Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Ask DeepWiki zread Document buckyball CI

Buckyball

Buckyball is a scalable framework for Domain Specific Architecture, built on RISC-V architecture and optimized for high-performance computing and machine learning accelerator design.

Project Overview

The buckyball framework provides a complete hardware design, simulation verification, and software development toolchain, supporting the full development process from RTL design to system-level verification. The framework adopts a modular design that supports flexible configuration and extension, suitable for various specialized computing scenarios.

Quick Start

Environment Dependencies

Before getting started, please ensure your system meets the following dependency requirements:

Required Software:

  • Anaconda/Miniconda (Python environment management)
  • Ninja Build System
  • GTKWave (waveform viewer)
  • Bash Shell environment (doesn't need to be the primary shell)

Installing Dependencies:

# Install Anaconda
# Download from: https://www.anaconda.com/download/

# Install system tools
sudo apt install ninja-build gtkwave

# Optional: FireSim passwordless configuration
# Add to /etc/sudoers: user_name ALL=(ALL) NOPASSWD:ALL

Source Build

1. Clone Repository

git clone https://github.com/DangoSys/buckyball.git
cd buckyball

2. Initialize Environment

./scripts/init.sh

Note: Initialization takes approximately 3 hours, including dependency downloads and compilation

3. Environment Activation

source buckyball/env.sh

4. Verify Installation

Run Verilator simulation test to verify installation:

bbdev verilator --run '--jobs 16 --binary ctest_vecunit_matmul_ones_singlecore-baremetal --config sims.verilator.BuckyballToyVerilatorConfig --batch'

Docker Quick Experience

We support providing a Docker environment for rapid deployment of buckyball.

Notice:

  • Docker images are provided only for specific release versions.
  • Docker image may not be the latest version, source build is recommended.

We do not provide support for this version as it is not a stable release.

Buckyball as a library

We support providing a streamlined version of buckyball installation, integrated as a generator within Chipyard.

Notice:

  • buckyball-as-a-lib are maintained only for specific release versions.

We do not provide support for this version as it is not a stable release.

Quick Tutorial

You can start to learn ball and blink from here

Additional Resources

You can learn more from DeepWiki and Zread

Community

Join our discussion on Slack

Contributors

Thank you for considering contributing to buckyball!

Buckyball Project Structure Overview

Buckyball is a scalable framework for domain-specific architectures. The project adopts a modular design with clear directory responsibilities, supporting a complete toolchain from hardware design to software development.

Main Directory Structure

Core Architecture Module

  • arch/ - Hardware architecture implementation, containing RTL code written in Scala/Chisel
    • Based on Rocket-chip and Chipyard framework
    • Implements custom RoCC coprocessors and memory subsystems
    • Supports various configuration and extension options

Test Verification Module

  • bb-tests/ - Unified test framework
    • workloads/ - Application workload tests
    • customext/ - Custom extension verification
    • sardine/ - Sardine test framework
    • uvbb/ - Unit test suite

Simulation Environment Module

  • sims/ - Simulators and verification environments
    • Supports Verilator, VCS and other simulators
    • Integrates FireSim FPGA accelerated simulation
    • Provides performance analysis and debugging tools

Development Tools Module

  • scripts/ - Build and deployment scripts

    • Environment initialization scripts
    • Automated build tools
    • Dependency management and configuration
  • workflow/ - Development workflows and automation

    • CI/CD pipeline configuration
    • Documentation generation tools
    • Code quality checks

Documentation System

  • docs/ - Project documentation
    • bb-note/ - Technical documentation based on mdBook
    • img/ - Documentation image resources
    • Supports automatic generation and updates

Third-party Dependencies

  • thirdparty/ - External dependency modules (submodules)
    • chipyard/ - Berkeley Chipyard SoC design framework
    • circt/ - CIRCT circuit compiler toolchain

Development Workflow

  1. Environment Setup: Use scripts/init.sh to initialize the development environment
  2. Architecture Development: Perform hardware design and modifications in the arch/ directory
  3. Test Verification: Use test suites in bb-tests/ for functional verification
  4. Simulation Debugging: Perform performance analysis through simulation environments in the sims/ directory
  5. Documentation Updates: Automatically generate or manually update technical documentation in docs/

Build System

The project supports multiple build methods:

  • Make: Traditional Makefile builds
  • SBT: Scala project build tool
  • CMake: Test framework build system
  • Conda: Python environment and dependency management

Version Management Notes

  • Submodules: Modules under thirdparty/ need independent updates
  • Main Repository: Core code and configuration update synchronously with the main branch
  • Documentation: Supports automatic generation, keeping in sync with code changes

Tutorial for buckyball

by - Bohan Wang

This document will be gradually updated as the author continues to solve and summarize encountered issues.

This document explains the step-by-step process and problem-solving approaches for a complete buckyball development workflow. We use building a ball operator module for executing the relu() function as an example:

First, we need to complete the hardware code writing for this module, i.e., write hardware code in Scala's Chisel language and generate corresponding verilog code.

Second, we need to write test software to implement relu(), which can be a reference function that runs on CPU with software code and an experimental function that runs software code on the dedicated hardware written in step one. If the test results match, it's successful, or proceed to step three for testing.

Third, simulate at the hardware level, view waveform diagrams for debugging. Additionally, there are other details such as compiler documentation changes, instruction set updates, etc., which will be explained below.

When encountering issues during development, you can visit DangoSys/buckyball | DeepWiki or Project Overview - Buckyball Technical Documentation

Chisel learning resources: binder

Before starting officially, let's initialize the environment:

cd /path/to/buckyball
source env.sh
// source ./env.sh if this gives an error
// All paths in this document are relative paths starting from ./buckyball

I. Writing Chisel Hardware Module

Create a Chisel implementation of the ReLU accelerator in the arch/src/main/scala/prototype/ directory. Referring to existing accelerator structures, it's recommended to create a new subdirectory under prototype/, for example prototype/relu/Relu.scala, and write the hardware code.

II. Hardware Instruction Decoding

Next, decode hardware instructions. Support for ReLU instructions needs to be added on the hardware side so that the hardware decoder recognizes this instruction, and register the instruction set for this ball.

This work is mainly divided into the following five aspects:

  • Instruction enumeration (DISA) defines func7 → instruction name (RELU)
  • Decoder (DomainDecoder) defines func7 → decoding rules (read/write/address/iter) → BID (e.g., 4)
  • Bus registration (busRegister) defines BID → actual Ball instance (ReluBall indexed at 4)
  • Reservation station registration (rsRegister) is used for RS/issue descriptions, aligned with BID, facilitating system issue/completion management and debugging If any link is missing or inconsistent, the ReLU instruction cannot be correctly recognized/routed/executed on actual hardware.
  • Create a new Ball execution unit class ReluUnit to handle ReLU operations.

1. Define RELU_BITPAT in DISA.scala

arch/src/main/scala/examples/toy/balldomain/DISA.scala defines the funct7 encoding (BitPat) for Ball instructions, such as TRANSPOSE, IM2COL, etc. It can be viewed as an "instruction set enumeration table" for decoder matching.

Add the bit pattern definition for the ReLU instruction in this file:

val RELU_BITPAT = BitPat("b0100110") // func7 = 38 = 0x23

2. Add ReLU instruction to Ball domain decoder

arch/src/main/scala/examples/toy/balldomain/DomainDecoder.scala is the Ball domain decoder. Its functions are as follows:

  • Input: PostGDCmd from global decoding (already determined to be a Ball category command).
  • Output: Structured BallDecodeCmd, including:
    • Whether to use op1/op2, whether to write back to scratchpad, whether operands come from scratchpad
    • Operand/writeback bank and address
    • Iteration count iter
    • Target Ball ID (BID)
    • Other dedicated fields special, etc.
  • Internally maps different funct7 instructions to a set of boolean switches and field extraction rules through ListLookup(func7, ...).

Add the decoding entry for the ReLU instruction in the decoding list in this file. Referring to the implementation of other instructions (e.g., TRANSPOSE_FUNC7 = 38), you need:

// Add to BallDecodeFields ListLookup
RELU                 -> List(Y,N,Y,Y,N, rs1(spAddrLen-1,0), 0.U(spAddrLen.W), rs2(spAddrLen-1,0), rs2(spAddrLen + 9,spAddrLen), 7.U, rs2(63,spAddrLen + 10), Y) // Fill in decoding fields according to specific ReLU instruction requirements, the number of list parameters must be consistent, you can refer to other instructions

3. Add ReLuBall generator and register it

a. arch/src/main/scala/examples/toy/balldomain/bbus/busRegister.scala is the Ball bus registration table, using a Seq(() => new SomeBall(...)) to register the actual Ball modules to be instantiated in the system.

Find and add the new ID for ReLuBall in this file.

class BBusModule(implicit b: CustomBuckyballConfig, p: Parameters)
    extends BBus(
      // Define Ball device generator to register
      Seq(
        () => new examples.toy.balldomain.vecball.VecBall(0),
        () => new examples.toy.balldomain.matrixball.MatrixBall(1),
        () => new examples.toy.balldomain.im2colball.Im2colBall(2),
        () => new examples.toy.balldomain.transposeball.TransposeBall(3),
        ...
        () =>new examples.toy.balldomain.reluball.ReluBall(7) // Ball ID 7 - newly added
      )
    ) {
  override lazy val desiredName = "BBusModule"
}

b. arch/src/main/scala/examples/toy/balldomain/rs/rsRegister.scala is the "Ball reservation station" registration table, using a list to register which Balls exist in the system (specifying ID and name by ballId). The reservation station (RS) is responsible for managing Ball issue, occupancy, completion and other metadata, usually also used for visualization/statistics, naming and logging.

Register ReluBall in this file:

class BallRSModule(implicit b: CustomBuckyballConfig, p: Parameters)
    extends BallReservationStation(
      // Define Ball device information to register
      Seq(
        BallRsRegist(ballId = 0, ballName = "VecBall"),
        BallRsRegist(ballId = 1, ballName = "MatrixBall"),
        BallRsRegist(ballId = 2, ballName = "Im2colBall"),
        BallRsRegist(ballId = 3, ballName = "TransposeBall"),
        ...
        BallRsRegist(ballId = 7, ballName = "ReluBall") // Ball ID 7 - newly added
      )
    ) {
  override lazy val desiredName = "BallRSModule"
}

4. Write ReluBall interface file

Create a reluball folder in the arch/src/main/scala/examples/toy/balldomain directory, enter the folder and create ReluBall.scala to write the interface code.

III. Writing Test Software and Compilation Settings

1. Create test file

Create relu_test.c under bb-tests/workloads/src/CTest/toy/, write test code. The core function in the code will execute void bb_relu(uint32_t op1_addr, uint32_t wr_addr, uint32_t iter); Note the declaration and definition of this function below.

2. Modify CMakeLists.txt

Add test target in bb-tests/workloads/src/CTest/toy/CMakeLists.txt: CMakeLists.txt:120-127

add_cross_platform_test_target(ctest_relu_test relu_test.c)

And add to the main build target: CMakeLists.txt:137-162

add_custom_target(buckyball-CTest-build ALL DEPENDS
  # ... other tests ...
  ctest_relu_test
  COMMENT "Building all workloads for Buckyball"
  VERBATIM)

3. Need to add ReLU instruction API

a. isa.h

  • Add declaration for ReLU instruction in bb-tests/workloads/lib/bbhw/isa/isa.h: isa.h:33-43

  • Add to InstructionType enum:

RELU_FUNC7 = 38,  // 0x26 - ReLU function code (or other value you choose)
  • Add to function declaration section: isa.h:72-73
void bb_relu(uint32_t op1_addr, uint32_t wr_addr, uint32_t iter);

b. isa.c

  • Add 38_relu.c in bb-tests/workloads/lib/bbhw/isa, implement void bb_relu(uint32_t op1_addr, uint32_t wr_addr, uint32_t iter) inside

  • Add declaration in bb-tests/workloads/lib/bbhw/isa/isa.c: isa.c:53-76

case RELU_FUNC7:
	return &relu_config;
  • In isa.c:37-47
extern const InstructionConfig relu_config;

4. Update CMakeLists.txt

Add compilation and linking of 38_relu.c in all three compilation commands in bb-tests/workloads/lib/bbhw/isa/CMakeLists.txt:

  1. Linux version: Add in COMMAND of add_custom_command:

    && riscv64-unknown-linux-gnu-gcc -c ${CMAKE_CURRENT_SOURCE_DIR}/38_relu.c -march=rv64gc -I${CMAKE_CURRENT_SOURCE_DIR} -I${CMAKE_CURRENT_SOURCE_DIR}/.. -o linux-38_relu.o
    

    And add linux-38_relu.o to the ar rcs command

  2. Baremetal version: Add in COMMAND of add_custom_command:

    && riscv64-unknown-elf-gcc -c ${CMAKE_CURRENT_SOURCE_DIR}/38_relu.c -g -fno-common -O2 -static -march=rv64gc -mcmodel=medany -fno-builtin-printf -D__BAREMETAL__ -I${CMAKE_CURRENT_SOURCE_DIR} -I${CMAKE_CURRENT_SOURCE_DIR}/.. -o baremetal-38_relu.o
    

    And add baremetal-38_relu.o to the ar rcs command

  3. x86 version: Add in COMMAND of add_custom_command:

    && gcc -c ${CMAKE_CURRENT_SOURCE_DIR}/38_relu.c -fPIC -D__x86_64__ -I${CMAKE_CURRENT_SOURCE_DIR} -I${CMAKE_CURRENT_SOURCE_DIR}/.. -o x86-38_relu.o
    

    And add x86-38_relu.o to the ar rcs command

  4. The ISA submodule library at the beginning needs to add the corresponding 38_relu.c file.

IV. Test Operation Steps

Step 1: Compile test program

cd bb-tests/build
rm -rf *
cmake -G Ninja ../

Warning: Before executing rm -rf *, make sure you are in the bb-tests/build directory, otherwise forcing deletion in the wrong folder will be catastrophic!

If a disaster occurs, you can pull the initial documents from GitHub again, but files updated on the server side cannot be recovered.

ninja ctest_relu_test // Software compilation

If ninja ctest_relu_test reports an error after execution, this means software compilation failed, please check "III. Writing Test Software" and related files.

bbdev workload --build

Compile/package the selected workload source code or configuration into artifacts (such as executable files, images, runtime scripts, input data packages, etc.) that can be used in the simulation or runtime environment for subsequent running on the Verilator/simulation platform or host side.

Step 2: Generate Verilog

cd buckyball
bbdev verilator --verilog '--config sims.verilator.BuckyballToyVerilatorConfig'

If bbdev verilator --verilog reports an error after execution, this means hardware compilation failed, please check "I. Writing Chisel Hardware Module II. Compilation Adaptation Preparation" related files.

Step 3: Run simulation

bbdev verilator --run '--jobs 16 --binary ctest_relu_test_singlecore-baremetal --batch'

If bbdev verilator --verilog reports an error after execution, this means the hardware system has timeout, deadlock and other issues, please check I. Writing Chisel Hardware Module related files.

Step 5: View simulation files

In arch/waveform/SimulationFileName(E.g.2025-10-08-00-03-ctest_vecunit_matmul_random1_singlecore-baremetal), download the waveform.fst file to your local system using software like Filezilla, and view the waveform using a local simulation waveform viewer (E.g. GTKWave).

Note that the simulation file folder should only contain the waveform.fst file. If a waveform.fst.hier file exists, it means the simulation failed.

If the waveform does not meet theoretical conditions, check I. Writing Chisel Hardware Module related files when the software test code is correct.

To check if the software code has problems, you can refer to its execution results on CPU. You can temporarily completely remove hardware accelerator calls from the relu_test.c file and only test the CPU version.

V. Simulation Waveform

After importing waveform.fst locally, use GTKWAVE to find in the project index: TOP.TestHarness.chiptop0.system.tile_prci_domain.element_reset_domain_tile.buckyball.ballDomain.bbus.balls_4.reluUnit The constants under this file are all hardware constants used by Relu.scala, double-click to view the waveform!

Some naming for different routines may not be exactly the same, but they are basically similar

VI. Performance Testing

Query number of clock cycles used - speed performance metric

cat /home/MikeNotFound/code/buckyball/arch/log/2025-10-24-16-59-ctest_relu_test_singlecore-baremetal/disasm.log | grep "PMC"
  • Preparation
  1. In the /home/<server_name>/bash.sh file, add the required environment variables at the end:

    export SNPSLMD_LICENSE_FILE=27000@amax
    export PATH="$PATH:/opt/riscv/bin"
    export VCS_HOME="/data0/tools/Synopsys/vcs/vcs/W-2024.09-SP1"
    export PATH="$PATH:$VCS_HOME/bin"
    export VERDI_HOME="/data0/tools/Synopsys/verdi/verdi/W-2024.09-SP1"
    export PATH="$PATH:$VERDI_HOME/bin"
    export SCL_HOME="/data0/tools/Synopsys/scl/scl/2024.06"
    export PATH="$PATH:$SCL_HOME/linux64/bin"
    export DC_HOME="/data0/tools/Synopsys/dc/syn/W-2024.09-SP1"
    export PATH="$PATH:$DC_HOME/bin"
    export PT_HOME="/data0/tools/Synopsys/ptpx/prime/W-2024.09-SP1/"
    export PATH="$PATH:$PT_HOME/bin"
    
    export LM_LICENSE_FILE=/data0/tools/Synopsys/lic/Synopsys.dat
    
    alias vcs="vcs -full64"
    alias lmli="lmgrd -c /data0/tools/Synopsys/lic/Synopsys.dat"
    
  2. In the /home/<server_name>/code/buckyball/evals/run-dc.sh file, remove the -retime option around line 126.


  • Formal Test
  1. Go back to the buckyball directory and run the command

    bbdev verilator --verilog "--balltype ReluBall --output_dir ReluBall_1"
    

    This will generate a Verilog folder for the specified ball under the arch directory.

  2. Grant execution permission to the script:

    chmod 777 evals/run-dc.sh
    
  3. Run the DC command:

    ./evals/run-dc.sh --srcdir arch/ReluBall_1 --top ReluBall
    

    This means performing the DC test on the top-level file ReluBall.sv located in the arch/ReluBall_1 folder.

  4. You can find the test results in

    /home/<server_name>/buckyball/bb-tests/output/dc/reports
    

Buckyball Architecture Design Overview

The Buckyball architecture module contains complete hardware design implementations, based on the RISC-V instruction set architecture, developed using the Scala/Chisel hardware description language. The architecture design follows modular and extensible principles, supporting various configurations and custom extensions.

Architecture Hierarchy

System-Level Architecture

Buckyball adopts a layered design, including from top to bottom:

  • SoC Subsystem: Integrates multi-core processors, cache hierarchy, interconnect networks
  • Processor Core: Custom implementation based on Rocket core
  • Coprocessor: Dedicated accelerators supporting RoCC interface
  • Memory Subsystem: High-performance memory controllers and DMA engines

Core Features

  • Configurability: Supports parameter configuration for core count, cache size, bus width, etc.
  • Extensibility: Provides standardized coprocessor interfaces and extension mechanisms
  • Compatibility: Maintains compatibility with the standard RISC-V ecosystem
  • Performance Optimization: Performance-optimized design for specific application scenarios

Directory Structure

arch/
├── src/main/scala/
│   └── framework/          - Buckyball framework core
│       ├── rocket/         - Rocket core custom implementation
│       └── builtin/        - Built-in component library
│           └── memdomain/  - Memory domain implementation
│               ├── mem/    - Memory components
│               └── dma/    - DMA engine
└── thirdparty/            - Third-party dependencies
    └── chipyard/          - Chipyard framework

Design Principles

Modular Design

Each functional module has clear interface definitions and independent implementations, facilitating testing, verification, and reuse. Modules communicate through standardized interfaces, reducing coupling.

Parameterized Configuration

All hardware modules support parameterized configuration, achieving flexible hardware generation through Scala's type system and configuration framework. Configuration parameters include:

  • Data path width
  • Cache size and organization
  • Parallelism and pipeline depth
  • Coprocessor types and quantities

Performance Optimization

Specialized performance optimizations for target application scenarios:

  • Memory access pattern optimization
  • Data pipeline design
  • Parallel computing support
  • Low-latency communication mechanisms

Development Workflow

  1. Requirement Analysis: Determine performance and functional requirements for target applications
  2. Architecture Design: Select appropriate configuration parameters and extension modules
  3. RTL Implementation: Use Chisel for hardware description and implementation
  4. Functional Verification: Verify functional correctness through unit tests and integration tests
  5. Performance Evaluation: Use simulators and FPGA for performance analysis and optimization

Toolchain Support

  • Chisel/FIRRTL: Hardware description and synthesis toolchain
  • Verilator: Fast simulation and verification
  • VCS: Commercial-grade simulation tools
  • FireSim: FPGA accelerated simulation platform
  • Chipyard: Integrated development environment and toolchain

Buckyball Scala Source Code

This directory contains all Scala/Chisel hardware description language source code for the Buckyball project, implementing hardware architecture design and simulation environments.

Overview

Buckyball uses Scala/Chisel as the hardware description language, built on Berkeley's Rocket-chip and Chipyard frameworks. This directory contains implementations from low-level hardware components to system-level integration.

Main functional modules include:

  • framework: Core framework implementation, including processor core, memory subsystem, bus interconnect, etc.
  • prototype: Prototype implementation of dedicated accelerators
  • examples: Example configurations and reference designs
  • sims: Simulation environment configurations and interfaces
  • Util: General utility classes and helper functions

Code Structure

scala/
├── framework/          - Buckyball core framework
│   ├── blink/          - Blink communication components
│   ├── builtin/        - Built-in hardware components
│   │   ├── frontend/   - Frontend processing components
│   │   ├── memdomain/  - Memory domain implementation
│   │   └── util/       - Framework utility classes
│   └── rocket/         - Rocket core extensions
├── prototype/          - Dedicated accelerator prototypes
│   ├── format/         - Data format processing
│   ├── im2col/         - Image processing acceleration
│   ├── matrix/         - Matrix computation engine
│   ├── transpose/      - Matrix transpose acceleration
│   └── vector/         - Vector processing unit
├── examples/           - Examples and configurations
│   └── toy/            - Toy example system
├── sims/               - Simulation configurations
│   ├── firesim/        - FireSim FPGA simulation
│   └── verilator/      - Verilator simulation
└── Util/               - General utility classes

Module Description

framework/ - Core Framework

Implements Buckyball's core architecture components, including:

  • Processor core and extensions
  • Memory subsystem and cache hierarchy
  • Bus interconnect and communication protocols
  • System configuration and parameterization mechanisms

prototype/ - Accelerator Prototypes

Contains hardware implementations of dedicated computation accelerators:

  • Machine learning accelerators (matrix operations, convolution, etc.)
  • Data processing accelerators (format conversion, transpose, etc.)
  • Vector processing units (SIMD, multi-threading, etc.)

examples/ - Example Configurations

Provides system configuration examples and reference designs:

  • Basic configuration templates
  • Custom extension examples
  • Integration test cases

sims/ - Simulation Environment

Supports multiple simulators and verification environments:

  • Verilator simulation
  • FireSim FPGA simulation
  • Performance analysis and debugging tools

Development Guide

Build System

Buckyball uses Mill as the build tool:

# Compile all modules
mill arch.compile

# Generate Verilog
mill arch.runMain examples.toy.ToyBuckyball

# Run tests
mill arch.test

Code Standards

  • Follow Scala and Chisel coding conventions
  • Use ScalaFmt for code formatting
  • Each module includes documentation and tests
  • Configuration parameterization uses Chipyard Config system

Extension Development

  1. Add new accelerator: Create new module in prototype/ directory
  2. Modify framework: Extend existing components in framework/ directory
  3. Add configuration: Create new configuration files in examples/ directory
  4. Integration testing: Use simulation environments in sims/ directory for verification

Buckyball Utility Library

Overview

This directory contains general utility functions and helper modules in the Buckyball framework, primarily providing reusable hardware design components. Located at arch/src/main/scala/Util, it serves as the base utility layer throughout the architecture, providing common hardware building blocks for other modules.

Main functionality includes:

  • Pipeline: Pipeline control and management tools
  • Common hardware design pattern implementations

Code Structure

Util/
└── Pipeline.scala    - Pipeline control implementation

File Dependencies

Pipeline.scala (Base utility layer)

  • Provides general pipeline control logic
  • Referenced by other modules requiring pipeline functionality
  • Implements standard pipeline interfaces and control signals

Module Description

Pipeline.scala

Main functionality: Provides general pipeline control and management functionality

Key components:

class Pipeline extends Module {
  val io = IO(new Bundle {
    val flush = Input(Bool())
    val stall = Input(Bool())
    val valid_in = Input(Bool())
    val ready_out = Output(Bool())
    val valid_out = Output(Bool())
  })

  // Pipeline control logic
  val pipeline_valid = RegInit(false.B)

  when(io.flush) {
    pipeline_valid := false.B
  }.elsewhen(!io.stall) {
    pipeline_valid := io.valid_in
  }

  io.ready_out := !io.stall
  io.valid_out := pipeline_valid && !io.flush
}

Pipeline control signals:

  • flush: Pipeline flush signal, clears all pipeline stages
  • stall: Pipeline stall signal, maintains current state
  • valid_in: Input data valid signal
  • ready_out: Ready to receive new data signal
  • valid_out: Output data valid signal

Inputs/Outputs:

  • Input: Control signals (flush, stall) and data valid signal
  • Output: Pipeline state and data valid indication
  • Edge cases: flush has higher priority than stall, ensuring correct pipeline behavior

Dependencies: Chisel3 base library, standard Module and Bundle interfaces

Usage

Usage

Integrating pipeline control:

class MyModule extends Module {
  val pipeline = Module(new Pipeline)

  // Connect control signals
  pipeline.io.flush := flush_condition
  pipeline.io.stall := stall_condition
  pipeline.io.valid_in := input_valid

  // Use pipeline output
  val output_enable = pipeline.io.valid_out
}

Design Patterns

Pipeline cascading:

  • Supports cascaded connection of multi-stage pipelines
  • Provides standard ready/valid handshake protocol
  • Ensures correctness and timing of data flow

Backpressure handling:

  • Implements standard backpressure propagation mechanism
  • Supports pause and resume of upstream modules
  • Guarantees no data loss or duplication

Notes

  1. Timing constraints: flush signal should be asserted synchronously at clock rising edge
  2. Reset behavior: Pipeline should clear all valid bits on reset
  3. Combinational logic: ready signal is combinational logic, avoid timing path issues
  4. Extensibility: Design supports parameterized pipeline depth and data width

Buckyball Framework Core

Overview

This directory contains the core implementation of the Buckyball framework, serving as the foundation layer for the entire hardware architecture. Located at arch/src/main/scala/framework, it provides a complete implementation of processor cores, built-in components, and system interconnects.

Main functional modules include:

  • builtin: Built-in hardware component library, including memory domain and frontend modules
  • blink: System interconnect and communication framework

Code Structure

framework/
├── builtin/          - Built-in component library
│   ├── memdomain/    - Memory domain implementation
│   │   ├── dma/      - DMA engines (BBStreamReader/Writer)
│   │   ├── mem/      - Memory components (Scratchpad, Accumulator, SRAM banks)
│   │   ├── rs/       - Memory domain reservation station
│   │   ├── tlb/      - TLB implementation
│   │   ├── MemController.scala  - Memory controller
│   │   ├── MemDomain.scala      - Memory domain top-level
│   │   ├── MemLoader.scala      - Load instruction handler
│   │   └── MemStorer.scala      - Store instruction handler
│   ├── frontend/     - Frontend components
│   │   ├── GobalDecoder.scala   - Global instruction decoder
│   │   ├── globalrs/            - Global reservation station
│   │   │   ├── GlobalReservationStation.scala
│   │   │   └── GlobalROB.scala  - Global reorder buffer
│   │   └── rs/                  - Ball domain reservation station
│   ├── util/         - Framework utility functions
│   └── BaseConfigs.scala - Base configuration parameters
└── blink/            - System interconnect framework
    ├── baseball.scala    - Ball device base trait
    ├── blink.scala       - Blink protocol definitions
    └── bbus.scala        - Ball bus implementation

Module Dependencies

Application Layer → builtin components → blink interconnect → Physical interface
                        ↓                    ↓
                   Memory domain        Ball protocol
                   Frontend             System bus

Module Details

builtin/ - Built-in Component Library

Main Function: Provides standardized hardware component implementations

Component Categories:

memdomain/ - Memory Domain

The memory domain encapsulates all memory-related functionality:

Key Components:

  • MemDomain.scala: Top-level memory domain module

    • Integrates MemController, MemLoader, MemStorer, and TLB
    • Provides unified interface to Global RS
    • Handles both load and store operations
  • MemController.scala: Memory controller

    • Encapsulates Scratchpad and Accumulator
    • Provides DMA and Ball Domain interfaces
    • Handles bank arbitration and routing
  • MemLoader.scala: Load instruction handler

    • Receives load instructions from reservation station
    • Issues DMA read requests
    • Writes data to Scratchpad/Accumulator
  • MemStorer.scala: Store instruction handler

    • Receives store instructions from reservation station
    • Reads data from Scratchpad/Accumulator
    • Issues DMA write requests with data alignment and masking
  • dma/: DMA engines

    • BBStreamReader: Streaming DMA read with TLB support
    • BBStreamWriter: Streaming DMA write with alignment handling
    • Transaction ID management for multiple outstanding requests
  • mem/: Memory components

    • Scratchpad.scala: 4-bank scratchpad memory (256KB total)
    • AccBank.scala: Accumulator bank with accumulation pipeline
    • SramBank.scala: Generic single-port SRAM bank implementation
  • rs/: Memory domain reservation station

    • reservationStation.scala: Local FIFO-based scheduler
    • rob.scala: Local reorder buffer for memory instructions
    • ringFifo.scala: Circular FIFO implementation
  • tlb/: Translation Lookaside Buffer

    • Virtual to physical address translation
    • Integrated with DMA engines

frontend/ - Frontend Components

The frontend handles global instruction management:

Key Components:

  • GobalDecoder.scala: Global instruction decoder

    • Classifies instructions into Ball/Memory/Fence types
    • Constructs PostGDCmd for domain-specific decoders
    • Interfaces with Global RS
  • globalrs/: Global reservation station

    • GlobalReservationStation.scala: Central instruction manager
      • Allocates ROB entries
      • Issues instructions to Ball and Memory domains
      • Handles instruction completion from both domains
      • Manages Fence instruction synchronization
    • GlobalROB.scala: Global reorder buffer
      • Tracks instruction state across domains
      • Supports out-of-order completion
      • Sequential commit of completed instructions
  • rs/: Ball domain reservation station

    • reservationStation.scala: Ball-specific scheduler
    • rob.scala: Local ROB for Ball instructions

util/ - Framework Utilities

Common utility functions and helper modules

BaseConfigs.scala

Configuration Parameters:

case class BaseConfig(
  veclane: Int = 16,              // Vector lane width
  accveclane: Int = 4,            // Accumulator vector lane width
  rob_entries: Int = 16,          // Number of ROB entries
  rs_out_of_order_response: Boolean = true,  // Out-of-order response support
  sp_banks: Int = 4,              // Scratchpad bank count
  acc_banks: Int = 8,             // Accumulator bank count
  sp_capacity: BuckyballMemCapacity = CapacityInKilobytes(256),
  acc_capacity: BuckyballMemCapacity = CapacityInKilobytes(64),
  spAddrLen: Int = 15,            // SPAD address length
  memAddrLen: Int = 32,           // Memory address length
  numVecPE: Int = 16,             // Vector PEs per thread
  numVecThread: Int = 16,         // Vector threads
  emptyBallid: Int = 5            // Empty ball ID
)

Main Function: Implements system-level interconnect and Ball protocol

Key Components:

  • baseball.scala: Ball device base trait

    • Defines BallRegist trait for Ball device registration
    • Provides common interface for all Ball devices
  • blink.scala: Blink protocol definitions

    • Command/response interfaces
    • Status and control signals
    • SRAM read/write interfaces
  • bbus.scala: Ball bus implementation (BBus)

    • Manages multiple Ball device connections
    • Command router: Routes commands to appropriate Ball devices
    • Bus router: Arbitrates Ball device responses
    • Memory router: Handles memory access arbitration
    • Performance monitoring counters

Interconnect Features:

  • Support for multiple bus protocols
  • Arbitration and routing functionality
  • Latency and bandwidth management
  • Dynamic Ball device registration

Usage Guide

Framework Integration

Configuration System:

class BuckyballConfig extends Config(
  new WithBuiltinComponents ++
  new WithBlinkInterconnect ++
  new BaseConfig
)

Module Instantiation:

class BuckyballSystem(implicit p: Parameters) extends LazyModule {
  // Memory domain
  val memdomain = Module(new MemDomain)

  // Ball domain
  val balldomain = Module(new BallDomain)

  // Global RS
  val globalRS = Module(new GlobalReservationStation)

  // Connect modules
  balldomain.io.issue <> globalRS.io.ballIssue
  memdomain.io.issue <> globalRS.io.memIssue
  globalRS.io.ballComplete <> balldomain.io.complete
  globalRS.io.memComplete <> memdomain.io.complete
}

Extension Development

Adding New Components:

  1. Create new component module in builtin directory
  2. Implement standard Module interface
  3. Register in configuration system
  4. Update interconnect and routing logic

Custom Ball Device:

  1. Extend BallRegist trait
  2. Implement Blink protocol interfaces
  3. Register in BBus
  4. Add to Ball RS device list

Design Principles

  1. Parameter Passing: Use Chipyard's Parameters system for configuration
  2. Clock Domains: Pay attention to clock domain crossing between modules
  3. Reset Strategy: Ensure proper reset sequencing and dependencies
  4. Performance Optimization: Focus on critical paths and timing constraints
  5. Debug Support: Integrate necessary debug and monitoring interfaces
  6. Memory Access: Respect bank access constraints (op1 and op2 cannot access same bank)
  7. Handshake Protocols: Use ready/valid handshake for all data transfers

Architecture Highlights

Instruction Flow

RoCC → Global Decoder → Global RS → Ball Domain / Mem Domain
                          ↓                ↓            ↓
                      Global ROB    Ball Decoder  Mem Decoder
                   (tracks state)       ↓            ↓
                                   Ball Devices  Loader/Storer
                                        ↓            ↓
                                   MemController ← → MemController

Memory Access Flow

Ball Devices ──→ MemController ──→ Scratchpad (4 banks)
                      │           └→ Accumulator (8 banks)
                      │
Mem Domain    ──→ MemController
  (Loader/Storer)     │
                      ↓
                  DMA + TLB
                      ↓
                 Main Memory

Performance Considerations

  1. ROB Size: 16 entries support up to 16 in-flight instructions
  2. Bank Parallelism: 4 scratchpad + 8 accumulator banks enable parallel access
  3. Out-of-Order Execution: Global RS supports out-of-order completion when enabled
  4. DMA Bandwidth: 128-bit bus width provides high memory bandwidth
  5. Pipeline Depth: Multi-stage pipeline allows high clock frequency

Common Issues and Solutions

Issue: Instructions stall in Global RS

  • Solution: Check ROB capacity and completion signals from domains

Issue: Memory access conflicts

  • Solution: Ensure op1 and op2 don't access same bank, respect bank boundaries

Issue: DMA timeout

  • Solution: Verify TLB configuration and page table walker connectivity

Issue: Ball device not responding

  • Solution: Check Ball device registration in BBus and RS device list

Buckyball Prototype Accelerators

This directory contains prototype implementations of various domain-specific computation accelerators in the Buckyball framework, covering hardware accelerator designs for machine learning, numerical computation, and data processing domains.

Directory Structure

prototype/
├── format/      - Data format conversion accelerators
├── im2col/      - Image-to-column transformation accelerator
├── matrix/      - Matrix computation accelerators
├── relu/        - ReLU activation accelerator
├── transpose/   - Matrix transpose accelerator
└── vector/      - Vector processing unit

Accelerator Components

format/ - Data Format Processing

Implements hardware acceleration for various data format conversions and arithmetic operations:

  • Arithmetic.scala: Custom arithmetic operation units
  • Dataformat.scala: Data format conversion and encoding

Key Features:

  • Support for multiple data formats (INT8, FP16, FP32, BBFP)
  • Abstract arithmetic interface for extensibility
  • Concrete implementations for different data types

Use Cases:

  • Floating-point format conversion
  • Fixed-point arithmetic optimization
  • Data compression and decompression
  • Mixed-precision computation

im2col/ - Image Processing Acceleration

Specialized accelerator for im2col operations in convolutional neural networks:

  • im2col.scala: Hardware implementation of image-to-column matrix transformation

Key Features:

  • Configurable kernel size and stride
  • Efficient data reorganization for convolution
  • Pipeline-based processing for high throughput
  • Support for different input dimensions

Use Cases:

  • CNN convolution layer acceleration
  • Image preprocessing pipeline
  • Feature extraction optimization
  • Memory-efficient convolution implementation

matrix/ - Matrix Computation Engine

Matrix computation accelerator implementation with multiple modules:

Core Components:

  • bbfpIns_decode.scala: Instruction decoder for matrix operations
  • bbfp_load.scala: Data loading unit for matrix operands
  • bbfp_ex.scala: Execution unit for matrix multiplication
  • bbfp_pe.scala: Processing Element (PE) array implementation
  • bbfp_control.scala: Control logic for matrix operations

PE Array Architecture:

  • BBFP_PE: Individual processing element with weight stationary mode
  • BBFP_PE_Array2x2: 2×2 PE array building block
  • BBFP_PE_Array16x16: 16×16 PE array for high-performance computing
  • Systolic array dataflow for efficient matrix multiplication

Supported Formats:

  • INT8 integer arithmetic
  • FP16 half-precision floating-point
  • FP32 single-precision floating-point
  • BBFP (Brain Floating Point) custom format

Use Cases:

  • Deep learning training and inference
  • Scientific computing acceleration
  • Linear algebra operations
  • High-performance GEMM operations

relu/ - ReLU Activation

Efficient hardware implementation of ReLU (Rectified Linear Unit) activation:

  • Relu.scala: Pipelined ReLU accelerator

Key Features:

  • Element-wise ReLU computation
  • Configurable tile size
  • Pipeline-based processing
  • Integrated with scratchpad memory

Use Cases:

  • Neural network activation layers
  • Non-linear transformation
  • Post-convolution activation

transpose/ - Matrix Transpose

Efficient hardware implementation for matrix transpose operations:

  • Transpose.scala: Matrix transpose accelerator

Key Features:

  • Tile-based transpose for large matrices
  • Optimized memory access patterns
  • Configurable tile size
  • Pipeline-based implementation

Use Cases:

  • Matrix operation preprocessing
  • Data reorganization and transformation
  • Memory access pattern optimization
  • Transpose in GEMM operations

vector/ - Vector Processing Unit

Vector processing architecture supporting SIMD and multi-threading:

Core Components:

  • VecUnit.scala: Vector processor top-level module
  • VecCtrlUnit.scala: Vector control unit for instruction dispatch
  • VecLoadUnit.scala: Vector load unit for data fetching
  • VecEXUnit.scala: Vector execution unit with multiple functional units
  • VecStoreUnit.scala: Vector store unit for result write-back

Submodules:

  • bond/: Binding and synchronization mechanisms

    • Various bond types (VSSBond, VVVBond, VSVBond, VVSBond, VVBond)
    • Operand routing and data distribution
  • op/: Vector operation implementations

    • AddOp, MulOp, CascadeOp, SelectOp, etc.
    • Arithmetic and logical operations
  • thread/: Multi-threading support

    • Thread-level parallelism
    • Warp-based execution model
  • warp/: Thread bundle management (MeshWarp)

    • 16×16 PE mesh for vector operations
    • Parallel execution of vector instructions

Architecture Highlights:

  • Configurable number of PEs and threads
  • Support for various vector operations (add, mul, cascade, select)
  • Flexible data routing through bond mechanisms
  • High parallelism with warp-level execution

Use Cases:

  • Parallel numerical computation
  • Signal processing acceleration
  • High-performance computing applications
  • SIMD-style data processing

Design Features

Modular Design

Each accelerator adopts modular design for:

  • Independent development and testing
  • Flexible composition and configuration
  • Performance tuning and extension
  • Easy integration with Buckyball framework

Pipeline Architecture

Most accelerators use deep pipeline design:

  • Improved throughput and frequency
  • Support for continuous data stream processing
  • Optimized resource utilization
  • Latency hiding through pipelining

Configurable Parameters

Support rich configuration parameters:

  • Data width and precision
  • Parallelism and pipeline depth
  • Cache size and organization
  • Interface protocol and timing

Integration Method

All Ball accelerators implement the Blink protocol interface:

class CustomBall(implicit b: CustomBuckyballConfig, p: Parameters)
  extends Module with BallRegist {
  val io = IO(new BlinkIO)
  def ballId = <unique_id>.U
  def Blink = // Implement Blink protocol
}

Blink Interface Components:

  • cmdReq: Command request interface with rob_id tracking
  • cmdResp: Command response interface for completion signaling
  • status: Status signals (ready, valid, idle, complete)
  • sramRead/Write: SRAM interfaces for scratchpad and accumulator access

Memory Interface

Support multiple memory access patterns:

  • DMA bulk transfer through MemDomain
  • Scratchpad direct access for low-latency operations
  • Accumulator access for result accumulation
  • Bank-aware memory access (op1 and op2 must access different banks)

Configuration Integration

Parameterized through Buckyball configuration system:

case class BaseConfig(
  veclane: Int = 16,        // Vector lane width
  numVecPE: Int = 16,       // Number of vector PEs
  numVecThread: Int = 16,   // Number of vector threads
  // ... more parameters
)

Performance Optimization

Data Locality

  • Optimize data access patterns for spatial and temporal locality
  • Reduce memory bandwidth requirements through data reuse
  • Improve cache hit rate with tile-based processing
  • Scratchpad memory for frequently accessed data

Parallel Processing

  • Multi-level parallelism design
    • Instruction-level parallelism (ILP) through pipelining
    • Data-level parallelism (DLP) through vector operations
    • Thread-level parallelism (TLP) through multiple warps
  • Pipeline parallelism for continuous data flow
  • Data parallelism through PE arrays

Resource Sharing

  • Arithmetic unit reuse across different operations
  • Storage resource sharing between modules
  • Control logic optimization for area efficiency
  • Flexible routing for resource utilization

Verification and Testing

Each accelerator comes with corresponding test cases:

  • Functional correctness verification
  • Performance benchmark testing
  • Boundary condition checking
  • Random test generation
  • Integration testing with complete system

Development Guidelines

Adding New Accelerators

Steps:

  1. Implement Ball device with BallRegist trait
  2. Define Blink protocol interfaces
  3. Implement computation logic
  4. Add SRAM access logic (respect bank constraints)
  5. Register in BBus and Ball RS

Example Template:

class NewBall(implicit b: CustomBuckyballConfig, p: Parameters)
  extends Module with BallRegist {
  val io = IO(new BlinkIO)

  def ballId = <unique_id>.U
  def Blink = io

  // State machine
  val sIdle :: sCompute :: sComplete :: Nil = Enum(3)
  val state = RegInit(sIdle)

  // Computation logic
  switch(state) {
    is(sIdle) {
      when(io.cmdReq.fire) {
        state := sCompute
      }
    }
    is(sCompute) {
      // Perform computation
      when(done) {
        state := sComplete
      }
    }
    is(sComplete) {
      io.cmdResp.valid := true.B
      state := sIdle
    }
  }
}

Performance Optimization Tips

  1. Memory Access:

    • Group memory accesses to same bank
    • Use streaming access patterns
    • Minimize random access
  2. Pipeline Design:

    • Balance pipeline stages
    • Add registers for timing closure
    • Use buffering for throughput
  3. Resource Utilization:

    • Share expensive resources (multipliers, dividers)
    • Use LUTs for simple operations
    • Optimize control logic

Common Pitfalls

  1. Bank Conflict: op1 and op2 accessing same bank - violates design constraint
  2. ROB ID Tracking: Must forward rob_id from request to response
  3. Ready/Valid Protocol: Carefully implement handshake to avoid deadlock
  4. Iteration Count: Properly handle iteration for multi-row operations

Future Enhancements

Potential areas for extension:

  • Support for additional data formats (INT4, BF16)
  • Advanced matrix operations (SVD, QR decomposition)
  • Fused operations (Conv+ReLU, GEMM+BiasAdd)
  • Dynamic reconfiguration for different workloads
  • Power management and clock gating
  • Advanced synchronization mechanisms

Data Format Processing Module

Overview

This directory implements data format definitions and arithmetic operation abstractions in Buckyball, providing a unified data type processing interface. Located at arch/src/main/scala/prototype/format, it serves as the data format layer, providing type-safe data format support for other prototype accelerators.

Core components:

  • Dataformat.scala: Data format definitions and factory classes
  • Arithmetic.scala: Arithmetic operation type class implementations

Code Structure

format/
├── Dataformat.scala  - Data format definitions
└── Arithmetic.scala  - Arithmetic operation abstractions

File Dependencies

Dataformat.scala (Format definition layer)

  • Defines DataFormat abstract class and concrete format implementations
  • Provides DataFormatFactory factory class
  • Implements DataFormatParams parameter class

Arithmetic.scala (Operation abstraction layer)

  • Defines Arithmetic type class interface
  • Implements UIntArithmetic concrete operations
  • Provides ArithmeticFactory factory class

Module Description

Dataformat.scala

Main functionality: Defines supported data format types

Format definition:

abstract class DataFormat {
  def width: Int
  def dataType: Data
  def name: String
}

Supported formats:

class INT8Format extends DataFormat {
  override def width: Int = 8
  override def dataType: Data = UInt(8.W)
  override def name: String = "INT8"
}

class FP16Format extends DataFormat {
  override def width: Int = 16
  override def dataType: Data = UInt(16.W)
  override def name: String = "FP16"
}

class FP32Format extends DataFormat {
  override def width: Int = 32
  override def dataType: Data = UInt(32.W)
  override def name: String = "FP32"
}

Factory class:

object DataFormatFactory {
  def create(formatType: String): DataFormat = formatType.toUpperCase match {
    case "INT8" => new INT8Format
    case "FP16" => new FP16Format
    case "FP32" => new FP32Format
    case _ => throw new IllegalArgumentException(...)
  }
}

Parameter class:

case class DataFormatParams(formatType: String = "INT8") {
  def format: DataFormat = DataFormatFactory.create(formatType)
  def width: Int = format.width
  def dataType: Data = format.dataType
}

Arithmetic.scala

Main functionality: Provides type-safe arithmetic operation abstractions

Type class definition:

abstract class Arithmetic[T <: Data] {
  def add(x: T, y: T): T
  def sub(x: T, y: T): T
  def mul(x: T, y: T): T
  def div(x: T, y: T): T
  def gt(x: T, y: T): Bool
}

UInt implementation:

class UIntArithmetic extends Arithmetic[UInt] {
  override def add(x: UInt, y: UInt): UInt = x + y
  override def sub(x: UInt, y: UInt): UInt = x - y
  override def mul(x: UInt, y: UInt): UInt = x * y
  override def div(x: UInt, y: UInt): UInt = Mux(y =/= 0.U, x / y, 0.U)
  override def gt(x: UInt, y: UInt): Bool = x > y
}

Factory class:

object ArithmeticFactory {
  def createArithmetic[T <: Data](dataType: T): Arithmetic[T] = {
    dataType match {
      case _: UInt => new UIntArithmetic().asInstanceOf[Arithmetic[T]]
      case _ => throw new IllegalArgumentException(...)
    }
  }
}

Usage

Notes

  1. Floating-point support: FP16 and FP32 currently use UInt representation, can be extended to true floating-point types later
  2. Division by zero protection: UInt division operation includes division-by-zero check, returns 0 as default value
  3. Type safety: Uses Scala type system to ensure operation type safety
  4. Extensibility: Factory pattern supports adding new data formats and arithmetic implementations
  5. Parameterization: DataFormatParams provides convenient parameterized configuration interface

Im2col Image Processing Accelerator

Overview

This directory implements Buckyball's Im2col operation accelerator for image-to-column matrix conversion in convolutional neural networks. Located at arch/src/main/scala/prototype/im2col, it serves as an image processing accelerator that converts convolution operations to matrix multiplication operations to improve computational efficiency.

Core components:

  • im2col.scala: Im2col accelerator main implementation

Code Structure

im2col/
└── im2col.scala  - Im2col accelerator implementation

Module Responsibilities

Im2col.scala (Accelerator implementation layer)

  • Implements image-to-column matrix conversion logic
  • Manages SRAM read/write operations
  • Provides Ball domain command interface

Module Description

im2col.scala

Main functionality: Implements sliding convolution window and data rearrangement

State machine definition:

val idle :: read :: read_and_convert :: complete :: Nil = Enum(4)
val state = RegInit(idle)

Key registers:

val ConvertBuffer = RegInit(VecInit(Seq.fill(4)(VecInit(Seq.fill(b.veclane)(0.U(b.inputType.getWidth.W))))))
val rowptr = RegInit(0.U(10.W))    // Convolution window top-left row pointer
val colptr = RegInit(0.U(5.W))     // Convolution window top-left column pointer
val krow_reg = RegInit(0.U(log2Up(b.veclane).W))  // Convolution kernel row count
val kcol_reg = RegInit(0.U(log2Up(b.veclane).W))  // Convolution kernel column count

Command parsing:

when(io.cmdReq.fire) {
  rowptr := io.cmdReq.bits.cmd.special(37,28)      // Start row
  colptr := io.cmdReq.bits.cmd.special(27,23)      // Start column
  kcol_reg := io.cmdReq.bits.cmd.special(3,0)      // Convolution kernel column count
  krow_reg := io.cmdReq.bits.cmd.special(7,4)      // Convolution kernel row count
  incol_reg := io.cmdReq.bits.cmd.special(12,8)    // Input matrix column count
  inrow_reg := io.cmdReq.bits.cmd.special(22,13)   // Input matrix row count
}

Data conversion logic:

// Fill window data
for (i <- 0 until 4; j <- 0 until 4) {
  when(i.U < krow_reg && j.U < kcol_reg) {
    val bufferRow = (rowcnt + i.U) % krow_reg
    val bufferCol = (colptr + j.U) % incol_reg
    window((i.U * kcol_reg) + j.U) := ConvertBuffer(bufferRow)(bufferCol)
  }.otherwise {
    window((i.U * kcol_reg) + j.U) := 0.U
  }
}

SRAM interface:

val io = IO(new Bundle {
  val cmdReq = Flipped(Decoupled(new BallRsIssue))
  val cmdResp = Decoupled(new BallRsComplete)
  val sramRead = Vec(b.sp_banks, Flipped(new SramReadIO(...)))
  val sramWrite = Vec(b.sp_banks, Flipped(new SramWriteIO(...)))
})

Processing flow:

  1. idle: Wait for command, parse convolution parameters
  2. read: Read initial convolution kernel-sized data into buffer
  3. read_and_convert: Slide window, convert data and write back
  4. complete: Send completion signal

Inputs/Outputs:

  • Input: Ball domain commands containing convolution parameters and address information
  • Output: Converted column matrix data, completion signal
  • Edge cases: Fill zero values when handling boundaries

Usage

Algorithm Principle

Im2col conversion: Convert convolution operation to matrix multiplication

  • Input: H×W image, K×K convolution kernel
  • Output: (H-K+1)×(W-K+1) windows of size K×K, expanded as column vectors

Sliding window:

  • Slide convolution window in row-major order
  • Each window position generates a column vector
  • Uses circular buffer to optimize memory access

Notes

  1. Buffer management: Uses 4×veclane conversion buffer to store window data
  2. Boundary handling: Fill zero values for positions beyond image boundaries
  3. Address calculation: Supports configurable start address and bank selection
  4. Pipeline optimization: Prefetch next row read requests during conversion
  5. Parameter limitation: Maximum support for 4×4 convolution kernel size

Matrix Computation Accelerator

Overview

This directory implements Buckyball's matrix computation accelerator for matrix multiplication and related operations. Located at arch/src/main/scala/prototype/matrix, it serves as a matrix computation accelerator supporting multiple data formats and operation modes.

Core components:

  • bbfp_control.scala: Matrix computation controller
  • bbfp_pe.scala: Processing Element (PE) and MAC unit
  • bbfp_buffer.scala: Data buffer management
  • bbfp_load.scala: Data load unit
  • bbfp_ex.scala: Execution unit
  • bbfpIns_decode.scala: Instruction decoder

Code Structure

matrix/
├── bbfp_control.scala   - Controller main module
├── bbfp_pe.scala        - Processing element implementation
├── bbfp_buffer.scala    - Buffer management
├── bbfp_load.scala      - Load unit
├── bbfp_ex.scala        - Execution unit
└── bbfpIns_decode.scala - Instruction decode

File Dependencies

bbfp_control.scala (Controller layer)

  • Integrates submodules (ID, LU, EX, etc.)
  • Manages SRAM and Accumulator interfaces
  • Handles Ball domain commands

bbfp_pe.scala (Computation core layer)

  • Implements MacUnit multiply-accumulate unit
  • Defines PEControl control signals
  • Handles signed/unsigned operations

Other modules (Functional support layer)

  • Provides data buffering, loading, execution and other support functions

Module Description

bbfp_control.scala

Main functionality: Top-level control module for matrix computation accelerator

Module integration:

class BBFP_Control extends Module {
  val BBFP_ID = Module(new BBFP_ID)
  val ID_LU = Module(new ID_LU)
  val BBFP_LoadUnit = Module(new BBFP_LoadUnit)
  val LU_EX = Module(new LU_EX)
}

Interface definition:

val io = IO(new Bundle {
  val cmdReq = Flipped(Decoupled(new BallRsIssue))
  val cmdResp = Decoupled(new BallRsComplete)
  val is_matmul_ws = Input(Bool())
  val sramRead = Vec(b.sp_banks, Flipped(new SramReadIO(...)))
  val sramWrite = Vec(b.sp_banks, Flipped(new SramWriteIO(...)))
  val accRead = Vec(b.acc_banks, Flipped(new SramReadIO(...)))
  val accWrite = Vec(b.acc_banks, Flipped(new SramWriteIO(...)))
})

Data flow:

cmdReq → BBFP_ID → ID_LU → BBFP_LoadUnit → LU_EX
                              ↓
                         SRAM/ACC interface

bbfp_pe.scala

Main functionality: Implements basic processing element for matrix computation

MAC unit definition:

class MacUnit extends Module {
  val io = IO(new Bundle {
    val in_a = Input(UInt(7.W))    // [6]=sign, [5]=flag, [4:0]=value
    val in_b = Input(UInt(7.W))    // [6]=sign, [5]=flag, [4:0]=value
    val in_c = Input(UInt(32.W))   // [31]=sign, [30:0]=value
    val out_d = Output(UInt(32.W)) // Output result
  })
}

Data format processing:

// Extract sign bit and value
val sign_a = io.in_a(6)
val sign_b = io.in_b(6)
val flag_a = io.in_a(5)
val flag_b = io.in_b(5)
val value_a = io.in_a(4, 0)
val value_b = io.in_b(4, 0)

// Determine left shift based on flag bit
val shifted_a = Mux(flag_a === 1.U, value_a << 2, value_a)
val shifted_b = Mux(flag_b === 1.U, value_b << 2, value_b)

Signed arithmetic:

val a_signed = Mux(sign_a === 1.U, -(shifted_a.zext), shifted_a.zext).asSInt
val b_signed = Mux(sign_b === 1.U, -(shifted_b.zext), shifted_b.zext).asSInt

Control signals:

class PEControl extends Bundle {
  val propagate = UInt(1.W)   // Propagation control
}

Usage

Data Format

Input format: 7-bit compressed format

  • bit[6]: Sign bit (0=positive, 1=negative)
  • bit[5]: Flag bit (1=left shift by 2)
  • bit[4:0]: 5-bit value

Output format: 32-bit signed number

  • bit[31]: Sign bit
  • bit[30:0]: 31-bit value

Operation Characteristics

MAC operation: Multiply-Accumulate operation

  • Supports signed and unsigned operations
  • Configurable shift operations
  • 32-bit accumulator output

Pipeline structure:

  • ID: Instruction decode stage
  • LU: Load unit stage
  • EX: Execution unit stage

Notes

  1. Data format: Uses custom 7-bit compressed format to reduce storage overhead
  2. Sign handling: Supports correct signed number operations and sign extension
  3. Shift optimization: Controls data preprocessing shift through flag bit
  4. Interface compatibility: Fully compatible with SRAM and Accumulator interfaces
  5. Pipeline design: Multi-stage pipeline improves throughput

Matrix Transpose Accelerator

Overview

This directory implements Buckyball's matrix transpose accelerator for matrix transpose operations. Located at arch/src/main/scala/prototype/transpose, it serves as a matrix transpose accelerator supporting pipelined transpose operations.

Core components:

  • Transpose.scala: Pipelined transposer implementation

Code Structure

transpose/
└── Transpose.scala  - Pipelined transposer

Module Responsibilities

Transpose.scala (Transpose implementation layer)

  • Implements PipelinedTransposer module
  • Manages matrix data read, transpose, and write-back
  • Provides Ball domain command interface

Module Description

Transpose.scala

Main functionality: Implements pipelined matrix transpose operation

State machine definition:

val idle :: sRead :: sWrite :: complete :: Nil = Enum(4)
val state = RegInit(idle)

Storage structure:

// Matrix storage register (veclane x veclane)
val regArray = Reg(Vec(b.veclane, Vec(b.veclane, UInt(b.inputType.getWidth.W))))

Counter management:

val readCounter = RegInit(0.U(log2Ceil(b.veclane + 1).W))
val respCounter = RegInit(0.U(log2Ceil(b.veclane + 1).W))
val writeCounter = RegInit(0.U(log2Ceil(b.veclane + 1).W))

Instruction registers:

val robid_reg = RegInit(0.U(10.W))    // ROB ID
val waddr_reg = RegInit(0.U(10.W))    // Write address
val wbank_reg = RegInit(0.U(log2Up(b.sp_banks).W))  // Write bank
val raddr_reg = RegInit(0.U(10.W))    // Read address
val rbank_reg = RegInit(0.U(log2Up(b.sp_banks).W))  // Read bank
val iter_reg = RegInit(0.U(10.W))     // Iteration count

Interface definition:

val io = IO(new Bundle {
  val cmdReq = Flipped(Decoupled(new BallRsIssue))
  val cmdResp = Decoupled(new BallRsComplete)
  val sramRead = Vec(b.sp_banks, Flipped(new SramReadIO(...)))
  val sramWrite = Vec(b.sp_banks, Flipped(new SramWriteIO(...)))
})

Processing flow:

  1. idle: Wait for command, parse transpose parameters
  2. sRead: Read matrix data row by row into register array
  3. sWrite: Write transposed data column by column
  4. complete: Send completion signal

Transpose algorithm:

  • Uses veclane×veclane register array to store matrix
  • Reads row-wise, writes column-wise to implement transpose
  • Supports block-wise transpose for matrices of arbitrary size

Usage

Implementation Details

State machine:

val idle :: sRead :: sWrite :: complete :: Nil = Enum(4)
  • idle: Wait for instruction
  • sRead: Read matrix data
  • sWrite: Write transpose result
  • complete: Complete and respond

Register array:

val regArray = Reg(Vec(b.veclane, Vec(b.veclane, UInt(b.inputType.getWidth.W))))

Uses veclane×veclane register array to cache matrix data.

Transpose operation:

  • Read phase: Read data row by row into regArray(row)(col)
  • Write phase: Read regArray(i)(col) column by column to form new rows for writing

Configuration Parameters

Matrix size: Determined by b.veclane parameter Data width: Determined by b.inputType.getWidth Bank configuration: Supports multi-bank SRAM access

Notes

  1. Matrix size limitation: Maximum support for veclane×veclane matrices
  2. Memory bandwidth: Transpose operation has high memory bandwidth requirements
  3. Register overhead: Requires veclane² registers to store matrix
  4. Address calculation: Transposed address calculation needs to be handled correctly
  5. Pipeline control: Read/write counters need to be synchronized correctly

Vector Processing Unit

Overview

The Vector Processing Unit is a specialized computation accelerator in the Buckyball framework, located at prototype/vector. This module implements a complete vector processing pipeline, including control unit, load unit, execution unit, and store unit, supporting parallel processing of vector data.

File Structure

vector/
├── VecUnit.scala         - Vector processing unit top module
├── VecCtrlUnit.scala     - Vector control unit
├── VecLoadUnit.scala     - Vector load unit
├── VecEXUnit.scala       - Vector execution unit
├── VecStoreUnit.scala    - Vector store unit
├── bond/                 - Binding and synchronization mechanisms
├── op/                   - Vector operation implementations
├── thread/               - Thread management
└── warp/                 - Thread warp management

Core Components

VecUnit - Vector Processing Unit Top Level

VecUnit is the top-level module of the vector processor, integrating all sub-units:

class VecUnit(implicit b: CustomBuckyballConfig, p: Parameters) extends Module {
  val io = IO(new Bundle {
    val cmdReq = Flipped(Decoupled(new BallRsIssue))
    val cmdResp = Decoupled(new BallRsComplete)

    // Connected to Scratchpad SRAM read/write interfaces
    val sramRead = Vec(b.sp_banks, Flipped(new SramReadIO(b.spad_bank_entries, spad_w)))
    val sramWrite = Vec(b.sp_banks, Flipped(new SramWriteIO(b.spad_bank_entries, spad_w, b.spad_mask_len)))
    // Connected to Accumulator read/write interfaces
    val accRead = Vec(b.acc_banks, Flipped(new SramReadIO(b.acc_bank_entries, b.acc_w)))
    val accWrite = Vec(b.acc_banks, Flipped(new SramWriteIO(b.acc_bank_entries, b.acc_w, b.acc_mask_len)))
  })
}

Interface Description

Command interface:

  • cmdReq: Vector instruction request from reservation station
  • cmdResp: Completion response returned to reservation station

Memory interface:

  • sramRead/sramWrite: Read/write interfaces connected to Scratchpad
  • accRead/accWrite: Read/write interfaces connected to Accumulator

VecCtrlUnit - Vector Control Unit

The vector control unit is responsible for instruction decode and pipeline control:

class VecCtrlUnit(implicit b: CustomBuckyballConfig, p: Parameters) extends Module {
  val io = IO(new Bundle{
    val cmdReq = Flipped(Decoupled(new BallRsIssue))
    val cmdResp_o = Decoupled(new BallRsComplete)

    val ctrl_ld_o = Decoupled(new ctrl_ld_req)
    val ctrl_st_o = Decoupled(new ctrl_st_req)
    val ctrl_ex_o = Decoupled(new ctrl_ex_req)

    val cmdResp_i = Flipped(Valid(new Bundle {val commit = Bool()}))
  })
}

Control State

val rob_id_reg    = RegInit(0.U(log2Up(b.rob_entries).W))
val iter          = RegInit(0.U(10.W))
val op1_bank      = RegInit(0.U(2.W))
val op1_bank_addr = RegInit(0.U(12.W))
val op2_bank_addr = RegInit(0.U(12.W))
val op2_bank      = RegInit(0.U(2.W))
val wr_bank       = RegInit(0.U(2.W))
val wr_bank_addr  = RegInit(0.U(12.W))
val is_acc        = RegInit(false.B)

Data Flow Architecture

The vector processing unit uses a pipeline architecture with the following data flow:

Instruction input → VecCtrlUnit → Control signal dispatch
                          ↓
                  VecLoadUnit (Load data)
                          ↓
                  VecEXUnit (Execute computation)
                          ↓
                  VecStoreUnit (Store results)
                          ↓
                      Completion response

Module Connections

// Control unit
val VecCtrlUnit = Module(new VecCtrlUnit)
VecCtrlUnit.io.cmdReq <> io.cmdReq
io.cmdResp <> VecCtrlUnit.io.cmdResp_o

// Load unit
val VecLoadUnit = Module(new VecLoadUnit)
VecLoadUnit.io.ctrl_ld_i <> VecCtrlUnit.io.ctrl_ld_o

// Execution unit
val VecEX = Module(new VecEXUnit)
VecEX.io.ctrl_ex_i <> VecCtrlUnit.io.ctrl_ex_o
VecEX.io.ld_ex_i <> VecLoadUnit.io.ld_ex_o

// Store unit
val VecStoreUnit = Module(new VecStoreUnit)
VecStoreUnit.io.ctrl_st_i <> VecCtrlUnit.io.ctrl_st_o
VecStoreUnit.io.ex_st_i <> VecEX.io.ex_st_o

Memory System Integration

Scratchpad Connection

The vector processing unit connects to Scratchpad through multiple banks:

for (i <- 0 until b.sp_banks) {
  io.sramRead(i).req <> VecLoadUnit.io.sramReadReq(i)
  VecLoadUnit.io.sramReadResp(i) <> io.sramRead(i).resp
}

Accumulator Connection

Execution results are written to Accumulator through the store unit:

for (i <- 0 until b.acc_banks) {
  io.accWrite(i) <> VecStoreUnit.io.accWrite(i)
}

Configuration Parameters

Vector Configuration

Configure vector processor parameters through CustomBuckyballConfig:

class CustomBuckyballConfig extends Config((site, here, up) => {
  case "veclane" => 16              // Vector lane count
  case "sp_banks" => 4              // Scratchpad bank count
  case "acc_banks" => 2             // Accumulator bank count
  case "spad_bank_entries" => 1024  // Entries per bank
  case "acc_bank_entries" => 512    // Accumulator entry count
})

Data Width

val spad_w = b.veclane * b.inputType.getWidth  // Scratchpad width
val acc_w = b.outputType.getWidth              // Accumulator width

Usage

Creating Vector Processing Unit

val vecUnit = Module(new VecUnit())

// Connect command interface
vecUnit.io.cmdReq <> reservationStation.io.issue
reservationStation.io.complete <> vecUnit.io.cmdResp

// Connect memory system
for (i <- 0 until sp_banks) {
  scratchpad.io.read(i) <> vecUnit.io.sramRead(i)
  scratchpad.io.write(i) <> vecUnit.io.sramWrite(i)
}

for (i <- 0 until acc_banks) {
  accumulator.io.read(i) <> vecUnit.io.accRead(i)
  accumulator.io.write(i) <> vecUnit.io.accWrite(i)
}

Vector Instruction Format

Vector instructions are passed through the BallRsIssue interface:

class BallRsIssue extends Bundle {
  val cmd = new Bundle {
    val iter = UInt(10.W)           // Iteration count
    val op1_bank = UInt(2.W)        // Operand 1 bank
    val op1_bank_addr = UInt(12.W)  // Operand 1 address
    val op2_bank = UInt(2.W)        // Operand 2 bank
    val op2_bank_addr = UInt(12.W)  // Operand 2 address
    val wr_bank = UInt(2.W)         // Write bank
    val wr_bank_addr = UInt(12.W)   // Write address
  }
  val rob_id = UInt(log2Up(rob_entries).W)
}

Execution Model

Pipeline Execution

  1. Instruction decode: VecCtrlUnit decodes vector instructions
  2. Data load: VecLoadUnit loads operands from Scratchpad
  3. Vector computation: VecEXUnit executes vector operations
  4. Result store: VecStoreUnit writes results to Accumulator
  5. Completion response: Returns completion signal to reservation station

Parallel Processing

  • Multi-lane parallelism: Supports parallel computation across multiple vector lanes
  • Bank-level parallelism: Multiple memory banks support parallel access
  • Pipeline overlap: Different stages can overlap execution

Submodule Description

Binding Mechanism (Bond)

Provides inter-thread synchronization and data binding functionality, supporting producer-consumer pattern data transfer.

Vector Operations (Op)

Implements specific vector computation operations, including arithmetic operations, logical operations, and special functions.

Thread Management (Thread)

Provides thread abstraction and management functionality, supporting different types of vector threads.

Thread Warp Management (Warp)

Implements thread warp organization and scheduling, supporting large-scale parallel computation.

Performance Characteristics

  • High parallelism: Supports multi-lane vector parallel processing
  • Pipelined: Multi-stage pipeline improves throughput
  • Memory optimization: Multi-bank memory system reduces access conflicts
  • Flexible configuration: Supports different vector lengths and data types

Binding Module

Overview

The binding module implements data interfaces and synchronization mechanisms in the vector processing unit, located at prototype/vector/bond. This module defines inter-thread data transfer interfaces, supporting different types of data binding patterns.

File Structure

bond/
├── BondWrapper.scala    - Binding wrapper base class
└── vvv.scala           - VVV binding implementation

Core Components

VVV - Vector-to-Vector Binding

VVV (Vector-Vector-Vector) binding implements a data interface from dual input vectors to single output vector:

class VVV(implicit p: Parameters) extends Bundle {
  val lane = p(ThreadKey).get.lane
  val bondParam = p(ThreadBondKey).get
  val inputWidth = bondParam.inputWidth
  val outputWidth = bondParam.outputWidth

  // Input interface (Flipped Decoupled)
  val in = Flipped(Decoupled(new Bundle {
    val in1 = Vec(lane, UInt(inputWidth.W))
    val in2 = Vec(lane, UInt(inputWidth.W))
  }))

  // Decoupled output interface
  val out = Decoupled(new Bundle {
    val out = Vec(lane, UInt(outputWidth.W))
  })
}

Interface Description

Input interface:

  • in.bits.in1: First input vector, width is inputWidth
  • in.bits.in2: Second input vector, width is inputWidth
  • in.valid: Input data valid signal
  • in.ready: Input ready signal

Output interface:

  • out.bits.out: Output vector, width is outputWidth
  • out.valid: Output data valid signal
  • out.ready: Output ready signal

Parameter Configuration

VVV binding parameters are obtained through the configuration system:

val lane = p(ThreadKey).get.lane                    // Vector lane count
val bondParam = p(ThreadBondKey).get                // Binding parameter
val inputWidth = bondParam.inputWidth               // Input width
val outputWidth = bondParam.outputWidth             // Output width

CanHaveVVVBond - VVV Binding Trait

The CanHaveVVVBond trait provides VVV binding functionality for threads:

trait CanHaveVVVBond { this: BaseThread =>
  val vvvBond = params(ThreadBondKey).filter(_.bondType == "vvv").map { bondParam =>
    IO(new VVV()(params))
  }

  def getVVVBond = vvvBond
}

Usage

Thread classes gain VVV binding capability by mixing in this trait:

class MulThread(implicit p: Parameters) extends BaseThread
  with CanHaveMulOp
  with CanHaveVVVBond {

  // Connect operation and binding
  for {
    op <- mulOp
    bond <- vvvBond
  } {
    op.io.in <> bond.in
    op.io.out <> bond.out
  }
}

BondWrapper - Binding Wrapper

BondWrapper provides Diplomacy-based binding encapsulation:

abstract class BondWrapper(implicit p: Parameters) extends LazyModule {
  val bondName = "vvv"

  def to[T](name: String)(body: => T): T = {
    LazyScope(s"bond_to_${name}", s"Bond_${bondName}_to_${name}") { body }
  }

  def from[T](name: String)(body: => T): T = {
    LazyScope(s"bond_from_${name}", s"Bond_${bondName}_from_${name}") { body }
  }
}

Scope Management

BondWrapper provides named scope management functionality:

  • to(): Creates binding scope in output direction
  • from(): Creates binding scope in input direction

Binding Types

VVV Binding Pattern

VVV binding supports the following data flow patterns:

  1. Dual input single output: Two vector inputs, one vector output
  2. Width conversion: Supports different input and output widths
  3. Vector parallelism: Supports multi-lane parallel data transmission

Data Flow Control

VVV binding uses Decoupled interface for flow control:

// Producer side
producer.io.out.valid := dataReady
producer.io.out.bits.in1 := inputVector1
producer.io.out.bits.in2 := inputVector2

// Consumer side
consumer.io.in.ready := canAcceptData
when(consumer.io.in.fire) {
  processData(consumer.io.in.bits.out)
}

Configuration Parameters

Binding Parameters

Binding parameters are defined through BondParam:

case class BondParam(
  bondType: String,           // Binding type ("vvv")
  inputWidth: Int = 8,        // Input width
  outputWidth: Int = 32       // Output width
)

Configuration Example

val bondConfig = BondParam(
  bondType = "vvv",
  inputWidth = 8,
  outputWidth = 32
)

val threadConfig = ThreadParam(
  lane = 16,
  attr = "vector",
  threadName = "mul_thread",
  Op = OpParam("mul", bondConfig)
)

Usage

Creating VVV Binding

// Using VVV binding in thread
class CustomThread(implicit p: Parameters) extends BaseThread
  with CanHaveVVVBond {

  // Get binding interface
  for (bond <- vvvBond) {
    // Connect input
    bond.in.valid := inputValid
    bond.in.bits.in1 := inputVector1
    bond.in.bits.in2 := inputVector2

    // Connect output
    outputValid := bond.out.valid
    outputVector := bond.out.bits.out
    bond.out.ready := outputReady
  }
}

Binding Connection

// Connect binding interfaces of two modules
val producer = Module(new ProducerThread())
val consumer = Module(new ConsumerThread())

// Direct binding interface connection
for {
  prodBond <- producer.vvvBond
  consBond <- consumer.vvvBond
} {
  consBond.in <> prodBond.out
}

Synchronization Mechanisms

Handshake Protocol

VVV binding uses standard Decoupled handshake protocol:

  1. Data preparation: Producer sets valid and bits
  2. Receive ready: Consumer sets ready
  3. Data transmission: Transfer completes when valid && ready
  4. State update: Both sides update internal state

Backpressure Handling

Binding interface supports backpressure mechanism:

// When downstream is not ready, upstream waits
when(!downstream.ready) {
  upstream.valid := false.B
  // Keep data unchanged
}

Extensibility

New Binding Types

New binding types can be defined following a similar pattern:

// Single input single output binding
class VV(implicit p: Parameters) extends Bundle {
  val in = Flipped(Decoupled(Vec(lane, UInt(inputWidth.W))))
  val out = Decoupled(Vec(lane, UInt(outputWidth.W)))
}

// Corresponding trait
trait CanHaveVVBond { this: BaseThread =>
  val vvBond = params(ThreadBondKey).filter(_.bondType == "vv").map { _ =>
    IO(new VV()(params))
  }
}

Parameterization Support

The binding module supports full parameterized configuration:

  • Vector lane count configurable
  • Input/output width configurable
  • Binding type extensible

Vector Operations Module

Overview

The vector operations module implements specific computation operations in the vector processing unit, located at prototype/vector/op. This module provides implementations of different types of vector operations, including multiplication operations and cascade operations.

File Structure

op/
├── cascade.scala    - Cascade addition operation
└── mul.scala       - Multiplication operation

Core Components

CascadeOp - Cascade Addition Operation

CascadeOp implements element-wise addition operation on vector elements:

class CascadeOp(implicit p: Parameters) extends Module {
  val lane = p(ThreadKey).get.lane
  val bondParam = p(ThreadBondKey).get
  val outputWidth = bondParam.outputWidth

  val io = IO(new VVV()(p))
}

Operation Logic

val reg1 = RegInit(VecInit(Seq.fill(lane)(0.U(outputWidth.W))))
val valid1 = RegInit(false.B)

when (io.in.valid) {
  valid1 := true.B
  reg1 := io.in.bits.in1.zip(io.in.bits.in2).map { case (a, b) => a + b }
}

Function description:

  • Receives two input vectors in1 and in2
  • Performs element-wise addition: out[i] = in1[i] + in2[i]
  • Uses register to cache computation results
  • Supports pipelined operations

Flow Control Mechanism

io.in.ready := io.out.ready

when (io.out.ready && valid) {
  io.out.valid := true.B
  io.out.bits.out := reg1
}.otherwise {
  io.out.valid := false.B
  io.out.bits.out := VecInit(Seq.fill(lane)(0.U(outputWidth.W)))
}

MulOp - Multiplication Operation

MulOp implements vector multiplication operation with broadcast mode support:

class MulOp(implicit p: Parameters) extends Module {
  val lane = p(ThreadKey).get.lane
  val bondParam = p(ThreadBondKey).get
  val inputWidth = bondParam.inputWidth

  val io = IO(new VVV()(p))
}

Operation Logic

val reg1 = RegInit(VecInit(Seq.fill(lane)(0.U(inputWidth.W))))
val reg2 = RegInit(VecInit(Seq.fill(lane)(0.U(inputWidth.W))))
val cnt = RegInit(0.U(log2Ceil(lane).W))
val active = RegInit(false.B)

when (io.in.valid) {
  reg1 := io.in.bits.in1
  reg2 := io.in.bits.in2
  cnt := 0.U
  active := true.B
}

Function description:

  • Receives two input vectors and caches them in registers
  • Uses counter cnt to control output sequence
  • Implements broadcast multiplication: out[i] = reg1[cnt] * reg2[i]

Sequential Output

for (i <- 0 until lane) {
  io.out.bits.out(i) := reg1(cnt) * reg2(i)
}

when (active && io.out.ready) {
  cnt := cnt + 1.U
  when (cnt === (lane-1).U) {
    active := false.B
  }
}

Output mode:

  • Outputs one set of multiplication results per cycle
  • reg1[cnt] multiplied with all elements of reg2
  • Counter increments to achieve sequential output

Operation Traits

CanHaveCascadeOp - Cascade Operation Trait

trait CanHaveCascadeOp { this: BaseThread =>
  val cascadeOp = params(ThreadOpKey).filter(_.OpType == "cascade").map { opParam =>
    Module(new CascadeOp()(params))
  }

  def getCascadeOp = cascadeOp
}

CanHaveMulOp - Multiplication Operation Trait

trait CanHaveMulOp { this: BaseThread =>
  val mulOp = params(ThreadOpKey).filter(_.OpType == "mul").map { opParam =>
    Module(new MulOp()(params))
  }

  def getMulOp = mulOp
}

Usage

Using Operations in Threads

class CasThread(implicit p: Parameters) extends BaseThread
  with CanHaveCascadeOp
  with CanHaveVVVBond {

  // Connect operation and binding
  for {
    op <- cascadeOp
    bond <- vvvBond
  } {
    op.io.in <> bond.in
    op.io.out <> bond.out
  }
}

Configuring Operation Parameters

val opParam = OpParam(
  OpType = "cascade",                    // Operation type
  bondType = BondParam(
    bondType = "vvv",
    inputWidth = 32,
    outputWidth = 32
  )
)

Operation Type Comparison

CascadeOp vs MulOp

FeatureCascadeOpMulOp
Operation typeElement-wise additionBroadcast multiplication
Input widthArbitraryUsually smaller
Output widthArbitraryUsually larger
Latency1 cyclelane cycles
Throughput1 group per cycle1 group per lane cycle
Resource consumptionAdder × laneMultiplier × lane

Application Scenarios

CascadeOp is suitable for:

  • Vector addition operations
  • Accumulation operations
  • Data merging

MulOp is suitable for:

  • Matrix-vector multiplication
  • Convolution operations
  • Scaling operations

Data Flow Patterns

CascadeOp Data Flow

Input: [a0, a1, ..., an], [b0, b1, ..., bn]
      ↓
Compute: [a0+b0, a1+b1, ..., an+bn]
      ↓
Output: [c0, c1, ..., cn] (1 cycle)

MulOp Data Flow

Input: [a0, a1, ..., an], [b0, b1, ..., bn]
      ↓
Cycle 0: [a0*b0, a0*b1, ..., a0*bn]
Cycle 1: [a1*b0, a1*b1, ..., a1*bn]
...
Cycle n: [an*b0, an*b1, ..., an*bn]

Extended Operations

Adding New Operations

New vector operations can be added following a similar pattern:

class SubOp(implicit p: Parameters) extends Module {
  val io = IO(new VVV()(p))

  // Implement subtraction operation
  io.out.bits.out := io.in.bits.in1.zip(io.in.bits.in2).map {
    case (a, b) => a - b
  }
}

trait CanHaveSubOp { this: BaseThread =>
  val subOp = params(ThreadOpKey).filter(_.OpType == "sub").map { _ =>
    Module(new SubOp()(params))
  }
}

Complex Operations

For more complex operations, multiple basic operations can be combined:

class FMAOp(implicit p: Parameters) extends Module {
  // Fused multiply-add operation: out = a * b + c
  val mulOp = Module(new MulOp())
  val addOp = Module(new CascadeOp())

  // Connect operation pipeline
  addOp.io.in.bits.in1 <> mulOp.io.out.bits.out
  // ...
}

Performance Optimization

Pipeline Optimization

  • Use registers to cache intermediate results
  • Support continuous data stream processing
  • Minimize combinational logic delay

Resource Optimization

  • Choose appropriate hardware resources based on operation type
  • Support resource sharing and reuse
  • Configurable parallelism

Thread Module

Overview

The thread module implements thread abstractions in the vector processing unit, located at prototype/vector/thread. This module defines the basic structure and specific implementations of threads, constructing threads with specific functionality by combining different operations (Op) and bindings (Bond).

File Structure

thread/
├── BaseThread.scala    - Thread base class definition
├── CasThread.scala     - Cascade operation thread
└── MulThread.scala     - Multiplication operation thread

Core Components

BaseThread - Thread Base Class

BaseThread is the base class for all threads, defining basic thread parameters and configuration:

class BaseThread(implicit p: Parameters) extends Module {
  val io = IO(new Bundle {})
  val params = p
  val threadMap = p(ThreadMapKey)
  val threadParam = threadMap.getOrElse(
    p(ThreadKey).get.threadName,
    throw new Exception(s"ThreadParam not found for threadName: ${p(ThreadKey).get.threadName}")
  )
  val opParam = p(ThreadOpKey).get
  val bondParam = p(ThreadBondKey).get
}

Parameter Definition

The thread module uses the following parameter structure:

case class ThreadParam(lane: Int, attr: String, threadName: String, Op: OpParam)
case class OpParam(OpType: String, bondType: BondParam)
case class BondParam(bondType: String, inputWidth: Int = 8, outputWidth: Int = 32)

Parameter description:

  • lane: Vector lane count
  • threadName: Thread name identifier
  • OpType: Operation type ("cascade", "mul")
  • bondType: Binding type ("vvv")
  • inputWidth: Input data width, default 8 bits
  • outputWidth: Output data width, default 32 bits

Specific Thread Implementations

CasThread - Cascade Operation Thread

CasThread implements cascade addition operation, combining CascadeOp and VVVBond:

class CasThread(implicit p: Parameters) extends BaseThread
  with CanHaveCascadeOp
  with CanHaveVVVBond {

  // Connect CascadeOp and VVVBond
  for {
    op <- cascadeOp
    bond <- vvvBond
  } {
    op.io.in <> bond.in
    op.io.out <> bond.out
  }
}

Function: Performs element-wise addition operation on two input vectors.

MulThread - Multiplication Operation Thread

MulThread implements multiplication operation, combining MulOp and VVVBond:

class MulThread(implicit p: Parameters) extends BaseThread
  with CanHaveMulOp
  with CanHaveVVVBond {

  // Connect MulOp and VVVBond
  for {
    op <- mulOp
    bond <- vvvBond
  } {
    op.io.in <> bond.in
    op.io.out <> bond.out
  }
}

Function: Implements vector multiplication operation, supporting per-cycle result output.

Configuration System

The thread module uses Chipyard's configuration system for parameterization:

case object ThreadKey extends Field[Option[ThreadParam]](None)
case object ThreadOpKey extends Field[Option[OpParam]](None)
case object ThreadBondKey extends Field[Option[BondParam]](None)
case object ThreadMapKey extends Field[Map[String, ThreadParam]](Map.empty)

Configuration key description:

  • ThreadKey: Current thread parameter
  • ThreadOpKey: Operation parameter
  • ThreadBondKey: Binding parameter
  • ThreadMapKey: Thread mapping table

Usage

Creating Thread Instance

// Configure parameters
val threadParam = ThreadParam(
  lane = 4,
  attr = "vector",
  threadName = "mul_thread",
  Op = OpParam("mul", BondParam("vvv", 8, 32))
)

// Create thread
val mulThread = Module(new MulThread()(
  new Config((site, here, up) => {
    case ThreadKey => Some(threadParam)
    case ThreadOpKey => Some(threadParam.Op)
    case ThreadBondKey => Some(threadParam.Op.bondType)
  })
))

Connecting Interfaces

Threads interact data through VVV binding interface:

// Input data
mulThread.io.in.valid := inputValid
mulThread.io.in.bits.in1 := inputVector1
mulThread.io.in.bits.in2 := inputVector2

// Output data
outputValid := mulThread.io.out.valid
outputVector := mulThread.io.out.bits.out
mulThread.io.out.ready := outputReady

Thread Warp Module

Overview

The thread warp module implements thread warp management functionality in the vector processing unit, located at prototype/vector/warp. This module organizes multiple threads into a mesh structure, implementing parallel computation and dataflow management.

File Structure

warp/
├── MeshWarp.scala    - Mesh warp implementation
└── VecBall.scala     - Vector ball processor

Core Components

MeshWarp - Mesh Warp

MeshWarp implements a 32-thread mesh structure containing 16 multiplication threads and 16 cascade threads:

class MeshWarp(implicit p: Parameters) extends Module {
  val io = IO(new Bundle {
    val in = Flipped(Decoupled(new MeshWarpInput))
    val out = Decoupled(new MeshWarpOutput)
  })
}

Input/Output Interface

class MeshWarpInput extends Bundle {
  val op1 = Vec(16, UInt(8.W))        // First operand vector
  val op2 = Vec(16, UInt(8.W))        // Second operand vector
  val thread_id = UInt(10.W)          // Thread identifier
}

class MeshWarpOutput extends Bundle {
  val res = Vec(16, UInt(32.W))       // Result vector
}

Thread Configuration

Threads in the mesh are configured according to the following rules:

val threadMap = (0 until 32).map { i =>
  val threadName = i.toString
  val opType = if (i < 16) "mul" else "cascade"
  val bond = if (opType == "mul") {
    BondParam("vvv", inputWidth = 8, outputWidth = 32)
  } else {
    BondParam("vvv", inputWidth = 32, outputWidth = 32)
  }
  val op = OpParam(opType, bond)
  val thread = ThreadParam(16, s"attr$threadName", threadName, op)
  threadName -> thread
}.toMap

Thread allocation:

  • Threads 0-15: Multiplication operation threads (8-bit input → 32-bit output)
  • Threads 16-31: Cascade operation threads (32-bit input → 32-bit output)

Data Flow Connection

Data flow in the mesh is connected as follows:

// Connect mul thread output to cascade thread input
casBond.in.bits.in1 := mulBond.out.bits.out
mulBond.out.ready   := casBond.in.ready

// Cascade connection between cascade threads
if (i == 0) {
  casBond.in.bits.in2 := VecInit(Seq.fill(16)(0.U(32.W)))
} else {
  casBond.in.bits.in2 := prevCasBond.out.bits.out
}

Data flow path:

  1. Input data → Multiplication threads (thread 0-15)
  2. Multiplication results → Cascade threads (thread 16-31)
  3. Serial connection between cascade threads
  4. Final result output from thread 31

VecBall - Vector Ball Processor

VecBall is a wrapper for MeshWarp, providing state management and iteration control:

class VecBall(implicit p: Parameters) extends Module {
  val io = IO(new VecBallIO())
}

Interface Definition

class VecBallIO extends BallIO {
  val op1In = Flipped(Valid(Vec(16, UInt(8.W))))    // Operand 1 input
  val op2In = Flipped(Valid(Vec(16, UInt(8.W))))    // Operand 2 input
  val rstOut = Decoupled(Vec(16, UInt(32.W)))       // Result output
}

class BallIO extends Bundle {
  val iterIn = Flipped(Decoupled(UInt(10.W)))       // Iteration count input
  val iterOut = Valid(UInt(10.W))                   // Current iteration output
}

State Management

VecBall maintains the following internal state:

val start  = RegInit(false.B)      // Start flag
val arrive = RegInit(false.B)      // Arrival flag
val done   = RegInit(false.B)      // Completion flag
val iter   = RegInit(0.U(10.W))    // Total iteration count
val iterCounter = RegInit(0.U(10.W)) // Current iteration counter

Thread Scheduling

VecBall uses round-robin scheduling to assign threads:

val threadId = RegInit(0.U(4.W))
when (io.op1In.valid && io.op2In.valid && threadId < 15.U) {
  threadId := threadId + 1.U
} .elsewhen (io.op1In.valid && io.op2In.valid && threadId === 15.U) {
  threadId := 0.U
}

Usage

Creating MeshWarp Instance

val meshWarp = Module(new MeshWarp()(p))

// Connect input
meshWarp.io.in.valid := inputValid
meshWarp.io.in.bits.op1 := operand1
meshWarp.io.in.bits.op2 := operand2
meshWarp.io.in.bits.thread_id := selectedThread

// Connect output
outputValid := meshWarp.io.out.valid
result := meshWarp.io.out.bits.res
meshWarp.io.out.ready := outputReady

Creating VecBall Instance

val vecBall = Module(new VecBall()(p))

// Set iteration count
vecBall.io.iterIn.valid := iterValid
vecBall.io.iterIn.bits := totalIterations

// Input data
vecBall.io.op1In.valid := dataValid
vecBall.io.op1In.bits := inputVector1
vecBall.io.op2In.valid := dataValid
vecBall.io.op2In.bits := inputVector2

// Get result
outputReady := vecBall.io.rstOut.ready
when(vecBall.io.rstOut.valid) {
  result := vecBall.io.rstOut.bits
}

Computation Modes

Vector Multiply-Accumulate

Computation mode implemented by MeshWarp:

  1. Multiplication phase: 16 multiplication threads compute op1[i] * op2[i] in parallel
  2. Accumulation phase: 16 cascade threads accumulate multiplication results serially
  3. Output phase: Output final accumulated vector

Iterative Processing

VecBall supports multi-iteration processing:

  1. Set iteration count iterIn
  2. Loop input data pairs
  3. Monitor iteration count iterOut
  4. Check completion status

Performance Characteristics

  • Parallelism: 16 multiplication operations execute in parallel
  • Pipeline: Supports continuous data stream processing
  • Throughput: Can process one 16-element vector pair per cycle
  • Latency: Combined latency of multiplication + cascade

Buckyball Example Configurations

Overview

This directory contains example configurations and reference implementations of the Buckyball framework, demonstrating how to configure and extend Buckyball systems. Located at arch/src/main/scala/examples, it serves as the configuration layer, providing configuration templates and system instances for developers.

Main components include:

  • BuckyballConfig.scala: Global configuration parameter definitions
  • toy/: Complete example system implementation with custom coprocessor and CSR extensions

Code Structure

examples/
├── BuckyballConfig.scala     - Global configuration definitions
└── toy/                      - Complete example system
    ├── balldomain/           - Ball domain component implementation
    │   ├── BallDomain.scala  - Ball domain top-level
    │   ├── bbus/             - Ball bus registration
    │   │   └── busRegister.scala
    │   ├── rs/               - Ball RS registration
    │   │   └── rsRegister.scala
    │   └── decoder/          - Ball decoder (if exists)
    ├── CustomConfigs.scala   - System configuration composition
    └── ToyBuckyball.scala    - System top-level module

File Dependencies

BuckyballConfig.scala (Base Configuration Layer)

  • Defines global configuration parameters and defaults
  • Inherited and extended by all other configuration files
  • Provides system-level configuration interface

toy/CustomConfigs.scala (Configuration Composition Layer)

  • Inherits from BuckyballConfig and adds custom parameters
  • Composes multiple configuration fragments into complete configuration
  • Provides configuration support for ToyBuckyball

toy/ToyBuckyball.scala (System Instantiation Layer)

  • Uses CustomConfigs to instantiate complete system
  • Serves as entry point for Mill build
  • Generates final Verilog code

Module Details

BuckyballConfig.scala

Main Function: Define global configuration parameters for the Buckyball framework

Key Components:

object BuckyballConfigs {
  val defaultConfig = BaseConfig
  val toyConfig = BuckyballToyConfig.defaultConfig

  // Actually used configuration
  val customConfig = toyConfig

  type CustomBuckyballConfig = BaseConfig
}

Configuration Selection: The framework uses customConfig to select the active configuration. This allows easy switching between different system configurations.

Input/Output:

  • Input: No direct input, parameters passed through configuration system
  • Output: Configuration parameters for use by other modules
  • Edge cases: Configuration conflicts resolved by priority-based overriding

toy/ - Example System

The toy system demonstrates a complete Buckyball implementation with various Ball devices.

toy/ToyBuckyball.scala

Main Function: System top-level module, instantiates complete toy system

Key Components:

class ToyBuckyball(implicit b: CustomBuckyballConfig, p: Parameters) extends LazyRoCC {
  override lazy val module = new ToyBuckyballModuleImp(this)
}

class ToyBuckyballModuleImp(outer: ToyBuckyball) extends LazyRoCCModuleImp(outer) {
  // Global Decoder
  val globalDecoder = Module(new GlobalDecoder)

  // Global Reservation Station (with ROB)
  val globalRS = Module(new GlobalReservationStation)

  // Ball Domain (regular Module, not LazyModule)
  val ballDomain = Module(new BallDomain)

  // Memory Domain (complete domain with DMA+TLB+SRAM)
  val memDomain = LazyModule(new MemDomain)

  // Connect components
  globalDecoder.io.rocc <> io.cmd
  globalRS.io.decode <> globalDecoder.io.issue
  ballDomain.io.issue <> globalRS.io.ballIssue
  memDomain.module.io.issue <> globalRS.io.memIssue
  // ... more connections
}

Build Flow:

  1. Load configuration from BuckyballConfig
  2. Instantiate ToyBuckyball LazyRoCC module
  3. Generate Verilog through ChiselStage
  4. Output to generated-src directory

Input/Output:

  • Input: RoCC interface commands from Rocket core
  • Output: RoCC interface responses, busy signals
  • Edge cases: Configuration errors cause build failure

toy/balldomain/ - Ball Domain Components

BallDomain.scala: Ball domain top-level module

  • Integrates Ball Decoder, local Ball RS, and BBus
  • Provides single-channel interface to Global RS
  • Routes commands to appropriate Ball devices

bbus/busRegister.scala: Ball bus registration

class BBusModule extends BBus {
  // Register Ball device generators
  registerBall(() => new VecBall, ballId = 0.U)
  registerBall(() => new MatrixBall, ballId = 1.U)
  registerBall(() => new TransposeBall, ballId = 2.U)
  registerBall(() => new Im2colBall, ballId = 3.U)
  registerBall(() => new ReluBall, ballId = 4.U)
}

rs/rsRegister.scala: Ball RS device registration

class BallRSModule extends BallReservationStation {
  // Register Ball device information
  registerBallInfo(name = "VecBall", bid = 0, latency = 10)
  registerBallInfo(name = "MatrixBall", bid = 1, latency = 20)
  registerBallInfo(name = "TransposeBall", bid = 2, latency = 15)
  registerBallInfo(name = "Im2colBall", bid = 3, latency = 15)
  registerBallInfo(name = "ReluBall", bid = 4, latency = 10)
}

toy/CustomConfigs.scala

Main Function: Compose multiple configuration fragments for the toy system

Configuration Composition:

object BuckyballToyConfig {
  val defaultConfig = BaseConfig(
    opcodes = OpcodeSet.custom3,
    inputType = UInt(8.W),        // INT8 input
    accType = UInt(32.W),         // INT32 accumulator
    veclane = 16,                 // 16-element vectors
    accveclane = 4,               // 4-element accumulator vectors
    rob_entries = 16,             // 16 ROB entries
    sp_banks = 4,                 // 4 scratchpad banks
    acc_banks = 8,                // 8 accumulator banks
    sp_capacity = CapacityInKilobytes(256),   // 256KB scratchpad
    acc_capacity = CapacityInKilobytes(64),   // 64KB accumulator
    numVecPE = 16,                // 16 vector PEs
    numVecThread = 16             // 16 vector threads
  )
}

Configuration Parameters:

  • opcodes: Custom instruction opcode set (custom3 = 0x7b)
  • inputType: Data type for input operands
  • accType: Data type for accumulator
  • veclane: Number of elements per vector lane
  • rob_entries: Reorder buffer depth
  • Memory configuration: Bank counts and capacities
  • Vector configuration: PE count and thread count

Usage Guide

Building the Toy System

Generate Verilog:

cd arch
mill arch.runMain examples.toy.ToyBuckyball

Generated Files:

  • Location: arch/generated-src/toy/
  • Files: Verilog (.v), FIRRTL (.fir), annotation (.anno.json)

Custom Configuration Development

Steps:

  1. Copy CustomConfigs.scala as template
  2. Modify configuration parameters to meet requirements
  3. Implement necessary custom components
  4. Update top-level module to reference new configuration
  5. Register Ball devices in BBus and Ball RS

Example: Adding New Ball Device:

  1. Implement Ball device:
class MyCustomBall(implicit b: CustomBuckyballConfig, p: Parameters)
  extends Module with BallRegist {
  // Implement Ball interfaces
  val io = IO(new BlinkIO)
  def ballId = 6.U  // Assign unique Ball ID
  // ... implementation
}
  1. Register in BBusModule:
registerBall(() => new MyCustomBall, ballId = 6.U)
  1. Register in BallRSModule:
registerBallInfo(name = "MyCustomBall", bid = 6, latency = 12)

Configuration Best Practices

Parameter Selection:

  1. Memory Sizes: Balance capacity vs. area

    • Scratchpad: Main working memory for data
    • Accumulator: Smaller, used for accumulation results
  2. ROB Depth: Impacts instruction-level parallelism

    • Larger ROB: More in-flight instructions, higher parallelism
    • Smaller ROB: Lower area, simpler control logic
  3. Bank Counts: Affects memory bandwidth

    • More banks: Higher parallel access bandwidth
    • Fewer banks: Simpler arbitration, lower area
  4. Vector Configuration: Depends on workload

    • Vector lane width: Match data parallelism
    • PE/Thread count: Balance performance vs. area

Common Configurations:

// High-performance configuration
val highPerfConfig = BaseConfig(
  veclane = 32,                 // Wider vectors
  rob_entries = 32,             // Deeper ROB
  sp_banks = 8,                 // More banks
  sp_capacity = CapacityInKilobytes(512)
)

// Area-optimized configuration
val smallConfig = BaseConfig(
  veclane = 8,
  rob_entries = 8,
  sp_banks = 2,
  sp_capacity = CapacityInKilobytes(64)
)

Important Notes

  1. Configuration Priority: Later configurations in the chain override earlier ones with same parameter names
  2. Dependency Management: Ensure custom component dependencies are correctly declared in configuration
  3. Build Path: Generated file paths specified by TargetDirAnnotation
  4. Parameter Validation: Configuration parameters validated during instantiation; invalid configurations cause build failure
  5. Ball ID Uniqueness: Each Ball device must have unique ID across the system
  6. Bank Access Rules: Remember op1 and op2 cannot access same bank simultaneously

System Architecture

The toy system implements the complete Buckyball architecture:

┌─────────────────────────────────────────────────────────┐
│                  Rocket Core (via RoCC)                 │
└────────────────────┬────────────────────────────────────┘
                     │
            ┌────────▼────────┐
            │ Global Decoder  │
            └────────┬────────┘
                     │
            ┌────────▼────────┐
            │   Global RS     │
            │  (with ROB)     │
            └────┬──────┬─────┘
                 │      │
         ┌───────▼──┐ ┌▼──────────┐
         │  Ball    │ │   Mem     │
         │  Domain  │ │  Domain   │
         │          │ │           │
         │  ┌─────┐ │ │ ┌──────┐ │
         │  │BBus │ │ │ │ DMA  │ │
         │  └──┬──┘ │ │ │+TLB  │ │
         │     │    │ │ └───┬──┘ │
         │  ┌──▼───┐│ │     │    │
         │  │Balls ││ │  ┌──▼──┐ │
         │  └──────┘│ │  │Mem  │ │
         └──────┬───┘ │  │Ctrl │ │
                │     │  └─────┘ │
                │     └─────┬────┘
                │           │
            ┌───▼───────────▼───┐
            │  Memory Controller│
            │ (Scratchpad+Acc)  │
            └───────────────────┘

Supported Ball Devices:

  • VecBall (ID=0): Vector operations
  • MatrixBall (ID=1): Matrix multiplication (various formats)
  • TransposeBall (ID=2): Matrix transpose
  • Im2colBall (ID=3): Im2col transformation for convolution
  • ReluBall (ID=4): ReLU activation function

Troubleshooting

Issue: Build fails with "Ball ID conflict"

  • Solution: Ensure each Ball device has unique ID in both BBus and RS registration

Issue: Generated Verilog has timing violations

  • Solution: Reduce clock frequency or optimize critical paths

Issue: Simulation shows incorrect results

  • Solution: Verify Ball device implementation and memory access patterns

Issue: Configuration parameter not taking effect

  • Solution: Check configuration priority and ensure parameter is in correct config fragment

Toy Buckyball Example Implementation

Overview

This directory contains a complete example implementation of the Buckyball framework, demonstrating how to build a custom coprocessor based on the RoCC interface. Located in arch/src/main/scala/examples/toy, it serves as a reference implementation for the Buckyball system, integrating global decoder, Ball domain, and memory domain.

Core components:

  • ToyBuckyball.scala: Main RoCC coprocessor implementation
  • CustomConfigs.scala: System configuration and RoCC integration
  • CSR.scala: Custom control and status registers
  • balldomain/: Ball domain related components

Code Structure

toy/
├── ToyBuckyball.scala    - Main coprocessor implementation
├── CustomConfigs.scala   - Configuration definitions
├── CSR.scala            - CSR implementation
└── balldomain/          - Ball domain components

File Dependencies

ToyBuckyball.scala (Core implementation layer)

  • Extends LazyRoCCBB, implements RoCC coprocessor interface
  • Integrates GlobalDecoder, BallDomain, MemDomain
  • Manages TileLink connections and DMA components

CustomConfigs.scala (Configuration layer)

  • Defines BuckyballCustomConfig and BuckyballToyConfig
  • Configures RoCC integration and system parameters
  • Provides multi-core configuration support

CSR.scala (Register layer)

  • Implements FenceCSR control register
  • Provides simple 64-bit register interface

Module Description

ToyBuckyball.scala

Main functionality: Implements complete Buckyball RoCC coprocessor

Key components:

class ToyBuckyball(val b: CustomBuckyballConfig)(implicit p: Parameters)
  extends LazyRoCCBB (opcodes = b.opcodes, nPTWPorts = 2) {

  val reader = LazyModule(new BBStreamReader(...))
  val writer = LazyModule(new BBStreamWriter(...))
  val xbar_node = TLXbar()
}

System architecture:

// Frontend: global decoder
val gDecoder = Module(new GlobalDecoder)

// Backend: Ball domain and memory domain
val ballDomain = Module(new BallDomain)
val memDomain = Module(new MemDomain)

// Response arbitration
val respArb = Module(new Arbiter(new RoCCResponseBB()(p), 2))

TileLink connections:

xbar_node := TLBuffer() := reader.node
xbar_node := TLBuffer() := writer.node
id_node := TLWidthWidget(b.dma_buswidth/8) := TLBuffer() := xbar_node

Inputs/Outputs:

  • Input: RoCC command interface, PTW interface
  • Output: RoCC response, TileLink memory access
  • Edge cases: Busy-wait handling during Fence operations

CustomConfigs.scala

Main functionality: Defines system configuration and RoCC integration

Configuration class definition:

class BuckyballCustomConfig(
  buckyballConfig: CustomBuckyballConfig = CustomBuckyballConfig()
) extends Config((site, here, up) => {
  case BuildRoCCBB => up(BuildRoCCBB) ++ Seq(
    (p: Parameters) => {
      val buckyball = LazyModule(new ToyBuckyball(buckyballConfig))
      buckyball
    }
  )
})

System configuration:

class BuckyballToyConfig extends Config(
  new framework.rocket.WithNBuckyballCores(1) ++
  new BuckyballCustomConfig(CustomBuckyballConfig()) ++
  new chipyard.config.WithSystemBusWidth(128) ++
  new WithCustomBootROM ++
  new chipyard.config.AbstractConfig
)

Multi-core support:

class WithMultiRoCCToyBuckyball(harts: Int*) extends Config(...)

CSR.scala

Main functionality: Provides custom control and status registers

Implementation:

object FenceCSR {
  def apply(): UInt = RegInit(0.U(64.W))
}

Fence handling logic:

val fenceCSR = FenceCSR()
val fenceSet = ballDomain.io.fence_o
val allDomainsIdle = !ballDomain.io.busy && !memDomain.io.busy

when (fenceSet) {
  fenceCSR := 1.U
  io.cmd.ready := allDomainsIdle
}

Usage

System Integration

RoCC interface integration:

  • Register coprocessor through BuildRoCCBB configuration key
  • Support multi-core configuration
  • Provide 2 PTW ports for address translation

Inter-domain communication:

// BallDomain -> MemDomain bridge
ballDomain.io.sramRead <> memDomain.io.ballDomain.sramRead
ballDomain.io.sramWrite <> memDomain.io.ballDomain.sramWrite

DMA connections:

memDomain.io.dma.read.req <> outer.reader.module.io.req
memDomain.io.dma.write.req <> outer.writer.module.io.req

Notes

  1. Fence semantics: Use CSR to implement Fence operation synchronization
  2. Busy-wait detection: Assertion checks to prevent long simulation stalls
  3. TLB integration: TLB functionality integrated in MemDomain
  4. Response arbitration: BallDomain has higher priority than MemDomain
  5. Configuration dependencies: Correctly configure CustomBuckyballConfig parameters

BallDomain Example Implementation

Overview

This directory contains a complete example implementation of BallDomain in the Buckyball framework, demonstrating how to build a custom computation domain to manage specialized accelerators. BallDomain is a core concept in Buckyball architecture, used to encapsulate and manage a group of related computation units with unified control and dataflow management.

This directory implements the ball domain architecture, including:

  • BallDomain: Top-level module managing the entire computation domain
  • BallController: Ball domain controller for instruction scheduling and execution control
  • DISA: Distributed instruction scheduling architecture
  • DomainDecoder: Domain instruction decoder
  • Specialized accelerators: Including matrix, vector, im2col and other accelerator implementations

Code Structure

balldomain/
├── BallDomain.scala      - Ball domain top module
├── BallController.scala  - Ball domain controller
├── DISA.scala           - Distributed instruction scheduling architecture
├── DomainDecoder.scala  - Domain instruction decoder
├── bbus/                - Ball domain bus system
├── im2col/              - Image-to-column conversion accelerator
├── matrixball/          - Matrix computation ball domain
├── rs/                  - Reservation station implementation
└── vecball/             - Vector computation ball domain

File Dependencies

BallDomain.scala (Top-level module)

  • Integrates all submodules, provides unified ball domain interface
  • Manages dataflow and control flow within ball domain
  • Connects to system bus and RoCC interface

BallController.scala (Control layer)

  • Implements instruction scheduling and execution control for ball domain
  • Manages coordination between multiple accelerators
  • Provides state management and error handling

DISA.scala (Scheduling layer)

  • Distributed instruction scheduling architecture implementation
  • Supports concurrent execution of multiple instructions
  • Provides dynamic load balancing

DomainDecoder.scala (Decode layer)

  • Ball domain specific instruction decode
  • Instruction dispatch to corresponding execution units
  • Supports complex instruction decomposition and reorganization

Module Description

BallDomain.scala

Main functionality: Ball domain top module, integrates all computation units and control logic

Key components:

class BallDomain(implicit p: Parameters) extends LazyModule {
  val controller = LazyModule(new BallController)
  val matrixBall = LazyModule(new MatrixBall)
  val vecBall = LazyModule(new VecBall)
  val im2colUnit = LazyModule(new Im2colUnit)

  // Ball domain bus connections
  val bbus = LazyModule(new BBus)
  bbus.node := controller.node
  matrixBall.node := bbus.node
  vecBall.node := bbus.node
}

Inputs/Outputs:

  • Input: RoCC instruction interface, memory access interface
  • Output: Computation results, status information
  • Edge cases: Instruction conflict handling, resource contention management

BallController.scala

Main functionality: Ball domain controller, manages overall ball domain execution control

Key components:

class BallController extends Module {
  val io = IO(new Bundle {
    val rocc = Flipped(new RoCCCoreIO)
    val mem = new HellaCacheIO
    val domain_ctrl = new DomainControlIO
  })

  // Instruction queue and scheduling logic
  val inst_queue = Module(new Queue(new RoCCInstruction, 16))
  val scheduler = Module(new InstructionScheduler)
}

Scheduling strategy:

  • Static scheduling based on instruction type
  • Dynamic resource allocation and load balancing
  • Supports instruction pipelining and concurrent execution

DISA.scala

Main functionality: Distributed instruction scheduling architecture

Key components:

class DISA extends Module {
  val io = IO(new Bundle {
    val inst_in = Flipped(Decoupled(new Instruction))
    val exec_units = Vec(numUnits, new ExecutionUnitIO)
    val completion = Decoupled(new CompletionInfo)
  })

  // Distributed dispatch table
  val dispatch_table = Reg(Vec(numUnits, new DispatchEntry))
  val load_balancer = Module(new LoadBalancer)
}

Scheduling algorithms:

  • Round-robin scheduling for fairness
  • Priority scheduling for critical tasks
  • Dynamic scheduling adapts to load changes

DomainDecoder.scala

Main functionality: Ball domain instruction decoder

Key components:

class DomainDecoder extends Module {
  val io = IO(new Bundle {
    val inst = Input(UInt(32.W))
    val decoded = Output(new DecodedInstruction)
    val valid = Output(Bool())
  })

  // Instruction decode table
  val decode_table = Array(
    MATRIX_OP -> MatrixOpDecoder,
    VECTOR_OP -> VectorOpDecoder,
    IM2COL_OP -> Im2colOpDecoder
  )
}

Decode functionality:

  • Supports multiple instruction formats
  • Microcode expansion for complex instructions
  • Instruction dependency analysis and optimization

Usage

Design Features

  1. Modular architecture: Each accelerator is an independent module, easy to extend and maintain
  2. Unified interface: All accelerators communicate through unified ball domain bus
  3. Flexible scheduling: Supports multiple scheduling strategies, adapts to different computation patterns
  4. Scalability: Easy to add new accelerator types and functionality

Performance Optimization

  1. Pipeline design: Instruction decode, scheduling, execution use pipeline architecture
  2. Concurrent execution: Supports multiple accelerators working simultaneously
  3. Data management: Data caching and access management
  4. Workload: Workload distribution

Usage Example

// Create ball domain instance
val ballDomain = LazyModule(new BallDomain)

// Connect to RoCC interface
rocc.cmd <> ballDomain.module.io.rocc.cmd
rocc.resp <> ballDomain.module.io.rocc.resp

// Configure ball domain parameters
ballDomain.module.io.config := ballDomainConfig

Notes

  1. Resource management: Properly allocate computational resources, avoid resource conflicts
  2. Timing constraints: Pay attention to timing relationships and data synchronization between different modules
  3. Power control: Implement dynamic power management, shut down modules when not in use
  4. Debug support: Debug interface and status monitoring functionality

BBus Ball Domain Bus System

Overview

This directory contains the implementation of Buckyball's ball domain bus system, primarily responsible for managing SRAM resource access by multiple Ball nodes within the ball domain. The bus system is implemented based on BBusNode from framework.blink, providing SRAM resource arbitration and routing functionality.

This directory implements two core components:

  • BallBus: Ball domain bus main module, manages SRAM access by multiple Ball nodes
  • BBusRouter: Bus router, provides routing functionality for Blink interface

Code Structure

bbus/
├── BallBus.scala    - Ball domain bus main module
└── router.scala     - Bus router implementation

File Dependencies

BallBus.scala (Main module)

  • Creates multiple BBusNode instances to manage Ball nodes
  • Connects external SRAM interfaces to each Ball node
  • Implements SRAM resource allocation and arbitration

router.scala (Routing module)

  • Implements routing functionality based on BBusNode
  • Provides Blink protocol interface encapsulation

Module Description

BallBus.scala

Main functionality: Ball domain bus main module, manages SRAM resource access by multiple Ball nodes

Key components:

class BallBus(maxReadBW: Int, maxWriteBW: Int, numBalls: Int) extends LazyModule {
  // Create multiple BBusNode instances
  val ballNodes = Seq.fill(numBalls) {
    new BBusNode(BallParams(sramReadBW = maxReadBW, sramWriteBW = maxWriteBW))
  }

  // External SRAM interfaces
  val io = IO(new Bundle {
    val sramRead = Vec(b.sp_banks, Flipped(new SramReadIO(...)))
    val sramWrite = Vec(b.sp_banks, Flipped(new SramWriteIO(...)))
    val accRead = Vec(b.acc_banks, Flipped(new SramReadIO(...)))
    val accWrite = Vec(b.acc_banks, Flipped(new SramWriteIO(...)))
  })
}

Resource allocation strategy:

  • First sp_banks ports connected to scratchpad SRAM
  • Next acc_banks ports connected to accumulator SRAM
  • Excess ports set to invalid state
  • All Ball nodes share the same SRAM resources

Inputs/Outputs:

  • Input: SRAM access requests from each Ball node
  • Output: Read/write interfaces connected to external SRAM
  • Edge cases: Handle ports beyond configuration range, set to DontCare

Dependencies: framework.balldomain.blink.BBusNode, framework.builtin.memdomain.mem

router.scala

Main functionality: Bus router, provides routing functionality for Blink protocol interface

Key components:

class BBusRouter extends LazyModule {
  val node = new BBusNode(BallParams(
    sramReadBW = b.sp_banks,
    sramWriteBW = b.sp_banks
  ))

  val io = IO(new Bundle {
    val blink = Flipped(new BlinkBundle(node.edges.in.head))
  })
}

Routing functionality:

  • Implements standard Ball node interface based on BBusNode
  • Provides Blink protocol encapsulation and conversion
  • Supports configurable read/write bandwidth parameters

Inputs/Outputs:

  • Input: Blink protocol interface
  • Output: BBusNode standard interface
  • Edge cases: Depends on validity of node.edges.in.head

Dependencies: framework.balldomain.blink.BlinkBundle, framework.balldomain.blink.BBusNode

Usage

Configuration Parameters

Bus system configuration is controlled by the following parameters:

  • maxReadBW: Maximum read bandwidth (port count)
  • maxWriteBW: Maximum write bandwidth (port count)
  • numBalls: Ball node count
  • b.sp_banks: Scratchpad bank count
  • b.acc_banks: Accumulator bank count

Resource Management

  1. SRAM port allocation: Allocate ports in order of scratchpad first, accumulator second
  2. Multi-Ball sharing: All Ball nodes share the same SRAM resource pool
  3. Port reuse: Ports beyond configuration are set to invalid state to save resources

Usage Example

// Create ball domain bus
val ballBus = LazyModule(new BallBus(
  maxReadBW = 8,
  maxWriteBW = 8,
  numBalls = 4
))

// Connect external SRAM
scratchpad.io.read <> ballBus.module.io.sramRead
scratchpad.io.write <> ballBus.module.io.sramWrite
accumulator.io.read <> ballBus.module.io.accRead
accumulator.io.write <> ballBus.module.io.accWrite

Notes

  1. Resource conflicts: Multiple Ball nodes may access the same SRAM resources simultaneously, requiring upper-level coordination
  2. Bandwidth limitations: Actual available bandwidth is limited by configured maximum read/write bandwidth parameters
  3. Port mapping: Ensure SRAM port count matches configuration parameters to avoid out-of-bounds access
  4. Timing constraints: BBusNode timing requirements need to match external SRAM interfaces

Reservation Station & ROB

Overview

This module implements the Reservation Station and Reorder Buffer (ROB) in the Buckyball system for out-of-order execution and instruction scheduling support. The reservation station manages instruction issue and completion, while ROB ensures instructions commit in program order, maintaining precise exception semantics.

File Structure

rs/
├── reservationStation.scala  - Reservation station implementation
└── rob.scala                - Reorder buffer implementation

Core Components

BallReservationStation - Ball Domain Reservation Station

The reservation station is a key component connecting the instruction decoder and execution units, responsible for:

Main functionality:

  • Receives instructions from Ball domain decoder
  • Dispatches to different execution units based on instruction type
  • Manages instruction issue and completion status
  • Generates RoCC responses

Supported execution units:

  • ball1: VecUnit (vector processing unit)
  • ball2: BBFP (floating-point processing unit)
  • ball3: im2col (image processing accelerator)
  • ball4: transpose (matrix transpose accelerator)

Interface design:

class BallReservationStation extends Module {
  val io = IO(new Bundle {
    // Instruction input
    val ball_decode_cmd_i = Flipped(DecoupledIO(new BallDecodeCmd))

    // RoCC response output
    val rs_rocc_o = new Bundle {
      val resp = DecoupledIO(new RoCCResponseBB)
      val busy = Output(Bool())
    }

    // Execution unit interfaces
    val issue_o = new BallIssueInterface    // Issue interface
    val commit_i = new BallCommitInterface  // Commit interface
  })
}

Instruction dispatch logic:

// Dispatch instructions based on bid (Ball ID)
io.issue_o.ball1.valid := rob.io.issue.valid && rob.io.issue.bits.cmd.bid === 1.U  // VecUnit
io.issue_o.ball2.valid := rob.io.issue.valid && rob.io.issue.bits.cmd.bid === 2.U  // BBFP
io.issue_o.ball3.valid := rob.io.issue.valid && rob.io.issue.bits.cmd.bid === 3.U  // im2col
io.issue_o.ball4.valid := rob.io.issue.valid && rob.io.issue.bits.cmd.bid === 4.U  // transpose

ROB - Reorder Buffer

ROB implements sequential instruction management and out-of-order completion support:

Design features:

  • Uses FIFO queue to maintain instruction order
  • Uses completion status table to track instruction execution status
  • Supports out-of-order completion but in-order issue
  • Provides ROB ID for instruction identification

Core data structures:

class RobEntry extends Bundle {
  val cmd = new BallDecodeCmd           // Instruction content
  val rob_id = UInt(log2Up(rob_entries).W)  // ROB identifier
}

State management:

val robFifo = Module(new Queue(new RobEntry, rob_entries))  // Instruction queue
val robTable = Reg(Vec(rob_entries, Bool()))               // Completion status table
val robIdCounter = RegInit(0.U(log2Up(rob_entries).W))     // ID counter

Workflow

Instruction Allocation Flow

  1. Instruction enqueue: Instructions from decoder enter ROB
  2. Assign ROB ID: Allocate unique ROB ID to each instruction
  3. State initialization: Mark as incomplete in completion status table
when(io.alloc.fire) {
  robIdCounter := robIdCounter + 1.U
  robTable(robIdCounter) := false.B  // Mark as incomplete
}

Instruction Issue Flow

  1. Head check: Check if ROB head instruction is incomplete
  2. Type dispatch: Dispatch instruction to corresponding execution unit based on bid
  3. Ready control: Only issue when target execution unit is ready
val headEntry = robFifo.io.deq.bits
val headCompleted = robTable(headEntry.rob_id)
io.issue.valid := robFifo.io.deq.valid && !headCompleted

Instruction Completion Flow

  1. Completion arbitration: Multiple execution unit completion signals handled by arbiter
  2. State update: Update completion status table based on ROB ID
  3. Queue dequeue: Remove completed head instruction from ROB
val completeArb = Module(new Arbiter(UInt(log2Up(rob_entries).W), 4))
when(io.complete.fire) {
  robTable(io.complete.bits) := true.B  // Mark as completed
}

Configuration Parameters

Key Configuration Items

  • rob_entries: ROB entry count, affects out-of-order execution window size
  • Execution unit count: Currently supports 4 Ball execution units
  • Arbitration strategy: Uses round-robin arbitration for multiple completion signals

Performance Considerations

  • ROB size: Larger ROB supports more out-of-order execution but increases hardware overhead
  • Issue bandwidth: Currently maximum one instruction issued per cycle
  • Completion bandwidth: Supports multiple instruction completions per cycle

Interface Protocol

BallIssueInterface - Issue Interface

class BallIssueInterface extends Bundle {
  val ball1 = Decoupled(new BallRsIssue)  // VecUnit issue
  val ball2 = Decoupled(new BallRsIssue)  // BBFP issue
  val ball3 = Decoupled(new BallRsIssue)  // im2col issue
  val ball4 = Decoupled(new BallRsIssue)  // transpose issue
}

BallCommitInterface - Commit Interface

class BallCommitInterface extends Bundle {
  val ball1 = Flipped(Decoupled(new BallRsComplete))  // VecUnit commit
  val ball2 = Flipped(Decoupled(new BallRsComplete))  // BBFP commit
  val ball3 = Flipped(Decoupled(new BallRsComplete))  // im2col commit
  val ball4 = Flipped(Decoupled(new BallRsComplete))  // transpose commit
}

Usage Examples

Basic Configuration

// Configure ROB size in CustomBuckyballConfig
class MyBuckyballConfig extends CustomBuckyballConfig {
  override val rob_entries = 16  // 16-entry ROB
}

// Instantiate reservation station
val reservationStation = Module(new BallReservationStation)

Connecting Execution Units

// Connect VecUnit
vecUnit.io.cmd <> reservationStation.io.issue_o.ball1
reservationStation.io.commit_i.ball1 <> vecUnit.io.resp

// Connect BBFP
bbfp.io.cmd <> reservationStation.io.issue_o.ball2
reservationStation.io.commit_i.ball2 <> bbfp.io.resp

Debug and Monitoring

Status Signals

  • io.rs_rocc_o.busy: Reservation station busy status
  • rob.io.empty: ROB empty status
  • rob.io.full: ROB full status

Performance Counters

The following performance counters can be added for monitoring:

  • Instruction issue count
  • Instruction completion count
  • ROB utilization
  • Load distribution across execution units

Extension Guide

Adding New Execution Units

  1. Add new issue port in BallIssueInterface
  2. Add corresponding commit port in BallCommitInterface
  3. Add corresponding dispatch and arbitration logic in reservation station
  4. Update completion signal arbiter port count

Optimization Suggestions

  • Multi-issue support: Can be extended to issue multiple instructions per cycle
  • Dynamic scheduling: Implement more complex scheduling algorithms
  • Load balancing: Perform load balancing across multiple execution units of the same type

Simulation Configurations

This directory contains simulation configurations and interfaces for various simulators, providing unified configuration management for different simulation environments.

Directory Structure

sims/
├── firesim/
│   └── TargetConfigs.scala    - FireSim FPGA simulation configuration
├── verilator/
│   └── Elaborate.scala        - Verilator simulation top-level generation
└── verify/
    └── TargetConfig.scala     - Verification configurations

Verilator Simulation (verilator/)

Elaborate.scala

Top-level generator for Verilator simulation:

object Elaborate extends App {
  // Select Ball type from command line arguments
  val ballType = args.headOption.getOrElse("toy")

  val config = ballType match {
    case "toy" => new ToyBuckyballConfig
    case "vec" => new WithBlink(TargetBall.VecBall)
    case "matrix" => new WithBlink(TargetBall.MatrixBall)
    case "transpose" => new WithBlink(TargetBall.TransposeBall)
    case "im2col" => new WithBlink(TargetBall.Im2colBall)
    case "relu" => new WithBlink(TargetBall.ReluBall)
    case _ => new ToyBuckyballConfig
  }

  val gen = () => LazyModule(new TestHarness()(config)).module

  (new ChiselStage).execute(
    args.tail,  // Remaining args passed to firtool
    Seq(
      ChiselGeneratorAnnotation(gen),
      TargetDirAnnotation("generated-src/verilator")
    )
  )
}

Generation Flow:

  1. Parse command line arguments and configuration
  2. Instantiate Buckyball system module
  3. Generate Verilog RTL code
  4. Output auxiliary files for simulation

Output Files:

  • *.v - Verilog files
  • *.anno.json - FIRRTL annotation files
  • *.fir - FIRRTL intermediate representation

FireSim Simulation (firesim/)

TargetConfigs.scala

Configurations for running on FireSim FPGA platform:

class FireSimBuckyballConfig extends Config(
  new WithDefaultFireSimBridges ++
  new WithDefaultMemModel ++
  new WithFireSimConfigTweaks ++
  new BuckyballConfig
)

Key Configuration Items:

  • Bridge Configuration: UART, BlockDevice, NIC I/O bridges
  • Memory Model: DDR3/DDR4 memory controller configuration
  • Clock Domains: Multi-clock domain management
  • Debug Interface: JTAG and Debug Module configuration

Use Cases:

  • Large-scale system simulation
  • Long-running workload testing
  • Multi-core system performance evaluation
  • I/O-intensive application verification

Verification Configurations (verify/)

TargetConfig.scala

Configurations for single Ball device verification:

sealed trait TargetBall
object TargetBall {
  case object VecBall extends TargetBall
  case object MatrixBall extends TargetBall
  case object TransposeBall extends TargetBall
  case object Im2colBall extends TargetBall
  case object ReluBall extends TargetBall
}

WithBlink Configuration: Empty configuration class for composing with Ball-specific configs

Usage:

# Verify specific Ball device
mill arch.runMain sims.verilator.Elaborate matrix
mill arch.runMain sims.verilator.Elaborate transpose

Build and Usage

Verilator Simulation Build

# Generate Verilog
cd arch
mill arch.runMain sims.verilator.Elaborate [ball_type]

# Build simulator (in sims/verilator directory)
cd ../../sims/verilator
make CONFIG=ToyBuckyball

Available Ball Types:

  • toy: Complete toy system (default)
  • vec: Vector Ball only
  • matrix: Matrix Ball only
  • transpose: Transpose Ball only
  • im2col: Im2col Ball only
  • relu: ReLU Ball only

FireSim Deployment

# Set up FireSim environment
cd firesim
source sourceme-f1-manager.sh

# Build FPGA bitstream
firesim buildbitstream

# Run simulation
firesim runworkload

Debug and Optimization

Verilator Debug

  • Waveform Generation: Use --trace option to generate VCD files
  • Performance Profiling: Use --prof-cfuncs for profiling
  • Coverage: Use --coverage to generate coverage reports

FireSim Debug

  • Printf Debugging: Use printf statements for debug output
  • Assertion Checking: Enable runtime assertion verification
  • Performance Counters: Integrated HPM counters for monitoring

Configuration Parameters

Common Parameters

// Processor core configuration
case object RocketTilesKey extends Field[Seq[RocketTileParams]]

// Memory system configuration
case object MemoryBusKey extends Field[MemoryBusParams]

// Peripheral configuration
case object PeripheryBusKey extends Field[PeripheryBusParams]

Simulation-Specific Parameters

// Verilator simulation parameters
case object VerilatorDRAMKey extends Field[Boolean](false)

// FireSim simulation parameters
case object FireSimBridgesKey extends Field[Seq[BridgeIOAnnotation]]

Extension Development

Adding New Simulator Support

  1. Create new configuration directory (e.g., vcs/)
  2. Implement simulator-specific configuration classes
  3. Add build scripts and Makefiles
  4. Update documentation and test cases

Custom Configuration

class MyCustomConfig extends Config(
  new WithMyCustomParameters ++
  new BuckyballConfig
)

FireSim Simulation Configuration

Overview

This directory contains Buckyball system simulation configuration for the FireSim platform. FireSim is an open-source FPGA-based simulation platform that provides hardware simulation environments, supporting system-level simulation and performance analysis.

File Structure

firesim/
└── TargetConfigs.scala  - FireSim target configuration

Configuration Description

TargetConfigs.scala

This file defines Buckyball system configuration for the FireSim platform:

WithBootROM Configuration:

class WithBootROM extends Config((site, here, up) => {
  case BootROMLocated(x) => {
    // Automatically select BootROM path
    val chipyardBootROM = new File("./thirdparty/chipyard/generators/testchipip/bootrom/bootrom.rv${MaxXLen}.img")
    val firesimBootROM = new File("./thirdparty/chipyard/target-rtl/chipyard/generators/testchipip/bootrom/bootrom.rv${MaxXLen}.img")

    // Prefer chipyard path, use firesim path if it doesn't exist
    val bootROMPath = if (chipyardBootROM.exists()) {
      chipyardBootROM.getAbsolutePath()
    } else {
      firesimBootROM.getAbsolutePath()
    }
  }
})

FireSimBuckyballToyConfig Configuration:

class FireSimBuckyballToyConfig extends Config(
  new WithBootROM ++                              // BootROM configuration
  new firechip.chip.WithDefaultFireSimBridges ++ // Default FireSim bridges
  new firechip.chip.WithFireSimConfigTweaks ++   // FireSim configuration tweaks
  new examples.toy.BuckyballToyConfig            // Buckyball toy configuration
)

Advanced Configuration

Custom BootROM:

class MyFireSimConfig extends Config(
  new WithBootROM ++
  new MyCustomBuckyballConfig ++
  // Other configurations...
)

Verilator Simulation Configuration

Overview

This directory contains Buckyball system simulation configuration for the Verilator platform. Verilator is an open-source Verilog/SystemVerilog simulator that compiles RTL code into high-performance C++ simulation models, providing a fast functional simulation and verification environment.

File Structure

verilator/
└── Elaborate.scala  - Verilator elaboration configuration

Core Implementation

Elaborate.scala

This file implements the Verilog generation and elaboration process for the Buckyball system:

object Elaborate extends App {
  val config = new examples.toy.BuckyballToyConfig
  val params = config.toInstance

  ChiselStage.emitSystemVerilogFile(
    new chipyard.harness.TestHarness()(config.toInstance),
    firtoolOpts = args,
    args = Array.empty  // Pass command line arguments directly
  )
}

Compiler Build Guide

Basic Workload Compilation

To build the workload, follow these steps:

mkdir build && cd build
cmake -G Ninja ..
ninja

Model-Level Testing

To enable model-level testing with specific models and architectures:

mkdir build && cd build
cmake -G Ninja .. \
    -DMODEL="lenet,resnet18,mobilenetv3,bert,stablediffusion,llama2,deepseekr1" \
    -DARCH="gemmini,buckyball"
ninja

Note:

  1. Model downloads for bert, whisper, stable-diffusion, llama2, DeepseekR1 require pre-configured HuggingFace access
  2. whisper is currently not supported
  3. llama2 model download requires additional API-key or cached credentials

Tile Dialect Refactoring Documentation

Refactoring Background and Goals

The core goal of this refactoring is to introduce a new intermediate layer between Linalg Dialect and Buckyball Dialect - the Tile Dialect - to achieve clearer separation of responsibilities and better code organization. In the original architecture, the conversion from linalg.matmul to hardware instructions was completed in one step through convert-linalg-to-buckyball, which caused Buckyball Dialect to handle both the slicing logic for arbitrary-size matrices and hardware-level memory management and computation scheduling, resulting in overly mixed responsibilities. The new architecture splits the conversion process into two phases: convert-linalg-to-tile and convert-tile-to-buckyball, making each layer have a clear and single responsibility.

New Architecture Design

The entire compilation flow is now divided into three clear layers. First is the Linalg layer, which represents high-level linear algebra operations, such as linalg.matmul representing matrix multiplication of arbitrary size. This layer does not care about hardware constraints. Next is the newly introduced Tile layer, whose core responsibility is to tile arbitrary-size matrix operations into fixed-size blocks that conform to hardware constraints. The Tile layer expresses this high-level tiling intent through the tile.tile_matmul operation. The specific tiling strategy, loop generation, and boundary handling are all implemented in the convert-tile-to-buckyball pass. Finally, the Buckyball layer focuses on hardware-level operations. buckyball.bb_matmul receives pre-tiled fixed-size matrix blocks and is responsible for generating precise hardware instruction sequences, including data movement (mvin/mvout), computation scheduling (mul_warp16), and memory address calculation.

Tile Dialect Design Details

The Tile Dialect defines the TileMatMulOp operation, which accepts three memref parameters representing matrices A, B, and C respectively. The semantics of this operation are: perform multiplication on input matrices of arbitrary size, automatically handling tiling, padding, and loops. In implementation, TileMatMulOp will be converted by the convert-tile-to-buckyball pass into multiple buckyball.bb_matmul operations and corresponding memref.subview operations. This conversion process will consider hardware scratchpad size limitations, warp and lane parallelism constraints, and generate an optimal tiling strategy. The design philosophy of the Tile layer is to provide a platform-independent intermediate representation, allowing upper-layer optimizations to transform matrix operations without understanding specific hardware details.

Buckyball Dialect Simplification

In the new architecture, the Buckyball Dialect has been significantly simplified. The original four operations VecTileMatMulOp, MergeTileMatMulOp, MetaTileMatMulOp, and VecMulWarp16Op have been unified into a single MatMulOp. This simplification is reasonable because the tiling logic has been moved up to the Tile layer, and the Buckyball layer only needs to express the single concept of "performing hardware-level multiplication on a matrix block that already conforms to hardware constraints." The lowering process of buckyball.bb_matmul will directly generate LLVM intrinsics: first load matrices A and B into the scratchpad through Mvin_IntrOp, then generate multiple Mul_Warp16_IntrOp operations based on warp and lane parameters for computation, and finally write the results back to main memory through Mvout_IntrOp. All address calculations and encodings are completed in this lowering process.

Key Implementation Details

When implementing the convert-linalg-to-tile pass, the core logic is very simple: match the linalg.matmul operation and directly replace it with tile.tile_matmul, passing the same three memref operands. The role of this pass is mainly type and semantic conversion, indicating that we have moved from the general linear algebra operation domain into the hardware-oriented tile operation domain.

The convert-tile-to-buckyball pass is the most complex part of the entire refactoring. It needs to extract matrix dimension information (M, K, N) from the operands of tile.tile_matmul, then calculate the optimal tiling strategy based on hardware parameters (dim, warp, lane). For the K dimension, it will tile according to warp size; for M and N dimensions, it will consider scratchpad capacity limitations. Each tile corresponds to a buckyball.bb_matmul operation, and tiles are connected through memref.subview to create matrix views. Special attention should be paid to handling boundary cases: when matrix dimensions cannot be evenly divided by tile size, the actual size of the last tile needs to be calculated to avoid out-of-bounds access.

When implementing BuckyballMatMulLowering, we encountered an important concept in MLIR's type conversion system: OpAdaptor. In conversion patterns, the types of the original operation (such as memref<32x16xi8>) will be converted to LLVM types (such as LLVM struct types) by the TypeConverter during the lowering process. OpAdaptor provides converted values, but we need to obtain type information (such as shape) from the original operation because this static information may no longer exist in the same form after conversion. Therefore, the correct approach is: obtain the original MemRefType from matMulOp.getOperandTypes() to extract shape information for address calculation and loop generation; for actual value operations (such as ExtractAlignedPointerAsIndexOp), use the original memref value, because MLIR's memref operations still require MemRefType.

Another key design decision is: MatMulOp's lowering should directly generate intrinsic operations (Mvin_IntrOp, Mul_Warp16_IntrOp, Mvout_IntrOp), rather than generating MvinOp, MvoutOp and then waiting for them to be lowered. The reason is that in the LLVM lowering stage, the type system has already been converted, and creating high-level Buckyball operations again would cause type mismatch issues. Directly generating intrinsics avoids multiple type conversions and makes the code clearer and more efficient. Referring to the Gemmini dialect implementation, we adopted the same strategy.

Test System

To verify the correctness of the new architecture, we created complete test cases in the bb-tests/workloads/src/OpTest/tile/ directory. Tests are divided into two categories: staged tests and end-to-end tests.

tile-matmul.mlir tests the conversion from Linalg to Tile, verifying that linalg.matmul is correctly converted to tile.tile_matmul. This is the most basic type conversion test. tile-to-buckyball.mlir tests the conversion from Tile to Buckyball, verifying that the tiling logic is correct and that the correct number of buckyball.bb_matmul operations and memref.subview operations are generated. buckyball-to-llvm.mlir tests the conversion from Buckyball MatMulOp to LLVM intrinsics, verifying that the correct sequences of buckyball.intr.bb_mvin, buckyball.intr.bb_mul_warp16, and buckyball.intr.bb_mvout instructions are generated.

end-to-end.mlir is the most important test, testing the complete conversion flow: starting from linalg.matmul, sequentially passing through the three passes -convert-linalg-to-tile, -convert-tile-to-buckyball, -lower-buckyball, and finally generating LLVM intrinsics. This test ensures that each part of the entire pipeline works correctly and that there are no issues with the connections between parts.

Pass Registration and Toolchain Integration

The two newly added passes need to be registered in multiple places. First, register the pass creation functions registerLowerLinalgToTilePass() and registerLowerTileToBuckyballPass() in InitAll.cpp, and also register buddy::tile::TileDialect. In the buddy-opt tool, buddy::tile::TileDialect needs to be added to the dialect registry so that the tool can recognize and parse tile dialect operations. In the CMake build system, the new libraries BuddyTile, LowerLinalgToTilePass, and LowerTileToBuckyballPass need to be added to the link dependencies, ensuring correct dependency relationships.

It is particularly worth noting that in the configureBuckyballLegalizeForExportTarget function in LegalizeForLLVMExport.cpp, we need to add target.addLegalDialect<memref::MemRefDialect>() and target.addLegalDialect<arith::ArithDialect>(), because memref and arith operations will be used during the lowering process of MatMulOp. If these dialects are not marked as legal, the conversion framework will attempt to lower these operations, causing type conversion conflicts.

Agent Workflow

AI assistant workflow in Buckyball framework, providing conversational interaction with AI models.

API Usage

chat

Endpoint: POST /agent/chat

Function: Conversational interaction with AI assistant

Parameters:

  • message [Required] - Message content to send to AI
  • model - AI model to use, default "deepseek-chat"

Examples:

# Basic conversation
bbdev agent --chat "--message 'Hello, can you help me with Buckyball development?'"

# Specify model
bbdev agent --chat "--message 'Explain this Scala code' --model deepseek-chat"

# Code analysis
bbdev agent --chat "--message 'Please analyze this Chisel module and suggest optimizations'"

Response:

{
  "traceId": "unique-trace-id",
  "status": "success"
}

Notes

  • Requires configured AI model API key
  • Responses use streaming output
  • Note message length limits

Compiler Workflow

Compiler build workflow in the Buckyball framework for building the Buckyball compiler toolchain.

API Usage

build

Endpoint: POST /compiler/build

Function: Build Buckyball compiler

Parameters: No specific parameters

Example:

bbdev compiler --build

Response:

{
  "status": 200,
  "body": {
    "success": true,
    "processing": false,
    "return_code": 0
  }
}

Notes

  • Ensure the system has necessary build tools and dependencies

Doc-Agent Workflow

Documentation generation workflow in the Buckyball framework, providing automated code documentation generation functionality.

API Usage Guide

generate

Endpoint: POST /doc/generate

Function: Generate documentation for specified directory

Parameters:

  • target_path [Required] - Target directory path
  • mode [Required] - Generation mode, options: "create", "update"

Example:

# Create new documentation for specified directory
bbdev doc --generate "--target_path arch/src/main/scala/framework --mode create"

# Update existing documentation
bbdev doc --generate "--target_path arch/src/main/scala/framework --mode update"

Response:

{
  "traceId": "unique-trace-id",
  "status": "success",
  "message": "Documentation generated successfully"
}

Supported Document Types

  • RTL hardware documentation
  • Test documentation
  • Script documentation
  • Simulator documentation
  • Workflow documentation

Important Notes

  • Requires AI model API key configuration
  • Generated documentation is automatically integrated into the mdBook system
  • Supports symbolic link management and automatic SUMMARY.md updates

Marshal Workflow

Marshal workflow in the Buckyball framework, used to build and launch the Marshal component.

API Usage Guide

build

Endpoint: POST /marshal/build

Function: Build Marshal component

Parameters: No specific parameters

Example:

bbdev marshal --build

launch

Endpoint: POST /marshal/launch

Function: Launch Marshal service

Parameters: No specific parameters

Example:

bbdev marshal --launch

Typical Workflow

# 1. Build Marshal
bbdev marshal --build

# 2. Launch Marshal service
bbdev marshal --launch

Response Format

All API calls return a unified format:

{
  "status": 200,
  "body": {
    "success": true,
    "processing": false,
    "return_code": 0
  }
}

Sardine Workflow

Sardine workflow in the Buckyball framework for running Sardine-related tasks.

API Usage

run

Endpoint: POST /sardine/run

Function: Run Sardine tasks

Parameters:

  • workload - Specify the workload to run

Example:

# Run specified workload
bbdev sardine --run "--workload /path/to/workload"

# Run default workload
bbdev sardine --run

Response:

{
  "status": 200,
  "body": {
    "success": true,
    "processing": false,
    "return_code": 0
  }
}

UVM Workflow

UVM (Universal Verification Methodology) workflow in the Buckyball framework for building and running UVM verification environments.

API Usage

builddut

Endpoint: POST /uvm/builddut

Function: Build DUT (Design Under Test)

Parameters:

  • jobs - Number of parallel build tasks, default 16

Example:

# Build DUT with default parallelism
bbdev uvm --builddut

# Specify number of parallel tasks
bbdev uvm --builddut "--jobs 8"

build

Endpoint: POST /uvm/build

Function: Build UVM executable

Parameters:

  • jobs - Number of parallel build tasks, default 16

Example:

# Build UVM with default parallelism
bbdev uvm --build

# Specify number of parallel tasks
bbdev uvm --build "--jobs 8"

Typical Workflow

# 1. Build DUT
bbdev uvm --builddut

# 2. Build UVM environment
bbdev uvm --build

Response Format:

{
  "status": 200,
  "body": {
    "success": true,
    "processing": false,
    "return_code": 0
  }
}

Verilator Simulation Workflow

Hardware simulation workflow based on Verilator in the Buckyball framework, providing a complete automation flow from RTL generation to simulation execution. Verilator is a high-performance Verilog simulator that supports fast functional verification and performance analysis.

II. Original API Usage Guide

run

Endpoint: POST /verilator/run

Function: Execute complete workflow. Clean build directory, generate Verilog, compile Verilator into simulation file, and run simulation directly

Parameters:

  • jobs - Number of parallel compilation tasks
    • Default value: 16
  • binary [Required] - Test binary file path
    • Default value: ""

Example:

# bbdev wrapper
bbdev verilator --run "jobs 256 --binary ${buckyball}/bb-tests/workloads/build/src/CTest/ctest_mvin_mvout_alternate_test_singlecore-baremetal --batch"

# Raw command
curl -X POST http://localhost:5000/verilator/run -H "Content-Type: application/json" -d '{"jobs": 8, "binary": "/home/user/test.elf"}'

clean

Endpoint: POST /verilator/clean

Function: Clean build folder

Parameters: None

Example:

curl -X POST http://localhost:5000/verilator/clean

verilog

Endpoint: POST /verilator/verilog

Function: Only generate Verilog code, without compilation and simulation

Parameters: None

Example:

curl -X POST http://localhost:5000/verilator/verilog -d '{"jobs": 8}'

build

Endpoint: POST /verilator/build

Function: Compile verilog source files and cpp source files into executable simulation file

Parameters:

  • jobs - Number of parallel compilation tasks
    • Default value: 16

Example:

curl -X POST http://localhost:5000/verilator/build -d '{"jobs": 16}'

sim

Endpoint: POST /verilator/sim

Function: Run existing simulation executable

Parameters:

  • binary [Required] - Custom test binary file path

Example:

curl -X POST http://localhost:5000/verilator/sim \
  -H "Content-Type: application/json" \
  -d '{"binary": "/home/user/test_program.elf"}'

II. Developer Documentation

Directory Structure

steps/verilator/
├── 00_start_node_noop_step.py      # Workflow entry node definition
├── 00_start_node_noop_step.tsx     # Frontend UI component
├── 01_run_api_step.py              # Complete workflow API entry
├── 01_clean_api_step.py            # Clean API endpoint
├── 01_verilog_api_step.py          # Verilog generation API endpoint
├── 01_build_api_step.py            # Build API endpoint
├── 01_sim_api_step.py              # Simulation API endpoint
├── 02_clean_event_step.py          # Clean build directory
├── 03_verilog_event_step.py        # Verilog code generation
├── 04_build_event_step.py          # Verilator compilation
├── 05_sim_event_step.py            # Simulation execution
├── 99_complete_event_step.py       # Completion handling
├── 99_error_event_step.py          # Error handling
└── README.md                       # This document

Workflow Steps Detailed

1. Entry Node (00_start_node_noop_step.py)

  • Type: noop node
  • Function: Provide UI interface entry point
  • Frontend: "Start Build Verilator" button

2. API Endpoints

  • Complete Workflow API (01_run_api_step.py): /verilatorverilator.run
  • Clean API (01_clean_api_step.py): /verilator/cleanverilator.clean
  • Verilog Generation API (01_verilog_api_step.py): /verilator/verilogverilator.verilog
  • Build API (01_build_api_step.py): /verilator/buildverilator.build
  • Simulation API (01_sim_api_step.py): /verilator/simverilator.sim

3. Clean Step (02_clean_event_step.py)

  • Type: event step
  • Subscribes: verilator.run, verilator.clean
  • Emits: verilator.verilog, verilator.complete
  • Function: Delete build directory, serves workflow or standalone operation

4. Verilog Generation (03_verilog_event_step.py)

  • Type: event step
  • Subscribes: verilator.verilog
  • Emits: verilator.build, verilator.complete
  • Function: Use mill to generate Verilog code to build directory

5. Verilator Compilation (04_build_event_step.py)

  • Type: event step
  • Subscribes: verilator.build
  • Emits: verilator.sim, verilator.complete
  • Function: Compile Verilog and C++ source files into executable simulation file

6. Simulation Execution (05_sim_event_step.py)

  • Type: event step
  • Subscribes: verilator.sim
  • Emits: verilator.complete
  • Function: Run simulation, supports custom binary parameter

7. Completion Handling (99_complete_event_step.py)

  • Type: event step
  • Subscribes: verilator.complete
  • Function: Print success message, mark workflow as complete

8. Error Handling (99_error_event_step.py)

  • Type: event step
  • Subscribes: verilator.error
  • Function: Print error message, handle workflow exceptions

Workflow Diagram

graph TD;
    API[POST /verilator<br/>Complete Workflow] --> RUN[verilator.run]

    CLEAN_DIRECT[verilator.clean<br/>Single-step Clean] --> CLEAN_STEP[02_clean_event_step]
    VERILOG_DIRECT[verilator.verilog<br/>Single-step Generate] --> VERILOG_STEP[03_verilog_event_step]
    BUILD_DIRECT[verilator.build<br/>Single-step Build] --> BUILD_STEP[04_build_event_step]
    SIM_DIRECT[verilator.sim<br/>Single-step Simulation] --> SIM_STEP[05_sim_event_step]

    RUN --> CLEAN_STEP
    CLEAN_STEP --> |Workflow Mode| VERILOG_STEP
    CLEAN_STEP --> |Single-step Mode| COMPLETE[verilator.complete]

    VERILOG_STEP --> |Workflow Mode| BUILD_STEP
    VERILOG_STEP --> |Single-step Mode| COMPLETE

    BUILD_STEP --> |Workflow Mode| SIM_STEP
    BUILD_STEP --> |Single-step Mode| COMPLETE

    SIM_STEP --> COMPLETE

    COMPLETE --> COMPLETE_STEP[99_complete_event_step]

    CLEAN_STEP -.-> |Error| ERROR[verilator.error]
    VERILOG_STEP -.-> |Error| ERROR
    BUILD_STEP -.-> |Error| ERROR
    SIM_STEP -.-> |Error| ERROR

    ERROR --> ERROR_STEP[99_error_event_step]

    classDef apiNode fill:#e1f5fe
    classDef eventNode fill:#f3e5f5
    classDef stepNode fill:#e8f5e8
    classDef endNode fill:#fff3e0

Workload Workflow

Workload build workflow in Buckyball framework, used to build test workloads and benchmark programs.

API Usage

build

Endpoint: POST /workload/build

Function: Build workload

Parameters:

  • workload - Specify workload name to build

Examples:

# Build specific workload
bbdev workload --build "--workload test_program"

# Build all workloads
bbdev workload --build

Response:

{
  "status": 200,
  "body": {
    "success": true,
    "processing": false,
    "return_code": 0
  }
}

Notes

  • Workload source code located in bb-tests/workload directory
  • Build results typically output to bb-tests/workloads/build directory

Contributors

Thank you to all developers and researchers who have contributed to the Buckyball project.

Core Development Team

The Buckyball project is primarily developed by the DangoSys team, dedicated to building a high-performance domain-specific architecture framework.

Contribution Methods

We welcome contributions of all kinds:

Code Contributions

  • Hardware architecture design and optimization
  • Software toolchain improvements
  • Test cases and benchmark programs
  • Documentation writing and maintenance

Issue Feedback

  • Bug reports and fix suggestions
  • Feature requirements and improvement suggestions
  • Performance optimization suggestions
  • Usage experience feedback

Academic Collaboration

  • Research papers and technical reports
  • Conference presentations and technical sharing
  • Open source community promotion

Participation Guidelines

  1. Fork Project: Create a project branch from GitHub
  2. Local Development: Set up development environment according to documentation
  3. Submit Changes: Follow code standards and commit format
  4. Create PR: Describe changes and test results in detail
  5. Code Review: Cooperate with maintainers to complete code review process

Contact

  • GitHub: DangoSys/buckyball
  • Issues: Report issues through GitHub Issues
  • Discussions: Participate in Slack for discussions

Acknowledgments

Special thanks to the following open source projects and communities:

  • Buddy-Compiler development team
  • Chipyard project
  • RISCV Foundation
  • All test users and feedback providers