Buckyball

Buckyball is a scalable framework for Domain Specific Architecture, built on RISC-V architecture and optimized for high-performance computing and machine learning accelerator design.

Project Overview

The buckyball framework provides a complete hardware design, simulation verification, and software development toolchain, supporting the full development process from RTL design to system-level verification. The framework adopts a modular design that supports flexible configuration and extension, suitable for various specialized computing scenarios.

Quick Start

Environment Dependencies

Before getting started, please ensure your system meets the following dependency requirements:

Required Software:

Anaconda/Miniconda (Python environment management)
Ninja Build System
GTKWave (waveform viewer)
Bash Shell environment (doesn't need to be the primary shell)

Installing Dependencies:

# Install Anaconda
# Download from: https://www.anaconda.com/download/

# Install system tools
sudo apt install ninja-build gtkwave

# Optional: FireSim passwordless configuration
# Add to /etc/sudoers: user_name ALL=(ALL) NOPASSWD:ALL

Source Build

1. Clone Repository

git clone https://github.com/DangoSys/buckyball.git
cd buckyball

2. Initialize Environment

./scripts/init.sh

Note: Initialization takes approximately 3 hours, including dependency downloads and compilation

3. Environment Activation

source buckyball/env.sh

4. Verify Installation

Run Verilator simulation test to verify installation:

bbdev verilator --run '--jobs 16 --binary ctest_vecunit_matmul_ones_singlecore-baremetal --config sims.verilator.BuckyballToyVerilatorConfig --batch'

Docker Quick Experience

We support providing a Docker environment for rapid deployment of buckyball.

Notice:

Docker images are provided only for specific release versions.
Docker image may not be the latest version, source build is recommended.

We do not provide support for this version as it is not a stable release.

Buckyball as a library

We support providing a streamlined version of buckyball installation, integrated as a generator within Chipyard.

Notice:

buckyball-as-a-lib are maintained only for specific release versions.

We do not provide support for this version as it is not a stable release.

Quick Tutorial

You can start to learn ball and blink from here

Additional Resources

You can learn more from DeepWiki and Zread

Community

Join our discussion on Slack

Contributors

Thank you for considering contributing to buckyball!

Buckyball Project Structure Overview

Buckyball is a scalable framework for domain-specific architectures. The project adopts a modular design with clear directory responsibilities, supporting a complete toolchain from hardware design to software development.

Main Directory Structure

Core Architecture Module

arch/ - Hardware architecture implementation, containing RTL code written in Scala/Chisel
- Based on Rocket-chip and Chipyard framework
- Implements custom RoCC coprocessors and memory subsystems
- Supports various configuration and extension options

Test Verification Module

bb-tests/ - Unified test framework
- workloads/ - Application workload tests
- customext/ - Custom extension verification
- sardine/ - Sardine test framework
- uvbb/ - Unit test suite

Simulation Environment Module

sims/ - Simulators and verification environments
- Supports Verilator, VCS and other simulators
- Integrates FireSim FPGA accelerated simulation
- Provides performance analysis and debugging tools

Development Tools Module

scripts/ - Build and deployment scripts
- Environment initialization scripts
- Automated build tools
- Dependency management and configuration
workflow/ - Development workflows and automation
- CI/CD pipeline configuration
- Documentation generation tools
- Code quality checks

Documentation System

docs/ - Project documentation
- bb-note/ - Technical documentation based on mdBook
- img/ - Documentation image resources
- Supports automatic generation and updates

Third-party Dependencies

thirdparty/ - External dependency modules (submodules)
- chipyard/ - Berkeley Chipyard SoC design framework
- circt/ - CIRCT circuit compiler toolchain

Development Workflow

Environment Setup: Use scripts/init.sh to initialize the development environment
Architecture Development: Perform hardware design and modifications in the arch/ directory
Test Verification: Use test suites in bb-tests/ for functional verification
Simulation Debugging: Perform performance analysis through simulation environments in the sims/ directory
Documentation Updates: Automatically generate or manually update technical documentation in docs/

Build System

The project supports multiple build methods:

Make: Traditional Makefile builds
SBT: Scala project build tool
CMake: Test framework build system
Conda: Python environment and dependency management

Version Management Notes

Submodules: Modules under thirdparty/ need independent updates
Main Repository: Core code and configuration update synchronously with the main branch
Documentation: Supports automatic generation, keeping in sync with code changes

Tutorial for buckyball

by - Bohan Wang

This document will be gradually updated as the author continues to solve and summarize encountered issues.

This document explains the step-by-step process and problem-solving approaches for a complete buckyball development workflow. We use building a ball operator module for executing the relu() function as an example:

First, we need to complete the hardware code writing for this module, i.e., write hardware code in Scala's Chisel language and generate corresponding verilog code.

Second, we need to write test software to implement relu(), which can be a reference function that runs on CPU with software code and an experimental function that runs software code on the dedicated hardware written in step one. If the test results match, it's successful, or proceed to step three for testing.

Third, simulate at the hardware level, view waveform diagrams for debugging. Additionally, there are other details such as compiler documentation changes, instruction set updates, etc., which will be explained below.

When encountering issues during development, you can visit DangoSys/buckyball | DeepWiki or Project Overview - Buckyball Technical Documentation

Chisel learning resources: binder

Before starting officially, let's initialize the environment:

cd /path/to/buckyball
source env.sh
// source ./env.sh if this gives an error
// All paths in this document are relative paths starting from ./buckyball

I. Writing Chisel Hardware Module

Create a Chisel implementation of the ReLU accelerator in the arch/src/main/scala/prototype/ directory. Referring to existing accelerator structures, it's recommended to create a new subdirectory under prototype/, for example prototype/relu/Relu.scala, and write the hardware code.

II. Hardware Instruction Decoding

Next, decode hardware instructions. Support for ReLU instructions needs to be added on the hardware side so that the hardware decoder recognizes this instruction, and register the instruction set for this ball.

This work is mainly divided into the following five aspects:

Instruction enumeration (DISA) defines func7 → instruction name (RELU)
Decoder (DomainDecoder) defines func7 → decoding rules (read/write/address/iter) → BID (e.g., 4)
Bus registration (busRegister) defines BID → actual Ball instance (ReluBall indexed at 4)
Reservation station registration (rsRegister) is used for RS/issue descriptions, aligned with BID, facilitating system issue/completion management and debugging If any link is missing or inconsistent, the ReLU instruction cannot be correctly recognized/routed/executed on actual hardware.
Create a new Ball execution unit class ReluUnit to handle ReLU operations.

1. Define RELU_BITPAT in DISA.scala

arch/src/main/scala/examples/toy/balldomain/DISA.scala defines the funct7 encoding (BitPat) for Ball instructions, such as TRANSPOSE, IM2COL, etc. It can be viewed as an "instruction set enumeration table" for decoder matching.

Add the bit pattern definition for the ReLU instruction in this file:

val RELU_BITPAT = BitPat("b0100110") // func7 = 38 = 0x23

2. Add ReLU instruction to Ball domain decoder

arch/src/main/scala/examples/toy/balldomain/DomainDecoder.scala is the Ball domain decoder. Its functions are as follows:

Input: PostGDCmd from global decoding (already determined to be a Ball category command).
Output: Structured BallDecodeCmd, including:
- Whether to use op1/op2, whether to write back to scratchpad, whether operands come from scratchpad
- Operand/writeback bank and address
- Iteration count iter
- Target Ball ID (BID)
- Other dedicated fields special, etc.
Internally maps different funct7 instructions to a set of boolean switches and field extraction rules through ListLookup(func7, ...).

Add the decoding entry for the ReLU instruction in the decoding list in this file. Referring to the implementation of other instructions (e.g., TRANSPOSE_FUNC7 = 38), you need:

// Add to BallDecodeFields ListLookup
RELU                 -> List(Y,N,Y,Y,N, rs1(spAddrLen-1,0), 0.U(spAddrLen.W), rs2(spAddrLen-1,0), rs2(spAddrLen + 9,spAddrLen), 7.U, rs2(63,spAddrLen + 10), Y) // Fill in decoding fields according to specific ReLU instruction requirements, the number of list parameters must be consistent, you can refer to other instructions

3. Add ReLuBall generator and register it

a. arch/src/main/scala/examples/toy/balldomain/bbus/busRegister.scala is the Ball bus registration table, using a Seq(() => new SomeBall(...)) to register the actual Ball modules to be instantiated in the system.

Find and add the new ID for ReLuBall in this file.

class BBusModule(implicit b: CustomBuckyballConfig, p: Parameters)
    extends BBus(
      // Define Ball device generator to register
      Seq(
        () => new examples.toy.balldomain.vecball.VecBall(0),
        () => new examples.toy.balldomain.matrixball.MatrixBall(1),
        () => new examples.toy.balldomain.im2colball.Im2colBall(2),
        () => new examples.toy.balldomain.transposeball.TransposeBall(3),
        ...
        () =>new examples.toy.balldomain.reluball.ReluBall(7) // Ball ID 7 - newly added
      )
    ) {
  override lazy val desiredName = "BBusModule"
}

b. arch/src/main/scala/examples/toy/balldomain/rs/rsRegister.scala is the "Ball reservation station" registration table, using a list to register which Balls exist in the system (specifying ID and name by ballId). The reservation station (RS) is responsible for managing Ball issue, occupancy, completion and other metadata, usually also used for visualization/statistics, naming and logging.

class BallRSModule(implicit b: CustomBuckyballConfig, p: Parameters)
    extends BallReservationStation(
      // Define Ball device information to register
      Seq(
        BallRsRegist(ballId = 0, ballName = "VecBall"),
        BallRsRegist(ballId = 1, ballName = "MatrixBall"),
        BallRsRegist(ballId = 2, ballName = "Im2colBall"),
        BallRsRegist(ballId = 3, ballName = "TransposeBall"),
        ...
        BallRsRegist(ballId = 7, ballName = "ReluBall") // Ball ID 7 - newly added
      )
    ) {
  override lazy val desiredName = "BallRSModule"
}

4. Write ReluBall interface file

Create a reluball folder in the arch/src/main/scala/examples/toy/balldomain directory, enter the folder and create ReluBall.scala to write the interface code.

III. Writing Test Software and Compilation Settings

1. Create test file

Create relu_test.c under bb-tests/workloads/src/CTest/toy/, write test code. The core function in the code will execute void bb_relu(uint32_t op1_addr, uint32_t wr_addr, uint32_t iter); Note the declaration and definition of this function below.

2. Modify CMakeLists.txt

Add test target in bb-tests/workloads/src/CTest/toy/CMakeLists.txt: CMakeLists.txt:120-127

add_cross_platform_test_target(ctest_relu_test relu_test.c)

And add to the main build target: CMakeLists.txt:137-162

add_custom_target(buckyball-CTest-build ALL DEPENDS
  # ... other tests ...
  ctest_relu_test
  COMMENT "Building all workloads for Buckyball"
  VERBATIM)

3. Need to add ReLU instruction API

a. isa.h

Add declaration for ReLU instruction in bb-tests/workloads/lib/bbhw/isa/isa.h: isa.h:33-43
Add to InstructionType enum:

RELU_FUNC7 = 38,  // 0x26 - ReLU function code (or other value you choose)

Add to function declaration section: isa.h:72-73

void bb_relu(uint32_t op1_addr, uint32_t wr_addr, uint32_t iter);

b. isa.c

Add 38_relu.c in bb-tests/workloads/lib/bbhw/isa, implement void bb_relu(uint32_t op1_addr, uint32_t wr_addr, uint32_t iter) inside
Add declaration in bb-tests/workloads/lib/bbhw/isa/isa.c: isa.c:53-76

case RELU_FUNC7:
	return &relu_config;

In isa.c:37-47

extern const InstructionConfig relu_config;

4. Update CMakeLists.txt

Add compilation and linking of 38_relu.c in all three compilation commands in bb-tests/workloads/lib/bbhw/isa/CMakeLists.txt:

Linux version: Add in COMMAND of add_custom_command:

&& riscv64-unknown-linux-gnu-gcc -c ${CMAKE_CURRENT_SOURCE_DIR}/38_relu.c -march=rv64gc -I${CMAKE_CURRENT_SOURCE_DIR} -I${CMAKE_CURRENT_SOURCE_DIR}/.. -o linux-38_relu.o

And add linux-38_relu.o to the ar rcs command

Baremetal version: Add in COMMAND of add_custom_command:

&& riscv64-unknown-elf-gcc -c ${CMAKE_CURRENT_SOURCE_DIR}/38_relu.c -g -fno-common -O2 -static -march=rv64gc -mcmodel=medany -fno-builtin-printf -D__BAREMETAL__ -I${CMAKE_CURRENT_SOURCE_DIR} -I${CMAKE_CURRENT_SOURCE_DIR}/.. -o baremetal-38_relu.o

And add baremetal-38_relu.o to the ar rcs command

x86 version: Add in COMMAND of add_custom_command:

&& gcc -c ${CMAKE_CURRENT_SOURCE_DIR}/38_relu.c -fPIC -D__x86_64__ -I${CMAKE_CURRENT_SOURCE_DIR} -I${CMAKE_CURRENT_SOURCE_DIR}/.. -o x86-38_relu.o

And add x86-38_relu.o to the ar rcs command

The ISA submodule library at the beginning needs to add the corresponding 38_relu.c file.

IV. Test Operation Steps

Step 1: Compile test program

cd bb-tests/build
rm -rf *
cmake -G Ninja ../

Warning: Before executing rm -rf *, make sure you are in the bb-tests/build directory, otherwise forcing deletion in the wrong folder will be catastrophic!

If a disaster occurs, you can pull the initial documents from GitHub again, but files updated on the server side cannot be recovered.

ninja ctest_relu_test // Software compilation

If ninja ctest_relu_test reports an error after execution, this means software compilation failed, please check "III. Writing Test Software" and related files.

bbdev workload --build

Compile/package the selected workload source code or configuration into artifacts (such as executable files, images, runtime scripts, input data packages, etc.) that can be used in the simulation or runtime environment for subsequent running on the Verilator/simulation platform or host side.

Step 2: Generate Verilog

cd buckyball
bbdev verilator --verilog '--config sims.verilator.BuckyballToyVerilatorConfig'

If bbdev verilator --verilog reports an error after execution, this means hardware compilation failed, please check "I. Writing Chisel Hardware Module II. Compilation Adaptation Preparation" related files.

Step 3: Run simulation

bbdev verilator --run '--jobs 16 --binary ctest_relu_test_singlecore-baremetal --batch'

If bbdev verilator --verilog reports an error after execution, this means the hardware system has timeout, deadlock and other issues, please check I. Writing Chisel Hardware Module related files.

Step 5: View simulation files

In arch/waveform/SimulationFileName(E.g.2025-10-08-00-03-ctest_vecunit_matmul_random1_singlecore-baremetal), download the waveform.fst file to your local system using software like Filezilla, and view the waveform using a local simulation waveform viewer (E.g. GTKWave).

Note that the simulation file folder should only contain the waveform.fst file. If a waveform.fst.hier file exists, it means the simulation failed.

If the waveform does not meet theoretical conditions, check I. Writing Chisel Hardware Module related files when the software test code is correct.

To check if the software code has problems, you can refer to its execution results on CPU. You can temporarily completely remove hardware accelerator calls from the relu_test.c file and only test the CPU version.

V. Simulation Waveform

After importing waveform.fst locally, use GTKWAVE to find in the project index: TOP.TestHarness.chiptop0.system.tile_prci_domain.element_reset_domain_tile.buckyball.ballDomain.bbus.balls_4.reluUnit The constants under this file are all hardware constants used by Relu.scala, double-click to view the waveform!

Some naming for different routines may not be exactly the same, but they are basically similar

VI. Performance Testing

Query number of clock cycles used - speed performance metric

cat /home/MikeNotFound/code/buckyball/arch/log/2025-10-24-16-59-ctest_relu_test_singlecore-baremetal/disasm.log | grep "PMC"

Preparation

In the /home/<server_name>/bash.sh file, add the required environment variables at the end:

export SNPSLMD_LICENSE_FILE=27000@amax
export PATH="$PATH:/opt/riscv/bin"
export VCS_HOME="/data0/tools/Synopsys/vcs/vcs/W-2024.09-SP1"
export PATH="$PATH:$VCS_HOME/bin"
export VERDI_HOME="/data0/tools/Synopsys/verdi/verdi/W-2024.09-SP1"
export PATH="$PATH:$VERDI_HOME/bin"
export SCL_HOME="/data0/tools/Synopsys/scl/scl/2024.06"
export PATH="$PATH:$SCL_HOME/linux64/bin"
export DC_HOME="/data0/tools/Synopsys/dc/syn/W-2024.09-SP1"
export PATH="$PATH:$DC_HOME/bin"
export PT_HOME="/data0/tools/Synopsys/ptpx/prime/W-2024.09-SP1/"
export PATH="$PATH:$PT_HOME/bin"

export LM_LICENSE_FILE=/data0/tools/Synopsys/lic/Synopsys.dat

alias vcs="vcs -full64"
alias lmli="lmgrd -c /data0/tools/Synopsys/lic/Synopsys.dat"

In the /home/<server_name>/code/buckyball/evals/run-dc.sh file, remove the -retime option around line 126.

Formal Test

Go back to the buckyball directory and run the command
```
bbdev verilator --verilog "--balltype ReluBall --output_dir ReluBall_1"
```
This will generate a Verilog folder for the specified ball under the arch directory.
Grant execution permission to the script:
```
chmod 777 evals/run-dc.sh
```
Run the DC command:
```
./evals/run-dc.sh --srcdir arch/ReluBall_1 --top ReluBall
```
This means performing the DC test on the top-level file ReluBall.sv located in the arch/ReluBall_1 folder.

You can find the test results in

/home/<server_name>/buckyball/bb-tests/output/dc/reports

Buckyball Architecture Design Overview

The Buckyball architecture module contains complete hardware design implementations, based on the RISC-V instruction set architecture, developed using the Scala/Chisel hardware description language. The architecture design follows modular and extensible principles, supporting various configurations and custom extensions.

Architecture Hierarchy

System-Level Architecture

Buckyball adopts a layered design, including from top to bottom:

SoC Subsystem: Integrates multi-core processors, cache hierarchy, interconnect networks
Processor Core: Custom implementation based on Rocket core
Coprocessor: Dedicated accelerators supporting RoCC interface
Memory Subsystem: High-performance memory controllers and DMA engines

Core Features

Configurability: Supports parameter configuration for core count, cache size, bus width, etc.
Extensibility: Provides standardized coprocessor interfaces and extension mechanisms
Compatibility: Maintains compatibility with the standard RISC-V ecosystem
Performance Optimization: Performance-optimized design for specific application scenarios

Directory Structure

arch/
├── src/main/scala/
│   └── framework/          - Buckyball framework core
│       ├── rocket/         - Rocket core custom implementation
│       └── builtin/        - Built-in component library
│           └── memdomain/  - Memory domain implementation
│               ├── mem/    - Memory components
│               └── dma/    - DMA engine
└── thirdparty/            - Third-party dependencies
    └── chipyard/          - Chipyard framework

Design Principles

Modular Design

Each functional module has clear interface definitions and independent implementations, facilitating testing, verification, and reuse. Modules communicate through standardized interfaces, reducing coupling.

Parameterized Configuration

All hardware modules support parameterized configuration, achieving flexible hardware generation through Scala's type system and configuration framework. Configuration parameters include:

Data path width
Cache size and organization
Parallelism and pipeline depth
Coprocessor types and quantities

Performance Optimization

Specialized performance optimizations for target application scenarios:

Memory access pattern optimization
Data pipeline design
Parallel computing support
Low-latency communication mechanisms

Development Workflow

Requirement Analysis: Determine performance and functional requirements for target applications
Architecture Design: Select appropriate configuration parameters and extension modules
RTL Implementation: Use Chisel for hardware description and implementation
Functional Verification: Verify functional correctness through unit tests and integration tests
Performance Evaluation: Use simulators and FPGA for performance analysis and optimization

Toolchain Support

Chisel/FIRRTL: Hardware description and synthesis toolchain
Verilator: Fast simulation and verification
VCS: Commercial-grade simulation tools
FireSim: FPGA accelerated simulation platform
Chipyard: Integrated development environment and toolchain

Buckyball Scala Source Code

This directory contains all Scala/Chisel hardware description language source code for the Buckyball project, implementing hardware architecture design and simulation environments.

Overview

Buckyball uses Scala/Chisel as the hardware description language, built on Berkeley's Rocket-chip and Chipyard frameworks. This directory contains implementations from low-level hardware components to system-level integration.

Main functional modules include:

framework: Core framework implementation, including processor core, memory subsystem, bus interconnect, etc.
prototype: Prototype implementation of dedicated accelerators
examples: Example configurations and reference designs
sims: Simulation environment configurations and interfaces
Util: General utility classes and helper functions

Code Structure

scala/
├── framework/          - Buckyball core framework
│   ├── blink/          - Blink communication components
│   ├── builtin/        - Built-in hardware components
│   │   ├── frontend/   - Frontend processing components
│   │   ├── memdomain/  - Memory domain implementation
│   │   └── util/       - Framework utility classes
│   └── rocket/         - Rocket core extensions
├── prototype/          - Dedicated accelerator prototypes
│   ├── format/         - Data format processing
│   ├── im2col/         - Image processing acceleration
│   ├── matrix/         - Matrix computation engine
│   ├── transpose/      - Matrix transpose acceleration
│   └── vector/         - Vector processing unit
├── examples/           - Examples and configurations
│   └── toy/            - Toy example system
├── sims/               - Simulation configurations
│   ├── firesim/        - FireSim FPGA simulation
│   └── verilator/      - Verilator simulation
└── Util/               - General utility classes

Module Description

framework/ - Core Framework

Implements Buckyball's core architecture components, including:

Processor core and extensions
Memory subsystem and cache hierarchy
Bus interconnect and communication protocols
System configuration and parameterization mechanisms

prototype/ - Accelerator Prototypes

Contains hardware implementations of dedicated computation accelerators:

Machine learning accelerators (matrix operations, convolution, etc.)
Data processing accelerators (format conversion, transpose, etc.)
Vector processing units (SIMD, multi-threading, etc.)

examples/ - Example Configurations

Provides system configuration examples and reference designs:

Basic configuration templates
Custom extension examples
Integration test cases

sims/ - Simulation Environment

Supports multiple simulators and verification environments:

Verilator simulation
FireSim FPGA simulation
Performance analysis and debugging tools

Development Guide

Build System

Buckyball uses Mill as the build tool:

# Compile all modules
mill arch.compile

# Generate Verilog
mill arch.runMain examples.toy.ToyBuckyball

# Run tests
mill arch.test

Code Standards

Follow Scala and Chisel coding conventions
Use ScalaFmt for code formatting
Each module includes documentation and tests
Configuration parameterization uses Chipyard Config system

Extension Development

Add new accelerator: Create new module in prototype/ directory
Modify framework: Extend existing components in framework/ directory
Add configuration: Create new configuration files in examples/ directory
Integration testing: Use simulation environments in sims/ directory for verification

Buckyball Utility Library

Overview

This directory contains general utility functions and helper modules in the Buckyball framework, primarily providing reusable hardware design components. Located at arch/src/main/scala/Util, it serves as the base utility layer throughout the architecture, providing common hardware building blocks for other modules.

Main functionality includes:

Pipeline: Pipeline control and management tools
Common hardware design pattern implementations

Code Structure

Util/
└── Pipeline.scala    - Pipeline control implementation

File Dependencies

Pipeline.scala (Base utility layer)

Provides general pipeline control logic
Referenced by other modules requiring pipeline functionality
Implements standard pipeline interfaces and control signals

Module Description

Pipeline.scala

Main functionality: Provides general pipeline control and management functionality

Key components:

class Pipeline extends Module {
  val io = IO(new Bundle {
    val flush = Input(Bool())
    val stall = Input(Bool())
    val valid_in = Input(Bool())
    val ready_out = Output(Bool())
    val valid_out = Output(Bool())
  })

  // Pipeline control logic
  val pipeline_valid = RegInit(false.B)

  when(io.flush) {
    pipeline_valid := false.B
  }.elsewhen(!io.stall) {
    pipeline_valid := io.valid_in
  }

  io.ready_out := !io.stall
  io.valid_out := pipeline_valid && !io.flush
}

Pipeline control signals:

flush: Pipeline flush signal, clears all pipeline stages
stall: Pipeline stall signal, maintains current state
valid_in: Input data valid signal
ready_out: Ready to receive new data signal
valid_out: Output data valid signal

Inputs/Outputs:

Input: Control signals (flush, stall) and data valid signal
Output: Pipeline state and data valid indication
Edge cases: flush has higher priority than stall, ensuring correct pipeline behavior

Dependencies: Chisel3 base library, standard Module and Bundle interfaces

Usage

Integrating pipeline control:

class MyModule extends Module {
  val pipeline = Module(new Pipeline)

  // Connect control signals
  pipeline.io.flush := flush_condition
  pipeline.io.stall := stall_condition
  pipeline.io.valid_in := input_valid

  // Use pipeline output
  val output_enable = pipeline.io.valid_out
}

Design Patterns

Pipeline cascading:

Supports cascaded connection of multi-stage pipelines
Provides standard ready/valid handshake protocol
Ensures correctness and timing of data flow

Backpressure handling:

Implements standard backpressure propagation mechanism
Supports pause and resume of upstream modules
Guarantees no data loss or duplication

Notes

Timing constraints: flush signal should be asserted synchronously at clock rising edge
Reset behavior: Pipeline should clear all valid bits on reset
Combinational logic: ready signal is combinational logic, avoid timing path issues
Extensibility: Design supports parameterized pipeline depth and data width

Buckyball Framework Core

Overview

This directory contains the core implementation of the Buckyball framework, serving as the foundation layer for the entire hardware architecture. Located at arch/src/main/scala/framework, it provides a complete implementation of processor cores, built-in components, and system interconnects.

Main functional modules include:

builtin: Built-in hardware component library, including memory domain and frontend modules
blink: System interconnect and communication framework

Code Structure

framework/
├── builtin/          - Built-in component library
│   ├── memdomain/    - Memory domain implementation
│   │   ├── dma/      - DMA engines (BBStreamReader/Writer)
│   │   ├── mem/      - Memory components (Scratchpad, Accumulator, SRAM banks)
│   │   ├── rs/       - Memory domain reservation station
│   │   ├── tlb/      - TLB implementation
│   │   ├── MemController.scala  - Memory controller
│   │   ├── MemDomain.scala      - Memory domain top-level
│   │   ├── MemLoader.scala      - Load instruction handler
│   │   └── MemStorer.scala      - Store instruction handler
│   ├── frontend/     - Frontend components
│   │   ├── GobalDecoder.scala   - Global instruction decoder
│   │   ├── globalrs/            - Global reservation station
│   │   │   ├── GlobalReservationStation.scala
│   │   │   └── GlobalROB.scala  - Global reorder buffer
│   │   └── rs/                  - Ball domain reservation station
│   ├── util/         - Framework utility functions
│   └── BaseConfigs.scala - Base configuration parameters
└── blink/            - System interconnect framework
    ├── baseball.scala    - Ball device base trait
    ├── blink.scala       - Blink protocol definitions
    └── bbus.scala        - Ball bus implementation

Module Dependencies

Application Layer → builtin components → blink interconnect → Physical interface
                        ↓                    ↓
                   Memory domain        Ball protocol
                   Frontend             System bus

Module Details

builtin/ - Built-in Component Library

Main Function: Provides standardized hardware component implementations

Component Categories:

memdomain/ - Memory Domain

The memory domain encapsulates all memory-related functionality:

Key Components:

MemDomain.scala: Top-level memory domain module
- Integrates MemController, MemLoader, MemStorer, and TLB
- Provides unified interface to Global RS
- Handles both load and store operations
MemController.scala: Memory controller
- Encapsulates Scratchpad and Accumulator
- Provides DMA and Ball Domain interfaces
- Handles bank arbitration and routing
MemLoader.scala: Load instruction handler
- Receives load instructions from reservation station
- Issues DMA read requests
- Writes data to Scratchpad/Accumulator
MemStorer.scala: Store instruction handler
- Receives store instructions from reservation station
- Reads data from Scratchpad/Accumulator
- Issues DMA write requests with data alignment and masking
dma/: DMA engines
- BBStreamReader: Streaming DMA read with TLB support
- BBStreamWriter: Streaming DMA write with alignment handling
- Transaction ID management for multiple outstanding requests
mem/: Memory components
- Scratchpad.scala: 4-bank scratchpad memory (256KB total)
- AccBank.scala: Accumulator bank with accumulation pipeline
- SramBank.scala: Generic single-port SRAM bank implementation
rs/: Memory domain reservation station
- reservationStation.scala: Local FIFO-based scheduler
- rob.scala: Local reorder buffer for memory instructions
- ringFifo.scala: Circular FIFO implementation
tlb/: Translation Lookaside Buffer
- Virtual to physical address translation
- Integrated with DMA engines

frontend/ - Frontend Components

The frontend handles global instruction management:

Key Components:

GobalDecoder.scala: Global instruction decoder
- Classifies instructions into Ball/Memory/Fence types
- Constructs PostGDCmd for domain-specific decoders
- Interfaces with Global RS
globalrs/: Global reservation station
- GlobalReservationStation.scala: Central instruction manager
  - Allocates ROB entries
  - Issues instructions to Ball and Memory domains
  - Handles instruction completion from both domains
  - Manages Fence instruction synchronization
- GlobalROB.scala: Global reorder buffer
  - Tracks instruction state across domains
  - Supports out-of-order completion
  - Sequential commit of completed instructions
rs/: Ball domain reservation station
- reservationStation.scala: Ball-specific scheduler
- rob.scala: Local ROB for Ball instructions

util/ - Framework Utilities

Common utility functions and helper modules

BaseConfigs.scala

Configuration Parameters:

case class BaseConfig(
  veclane: Int = 16,              // Vector lane width
  accveclane: Int = 4,            // Accumulator vector lane width
  rob_entries: Int = 16,          // Number of ROB entries
  rs_out_of_order_response: Boolean = true,  // Out-of-order response support
  sp_banks: Int = 4,              // Scratchpad bank count
  acc_banks: Int = 8,             // Accumulator bank count
  sp_capacity: BuckyballMemCapacity = CapacityInKilobytes(256),
  acc_capacity: BuckyballMemCapacity = CapacityInKilobytes(64),
  spAddrLen: Int = 15,            // SPAD address length
  memAddrLen: Int = 32,           // Memory address length
  numVecPE: Int = 16,             // Vector PEs per thread
  numVecThread: Int = 16,         // Vector threads
  emptyBallid: Int = 5            // Empty ball ID
)

blink/ - System Interconnect

Main Function: Implements system-level interconnect and Ball protocol

Key Components:

baseball.scala: Ball device base trait
- Defines BallRegist trait for Ball device registration
- Provides common interface for all Ball devices
blink.scala: Blink protocol definitions
- Command/response interfaces
- Status and control signals
- SRAM read/write interfaces
bbus.scala: Ball bus implementation (BBus)
- Manages multiple Ball device connections
- Command router: Routes commands to appropriate Ball devices
- Bus router: Arbitrates Ball device responses
- Memory router: Handles memory access arbitration
- Performance monitoring counters

Interconnect Features:

Support for multiple bus protocols
Arbitration and routing functionality
Latency and bandwidth management
Dynamic Ball device registration

Usage Guide

Framework Integration

Configuration System:

class BuckyballConfig extends Config(
  new WithBuiltinComponents ++
  new WithBlinkInterconnect ++
  new BaseConfig
)

Module Instantiation:

class BuckyballSystem(implicit p: Parameters) extends LazyModule {
  // Memory domain
  val memdomain = Module(new MemDomain)

  // Ball domain
  val balldomain = Module(new BallDomain)

  // Global RS
  val globalRS = Module(new GlobalReservationStation)

  // Connect modules
  balldomain.io.issue <> globalRS.io.ballIssue
  memdomain.io.issue <> globalRS.io.memIssue
  globalRS.io.ballComplete <> balldomain.io.complete
  globalRS.io.memComplete <> memdomain.io.complete
}

Extension Development

Adding New Components:

Create new component module in builtin directory
Implement standard Module interface
Register in configuration system
Update interconnect and routing logic

Custom Ball Device:

Extend BallRegist trait
Implement Blink protocol interfaces
Register in BBus
Add to Ball RS device list

Design Principles

Parameter Passing: Use Chipyard's Parameters system for configuration
Clock Domains: Pay attention to clock domain crossing between modules
Reset Strategy: Ensure proper reset sequencing and dependencies
Performance Optimization: Focus on critical paths and timing constraints
Debug Support: Integrate necessary debug and monitoring interfaces
Memory Access: Respect bank access constraints (op1 and op2 cannot access same bank)
Handshake Protocols: Use ready/valid handshake for all data transfers

Architecture Highlights

Instruction Flow

RoCC → Global Decoder → Global RS → Ball Domain / Mem Domain
                          ↓                ↓            ↓
                      Global ROB    Ball Decoder  Mem Decoder
                   (tracks state)       ↓            ↓
                                   Ball Devices  Loader/Storer
                                        ↓            ↓
                                   MemController ← → MemController

Memory Access Flow

Ball Devices ──→ MemController ──→ Scratchpad (4 banks)
                      │           └→ Accumulator (8 banks)
                      │
Mem Domain    ──→ MemController
  (Loader/Storer)     │
                      ↓
                  DMA + TLB
                      ↓
                 Main Memory

Blink Interconnect System - System interconnect implementation
Built-in Components - Standard hardware components
Memory Domain - Memory subsystem details
Frontend Components - Instruction management
Buckyball Source Overview - Upper-level architecture

Performance Considerations

ROB Size: 16 entries support up to 16 in-flight instructions
Bank Parallelism: 4 scratchpad + 8 accumulator banks enable parallel access
Out-of-Order Execution: Global RS supports out-of-order completion when enabled
DMA Bandwidth: 128-bit bus width provides high memory bandwidth
Pipeline Depth: Multi-stage pipeline allows high clock frequency

Common Issues and Solutions

Issue: Instructions stall in Global RS

Solution: Check ROB capacity and completion signals from domains

Issue: Memory access conflicts

Solution: Ensure op1 and op2 don't access same bank, respect bank boundaries

Issue: DMA timeout

Solution: Verify TLB configuration and page table walker connectivity

Issue: Ball device not responding

Solution: Check Ball device registration in BBus and RS device list

Buckyball Prototype Accelerators

This directory contains prototype implementations of various domain-specific computation accelerators in the Buckyball framework, covering hardware accelerator designs for machine learning, numerical computation, and data processing domains.

Directory Structure

prototype/
├── format/      - Data format conversion accelerators
├── im2col/      - Image-to-column transformation accelerator
├── matrix/      - Matrix computation accelerators
├── relu/        - ReLU activation accelerator
├── transpose/   - Matrix transpose accelerator
└── vector/      - Vector processing unit

Accelerator Components

format/ - Data Format Processing

Implements hardware acceleration for various data format conversions and arithmetic operations:

Arithmetic.scala: Custom arithmetic operation units
Dataformat.scala: Data format conversion and encoding

Key Features:

Support for multiple data formats (INT8, FP16, FP32, BBFP)
Abstract arithmetic interface for extensibility
Concrete implementations for different data types

Use Cases:

Floating-point format conversion
Fixed-point arithmetic optimization
Data compression and decompression
Mixed-precision computation

im2col/ - Image Processing Acceleration

Specialized accelerator for im2col operations in convolutional neural networks:

im2col.scala: Hardware implementation of image-to-column matrix transformation

Key Features:

Configurable kernel size and stride
Efficient data reorganization for convolution
Pipeline-based processing for high throughput
Support for different input dimensions

Use Cases:

CNN convolution layer acceleration
Image preprocessing pipeline
Feature extraction optimization
Memory-efficient convolution implementation

matrix/ - Matrix Computation Engine

Matrix computation accelerator implementation with multiple modules:

Core Components:

bbfpIns_decode.scala: Instruction decoder for matrix operations
bbfp_load.scala: Data loading unit for matrix operands
bbfp_ex.scala: Execution unit for matrix multiplication
bbfp_pe.scala: Processing Element (PE) array implementation
bbfp_control.scala: Control logic for matrix operations

PE Array Architecture:

BBFP_PE: Individual processing element with weight stationary mode
BBFP_PE_Array2x2: 2×2 PE array building block
BBFP_PE_Array16x16: 16×16 PE array for high-performance computing
Systolic array dataflow for efficient matrix multiplication

Supported Formats:

INT8 integer arithmetic
FP16 half-precision floating-point
FP32 single-precision floating-point
BBFP (Brain Floating Point) custom format

Use Cases:

Deep learning training and inference
Scientific computing acceleration
Linear algebra operations
High-performance GEMM operations

relu/ - ReLU Activation

Efficient hardware implementation of ReLU (Rectified Linear Unit) activation:

Relu.scala: Pipelined ReLU accelerator

Key Features:

Element-wise ReLU computation
Configurable tile size
Pipeline-based processing
Integrated with scratchpad memory

Use Cases:

Neural network activation layers
Non-linear transformation
Post-convolution activation

transpose/ - Matrix Transpose

Efficient hardware implementation for matrix transpose operations:

Transpose.scala: Matrix transpose accelerator

Key Features:

Tile-based transpose for large matrices
Optimized memory access patterns
Configurable tile size
Pipeline-based implementation

Use Cases:

Matrix operation preprocessing
Data reorganization and transformation
Memory access pattern optimization
Transpose in GEMM operations

vector/ - Vector Processing Unit

Vector processing architecture supporting SIMD and multi-threading:

Core Components:

VecUnit.scala: Vector processor top-level module
VecCtrlUnit.scala: Vector control unit for instruction dispatch
VecLoadUnit.scala: Vector load unit for data fetching
VecEXUnit.scala: Vector execution unit with multiple functional units
VecStoreUnit.scala: Vector store unit for result write-back

Submodules:

bond/: Binding and synchronization mechanisms
- Various bond types (VSSBond, VVVBond, VSVBond, VVSBond, VVBond)
- Operand routing and data distribution
op/: Vector operation implementations
- AddOp, MulOp, CascadeOp, SelectOp, etc.
- Arithmetic and logical operations
thread/: Multi-threading support
- Thread-level parallelism
- Warp-based execution model
warp/: Thread bundle management (MeshWarp)
- 16×16 PE mesh for vector operations
- Parallel execution of vector instructions

Architecture Highlights:

Configurable number of PEs and threads
Support for various vector operations (add, mul, cascade, select)
Flexible data routing through bond mechanisms
High parallelism with warp-level execution

Use Cases:

Parallel numerical computation
Signal processing acceleration
High-performance computing applications
SIMD-style data processing

Design Features

Modular Design

Each accelerator adopts modular design for:

Independent development and testing
Flexible composition and configuration
Performance tuning and extension
Easy integration with Buckyball framework

Pipeline Architecture

Most accelerators use deep pipeline design:

Improved throughput and frequency
Support for continuous data stream processing
Optimized resource utilization
Latency hiding through pipelining

Configurable Parameters

Support rich configuration parameters:

Data width and precision
Parallelism and pipeline depth
Cache size and organization
Interface protocol and timing

Integration Method

Blink Protocol Interface

All Ball accelerators implement the Blink protocol interface:

class CustomBall(implicit b: CustomBuckyballConfig, p: Parameters)
  extends Module with BallRegist {
  val io = IO(new BlinkIO)
  def ballId = <unique_id>.U
  def Blink = // Implement Blink protocol
}

Blink Interface Components:

cmdReq: Command request interface with rob_id tracking
cmdResp: Command response interface for completion signaling
status: Status signals (ready, valid, idle, complete)
sramRead/Write: SRAM interfaces for scratchpad and accumulator access

Memory Interface

Support multiple memory access patterns:

DMA bulk transfer through MemDomain
Scratchpad direct access for low-latency operations
Accumulator access for result accumulation
Bank-aware memory access (op1 and op2 must access different banks)

Configuration Integration

Parameterized through Buckyball configuration system:

case class BaseConfig(
  veclane: Int = 16,        // Vector lane width
  numVecPE: Int = 16,       // Number of vector PEs
  numVecThread: Int = 16,   // Number of vector threads
  // ... more parameters
)

Performance Optimization

Data Locality

Optimize data access patterns for spatial and temporal locality
Reduce memory bandwidth requirements through data reuse
Improve cache hit rate with tile-based processing
Scratchpad memory for frequently accessed data

Parallel Processing

Multi-level parallelism design
- Instruction-level parallelism (ILP) through pipelining
- Data-level parallelism (DLP) through vector operations
- Thread-level parallelism (TLP) through multiple warps
Pipeline parallelism for continuous data flow
Data parallelism through PE arrays

Arithmetic unit reuse across different operations
Storage resource sharing between modules
Control logic optimization for area efficiency
Flexible routing for resource utilization

Verification and Testing

Each accelerator comes with corresponding test cases:

Functional correctness verification
Performance benchmark testing
Boundary condition checking
Random test generation
Integration testing with complete system

Development Guidelines

Adding New Accelerators

Steps:

Implement Ball device with BallRegist trait
Define Blink protocol interfaces
Implement computation logic
Add SRAM access logic (respect bank constraints)
Register in BBus and Ball RS

Example Template:

class NewBall(implicit b: CustomBuckyballConfig, p: Parameters)
  extends Module with BallRegist {
  val io = IO(new BlinkIO)

  def ballId = <unique_id>.U
  def Blink = io

  // State machine
  val sIdle :: sCompute :: sComplete :: Nil = Enum(3)
  val state = RegInit(sIdle)

  // Computation logic
  switch(state) {
    is(sIdle) {
      when(io.cmdReq.fire) {
        state := sCompute
      }
    }
    is(sCompute) {
      // Perform computation
      when(done) {
        state := sComplete
      }
    }
    is(sComplete) {
      io.cmdResp.valid := true.B
      state := sIdle
    }
  }
}

Performance Optimization Tips

Memory Access:
- Group memory accesses to same bank
- Use streaming access patterns
- Minimize random access
Pipeline Design:
- Balance pipeline stages
- Add registers for timing closure
- Use buffering for throughput
Resource Utilization:
- Share expensive resources (multipliers, dividers)
- Use LUTs for simple operations
- Optimize control logic

Common Pitfalls

Bank Conflict: op1 and op2 accessing same bank - violates design constraint
ROB ID Tracking: Must forward rob_id from request to response
Ready/Valid Protocol: Carefully implement handshake to avoid deadlock
Iteration Count: Properly handle iteration for multi-row operations

Format Conversion - Data format details
Im2col Implementation - Im2col accelerator
Matrix Operations - Matrix computation
ReLU Activation - ReLU implementation
Transpose Operations - Matrix transpose
Vector Processing - Vector unit architecture
Blink Protocol - Ball protocol specification

Future Enhancements

Potential areas for extension:

Support for additional data formats (INT4, BF16)
Advanced matrix operations (SVD, QR decomposition)
Fused operations (Conv+ReLU, GEMM+BiasAdd)
Dynamic reconfiguration for different workloads
Power management and clock gating
Advanced synchronization mechanisms

Data Format Processing Module

Overview

This directory implements data format definitions and arithmetic operation abstractions in Buckyball, providing a unified data type processing interface. Located at arch/src/main/scala/prototype/format, it serves as the data format layer, providing type-safe data format support for other prototype accelerators.

Core components:

Dataformat.scala: Data format definitions and factory classes
Arithmetic.scala: Arithmetic operation type class implementations

Code Structure

format/
├── Dataformat.scala  - Data format definitions
└── Arithmetic.scala  - Arithmetic operation abstractions

File Dependencies

Dataformat.scala (Format definition layer)

Defines DataFormat abstract class and concrete format implementations
Provides DataFormatFactory factory class
Implements DataFormatParams parameter class

Arithmetic.scala (Operation abstraction layer)

Defines Arithmetic type class interface
Implements UIntArithmetic concrete operations
Provides ArithmeticFactory factory class

Module Description

Dataformat.scala

Main functionality: Defines supported data format types

Format definition:

abstract class DataFormat {
  def width: Int
  def dataType: Data
  def name: String
}

Supported formats:

class INT8Format extends DataFormat {
  override def width: Int = 8
  override def dataType: Data = UInt(8.W)
  override def name: String = "INT8"
}

class FP16Format extends DataFormat {
  override def width: Int = 16
  override def dataType: Data = UInt(16.W)
  override def name: String = "FP16"
}

class FP32Format extends DataFormat {
  override def width: Int = 32
  override def dataType: Data = UInt(32.W)
  override def name: String = "FP32"
}

Factory class:

object DataFormatFactory {
  def create(formatType: String): DataFormat = formatType.toUpperCase match {
    case "INT8" => new INT8Format
    case "FP16" => new FP16Format
    case "FP32" => new FP32Format
    case _ => throw new IllegalArgumentException(...)
  }
}

Parameter class:

case class DataFormatParams(formatType: String = "INT8") {
  def format: DataFormat = DataFormatFactory.create(formatType)
  def width: Int = format.width
  def dataType: Data = format.dataType
}

Arithmetic.scala

Main functionality: Provides type-safe arithmetic operation abstractions

Type class definition:

abstract class Arithmetic[T <: Data] {
  def add(x: T, y: T): T
  def sub(x: T, y: T): T
  def mul(x: T, y: T): T
  def div(x: T, y: T): T
  def gt(x: T, y: T): Bool
}

UInt implementation:

class UIntArithmetic extends Arithmetic[UInt] {
  override def add(x: UInt, y: UInt): UInt = x + y
  override def sub(x: UInt, y: UInt): UInt = x - y
  override def mul(x: UInt, y: UInt): UInt = x * y
  override def div(x: UInt, y: UInt): UInt = Mux(y =/= 0.U, x / y, 0.U)
  override def gt(x: UInt, y: UInt): Bool = x > y
}

Factory class:

object ArithmeticFactory {
  def createArithmetic[T <: Data](dataType: T): Arithmetic[T] = {
    dataType match {
      case _: UInt => new UIntArithmetic().asInstanceOf[Arithmetic[T]]
      case _ => throw new IllegalArgumentException(...)
    }
  }
}

Usage

Notes

Floating-point support: FP16 and FP32 currently use UInt representation, can be extended to true floating-point types later
Division by zero protection: UInt division operation includes division-by-zero check, returns 0 as default value
Type safety: Uses Scala type system to ensure operation type safety
Extensibility: Factory pattern supports adding new data formats and arithmetic implementations
Parameterization: DataFormatParams provides convenient parameterized configuration interface

Im2col Image Processing Accelerator

Overview

This directory implements Buckyball's Im2col operation accelerator for image-to-column matrix conversion in convolutional neural networks. Located at arch/src/main/scala/prototype/im2col, it serves as an image processing accelerator that converts convolution operations to matrix multiplication operations to improve computational efficiency.

Core components:

im2col.scala: Im2col accelerator main implementation

Code Structure

im2col/
└── im2col.scala  - Im2col accelerator implementation

Module Responsibilities

Im2col.scala (Accelerator implementation layer)

Implements image-to-column matrix conversion logic
Manages SRAM read/write operations
Provides Ball domain command interface

Module Description

im2col.scala

Main functionality: Implements sliding convolution window and data rearrangement

State machine definition:

val idle :: read :: read_and_convert :: complete :: Nil = Enum(4)
val state = RegInit(idle)

Key registers:

val ConvertBuffer = RegInit(VecInit(Seq.fill(4)(VecInit(Seq.fill(b.veclane)(0.U(b.inputType.getWidth.W))))))
val rowptr = RegInit(0.U(10.W))    // Convolution window top-left row pointer
val colptr = RegInit(0.U(5.W))     // Convolution window top-left column pointer
val krow_reg = RegInit(0.U(log2Up(b.veclane).W))  // Convolution kernel row count
val kcol_reg = RegInit(0.U(log2Up(b.veclane).W))  // Convolution kernel column count

Command parsing:

when(io.cmdReq.fire) {
  rowptr := io.cmdReq.bits.cmd.special(37,28)      // Start row
  colptr := io.cmdReq.bits.cmd.special(27,23)      // Start column
  kcol_reg := io.cmdReq.bits.cmd.special(3,0)      // Convolution kernel column count
  krow_reg := io.cmdReq.bits.cmd.special(7,4)      // Convolution kernel row count
  incol_reg := io.cmdReq.bits.cmd.special(12,8)    // Input matrix column count
  inrow_reg := io.cmdReq.bits.cmd.special(22,13)   // Input matrix row count
}

Data conversion logic:

// Fill window data
for (i <- 0 until 4; j <- 0 until 4) {
  when(i.U < krow_reg && j.U < kcol_reg) {
    val bufferRow = (rowcnt + i.U) % krow_reg
    val bufferCol = (colptr + j.U) % incol_reg
    window((i.U * kcol_reg) + j.U) := ConvertBuffer(bufferRow)(bufferCol)
  }.otherwise {
    window((i.U * kcol_reg) + j.U) := 0.U
  }
}

SRAM interface:

val io = IO(new Bundle {
  val cmdReq = Flipped(Decoupled(new BallRsIssue))
  val cmdResp = Decoupled(new BallRsComplete)
  val sramRead = Vec(b.sp_banks, Flipped(new SramReadIO(...)))
  val sramWrite = Vec(b.sp_banks, Flipped(new SramWriteIO(...)))
})

Processing flow:

idle: Wait for command, parse convolution parameters
read: Read initial convolution kernel-sized data into buffer
read_and_convert: Slide window, convert data and write back
complete: Send completion signal

Inputs/Outputs:

Input: Ball domain commands containing convolution parameters and address information
Output: Converted column matrix data, completion signal
Edge cases: Fill zero values when handling boundaries

Usage

Algorithm Principle

Im2col conversion: Convert convolution operation to matrix multiplication

Input: H×W image, K×K convolution kernel
Output: (H-K+1)×(W-K+1) windows of size K×K, expanded as column vectors

Sliding window:

Slide convolution window in row-major order
Each window position generates a column vector
Uses circular buffer to optimize memory access

Notes

Buffer management: Uses 4×veclane conversion buffer to store window data
Boundary handling: Fill zero values for positions beyond image boundaries
Address calculation: Supports configurable start address and bank selection
Pipeline optimization: Prefetch next row read requests during conversion
Parameter limitation: Maximum support for 4×4 convolution kernel size

Matrix Computation Accelerator

Overview

This directory implements Buckyball's matrix computation accelerator for matrix multiplication and related operations. Located at arch/src/main/scala/prototype/matrix, it serves as a matrix computation accelerator supporting multiple data formats and operation modes.

Core components:

bbfp_control.scala: Matrix computation controller
bbfp_pe.scala: Processing Element (PE) and MAC unit
bbfp_buffer.scala: Data buffer management
bbfp_load.scala: Data load unit
bbfp_ex.scala: Execution unit
bbfpIns_decode.scala: Instruction decoder

Code Structure

matrix/
├── bbfp_control.scala   - Controller main module
├── bbfp_pe.scala        - Processing element implementation
├── bbfp_buffer.scala    - Buffer management
├── bbfp_load.scala      - Load unit
├── bbfp_ex.scala        - Execution unit
└── bbfpIns_decode.scala - Instruction decode

File Dependencies

bbfp_control.scala (Controller layer)

Integrates submodules (ID, LU, EX, etc.)
Manages SRAM and Accumulator interfaces
Handles Ball domain commands

bbfp_pe.scala (Computation core layer)

Implements MacUnit multiply-accumulate unit
Defines PEControl control signals
Handles signed/unsigned operations

Other modules (Functional support layer)

Provides data buffering, loading, execution and other support functions

Module Description

bbfp_control.scala

Main functionality: Top-level control module for matrix computation accelerator

Module integration:

class BBFP_Control extends Module {
  val BBFP_ID = Module(new BBFP_ID)
  val ID_LU = Module(new ID_LU)
  val BBFP_LoadUnit = Module(new BBFP_LoadUnit)
  val LU_EX = Module(new LU_EX)
}

Interface definition:

val io = IO(new Bundle {
  val cmdReq = Flipped(Decoupled(new BallRsIssue))
  val cmdResp = Decoupled(new BallRsComplete)
  val is_matmul_ws = Input(Bool())
  val sramRead = Vec(b.sp_banks, Flipped(new SramReadIO(...)))
  val sramWrite = Vec(b.sp_banks, Flipped(new SramWriteIO(...)))
  val accRead = Vec(b.acc_banks, Flipped(new SramReadIO(...)))
  val accWrite = Vec(b.acc_banks, Flipped(new SramWriteIO(...)))
})

Data flow:

cmdReq → BBFP_ID → ID_LU → BBFP_LoadUnit → LU_EX
                              ↓
                         SRAM/ACC interface

bbfp_pe.scala

Main functionality: Implements basic processing element for matrix computation

MAC unit definition:

class MacUnit extends Module {
  val io = IO(new Bundle {
    val in_a = Input(UInt(7.W))    // [6]=sign, [5]=flag, [4:0]=value
    val in_b = Input(UInt(7.W))    // [6]=sign, [5]=flag, [4:0]=value
    val in_c = Input(UInt(32.W))   // [31]=sign, [30:0]=value
    val out_d = Output(UInt(32.W)) // Output result
  })
}

Data format processing:

// Extract sign bit and value
val sign_a = io.in_a(6)
val sign_b = io.in_b(6)
val flag_a = io.in_a(5)
val flag_b = io.in_b(5)
val value_a = io.in_a(4, 0)
val value_b = io.in_b(4, 0)

// Determine left shift based on flag bit
val shifted_a = Mux(flag_a === 1.U, value_a << 2, value_a)
val shifted_b = Mux(flag_b === 1.U, value_b << 2, value_b)

Signed arithmetic:

val a_signed = Mux(sign_a === 1.U, -(shifted_a.zext), shifted_a.zext).asSInt
val b_signed = Mux(sign_b === 1.U, -(shifted_b.zext), shifted_b.zext).asSInt

Control signals:

class PEControl extends Bundle {
  val propagate = UInt(1.W)   // Propagation control
}

Usage

Data Format

Input format: 7-bit compressed format

bit[6]: Sign bit (0=positive, 1=negative)
bit[5]: Flag bit (1=left shift by 2)
bit[4:0]: 5-bit value

Output format: 32-bit signed number

bit[31]: Sign bit
bit[30:0]: 31-bit value

Operation Characteristics

MAC operation: Multiply-Accumulate operation

Supports signed and unsigned operations
Configurable shift operations
32-bit accumulator output

Pipeline structure:

ID: Instruction decode stage
LU: Load unit stage
EX: Execution unit stage

Notes

Data format: Uses custom 7-bit compressed format to reduce storage overhead
Sign handling: Supports correct signed number operations and sign extension
Shift optimization: Controls data preprocessing shift through flag bit
Interface compatibility: Fully compatible with SRAM and Accumulator interfaces
Pipeline design: Multi-stage pipeline improves throughput

Matrix Transpose Accelerator

Overview

This directory implements Buckyball's matrix transpose accelerator for matrix transpose operations. Located at arch/src/main/scala/prototype/transpose, it serves as a matrix transpose accelerator supporting pipelined transpose operations.

Core components:

Transpose.scala: Pipelined transposer implementation

Code Structure

transpose/
└── Transpose.scala  - Pipelined transposer

Module Responsibilities

Transpose.scala (Transpose implementation layer)

Implements PipelinedTransposer module
Manages matrix data read, transpose, and write-back
Provides Ball domain command interface

Module Description

Transpose.scala

Main functionality: Implements pipelined matrix transpose operation

State machine definition:

val idle :: sRead :: sWrite :: complete :: Nil = Enum(4)
val state = RegInit(idle)

Storage structure:

// Matrix storage register (veclane x veclane)
val regArray = Reg(Vec(b.veclane, Vec(b.veclane, UInt(b.inputType.getWidth.W))))

Counter management:

val readCounter = RegInit(0.U(log2Ceil(b.veclane + 1).W))
val respCounter = RegInit(0.U(log2Ceil(b.veclane + 1).W))
val writeCounter = RegInit(0.U(log2Ceil(b.veclane + 1).W))

Instruction registers:

val robid_reg = RegInit(0.U(10.W))    // ROB ID
val waddr_reg = RegInit(0.U(10.W))    // Write address
val wbank_reg = RegInit(0.U(log2Up(b.sp_banks).W))  // Write bank
val raddr_reg = RegInit(0.U(10.W))    // Read address
val rbank_reg = RegInit(0.U(log2Up(b.sp_banks).W))  // Read bank
val iter_reg = RegInit(0.U(10.W))     // Iteration count

Interface definition:

val io = IO(new Bundle {
  val cmdReq = Flipped(Decoupled(new BallRsIssue))
  val cmdResp = Decoupled(new BallRsComplete)
  val sramRead = Vec(b.sp_banks, Flipped(new SramReadIO(...)))
  val sramWrite = Vec(b.sp_banks, Flipped(new SramWriteIO(...)))
})

Processing flow:

idle: Wait for command, parse transpose parameters
sRead: Read matrix data row by row into register array
sWrite: Write transposed data column by column
complete: Send completion signal

Transpose algorithm:

Uses veclane×veclane register array to store matrix
Reads row-wise, writes column-wise to implement transpose
Supports block-wise transpose for matrices of arbitrary size

Usage

Implementation Details

State machine:

val idle :: sRead :: sWrite :: complete :: Nil = Enum(4)

idle: Wait for instruction
sRead: Read matrix data
sWrite: Write transpose result
complete: Complete and respond

Register array:

val regArray = Reg(Vec(b.veclane, Vec(b.veclane, UInt(b.inputType.getWidth.W))))

Uses veclane×veclane register array to cache matrix data.

Transpose operation:

Read phase: Read data row by row into regArray(row)(col)
Write phase: Read regArray(i)(col) column by column to form new rows for writing

Configuration Parameters

Matrix size: Determined by b.veclane parameter Data width: Determined by b.inputType.getWidth Bank configuration: Supports multi-bank SRAM access

Notes

Matrix size limitation: Maximum support for veclane×veclane matrices
Memory bandwidth: Transpose operation has high memory bandwidth requirements
Register overhead: Requires veclane² registers to store matrix
Address calculation: Transposed address calculation needs to be handled correctly
Pipeline control: Read/write counters need to be synchronized correctly

Vector Processing Unit

Overview

The Vector Processing Unit is a specialized computation accelerator in the Buckyball framework, located at prototype/vector. This module implements a complete vector processing pipeline, including control unit, load unit, execution unit, and store unit, supporting parallel processing of vector data.

File Structure

vector/
├── VecUnit.scala         - Vector processing unit top module
├── VecCtrlUnit.scala     - Vector control unit
├── VecLoadUnit.scala     - Vector load unit
├── VecEXUnit.scala       - Vector execution unit
├── VecStoreUnit.scala    - Vector store unit
├── bond/                 - Binding and synchronization mechanisms
├── op/                   - Vector operation implementations
├── thread/               - Thread management
└── warp/                 - Thread warp management

Core Components

VecUnit - Vector Processing Unit Top Level

VecUnit is the top-level module of the vector processor, integrating all sub-units:

class VecUnit(implicit b: CustomBuckyballConfig, p: Parameters) extends Module {
  val io = IO(new Bundle {
    val cmdReq = Flipped(Decoupled(new BallRsIssue))
    val cmdResp = Decoupled(new BallRsComplete)

    // Connected to Scratchpad SRAM read/write interfaces
    val sramRead = Vec(b.sp_banks, Flipped(new SramReadIO(b.spad_bank_entries, spad_w)))
    val sramWrite = Vec(b.sp_banks, Flipped(new SramWriteIO(b.spad_bank_entries, spad_w, b.spad_mask_len)))
    // Connected to Accumulator read/write interfaces
    val accRead = Vec(b.acc_banks, Flipped(new SramReadIO(b.acc_bank_entries, b.acc_w)))
    val accWrite = Vec(b.acc_banks, Flipped(new SramWriteIO(b.acc_bank_entries, b.acc_w, b.acc_mask_len)))
  })
}

Interface Description

Command interface:

cmdReq: Vector instruction request from reservation station
cmdResp: Completion response returned to reservation station

Memory interface:

sramRead/sramWrite: Read/write interfaces connected to Scratchpad
accRead/accWrite: Read/write interfaces connected to Accumulator

VecCtrlUnit - Vector Control Unit

The vector control unit is responsible for instruction decode and pipeline control:

class VecCtrlUnit(implicit b: CustomBuckyballConfig, p: Parameters) extends Module {
  val io = IO(new Bundle{
    val cmdReq = Flipped(Decoupled(new BallRsIssue))
    val cmdResp_o = Decoupled(new BallRsComplete)

    val ctrl_ld_o = Decoupled(new ctrl_ld_req)
    val ctrl_st_o = Decoupled(new ctrl_st_req)
    val ctrl_ex_o = Decoupled(new ctrl_ex_req)

    val cmdResp_i = Flipped(Valid(new Bundle {val commit = Bool()}))
  })
}

Control State

val rob_id_reg    = RegInit(0.U(log2Up(b.rob_entries).W))
val iter          = RegInit(0.U(10.W))
val op1_bank      = RegInit(0.U(2.W))
val op1_bank_addr = RegInit(0.U(12.W))
val op2_bank_addr = RegInit(0.U(12.W))
val op2_bank      = RegInit(0.U(2.W))
val wr_bank       = RegInit(0.U(2.W))
val wr_bank_addr  = RegInit(0.U(12.W))
val is_acc        = RegInit(false.B)

Data Flow Architecture

The vector processing unit uses a pipeline architecture with the following data flow:

Instruction input → VecCtrlUnit → Control signal dispatch
                          ↓
                  VecLoadUnit (Load data)
                          ↓
                  VecEXUnit (Execute computation)
                          ↓
                  VecStoreUnit (Store results)
                          ↓
                      Completion response

Module Connections

// Control unit
val VecCtrlUnit = Module(new VecCtrlUnit)
VecCtrlUnit.io.cmdReq <> io.cmdReq
io.cmdResp <> VecCtrlUnit.io.cmdResp_o

// Load unit
val VecLoadUnit = Module(new VecLoadUnit)
VecLoadUnit.io.ctrl_ld_i <> VecCtrlUnit.io.ctrl_ld_o

// Execution unit
val VecEX = Module(new VecEXUnit)
VecEX.io.ctrl_ex_i <> VecCtrlUnit.io.ctrl_ex_o
VecEX.io.ld_ex_i <> VecLoadUnit.io.ld_ex_o

// Store unit
val VecStoreUnit = Module(new VecStoreUnit)
VecStoreUnit.io.ctrl_st_i <> VecCtrlUnit.io.ctrl_st_o
VecStoreUnit.io.ex_st_i <> VecEX.io.ex_st_o

Memory System Integration

Scratchpad Connection

The vector processing unit connects to Scratchpad through multiple banks:

for (i <- 0 until b.sp_banks) {
  io.sramRead(i).req <> VecLoadUnit.io.sramReadReq(i)
  VecLoadUnit.io.sramReadResp(i) <> io.sramRead(i).resp
}

Accumulator Connection

Execution results are written to Accumulator through the store unit:

for (i <- 0 until b.acc_banks) {
  io.accWrite(i) <> VecStoreUnit.io.accWrite(i)
}

Configuration Parameters

Vector Configuration

Configure vector processor parameters through CustomBuckyballConfig:

class CustomBuckyballConfig extends Config((site, here, up) => {
  case "veclane" => 16              // Vector lane count
  case "sp_banks" => 4              // Scratchpad bank count
  case "acc_banks" => 2             // Accumulator bank count
  case "spad_bank_entries" => 1024  // Entries per bank
  case "acc_bank_entries" => 512    // Accumulator entry count
})

Data Width

val spad_w = b.veclane * b.inputType.getWidth  // Scratchpad width
val acc_w = b.outputType.getWidth              // Accumulator width

Usage

Creating Vector Processing Unit

val vecUnit = Module(new VecUnit())

// Connect command interface
vecUnit.io.cmdReq <> reservationStation.io.issue
reservationStation.io.complete <> vecUnit.io.cmdResp

// Connect memory system
for (i <- 0 until sp_banks) {
  scratchpad.io.read(i) <> vecUnit.io.sramRead(i)
  scratchpad.io.write(i) <> vecUnit.io.sramWrite(i)
}

for (i <- 0 until acc_banks) {
  accumulator.io.read(i) <> vecUnit.io.accRead(i)
  accumulator.io.write(i) <> vecUnit.io.accWrite(i)
}

Vector Instruction Format

Vector instructions are passed through the BallRsIssue interface:

class BallRsIssue extends Bundle {
  val cmd = new Bundle {
    val iter = UInt(10.W)           // Iteration count
    val op1_bank = UInt(2.W)        // Operand 1 bank
    val op1_bank_addr = UInt(12.W)  // Operand 1 address
    val op2_bank = UInt(2.W)        // Operand 2 bank
    val op2_bank_addr = UInt(12.W)  // Operand 2 address
    val wr_bank = UInt(2.W)         // Write bank
    val wr_bank_addr = UInt(12.W)   // Write address
  }
  val rob_id = UInt(log2Up(rob_entries).W)
}

Execution Model

Pipeline Execution

Instruction decode: VecCtrlUnit decodes vector instructions
Data load: VecLoadUnit loads operands from Scratchpad
Vector computation: VecEXUnit executes vector operations
Result store: VecStoreUnit writes results to Accumulator
Completion response: Returns completion signal to reservation station

Parallel Processing

Multi-lane parallelism: Supports parallel computation across multiple vector lanes
Bank-level parallelism: Multiple memory banks support parallel access
Pipeline overlap: Different stages can overlap execution

Submodule Description

Binding Mechanism (Bond)

Provides inter-thread synchronization and data binding functionality, supporting producer-consumer pattern data transfer.

Vector Operations (Op)

Implements specific vector computation operations, including arithmetic operations, logical operations, and special functions.

Thread Management (Thread)

Provides thread abstraction and management functionality, supporting different types of vector threads.

Thread Warp Management (Warp)

Implements thread warp organization and scheduling, supporting large-scale parallel computation.

Performance Characteristics

High parallelism: Supports multi-lane vector parallel processing
Pipelined: Multi-stage pipeline improves throughput
Memory optimization: Multi-bank memory system reduces access conflicts
Flexible configuration: Supports different vector lengths and data types

Binding Mechanism - Thread synchronization and data binding
Vector Operations - Specific computation operation implementations
Thread Management - Thread abstraction and management
Thread Warp Management - Thread warp organization and scheduling
Prototype Accelerator Overview - Upper-level accelerator framework

Binding Module

Overview

The binding module implements data interfaces and synchronization mechanisms in the vector processing unit, located at prototype/vector/bond. This module defines inter-thread data transfer interfaces, supporting different types of data binding patterns.

File Structure

bond/
├── BondWrapper.scala    - Binding wrapper base class
└── vvv.scala           - VVV binding implementation

Core Components

VVV - Vector-to-Vector Binding

VVV (Vector-Vector-Vector) binding implements a data interface from dual input vectors to single output vector:

class VVV(implicit p: Parameters) extends Bundle {
  val lane = p(ThreadKey).get.lane
  val bondParam = p(ThreadBondKey).get
  val inputWidth = bondParam.inputWidth
  val outputWidth = bondParam.outputWidth

  // Input interface (Flipped Decoupled)
  val in = Flipped(Decoupled(new Bundle {
    val in1 = Vec(lane, UInt(inputWidth.W))
    val in2 = Vec(lane, UInt(inputWidth.W))
  }))

  // Decoupled output interface
  val out = Decoupled(new Bundle {
    val out = Vec(lane, UInt(outputWidth.W))
  })
}

Interface Description

Input interface:

in.bits.in1: First input vector, width is inputWidth
in.bits.in2: Second input vector, width is inputWidth
in.valid: Input data valid signal
in.ready: Input ready signal

Output interface:

out.bits.out: Output vector, width is outputWidth
out.valid: Output data valid signal
out.ready: Output ready signal

Parameter Configuration

VVV binding parameters are obtained through the configuration system:

val lane = p(ThreadKey).get.lane                    // Vector lane count
val bondParam = p(ThreadBondKey).get                // Binding parameter
val inputWidth = bondParam.inputWidth               // Input width
val outputWidth = bondParam.outputWidth             // Output width

CanHaveVVVBond - VVV Binding Trait

The CanHaveVVVBond trait provides VVV binding functionality for threads:

trait CanHaveVVVBond { this: BaseThread =>
  val vvvBond = params(ThreadBondKey).filter(_.bondType == "vvv").map { bondParam =>
    IO(new VVV()(params))
  }

  def getVVVBond = vvvBond
}

Usage

Thread classes gain VVV binding capability by mixing in this trait:

class MulThread(implicit p: Parameters) extends BaseThread
  with CanHaveMulOp
  with CanHaveVVVBond {

  // Connect operation and binding
  for {
    op <- mulOp
    bond <- vvvBond
  } {
    op.io.in <> bond.in
    op.io.out <> bond.out
  }
}

BondWrapper - Binding Wrapper

BondWrapper provides Diplomacy-based binding encapsulation:

abstract class BondWrapper(implicit p: Parameters) extends LazyModule {
  val bondName = "vvv"

  def to[T](name: String)(body: => T): T = {
    LazyScope(s"bond_to_${name}", s"Bond_${bondName}_to_${name}") { body }
  }

  def from[T](name: String)(body: => T): T = {
    LazyScope(s"bond_from_${name}", s"Bond_${bondName}_from_${name}") { body }
  }
}

Scope Management

BondWrapper provides named scope management functionality:

to(): Creates binding scope in output direction
from(): Creates binding scope in input direction

Binding Types

VVV Binding Pattern

VVV binding supports the following data flow patterns:

Dual input single output: Two vector inputs, one vector output
Width conversion: Supports different input and output widths
Vector parallelism: Supports multi-lane parallel data transmission

Data Flow Control

VVV binding uses Decoupled interface for flow control:

// Producer side
producer.io.out.valid := dataReady
producer.io.out.bits.in1 := inputVector1
producer.io.out.bits.in2 := inputVector2

// Consumer side
consumer.io.in.ready := canAcceptData
when(consumer.io.in.fire) {
  processData(consumer.io.in.bits.out)
}

Configuration Parameters

Binding Parameters

Binding parameters are defined through BondParam:

case class BondParam(
  bondType: String,           // Binding type ("vvv")
  inputWidth: Int = 8,        // Input width
  outputWidth: Int = 32       // Output width
)

Configuration Example

val bondConfig = BondParam(
  bondType = "vvv",
  inputWidth = 8,
  outputWidth = 32
)

val threadConfig = ThreadParam(
  lane = 16,
  attr = "vector",
  threadName = "mul_thread",
  Op = OpParam("mul", bondConfig)
)

Usage

Creating VVV Binding

// Using VVV binding in thread
class CustomThread(implicit p: Parameters) extends BaseThread
  with CanHaveVVVBond {

  // Get binding interface
  for (bond <- vvvBond) {
    // Connect input
    bond.in.valid := inputValid
    bond.in.bits.in1 := inputVector1
    bond.in.bits.in2 := inputVector2

    // Connect output
    outputValid := bond.out.valid
    outputVector := bond.out.bits.out
    bond.out.ready := outputReady
  }
}

Binding Connection

// Connect binding interfaces of two modules
val producer = Module(new ProducerThread())
val consumer = Module(new ConsumerThread())

// Direct binding interface connection
for {
  prodBond <- producer.vvvBond
  consBond <- consumer.vvvBond
} {
  consBond.in <> prodBond.out
}

Synchronization Mechanisms

Handshake Protocol

VVV binding uses standard Decoupled handshake protocol:

Data preparation: Producer sets valid and bits
Receive ready: Consumer sets ready
Data transmission: Transfer completes when valid && ready
State update: Both sides update internal state

Backpressure Handling

Binding interface supports backpressure mechanism:

// When downstream is not ready, upstream waits
when(!downstream.ready) {
  upstream.valid := false.B
  // Keep data unchanged
}

Extensibility

New Binding Types

New binding types can be defined following a similar pattern:

// Single input single output binding
class VV(implicit p: Parameters) extends Bundle {
  val in = Flipped(Decoupled(Vec(lane, UInt(inputWidth.W))))
  val out = Decoupled(Vec(lane, UInt(outputWidth.W)))
}

// Corresponding trait
trait CanHaveVVBond { this: BaseThread =>
  val vvBond = params(ThreadBondKey).filter(_.bondType == "vv").map { _ =>
    IO(new VV()(params))
  }
}

Parameterization Support

The binding module supports full parameterized configuration:

Vector lane count configurable
Input/output width configurable
Binding type extensible

Thread Module - Provides usage environment for bindings
Vector Operations Module - Data processing logic for bindings
Vector Processing Unit - Upper-level vector processor

Vector Operations Module

Overview

The vector operations module implements specific computation operations in the vector processing unit, located at prototype/vector/op. This module provides implementations of different types of vector operations, including multiplication operations and cascade operations.

File Structure

op/
├── cascade.scala    - Cascade addition operation
└── mul.scala       - Multiplication operation

Core Components

CascadeOp - Cascade Addition Operation

CascadeOp implements element-wise addition operation on vector elements:

class CascadeOp(implicit p: Parameters) extends Module {
  val lane = p(ThreadKey).get.lane
  val bondParam = p(ThreadBondKey).get
  val outputWidth = bondParam.outputWidth

  val io = IO(new VVV()(p))
}

Operation Logic

val reg1 = RegInit(VecInit(Seq.fill(lane)(0.U(outputWidth.W))))
val valid1 = RegInit(false.B)

when (io.in.valid) {
  valid1 := true.B
  reg1 := io.in.bits.in1.zip(io.in.bits.in2).map { case (a, b) => a + b }
}

Function description:

Receives two input vectors in1 and in2
Performs element-wise addition: out[i] = in1[i] + in2[i]
Uses register to cache computation results
Supports pipelined operations

Flow Control Mechanism

io.in.ready := io.out.ready

when (io.out.ready && valid) {
  io.out.valid := true.B
  io.out.bits.out := reg1
}.otherwise {
  io.out.valid := false.B
  io.out.bits.out := VecInit(Seq.fill(lane)(0.U(outputWidth.W)))
}

MulOp - Multiplication Operation

MulOp implements vector multiplication operation with broadcast mode support:

class MulOp(implicit p: Parameters) extends Module {
  val lane = p(ThreadKey).get.lane
  val bondParam = p(ThreadBondKey).get
  val inputWidth = bondParam.inputWidth

  val io = IO(new VVV()(p))
}

Operation Logic

val reg1 = RegInit(VecInit(Seq.fill(lane)(0.U(inputWidth.W))))
val reg2 = RegInit(VecInit(Seq.fill(lane)(0.U(inputWidth.W))))
val cnt = RegInit(0.U(log2Ceil(lane).W))
val active = RegInit(false.B)

when (io.in.valid) {
  reg1 := io.in.bits.in1
  reg2 := io.in.bits.in2
  cnt := 0.U
  active := true.B
}

Function description:

Receives two input vectors and caches them in registers
Uses counter cnt to control output sequence
Implements broadcast multiplication: out[i] = reg1[cnt] * reg2[i]

Sequential Output

for (i <- 0 until lane) {
  io.out.bits.out(i) := reg1(cnt) * reg2(i)
}

when (active && io.out.ready) {
  cnt := cnt + 1.U
  when (cnt === (lane-1).U) {
    active := false.B
  }
}

Output mode:

Outputs one set of multiplication results per cycle
reg1[cnt] multiplied with all elements of reg2
Counter increments to achieve sequential output

Operation Traits

CanHaveCascadeOp - Cascade Operation Trait

trait CanHaveCascadeOp { this: BaseThread =>
  val cascadeOp = params(ThreadOpKey).filter(_.OpType == "cascade").map { opParam =>
    Module(new CascadeOp()(params))
  }

  def getCascadeOp = cascadeOp
}

CanHaveMulOp - Multiplication Operation Trait

trait CanHaveMulOp { this: BaseThread =>
  val mulOp = params(ThreadOpKey).filter(_.OpType == "mul").map { opParam =>
    Module(new MulOp()(params))
  }

  def getMulOp = mulOp
}

Usage

Using Operations in Threads

class CasThread(implicit p: Parameters) extends BaseThread
  with CanHaveCascadeOp
  with CanHaveVVVBond {

  // Connect operation and binding
  for {
    op <- cascadeOp
    bond <- vvvBond
  } {
    op.io.in <> bond.in
    op.io.out <> bond.out
  }
}

Configuring Operation Parameters

val opParam = OpParam(
  OpType = "cascade",                    // Operation type
  bondType = BondParam(
    bondType = "vvv",
    inputWidth = 32,
    outputWidth = 32
  )
)

Operation Type Comparison

CascadeOp vs MulOp

Feature	CascadeOp	MulOp
Operation type	Element-wise addition	Broadcast multiplication
Input width	Arbitrary	Usually smaller
Output width	Arbitrary	Usually larger
Latency	1 cycle	lane cycles
Throughput	1 group per cycle	1 group per lane cycle
Resource consumption	Adder × lane	Multiplier × lane

Application Scenarios

CascadeOp is suitable for:

Vector addition operations
Accumulation operations
Data merging

MulOp is suitable for:

Matrix-vector multiplication
Convolution operations
Scaling operations

Data Flow Patterns

CascadeOp Data Flow

Input: [a0, a1, ..., an], [b0, b1, ..., bn]
      ↓
Compute: [a0+b0, a1+b1, ..., an+bn]
      ↓
Output: [c0, c1, ..., cn] (1 cycle)

MulOp Data Flow

Input: [a0, a1, ..., an], [b0, b1, ..., bn]
      ↓
Cycle 0: [a0*b0, a0*b1, ..., a0*bn]
Cycle 1: [a1*b0, a1*b1, ..., a1*bn]
...
Cycle n: [an*b0, an*b1, ..., an*bn]

Extended Operations

Adding New Operations

New vector operations can be added following a similar pattern:

class SubOp(implicit p: Parameters) extends Module {
  val io = IO(new VVV()(p))

  // Implement subtraction operation
  io.out.bits.out := io.in.bits.in1.zip(io.in.bits.in2).map {
    case (a, b) => a - b
  }
}

trait CanHaveSubOp { this: BaseThread =>
  val subOp = params(ThreadOpKey).filter(_.OpType == "sub").map { _ =>
    Module(new SubOp()(params))
  }
}

Complex Operations

For more complex operations, multiple basic operations can be combined:

class FMAOp(implicit p: Parameters) extends Module {
  // Fused multiply-add operation: out = a * b + c
  val mulOp = Module(new MulOp())
  val addOp = Module(new CascadeOp())

  // Connect operation pipeline
  addOp.io.in.bits.in1 <> mulOp.io.out.bits.out
  // ...
}

Performance Optimization

Pipeline Optimization

Use registers to cache intermediate results
Support continuous data stream processing
Minimize combinational logic delay

Resource Optimization

Choose appropriate hardware resources based on operation type
Support resource sharing and reuse
Configurable parallelism

Binding Module - Provides data interfaces
Thread Module - Provides execution environment for operations
Vector Processing Unit - Upper-level vector processor

Thread Module

Overview

The thread module implements thread abstractions in the vector processing unit, located at prototype/vector/thread. This module defines the basic structure and specific implementations of threads, constructing threads with specific functionality by combining different operations (Op) and bindings (Bond).

File Structure

thread/
├── BaseThread.scala    - Thread base class definition
├── CasThread.scala     - Cascade operation thread
└── MulThread.scala     - Multiplication operation thread

Core Components

BaseThread - Thread Base Class

BaseThread is the base class for all threads, defining basic thread parameters and configuration:

class BaseThread(implicit p: Parameters) extends Module {
  val io = IO(new Bundle {})
  val params = p
  val threadMap = p(ThreadMapKey)
  val threadParam = threadMap.getOrElse(
    p(ThreadKey).get.threadName,
    throw new Exception(s"ThreadParam not found for threadName: ${p(ThreadKey).get.threadName}")
  )
  val opParam = p(ThreadOpKey).get
  val bondParam = p(ThreadBondKey).get
}

Parameter Definition

The thread module uses the following parameter structure:

case class ThreadParam(lane: Int, attr: String, threadName: String, Op: OpParam)
case class OpParam(OpType: String, bondType: BondParam)
case class BondParam(bondType: String, inputWidth: Int = 8, outputWidth: Int = 32)

Parameter description:

lane: Vector lane count
threadName: Thread name identifier
OpType: Operation type ("cascade", "mul")
bondType: Binding type ("vvv")
inputWidth: Input data width, default 8 bits
outputWidth: Output data width, default 32 bits

Specific Thread Implementations

CasThread - Cascade Operation Thread

CasThread implements cascade addition operation, combining CascadeOp and VVVBond:

class CasThread(implicit p: Parameters) extends BaseThread
  with CanHaveCascadeOp
  with CanHaveVVVBond {

  // Connect CascadeOp and VVVBond
  for {
    op <- cascadeOp
    bond <- vvvBond
  } {
    op.io.in <> bond.in
    op.io.out <> bond.out
  }
}

Function: Performs element-wise addition operation on two input vectors.

MulThread - Multiplication Operation Thread

MulThread implements multiplication operation, combining MulOp and VVVBond:

class MulThread(implicit p: Parameters) extends BaseThread
  with CanHaveMulOp
  with CanHaveVVVBond {

  // Connect MulOp and VVVBond
  for {
    op <- mulOp
    bond <- vvvBond
  } {
    op.io.in <> bond.in
    op.io.out <> bond.out
  }
}

Function: Implements vector multiplication operation, supporting per-cycle result output.

Configuration System

The thread module uses Chipyard's configuration system for parameterization:

case object ThreadKey extends Field[Option[ThreadParam]](None)
case object ThreadOpKey extends Field[Option[OpParam]](None)
case object ThreadBondKey extends Field[Option[BondParam]](None)
case object ThreadMapKey extends Field[Map[String, ThreadParam]](Map.empty)

Configuration key description:

ThreadKey: Current thread parameter
ThreadOpKey: Operation parameter
ThreadBondKey: Binding parameter
ThreadMapKey: Thread mapping table

Usage

Creating Thread Instance

// Configure parameters
val threadParam = ThreadParam(
  lane = 4,
  attr = "vector",
  threadName = "mul_thread",
  Op = OpParam("mul", BondParam("vvv", 8, 32))
)

// Create thread
val mulThread = Module(new MulThread()(
  new Config((site, here, up) => {
    case ThreadKey => Some(threadParam)
    case ThreadOpKey => Some(threadParam.Op)
    case ThreadBondKey => Some(threadParam.Op.bondType)
  })
))

Connecting Interfaces

Threads interact data through VVV binding interface:

// Input data
mulThread.io.in.valid := inputValid
mulThread.io.in.bits.in1 := inputVector1
mulThread.io.in.bits.in2 := inputVector2

// Output data
outputValid := mulThread.io.out.valid
outputVector := mulThread.io.out.bits.out
mulThread.io.out.ready := outputReady

Vector Operation Module - Provides specific computation operations
Binding Module - Provides data interfaces and synchronization mechanisms
Vector Processing Unit - Upper-level vector processor

Thread Warp Module

Overview

The thread warp module implements thread warp management functionality in the vector processing unit, located at prototype/vector/warp. This module organizes multiple threads into a mesh structure, implementing parallel computation and dataflow management.

File Structure

warp/
├── MeshWarp.scala    - Mesh warp implementation
└── VecBall.scala     - Vector ball processor

Core Components

MeshWarp - Mesh Warp

MeshWarp implements a 32-thread mesh structure containing 16 multiplication threads and 16 cascade threads:

class MeshWarp(implicit p: Parameters) extends Module {
  val io = IO(new Bundle {
    val in = Flipped(Decoupled(new MeshWarpInput))
    val out = Decoupled(new MeshWarpOutput)
  })
}

Input/Output Interface

class MeshWarpInput extends Bundle {
  val op1 = Vec(16, UInt(8.W))        // First operand vector
  val op2 = Vec(16, UInt(8.W))        // Second operand vector
  val thread_id = UInt(10.W)          // Thread identifier
}

class MeshWarpOutput extends Bundle {
  val res = Vec(16, UInt(32.W))       // Result vector
}

Thread Configuration

Threads in the mesh are configured according to the following rules:

val threadMap = (0 until 32).map { i =>
  val threadName = i.toString
  val opType = if (i < 16) "mul" else "cascade"
  val bond = if (opType == "mul") {
    BondParam("vvv", inputWidth = 8, outputWidth = 32)
  } else {
    BondParam("vvv", inputWidth = 32, outputWidth = 32)
  }
  val op = OpParam(opType, bond)
  val thread = ThreadParam(16, s"attr$threadName", threadName, op)
  threadName -> thread
}.toMap

Thread allocation:

Threads 0-15: Multiplication operation threads (8-bit input → 32-bit output)
Threads 16-31: Cascade operation threads (32-bit input → 32-bit output)

Data Flow Connection

Data flow in the mesh is connected as follows:

// Connect mul thread output to cascade thread input
casBond.in.bits.in1 := mulBond.out.bits.out
mulBond.out.ready   := casBond.in.ready

// Cascade connection between cascade threads
if (i == 0) {
  casBond.in.bits.in2 := VecInit(Seq.fill(16)(0.U(32.W)))
} else {
  casBond.in.bits.in2 := prevCasBond.out.bits.out
}

Data flow path:

Input data → Multiplication threads (thread 0-15)
Multiplication results → Cascade threads (thread 16-31)
Serial connection between cascade threads
Final result output from thread 31

VecBall - Vector Ball Processor

VecBall is a wrapper for MeshWarp, providing state management and iteration control:

class VecBall(implicit p: Parameters) extends Module {
  val io = IO(new VecBallIO())
}

Interface Definition

class VecBallIO extends BallIO {
  val op1In = Flipped(Valid(Vec(16, UInt(8.W))))    // Operand 1 input
  val op2In = Flipped(Valid(Vec(16, UInt(8.W))))    // Operand 2 input
  val rstOut = Decoupled(Vec(16, UInt(32.W)))       // Result output
}

class BallIO extends Bundle {
  val iterIn = Flipped(Decoupled(UInt(10.W)))       // Iteration count input
  val iterOut = Valid(UInt(10.W))                   // Current iteration output
}

State Management

VecBall maintains the following internal state:

val start  = RegInit(false.B)      // Start flag
val arrive = RegInit(false.B)      // Arrival flag
val done   = RegInit(false.B)      // Completion flag
val iter   = RegInit(0.U(10.W))    // Total iteration count
val iterCounter = RegInit(0.U(10.W)) // Current iteration counter

Thread Scheduling

VecBall uses round-robin scheduling to assign threads:

val threadId = RegInit(0.U(4.W))
when (io.op1In.valid && io.op2In.valid && threadId < 15.U) {
  threadId := threadId + 1.U
} .elsewhen (io.op1In.valid && io.op2In.valid && threadId === 15.U) {
  threadId := 0.U
}

Usage

Creating MeshWarp Instance

val meshWarp = Module(new MeshWarp()(p))

// Connect input
meshWarp.io.in.valid := inputValid
meshWarp.io.in.bits.op1 := operand1
meshWarp.io.in.bits.op2 := operand2
meshWarp.io.in.bits.thread_id := selectedThread

// Connect output
outputValid := meshWarp.io.out.valid
result := meshWarp.io.out.bits.res
meshWarp.io.out.ready := outputReady

Creating VecBall Instance

val vecBall = Module(new VecBall()(p))

// Set iteration count
vecBall.io.iterIn.valid := iterValid
vecBall.io.iterIn.bits := totalIterations

// Input data
vecBall.io.op1In.valid := dataValid
vecBall.io.op1In.bits := inputVector1
vecBall.io.op2In.valid := dataValid
vecBall.io.op2In.bits := inputVector2

// Get result
outputReady := vecBall.io.rstOut.ready
when(vecBall.io.rstOut.valid) {
  result := vecBall.io.rstOut.bits
}

Computation Modes

Vector Multiply-Accumulate

Computation mode implemented by MeshWarp:

Multiplication phase: 16 multiplication threads compute op1[i] * op2[i] in parallel
Accumulation phase: 16 cascade threads accumulate multiplication results serially
Output phase: Output final accumulated vector

Iterative Processing

VecBall supports multi-iteration processing:

Set iteration count iterIn
Loop input data pairs
Monitor iteration count iterOut
Check completion status

Performance Characteristics

Parallelism: 16 multiplication operations execute in parallel
Pipeline: Supports continuous data stream processing
Throughput: Can process one 16-element vector pair per cycle
Latency: Combined latency of multiplication + cascade

Thread Module - Provides basic thread implementation
Vector Operations Module - Provides multiplication and cascade operations
Binding Module - Provides data interfaces

Buckyball Example Configurations

Overview

This directory contains example configurations and reference implementations of the Buckyball framework, demonstrating how to configure and extend Buckyball systems. Located at arch/src/main/scala/examples, it serves as the configuration layer, providing configuration templates and system instances for developers.

Main components include:

BuckyballConfig.scala: Global configuration parameter definitions
toy/: Complete example system implementation with custom coprocessor and CSR extensions

Code Structure

examples/
├── BuckyballConfig.scala     - Global configuration definitions
└── toy/                      - Complete example system
    ├── balldomain/           - Ball domain component implementation
    │   ├── BallDomain.scala  - Ball domain top-level
    │   ├── bbus/             - Ball bus registration
    │   │   └── busRegister.scala
    │   ├── rs/               - Ball RS registration
    │   │   └── rsRegister.scala
    │   └── decoder/          - Ball decoder (if exists)
    ├── CustomConfigs.scala   - System configuration composition
    └── ToyBuckyball.scala    - System top-level module

File Dependencies

BuckyballConfig.scala (Base Configuration Layer)

Defines global configuration parameters and defaults
Inherited and extended by all other configuration files
Provides system-level configuration interface

toy/CustomConfigs.scala (Configuration Composition Layer)

Inherits from BuckyballConfig and adds custom parameters
Composes multiple configuration fragments into complete configuration
Provides configuration support for ToyBuckyball

toy/ToyBuckyball.scala (System Instantiation Layer)

Uses CustomConfigs to instantiate complete system
Serves as entry point for Mill build
Generates final Verilog code

Module Details

BuckyballConfig.scala

Main Function: Define global configuration parameters for the Buckyball framework

Key Components:

object BuckyballConfigs {
  val defaultConfig = BaseConfig
  val toyConfig = BuckyballToyConfig.defaultConfig

  // Actually used configuration
  val customConfig = toyConfig

  type CustomBuckyballConfig = BaseConfig
}

Configuration Selection: The framework uses customConfig to select the active configuration. This allows easy switching between different system configurations.

Input/Output:

Input: No direct input, parameters passed through configuration system
Output: Configuration parameters for use by other modules
Edge cases: Configuration conflicts resolved by priority-based overriding

toy/ - Example System

The toy system demonstrates a complete Buckyball implementation with various Ball devices.

toy/ToyBuckyball.scala

Main Function: System top-level module, instantiates complete toy system

Key Components:

class ToyBuckyball(implicit b: CustomBuckyballConfig, p: Parameters) extends LazyRoCC {
  override lazy val module = new ToyBuckyballModuleImp(this)
}

class ToyBuckyballModuleImp(outer: ToyBuckyball) extends LazyRoCCModuleImp(outer) {
  // Global Decoder
  val globalDecoder = Module(new GlobalDecoder)

  // Global Reservation Station (with ROB)
  val globalRS = Module(new GlobalReservationStation)

  // Ball Domain (regular Module, not LazyModule)
  val ballDomain = Module(new BallDomain)

  // Memory Domain (complete domain with DMA+TLB+SRAM)
  val memDomain = LazyModule(new MemDomain)

  // Connect components
  globalDecoder.io.rocc <> io.cmd
  globalRS.io.decode <> globalDecoder.io.issue
  ballDomain.io.issue <> globalRS.io.ballIssue
  memDomain.module.io.issue <> globalRS.io.memIssue
  // ... more connections
}

Build Flow:

Load configuration from BuckyballConfig
Instantiate ToyBuckyball LazyRoCC module
Generate Verilog through ChiselStage
Output to generated-src directory

Input/Output:

Input: RoCC interface commands from Rocket core
Output: RoCC interface responses, busy signals
Edge cases: Configuration errors cause build failure

toy/balldomain/ - Ball Domain Components

BallDomain.scala: Ball domain top-level module

Integrates Ball Decoder, local Ball RS, and BBus
Provides single-channel interface to Global RS
Routes commands to appropriate Ball devices

bbus/busRegister.scala: Ball bus registration

class BBusModule extends BBus {
  // Register Ball device generators
  registerBall(() => new VecBall, ballId = 0.U)
  registerBall(() => new MatrixBall, ballId = 1.U)
  registerBall(() => new TransposeBall, ballId = 2.U)
  registerBall(() => new Im2colBall, ballId = 3.U)
  registerBall(() => new ReluBall, ballId = 4.U)
}

rs/rsRegister.scala: Ball RS device registration

class BallRSModule extends BallReservationStation {
  // Register Ball device information
  registerBallInfo(name = "VecBall", bid = 0, latency = 10)
  registerBallInfo(name = "MatrixBall", bid = 1, latency = 20)
  registerBallInfo(name = "TransposeBall", bid = 2, latency = 15)
  registerBallInfo(name = "Im2colBall", bid = 3, latency = 15)
  registerBallInfo(name = "ReluBall", bid = 4, latency = 10)
}

toy/CustomConfigs.scala

Main Function: Compose multiple configuration fragments for the toy system

Configuration Composition:

object BuckyballToyConfig {
  val defaultConfig = BaseConfig(
    opcodes = OpcodeSet.custom3,
    inputType = UInt(8.W),        // INT8 input
    accType = UInt(32.W),         // INT32 accumulator
    veclane = 16,                 // 16-element vectors
    accveclane = 4,               // 4-element accumulator vectors
    rob_entries = 16,             // 16 ROB entries
    sp_banks = 4,                 // 4 scratchpad banks
    acc_banks = 8,                // 8 accumulator banks
    sp_capacity = CapacityInKilobytes(256),   // 256KB scratchpad
    acc_capacity = CapacityInKilobytes(64),   // 64KB accumulator
    numVecPE = 16,                // 16 vector PEs
    numVecThread = 16             // 16 vector threads
  )
}

Configuration Parameters:

opcodes: Custom instruction opcode set (custom3 = 0x7b)
inputType: Data type for input operands
accType: Data type for accumulator
veclane: Number of elements per vector lane
rob_entries: Reorder buffer depth
Memory configuration: Bank counts and capacities
Vector configuration: PE count and thread count

Usage Guide

Building the Toy System

Generate Verilog:

cd arch
mill arch.runMain examples.toy.ToyBuckyball

Generated Files:

Location: arch/generated-src/toy/
Files: Verilog (.v), FIRRTL (.fir), annotation (.anno.json)

Custom Configuration Development

Steps:

Copy CustomConfigs.scala as template
Modify configuration parameters to meet requirements
Implement necessary custom components
Update top-level module to reference new configuration
Register Ball devices in BBus and Ball RS

Example: Adding New Ball Device:

Implement Ball device:

class MyCustomBall(implicit b: CustomBuckyballConfig, p: Parameters)
  extends Module with BallRegist {
  // Implement Ball interfaces
  val io = IO(new BlinkIO)
  def ballId = 6.U  // Assign unique Ball ID
  // ... implementation
}

registerBall(() => new MyCustomBall, ballId = 6.U)

registerBallInfo(name = "MyCustomBall", bid = 6, latency = 12)

Configuration Best Practices

Parameter Selection:

Memory Sizes: Balance capacity vs. area
- Scratchpad: Main working memory for data
- Accumulator: Smaller, used for accumulation results
ROB Depth: Impacts instruction-level parallelism
- Larger ROB: More in-flight instructions, higher parallelism
- Smaller ROB: Lower area, simpler control logic
Bank Counts: Affects memory bandwidth
- More banks: Higher parallel access bandwidth
- Fewer banks: Simpler arbitration, lower area
Vector Configuration: Depends on workload
- Vector lane width: Match data parallelism
- PE/Thread count: Balance performance vs. area

Common Configurations:

// High-performance configuration
val highPerfConfig = BaseConfig(
  veclane = 32,                 // Wider vectors
  rob_entries = 32,             // Deeper ROB
  sp_banks = 8,                 // More banks
  sp_capacity = CapacityInKilobytes(512)
)

// Area-optimized configuration
val smallConfig = BaseConfig(
  veclane = 8,
  rob_entries = 8,
  sp_banks = 2,
  sp_capacity = CapacityInKilobytes(64)
)

Important Notes

Configuration Priority: Later configurations in the chain override earlier ones with same parameter names
Dependency Management: Ensure custom component dependencies are correctly declared in configuration
Build Path: Generated file paths specified by TargetDirAnnotation
Parameter Validation: Configuration parameters validated during instantiation; invalid configurations cause build failure
Ball ID Uniqueness: Each Ball device must have unique ID across the system
Bank Access Rules: Remember op1 and op2 cannot access same bank simultaneously

System Architecture

The toy system implements the complete Buckyball architecture:

┌─────────────────────────────────────────────────────────┐
│                  Rocket Core (via RoCC)                 │
└────────────────────┬────────────────────────────────────┘
                     │
            ┌────────▼────────┐
            │ Global Decoder  │
            └────────┬────────┘
                     │
            ┌────────▼────────┐
            │   Global RS     │
            │  (with ROB)     │
            └────┬──────┬─────┘
                 │      │
         ┌───────▼──┐ ┌▼──────────┐
         │  Ball    │ │   Mem     │
         │  Domain  │ │  Domain   │
         │          │ │           │
         │  ┌─────┐ │ │ ┌──────┐ │
         │  │BBus │ │ │ │ DMA  │ │
         │  └──┬──┘ │ │ │+TLB  │ │
         │     │    │ │ └───┬──┘ │
         │  ┌──▼───┐│ │     │    │
         │  │Balls ││ │  ┌──▼──┐ │
         │  └──────┘│ │  │Mem  │ │
         └──────┬───┘ │  │Ctrl │ │
                │     │  └─────┘ │
                │     └─────┬────┘
                │           │
            ┌───▼───────────▼───┐
            │  Memory Controller│
            │ (Scratchpad+Acc)  │
            └───────────────────┘

Supported Ball Devices:

VecBall (ID=0): Vector operations
MatrixBall (ID=1): Matrix multiplication (various formats)
TransposeBall (ID=2): Matrix transpose
Im2colBall (ID=3): Im2col transformation for convolution
ReluBall (ID=4): ReLU activation function

Framework Overview - Core framework architecture
Ball Domain Details - Ball domain implementation
Prototype Ball Devices - Ball device implementations
Memory Domain - Memory subsystem
Simulation Guide - Running simulations

Troubleshooting

Issue: Build fails with "Ball ID conflict"

Solution: Ensure each Ball device has unique ID in both BBus and RS registration

Issue: Generated Verilog has timing violations

Solution: Reduce clock frequency or optimize critical paths

Issue: Simulation shows incorrect results

Solution: Verify Ball device implementation and memory access patterns

Issue: Configuration parameter not taking effect

Solution: Check configuration priority and ensure parameter is in correct config fragment

Toy Buckyball Example Implementation

Overview

This directory contains a complete example implementation of the Buckyball framework, demonstrating how to build a custom coprocessor based on the RoCC interface. Located in arch/src/main/scala/examples/toy, it serves as a reference implementation for the Buckyball system, integrating global decoder, Ball domain, and memory domain.

Core components:

ToyBuckyball.scala: Main RoCC coprocessor implementation
CustomConfigs.scala: System configuration and RoCC integration
CSR.scala: Custom control and status registers
balldomain/: Ball domain related components

Code Structure

toy/
├── ToyBuckyball.scala    - Main coprocessor implementation
├── CustomConfigs.scala   - Configuration definitions
├── CSR.scala            - CSR implementation
└── balldomain/          - Ball domain components

File Dependencies

ToyBuckyball.scala (Core implementation layer)

Extends LazyRoCCBB, implements RoCC coprocessor interface
Integrates GlobalDecoder, BallDomain, MemDomain
Manages TileLink connections and DMA components

CustomConfigs.scala (Configuration layer)

Defines BuckyballCustomConfig and BuckyballToyConfig
Configures RoCC integration and system parameters
Provides multi-core configuration support

CSR.scala (Register layer)

Implements FenceCSR control register
Provides simple 64-bit register interface

Module Description

ToyBuckyball.scala

Main functionality: Implements complete Buckyball RoCC coprocessor

Key components:

class ToyBuckyball(val b: CustomBuckyballConfig)(implicit p: Parameters)
  extends LazyRoCCBB (opcodes = b.opcodes, nPTWPorts = 2) {

  val reader = LazyModule(new BBStreamReader(...))
  val writer = LazyModule(new BBStreamWriter(...))
  val xbar_node = TLXbar()
}

System architecture:

// Frontend: global decoder
val gDecoder = Module(new GlobalDecoder)

// Backend: Ball domain and memory domain
val ballDomain = Module(new BallDomain)
val memDomain = Module(new MemDomain)

// Response arbitration
val respArb = Module(new Arbiter(new RoCCResponseBB()(p), 2))

TileLink connections:

xbar_node := TLBuffer() := reader.node
xbar_node := TLBuffer() := writer.node
id_node := TLWidthWidget(b.dma_buswidth/8) := TLBuffer() := xbar_node

Inputs/Outputs:

Input: RoCC command interface, PTW interface
Output: RoCC response, TileLink memory access
Edge cases: Busy-wait handling during Fence operations

CustomConfigs.scala

Main functionality: Defines system configuration and RoCC integration

Configuration class definition:

class BuckyballCustomConfig(
  buckyballConfig: CustomBuckyballConfig = CustomBuckyballConfig()
) extends Config((site, here, up) => {
  case BuildRoCCBB => up(BuildRoCCBB) ++ Seq(
    (p: Parameters) => {
      val buckyball = LazyModule(new ToyBuckyball(buckyballConfig))
      buckyball
    }
  )
})

System configuration:

class BuckyballToyConfig extends Config(
  new framework.rocket.WithNBuckyballCores(1) ++
  new BuckyballCustomConfig(CustomBuckyballConfig()) ++
  new chipyard.config.WithSystemBusWidth(128) ++
  new WithCustomBootROM ++
  new chipyard.config.AbstractConfig
)

Multi-core support:

class WithMultiRoCCToyBuckyball(harts: Int*) extends Config(...)

CSR.scala

Main functionality: Provides custom control and status registers

Implementation:

object FenceCSR {
  def apply(): UInt = RegInit(0.U(64.W))
}

Fence handling logic:

val fenceCSR = FenceCSR()
val fenceSet = ballDomain.io.fence_o
val allDomainsIdle = !ballDomain.io.busy && !memDomain.io.busy

when (fenceSet) {
  fenceCSR := 1.U
  io.cmd.ready := allDomainsIdle
}

Usage

System Integration

RoCC interface integration:

Register coprocessor through BuildRoCCBB configuration key
Support multi-core configuration
Provide 2 PTW ports for address translation

Inter-domain communication:

// BallDomain -> MemDomain bridge
ballDomain.io.sramRead <> memDomain.io.ballDomain.sramRead
ballDomain.io.sramWrite <> memDomain.io.ballDomain.sramWrite

DMA connections:

memDomain.io.dma.read.req <> outer.reader.module.io.req
memDomain.io.dma.write.req <> outer.writer.module.io.req

Notes

Fence semantics: Use CSR to implement Fence operation synchronization
Busy-wait detection: Assertion checks to prevent long simulation stalls
TLB integration: TLB functionality integrated in MemDomain
Response arbitration: BallDomain has higher priority than MemDomain
Configuration dependencies: Correctly configure CustomBuckyballConfig parameters

BallDomain Example Implementation

Overview

This directory contains a complete example implementation of BallDomain in the Buckyball framework, demonstrating how to build a custom computation domain to manage specialized accelerators. BallDomain is a core concept in Buckyball architecture, used to encapsulate and manage a group of related computation units with unified control and dataflow management.

This directory implements the ball domain architecture, including:

BallDomain: Top-level module managing the entire computation domain
BallController: Ball domain controller for instruction scheduling and execution control
DISA: Distributed instruction scheduling architecture
DomainDecoder: Domain instruction decoder
Specialized accelerators: Including matrix, vector, im2col and other accelerator implementations

Code Structure

balldomain/
├── BallDomain.scala      - Ball domain top module
├── BallController.scala  - Ball domain controller
├── DISA.scala           - Distributed instruction scheduling architecture
├── DomainDecoder.scala  - Domain instruction decoder
├── bbus/                - Ball domain bus system
├── im2col/              - Image-to-column conversion accelerator
├── matrixball/          - Matrix computation ball domain
├── rs/                  - Reservation station implementation
└── vecball/             - Vector computation ball domain

File Dependencies

BallDomain.scala (Top-level module)

Integrates all submodules, provides unified ball domain interface
Manages dataflow and control flow within ball domain
Connects to system bus and RoCC interface

BallController.scala (Control layer)

Implements instruction scheduling and execution control for ball domain
Manages coordination between multiple accelerators
Provides state management and error handling

DISA.scala (Scheduling layer)

Distributed instruction scheduling architecture implementation
Supports concurrent execution of multiple instructions
Provides dynamic load balancing

DomainDecoder.scala (Decode layer)

Ball domain specific instruction decode
Instruction dispatch to corresponding execution units
Supports complex instruction decomposition and reorganization

Module Description

BallDomain.scala

Main functionality: Ball domain top module, integrates all computation units and control logic

Key components:

class BallDomain(implicit p: Parameters) extends LazyModule {
  val controller = LazyModule(new BallController)
  val matrixBall = LazyModule(new MatrixBall)
  val vecBall = LazyModule(new VecBall)
  val im2colUnit = LazyModule(new Im2colUnit)

  // Ball domain bus connections
  val bbus = LazyModule(new BBus)
  bbus.node := controller.node
  matrixBall.node := bbus.node
  vecBall.node := bbus.node
}

Inputs/Outputs:

Input: RoCC instruction interface, memory access interface
Output: Computation results, status information
Edge cases: Instruction conflict handling, resource contention management

BallController.scala

Main functionality: Ball domain controller, manages overall ball domain execution control

Key components:

class BallController extends Module {
  val io = IO(new Bundle {
    val rocc = Flipped(new RoCCCoreIO)
    val mem = new HellaCacheIO
    val domain_ctrl = new DomainControlIO
  })

  // Instruction queue and scheduling logic
  val inst_queue = Module(new Queue(new RoCCInstruction, 16))
  val scheduler = Module(new InstructionScheduler)
}

Scheduling strategy:

Static scheduling based on instruction type
Dynamic resource allocation and load balancing
Supports instruction pipelining and concurrent execution

DISA.scala

Main functionality: Distributed instruction scheduling architecture

Key components:

class DISA extends Module {
  val io = IO(new Bundle {
    val inst_in = Flipped(Decoupled(new Instruction))
    val exec_units = Vec(numUnits, new ExecutionUnitIO)
    val completion = Decoupled(new CompletionInfo)
  })

  // Distributed dispatch table
  val dispatch_table = Reg(Vec(numUnits, new DispatchEntry))
  val load_balancer = Module(new LoadBalancer)
}

Scheduling algorithms:

Round-robin scheduling for fairness
Priority scheduling for critical tasks
Dynamic scheduling adapts to load changes

DomainDecoder.scala

Main functionality: Ball domain instruction decoder

Key components:

class DomainDecoder extends Module {
  val io = IO(new Bundle {
    val inst = Input(UInt(32.W))
    val decoded = Output(new DecodedInstruction)
    val valid = Output(Bool())
  })

  // Instruction decode table
  val decode_table = Array(
    MATRIX_OP -> MatrixOpDecoder,
    VECTOR_OP -> VectorOpDecoder,
    IM2COL_OP -> Im2colOpDecoder
  )
}

Decode functionality:

Supports multiple instruction formats
Microcode expansion for complex instructions
Instruction dependency analysis and optimization

Usage

Design Features

Modular architecture: Each accelerator is an independent module, easy to extend and maintain
Unified interface: All accelerators communicate through unified ball domain bus
Flexible scheduling: Supports multiple scheduling strategies, adapts to different computation patterns
Scalability: Easy to add new accelerator types and functionality

Performance Optimization

Pipeline design: Instruction decode, scheduling, execution use pipeline architecture
Concurrent execution: Supports multiple accelerators working simultaneously
Data management: Data caching and access management
Workload: Workload distribution

Usage Example

// Create ball domain instance
val ballDomain = LazyModule(new BallDomain)

// Connect to RoCC interface
rocc.cmd <> ballDomain.module.io.rocc.cmd
rocc.resp <> ballDomain.module.io.rocc.resp

// Configure ball domain parameters
ballDomain.module.io.config := ballDomainConfig

Notes

Resource management: Properly allocate computational resources, avoid resource conflicts
Timing constraints: Pay attention to timing relationships and data synchronization between different modules
Power control: Implement dynamic power management, shut down modules when not in use
Debug support: Debug interface and status monitoring functionality

BBus Ball Domain Bus System

Overview

This directory contains the implementation of Buckyball's ball domain bus system, primarily responsible for managing SRAM resource access by multiple Ball nodes within the ball domain. The bus system is implemented based on BBusNode from framework.blink, providing SRAM resource arbitration and routing functionality.

This directory implements two core components:

BallBus: Ball domain bus main module, manages SRAM access by multiple Ball nodes
BBusRouter: Bus router, provides routing functionality for Blink interface

Code Structure

bbus/
├── BallBus.scala    - Ball domain bus main module
└── router.scala     - Bus router implementation

File Dependencies

BallBus.scala (Main module)

Creates multiple BBusNode instances to manage Ball nodes
Connects external SRAM interfaces to each Ball node
Implements SRAM resource allocation and arbitration

router.scala (Routing module)

Implements routing functionality based on BBusNode
Provides Blink protocol interface encapsulation

Module Description

BallBus.scala

Main functionality: Ball domain bus main module, manages SRAM resource access by multiple Ball nodes

Key components:

class BallBus(maxReadBW: Int, maxWriteBW: Int, numBalls: Int) extends LazyModule {
  // Create multiple BBusNode instances
  val ballNodes = Seq.fill(numBalls) {
    new BBusNode(BallParams(sramReadBW = maxReadBW, sramWriteBW = maxWriteBW))
  }

  // External SRAM interfaces
  val io = IO(new Bundle {
    val sramRead = Vec(b.sp_banks, Flipped(new SramReadIO(...)))
    val sramWrite = Vec(b.sp_banks, Flipped(new SramWriteIO(...)))
    val accRead = Vec(b.acc_banks, Flipped(new SramReadIO(...)))
    val accWrite = Vec(b.acc_banks, Flipped(new SramWriteIO(...)))
  })
}

Resource allocation strategy:

First sp_banks ports connected to scratchpad SRAM
Next acc_banks ports connected to accumulator SRAM
Excess ports set to invalid state
All Ball nodes share the same SRAM resources

Inputs/Outputs:

Input: SRAM access requests from each Ball node
Output: Read/write interfaces connected to external SRAM
Edge cases: Handle ports beyond configuration range, set to DontCare

Dependencies: framework.balldomain.blink.BBusNode, framework.builtin.memdomain.mem

router.scala

Main functionality: Bus router, provides routing functionality for Blink protocol interface

Key components:

class BBusRouter extends LazyModule {
  val node = new BBusNode(BallParams(
    sramReadBW = b.sp_banks,
    sramWriteBW = b.sp_banks
  ))

  val io = IO(new Bundle {
    val blink = Flipped(new BlinkBundle(node.edges.in.head))
  })
}

Routing functionality:

Implements standard Ball node interface based on BBusNode
Provides Blink protocol encapsulation and conversion
Supports configurable read/write bandwidth parameters

Inputs/Outputs:

Input: Blink protocol interface
Output: BBusNode standard interface
Edge cases: Depends on validity of node.edges.in.head

Dependencies: framework.balldomain.blink.BlinkBundle, framework.balldomain.blink.BBusNode

Usage

Configuration Parameters

Bus system configuration is controlled by the following parameters:

maxReadBW: Maximum read bandwidth (port count)
maxWriteBW: Maximum write bandwidth (port count)
numBalls: Ball node count
b.sp_banks: Scratchpad bank count
b.acc_banks: Accumulator bank count

Resource Management

SRAM port allocation: Allocate ports in order of scratchpad first, accumulator second
Multi-Ball sharing: All Ball nodes share the same SRAM resource pool
Port reuse: Ports beyond configuration are set to invalid state to save resources

Usage Example

// Create ball domain bus
val ballBus = LazyModule(new BallBus(
  maxReadBW = 8,
  maxWriteBW = 8,
  numBalls = 4
))

// Connect external SRAM
scratchpad.io.read <> ballBus.module.io.sramRead
scratchpad.io.write <> ballBus.module.io.sramWrite
accumulator.io.read <> ballBus.module.io.accRead
accumulator.io.write <> ballBus.module.io.accWrite

Notes

Resource conflicts: Multiple Ball nodes may access the same SRAM resources simultaneously, requiring upper-level coordination
Bandwidth limitations: Actual available bandwidth is limited by configured maximum read/write bandwidth parameters
Port mapping: Ensure SRAM port count matches configuration parameters to avoid out-of-bounds access
Timing constraints: BBusNode timing requirements need to match external SRAM interfaces

Reservation Station & ROB

Overview

This module implements the Reservation Station and Reorder Buffer (ROB) in the Buckyball system for out-of-order execution and instruction scheduling support. The reservation station manages instruction issue and completion, while ROB ensures instructions commit in program order, maintaining precise exception semantics.

File Structure

rs/
├── reservationStation.scala  - Reservation station implementation
└── rob.scala                - Reorder buffer implementation

Core Components

BallReservationStation - Ball Domain Reservation Station

The reservation station is a key component connecting the instruction decoder and execution units, responsible for:

Main functionality:

Receives instructions from Ball domain decoder
Dispatches to different execution units based on instruction type
Manages instruction issue and completion status
Generates RoCC responses

Supported execution units:

ball1: VecUnit (vector processing unit)
ball2: BBFP (floating-point processing unit)
ball3: im2col (image processing accelerator)
ball4: transpose (matrix transpose accelerator)

Interface design:

class BallReservationStation extends Module {
  val io = IO(new Bundle {
    // Instruction input
    val ball_decode_cmd_i = Flipped(DecoupledIO(new BallDecodeCmd))

    // RoCC response output
    val rs_rocc_o = new Bundle {
      val resp = DecoupledIO(new RoCCResponseBB)
      val busy = Output(Bool())
    }

    // Execution unit interfaces
    val issue_o = new BallIssueInterface    // Issue interface
    val commit_i = new BallCommitInterface  // Commit interface
  })
}

Instruction dispatch logic:

// Dispatch instructions based on bid (Ball ID)
io.issue_o.ball1.valid := rob.io.issue.valid && rob.io.issue.bits.cmd.bid === 1.U  // VecUnit
io.issue_o.ball2.valid := rob.io.issue.valid && rob.io.issue.bits.cmd.bid === 2.U  // BBFP
io.issue_o.ball3.valid := rob.io.issue.valid && rob.io.issue.bits.cmd.bid === 3.U  // im2col
io.issue_o.ball4.valid := rob.io.issue.valid && rob.io.issue.bits.cmd.bid === 4.U  // transpose

ROB - Reorder Buffer

ROB implements sequential instruction management and out-of-order completion support:

Design features:

Uses FIFO queue to maintain instruction order
Uses completion status table to track instruction execution status
Supports out-of-order completion but in-order issue
Provides ROB ID for instruction identification

Core data structures:

class RobEntry extends Bundle {
  val cmd = new BallDecodeCmd           // Instruction content
  val rob_id = UInt(log2Up(rob_entries).W)  // ROB identifier
}

State management:

val robFifo = Module(new Queue(new RobEntry, rob_entries))  // Instruction queue
val robTable = Reg(Vec(rob_entries, Bool()))               // Completion status table
val robIdCounter = RegInit(0.U(log2Up(rob_entries).W))     // ID counter

Workflow

Instruction Allocation Flow

Instruction enqueue: Instructions from decoder enter ROB
Assign ROB ID: Allocate unique ROB ID to each instruction
State initialization: Mark as incomplete in completion status table

when(io.alloc.fire) {
  robIdCounter := robIdCounter + 1.U
  robTable(robIdCounter) := false.B  // Mark as incomplete
}

Instruction Issue Flow

Head check: Check if ROB head instruction is incomplete
Type dispatch: Dispatch instruction to corresponding execution unit based on bid
Ready control: Only issue when target execution unit is ready

val headEntry = robFifo.io.deq.bits
val headCompleted = robTable(headEntry.rob_id)
io.issue.valid := robFifo.io.deq.valid && !headCompleted

Instruction Completion Flow

Completion arbitration: Multiple execution unit completion signals handled by arbiter
State update: Update completion status table based on ROB ID
Queue dequeue: Remove completed head instruction from ROB

val completeArb = Module(new Arbiter(UInt(log2Up(rob_entries).W), 4))
when(io.complete.fire) {
  robTable(io.complete.bits) := true.B  // Mark as completed
}

Configuration Parameters

Key Configuration Items

rob_entries: ROB entry count, affects out-of-order execution window size
Execution unit count: Currently supports 4 Ball execution units
Arbitration strategy: Uses round-robin arbitration for multiple completion signals

Performance Considerations

ROB size: Larger ROB supports more out-of-order execution but increases hardware overhead
Issue bandwidth: Currently maximum one instruction issued per cycle
Completion bandwidth: Supports multiple instruction completions per cycle

Interface Protocol

BallIssueInterface - Issue Interface

class BallIssueInterface extends Bundle {
  val ball1 = Decoupled(new BallRsIssue)  // VecUnit issue
  val ball2 = Decoupled(new BallRsIssue)  // BBFP issue
  val ball3 = Decoupled(new BallRsIssue)  // im2col issue
  val ball4 = Decoupled(new BallRsIssue)  // transpose issue
}

BallCommitInterface - Commit Interface

class BallCommitInterface extends Bundle {
  val ball1 = Flipped(Decoupled(new BallRsComplete))  // VecUnit commit
  val ball2 = Flipped(Decoupled(new BallRsComplete))  // BBFP commit
  val ball3 = Flipped(Decoupled(new BallRsComplete))  // im2col commit
  val ball4 = Flipped(Decoupled(new BallRsComplete))  // transpose commit
}

Usage Examples

Basic Configuration

// Configure ROB size in CustomBuckyballConfig
class MyBuckyballConfig extends CustomBuckyballConfig {
  override val rob_entries = 16  // 16-entry ROB
}

// Instantiate reservation station
val reservationStation = Module(new BallReservationStation)

Connecting Execution Units

// Connect VecUnit
vecUnit.io.cmd <> reservationStation.io.issue_o.ball1
reservationStation.io.commit_i.ball1 <> vecUnit.io.resp

// Connect BBFP
bbfp.io.cmd <> reservationStation.io.issue_o.ball2
reservationStation.io.commit_i.ball2 <> bbfp.io.resp

Debug and Monitoring

Status Signals

io.rs_rocc_o.busy: Reservation station busy status
rob.io.empty: ROB empty status
rob.io.full: ROB full status

Performance Counters

The following performance counters can be added for monitoring:

Instruction issue count
Instruction completion count
ROB utilization
Load distribution across execution units

Extension Guide

Adding New Execution Units

Add new issue port in BallIssueInterface
Add corresponding commit port in BallCommitInterface
Add corresponding dispatch and arbitration logic in reservation station
Update completion signal arbiter port count

Optimization Suggestions

Multi-issue support: Can be extended to issue multiple instructions per cycle
Dynamic scheduling: Implement more complex scheduling algorithms
Load balancing: Perform load balancing across multiple execution units of the same type

Simulation Configurations

This directory contains simulation configurations and interfaces for various simulators, providing unified configuration management for different simulation environments.

Directory Structure

sims/
├── firesim/
│   └── TargetConfigs.scala    - FireSim FPGA simulation configuration
├── verilator/
│   └── Elaborate.scala        - Verilator simulation top-level generation
└── verify/
    └── TargetConfig.scala     - Verification configurations

Verilator Simulation (verilator/)

Elaborate.scala

Top-level generator for Verilator simulation:

object Elaborate extends App {
  // Select Ball type from command line arguments
  val ballType = args.headOption.getOrElse("toy")

  val config = ballType match {
    case "toy" => new ToyBuckyballConfig
    case "vec" => new WithBlink(TargetBall.VecBall)
    case "matrix" => new WithBlink(TargetBall.MatrixBall)
    case "transpose" => new WithBlink(TargetBall.TransposeBall)
    case "im2col" => new WithBlink(TargetBall.Im2colBall)
    case "relu" => new WithBlink(TargetBall.ReluBall)
    case _ => new ToyBuckyballConfig
  }

  val gen = () => LazyModule(new TestHarness()(config)).module

  (new ChiselStage).execute(
    args.tail,  // Remaining args passed to firtool
    Seq(
      ChiselGeneratorAnnotation(gen),
      TargetDirAnnotation("generated-src/verilator")
    )
  )
}

Generation Flow:

Parse command line arguments and configuration
Instantiate Buckyball system module
Generate Verilog RTL code
Output auxiliary files for simulation

Output Files:

*.v - Verilog files
*.anno.json - FIRRTL annotation files
*.fir - FIRRTL intermediate representation

FireSim Simulation (firesim/)

TargetConfigs.scala

Configurations for running on FireSim FPGA platform:

class FireSimBuckyballConfig extends Config(
  new WithDefaultFireSimBridges ++
  new WithDefaultMemModel ++
  new WithFireSimConfigTweaks ++
  new BuckyballConfig
)

Key Configuration Items:

Bridge Configuration: UART, BlockDevice, NIC I/O bridges
Memory Model: DDR3/DDR4 memory controller configuration
Clock Domains: Multi-clock domain management
Debug Interface: JTAG and Debug Module configuration

Use Cases:

Large-scale system simulation
Long-running workload testing
Multi-core system performance evaluation
I/O-intensive application verification

Verification Configurations (verify/)

TargetConfig.scala

Configurations for single Ball device verification:

sealed trait TargetBall
object TargetBall {
  case object VecBall extends TargetBall
  case object MatrixBall extends TargetBall
  case object TransposeBall extends TargetBall
  case object Im2colBall extends TargetBall
  case object ReluBall extends TargetBall
}

WithBlink Configuration: Empty configuration class for composing with Ball-specific configs

Usage:

# Verify specific Ball device
mill arch.runMain sims.verilator.Elaborate matrix
mill arch.runMain sims.verilator.Elaborate transpose

Build and Usage

Verilator Simulation Build

# Generate Verilog
cd arch
mill arch.runMain sims.verilator.Elaborate [ball_type]

# Build simulator (in sims/verilator directory)
cd ../../sims/verilator
make CONFIG=ToyBuckyball

Available Ball Types:

toy: Complete toy system (default)
vec: Vector Ball only
matrix: Matrix Ball only
transpose: Transpose Ball only
im2col: Im2col Ball only
relu: ReLU Ball only

FireSim Deployment

# Set up FireSim environment
cd firesim
source sourceme-f1-manager.sh

# Build FPGA bitstream
firesim buildbitstream

# Run simulation
firesim runworkload

Debug and Optimization

Verilator Debug

Waveform Generation: Use --trace option to generate VCD files
Performance Profiling: Use --prof-cfuncs for profiling
Coverage: Use --coverage to generate coverage reports

FireSim Debug

Printf Debugging: Use printf statements for debug output
Assertion Checking: Enable runtime assertion verification
Performance Counters: Integrated HPM counters for monitoring

Configuration Parameters

Common Parameters

// Processor core configuration
case object RocketTilesKey extends Field[Seq[RocketTileParams]]

// Memory system configuration
case object MemoryBusKey extends Field[MemoryBusParams]

// Peripheral configuration
case object PeripheryBusKey extends Field[PeripheryBusParams]

Simulation-Specific Parameters

// Verilator simulation parameters
case object VerilatorDRAMKey extends Field[Boolean](false)

// FireSim simulation parameters
case object FireSimBridgesKey extends Field[Seq[BridgeIOAnnotation]]

Extension Development

Adding New Simulator Support

Create new configuration directory (e.g., vcs/)
Implement simulator-specific configuration classes
Add build scripts and Makefiles
Update documentation and test cases

Custom Configuration

class MyCustomConfig extends Config(
  new WithMyCustomParameters ++
  new BuckyballConfig
)

FireSim Simulation Configuration

Overview

This directory contains Buckyball system simulation configuration for the FireSim platform. FireSim is an open-source FPGA-based simulation platform that provides hardware simulation environments, supporting system-level simulation and performance analysis.

File Structure

firesim/
└── TargetConfigs.scala  - FireSim target configuration

Configuration Description

TargetConfigs.scala

This file defines Buckyball system configuration for the FireSim platform:

WithBootROM Configuration:

class WithBootROM extends Config((site, here, up) => {
  case BootROMLocated(x) => {
    // Automatically select BootROM path
    val chipyardBootROM = new File("./thirdparty/chipyard/generators/testchipip/bootrom/bootrom.rv${MaxXLen}.img")
    val firesimBootROM = new File("./thirdparty/chipyard/target-rtl/chipyard/generators/testchipip/bootrom/bootrom.rv${MaxXLen}.img")

    // Prefer chipyard path, use firesim path if it doesn't exist
    val bootROMPath = if (chipyardBootROM.exists()) {
      chipyardBootROM.getAbsolutePath()
    } else {
      firesimBootROM.getAbsolutePath()
    }
  }
})

FireSimBuckyballToyConfig Configuration:

class FireSimBuckyballToyConfig extends Config(
  new WithBootROM ++                              // BootROM configuration
  new firechip.chip.WithDefaultFireSimBridges ++ // Default FireSim bridges
  new firechip.chip.WithFireSimConfigTweaks ++   // FireSim configuration tweaks
  new examples.toy.BuckyballToyConfig            // Buckyball toy configuration
)

Advanced Configuration

Custom BootROM:

class MyFireSimConfig extends Config(
  new WithBootROM ++
  new MyCustomBuckyballConfig ++
  // Other configurations...
)

Verilator Simulation Configuration

Overview

This directory contains Buckyball system simulation configuration for the Verilator platform. Verilator is an open-source Verilog/SystemVerilog simulator that compiles RTL code into high-performance C++ simulation models, providing a fast functional simulation and verification environment.

File Structure

verilator/
└── Elaborate.scala  - Verilator elaboration configuration

Core Implementation

Elaborate.scala

This file implements the Verilog generation and elaboration process for the Buckyball system:

object Elaborate extends App {
  val config = new examples.toy.BuckyballToyConfig
  val params = config.toInstance

  ChiselStage.emitSystemVerilogFile(
    new chipyard.harness.TestHarness()(config.toInstance),
    firtoolOpts = args,
    args = Array.empty  // Pass command line arguments directly
  )
}

Compiler Build Guide

Basic Workload Compilation

To build the workload, follow these steps:

mkdir build && cd build
cmake -G Ninja ..
ninja

Model-Level Testing

To enable model-level testing with specific models and architectures:

mkdir build && cd build
cmake -G Ninja .. \
    -DMODEL="lenet,resnet18,mobilenetv3,bert,stablediffusion,llama2,deepseekr1" \
    -DARCH="gemmini,buckyball"
ninja

Note:

Model downloads for bert, whisper, stable-diffusion, llama2, DeepseekR1 require pre-configured HuggingFace access
whisper is currently not supported
llama2 model download requires additional API-key or cached credentials

Tile Dialect Refactoring Documentation

Refactoring Background and Goals

The core goal of this refactoring is to introduce a new intermediate layer between Linalg Dialect and Buckyball Dialect - the Tile Dialect - to achieve clearer separation of responsibilities and better code organization. In the original architecture, the conversion from linalg.matmul to hardware instructions was completed in one step through convert-linalg-to-buckyball, which caused Buckyball Dialect to handle both the slicing logic for arbitrary-size matrices and hardware-level memory management and computation scheduling, resulting in overly mixed responsibilities. The new architecture splits the conversion process into two phases: convert-linalg-to-tile and convert-tile-to-buckyball, making each layer have a clear and single responsibility.

New Architecture Design

The entire compilation flow is now divided into three clear layers. First is the Linalg layer, which represents high-level linear algebra operations, such as linalg.matmul representing matrix multiplication of arbitrary size. This layer does not care about hardware constraints. Next is the newly introduced Tile layer, whose core responsibility is to tile arbitrary-size matrix operations into fixed-size blocks that conform to hardware constraints. The Tile layer expresses this high-level tiling intent through the tile.tile_matmul operation. The specific tiling strategy, loop generation, and boundary handling are all implemented in the convert-tile-to-buckyball pass. Finally, the Buckyball layer focuses on hardware-level operations. buckyball.bb_matmul receives pre-tiled fixed-size matrix blocks and is responsible for generating precise hardware instruction sequences, including data movement (mvin/mvout), computation scheduling (mul_warp16), and memory address calculation.

Tile Dialect Design Details

The Tile Dialect defines the TileMatMulOp operation, which accepts three memref parameters representing matrices A, B, and C respectively. The semantics of this operation are: perform multiplication on input matrices of arbitrary size, automatically handling tiling, padding, and loops. In implementation, TileMatMulOp will be converted by the convert-tile-to-buckyball pass into multiple buckyball.bb_matmul operations and corresponding memref.subview operations. This conversion process will consider hardware scratchpad size limitations, warp and lane parallelism constraints, and generate an optimal tiling strategy. The design philosophy of the Tile layer is to provide a platform-independent intermediate representation, allowing upper-layer optimizations to transform matrix operations without understanding specific hardware details.

Buckyball Dialect Simplification

In the new architecture, the Buckyball Dialect has been significantly simplified. The original four operations VecTileMatMulOp, MergeTileMatMulOp, MetaTileMatMulOp, and VecMulWarp16Op have been unified into a single MatMulOp. This simplification is reasonable because the tiling logic has been moved up to the Tile layer, and the Buckyball layer only needs to express the single concept of "performing hardware-level multiplication on a matrix block that already conforms to hardware constraints." The lowering process of buckyball.bb_matmul will directly generate LLVM intrinsics: first load matrices A and B into the scratchpad through Mvin_IntrOp, then generate multiple Mul_Warp16_IntrOp operations based on warp and lane parameters for computation, and finally write the results back to main memory through Mvout_IntrOp. All address calculations and encodings are completed in this lowering process.

Key Implementation Details

When implementing the convert-linalg-to-tile pass, the core logic is very simple: match the linalg.matmul operation and directly replace it with tile.tile_matmul, passing the same three memref operands. The role of this pass is mainly type and semantic conversion, indicating that we have moved from the general linear algebra operation domain into the hardware-oriented tile operation domain.

The convert-tile-to-buckyball pass is the most complex part of the entire refactoring. It needs to extract matrix dimension information (M, K, N) from the operands of tile.tile_matmul, then calculate the optimal tiling strategy based on hardware parameters (dim, warp, lane). For the K dimension, it will tile according to warp size; for M and N dimensions, it will consider scratchpad capacity limitations. Each tile corresponds to a buckyball.bb_matmul operation, and tiles are connected through memref.subview to create matrix views. Special attention should be paid to handling boundary cases: when matrix dimensions cannot be evenly divided by tile size, the actual size of the last tile needs to be calculated to avoid out-of-bounds access.

When implementing BuckyballMatMulLowering, we encountered an important concept in MLIR's type conversion system: OpAdaptor. In conversion patterns, the types of the original operation (such as memref<32x16xi8>) will be converted to LLVM types (such as LLVM struct types) by the TypeConverter during the lowering process. OpAdaptor provides converted values, but we need to obtain type information (such as shape) from the original operation because this static information may no longer exist in the same form after conversion. Therefore, the correct approach is: obtain the original MemRefType from matMulOp.getOperandTypes() to extract shape information for address calculation and loop generation; for actual value operations (such as ExtractAlignedPointerAsIndexOp), use the original memref value, because MLIR's memref operations still require MemRefType.

Another key design decision is: MatMulOp's lowering should directly generate intrinsic operations (Mvin_IntrOp, Mul_Warp16_IntrOp, Mvout_IntrOp), rather than generating MvinOp, MvoutOp and then waiting for them to be lowered. The reason is that in the LLVM lowering stage, the type system has already been converted, and creating high-level Buckyball operations again would cause type mismatch issues. Directly generating intrinsics avoids multiple type conversions and makes the code clearer and more efficient. Referring to the Gemmini dialect implementation, we adopted the same strategy.

Test System

To verify the correctness of the new architecture, we created complete test cases in the bb-tests/workloads/src/OpTest/tile/ directory. Tests are divided into two categories: staged tests and end-to-end tests.

tile-matmul.mlir tests the conversion from Linalg to Tile, verifying that linalg.matmul is correctly converted to tile.tile_matmul. This is the most basic type conversion test. tile-to-buckyball.mlir tests the conversion from Tile to Buckyball, verifying that the tiling logic is correct and that the correct number of buckyball.bb_matmul operations and memref.subview operations are generated. buckyball-to-llvm.mlir tests the conversion from Buckyball MatMulOp to LLVM intrinsics, verifying that the correct sequences of buckyball.intr.bb_mvin, buckyball.intr.bb_mul_warp16, and buckyball.intr.bb_mvout instructions are generated.

end-to-end.mlir is the most important test, testing the complete conversion flow: starting from linalg.matmul, sequentially passing through the three passes -convert-linalg-to-tile, -convert-tile-to-buckyball, -lower-buckyball, and finally generating LLVM intrinsics. This test ensures that each part of the entire pipeline works correctly and that there are no issues with the connections between parts.

Pass Registration and Toolchain Integration

The two newly added passes need to be registered in multiple places. First, register the pass creation functions registerLowerLinalgToTilePass() and registerLowerTileToBuckyballPass() in InitAll.cpp, and also register buddy::tile::TileDialect. In the buddy-opt tool, buddy::tile::TileDialect needs to be added to the dialect registry so that the tool can recognize and parse tile dialect operations. In the CMake build system, the new libraries BuddyTile, LowerLinalgToTilePass, and LowerTileToBuckyballPass need to be added to the link dependencies, ensuring correct dependency relationships.

It is particularly worth noting that in the configureBuckyballLegalizeForExportTarget function in LegalizeForLLVMExport.cpp, we need to add target.addLegalDialect<memref::MemRefDialect>() and target.addLegalDialect<arith::ArithDialect>(), because memref and arith operations will be used during the lowering process of MatMulOp. If these dialects are not marked as legal, the conversion framework will attempt to lower these operations, causing type conversion conflicts.

Agent Workflow

AI assistant workflow in Buckyball framework, providing conversational interaction with AI models.

API Usage

`chat`

Endpoint: POST /agent/chat

Function: Conversational interaction with AI assistant

Parameters:

message [Required] - Message content to send to AI
model - AI model to use, default "deepseek-chat"

Examples:

# Basic conversation
bbdev agent --chat "--message 'Hello, can you help me with Buckyball development?'"

# Specify model
bbdev agent --chat "--message 'Explain this Scala code' --model deepseek-chat"

# Code analysis
bbdev agent --chat "--message 'Please analyze this Chisel module and suggest optimizations'"

Response:

{
  "traceId": "unique-trace-id",
  "status": "success"
}

Notes

Requires configured AI model API key
Responses use streaming output
Note message length limits

Compiler Workflow

Compiler build workflow in the Buckyball framework for building the Buckyball compiler toolchain.

API Usage

`build`

Endpoint: POST /compiler/build

Function: Build Buckyball compiler

Parameters: No specific parameters

Example:

bbdev compiler --build

Response:

{
  "status": 200,
  "body": {
    "success": true,
    "processing": false,
    "return_code": 0
  }
}

Notes

Ensure the system has necessary build tools and dependencies

Doc-Agent Workflow

Documentation generation workflow in the Buckyball framework, providing automated code documentation generation functionality.

API Usage Guide

`generate`

Endpoint: POST /doc/generate

Function: Generate documentation for specified directory

Parameters:

target_path [Required] - Target directory path
mode [Required] - Generation mode, options: "create", "update"

Example:

# Create new documentation for specified directory
bbdev doc --generate "--target_path arch/src/main/scala/framework --mode create"

# Update existing documentation
bbdev doc --generate "--target_path arch/src/main/scala/framework --mode update"

Response:

{
  "traceId": "unique-trace-id",
  "status": "success",
  "message": "Documentation generated successfully"
}

Supported Document Types

RTL hardware documentation
Test documentation
Script documentation
Simulator documentation
Workflow documentation

Important Notes

Requires AI model API key configuration
Generated documentation is automatically integrated into the mdBook system
Supports symbolic link management and automatic SUMMARY.md updates

Marshal Workflow

Marshal workflow in the Buckyball framework, used to build and launch the Marshal component.

API Usage Guide

`build`

Endpoint: POST /marshal/build

Function: Build Marshal component

Parameters: No specific parameters

Example:

bbdev marshal --build

`launch`

Endpoint: POST /marshal/launch

Function: Launch Marshal service

Parameters: No specific parameters

Example:

bbdev marshal --launch

Typical Workflow

# 1. Build Marshal
bbdev marshal --build

# 2. Launch Marshal service
bbdev marshal --launch

Response Format

All API calls return a unified format:

{
  "status": 200,
  "body": {
    "success": true,
    "processing": false,
    "return_code": 0
  }
}

Sardine Workflow

Sardine workflow in the Buckyball framework for running Sardine-related tasks.

API Usage

`run`

Endpoint: POST /sardine/run

Function: Run Sardine tasks

Parameters:

workload - Specify the workload to run

Example:

# Run specified workload
bbdev sardine --run "--workload /path/to/workload"

# Run default workload
bbdev sardine --run

Response:

{
  "status": 200,
  "body": {
    "success": true,
    "processing": false,
    "return_code": 0
  }
}

UVM Workflow

UVM (Universal Verification Methodology) workflow in the Buckyball framework for building and running UVM verification environments.

API Usage

`builddut`

Endpoint: POST /uvm/builddut

Function: Build DUT (Design Under Test)

Parameters:

jobs - Number of parallel build tasks, default 16

Example:

# Build DUT with default parallelism
bbdev uvm --builddut

# Specify number of parallel tasks
bbdev uvm --builddut "--jobs 8"

`build`

Endpoint: POST /uvm/build

Function: Build UVM executable

Parameters:

jobs - Number of parallel build tasks, default 16

Example:

# Build UVM with default parallelism
bbdev uvm --build

# Specify number of parallel tasks
bbdev uvm --build "--jobs 8"

Typical Workflow

# 1. Build DUT
bbdev uvm --builddut

# 2. Build UVM environment
bbdev uvm --build

Response Format:

{
  "status": 200,
  "body": {
    "success": true,
    "processing": false,
    "return_code": 0
  }
}

Verilator Simulation Workflow

Hardware simulation workflow based on Verilator in the Buckyball framework, providing a complete automation flow from RTL generation to simulation execution. Verilator is a high-performance Verilog simulator that supports fast functional verification and performance analysis.

II. Original API Usage Guide

`run`

Endpoint: POST /verilator/run

Function: Execute complete workflow. Clean build directory, generate Verilog, compile Verilator into simulation file, and run simulation directly

Parameters:

jobs - Number of parallel compilation tasks
- Default value: 16
binary [Required] - Test binary file path
- Default value: ""

Example:

# bbdev wrapper
bbdev verilator --run "jobs 256 --binary ${buckyball}/bb-tests/workloads/build/src/CTest/ctest_mvin_mvout_alternate_test_singlecore-baremetal --batch"

# Raw command
curl -X POST http://localhost:5000/verilator/run -H "Content-Type: application/json" -d '{"jobs": 8, "binary": "/home/user/test.elf"}'

`clean`

Endpoint: POST /verilator/clean

Function: Clean build folder

Parameters: None

Example:

curl -X POST http://localhost:5000/verilator/clean

`verilog`

Endpoint: POST /verilator/verilog

Function: Only generate Verilog code, without compilation and simulation

Parameters: None

Example:

curl -X POST http://localhost:5000/verilator/verilog -d '{"jobs": 8}'

`build`

Endpoint: POST /verilator/build

Function: Compile verilog source files and cpp source files into executable simulation file

Parameters:

jobs - Number of parallel compilation tasks
- Default value: 16

Example:

curl -X POST http://localhost:5000/verilator/build -d '{"jobs": 16}'

`sim`

Endpoint: POST /verilator/sim

Function: Run existing simulation executable

Parameters:

binary [Required] - Custom test binary file path

Example:

curl -X POST http://localhost:5000/verilator/sim \
  -H "Content-Type: application/json" \
  -d '{"binary": "/home/user/test_program.elf"}'

II. Developer Documentation

Directory Structure

steps/verilator/
├── 00_start_node_noop_step.py      # Workflow entry node definition
├── 00_start_node_noop_step.tsx     # Frontend UI component
├── 01_run_api_step.py              # Complete workflow API entry
├── 01_clean_api_step.py            # Clean API endpoint
├── 01_verilog_api_step.py          # Verilog generation API endpoint
├── 01_build_api_step.py            # Build API endpoint
├── 01_sim_api_step.py              # Simulation API endpoint
├── 02_clean_event_step.py          # Clean build directory
├── 03_verilog_event_step.py        # Verilog code generation
├── 04_build_event_step.py          # Verilator compilation
├── 05_sim_event_step.py            # Simulation execution
├── 99_complete_event_step.py       # Completion handling
├── 99_error_event_step.py          # Error handling
└── README.md                       # This document

Workflow Steps Detailed

1. Entry Node (`00_start_node_noop_step.py`)

Type: noop node
Function: Provide UI interface entry point
Frontend: "Start Build Verilator" button

2. API Endpoints

Complete Workflow API (01_run_api_step.py): /verilator → verilator.run
Clean API (01_clean_api_step.py): /verilator/clean → verilator.clean
Verilog Generation API (01_verilog_api_step.py): /verilator/verilog → verilator.verilog
Build API (01_build_api_step.py): /verilator/build → verilator.build
Simulation API (01_sim_api_step.py): /verilator/sim → verilator.sim

3. Clean Step (`02_clean_event_step.py`)

Type: event step
Subscribes: verilator.run, verilator.clean
Emits: verilator.verilog, verilator.complete
Function: Delete build directory, serves workflow or standalone operation

4. Verilog Generation (`03_verilog_event_step.py`)

Type: event step
Subscribes: verilator.verilog
Emits: verilator.build, verilator.complete
Function: Use mill to generate Verilog code to build directory

5. Verilator Compilation (`04_build_event_step.py`)

Type: event step
Subscribes: verilator.build
Emits: verilator.sim, verilator.complete
Function: Compile Verilog and C++ source files into executable simulation file

6. Simulation Execution (`05_sim_event_step.py`)

Type: event step
Subscribes: verilator.sim
Emits: verilator.complete
Function: Run simulation, supports custom binary parameter

7. Completion Handling (`99_complete_event_step.py`)

Type: event step
Subscribes: verilator.complete
Function: Print success message, mark workflow as complete

8. Error Handling (`99_error_event_step.py`)

Type: event step
Subscribes: verilator.error
Function: Print error message, handle workflow exceptions

Workflow Diagram

graph TD;
    API[POST /verilator<br/>Complete Workflow] --> RUN[verilator.run]

    CLEAN_DIRECT[verilator.clean<br/>Single-step Clean] --> CLEAN_STEP[02_clean_event_step]
    VERILOG_DIRECT[verilator.verilog<br/>Single-step Generate] --> VERILOG_STEP[03_verilog_event_step]
    BUILD_DIRECT[verilator.build<br/>Single-step Build] --> BUILD_STEP[04_build_event_step]
    SIM_DIRECT[verilator.sim<br/>Single-step Simulation] --> SIM_STEP[05_sim_event_step]

    RUN --> CLEAN_STEP
    CLEAN_STEP --> |Workflow Mode| VERILOG_STEP
    CLEAN_STEP --> |Single-step Mode| COMPLETE[verilator.complete]

    VERILOG_STEP --> |Workflow Mode| BUILD_STEP
    VERILOG_STEP --> |Single-step Mode| COMPLETE

    BUILD_STEP --> |Workflow Mode| SIM_STEP
    BUILD_STEP --> |Single-step Mode| COMPLETE

    SIM_STEP --> COMPLETE

    COMPLETE --> COMPLETE_STEP[99_complete_event_step]

    CLEAN_STEP -.-> |Error| ERROR[verilator.error]
    VERILOG_STEP -.-> |Error| ERROR
    BUILD_STEP -.-> |Error| ERROR
    SIM_STEP -.-> |Error| ERROR

    ERROR --> ERROR_STEP[99_error_event_step]

    classDef apiNode fill:#e1f5fe
    classDef eventNode fill:#f3e5f5
    classDef stepNode fill:#e8f5e8
    classDef endNode fill:#fff3e0

Workload Workflow

Workload build workflow in Buckyball framework, used to build test workloads and benchmark programs.

API Usage

`build`

Endpoint: POST /workload/build

Function: Build workload

Parameters:

workload - Specify workload name to build

Examples:

# Build specific workload
bbdev workload --build "--workload test_program"

# Build all workloads
bbdev workload --build

Response:

{
  "status": 200,
  "body": {
    "success": true,
    "processing": false,
    "return_code": 0
  }
}

Notes

Workload source code located in bb-tests/workload directory
Build results typically output to bb-tests/workloads/build directory

Contributors

Thank you to all developers and researchers who have contributed to the Buckyball project.

Hardware architecture design and optimization
Software toolchain improvements
Test cases and benchmark programs
Documentation writing and maintenance

Issue Feedback

Bug reports and fix suggestions
Feature requirements and improvement suggestions
Performance optimization suggestions
Usage experience feedback

Academic Collaboration

Research papers and technical reports
Conference presentations and technical sharing
Open source community promotion

Participation Guidelines

Fork Project: Create a project branch from GitHub
Local Development: Set up development environment according to documentation
Submit Changes: Follow code standards and commit format
Create PR: Describe changes and test results in detail
Code Review: Cooperate with maintainers to complete code review process

Contact

GitHub: DangoSys/buckyball
Issues: Report issues through GitHub Issues
Discussions: Participate in Slack for discussions

Acknowledgments

Special thanks to the following open source projects and communities:

Buddy-Compiler development team
Chipyard project
RISCV Foundation
All test users and feedback providers

Keyboard shortcuts

Buckyball Technical Documentation