Getting Started

"simtix" (all letters are in lowercase) aims to provide behavior model of our target hardware.

Basic concepts

Directory structure

include: "Public" headers for simtix
src: All the sources
- kernel-sim: Sources for kernel-sim
- simtix: Sources for simtix
sw: Softwares to run on kernel-sim
- config: Simulation configurations for kernel-sim
- kernel: The very basic kernel runtime library
- tests: Sources for test programs (RISC-V SPMD programs written in C)
tests: Test programs
- unit: The unit tests

The simtix core library

The core library implements the API defined in the "public" headers, such as class PipelinedSM or class NBHBCache.

Note: We heavily use Pimpl to hide implementation details, so don't be afraid to see a lot of Impls in the codebase.

The difference between public headers and private headers is that public headers are to be included by the user of the core library, such as casvp, while the private headers (src/**/*.h) are only included internally.

APIs defined in public headers are classified by the component types (e.g., mem.h, sm.h) or functionalities (e.g., statistics.h, trace.h).

Mappings of declarations to implementations:

mem.h -> src/simtix/mem/*
sm.h -> src/simtix/sm/*
system.h -> src/simtix/system/*
clocked.h -> src/simtix/sim/clocked.cc
* -> src/simtix/common/*

All hardware simulations in simtix are driven by Ticks (so that's why it's called "simtix")

The global tick entry is defined here. Calling this function makes the simulator move 1 cycle forward. The tick functions does (see here):

Iterate through all Clocked objects, calling their Tick methods respectively.
Increase global tick counter by 1.
Check whether there is any component busy.
Dump Konata C 1 command.

All the Clocked objects are sorted by its TickPriority.

Unit tests

We use Catch2 v3 as the unit test framework. (Please refer to their documents for how to write a test)

Basically all components (classes) have their own unit test. We make sure your code is tested before your code can be merged into main.

Question: How to know your code is actually tested?
Ans: The coverage report!

A test case will become an item in CTest (the list shown when you run ctest --test-dir build).

Kernel sim

Kernel sim is considered a "user" of the simtix core library. It uses the APIs provided by simtix to construct a simple test environment (single core), which can run the programs provided in the sw directory.

Kernel sim is a command line tool that provides a set of flexible options to change the configuration. Commonly used options are (do kernel-sim -h for more details):

-M: limit the maximum ticks to run
-s: Change the warp scheduler of the SM
-t: Change the number of threads per warp
-w: Change the number of warps per core
-k: Specify the output path of Konata log
-c: Specify the configuration files

Since there are too many configurations available, kernel sim accepts configuration files (using -c) written in TOML which are used to set the parameters of the system.

Kernel sim runs either until there are no more busy Clocked objects or reaching the specified maximum tick limit.

Procedure of kernel sim:

Read configuration file and parse the command line arguments.
Initialize the test system with given parameters.
Load the ELF file (program) into the main memory.
Call simtix::sim::Tick repeatedly to drive the simulation.
Dump the log/stats after the simulation stops.

Software components

Instr

The Instr classes is the most important components in simtix. It defines the behavior of RISC-V instructions in different stages using the following methods:

Decode
Issue
OperandCollect
Execute
Commit

To implement any instruction, the behavior of the instruction in these stage must be well defined. (See a lot of classes defined in src/simtix/sm/Instr*.h)

For example, R-type instructions (OP):

Determines what operation to perform based on funct3_ & funct7_ when doing Decode.
Reserves rd_ when doing Issue.
Collect rs1_data_ and rs2_data_ when doing OperandCollect.
Run the decoded operation, such as add_ when doing Execute.
Write the rd_data_ back to register file and release rd_ when doing Commit.

In instruction in SIMT execution model is associate with a warp. Therefore, all the operations mentioned above is performed against a Warp object. The constructor of the Instr object accepts a pointer to a Warp object that the Instr is associated with.

To simulate the execution of a sequence of RISC-V instruction, we:

Fetch a iword (RISC-V 32-bit instruction) from memory.
Construct an Instr object using that iword.
Call the above methods sequentially.
Repeat the above operations

With this design, we can simulate different core configuration effectively.

SM models

An SM model perform the above operations to simulate the execution of an instruction stream.

To model different microarchitecture, we schedule the invokation of the above Instr methods.

For example, the AtomicSM model is served as an ISS (instruction set simulator), in which all instructions are executed atomically. In this case, AtomicSM::Tick selects a warp suggested by the warp scheduler, fetches an iword to construct an Instr object, and calls all methods of the Instr to finish the execution atomically. See here for more details.

For the PipelinedSM model, things can get more complicated. We model the behavior of pipelined processor, meaning that instruction is propagated between stages. The PipelinedSM determine whether an instruction can be propagated to the next stage by checking the current pipeline status. Also, there are several "stateful" components such as I-buffer and Arbitrator that will affect how the Instr are propagated.

Note:

The Instr class is decoupled with the clock, i.e., the Tick function. Therefore, for operations that takes more than 1 cycles, such as reading/writing register files or accessing the memory, we use a pattern of pushing requests and a callback function.

When the requests to these "timed" components are finished, these components calls the callback that is associated with the requests. The callback usually changes the state of the Instr marking the availability of calling next method.

The availability is reflected on these methods:

CanIssue: Dependency resolved
CanExecute: Done collecting operands
CanCommit: Done execution (memory accesses)
CanRetire: Done writing back results

See here for example of InstrLoad.

Queues

Software queues (std::queue, std::deque) are unsized, meaning that there is not a limit for the capacity. However, for hardware queues, this is not the case.

As queues are common components in the hardware we have the following templated queue to be used when modeling the hardwrae:

SizedQueue
- Add a capacity limit to std::deque so that it cannot be Enq when it's full.
DelayQueue
- With the capacity limit, it adds an additional delay constraint for Deq.
- An element can be Deq only when it stays in the queue for a given delay.
FcfsDelayQueue
- With the capacity limit, it adds an additional delay constraint for Deq.
- An element can be Deq only when it stays in the front of the queue for a given delay (First-come first-serve policy).

Note:

The delay of all arithmetic operations of Instr, including multiplication, division and even floating point operations is not modeled in the Execute method. Instead, the delay is simulated by pushing the instruction into a DelayQueue, so that it must stay in the queue for a couple cycles.

The operations are usually classified by whether it can be pipelined. For pipelined operations, DelayQueue is used, and for non-pipelined operations, FcfsDelayQueue is used.

InstrPtr

Since an Instr is constructed upon fetched, and destroyed after retirement, the cost of constructing and destructing the object would become very expensive thus slows down the simulation. To handle this, we introduce an optimization technique called Object Pool Pattern.

Before the simulation starts, we first allocate a number of Instr objects in the pool. When we need a new Instr, we can directly obtain from the pool. After we no longer need the Instr we can recycle it back to the pool for the next time.

To reuse the Instr object, it defines Reset method to clear all the properties when recycling back to the pool and Reinitialize method to reinitialize the fields when allocating.

InstrPtr is a wrapper class for Instr that:

Guarantees the global uniqueness of the Instr instance.
Handles allocations of Instr from InstrPool.
Automatically recycles the instruction back to the pool.
Konata trace of a single instruction.

InstrPtr acts like a std::unique_ptr, which can only be moved around variables. This nature guarantees the uniqueness of an Instr in the processor pipeline.

Note: The begin of the pipeline trace of a instruction begins by the invokation of the constuctor of the IntrPtr and ends when the destructor of the InstrPtr is called.

MemoryInterface

MemoryInterface is a simplified interface for components that can be read or written with a given address and size. The interface implements a simple protocol that emulates the process of handshaking.

To send a request over a MemoryInterface. The initiator pass a Payload and a OnResp callback to either Read or Write method. The method return true when it accepts the request or false when it rejects the request.

The invokation of Read or Write is analogous to the posedge of valid signal, while the return value is analogous to the ready signal. Only when the return value is true is the request considered accepted.

Once the request is served, the OnResp callback is called with the status (Okay or types of errors) being the only argument.

Similarly, the OnResp callback returns true when the initiator can accepts the response and false oterwise.

Trace system

You may see lots of DPRINTF in the codebase. These are to dump some debug information when running the simulation. The first argument of DPRINTF is the "debug flag", which can be turned on or off when configuring the build with CMake (-D SIMTIX_ENABLE_TRACING=XXX).

The design allows us to turn off unwanted debug messages when running the simulation, which not only makes the output less verbose but also significantly speeds up the simulation.

A special trace type is Konata, which dumps Kanata logs that can be rendered in Konata.

Read our paper (must read) for more details!

Statistic system

To profile the performance, we need lots of performance counters such as cache hit rate, or IPC. These informations are collected using the statistic system (see statistic.h)

A single performance counter is a stat::Integer or stat::Real. These counters are defined in a stat::Group. Counters that are derived from other counters are stat::Formula, whose expression can be defined using operator= overloading. See the cache example.

A stat::Group can include another stat::Group as subgroup to form a tree-like structure. Finally, we can Tabularize the whole stats from the root node into a TOML table. See how kernel-sim handles this.