Getting Started
"simtix" (all letters are in lowercase) aims to provide behavior model of our target hardware.
Basic concepts
Directory structure
include
: "Public" headers for simtixsrc
: All the sourceskernel-sim
: Sources for kernel-simsimtix
: Sources for simtix
sw
: Softwares to run on kernel-simconfig
: Simulation configurations for kernel-simkernel
: The very basic kernel runtime librarytests
: Sources for test programs (RISC-V SPMD programs written in C)
tests
: Test programsunit
: The unit tests
The simtix core library
The core library implements the API defined in the "public" headers, such as class PipelinedSM
or class NBHBCache
.
Note: We heavily use Pimpl to hide implementation details, so don't be afraid to see a lot of Impl
s in the codebase.
The difference between public headers and private headers is that public headers are to be included by the user of the core library, such as casvp, while the private headers (src/**/*.h
) are only included internally.
APIs defined in public headers are classified by the component types (e.g., mem.h
, sm.h
) or functionalities (e.g., statistics.h
, trace.h
).
Mappings of declarations to implementations:
mem.h
->src/simtix/mem/*
sm.h
->src/simtix/sm/*
system.h
->src/simtix/system/*
clocked.h
->src/simtix/sim/clocked.cc
*
->src/simtix/common/*
All hardware simulations in simtix are driven by Ticks
(so that's why it's called "simtix")
The global tick entry is defined here. Calling this function makes the simulator move 1 cycle forward. The tick functions does (see here):
- Iterate through all
Clocked
objects, calling theirTick
methods respectively. - Increase global tick counter by 1.
- Check whether there is any component busy.
- Dump Konata
C 1
command.
All the Clocked
objects are sorted by its TickPriority
.
Unit tests
We use Catch2 v3 as the unit test framework. (Please refer to their documents for how to write a test)
Basically all components (classes) have their own unit test. We make sure your code is tested before your code can be merged into main
.
- Question: How to know your code is actually tested?
- Ans: The coverage report!
A test case will become an item in CTest (the list shown when you run ctest --test-dir build
).
Kernel sim
Kernel sim is considered a "user" of the simtix core library. It uses the APIs provided by simtix to construct a simple test environment (single core), which can run the programs provided in the sw
directory.
Kernel sim is a command line tool that provides a set of flexible options to change the configuration. Commonly used options are (do kernel-sim -h
for more details):
-M
: limit the maximum ticks to run-s
: Change the warp scheduler of the SM-t
: Change the number of threads per warp-w
: Change the number of warps per core-k
: Specify the output path of Konata log-c
: Specify the configuration files
Since there are too many configurations available, kernel sim accepts configuration files (using -c
) written in TOML which are used to set the parameters of the system.
Kernel sim runs either until there are no more busy Clocked
objects or reaching the specified maximum tick limit.
Procedure of kernel sim:
- Read configuration file and parse the command line arguments.
- Initialize the test system with given parameters.
- Load the ELF file (program) into the main memory.
- Call
simtix::sim::Tick
repeatedly to drive the simulation. - Dump the log/stats after the simulation stops.
Software components
Instr
The Instr
classes is the most important components in simtix. It defines the behavior of RISC-V instructions in different stages using the following methods:
Decode
Issue
OperandCollect
Execute
Commit
To implement any instruction, the behavior of the instruction in these stage must be well defined. (See a lot of classes defined in src/simtix/sm/Instr*.h
)
For example, R-type instructions (OP
):
- Determines what operation to perform based on
funct3_
&funct7_
when doingDecode
. - Reserves
rd_
when doingIssue
. - Collect
rs1_data_
andrs2_data_
when doingOperandCollect
. - Run the decoded operation, such as
add_
when doingExecute
. - Write the
rd_data_
back to register file and releaserd_
when doingCommit
.
In instruction in SIMT execution model is associate with a warp. Therefore, all the operations mentioned above is performed against a Warp
object. The constructor of the Instr
object accepts a pointer to a Warp
object that the Instr
is associated with.
To simulate the execution of a sequence of RISC-V instruction, we:
- Fetch a
iword
(RISC-V 32-bit instruction) from memory. - Construct an
Instr
object using thatiword
. - Call the above methods sequentially.
- Repeat the above operations
With this design, we can simulate different core configuration effectively.
SM models
An SM model perform the above operations to simulate the execution of an instruction stream.
To model different microarchitecture, we schedule the invokation of the above Instr
methods.
For example, the AtomicSM
model is served as an ISS (instruction set simulator), in which all instructions are executed atomically. In this case, AtomicSM::Tick
selects a warp suggested by the warp scheduler, fetches an iword
to construct an Instr
object, and calls all methods of the Instr
to finish the execution atomically. See here for more details.
For the PipelinedSM
model, things can get more complicated. We model the behavior of pipelined processor, meaning that instruction is propagated between stages. The PipelinedSM
determine whether an instruction can be propagated to the next stage by checking the current pipeline status. Also, there are several "stateful" components such as I-buffer and Arbitrator that will affect how the Instr
are propagated.
Note:
The Instr
class is decoupled with the clock, i.e., the Tick
function. Therefore, for operations that takes more than 1 cycles, such as reading/writing register files or accessing the memory, we use a pattern of pushing requests and a callback function.
When the requests to these "timed" components are finished, these components calls the callback that is associated with the requests. The callback usually changes the state of the Instr
marking the availability of calling next method.
The availability is reflected on these methods:
CanIssue
: Dependency resolvedCanExecute
: Done collecting operandsCanCommit
: Done execution (memory accesses)CanRetire
: Done writing back results
See here for example of InstrLoad
.
Queues
Software queues (std::queue
, std::deque
) are unsized, meaning that there is not a limit for the capacity. However, for hardware queues, this is not the case.
As queues are common components in the hardware we have the following templated queue to be used when modeling the hardwrae:
SizedQueue
- Add a capacity limit to
std::deque
so that it cannot beEnq
when it's full.
- Add a capacity limit to
DelayQueue
- With the capacity limit, it adds an additional delay constraint for
Deq
. - An element can be
Deq
only when it stays in the queue for a given delay.
- With the capacity limit, it adds an additional delay constraint for
FcfsDelayQueue
- With the capacity limit, it adds an additional delay constraint for
Deq
. - An element can be
Deq
only when it stays in the front of the queue for a given delay (First-come first-serve policy).
- With the capacity limit, it adds an additional delay constraint for
Note:
The delay of all arithmetic operations of Instr
, including multiplication, division and even floating point operations is not modeled in the Execute
method. Instead, the delay is simulated by pushing the instruction into a DelayQueue
, so that it must stay in the queue for a couple cycles.
The operations are usually classified by whether it can be pipelined. For pipelined operations, DelayQueue
is used, and for non-pipelined operations, FcfsDelayQueue
is used.
InstrPtr
Since an Instr
is constructed upon fetched, and destroyed after retirement, the cost of constructing and destructing the object would become very expensive thus slows down the simulation. To handle this, we introduce an optimization technique called Object Pool Pattern.
Before the simulation starts, we first allocate a number of Instr
objects in the pool. When we need a new Instr
, we can directly obtain from the pool. After we no longer need the Instr
we can recycle it back to the pool for the next time.
To reuse the Instr
object, it defines Reset
method to clear all the properties when recycling back to the pool and Reinitialize
method to reinitialize the fields when allocating.
InstrPtr
is a wrapper class for Instr
that:
- Guarantees the global uniqueness of the
Instr
instance. - Handles allocations of
Instr
fromInstrPool
. - Automatically recycles the instruction back to the pool.
- Konata trace of a single instruction.
InstrPtr
acts like a std::unique_ptr
, which can only be moved around variables. This nature guarantees the uniqueness of an Instr
in the processor pipeline.
Note: The begin of the pipeline trace of a instruction begins by the invokation of the constuctor of the IntrPtr
and ends when the destructor of the InstrPtr
is called.
MemoryInterface
MemoryInterface
is a simplified interface for components that can be read or written with a given address and size. The interface implements a simple protocol that emulates the process of handshaking.
To send a request over a MemoryInterface
. The initiator pass a Payload
and a OnResp
callback to either Read
or Write
method. The method return true
when it accepts the request or false
when it rejects the request.
The invokation of Read
or Write
is analogous to the posedge of valid signal, while the return value is analogous to the ready signal. Only when the return value is true
is the request considered accepted.
Once the request is served, the OnResp
callback is called with the status
(Okay or types of errors) being the only argument.
Similarly, the OnResp
callback returns true
when the initiator can accepts the response and false
oterwise.
Trace system
You may see lots of DPRINTF
in the codebase. These are to dump some debug information when running the simulation. The first argument of DPRINTF
is the "debug flag", which can be turned on or off when configuring the build with CMake (-D SIMTIX_ENABLE_TRACING=XXX
).
The design allows us to turn off unwanted debug messages when running the simulation, which not only makes the output less verbose but also significantly speeds up the simulation.
A special trace type is Konata
, which dumps Kanata logs that can be rendered in Konata.
Read our paper (must read) for more details!
Statistic system
To profile the performance, we need lots of performance counters such as cache hit rate, or IPC. These informations are collected using the statistic system (see statistic.h
)
A single performance counter is a stat::Integer
or stat::Real
. These counters are defined in a stat::Group
. Counters that are derived from other counters are stat::Formula
, whose expression can be defined using operator=
overloading. See the cache example.
A stat::Group
can include another stat::Group
as subgroup to form a tree-like structure. Finally, we can Tabularize
the whole stats from the root node into a TOML table. See how kernel-sim handles this.