Systemverilog - Coding

Interview Questions
Q1. FIFO depth, given read and write rates for a burst of x writes
Q2. a=0; b=0; c=1; #1 a=c; #1 b =a; (Give waveforms)
Q3. a<=0; b<=0; c<=1; #1 a<=c; #1 b< =a; (Give waveforms)
Q4. a=0; b=0; c=1; a= #1 c; b=#1 a; (Give waveforms)
Q5. a<=0; b<=0; c<=1; a<= #1 c; b<=#1 a; (Give waveforms)
Q6. You have incoming bit stream. You can't store them. You get a new bit at everyclock edge, find modulo 5 of the
updated number everytime. Eg, if bitstream is 10111, you find modulo of 1, then 10, then 101 and so on..
FIbonacci series
Questions on C++, Perl, System Verilog.
Computer Architecture Concepts, Memory Consistency and Cache Coherency, cache configuration.
difference between non-blocking and blocking assignment
How to verify asynchronous fifo?
How to implement a greedy snake game? what data structures to implement the snake
In a certain protocol why the ready signal is inout instead of out?
About the refresh in DDR2
FSM
System Verilog, Verilog, C, Perl, (also questions about OOP)
Bit operation
Asked to write SystemVerilog constraints for a variety of random stimulus needs
What is verification about? What are the components of design verification? What is
coverage? coverages types?
setup and hold time
aptitude based questions(apples and oranges)
perl scripting and programming based questions
Write code for a UVC mimicing a memory . Reactive sequence in UVM
Explain how an out-of-order processor works? How do you implement register renaming? Difference
between an architectural and physical register file
Verilog code writing, simple hardware design question using muxes and counter that was approached from
different levels of abstraction.
Entirely computer architecture questions, including cache coherency protocols, cache organizations
What is the scope of a static variable? Given multiple scenarios(static variables across files, in
recursion, ect.)
Describe what a virtual function does?
What are some ways for error testing/handling in software?
Computer Architecture stuff: OOO, memory dependencies, Pipelining, Fetch stage, Branch Prediction
System Verilog: coverage and assertion writing
Digital Logic: Implement AND and OR using 2:1 mux
Asked to rate myself in C++, System Verilog
C program to sort array. Binary search vs Linear Search. Time complexity.
How to verify many design scenarios.
difference of Union and Struct (C++).
VIPT cache.
What is an isolation cell?
FIFO Depth, SV assertions, Multi-threading and OOP concepts
1
Random number generations, assertions, constraint
Bug scenarioshttp://hwinterview.com/index.php/2016/11/01/bug-scenarios/
Synchronous FIFO verification http://hwinterview.com/index.php/2016/11/13/synchronous-fifo-verification/
SystemVerilog Assertions http://hwinterview.com/index.php/2016/11/04/assertions/
Coverage-driven verification http://hwinterview.com/index.php/2016/10/29/coverage-driven-verification/
Fork join statements http://hwinterview.com/index.php/2016/11/16/fork-join-statements/
Virtual Memory http://hwinterview.com/index.php/2016/11/20/virtual-memory/
Virtual Address Space / Paging http://hwinterview.com/index.php/2016/11/20/virtual-address-space-paging/
Design a Cache addressing scheme http://hwinterview.com/index.php/2016/11/20/design-cache-addressing-scheme/
Virtual Address Space / Paging http://hwinterview.com/index.php/2016/11/20/virtual-address-space-paging/
2
What are the goals of a verification engineer?
Develop a testplan to define the what, where and how of testing methodology.
Design a resusable and scalable testbench environment to verify the module.
Work with the designer to ensure that the design meets all the specifications through coverage analysis.
Start debugging with the mindset that the testbench is incorrect. Once that is ruled out, then the designer
can be involved in the debug effort.
Automate the checking process.
Gain a good grasp of the design specifications. The verification engineer should not completely trust the
designer to determine if the design has been documented correctly.
Suggest changes once a reasonable understanding is obtained to re-design or re-evaluate if a particular
logic is constantly seeing issues. The verification engineer should not be afraid to push for changes keeping
verification schedules in mind.
What is a testplan?
A testplan is the probably the most crucial aspect of the design verification flow. It involves the definition of the following aspects
in general
Engineers involved in designing and verifying the module
Module features to be verified based on the design specification
Environment used for testing (unit/system/emulation)
Schedule
Description of how to go about the task of thoroughly verifying the module
Benefits of a good testplan
A good testplan sets the ground work for focusing on the important features to be verified.
It also provides a framework for evaluating the progress of verification through functional coverage.
Furthermore, it provides a good opportunity to hash out any misunderstandings on features and interfaces. Thus, a
review can be held by all the stakeholders involved in the module (design, block verification, system verification,
architecture) to clearly define the methodology for testing.
What should be in a testplan?
Testbench description A brief overview (maybe a diagram) of the testbench components used like
scoreboards/checkers and agents. Also, a description of the testbench files should be beneficial for anyone new to the
module to grasp the testbench intent.
Features The testplan should list the feature specifications and map them to a specific coverpoint. It is also crucial
to focus on the interfaces of the block as these are the usual spots to uncover bugs. Another aspect is to provide the
scenarios of how the end product will be used by the system.
How to test? The how of testing should cover the following
o High risk areas
o Scope of what should be covered in the future
o Assumptions
o Test fail criteria
To achieve successful testing:
1. The testbench must generate proper input stimulus to activate a design error.
2. The testbench must generate proper input stimulus to propagate all effects resulting from the design error to an output port.
3. The testbench must contain a monitor that can detect the design error that was first activated then propagated to a point for
detection.
3
Circular Buffer Implementation
reg [15:0] myMemory [0:1023];
// Port A
wire [9:0] addressA;
wire [15:0] dataInA;
wire writeEnableA;
reg [15:0] dataOutA;
// Port B
wire [9:0] addressB;
wire [15:0] dataInB;
wire writeEnableB;
reg [15:0] dataOutB;
always @ (posedge clk) begin

if (writeEnableA) begin
myMemory[addressA] <= dataInA;
dataOutA <= dataInA;
end
else
dataOutA <= myMemory[addressA];
if (writeEnableB) begin
myMemory[addressB] <= dataInB;
dataOutB <= dataInB;
end
else
dataOutB <= myMemory[addressB];
end
Comparing Gray code pointers to binary pointers

Binary pointers can be used to do FIFO design if the pointers are sampled and handshaking control signals are used between the two clock domains to safely pass the sampled
binary count values. Some advantages of using binary pointers over Gray code pointers:
The technique of sampling a multi-bit value into a holding register and using synchronized handshaking control signals to pass the multi-bit value into a new clock domain can be
used for passing ANY arbitrary multi-bit value across clock domains. This technique can be used to pass FIFO pointers or any multi-bit value.
Each synchronized Gray code pointer requires 2n flip-flops (2 per pointer bit). The sampled multi-bit register requires 2n+4 flip-flops (1 per holding register bit in each clock
domain, 2 flip-flops to synchronize a ready bit and 2 flip-flops to synchronize an acknowledge bit). There is no appreciable difference in the chance that either pointer style would
experience metastability.
The sampled multi-bit binary register allows arbitrary pointer changes. Gray code pointers can only increment and decrement.
The sampled multi-bit register technique permits arbitrary FIFO depths; whereas, a Gray code pointer requires power-of-2 FIFO depths. If a design required a FIFO depth of at
least 132 words, using a standard Gray code pointer would employ a FIFO depth of 256 words. Since most instantiated dual-port RAM blocks are power-of- 2 words deep, this may
not be an issue.
Using binary pointers makes it easy to calculate almost-empty and almost-full status bits using simple binary arithmetic between the pointer values.
One small disadvantage to using binary pointers over Gray code pointers is:
Sampling and holding a binary FIFO pointer and then handshaking it across a clock boundary can delay the capture of new samples by at least two clock edges from the receiving
clock domain and another two clock edges from the sending clock domain. This latency is generally not a problem, but it will typically add more pessimism to the assertion of full
and empty and might require additional FIFO depth to compensate for the added pessimism. Since most FIFOs are typically specified with excess depth, it is not likely that extra
registers or a larger dual-port FIFO buffer size would be required.
4
ASSERTIONS
An assertion is a statement about a designs intended behavior
- If a property that is being checked for in a simulation does not behave the way we expect it to, the assertion fails.
- If a property that is forbidden from happening in a design happens during simulation, the assertion fails.
- It helps capturing the designers interpretation of the specification. - Describes the property of the design - Assertion doesnt help in designing any entity but
it checks for the behavior of the design.
assert property (@(posedge clk) $rose(req) |-> ##[1:3] $rose(ack)); In this example, when there is a positive edge on the Request (req) signal, then make sure that
between 1 and 3 clock cycles later, there is a positive edge on acknowledge (ack) signal. Here the designer knows that the acknowledge signal should go high within 1 to 3 cycles as
soon as the Request signal has gone high at the positive edge.
Immediate assertions uses the keyword assert (not assert property), and is placed in procedural code and executed as a procedural statement.
- Based on simulation event semantics. - Test expression is evaluated just like any other. Verilog expression with a procedural block. Those are not temporal in
nature and are evaluated immediately. - Have to be placed in a procedural block definition. - Used only with dynamic simulation
A sample immediate assertion is shown below:
always_comb
begin
a_ia: assert (a && b);
end
Concurrent assertions uses the keywords assert property, is placed outside of a procedural block and is executed once per sample cycle at the end of the
cycle. The sample cycle is typically a posedge clk and sampling takes place at the end of the clock cycle, just before everything changes on the next posedge clk.
- Based on Clock Cycles. - Test expression is evaluated at clock edges based on the sampled values of the variables involved. - Sampling of variables is done in
the observed region of the scheduler. - Can be placed in a procedural block, a module, an interface or a program definition. - Can be used with both static
and dynamic verification tool.
A sample of Concurrent Assertion: a_cc: assert property ( @ (posedge clk) not (a && b)) ; This example shows the result of concurrent assertion a_cc. All
successes are shown with an up arrow and all features are shown with a down arrow. The key concept in this example is that property being verified on every
positive edge of the clock irrespective of whether or not signal a and signal b changes.
Embedded concurrent assertions another form of concurrent assertions added to IEEE Std 18002009[7] and also uses the keywords assert property but
is placed inside of a clocked always process. Placing the assertion in a clocked always process allows the concurrent assertion to inherit the clockingsample
signal from the always process.
Design Engineers should create the lowlevel and simple assertions while Verification Engineers should create higherlevel and perhaps more complex
assertions.
Where should the Assertions be used? - Between modules, DUT and Testbench to check communication between the modules
and stimulus constraints. - It can also be used inside individual modules to verify the design, corner-cases and verify the
assumptions.
assertions should instead be put in a separate bindfile and NOT put the assertions in the RTL code.
Bindfiles:
How bindfiles work : In general, using bindfiles is actually doing indirect instantiation. The engineer will bind (indirectly instantiate) one module inside of another module
using the bind keyword.
To create a bindfile, declare a module that will encapsulate the assertion code (and other verification code if needed). The module needs access to all of the important
signals in the enclosing file so all of the ports and internal signals from the enclosing file are declared as inputs to the bindfile.
- the bind command includes the bind keyword followed by the DUT module name
bind fifo1
- describes how the bound module would be instantiated if placed directly in the module being bound to.
fifo1_asserts p1 (.*);
When creating bindfiles, it is a good idea to copy the DUT module to a DUT_asserts module, keep all existing input declarations, change all output declarations to input
declarations, and declare all internal signals as input declarations to the bindfile. The bindfile will sample the port and internal signals from the DUT.
It is not required to list all of the DUT signals in the asserts file, only those signals that will be checked by assertions; however, it is highly recommend to add ALL of the DUT signals
to the asserts file because it is common to add more assertions in the future that might require previously unused DUT signals.
5
The SystemVerilog language provides three important benefits over Verilog. 1. Explicit design intent SystemVerilog introduces several constructs
that allow you to explicitly state what type of logic should be generated. 2. Conciseness of expressions SystemVerilog includes commands that allow you to specify
design behavior more concisely than previously possible. 3. A high level of abstraction for design The SystemVerilog interface construct facilitates inter module
communication. These benefits of SystemVerilog enable you to rapidly develop your RTL code, easily maintain your code, and minimize the occurrence of situations
where the RTL code simulates differently than the synthesized netlist. SystemVerilog allows you to design at a high level of abstraction. This results in improved
code readability and portability. Advanced features such as interfaces, concise port naming, explicit hardware constructs, and special data types ease verification
challenges.
Basic Testbench Functionality The purpose of a Testbench is to determine the correctness of the design under test (DUT). The following steps
accomplish this. Generate stimulus Apply stimulus to the DUT Captures response Check for correctness Measure progress against overall verification goals
Classes System Verilog provides an object-oriented programming model. System Verilog classes support a single-inheritance model. There is no facility
that permits conformance of a class to multiple functional interfaces, such as the interface feature of Java. System Verilog classes can be type-parameterized,
providing the basic function of C++ templates. However, function templates and template specialization are not supported.
The polymorphism features are similar to those of C++: the programmer may specify write a virtual function to have a derived class gain control of the
function. Encapsulation and data hiding is accomplished using the local and protected keywords, which must be applied to any item that is to be hidden. By default,
all class properties are public. System Verilog class instances are created with the new keyword. A constructor denoted by function new can be defined. System
Verilog supports garbage collection, so there is no facility to explicitly destroy class instances.
Program Block Encapsulates Test Code

A program can call a routine in a module to perform various actions. The routine can set values on internal signals, also known as back- door load. Next,
because the current SystemVerilog standard does not define how to force signals from a program block, you need to write a task in the design to do the force, and
then call it from the program.
The Program block provides
- Entry point to test execution - Scope for program-wide data and routines - Race-free interaction between Testbench and design
Why always Blocks Not allowed in a Program Block? In System Verilog, you can put an initial blocks in a program, but not always blocks.
This is bit opposite to the verilog and we have the reasons below: - System Verilog programs are closer to program in C, with one entry point, than Verilogs many
small blocks of concurrently executing hardware. - In a design, always block might trigger on every positive edge of a clock from the start of simulation. In System
Verilog, a Testbench has the steps of initialization, stimulate and respond to the design and then wrap up the simulation. An always block that runs continuously
would not work in System Verilog.
The Interface It is the mechanism to connect Testbench to the DUT just named as bundle of wires (e.g. connecting two hardware unit blocks with the help of
physical wires). With the help of interface block: we can add new connections easily, no missed connections and port lists are compact. It also carries the directional
information in the form of Modport (will be explained in counter example) and clocking blocks.
The interface instance is to be used in program block, the data type for the signals should be logic. The reason is that signals within a program block are
almost always driven within a procedural block(initial). All signals driven within a procedural block must be of type reg, the synomn of which is logic. When a signal is
declared to be logic, it can also be driven by a continuous assignment statement. This added flexibility of logic is generally desirable. There is an exception to the
above recommendation. If the signal is bi-directional signal(inout), or has multiple drivers, then the data must be wire (or any other form of wire-types).
TIP: Use wire type in case of multiple drivers. Use logic type in case of a single driver.
How do we achieve Synchronous Timing for different modules?

Clocking Blocks
- A clocking block assembles signals that are synchronous to a particular clock, and makes their timing explicit. A clocking block (CB) specifies clock signals and the
timing and synchronization requirements of various blocks. A CB is helpful in separating clocking activities of a design from its data assignments activities and can
be used in test benching. - A CB assembles all the signals that are sampled or synchronized by a common clock and defines the timing behaviors with respect to the
clock. It is defined by a clocking endclocking keyword pair
- A CB only describes how the inputs and outputs are sampled and synchronized. It does not assign a value to a variable. - Depending on the environment, a
Testbench can contain one or more clocking blocks, each containing its own clock plus an arbitrary number of signals.
Clocking Block Example:
clocking cb @ (posedge clk)
default input #1ns output #1ns;
output reset_n; output din;
output frame_n;
output valid_n;
input dout;
input busy_n;
input valido_n;
input frameo_n;
endclocking: cb
An interface encapsulates the communication between DUT and Testbench including

- Connectivity (signals) named bundle of wires - One or more bundles to connect - Can be reduced for different tests and devices - Directional
information (modports) - Timing (Clocking blocks) - Functionality (routines, assertions, initial/always blocks)
Solves many problems with traditional connections
- Port lists for the connections are compact - No missed connections - Easy to add new connections - New signals in the interface are automatically passed to test
program or module, preventing connection problems. - Changes are easily made in interface making it easy to integrate in higher levels - Most port declaration work
is duplicated in many modules. - And most important is easy to change if design changes.
An interface cannot contain module instances, but only instances of other interfaces.
The advantages to using an interface are as follows: - An interface is ideal for design reuse. When two blocks communicate with a specified
protocol using more than two signals, consider using an interface. - The interface takes the jumble of signals that you declare over and over in every module or
program and puts it in a central location, reducing the possibility of misconnecting signals. - To add a new signal, you just have to declare it once in the interface, not
in higher-level modules, once again reducing errors. - Modports allow a module to easily tap a subset of signals form an interface. We can also specify signal
direction for additional checking
6
Modport. This provides direction information for module interface ports and controls the use of tasks and functions within certain modules. The directions of
ports are those seen from the perspective module or program. - Modports do not contain vector sizes or data types (common error) only whether the connecting
module sees a signal as input, output, inout or ref port.
Module Top: This file carries the top-level image of your whole design showing all the modules connected to it and the ports being used for the design. The
interface and test programs are instantiated here in the harness files. Looking at into a top level harness files gives an detailed picture of any design, as to what are
the functional parameters, interfaces etc.
Descriptions of some of the intermediate blocks
Environment contains the instances of the entire verification component and Component connectivity is also done. Steps required for execution of each
component are done in this.
Coverage - Checks completeness of the Testbench. This can be improved by the usage of Assertions, which helps to check the coverage of the Testbench and
generate suitable reports of the test coverage. The concept of coverage can get more complex, when we deal with the concept of functional coverage, cover groups
and cover points. With the coverage points, we can generate coverage report of your design and know the strength of your verification.
Transactors Transactor does the high level operations like burst-operations into individual commands, sub-layer protocol in layered protocol like PciExpress
Transaction layer over PciExpress Data Link Layer, TCP/IP over Ethernet etc. It also handles the DUT configuration operations. This layer also provides necessary
information to coverage model about the stimulus generated. Stimulus generated in generator is high level like Packet is with good crc, length is 5 and da is 8h0.
This high level stimulus is converted into low-level data using packing. This low level data is just a array of bits or bytes. Creates test scenarios and tests for the
functionality and identifies the transaction through the interface.
Drivers - The drivers translate the operations produced by the generator into the actual inputs for the design under verification. Generators create inputs at a high
level of abstraction namely, as transactions like read write operation. The drivers convert this input into actual design inputs, as defined in the specification of the
designs interface. If the generator generates read operation, then read task is called, in that, the DUT input pin "read_write" is asserted.
Monitor Monitor reports the protocol violation and identifies all the transactions. Monitors are two types, Passive and active. Passive monitors do not drive any
signals. Active monitors can drive the DUT signals. Sometimes this is also referred as receiver. Monitor converts the state of the design and its outputs to a
transaction abstraction level so it can be stored in a 'score-boards' database to be checked later on. Monitor converts the pin level activities in to high level.
Checker: The monitor only monitors the interface protocol. It doesn't check the whether the data is same as expected data or not, as interface has nothing to do
with the data. Checker converts the low level data to high-level data and validated the data. This operation of converting low-level data to high-level data is called
Unpacking, which is reverse of packing operation. For example, if data is collected from all the 15 commands of the burst operation and then the data is converted in
to raw data, and all the sub fields information are extracted from the data and compared against the expected values. The comparison state is sent to scoreboard
The Generator, Agent, Driver, Monitor and Checker are all classes, modelled as Transactors. They are instantiated inside the Environment class. For
simplicity, the test is at the top of the hierarchy, as is the program that instantiates the Environment Class. The functional coverage definition can be put inside or
outside the Environment class.
Scoreboard: Scoreboard is used to store the expected output of the device under test. It implements the same functionality as DUT. It uses higher level of
constructs. Dynamic data types and dynamic memory allocations in SystemVerilog make us easy to write scoreboards. Scoreboard is used to keep track of how
many transactions were initiated; out of which how many are passed or failed.
Randomization: What to randomize, the first things you may think of are the data fields. These are the easiest to create just call $random. The problem is
that this approach has a very low payback in terms of bugs found: you only find data-path bugs, perhaps with bit-level mistakes. The test is still inherently directed.
The challenging bugs are in the control logic. As a result, you need to randomize all decision points in your DUT. Wherever control paths diverge, randomization
increases the probability that youll take a different path in each test case.
7
Difference between rand and randc? The variables in the class can be declared random using the keywords: rand and randc. Dynamic and
associative arrays can be declared using rand or randc keywords. Variables declared with rand keywords are standard random variables. Their values are uniformly
distributed over their range. Values declared with randc keyword are randomly distributed. They are cyclic in nature. They support only bit or enumerated data types.
The size is limited.
Semaphores A semaphore allows you to control access to a resource. Semaphores can be used a testbench when you have a resource, such as a bus, that
may have multiple requestors from inside the testbench but, as part of the physical design, can only have one driver. In System Verilog, a thread that requests a key
when one is not available always block.
Semaphores can be used in a testbench when you have a resource, such as a bus, that may have multiple requestors from inside the testbench but as
part of the physical design, can only has one driver. There are three basic operations for a semaphore. We create a semaphore with one or more keys using the new
method get one or more keys with get, and return one or more keys with put.
Mailboxes: A mailbox is a communication mechanism that allows messages to be exchanged between processes or threads. Data can be sent to a mailbox by
one process and retrieved by another. Mailbox is a built-in class that provides the following methods: - Create a mailbox: new() - Place a message in a mailbox:
put() - Try to place a message in a mailbox without blocking: try_put() - Retrieve a message from a mailbox: get() or peek() - Try to retrieve a message from a
mailbox without blocking: try_get() or try_peek() - Retrieve the number of messages in the mailbox: num().
Eg:
Generator using mailboxes
task generator(int n, mailbox mbx);
Transacion t;
repeat (...)
begin t = new();
.....
mbx.put(t);
end endtask
Tasks and Functions

Task and Function declarations are similar as in Verilog but following rules hold good for system verilog.
Any port is seen as input default port direction, unless explicitly declared as other types.Eg: in, out, inout, ref.
Unless declared, data types of ports are of logic type.
There is no need to use begin..end when more then one statement is used inside a task.
A task can be terminated before endtask, by usage of return statement.
Wire data type cannot be used inside the port list.
8
what is coverage? Simply put, coverage is a metric we use to measure verification progress and completeness. Coverage
metrics tells us what portion of the design has been activated during simulation (that is, the controllability quality of a
testbench). Or more importantly, coverage metrics identify portions of the design that were never activated during simulation,
which allows us to adjust our input stimulus to improve verification.
Coverage-driven verification Coverage-driven verification is a widely used methodology to tackle the growing
complexity of ASIC designs which add new features and improve performance with every product generation. It typically
involves the following steps
1. Development of a test plan incorporating the list of features to verify.
2. Creation of a smart environment with configurable parameters, random constrained stimulus, checkers and a
coverage model to track progress.
3. Addition of assertions to catch illegal scenarios.
4. Iteratively run simulations and analyze coverage metrics (code coverage and functional coverage).
Benefits
Coverage-driven approach provides measurable success parameters through coverage metrics. This is crucial especially with
the tough schedules to meet. In addition, using constrained random stimulus eliminates the time spent creating directed tests.
Why not use simple directed tests?

Directed tests are not reusable across multiple environments. They are not scalable since they require a substantial effort to
develop. Subsequently, they are inefficient as compared to a constrained random approach. Most of all, it is hard to pinpoint
the completeness of verification through directed testing.
Are there any drawbacks?

Coverage-driven approach requires a significant amount of planning and effort to develop which may not be necessary for
simple standalone designs. Furthermore, some scenarios are unlikely to occur using a constrained random stimulus. It could be
beneficial to write specific directed tests for hard-to-hit coverage holes.
Coverage Classification
Two most common ways are to classify them by either their method of creation (such as, explicit versus implicit), or by their origin of source (such
as, specification versus implementation).
For instance, functional coverage is one example of an explicit coverage metric, which has been manually defined and then implemented by the
engineer. In contrast, line coverage and expression coverage are two examples of an implicit coverage metric since its definition and implementation is
automatically derived and extracted from the RTL representation.
Coverage Metrics
There are two primary forms of coverage metrics in production use in industry today and these are:
- Code Coverage Metrics (Implicit coverage)
- Functional Coverage/Assertion Coverage Metrics (Explicit coverage)
Code Coverage Metrics One of the advantages of code coverage is that it automatically describes the degree to which the source code of
a program has been activated during testing-thus, identifying structures in the source code that have not been activated during testing. One of the key benefits of
code coverage, unlike functional coverage, is that creating the structural coverage model is an automatic process. Hence, integrating code coverage into your
existing simulation flow is easy and does not require a change to either your current design or verification approach.
Limitations:
One limitation with code coverage metrics are that you might achieve 100% code coverage during your regression run, which means that your testbench
provided stimulus that activated all structures within your RTL source code, yet there are still bugs in your design. For example, the input stimulus might have
activated a line of code that contained a bug, yet the testbench did not generate the additional required stimulus that propagates the effects of the bug to some point
in the testbench where it could be detected.
Another limitation of code coverage is that it does not provide an indication on exactly what functionality defined in the specification was actually
tested. For example, you could run into a situation where you achieved 100% code coverage, and then assume you are done. Yet, there could be functionality
defined in specification that was never tested?or even functionality that had never been implemented! Code coverage metrics will not help you find these situations.
9
Types of Code Coverage Metrics
Toggle Coverage Toggle coverage is a code coverage metric used to measure the number of times each bit of a register or wire has toggled its value. Although
this is a relatively basic metric, many projects have a testing requirement that all ports and registers, at a minimum, must have experienced a zero-to-one and one-
to-zero transition.
In general, reviewing a toggle coverage analysis report can be overwhelming and of little value if not carefully focused. For example, toggle coverage is
often used for basic connectivity checks between IP blocks. In addition, it can be useful to know that many control structures, such as a one-hot select bus, have
been fully exercised.
Line Coverage Line coverage is a code coverage metric we use to identify which lines of our source code have been executed during simulation. A line coverage
metric report will have a count associated with each line of source code indicating the total number of times the line has executed. The line execution count value is
not only useful for identifying lines of source code that have never executed, but also useful when the engineer feels that a minimum line execution threshold is
required to achieve sufficient testing.
Line coverage analysis will often reveal that a rare condition required to activate a line of code has not occurred due to missing input stimulus.
Alternatively, line coverage analysis might reveal that the data and control flow of the source code prevented it either due to a bug in the code, or dead code that is
not currently needed under certain IP configurations. For unused or dead code, you might choose to exclude or filter this code during the coverage recording and
reporting steps, which allows you to focus only on the relevant code.
Statement Coverage Statement coverage is a code coverage metric we use to identify which statements within our source code have been executed during
simulation. In general, most engineers find that statement coverage analysis is more useful than line coverage since a statement often spans multiple lines of source
code-or multiple statements can occur on a single line of source code.
A metrics report used for statement coverage analysis will have a count associated with each line of source code indicating the total number of times the
statement has executed. This statement execution count value is not only useful for identifying lines of source code that have never executed, but also useful when
the engineer feels that a minimum statement execution threshold is required to achieve sufficient testing.
Block Coverage Block coverage is a variant on the statement coverage metric which identifies whether a block of code has been executed or not. A block is
defined as a set of statements between conditional statements or within a procedural definition, the key point being that if the block is reached, all the lines within the
block will be executed. This metric is used to avoid unscrupulous engineers from achieving a higher statement coverage by simply adding more statements to their
code.
Branch Coverage Branch coverage (also referred to as decision coverage) is a code coverage metric that reports whether Boolean expressions tested in control
structures (such as the if, case, while, repeat, forever, for and loop statements) evaluated to both true and false. The entire Boolean expression is considered one
true-or-false predicate regardless of whether it contains logical-and or logical-or operators.
Expression Coverage Expression coverage (sometimes referred to as condition coverage) is a code coverage metric used to determine if each condition
evaluated both to true and false. A condition is an Boolean operand that does not contain logical operators. Hence, expression coverage measures the Boolean
conditions independently of each other.
Focused Expression Coverage Focused Expression Coverage (FEC), which is also referred to as Modified Condition/Decision Coverage (MC/DC), is a code
coverage metric often used used by the DO-178B safety critical software certification standard, as well as the DO-254 formal airborne electronic hardware
certification standard. This metric is stronger than condition and decision coverage. The formal definition of MC/DC as defined by DO-178B is: Every point of entry
and exit in the program has been invoked at least once, every condition in a decision has taken all possible outcomes at least once, every decision in the program
has taken all possible outcomes at least once, and each condition in a decision has been shown to independently affect that decisions outcome. A condition is
shown to independently affect a decisions outcome by varying just that condition while holding fixed all other possible conditions. [3] It is worth noting that
completely closing Focused Expressing Coverage can be non-trivial.
Finite-State Machine Coverage Today's code coverage tools are able to identify finite state machines within the RTL source code. Hence, this makes it
possible to automatically extract FSM code coverage metrics to measure conditions. For example, the number of times each state of the state machine was entered,
the number of times the FSM transitioned from one state to each of its neighboring states, and even sequential arc coverage to identify state visitation transitions.
There are generally three main steps involved in a code coverage flow, which include:
Instrument the RTL code to gather coverage
Run simulation to capture and record coverage metrics
Report and analyze the coverage results
Part of the analysis step is to identify coverage holes, and determine if the coverage hole is due to one of three
conditions:
Missing input stimulus required to activate the uncovered code
A bug in the design (or testbench) that is preventing the input stimulus from activating the uncovered code
Unused code for certain IP configurations or expected unreachable code related during normal operating conditions
10
Functional Coverage Metrics
The objective of functional verification is to determine if the design requirements, as defined in our specification, are functioning as intended. The objective of
measuring functional coverage is to measure verification progress with respect to the functional requirements of the design. That is, functional coverage helps us
answer the question: Have all specified functional requirements been implemented, and then exercised during simulation?
Benefit:
one of the value propositions of constrained-random stimulus generation is that the simulation environment can automatically generate thousands of tests that would
have normally required a significant amount of manual effort to create as directed tests. However, one of the problems with constrained-random stimulus generation
is that you never know exactly what functionality has been tested without the tedious effort of examining waveforms after a simulation run. Hence, functional
coverage was invented as a measurement to help determine exactly what functionality a simulation regression tested without the need for visual inspection of
waveforms.
For example, functional coverage can be implemented with a mechanism that links to specific requirements defined in a specification. Then, after a
simulation run, it is possible to automatically measure which requirements were checked by a specific directed or constrained-random test as well as automatically
determine which requirements were never tested.
Limitations:
Since functional coverage is not an implicit coverage metric, it cannot be automatically extracted. Hence, this requires the user to manually create the coverage
model. From a high-level, there are two different steps involved in creating a functional coverage model that need to be considered:
1. Identify the functionality or design intent that you want to measure: addressed through verification planning
2. Implementing the machinery to measure the functionality or design intent: coding the machinery for each of the coverage items identified in the verification
planning step (for example, coding a set of SystemVerilog covergroups for each verification objective identified in the verification plan).
Steps to Coverage Implementation

First the required coverage must be defined, usually by manual analysis of functional and architectural specification documents along with expert consideration of
the DUTs architecture. RTL implementers are likely to be aware of the relationships between specified activity and the operation of internal blocks, and will be able
to suggest important coverage scenarios that are not necessarily evident from a high-level functional spec.
Next, the verification team must identify how to capture the necessary information an easy task for activity on a DUT interface, but often much more
challenging for coverage that captures DUT internal state, or timing relationships across multiple interfaces. At this stage it is also important to identify the triggering
and filtering criteria that will be used to determine whether coverage information should or should not be sampled.
Scoreboard and Functional Coverage: The main goal of a verification environment is to reach 100% coverage of the defined functional coverage spec
in the verification plan. Based on functional coverage analysis, the random based tests are than constrained to focus on corner cases to get do complete functional
check. Coverage is a generic term for measuring progress to complete design verification. Simulations slowly paint the canvas of the design, as we try to cover all of
the legal combinations. The coverage tools gather information during a simulation and then post process it to produce a coverage report. You can use this report to
look for coverage holes and then modify existing tests or create new ones to fill the holes.
Types of Functional Coverage Metrics The functional behavior of any design, at least as observed from any interface within the
verification environment, consists of both data and temporal components. Hence, from a high-level, there are two main types of functional coverage measurement
we need to consider: Cover Groups' and Cover Properties.
Cover Group Covergroup is like a user-defined type that encapsulates and specifies the coverage. It can be defined in a package, module, program,
interface or class once defined multiple instances can be created using new Parameters to new () enable customization of different instances. In all cases, we must
explicitly instantiate it to start sampling. If the cover group is defined in a class, you do not make a separate name when we instance it. Cover group comprises of
cover points, options, formal arguments, and an optional trigger. A cover group encompasses one or more data points, all of which are sampled at the same time.
The two major parts of functional coverage are the sampled data values and the time when they are sampled. When new values are ready (such as when
a transaction has completed), your testbench triggers the cover group.
To calculate the coverage for a point, you first have to determine the total number of possible values, also known as the domain. There may be one value
per bin or multiple values. Coverage is the number of sampled values divided by the number of bins in the domain. A cover point that is a 3-bit variable has the
domain 0:7 and is normally divided into eight bins. If, during simulation, values belonging to seven bins are sampled, the report will show 7/8 or 87.5% coverage for
this point. All these points are combined to show the coverage for the entire group, and then all the groups are combined to give a coverage percentage for all the
simulation databases.
With respect to functional coverage, the sampling of state values within a design model or on an interface is probably the easiest to understand. We refer
to this form of functional coverage as cover group modeling. It consists of state values observed on buses, grouping of interface control signals, as well as register.
The point is that the values that are being measured occur at a single explicitly or implicitly sampled point in time. SystemVerilog covergroups are part of the
machinery we typically use to build the functional data coverage models, and the details are discussed in the block level design example and the discussion of the
corresponding example covergroup implementations.
Cover Property Modeling

With respect to functional coverage, temporal relationships between sequences of events are probably the hardest to reason about. However, ensuring that these
sequences of events are properly tested is important. We use cover property modeling to measure temporal relationships between sequences of events. Probably
the most popular example of cover properties involves the handshaking sequence between control signals on a bus protocol. Other examples include power-state
transition coverage associated with verifying a low-power design. Assertions and coverage properties are part of the machinery that we use to build temporal
coverage models, and are addressed in the bus protocol monitor example.
11
Assertion Coverage
The term assertion coverage has many meanings in the industry today. For example, some people define assertion coverage as the ratio of number of assertions to
RTL lines of code. However, assertion density is a more accurate term that is often used for this metric. For our discussion, we use the term assertion coverage to
describe an implementation of coverage properties using assertions.
Cross Coverage
Cross Coverage is specified between the cover points or variables. Cross coverage is specified using the cross construct.
Expressions cannot be used directly in a cross; a coverage point must be explicitly defined first.
12
CONSTRAINTS
13
14
15
Clocking blocks have been introduced in SystemVerilog to address the problem of specifying the timing and synchronisation requirements of a
design in a testbench.
A clocking block is a set of signals synchronised on a particular clock. It basically separates the time related details from the structural, functional
and procedural elements of a testbench. It helps the designer develop testbenches in terms of transactions and cycles. Clocking blocks can only be
declared inside a module, interface or program.
SystemVerilogs clocking construct works. Consider a loadable, up/down binary counter:
The clocking construct is both the declaration and the instance of that declaration. Note that the signal directions in the clocking block within the testbench
are with respect to the testbench. So Q is an output of COUNTER, but a clocking input. Note also that widths are not declared in the clocking block, just the
directions.
The signals in the clocking block cb_counter are synchronised on the posedge of Clock, and by default all signals have a 4ns output (drive)
skew and a #1step input (sample) skew. The skew determines how many time units away from the clock event a signal is to be sampled or driven. Input
skews are implicitly negative (i.e. they always refer to a time before the clock), whereas output skews always refer to a time after the clock.
Clocking Block Drives
Clocking block outputs and inouts can be used to drive values onto their corresponding signals, at a certain clocking event and with the specified skew. An important
point to note is that a drive does not change the clock block input of an inout signal. This is because reading the input always yields the last sampled value, and not
the driven value.
Synchronous signal drives are processed as nonblocking assignments. If multiple synchronous drives are applied to the same clocking block output or inout at the
same simulation time, a run-time error is issued and the conflicting bits are set to X for 4-state ports or 0 for 2-state ports.
Here are some examples using the driving signals from the clocking block cb:
cb.Data[2:0] <= 3'h2; // Drive 3-bit slice of Data in current cycle

##1 cb.Data <= 8'hz; // Wait 1 Clk cycle and then drive Data
##2 cb.Data[1] <= 1; // Wait 2 cycles, then drive bit 1 of Data
cb.Data <= ##1 Int_Data; // Remember the value of Int_Data, and then
// drive Data 1 Clk cycle later
cb.Data[7:4] <= 4'b0101;
cb.Data[7:4] <= 4'b0011; // Error: driven value of Data[7:4] is 4b0xx1
Clocking Blocks and Interfaces
This is an example presenting multiple clocking blocks using interfaces. A clocking block can use an interface to reduce the amount of code needed to connect the
testbench.
The interface signals will have the same direction as specified in the clocking block when viewed from the testbench side (e.g. modport TestR), and reversed when
viewed from the DUT (i.e. modport Ram). The signal directions in the clocking block within the testbench are with respect to the testbench, while a modport declaration
can describe either direction (i.e. the testbench or the design under test).
Clocking cb2 @(posedge CtrlInt.Clock); interface CtrlBus (input Clock);

output #10; logic RWn;
output RWn = CtrlInt.RWn; // Hierarchical expression // RWn is output, as it is in the clocking block
endclocking modport TestR (output RWn);
// RWn is input, reversed than in the clocking block
modport Ram (input RWn);
endinterface
Clocking block events

The clocking event of a clocking block can be accessed directly by using the clocking block name, e.g. @(cb) is equivalent to @(posedge Clk). Individual signals from
the clocking block can be accessed using the clocking block name and the dot (.) operator. All events are synchronised to the clocking block.
16
Assertions are primarily used to validate the behaviour of a design. ("Is it working correctly?") They may also be used to provide functional
coverage information for a design ("How good is the test?"). Assertions can be checked dynamically by simulation, or statically by a separate property checker
tool i.e. a formal verification tool that proves whether or not a design meets its specification. Such tools may require certain assumptions about the designs
behaviour to be specified.
In SystemVerilog there are two kinds of assertions: immediate (assert) and concurrent (assert property). Coverage statements (cover
property) are concurrent and have the same syntax as concurrent assertions, as do assume property statements. Another similar statement
expect is used in testbenches; it is a procedural statement that checks that some specified activity occurs. The three types of concurrent assertion
statement and the expect statement make use of sequences and properties that describe the designs temporal behaviour i.e. behaviour over time, as
defined by one or more clocks.
Immediate Assertions
Immediate assertions are procedural statements and are mainly used in simulation. An assertion is basically a statement that something must be true, similar
to the if statement. The difference is that an if statement does not assert that an expression is true, it simply checks that it is true, e.g.:
if (A == B) ... // Simply checks if A equals B
assert (A == B); // Asserts that A equals B; if not, an error is generated
If the conditional expression of the immediate assert evaluates to X, Z or 0, then the assertion fails and the simulator writes an error message.
An immediate assertion may include a pass statement and/or a fail statement. In our example the pass statement is omitted, so no action is taken when
the assert expression is true. If the pass statement exists:
assert (A == B) $display ("OK. A equals B");
it is executed immediately after the evaluation of the assert expression. The statement associated with an else is called a fail statement and is executed
if the assertion fails:
assert (A == B) $display ("OK. A equals B");
else $error("It's gone wrong");
Note that you can omit the pass statement and still have a fail statement:
assert (A == B) else $error("It's gone wrong");
The failure of an assertion has a severity associated with it. There are three severity system tasks that can be included in the fail statement to specify a
severity level: $fatal, $error (the default severity) and $warning. In addition, the system task $info indicates that the assertion failure carries no
specific severity.
Here are some examples:
ReadCheck: assert (data === correct_data)
else $error("memory read error");
Igt10: assert (I > 10)
else $warning("I is less than or equal to 10");
The pass and fail statements can be any legal SystemVerilog procedural statement. They can be used, for example, to write out a message, set an error
flag, increment a count of errors, or signal a failure to another part of the testbench.
AeqB: assert (a === b)
else begin error_count++; $error("A should equal B"); end
Concurrent Assertions
The behaviour of a design may be specified using statements similar to these:
"The Read and Write signals should never be asserted together."
"A Request should be followed by an Acknowledge occurring no more than two clocks after the Request is asserted."
Concurrent assertions are used to check behaviour such as this. These are statements that assert that specified properties must be true. For example,
assert property (!(Read && Write));
asserts that the expression Read && Write is never true at any point during simulation.
Properties are built using sequences. For example,
assert property (@(posedge Clock) Req |-> ##[1:2] Ack);
17
where Req is a simple sequence (its just a boolean expression) and ##[1:2] Ack is a more complex sequence expression, meaning that Ackis true on
the next clock, or on the one following (or both). |-> is the implication operator, so this assertion checks that whenever Req is asserted, Ack must be
asserted on the next clock, or the following clock.
Concurrent assertions like these are checked throughout simulation. They usually appear outside any initial or always blocks in modules, interfaces and
programs. (Concurrent assertions may also be used as statements in initial or always blocks. A concurrent assertion in an initial block is only tested on the
first clock tick.)
The first assertion example above does not contain a clock. Therefore it is checked at every point in the simulation. The second assertion is only checked
when a rising clock edge has occurred; the values of Req and Ack are sampled on the rising edge of Clock.
Implication
The implication construct (|->) allows a user to monitor sequences based on satisfying some criteria, e.g. attach a precondition to a sequence and evaluate
the sequence only if the condition is successful. The left-hand side operand of the implication is called the antecedent sequence expression, while the right-
hand side is called the consequent sequence expression.
If there is no match of the antecedent sequence expression, implication succeeds vacuously by returning true. If there is a match, for each successful match
of the antecedent sequence expression, the consequent sequence expression is separately evaluated, beginning at the end point of the match.
There are two forms of implication: overlapped using operator |->, and non-overlapped using operator |=>.
For overlapped implication, if there is a match for the antecedent sequence expression, then the first element of the consequent sequence expression is
evaluated on the same clock tick.
s1 |-> s2;
In the example above, if the sequence s1 matches, then sequence s2 must also match. If sequence s1 does not match, then the result is true.
For non-overlapped implication, the first element of the consequent sequence expression is evaluated on the next clock tick.
s1 |=> s2;
The expression above is basically equivalent to:
define true 1
s1 ##1 true |-> s2;
where `true is a boolean expression, used for visual clarity, that always evaluates to true.
Properties and Sequences

In these examples we have been using, the properties being asserted are specified in the assert property statements themselves. Properties may also
be declared separately, for example:
property not_read_and_write;
not (Read && Write);
endproperty assert property (not_read_and_write);
Complex properties are often built using sequences. Sequences, too, may be declared separately:
sequence request
Req;
endsequence
sequence acknowledge
##[1:2] Ack;
endsequence
property handshake;
@(posedge Clock) request |-> acknowledge;
endproperty
assert property (handshake);
18
Assertion Clocking
Concurrent assertions (assert property and cover property statements) use a generalised model of a clock and are only evaluated when a clock
tick occurs. (In fact the values of the variables in the property are sampled right at the end of the previous time step.) Everything in between clock ticks is
ignored. This model of execution corresponds to the way a RTL description of a design is interpreted after synthesis.
A clock tick is an atomic moment in time and a clock ticks only once at any simulation time. The clock can actually be a single signal, a gated clock (e.g. (clk
&& GatingSig)) or other more complex expression. When monitoring asynchronous signals, a simulation time step corresponds to a clock tick.
The clock for a property can be specified in several ways:
o Explicitly specified in a sequence:
sequence s;
@(posedge clk) a ##1 b;
endsequence
property p;
a |-> s;
endproperty
assert property (p);
o Explicitly specified in the property:
property p;
@(posedge clk) a ##1 b;
endproperty
assert property (p);
o Explicitly specified in the concurrent assertion:
assert property (@(posedge clk) a ##1 b);
o Inferred from a procedural block:
property p;
a ##1 b;
endproperty
always @(posedge clk) assert property (p);
o From a clocking block (see the Clocking Blocks tutorial):
clocking cb @(posedge clk);
property p;
a ##1 b;
endproperty
endclocking
assert property (cb.p);
o From a default clock (see the Clocking Blocks tutorial):
default clocking cb;
19
Handling Asynchronous Resets
In the following example, the disable iff clause allows an asynchronous reset to be specified.
property p1;
@(posedge clk) disable iff (Reset) not b ##1 c;
endproperty
assert property (p1);
The not negates the result of the sequence following it. So, this assertion means that if Reset becomes true at any time during the evaluation of the
sequence, then the attempt for p1 is a success. Otherwise, the sequence b ##1 c must never evaluate to true.
Sequences
A sequence is a list of boolean expressions in a linear order of increasing time. The sequence is true over time if the boolean expressions are true at the
specific clock ticks. The expressions used in sequences are interpreted in the same way as the condition of a procedural ifstatement.
Here are some simple examples of sequences. The ## operator delays execution by the specified number of clocking events, or clock cycles.
a ##1 b // a must be true on the current clock tick
// and b on the next clock tick
a ##N b // Check b on the Nth clock tick after a
a ##[1:4] b // a must be true on the current clock tick and b
// on some clock tick between the first and fourth
// after the current clock tick
The * operator is used to specify a consecutive repetition of the left-hand side operand.
a ##1 b [*3] ##1 c // Equiv. to a ##1 b ##1 b ##1 b ##1 c
(a ##2 b) [*2] // Equiv. to (a ##2 b ##1 a ##2 b)
(a ##2 b)[*1:3] // Equiv. to (a ##2 b)
// or (a ##2 b ##1 a ##2 b)
// or (a ##2 b ##1 a ##2 b ##1 a ##2 b)
The $ operator can be used to extend a time window to a finite, but unbounded range.
a ##1 b [*1:$] ##1 c // E.g. a b b b b c
The [-> or goto repetition operator specifies a non-consecutive sequence.
a ##1 b[->1:3] ##1 c // E.g. a !b b b !b !b b c
This means a is followed by any number of clocks where c is false, and b is true between 1 and three times, the last time being the clock before c is true.
The [= or non-consecutive repetition operator is similar to goto repetition, but the expression (b in this example) need not be true in the clock cycle before c is
true.
a ##1 b [=1:3] ##1 c // E.g. a !b b b !b !b b !b !b c
20
Combining Sequences There are several operators that can be used with sequences:
The binary operator and is used when both operand expressions are expected to succeed, but the end times of the operand expressions can be different.
The end time of the end operation is the end time of the sequence that terminates last. A sequence succeeds (i.e. is true over time) if the boolean expressions
containing it are true at the specific clock ticks.
s1 and s2 // Succeeds if s1 and s2 succeed. The end time is the
// end time of the sequence that terminates last
If s1 and s2 are sampled booleans and not sequences, the expression above succeeds if both s1 and s2 are evaluated to be true.
The binary operator intersect is used when both operand expressions are expected to succeed, and the end times of the operand expressions must be
the same.
s1 intersect s2 // Succeeds if s1 and s2 succeed and if end time of s1 is
// the same with the end time of s2
The operator or is used when at least one of the two operand sequences is expected to match. The sequence matches whenever at least one of the operands
is evaluated to true.
s1 or s2 // Succeeds whenever at least one of two operands s1
// and s2 is evaluated to true
The first_match operator matches only the first match of possibly multiple matches for an evaluation attempt of a sequence expression. This allows all
subsequent matches to be discarded from consideration. In this example:
sequence fms;
first_match(s1 ##[1:2] s2);
endsequence
whichever of the (s1 ##1 s2) and (s1 ##2 s2) matches first becomes the result of sequence fms.
The throughout construct is an abbreviation for writing:
(Expression) [*0:$] intersect SequenceExpr
i.e. Expression throughout SequenceExpr means that Expression must evaluate true at every clock tick during the evaluation
of SequenceExpr.
The within construct is an abbreviation for writing:
(1[*0:$] ##1 SeqExpr1 ##1 1[*0:$]) intersect SeqExpr2
i.e. SequenceExpr1 within SequenceExpr2 means that SeqExpr1 must occur at least once entirely within SeqExpr2 (both start and end points
of SeqExpr1 must be between the start and the end point of SeqExpr2 ).
Variables in Sequences and Properties

Variables can be used in sequences and properties. A common use for this occurs in pipelines:
`define true 1
property p_pipe;
logic v;
@(posedge clk) (`true,v=DataIn) ##5 (DataOut === v);
endproperty
In this example, the variable v is assigned the value of DataIn unconditionally on each clock. Five clocks later, DataOut is expected to equal the assigned
value. Each invocation of the property (here there is one invocation on every clock) has its own copy of v. Notice the syntax: the assignment to v is separated
from a sequence expression by a comma, and the sequence expression and variable assignment are enclosed in parentheses.
Coverage Statements
In order to monitor sequences and other behavioural aspects of a design for functional coverage, cover property statements can be used. The syntax of
these is the same as that of assert property. The simulator keeps a count of the number of times the property in the cover property statement holds or
fails. This can be used to determine whether or not certain aspects of the designs functionality have been exercised.
module Amod2(input bit clk);

bit X, Y;
sequence s1;
@(posedge clk) X ##1 Y;
endsequence
21
CovLavel: cover property (s1);
...
endmodule
SystemVerilog also includes covergroup statements for specifying functional coverage. These are introduced in the Constrained-Random Verification
Tutorial.
Assertion System Functions

SystemVerilog provides a number of system functions, which can be used in assertions.
$rose, $fell and $stable indicate whether or not the value of an expression has changed between two adjacent clock ticks. For example,
assert property
(@(posedge clk) $rose(in) |=> detect);
asserts that if in changes from 0 to 1 between one rising clock and the next, detect must be 1 on the following clock.
This assertion,
assert property
(@(posedge clk) enable == 0 |=> $stable(data));
states that data shouldnt change whilst enable is 0.

The system function $past returns the value of an expression in a previous clock cycle. For example,
assert property
(@(posedge clk) disable iff (reset)
enable |=> q == $past(q+1));
states that q increments, provided reset is low and enable is high.

Note that the argument to $past may be an expression, as shown above.
The system functions $onehot and $onehot0 are used for checking one-hot encoded signals. $onehot(expr) returns true if exactly one bit of expr is
high; $onehot0(expr) returns true if at most one bit of expr is high.
assert property (@(posedge clk) $onehot(state));
There are other system functions.
Binding
We have seen that assertions can be included directly in the source code of the modules in which they apply. They can even be embedded in procedural
code. Alternatively, verification code can be written in a separate program, for example, and that program can then be bound to a specific module or module
instance.
For example, suppose there is a module for which assertions are to be written:
module M (...);
// The design is modelled here
endmodule
The properties, sequences and assertions for the module can be written in a separate program:
program M_assertions(...);
// sequences, properties, assertions for M go here
endprogram
This program can be bound to the module M like this:
bind M M_assertions M_assertions_inst (...);
The syntax and meaning of M_assertions is the same as if the program were instanced in the module itself:
module M (...);
// The design is modelled here
M_assertions M_assertions_inst (...);
endmodule
22
Universal Verification Methodology (UVM)
What is UVM?
UVM refers to Universal Verification Methodology introduced by Accellera based on the Open Verification Methodology (OVM). It
is a methodology to perform functional verification through a supporting library of System Verilog code.
What are benefits of using UVM?
UVM offers a complete verification environment composed of reusable components and is part of a constrained
random, coverage-driven methodology. However, traditional HDL based testbenches might wiggle a few input pins and rely on
manual inspection for checking correct operation. Even if they are automated, they have to offer a quantifiable way to
determine verification progress. Based on the complexity of current designs, a complete random approach is not reasonable to
meet the tight schedules.
UVM leverages the object oriented capabilities of System Verilog such as classes, constraints and covergroups to ease the
difficulties in verifying a complex design.
UVM is primarily simulation based. However, it can also be used alongside assertion, emulation or hardware acceleration based
approaches. The other approaches typically use a Verilog, System Verilog or System C language at abstraction levels such as
behavioral, gate level or register transfer level.
What is Transaction-Based Verification? How is it done? Where is it used?

What is Transaction-Based Verification (TBV)?
TBV is a concept used in Hardware Emulation raises the level of verification abstraction from a wire-level (or pin-
level) interface to run several million times faster than RTL simulation
It simplifies the communication between the testbench and DUT so a design team can access the full performance
benefit of the emulator
TBV thus helps accelerate SoC verification by offering multiple orders of magnitude improvement in verification
performance
How is TBV carried out?
With TBV, the DUT is loaded onto the emulator, and the testbench resides on a computer. But instead of a wire-level
interface, TBV uses a high-performance transaction-level interface
The communication between the testbench, now working at a protocol level, and the DUT, which still requires a
wire-level interface, is accomplished through whats called a transactor. The transactor converts the high-level
commands from the testbench into the wire-level, protocol-specific sequences required by the DUT, and vice versa
The main point here is that all the wire-level communications are wholly contained within the emulator itself and run
orders of magnitude faster as a result
TBV also eliminates the need for rate adapters and physical interfaces
Another benefit of TBV is that it allows the testbench to stream data to the DUT, which the transactor buffers
automatically. This further speeds up the execution of the testbench.
o With this methodology, it is possible to have multiple transactions active across multiple transactors
o Together, these transactors enable the emulator to process data continuously, which dramatically
increases overall performance to that of a pure ICE environment.
A point to note here is that in TBV, the back-end portion of the transactor and the DUT are located within the emulator. This
mandates that they both be written in synthesizable RTL.
Where is TBV used?
TBV can be used throughout the verification flow, from unit (block) level to SoC level. Common applications include:
Verification of large blocks, subsystems or entire SoCs
Driver development
Early hardware/software bring-up (this includes firmware, drivers, and OSs)
Full-chip power analysis and estimation
23
Arbiter verification
An arbiter is a commonly used design in circuits to control the access to a shared resource among multiple clients.
SOURCE: http://rtlery.com/sites/default/files/queueing_fifos_and_arbiter.png
Arbitration policies
Round Robin This policy is generally used to improve fairness. Fairness generally implies granting all clients a good
chance of running on the shared resource. A particular client will not be considered for arbitration if it has been
serviced and there are other clients having outstanding requests.
Priority This policy guarantees that the important clients run first when the latency or application requirements are
known.
First Come First Serve (FCFS) This is a variation of the priority policy where the priority is granted to the client
making the request first.
Scenarios to verify
Apart from functionally verifying the arbiter algorithm stand alone, the arbiter should be verified at application level via writing
assertions. Adding the assertions will also ensure that the application requirements are met in terms of fairness and performance.
If only a single client requests, that client should be serviced.

In round robin arbiters, a client which has been serviced should not receive a grant again until the other clients having
outstanding requests are serviced at least once. This will ensure that the clients do not suffer from starvation.
In priority arbiters, a client having higher priority should always win arbitration over a lower priority arbiter.
24
What is REGISTER RENAMING?
Register renaming is a technique deployed in Out-Of-Order Processors (OOO). It eliminates the false data dependencies arising
from the reuse of architectural registers by successive instructions that do not have any real data dependencies between them.
Why use register renaming?
As mentioned earlier, it eliminates false (WAR and WAW) dependencies. A description of false dependencies is here.
How is register renaming implemented?
When possible, the compiler detects the distinct instructions and tries to assign them to a different register. However, there is a
finite number of register names that can be used in the assembly code. Many high performance CPUs have more physical
registers than that may be named directly in the instruction set, so they rename registers in hardware to achieve additional
parallelism.In all renaming schemes, the machine converts the architectural registers referenced in the instruction stream into
tags. Where the architectural registers might be specified by 3 to 5 bits, the tags are usually a 6 to 8 bit number. Because the
size of a register file generally grows as the square of the number of ports, the rename file is usually physically large and
consumes significant power.
Superscalar and VLIW processors In VLIW processors, the decision is made by the compiler to group together
instructions (words) into a Very Large Instruction Word. The onus is on the compiler to group and execute independent
instructions in parallel. Therefore, hardware implementation is simplified leading to lesser power consumption.
For Superscalar processors, the decision is made by the compiler. However, hardware implementation is complicated.
Superscalar processors have multiple functional units to execute the same types of instructions in parallel.
Example: 4 adders can execute 4 addition instructions in parallel.
Disadvantages of VLIW:
It requires compiler support and extended usage to make the best of the hardware support.
It requires the compiler to add branch prediction. Furthermore, it needs to add recovery code.
VLIW compilers can induce code bloat when there is a lot of dependence among the instructions. Hence, this can lead to
functional units execution NOPs.
Direct-Mapped and Fully Associative Caches

Direct-Mapped Caches
It is fast
Design is simple
It has maximum number of conflict misses
The best way to visualize it is as a 1-way associative cache
Fully Associative Caches
It is slow
Design is complex requiring higher number of comparators
There are no conflict misses here
Consumes more area and power
Trade-off
A Set Associative cache has lesser conflict misses and requires lesser number of comparators. Hence, it has advantages over
direct-mapped and fully associative caches in terms of both power consumption and performance. Therefore, to get the best
features of both designs, it makes sense to use a set associative cache.
Additionally, as a thumb rule for considering performance and power considerations :
For an n-way associative cache, you need n comparators to compare for tags within a set.
25
PIPELINING and its Pros and Cons
What is pipelining?
In general, pipelining refers to a set of data processing elements connected in series and executed in parallel, where the output
of one element is the input of the next one.
How does pipelining affect processor clock?

As the pipeline is made deeper, a particular step may be implemented with simpler circuitry. This in turn, lets the processor
clock run faster.
What are the pros of pipelining?

Increase in instruction throughput (this occurs due to multiple operations being performed simultaneously).
Higher clock frequency as indicated above.
What are the cons of pipelining?

Does not reduce instruction latency (defined as the time to complete a single instruction from start to finish).
In fact, it may increase latency due to additional overhead involved from breaking the computation into separate
steps and worse, the pipeline may stall (or even need to be flushed) due to mispredicted branches and exceptions.
A pipelined system typically requires buffer storage to pass on the output of each stage to the next. This buffer
storage requires additional area on the chip and adds to the latency of execution.
Pipelining also makes the static timing analysis of the datapath more complicated since the setup/hold time
requirements of the buffers need to be taken into account during the design of the processor.
It may also require more resources such as processing units, memory, etc. than a regular single-cycle datapath,
because its stages cannot reuse the resources of a previous stage.
May lead to an increase in the time it takes for an instruction to complete.
Hence, pipelining increases throughput at the cost of latency. Furthermore, it is frequently used in todays CPUs.
But it is avoided in real-time systems, in which latency is a hard constraint.
Data Hazards in Pipelining
26
Questions:
What are the primary types of Register Dependencies/Data Hazards in a Pipelined System?
What are the remedies for each of these hazards?
Which type of data hazard is commonly observed in an in-order pipeline?
Here, we will use the term Read for a Load and Write for a Store operation.
1. Read After Write (RAW) This is observed when a read is to be performed to an unwritten address/memory location. Hence, these
occur in an in-order pipeline. It is also known as True or Flow dependence.
pseudo-Assembly example:
1. R1 < 10
2. R2 < R1
Solution: In order execution of the code sequence prevents the occurrence of this data hazard. However, this typically results in stalling the
pipeline for few clocks until the write can be committed to memory. Therefore, Data Forwarding or Bypassing is an optimization for this hazard.
2. Write After Write (WAW) This is observed when you need to write to the same address in consecutive lines of code. It is also known
as Output Dependence.
1. R1 < 10
2. R2 < 11
Solution: Squashing earlier Write (using updated value) within a structure like a Write/Store buffer is a micro architectural optimization to
prevent multiple stores to the same address. Register Renaming, a very efficient technique employed in modern Out-of-Order systems can be
used to prevent this hazard.
3. Write After Read (WAR) This is observed when you need to write to a memory location after a read. Hence, you need to wait till the old
value has been read before performing another write to the same address. It is also known as Anti Dependence or False Dependence.
1. R2 < R1
2. R1 < 10
Solution: Similar to WAW hazards, Register Renaming can also be used.
Load Store Queue
Todays processors use the mechanisms dependent on load store queues to resolve ambiguous dependences and recover when
a dependence was violated. If you havent read about dependencies in out-of-order processors, check it out here first.
Avoiding WAR and WAW dependencies

Values from store instructions are buffered in a store queue until they retire. This is because without store buffering,
stores cannot execute until all older possibly exception-causing instructions have executed (and not caused an
exception) and all previous branch directions are known. Forcing stores to wait until branch directions and exceptions
are known, significantly reduces the out-of-order aggressiveness and limits ILP (Instruction level parallelism) and thus
performance.
When a store retires, it then writes its value to the memory system. This avoids the WAR and WAW dependences
where an earlier load receives an incorrect value from the memory system because a later store was allowed to
execute before an earlier load.
Buffering stores until retirement avoids WAW and WAR dependencies but introduces a new issue. Consider the
following scenario: a store executes and buffers its address and data in the store queue. A few instructions later, a
load executes that reads from the same memory address to which the store just wrote. If the load reads its data from
the memory system, it will read an old value that would have been overwritten by the preceding store. The data
obtained by the load will be incorrect.
Store to load (S to L) forwarding To solve the above problem, processors employ a technique called store-to-load
forwarding using the store queue. In addition to buffering stores until retirement, the store queue serves a second
purpose: forwarding data from completed but not-yet-retired (in-flight) stores to later loads. Rather than a simple
FIFO queue, the store queue is really a Content-Addressable Memory (CAM) searched using the memory
address. When a load executes, it searches the store queue for in-flight stores to the same address that
27
are logically earlier in program order. If a matching store exists, the load obtains its data value from that store
instead of the memory system. If there is no matching store, the load accesses the memory system as usual; any
preceding, matching stores must have already retired and committed their values. This technique allows loads to
obtain correct data if their producer store has completed but not yet retired.
o Multiple stores to the loads memory address may be present in the store queue. To handle this case, the
store queue is priority encoded to select the latest store that is logically earlier than the load in program
order. The determination of latest store is achieved by attaching a timestamp to the instructions as they
are fetched and decoded. Alternatively, by knowing the relative position of the load with respect to the
stores within the store queue.
Detecting RAW dependence violations

Modern out-of-order CPUs can use a number of techniques to detect a RAW dependence violation. But, all techniques
require tracking in-flight loads from execution until retirement. When a load executes, it accesses the memory
system and/or store queue to obtain its data value.Then, its address and data are buffered in a load queue until
retirement. The load queue is similar in structure and function to the store queue. In fact, in some processors may be
combined with the store queue in a single structure called a load store queue, or LSQ.
With this technique, the load queue keeps track of all in-flight loads. Similar to the store queue, it is a CAM searched
using the memory access address. When a store executes, it searches the load queue for completed loads
from the same address that are logically later in program order.
If such a matching load exists, it must have executed before the store. Thus, it read a stale value from the memory
system/store queue. Any instructions that used the loads value have also used bad data. If such a violation is
detected, the load is marked as violated in the retirement buffer. The store remains in the store queue and retirement
buffer and retires normally. It commits its value to the memory system when it retires.
Solution
However, when the violated load reaches the retirement point, the processor flushes the pipeline. It restarts execution
from the load instruction. At this point, all previous stores have committed their values to the memory system. The
load instruction will now read the correct value from the memory system. Any dependent instructions will re-execute
using the correct value.
Instruction-Level Parallelism
What is Instruction-level parallelism (ILP)? A measure of how many of the instructions a processor can execute simultaneously.
What are the approaches to instruction level parallelism?
Hardware
o Also known as dynamic parallelism
o Processor decides which instructions to execute in parallel at run time
o The Pentium processor implements dynamic parallelism
Software
o Also known as static parallelism
o Compiler decides which instructions to execute in parallel at compile time
o The Itanium processor implements static parallelism
ILP Example: e = a + b //1
f = c + d //2
m = e * f //3
The result of instruction 3 cannot be calculated until results of instructions 1 and 2 are completed since it depends on them. On the contrary, instructions 1 and 2 do
not depend on any other operation and hence can be calculated parallelly.
How do you calculate ILP?

ILP = (Number of Instructions)/(Number of cycles).
There is an ILP of 3/2 in the above example since 3 instructions are executed in 2 cycles (instruction 1 and 2 can be overlapped in the same cycle).
What are some common micro-architectural techniques used to exploit ILP?
Instruction Pipelining: Execution of multiple instructions can be partially overlapped.
Superscalar Execution, Very Long Instruction Word (VLIW), and Explicitly Parallel Instruction Computer (EPIC): Multiple execution units
are used to execute multiple instructions in parallel.
Out-of-order execution (OOO): Instructions execute in any order that does not violate data dependencies.
Register Renaming: A technique used to avoid unnecessary serialization of program operations imposed by the reuse of registers by those operations.
Branch Prediction: Used to avoid stalling until branch/control dependencies are resolved.
28
Translation Lookaside Buffer
What is a Translation Lookaside Buffer?
A Translation Lookaside Buffer (TLB) is essentially a cache of the Page Table.

It resides on the on-chip MMU and helps improve virtual address translation speed.
Todays desktop, laptop and server processors typically include more than one TLBs in the MMU.
The TLB is typically always present in any hardware that utilizes paged or segmented virtual memory.
TLB miss handling
The CPU accesses the main memory to do a Page Table Walk (PTW) in case of a TLB miss.
In case of a TLB miss, the best case is that the desired virtual address translation entry is in the Page Table, but the virtual-to-
physical translation entry is just not in a TLB. In this case, all that needs to be done is to lookup the main memory page table to find
the requested translation and insert it into the TLB.
However, the worst case is that a TLB miss does not find the entry in the main memory Page Table eventually leads to a Page Table
fault, in which case the page does not exist in memory and needs to be first brought into memory by doing an IO read operation
from disk. Post that, the page table needs to be updated with a Page Table Entry (PTE) reflecting the new page that has just been
brought into memory. The faulting memory operation which originally lead to the TLB miss is then retried and it leads to a TLB miss
again but this time a Page Table Walk leads to the entry finally bring brought into the TLB eventually leading to a TLB hit.
To do a Page Table Walk, the CPU first reads the Page Table Base Register (PTBR) (CR3 register on x86 for instance) to find the
starting address for the Page Table and looks up the entry in the Page Table by looking up the Virtual Page Number and the Offset
from the virtual address.
Due to the latency involved in accessing a lower level of the memory hierarchy (DRAM or Disk), these operations are time consuming
so a well-functioning TLB is of prime importance.
These sequence of operations also prove that a TLB miss can be more expensive than an instruction or data cache miss, due to
requiring not just a load from main memory, but a page walk, requiring several loads.
Multiple TLBs
Similar to caches, TLBs may have multiple levels.

Todays CPUs typically have multiple TLBs.
For example, a small L1 TLB (potentially fully associative) that is extremely fast and a larger L2 TLB that is
somewhat slower.
TLBs can also be unified (one TLB for both instructions and data), while a split TLB config (two different TLBs, one for
instructions and the other for data) is also used.
Hardware/Software Managed TLBs

Hardware Managed TLBs:
With hardware managed TLBs, the CPU walks the page tables.
In case of a Page Table Fault, the CPU raises a page fault exception, which the operating system must handle.
With a hardware-managed TLB, the format of the TLB entries is not visible to software, and can change from CPU to CPU without
causing loss of compatibility for the programs.
Software Managed TLBs:

With software-managed TLBs, a TLB miss generates a TLB miss exception, and the OS is responsible for walking the page tables and performing the
translation in software.
The OS loads the translation into the TLB and restarts the program from the instruction that caused the TLB miss.
If the OS finds no valid translation in the page tables, a page fault has occurred and the OS must handle it accordingly.
Instruction Set Architectures (ISA) of CPUs that have software-managed TLBs have instructions that allow loading entries into any slot in the TLB. The
format of the TLB entry is defined as a part of the ISA.
Keep in mind that Page Table Faults are *always* handled by the OS irrespective of the TLB being hardware or software managed.
29
Tomasulo Algorithm Todays Out-Of-Order (OOO) processors carry out dynamic scheduling of instructions to efficiently utilize multiple execution
units using Tomasulos algorithm. It was developed by Robert Tomasulo at IBM in 1967, and first implemented in the IBM System/360 Model 91s floating point unit
where it gained its fame.
Key features: The following are key features of Tomasulos Algorithm: Reservation Stations, Common Data Bus, Distributed hazard detection and execution
control and Dynamic memory disambiguation.
Reservation Stations (RS)
Buffers for functional units that hold instructions stalled for RAW hazards and their operands to be available.
Source operands can be values or names of other reservation station entries or load queue entries (in case of a memory read) that will produce value.
o Both operands dont have to be available at the same time.
o When both operand values have been computed, an instruction can be dispatched to its functional unit.
RAW hazards eliminated by forwarding
o Source operand values that are computed after the registers are read are known by the functional unit or load queue that will produce
them.
o Results are immediately forwarded to functional units on the common data bus.
o Dont have to wait until for value to be written into the register file.
WAR and WAW hazards eliminated by using register renaming
o Name-dependent instructions refer to reservation station or load queue locations for their sources, not the registers (as above)
o The last writer to the register updates it
o More reservation stations than registers, so eliminates more name dependences than a compiler can & exploits more parallelism
Common Data Bus (CDB)
Connects functional units and load queue to reservations stations, registers and the store queue.
Ships results to all hardware that could want an updated value.
Eliminates RAW hazards: not have to wait until registers are written before consuming a value.
Distributed hazard detection and execution control
Each reservation station decides when to dispatch instructions to its function unit.
Each hardware data structure entry that needs a value from the common data bus grabs the value itself by snooping .
Reservation stations, store queue entries & registers have a tag saying where their data should come from.
When it matches the data producers tag on the bus, reservation stations, store queue entries & registers grab the data.
Dynamic memory disambiguation
The issue:
o Dont want loads to bypass stores to the same location.
The solution:
o Loads associatively checks addresses in the store queue.
o If an address match, grab the value.
Tomasulo execution stages
Tomasulo works in three stages: Issue, execute and write result assuming that the instruction has been fetched. With the addition of a Re-Order Buffer (ROB), an
additional fourth stage is added called commit.
Issues
o Issue if no hazard; stall if hazard.
o Read registers for source operands.
Put into reservation stations if values are in them.
If not, put tag of the producing functional unit or load queue.
(renaming the registers to eliminate WAR and WAW hazards)
Execute
o Detect RAW hazards.
o Snoop on CDB for missing operands.
o Dispatch instruction to a functional unit when both operand values are ready.
o Execute the operation.
o Calculate effective address and start memory operation (load/store).
Write Result
o Broadcast result and reservation station tag (ID) on the CDB.
o Reservation stations, registers and store queue entries obtain the
value through snooping.
Advantages of Tomasulos algorithm compared to Scoreboarding
Register renaming in hardware.
Reservation stations for all execution units.
Common Data Bus (CDB) on which computed values broadcast to all reservation stations that may need them.
These developments avoid unnecessary stalls that would occur with scoreboarding and thus allow for efficient parallel execution and better performance
than scoreboarding.
30
CACHES
Tradeoffs
There are multifaceted tradeoffs considered while designing caches.
1. Caches are based on SRAM Technology SRAMs are much faster than DRAMs. However their disadvantages include:
1. Lower density compared to DRAMs owing to the use of ~6 transistors per bit compared to 1 transistor per bit used in DRAM.
2. As a result of the lower density, the per bit storage cost is higher in SRAMs compared to DRAMs.
2. Large Caches provide higher hit rate. However, as the size of the cache increases the latency of the access circuits (comparators) increase drastically.
3. Multi-level caching helps improve hit rate further but can be slow if you dont find data in the first level you go to the next level (adds to latency).
4. Typical scenarios when caches do not serve performance improvement
1. On every program change, the stale data of the previous program has to be flushed. You have to go through the phase of cold misses every time a fresh
program is loaded for execution.
2. If a particular workload does not have locality such as a streaming application (say streaming a YouTube video online wherein you rarely re-read
previous data by seeking backwards). In this case, the benefits of caches drop low and the AMAT (Average Memory Access Time) increases drastically. This is
because cache content is rarely re-used and everything you bring in your cache eventually has to be evicted without being re-read.
What are the types of Cache Misses? (4 Cs) In modern High Performance Computer Architecture (HPCA) literature, cache optimizations play a very
important role. Caches are another layer of faster (SRAM) memory added to speed up Memory operations which are traditionally bottlenecked by Main
Memory (DRAM) also known as lower layer Memory.
Types of cache misses

Compulsory These are also called Cold Misses which are observed when you freshly boot up your system and when everything you access is a new request
(not present in cache). Optimizations such are the use of a sequential/stream prefetcher significantly help reduce Cold/Compulsory misses.
Capacity Caches (SRAM) are faster than Main Memory (DRAM). However, they consume considerably large power and area than Main Memory and hence
their size is limited. Once a particular way gets full in the cache replacement policies come into play. More the number of processors on the SoC (More the
caches) leads to reduction in Capacity misses.
Conflict Conflict misses arise when you hit a particular set and its already full. You have to replace data in that particular set since there is a conflict. More
the number of sets the lesser the conflict misses.
Coherence Coherence misses are observed on account of Coherence based Protocols where coherence traffic/invalidations may be sent out between cores.
What are Snooping and Directory Based Cache Coherency Protocols? Cache Coherency is of prime importance in Modern CPU Design. There
are two types of Cache Coherency Protocols. There is a trade off between them depending on the complexity and scalability of implementation.
Snooping-Based Cache Coherency Protocols

It is based on the concept of Broadcast where every coherency state transition in one core is sent out as a broadcast to the other cores (for multicore systems).
It is not scalable, imagine a system with 1000s of cores sending out broadcasting traffic. You are bound to get bottlenecked by the bus interface bandwidth
here.
It requires more power and area to implement a Snoop based protocol.
Snoop based Protocols are plagued by contention and electrical issues.
The implementation of Snoop Based protocols is straight forward.
Directory-Based Cache Coherency Protocols

It is based on the concept of point-to-point communication in contrast to broadcast being used for Snoop Based protocols.
The whole idea to make it point to point and to only send out traffic to cores which are in contention for a cache line makes it a scalable solution for modern
Multi Processor SoCs.
There can be false positives being sent out This is more of an implementation based issue where you sometimes do not inform other cores of a clean line
being evicted and are still a valid entry in other directories and receive coherence traffic from them.
A typical implementation can be thought of as a Telephone Directory. However, here you use a bit vector instead of an actual book to keep track of who
owns a particular cache line.
The implementation is relatively complex depending on the number of cores in picture.
Directory size is proportional to (No. of Processors) x (No. of Memory Blocks).
31
Types of Caches based on Construction
Structure of a cache line
tag flag bits data
Direct-mapped caches
Each memory location can go in only one entry in the cache

Also called one-way set associative cache
Does not utilize a replacement policy as such, since there no concept of eviction
This means that if two locations map to the same entry, they may continually knock each other out
A direct-mapped cache needs to be much larger than an associative cache to offer comparable performance
Mapping equation: x = y mod n => where x is the cache line number, y is the address, and n be number of blocks in cache
Set-associative caches
A memory location can be cached in any of the n-ways (or slots) within the cache.
Typically, the least significant bits of the memory locations index are used the slot index for the cache memory, and to have two entries for each
index.
LRU is typically used as a replacement policy and is especially simple since only one bit needs to be stored for each pair.
Fully-associative caches
A memory location can be cached in any location within the cache.

A fully-associative cache tests all the possible ways simultaneously, using something like a content addressable memory.
In the common case of finding a hit in the first way tested, a fully-associative cache is as fast as a direct-mapped cache, but it has a much lower
conflict miss rate than a direct-mapped cache.
Advantages of prefetching
1. Reducing effective latency 2. Improving resource utilization 3. Higher confidence of prefetch usage (depending on the workload)
Generally, prefetchers understand and develop a pattern in the way the current workload uses data by applying dynamic learning policies. Therefore, the locality
and access footprint of the workload trains the prefetcher. Also, modern CPUs typically have multiple levels of prefetchers for each cache level in the Memory
Hierarchy. As an additional note, prefetching improves the latency for both instruction and data caches.
Interview Question
Design an L2 prefetcher for the below specification.
Inputs: Current PC, Valid, Hit Outputs: Prefetch Address, Valid, Solution: Firstly, if there is a Hit for the PC being sent in to the block,
we can increment and send out PC+4 (next line) and PC+8 (next to next line). Otherwise, send out just PC+4 (next line).
32
Frequency Divide by 2 logic design
Verilog RTL code for a divide by 2 logic
module clk_div (clk_in, enable,reset, clk_out);
// Port Declaration
input clk_in ;
input reset ;
input enable ;
output clk_out ;
//Port data type declaration-
wire clk_in ;
wire enable ;
//Internal Registers-
reg clk_out ;
//Code Starts Here
always @ (posedge clk_in)
if (reset) begin
clk_out <= 1b0;
end else if (enable) begin
clk_out <= ! clk_out ;
end
endmodule
33
What's the relationship between voltage and speed Higher the voltage more time to reach saturation or more time to switch on and off the transistor so
lower the speed. V (proportional) R (proportional) Length of wire.
Lets assume Length of wire is Distance, then longer the wire V should be higher to reach the other end of the wire. Which means higher the voltage-difference.
What will happen if the PMOS and NMOS of the CMOS inverter circuit are interchanged with respect to their positions?
Assume that the PMOS and NMOS positions are interchanged.
- pMOS is a switch which turns on when you give 0 in gate.
- nMOS is a switch which turns on when you give 1 in gate.
Since in an Nmos, the Drain gets the Higher voltage; in our case, Drain is connected to VDD and Source becomes the output node.
Apply a VDD i.e Logic 1 to the Gate. The Nmos turns ON and the ouput node charges towards VDD. But you need a Vgs >= Vth to keep the Nmos in ON state.
Currently Vg is at VDD and Vs charging towards VDD.
Now, when Vs approaches VDD - Vth , you have Vgs = VDD - (VDD - Vth) = Vth. Any extra voltage at Vs would turn the Nmos off and thus, you would never get a
Strong 1 ( i.e VDD) at the output. Thus Nmos passes a Weak 1 (VDD - Vth ).
You could apply the similar analysis to the Pmos and prove it passes a weak Zero. (i.e Vth)
PS: The circuit would actually not work like an Inverter......but a Buffer passing Weak 1's and Weak 0's.
Why/What is load capacitance in CMOS inverter? Load capacitance in a CMOS circuit is a combination of input capacitance of the following
circuit(s) and the capacitance of the interconnect. (For long interconnects things get more tricky as transmission line effects need to be taken into
consideration)
The effect of load capacitance is that it causes a transient current demand on the inverter output, which causes a number of secondary
effects, two of which are: The output has a limited current capability, so this limits the maximum rate of change of the signal, slowing down the
edges.
The transient output current is drawn from the power supply and hence causes spikes in the power supply (since the power supply and its
interconnect are non-ideal and have series impedance). This is the reason why decoupling capacitors need to be connected between the power rails
close to the output stage.
Why increasing transistor size reduces delay in operation of MOS?Delay in a gate can be simplified as the amount of time it takes to discharge the load
capacitance that the gate or fet is driving.
I= q/t = C*V/t
t=C*V/I
1) to the first order, delay (time) is inversely proportional to drive current. So, increasing the drive current will reduce the delay.
2) Increasing the MOS width will increase its drive current.
Therefore, increasing MOS width will increase its drive current which will reduce the discharge time of the load ( reduce delay).
If you want the delay through the gate to be small, you should make the gate bigger and that would reduce the fanout.
However, we have to keep in mind that there will be other gates that need to drive Cin. So, we cannot make the gate very big. You cannot size one gate in isolation but you should
consider the full chain of logic or gates. Typically, there will be an optimum sizing solution. In the case of a chain of inverters driving a large load capacitance, the optimal electrical
fanout is found to be between 3 and 4.
How does transistor size affects clock speed? The critical path, i.e. the slowest pathway in your chip. In layman's terms, you're only as strong as your
weakest link - the critical path is that weakest link. If you run your clock any faster than your critical path, you encounter "setup violations" along that path, which in
turn contaminates other paths, and the chip malfunctions.
This is where transistor size comes into play. While not directly affecting the clock, the size of your transistors affects path delays. Why? A bigger transistor takes
longer to charge up. This makes your pathways, and therefore your critical path, slower.
Bigger transistors=>Longer critical path=>Slower clock
Transistor size goes down and Switching Frequency goes up. That is 0 to 1 or 1 to 0. So a high frequency transistor ( well a whole bunch of them) ensures a high
frequency CPU.
Transistors
The four transistor operation modes are:

Saturation The transistor acts like a short circuit. Current freely flows from collector to emitter. Saturation is the on mode of a transistor.
Cut-off The transistor acts like an open circuit. No current flows from collector to emitter. Cutoff mode is the opposite of saturation. A transistor
in cutoff mode is off

Active The current from collector to emitter is proportional to the current flowing into the base. To operate in active mode, a transistors VBE must
be greater than zero and VBC must be negative.
Reverse-Active Like active mode, the current is proportional to the base current, but it flows in reverse. Current flows from emitter to collector
(not, exactly, the purpose transistors were designed for).
34
Describe how a multi bit synchronizer / async fifo handles the variable delay of each bit!
The standard solution is to encode the pointer into grey-code, where for each incrementation of the pointer, only one bit changes (so the variable delay cannot
cause false empty/full glitches as the bits settle). Note that this assumes a power-of two depth FIFO, or the grey-code may flip multiple bits when the pointer hits the non-
power-of-2 top and has to be set to the bottom. Also not that in any standard asynchronous design, it is best not to have any combinational logic before a synchronizer
(always have a flop). The reason is that even if you think you know the design is glitch-proof, the synthesis tool could do an strange optimizatino and create a circuit that
will glitch when you're not expecting it (and that glitch can get sampled by the sychronizer and result in a false pulse on the other sie). This can be avoided by writing
structural RTL and using DC constraints to ensure your glitch-proof circuit ends up the in netlist (though why not just avoid these cases all together by adding the flop).
Also of note, the create an empty signal, the write-pointer has to be converted to grey-code, flopped (see notes above about combinational logic before
synchronizers) and then sent through synchronizers in the write-pointer clock domain. The output of the synchronizers then has to be grey-code decoded, so that it can
finally be compared to the read-pointer to determine empty. Note, generating the full signal is just the reverse. For power-of two FIFOs, just add one extra bit to pointers
tell the difference between full and empty (if the extra bit is 0, and the pointers are equal, the FIFO is empty, if the extra bit is 1, and the pointers are equal, the FIFO is full).
Enough about FIFOs and async-crossings.
Synchronous FIFO verification
A synchronous First In First Out (commonly referred to as FIFO) or a queue is an array of memory. Generally, it is used when the write and read side logic
operate at the same clock frequency.
Use Case
- to buffer data when the burst write operation is larger than the burst read operation or vice versa
- read operation is delayed with respect to the write
Interfaces
A FIFO typically has the following set of signals
- Clock and reset
- Write and Write data
- Read and Read data
- Read and Write enable
- Full and empty (outputs)
Scenarios to verify
FIFO is a commonly used logic in many designs. The major functional features which have to be verified are
- Single write and read operations as well as data correctness
- FIFO transitioning from empty to non-empty and vice versa
- The transition from non-empty to full and vice versa
- Burst read and burst write operations up to the maximum depth
- Empty to full and back to empty
Error conditions
- Write operation when full The client should wait for the full signal to go low before issuing more writes. Otherwise, the data in the FIFO could be
overwritten or dropped.
- Read operation when empty The client should wait for the empty signal to go low before issuing a read. Otherwise, the data read will be garbage.
Rising Edge Detector (0->1) Verilog Code
Concept The concept is to detect a rising edge (signal transition from logic 0 to logic 1). This can actually be done in 2 ways.
Using an AND Gate Using an XOR Gate (Efficient)

module positive_edge_detector ( input signal, An XOR based solution can actual save an inverter.
input clk,
output detect); You can just XOR the flopped version of the signal
reg signal_delay; //delayed version by 1 cycle (signal_delay) with the original signal.
always @ (posedge clk) begin
signal_dly <= signal; assign detect = signal ^ signal_delay;
end assign detect = signal & ~signal_delay;
endmodule
35
Power dissipation in Circuits
Types of Power Dissipation
Static Power (also known as Leakage Power)

o Static Power is due to the leakage current when the transistor is not in the process of switching.
Dynamic Power (also known as Switching Power)
o Dynamic Power is due to switching / logic toggle of transistor.
Short Circuit Dissipation Power:
o Occurs when both NMOS and PMOS transistors are active for a small period of time in which current will find a path directly from VDD to
ground. Hence, this creates a short-circuit current.
o This is because there is a finite rise/fall time for both pMOS and nMOS during the transition.
Design for power :- circuit components

Design for power components
1. Levelshifters Cells:- These cells are used when signals need to traverse between different voltage levels. Commonly different blocks have
different voltage modes depending on performance and power requirements.
1. Low to High levelshifter cells :- These cells connect between low voltage to high voltage domains.
2. High to Low levelshifter cells :- These cells connect between high voltage to low voltage domains.
2. Isolation Cells :- These cells are used to isolate the logic which is power collapsed and power on logic. Because power collapsed will output
Xs and this unknown digital logic values should not propagate into powered on digital logic. There are different types of isolation cells. Isolation
cells can be both of input/output types.
3. Clamp Low Isolation Cell:- When the clamp signal is asserted, then these cells clamp to a Digital 0.
4. Clamp High Isolation Cell:- When the clamp signal is asserted, these cells clamp to a Digital 1.
5. Clamp Keeper Isolation Cell :- These cells clamp to the value previous to the clamp signal assertion. They are like
sequential elements.
3. Retention Registers:- In a power collapsed state, the registers which are retain-able will hold the values. These cells will be optimized to
work on dual rails and a powered on rail will enable them to hold values. This is used for the system to recover from powered down state. The
state and configuration registers can be retained to hold values.
4. Power switches:- Power switches can be used to turn on/off power to a block. These cells cause the power collapse which enables us to save
power from leakage and switching activity. This type of architecture is called power gating. Power switches can be PMOS headswitches or NMOS
foot switches. They are typically of higher resistance to limit leakage power from voltage rails.
5. LDO (low dropout regulator):- LDOs are used to regulate voltage and they are quite stable in operation. They are used for voltage scaling.
6.Voltage rail shifters:- These can change the output voltage rail from different input rails. A certain rail can be a higher voltage rail for high
performance and other rail can of lower voltage for lower power. Since, changing the voltage of rails is not fast enough and settling time is large.
Voltage rail shifters can be used to quickly shift to a lower voltage rail for lower power or a higher voltage rail for higher performance.
7. Clock gating cells:- To save dynamic power, clock gating cells are used which are used to gate the clock to an idle block. There are self
gating clocks, they depend on logic which enables or disables clock. It is a kind of feedback loop. Clock power is a significant portion of total
power. Typically the concept is when d and q of FFs remain at constant values, then clock is shut down to save power. This is however at cell
level. There will be higher levels of power saving architectures.
8. PMIC:- Typically a power management IC is used to supply voltage rails to chip. The characteristics of these voltage rails are that they are
stable, low noise and work in operating margin. PMIC takes in power from a battery or a power source uses converters( such as DC-DC buck/boost,
pulse width modulators etc ) . It also performs voltage scaling and power source selection. In some cases, can be used to perform charging device
battery.
FIFO depth calculation

A commonly asked interview question is to calculate the depth of a FIFO required for a particular data transfer operation. Since a FIFO is used to
safely buffer data during the transfer, we need to consider the worst case scenario in order to determine the depth.
The need for a FIFO arises when either the read or write operation is slower than the other. The following parameters are required to determine
the minimum FIFO size
Write frequency (freq_write)

Read frequency (freq_read)
Write burst size (B)
For simple scenarios, FIFO_Depth = B B*(freq_read/freq_write)
Example 1 Example 2
freq_write = 20 MHz, freq_read = 10 MHz, Burst = 80 bytes Same read and write clock frequency, Write burst = 80 bytes in 100 clocks, Read
Time taken to write 80 bytes (t1) = 80/20 = 4 us burst = 8 bytes in 10 clocks
Time taken to read 80 bytes (t2) = 80/10 = 8 us If there is no burst overlap and considering the worst case burst,
FIFO_depth = (t2 t1) * Smaller_freq = 4 * 10 = 40 No of bytes written in 80 clocks = 80
Using the formula, FIFO_Depth = 80 80*(10/20) = 40 No. of bytes read in 80 clocks = 8*8 = 64
Hence, FIFO_depth = 80 64 = 16
If there is a burst overlap, the maximum write burst can be 160 bytes across 200
clocks and considering worst case burst,
No of bytes written in 160 clocks = 160
No. of bytes read in 160 clocks = 8*16 = 128
Hence, FIFO_depth = 160 128 = 32
36
// generate 100 Hz pulse chain from 50 MHz // generate 100 Hz from 50 MHz
reg [18:0] count_reg = 0; reg [17:0] count_reg = 0;
reg out_100hz = 0; reg out_100hz = 0;
always @(posedge clk_50mhz or posedge rst_50mhz) begin always @(posedge clk_50mhz or posedge rst_50mhz) begin
if (rst_50mhz) begin if (rst_50mhz) begin
count_reg <= 0; count_reg <= 0;
out_100hz <= 0; out_100hz <= 0;
end else begin end else begin
out_100hz <= 0; if (count_reg < 249999) begin
if (count_reg < 499999) begin count_reg <= count_reg + 1;
count_reg <= count_reg + 1; end else begin
end else begin count_reg <= 0;
count_reg <= 0; out_100hz <= ~out_100hz;
out_100hz = 1; end
end end
end end
end
// generate 10 MHz from 250 MHz // generate 100 Hz from 50 MHz

// 25 cycle counter, falling edge interpolated reg [31:0] count_reg = 0;
reg [4:0] count_reg = 0; wire out_100hz = count_reg[31];
reg q0 = 0;
reg q1 = 0; always @(posedge clk_50mhz or posedge rst_50mhz) begin
if (rst_50mhz) begin
always @(posedge clk_250mhz or posedge rst_250mhz) begin count_reg <= 0;
if (rst_250mhz) begin end else begin
count_reg <= 0; count_reg <= count_reg + 8590; //(((100 * 1 << 32) +
q0 <= 0; 50000000/2) / 50000000)
q1 <= 0; end
end else begin end
if (count_reg < 24) begin
count_reg <= count_reg + 1;
end else begin
count_reg <= 0;
end
q0 <= count_reg < 12;
q1 <= count_reg < 13;
end
end
37
Summary of SystemVerilog Extensions to Verilog
SystemVerilog adds important new constructs to Verilog-2001, including:
New data types: byte, shortint, int, longint, bit, logic, string, chandle.
Typedef, struct, union, tagged union, enum
Dynamic and associative arrays; queues
Classes
Automatic/static specification on a per variable instance basis
Packages and support for Compilation Units
Extensions to Always blocks for modelling combinational, latched or clocked processes
Jump Statements (return, break and continue)
Extensions to fork-join, disable and wait to support dynamic processes.
Interfaces to encapsulate communication
Clocking blocks to support cycle-based methodologies
Program blocks for describing tests
Randomization and constraints for random and directed-random verification
Procedural and concurrent assertions and Coverage for verification
Enhancements to events and new Mailbox and Semaphore built-in classes for inter-process communication.
The Direct Programming Interface, which allows C functions to be called directly from SystemVerilog (and vice versa) without using the PLI.
Assertions and Coverage Application Programming Interfaces (APIs) and extensions to the Verilog Procedural Interface (VPI) details of these are outside
the scope of the SystemVerilog Golden Reference Guide
38
39

Systemverilog - Coding

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Systemverilog - Coding

Hochgeladen von

Copyright:

Verfügbare Formate

Interview Questions

Synchronous FIFO verification http://hwinterview.com/index.php/2016/11/13/synchronous-fifo-verification/

SystemVerilog Assertions http://hwinterview.com/index.php/2016/11/04/assertions/

Coverage-driven verification http://hwinterview.com/index.php/2016/10/29/coverage-driven-verification/

Fork join statements http://hwinterview.com/index.php/2016/11/16/fork-join-statements/

Virtual Memory http://hwinterview.com/index.php/2016/11/20/virtual-memory/

Virtual Address Space / Paging http://hwinterview.com/index.php/2016/11/20/virtual-address-space-paging/

Design a Cache addressing scheme http://hwinterview.com/index.php/2016/11/20/design-cache-addressing-scheme/

Virtual Address Space / Paging http://hwinterview.com/index.php/2016/11/20/virtual-address-space-paging/

To achieve successful testing:

reg [15:0] myMemory [0:1023];

always @ (posedge clk) begin

Comparing Gray code pointers to binary pointers

registers or a larger dual-port FIFO buffer size would be required.

Program Block Encapsulates Test Code

How do we achieve Synchronous Timing for different modules?

An interface encapsulates the communication between DUT and Testbench including

Tasks and Functions

Why not use simple directed tests?

Are there any drawbacks?

Steps to Coverage Implementation

Cover Property Modeling

SystemVerilogs clocking construct works. Consider a loadable, up/down binary counter:

cb.Data[2:0] <= 3'h2; // Drive 3-bit slice of Data in current cycle

Clocking cb2 @(posedge CtrlInt.Clock); interface CtrlBus (input Clock);

Clocking block events

if (A == B) ... // Simply checks if A equals B

assert (A == B); // Asserts that A equals B; if not, an error is generated

assert (A == B) $display ("OK. A equals B");

assert (A == B) $display ("OK. A equals B");

else $error("It's gone wrong");

assert (A == B) else $error("It's gone wrong");

ReadCheck: assert (data === correct_data)

else $error("memory read error");

Igt10: assert (I > 10)

else $warning("I is less than or equal to 10");

AeqB: assert (a === b)

else begin error_count++; $error("A should equal B"); end

assert property (!(Read && Write));

assert property (@(posedge Clock) Req |-> ##[1:2] Ack);

The expression above is basically equivalent to:

s1 ##1 true |-> s2;

Properties and Sequences

not (Read && Write);

endproperty assert property (not_read_and_write);

@(posedge Clock) request |-> acknowledge;

assert property (handshake);

@(posedge clk) a ##1 b;

assert property (p);

o Explicitly specified in the property:

@(posedge clk) a ##1 b;

assert property (p);

o Explicitly specified in the concurrent assertion:

assert property (@(posedge clk) a ##1 b);

o Inferred from a procedural block:

always @(posedge clk) assert property (p);

o From a clocking block (see the Clocking Blocks tutorial):

clocking cb @(posedge clk);

assert property (cb.p);

o From a default clock (see the Clocking Blocks tutorial):

default clocking cb;

@(posedge clk) disable iff (Reset) not b ##1 c;

assert property (p1);

a ##1 b // a must be true on the current clock tick

(1[0:$] ##1 SeqExpr1 ##1 1[0:$]) intersect SeqExpr2