Gate-level logic simulation is a low-granularity application that is very challenging for parallel processing. Parallel discrete event simulation (PDES) on general-purpose machines can reduce the logic simulation time for large circuits considerably. However, it generates more events than necessary for certain high activity circuits and produces inconsistent speedup over different circuits. This paper presents an optimistic parallel discrete event simulation technique that can reduce unnecessary events and produce more consistent speedup when circuit glitches are not considered. This is possible because, even though glitches contribute to a sizable portion of events during a simulation, they can be ignored most of the times.
The proposed Event-lookahead Time Warp (ETW) algorithm can look ahead,
combine and execute multiple events at each gate optimistically, and recover
from an error by using a rollback mechanism as used in the original Time
Warp algorithm. To reduce the rollback cost of the algorithm and to support
efficient run-time scheduling, a new data structure, called a governor
heap, was proposed and implemented. The governor heap contains a heap
structure and multiple linked lists, one for each gate or flip-flop. By
separating events into short linked lists, the governor heap is very efficient
and scalable in inserting, deleting, and combining events for large-scale
simulations.
A governor heap contains a fixed-size heap and multiple linked lists,
one for each LP (gate/flip-flop). Events for each LP are sorted in a linked
list according to their timestamps. Each linked list has a ``governor''
that has the smallest timestamp for the LP. All the governors in a processor
form the fixed-size heap structure. Each heap node has two values: its
Local Virtual Time (LVT) and the LP number (or gate number). The governors
are ordered in the heap according to their timestamps so that the root
of the heap is for the unprocessed event that has the smallest timestamp
in the processor. The right-hand side of the above Figure shows multiple
linked lists, one for each LP. The horizontal axis represents the simulation
time in LVT and the vertical axis indicates the seven LPs in a processor.
From Table B, it appears that the 16-bit multiplier with the TW algorithm produces a lot more events compared to the other circuit (s38417). With the ETW algorithm, the event-lookahead operation drastically reduces the number of events for the 16-bit multiplier. In either circuit, the ETW gets better performance than the TW algorithm in terms of the total number of events, the ratio of external events, the ratio of erroneous events, the efficiency, and the execution time. In particular, the ETW simulation is five times faster for the multiplier and 20% faster for s38417.
Table B shows the ETW execution time in seconds and speedups of those two circuits with different numbers of SP2 processors. The speedup results for the multiplier are pretty good considering that there are only 2080 gates. It also shows that more processors can be used effectively for the larger circuit (s38417) with ten times more gates.
Execution times and speedups for four different circuits are shown in the following Figure , respectively. For those three large sequential circuits, the speedups are reasonably good. Particularly, the s13207.1 circuit has better speedup when 24 or smaller number of processors are used. A speedup of 12.8 is achieved with 24 processors. However, when 32 processors are used, that reduces to 11.1 and the s38417 circuit with larger number of gates has a better speedup of 13.5. For small number of processors, the obtained execution times are more consistent and the speedups are reasonably good for different circuits.
Figure: Speedups with ETW
The previous results were obtained with 1000 input vectors for each simulation run. In order to reduce the workload imbalance, the amount of rollback, and the event list length, several bounded window techniques can be used by introducing global synchronization periodically. Note that global synchronization, if used too frequently, causes a lot of overhead or blocking among processors and degrades the performance of parallel simulation. Simulation results on the ETW show that, even though global synchronization once every input vector reduces erroneous events to less than 1%, global synchronization once for every 100 input vectors outperforms those with more frequent global synchronization. This is because the ETW with its lower rollback cost allows less frequent global synchronization.
Most previous simulators have limitations on some specific types of
circuits. For example, the parallel discrete event simulation (PDES) does
not perform well for high-activity circuits and the parallel data-driven
simulation does not perform well for low-activity circuits. The proposed
Event-lookahead Time Warp (ETW) technique improves PDES by combining multiple
events and reducing the number of events. The resulting algorithm is very
efficient for different circuits and the improvement is particularly clear
for high-activity circuits. Unlike the compiled-code simulation, it does
not require the static evaluation order of gates and it is therefore applicable
to synchronous circuits with or without feedback loops. In addition, it
can be applied to multi-delay model and it provides timing information
even though glitches are removed.