ETW: Event-lookahead Time Warp




Event-lookahead Time Warp

Gate-level logic simulation is a low-granularity application that is very challenging for parallel processing. Parallel discrete event simulation (PDES) on general-purpose machines can reduce the logic simulation time for large circuits considerably. However, it generates more events than necessary for certain high activity circuits and produces inconsistent speedup over different circuits. This paper presents an optimistic parallel discrete event simulation technique that can reduce unnecessary events and produce more consistent speedup when circuit glitches are not considered. This is possible because, even though glitches contribute to a sizable portion of events during a simulation, they can be ignored most of the times.

The proposed Event-lookahead Time Warp (ETW) algorithm can look ahead, combine and execute multiple events at each gate optimistically, and recover from an error by using a rollback mechanism as used in the original Time Warp algorithm. To reduce the rollback cost of the algorithm and to support efficient run-time scheduling, a new data structure, called a governor heap, was proposed and implemented. The governor heap contains a heap structure and multiple linked lists, one for each gate or flip-flop. By separating events into short linked lists, the governor heap is very efficient and scalable in inserting, deleting, and combining events for large-scale simulations.
 
 


Governor Heap

The proposed Event-lookahead Time Warp (ETW) algorithm can look ahead, combine and execute multiple events at each gate optimistically, and recover from an error by using a rollback mechanism as used in the original Time Warp algorithm. To reduce the rollback cost of the algorithm and to support efficient run-time scheduling, a new data structure, called a governor heap, was proposed and implemented. The governor heap contains a heap structure and multiple linked lists, one for each gate or flip-flop. By separating events into short linked lists, the governor heap is very efficient and scalable in inserting, deleting, and combining events for large-scale simulations.

figure1013

A governor heap contains a fixed-size heap and multiple linked lists, one for each LP (gate/flip-flop). Events for each LP are sorted in a linked list according to their timestamps. Each linked list has a ``governor'' that has the smallest timestamp for the LP. All the governors in a processor form the fixed-size heap structure. Each heap node has two values: its Local Virtual Time (LVT) and the LP number (or gate number). The governors are ordered in the heap according to their timestamps so that the root of the heap is for the unprocessed event that has the smallest timestamp in the processor. The right-hand side of the above Figure shows multiple linked lists, one for each LP. The horizontal axis represents the simulation time in LVT and the vertical axis indicates the seven LPs in a processor.
 



Performance Results

Table A  compares the results of the ETW and the original TW simulation in the parallel environment with 10 processors on the IBM SP2. For both TW and ETW, global synchronization is used once every 100 input vectors. Several factors that influence simulation time are summarized in Table A.  They include table1128
Table A: The Comparison of TW and ETW with 10 processors    and
Table B: ETW Execution Time (in seconds) and Speedups.
 

From Table B,  it appears that the 16-bit multiplier with the TW algorithm produces a lot more events compared to the other circuit (s38417). With the ETW algorithm, the event-lookahead operation drastically reduces the number of events for the 16-bit multiplier. In either circuit, the ETW gets better performance than the TW algorithm in terms of the total number of events, the ratio of external events, the ratio of erroneous events, the efficiency, and the execution time. In particular, the ETW simulation is five times faster for the multiplier and 20% faster for s38417.

Table B shows the ETW execution time in seconds and speedups of those two circuits with different numbers of SP2 processors. The speedup results for the multiplier are pretty good considering that there are only 2080 gates. It also shows that more processors can be used effectively for the larger circuit (s38417) with ten times more gates.

Execution times and speedups for four different circuits are shown in the following Figure , respectively. For those three large sequential circuits, the speedups are reasonably good. Particularly, the s13207.1 circuit has better speedup when 24 or smaller number of processors are used. A speedup of 12.8 is achieved with 24 processors. However, when 32 processors are used, that reduces to 11.1 and the s38417 circuit with larger number of gates has a better speedup of 13.5. For small number of processors, the obtained execution times are more consistent and the speedups are reasonably good for different circuits.

figure1210

Figure: Speedups with ETW

The previous results were obtained with 1000 input vectors for each simulation run. In order to reduce the workload imbalance, the amount of rollback, and the event list length, several bounded window techniques  can be used by introducing global synchronization periodically. Note that global synchronization, if used too frequently, causes a lot of overhead or blocking among processors and degrades the performance of parallel simulation. Simulation results on the ETW show that, even though global synchronization once every input vector reduces erroneous events to less than 1%, global synchronization once for every 100 input vectors outperforms those with more frequent global synchronization. This is because the ETW with its lower rollback cost allows less frequent global synchronization.

Most previous simulators have limitations on some specific types of circuits. For example, the parallel discrete event simulation (PDES) does not perform well for high-activity circuits and the parallel data-driven simulation does not perform well for low-activity circuits. The proposed Event-lookahead Time Warp (ETW) technique improves PDES by combining multiple events and reducing the number of events. The resulting algorithm is very efficient for different circuits and the improvement is particularly clear for high-activity circuits. Unlike the compiled-code simulation, it does not require the static evaluation order of gates and it is therefore applicable to synchronous circuits with or without feedback loops. In addition, it can be applied to multi-delay model and it provides timing information even though glitches are removed.