Concurrency Preserving Partitioning


Concurrency Preserving Partitioning

The proposed iCPP algorithm preserves circuit concurrency by assigning to processors circuit gates that can be evaluated at about the same time. As a result, the improved concurrency preserving partitioning (iCPP) algorithm can provide better load balancing throughout the period of a parallel simulation. This is especially important when the algorithm is used together with a Time Warp simulation where a high degree of concurrency can lead to fewer rollbacks and better performance. This algorithm improves interprocessor communication patterns as messages are dispersed over time during parallel simulation. The result is an efficient heuristic algorithm that obtains a suboptimal solution within a reasonable amount of running time.


 


Concurrency Metric

Since the iCPP algorithm is to preserve concurrency, how to measure concurrency is discussed  and a concurrency metric is developed. In previous works, concurrency was defined as the number of active gates at a given simulation cycle for parallel discrete event logic simulation under the assumption that there is an infinite number of processors. In the same paper, the average parallelism is defined as the average concurrency during the entire simulation. A different definition of concurrency was used by Smith et. al. so to take into account the number of processors. However, both definitions can be used only for a synchronous simulation that performs at lock steps. In an asynchronous parallel simulation as in a Time Warp simulation, it is difficult to measure the concurrency because the ``simulation cycle'' is not clearly defined.

Most asynchronous parallel simulation studies use the performance of real parallel simulations to evaluate partitioning algorithms. Recently, a few works based on the critical path analysis have been performed to estimate the execution time of parallel simulations. However, their approaches require the execution of either a sequential simulation or a parallel simulation. Since the performance of parallel simulation is affected not only by partitioning but also by many other factors such as synchronization algorithms and target architectures, it is desirable to have a concurrency metric for partitioning that is independent of those factors and requires neither parallel simulation nor sequential simulation. Such a metric for concurrency has never been defined previously.

Given a partitioned directed graph, we want to define a concurrency metric that can be evaluated without performing a parallel simulation or a sequential simulation. Since concurrency is related to the number of active gates at a simulation cycle, nodes in a partitioned directed graph need to be classified into linearly ordered levels so that each level can be evaluated at about the same time on all processors. This classification process, called levelization, is based on the distances of individual graph nodes from primary input gates or flip-flops and the processor assignment. Once the levelization procedure has been applied to a directed graph, the concurrency can then be defined.

The inherent parallelism of a circuit depends on its structure, such as the number of primary input gates, the number of flip-flops, and the connectivity of circuit gates. In general, a loosely connected circuit has high concurrency while a tightly connected circuit has relatively low concurrency. Figure  summarizes the concurrency of several circuits after the application of those partitioning algorithms. Figure  displays those results graphically for the circuit s38417 with three partitioning algorithms that produce a high degree of concurrency. It shows that the iCPP algorithm produces better concurrency than MeTis does while the string algorithm is the best in terms of concurrency.
 

figure792
 



Partitioning Results

In Figure A,  partitioning results of the s38584.1 circuit are summarized in terms of the ratio of edge cut. As can be seen from the diagram, curves for different partitioning algorithms exhibit the same trend as the number of processors increases. Another observation is that the iCPP and the MeTis algorithms consistently produce lower ratio of edge cuts than those other four algorithms. As a matter of fact, when bi-partitioning (two-processor partitioning) is performed, the ratio of external events for the iCPP algorithm is 1.5%. This compares favorably with several other partitioning algorithms such as Flip-flop-Clustering (10%), Min-Cut (15%), and Levelizing (20%). These results were reported in  Sporrer et al. Even though the Corolla-clustering algorithm with its 0.6% ratio of external events for bi-partitioning definitely outperforms the iCPP algorithm in this regard, it is relatively slow and does not take concurrency into consideration. Figure B also shows that the ratio of edge cuts with the iCPP algorithm is only 12.5% even when 64 processors are used. For more detail in concurrency metric and partitioning results, please read a paper.

figure802

Figure A: Edge Cut and communication Ratio

figure807
Figure B: Edge Cut Ratios with several partitioning algorithms