1. Introduction
In the design of a synchronous circuit, clock signal ideally must propagate to all components at the same time. However, since the distance from clock source to all sequential elements is different, there exists timing difference between these clock signal paths. A circuit with a large clock skew will cause the failure of its functionality. With the increasing complexity of high-performance chip design and process variation, clock network synthesis becomes a crucial and difficult problem.
There are three important issues in modern clock network synthesis problem: clock skew, process variation, and power consumption. To satisfy these considerations, clock tree and clock mesh are the two commonly used clock network synthesis approaches. Comparing these two approaches, clock tree network has shorter wire length and better power consumption, but it is more difficult to achieve timing closure and has less tolerance for process variation. On the other hand, clock mesh network has higher tolerance for process variation and is easier to satisfy the clock skew constraint, but it usually has much more power consumption.
Because of tightly clock skew constraint, process variation tolerance, and low power requirement in high performance circuit design, the concept of hybrid clock network was proposed in recent years. This style of clock network architecture aims at integrating both lower power advantage of clock tree and easy timing convergence of clock mesh, while achieving timing closure and optimizing power consumption.
In this paper, we proposed a new cross-mesh architecture. In normal hybrid mesh/tree structure, the clock mesh layout was based on the board shape, driving buffer was placed in the intersection of mesh lines. For example,
Figure 1(a) is a 2 × 2 size uniform clock mesh design, in which the driver buffer was placed in the mesh line intersection to drive the registers within the mesh, and its driving range was as shown by the gray rectangle. In contrast, our proposed cross-mesh layout is as shown in
Figure 1(b). The range of a driving buffer is changed from the original rectangular area to cut into four triangular areas. We use this mesh architecture to distribute the buffers to balance the overall switching capacitance, reducing the number of registers connected to a sub-tree, and the load capacitance of a buffer.
Comparison of the two clock mesh architectures in
Figure 1 is illustrated below. We assume the capacitance and power consumption of every piece of wire is 1 pf capacitance unit and 1 pw, 1X buffer has 4 pf capacitance and 4 pw power consumption, and 4X buffer has 10 pf capacitance and 10pw power consumption. In this example, uniform clock mesh is composed of twelve nets and one 4X buffer, total power consumption is 12*1+10 = 22 pw, total capacitance is 12*1+10 = 22 pf.
On the other hand, cross clock mesh uses four buffers as driving buffers to drive the same size of area, and each driving buffer in average only drives triangle area surrounded by blue line which is equal to one-fourth of original area. This change can reduce the output capacitance of the driving buffer, and hence timing convergence will be easier.
There are four 1X buffers and 4 pieces of wire in the circuit. The total capacitance can be calculated as 4*1+4*4=20pf, and power consumption is 4*1+4*4=20pw. In this example, we show that cross-mesh has not only smaller capacitance but also less power consumption in comparison with uniform mesh.
In addition to the proposed new type of hybrid clock network architecture, we also propose a clock gating and register clustering algorithm to resolve both clock skew and power consumption problems, and an effective solution to overcome the various operating environment problem.
The rest of this paper is organized as follows.
Section 2 describes related work on clock tree, clock mesh, and hybrid network.
Section 3 illustrates our motivation for integrating cross-mesh architecture and clock gating in local clock tree with an example. In section 4, we propose a four stages methodology for hybrid cross-mesh synthesis, clock gating, and register clustering. Experimental results in
Section 5 show the effect of integrating these optimization techniques. Finally, we draw the concluding remarks in
Section 6.
3. Motivation Example
The example in
Figure 1 illustrates how to take driving buffer and mesh wire into consideration, and the use of cross-mesh architecture to reduce total capacitance and power consumption. In this section, we further illustrate our motivation for integrating cross-mesh architecture and clock gating in local clock tree and analyze capacitance and power consumption of clusters in each kind of sub-tree topology with the example in
Figure 2. Based on the circuit in
Figure 2(a),
Figure 2(b) and
Figure 2(c) illustrate how clock gating and load balancing techniques can reduce capacitance load on mesh wires,
Figure 2(d) illustrates how we integrate these two optimization methods with the cross-mesh architecture to further reduce capacitance load. Details of this example are described below.
The switching capacitance and power consumption of a cluster are defined as following formulas:
Capmesh_wire is wire capacitance from driving buffer to the gate that is the beginning of a cluster, Capgate is the capacitance of cluster gate, Capsubtree_wire is the wire capacitance in a cluster, and Cap reg is the capacitance of register. Because we use clock gate to control a cluster, it needs a parameter α to represent the activity ratio of a cluster. If , it means this cluster will activate half of time during working process. To simplify the problem, we assume α of every gate is 0.5 in this example. Note that parameters of power are similar to capacitance’.While for the other parameters, we assume the capacitance and power of 1X gate are 6 pf and 6 pw, and that of 4X gate are 14 pf and 14 pw. Similarly, we assume the capacitance and power of registers are 5 pf and 5 pw.
In
Figure 2(a), circuit A is a conventional mesh tree which branches two clusters. For cluster 1, we assume
Capmesh_wire = 5
pf,
Pwrmesh_wire = 5
pw and
Capsubtree_wire = 5
pf,
Pwrsubtree_wire = 5
pw, therefore
and
. For cluster 2, we assume
Capmesh_wire = 10
pf,
Pwrmesh_wire = 10
pw and
Capsubtree_wire = 30
pf,
Pwrsubtree_wire = 30
pw, therefore
and
. In cluster 2, there are eight registers in the tree topology, thus it needs to use 4X gated cell to drive the clock tree. At last, the average capacitance is 33.75
pf and total power consumption is 62.5
pw. Incidentally,
PWRmesh_wire1 is public in clusters 1 and 2, we only need to calculate one time in total power consumption.
Based on conventional mesh circuit, we can optimize circuit A by two ways. For the first optimization method, because cluster 2 is too large to drive by a gate, we insert another gate G3 to divide cluster 2 into two smaller clusters by analyzing registers’ activity status like circuit B in
Figure 2(b). The advantage of this method is that not only mitigate the capacitance of each gate (in other words, large gate can be instead by smaller one), but also increase the possibility of closing time due to more similarity between clusters. In circuit B, we assume
Capmesh_wire and
Pwrmesh_wire of clusters 1, 2, 3 as 5
pf, 13
pf, 4
pf, and 5
pw, 13
pw, 4
pw;
Capsubtree_wire and
Pwrsubtree_wire are 5
pf, 13
pf, 14
pf, and 5
pw, 13
pw, 14
pw. By formulas (1) and (2), we get
Captot of clusters 1, 2, 3 are 15.5
pf, 32.5
pf, 24
pf, and
Pwrtot of clusters 1, 2, 3 are 15.5
pw, 32.5
pw, 24
pw. The average capacitance of circuit B is 24
pf and total power consumption is 63
pf. This optimization in circuit B significantly reduce average capacitance of driving gate but only have 0.5
pw of cost increased.
The second optimization method aims at balancing gate amount on all mesh wires. Actually, it is better to put all gates on the mesh wires in average to reduce the maximum loading, since too much loading of mesh wire will induce timing as well as temperature problem in the circuit. For circuit C in
Figure 2(c), we assume
Capmesh_wire and
Pwrmesh_wire of clusters 1, 2, 3 as 5
pf, 13
pf, 8
pf, and 5
pw, 13
pw, 8
pw;
Capsubtree_wire and
Pwrsubtree_wire are 5
pf, 13
pf, 14
pf, and 5
pw, 13
pw, 14
pw. By formulas (1) and (2), we get
Captot of clusters 1, 2, 3 are 15.5
pf, 32.5
pf, 28
pf and
Pwrtot of clusters 1, 2, 3 are 15.5
pw, 32.5
pw, 28
pw. The average capacitance of circuit C is 25.3
pf and total power consumption is 71
pw. Compare to circuit B, it has a little increase in both average capacitance and total power consumption. However, if we only focus on horizontal mesh wire, it gets better average capacitance and total power consumption as 24
pf and 43
pw due to cluster 3 is moved to y-axis mesh wire.
While circuit D in
Figure 2(d) is different from the two previous methods, we use the proposed cross-mesh structure to build mesh tree. Despite circuit C balances the loading of mesh wires, the burden of driving buffers is still the same. To resolve this problem, our method put driving buffers on both x-axis mesh and y-axis mesh. It means that gate cell in the cluster will connect to the mesh tree in average by these guide buffers. We assume
Capmesh_wire and
Pwrmesh_wire of clusters 1, 2, 3 as 4
pf, 5
pf, 3
pf, and 4
pw, 5
pw, 3
pw;
Capsubtree_wire and
Pwrsubtree_wire are 5
pf, 13
pf, 14
pf, and 5
pfw 13
pw, 14
pw;
Pwrmesh_wire of buffer
12 and buffer
3 are 8
pw, 5
pw. According to the formulas, we get
Captot of clusters 1, 2, 3 are 14.5
pf, 24.5
pf, 23
pf and
Pwrtot of clusters 1, 2, 3 are 14.5
pw, 24.5
pw, 23
pw. Because there are two driving buffers in the circuit, we analyze average capacitance and total power consumption for each buffer. For buffer
12, average capacitance is 19.5
pf and total power consumption is 47
pw. On the other hand, average capacitance and total power consumption are 23
pf and 28
pw for buffer
3. This result shows that our cross-mesh structure can further reduce capacitance of each cluster. We cannot compare power consumption between conventional mesh tree and cross-mesh tree here directly due to their different structure. Instead, we will discuss it in next paragraph with power consumption of whole mesh tree.
To compare average output capacitance to drive buffers and total power consumption of whole circuit between four results in
Figure 2, we assume the power consumption of driving buffer is 4
pw, and a piece of mesh wire is 18pw. According to mesh structure in
Figure 1, there are twelve pieces of mesh wires in conventional mesh structure, and only four pieces of mesh wires in the cross-mesh structure. Since all circuits in
Figure 2 only use one fourth of whole mesh circuit, thus we calculate circuit A to circuit C as four piece of mesh wires, and circuit D as two piece of mesh wires. In
Table 1,
Capavg in subtree denotes average capacitance of every cluster, while
Capavg in whole circuit is the average output capacitance to each driving buffer, and Pwr
tot is total power consumption of the whole circuit. In this table, we can see that cross-mesh structure has much better results than the three others.
Figure 1.
(a) Uniform clock mesh. (b) Cross clock mesh.
Figure 1.
(a) Uniform clock mesh. (b) Cross clock mesh.
Figure 2.
Motivation example.
Figure 2.
Motivation example.
Figure 3.
Cross mesh design flow.
Figure 3.
Cross mesh design flow.
Figure 4.
Example of establishing cross-mesh.
Figure 4.
Example of establishing cross-mesh.
Figure 5.
Example of allocating driving buffer.
Figure 5.
Example of allocating driving buffer.
Figure 6.
Flow of register clustering algorithm.
Figure 6.
Flow of register clustering algorithm.
Figure 7.
Example of register clustering algorithm.
Figure 7.
Example of register clustering algorithm.
Figure 8.
Example of mesh network connecting algorithm.
Figure 8.
Example of mesh network connecting algorithm.
Figure 9.
Flow of load balancing algorithm.
Figure 9.
Flow of load balancing algorithm.
Figure 10.
Example of load balancing algorithm.
Figure 10.
Example of load balancing algorithm.
Table 1.
Result of our motivation example.
Table 1.
Result of our motivation example.
Circuit |
Subtree |
Whole Circuit |
Capavg (pf) |
Capavg (pf) |
Pwrtot (pw) |
Conventional mesh tree |
A |
33.75 |
30 |
128.5 |
B |
24 |
31 |
126 |
C |
25.3 |
39 |
126 |
Cross-mesh tree |
D |
21.2 |
15 |
107 |
Table 2.
Amount of clock gates in different cluster constraint.
Table 2.
Amount of clock gates in different cluster constraint.
Cluster Constraint |
Circuit |
S9234 |
S13207 |
S38584 |
S35932 |
#Clock Gates |
60% |
11 |
27 |
69 |
80 |
70% |
9 |
23 |
54 |
66 |
80% |
8 |
21 |
46 |
55 |
90% |
8 |
20 |
39 |
46 |
Table 3.
Clock gates and capacitance after register clustering in typical mode.
Table 3.
Clock gates and capacitance after register clustering in typical mode.
Circuit |
Voltage: 1.0V, Temperature: 25℃, Cluster constraint 90% |
#Gate |
Capacitance (pF) |
Non Cluster |
Cluster |
Non Cluster |
Cluster |
Cap. Reduction |
s9234 |
211 |
8 |
0.867 |
0.338 |
61.0% |
s13207 |
638 |
20 |
3.408 |
0.906 |
73.4% |
s38584 |
1426 |
39 |
11.746 |
2.086 |
82.2% |
s35932 |
1728 |
46 |
13.556 |
2.425 |
82.1% |
Avg. Capacitance Reduction |
74.7% |
Table 4.
Clock gates and capacitance after register clustering in fast mode.
Table 4.
Clock gates and capacitance after register clustering in fast mode.
Circuit |
Voltage: 1.1V, Temperature: -40℃, Cluster constraint 90% |
#Gate |
Capacitance (pF) |
Non Cluster |
Cluster |
Non Cluster |
Cluster |
Cap. Reduction |
s9234 |
211 |
8 |
0.879 |
0.345 |
61.8% |
s13207 |
638 |
20 |
3.444 |
0.924 |
73.2% |
s38584 |
1426 |
41 |
11.824 |
2.155 |
81.8% |
s35932 |
1728 |
47 |
13.651 |
2.484 |
81.8% |
Avg. Capacitance Reduction |
74.4% |
Table 5.
Clock gates and capacitance after register clustering in slow mode.
Table 5.
Clock gates and capacitance after register clustering in slow mode.
Circuit |
Voltage: 0.9V, Temperature: 125℃, Cluster constraint 90% |
#Gate |
Capacitance (pF) |
Non Cluster |
Cluster |
Non Cluster |
Cluster |
Cap. Reduction |
s9234 |
211 |
8 |
0.857 |
0.333 |
61.2% |
s13207 |
638 |
9 |
3.377 |
0.879 |
74.0% |
s38584 |
1426 |
40 |
11.676 |
2.064 |
82.3% |
s35932 |
1728 |
47 |
13.471 |
2.393 |
82.2% |
Avg. Capacitance Reduction |
74.9% |
Table 6.
Capacitance and skew after optimization in typical mode.
Table 6.
Capacitance and skew after optimization in typical mode.
Circuit |
Voltage: 1.0V, Temperature: 25℃ |
Capacitance (pF) |
Skew (ps) |
Pre Opt. |
Post Opt. |
Cap. Ratio |
Pre Opt. |
Post Opt. |
Skew Reduction |
s9234 |
0.338 |
0.487 |
1.44x |
100.25 |
10.15 |
89.9% |
s13207 |
0.906 |
1.613 |
1.78x |
402.25 |
9.02 |
97.8% |
s38584 |
2.086 |
2.582 |
1.24x |
323.51 |
12.43 |
96.2% |
s35932 |
2.425 |
2.974 |
1.22x |
318.75 |
10.55 |
96.7% |
Avg. Capacitance Increasing |
1.42x |
Avg. Skew Reduction |
95.1% |
Table 7.
Capacitance and skew after optimization in fast mode.
Table 7.
Capacitance and skew after optimization in fast mode.
Circuit |
Voltage: 1.1V, Temperature: -40℃ |
Capacitance (pF) |
Skew (ps) |
Pre Opt. |
Post Opt. |
Cap. Ratio |
Pre Opt. |
Post Opt. |
Skew Reduction |
s9234 |
0.345 |
0.493 |
1.43x |
71.89 |
7.27 |
89.9% |
s13207 |
0.924 |
1.589 |
1.72x |
273.94 |
6.69 |
97.6% |
s38584 |
2.155 |
2.870 |
1.33x |
317.69 |
6.83 |
97.8% |
s35932 |
2.484 |
2.983 |
1.20x |
251.65 |
7.08 |
97.2% |
Avg. Capacitance Increasing |
1.42x |
Avg. Skew Reduction |
95.6% |
Table 8.
Capacitance and skew after optimization in slow mode.
Table 8.
Capacitance and skew after optimization in slow mode.
Circuit |
Voltage: 0.9V, Temperature: 125℃ |
Capacitance (pF) |
Skew (ps) |
Pre Opt. |
Post Opt. |
Cap. Ratio |
Pre Opt. |
Post Opt. |
Skew Reduction |
s9234 |
0.333 |
0.477 |
1.43x |
140.86 |
14.39 |
89.8% |
s13207 |
0.879 |
1.479 |
1.68x |
542.28 |
24.66 |
95.4% |
s38584 |
2.064 |
2.766 |
1.34x |
579.97 |
13.21 |
97.7% |
s35932 |
2.393 |
3.160 |
1.32x |
574.61 |
16.76 |
97.1% |
Avg. Capacitance Increasing |
1.44x |
Avg. Skew Reduction |
95.0% |
Table 13.
Capacitance comparison of cross mesh and uniform mesh.
Table 13.
Capacitance comparison of cross mesh and uniform mesh.
Circuit |
Capacitance (pF) |
Cross-Mesh (Our proposed) |
Uniform Mesh ([19]) |
Capacitance Reduction |
s9234 |
0.487 |
0.666 |
26.7% |
s13207 |
1.613 |
1.792 |
9.9% |
s38584 |
2.582 |
4.409 |
41.4% |
s35932 |
2.974 |
4.774 |
37.7% |
Avg. Capacitance Reduction |
28.9% |
Table 14.
Skew comparison of cross mesh and uniform mesh.
Table 14.
Skew comparison of cross mesh and uniform mesh.
Circuit |
Skew (ps) |
Cross-Mesh (Our proposed) |
Uniform Mesh ([19]) |
Skew Reduction |
s9234 |
10.148 |
35.962 |
71.7% |
s13207 |
9.017 |
87.816 |
89.7% |
s38584 |
12.426 |
57.641 |
78.4% |
s35932 |
10.548 |
57.558 |
81.6% |
Avg. Skew Reduction |
80.4% |
Table 15.
Comparison of capacitance and skew for s9234 in non-uniform mesh.
Table 15.
Comparison of capacitance and skew for s9234 in non-uniform mesh.
Circuit |
Grid |
Skew (ps) |
Skew Normalization |
Capacitance (pF) |
Capacitance Normalization |
s9234 |
3×3 |
65.378 |
1 |
0.503 |
1 |
4×4 |
46.984 |
0.72 |
0.554 |
1.10 |
5×5 |
28.428 |
0.43 |
0.601 |
1.20 |
6×6 |
46.298 |
0.71 |
0.645 |
1.28 |
Cross Mesh |
10.148 |
0.16 |
0.487 |
0.97 |
Table 16.
Comparison of capacitance and skew for s13207 in non-uniform mesh.
Table 16.
Comparison of capacitance and skew for s13207 in non-uniform mesh.
Circuit |
Grid |
Skew (ps) |
Skew Normalization |
Capacitance (pF) |
Capacitance Normalization |
s13207 |
5×5 |
126.881 |
1 |
1.360 |
1 |
6×6 |
67.638 |
0.53 |
1.512 |
1.11 |
7×7 |
59.068 |
0.47 |
1.591 |
1.17 |
8×8 |
62.39 |
0.49 |
1.616 |
1.19 |
Cross Mesh |
9.017 |
0.07 |
1.613 |
1.19 |
Table 17.
Comparison of capacitance and skew for s38584 in non-uniform mesh.
Table 17.
Comparison of capacitance and skew for s38584 in non-uniform mesh.
Circuit |
Grid |
Skew (ps) |
Skew Normalization |
Capacitance (pF) |
Capacitance Normalization |
s38584 |
10×10 |
82.198 |
1 |
3.715 |
1 |
11×11 |
60.983 |
0.74 |
3.925 |
1.06 |
12×12 |
55.541 |
0.68 |
4.135 |
1.11 |
13×13 |
55.09 |
0.67 |
4.074 |
1.10 |
Cross Mesh |
12.426 |
0.15 |
2.582 |
0.70 |
Table 18.
Comparison of capacitance and skew for s35932 in non-uniform mesh.
Table 18.
Comparison of capacitance and skew for s35932 in non-uniform mesh.
Circuit |
Grid |
Skew (ps) |
Skew Normalization |
Capacitance (pF) |
Capacitance Normalization |
s35932 |
12×12 |
64.243 |
1 |
4.207 |
1 |
13×13 |
61.974 |
0.96 |
4.271 |
1.02 |
14×14 |
55.178 |
0.86 |
4.485 |
1.07 |
15×15 |
53.917 |
0.84 |
4.629 |
1.10 |
Cross Mesh |
10.548 |
0.16 |
2.974 |
0.71 |
Table 19.
Capacitance comparison of cross mesh and non-uniform mesh.
Table 19.
Capacitance comparison of cross mesh and non-uniform mesh.
Circuit |
Capacitance (pF) |
Cross-Mesh (Our proposed) |
Non-Uniform ([27]) |
Capacitance Reduction |
s9234 |
0.487 |
0.601 |
18.8% |
s13207 |
1.613 |
1.591 |
-1.4% |
s38584 |
2.582 |
4.074 |
36.6% |
s35932 |
2.974 |
4.629 |
35.7% |
Avg. Capacitance Reduction |
22.4% |
Table 20.
Skew comparison of cross mesh and uniform mesh.
Table 20.
Skew comparison of cross mesh and uniform mesh.
Circuit |
Skew (ps) |
Cross-Mesh (Our proposed) |
Non-Uniform ([27]) |
Skew Reduction |
s9234 |
10.148 |
28.428 |
64.3% |
s13207 |
9.017 |
59.068 |
84.7% |
s38584 |
12.426 |
55.090 |
77.4% |
s35932 |
10.548 |
53.917 |
80.4% |
Avg. Skew Reduction |
76.7% |