continue to investigate counter congestion management with computational efficiency is determined by the longest
multiple-counter dynamic load balancing schemes under execution time of individual processors. Hence, the
different computing environments and provide guidance for computational power is not fully utilized as many processors
users to select suitable schemes to meet their own needs. are idle while waiting for the last task to finish; with the
This paper starts with an introduction of counter-based dynamic load balancing scheme, the computation time on each
dynamic load balancing schemes in Section II and an processor is optimally equalized.
overview of two main network environments, InfiniBand and Figure 2 shows the performance comparison of static and
Ethernet, in Section III, followed by Section IV on the dynamic computation load balancing schemes with WECC
MATLAB simulation results for multi-counter based dynamic full N-1 contingency cases (17,346 cases). Clearly, dynamic
load balancing schemes. Section V presents the actual case load balancing has better linear scalability. Figure 3 shows the
studies and performance analysis using a HP cluster computer, evenness of execution time with different load balancing
NWICEB. Section VI provides the guidance for users to select schemes with 32 processors, where the variation of execution
suitable schemes and discusses relevant issues on contingency time with dynamic load balancing scheme is much smaller
selection and decision support capabilities in the context of than its counterpart.
massive contingency analysis. Section VII concludes the
paper with future work suggested. 14,000-bus WECC Full N-1 Analysis
35
Speedup
20
because multiple contingency cases can be easily divided onto
multiple processors and communication between different 15
parallelization but on the computational load balancing (task Figure 2 Performance comparison of static and dynamic computation load
partitioning) to achieve the evenness of execution time for balancing schemes with WECC full N-1 contingency cases
multiple processors.
The framework of parallel contingency analysis is shown 1000 Static Load Balancing
in Figure 1 [4]. Each contingency case is essentially a power Dynamic Load Balancing
Figure 3 Evenness of execution time for WECC full N-1 contingency analysis
Proc 0:
(1) Distribute base case Y0 matrix with different computational load balancing schemes
(2) Perform load balancing (static/dynamic)
Proc
Proc00
(3) Distribute case information to other processors The speedup performance of dynamic load balancing
(4) Perform contingency analysis scheme can be estimated using the following equation:
Speedup = N P
(tc + tio ) (1)
Proc
Proc11 Proc
Proc22 … Proc
ProcNN
t c + tio +t + P
(N − 1)t w
cnt
Other Proc’s: 2
(1) Update Y matrix based on case information: Y = Y0 + ΔY
(2) Perform contingency analysis
where N P is the number of processors, t cnt is the counter
Figure 1 Framework of parallel massive contingency analysis
updating time, t w is the waiting time due to counter
They are two main categories of load balancing schemes: congestion, t c and tio are the average computation time and
static load balancing and dynamic load balancing. The static I/O time, respectively. In order to improve speedup
load balancing scheme pre-allocates an equal number of cases performance, it needs to reduce the counter updating time
to each processor, while the dynamic load balancing scheme t cnt , as well as the waiting time t w .
allocates tasks to processors based on the availability of a The counter update time t cnt is mainly determined by the
processor. With the static load balancing scheme, the overall
3
network bandwidth and speed. Minimizing t cnt usually means network is approximately 1/20 of that of Ethernet.
selecting a high-performance network connection between
TABLE 1 THE TYPICAL VALUES OF LATENCY IN 1GB/SEC ETHERNET, AND
processors. The waiting time t w is due to counter congestion. INFINIBAND NETWORKS
Though more processors would improve the speedup 1GB/Sec Ethernet InfiniBand
performance, they also increase the possibility of counter
Latency (μSec) ~30 ~1-2
congestion as shown in (1). Counter congestion will occur
when multiple requests arrive at the same time. Therefore, the Bandwidth 100~200 MB/Sec 10GB/Sec*
scalability of dynamic load balancing schemes is likely to be * InfiniBand is a type of communications links between processors and I/O
limited by counter congestion. devices that offers throughput of up to 2.5 GB/Sec, and it can achieve
10GB/Sec or higher bandwidth through double-rate and quad-rate techniques.
In order to better manage counter congestion, we propose a
multi-counter based dynamic load balancing scheme with task
IV. MATLAB SIMULATION RESULTS
stealing. The framework of this new scheme is illustrated in
Figure 4 with two counters. The equal number of cases is pre- Before the actual case studies are conducted, Matlab
allocated to the two counter groups. Each group has its own simulations are performed to predict the performance of multi-
counter (Proc 0 in Figure 4). Inside each group, the dynamic counter schemes under different simulated computing
load balancing scheme is applied based on the availability of environments. The advantage of using simulated environments
processors. When the pre-allocated tasks are finished in one is the flexibility of studying different configurations and
group, the counter in this group “steals” tasks from the other reducing the time implementing the schemes on actual parallel
group to continue the computation until all tasks are done. By computers.
implementing the multi-counter dynamic load balancing The main factors could affect speedup include: (a) the
scheme, counter congestions can be reduced, and further number of processors, N P ; (b) the number of cases, N; (c) the
speedup is expected. latency of the network communication; (d) the counter
updating time; and (e) the bandwidth of the network. Since the
effects of factor (c), (d), and (e) are equivalent in terms of the
contribution of total time, these factors can be treated as one
term, tcnt, for the purpose of MATLAB simulation. In order to
study the sensitivities of the factors of N P , N, and tcnt, two
Figure 4 Framework of multi-counter based dynamic load balancing scheme sets of simulations are studied with the number of processors
with task stealing. ranged from 20 up to 210: (a) different number of cases, N =
500, 5000, 20000 with respect to different number of
The cost to minimize counter congestion using multi-
processors; and (b) different tcnt, tcnt = 0.0001, 0.002, and
counter schemes is the overhead with managing multiple
0.005, with respect to different number of processors.
counters. Even though additional counters can reduce counter
In order to make the simulation data closer to the actual
congestion, it is possible that the overhead would compromise
computational time, the actual computational time for WECC
the benefit gained by reducing counter congestion. Therefore,
full N-1 contingency analysis is studied, based on the
it is important that we evaluation the performance of these
histogram of the time, and then is used in MATLAB
load balancing schemes and determine under what conditions
simulation studies.
the multi-counter scheme had superior performance over the
Figure 5 shows the speedup performance with different
single-counter scheme.
numbers of cases. The Tcnt is fixed to 0.0001 for this set of
simulations. The horizontal axis is the number of processors in
III. NETWORK ENVIRONMENT COMPARISONS
base-2 exponentials, and the vertical axis is the speedup. It is
As stated earlier, minimizing the counter updating time t cnt clear that the better speedup can be achieved when the number
usually means choosing a high-performance network of cases increases. This statement has been confirmed by the
connection among processors. However, due to the cost case studies in [4]. With 512 processors, the speedup is 462
issues, PC-based Ethernet network dominates the current for 20,000 contingency cases, 503 for 150K cases and 507 for
utility control center environment. In order to better 300K cases.
understand the effect of high-performance networks, it is The sensitivity of tcnt is shown in Figure 6. In this
important to study the performance of multi-counter dynamic simulation, the number of contingency cases is fixed to be
load balancing schemes under different network 20,000. Same as in Figure 5, the horizontal axis shows the
environments. number of processors in base-2 exponentials, and the vertical
The main network properties are latency and bandwidth. In axis shows the speedup.
this paper, we are interested in the latency and bandwidth in There are three observations from Figure 6. The first
two common networks: 1GB/Sec Ethernet, and InfiniBand observation is that for certain number of processors, the larger
network. The typical values of latency and bandwidth in these tcnt, which represents the low speed communication network
two networks are listed in Table 1. The latency of InfiniBand (e.g. Ethernet network), the less speedup archived for both
4
single-counter and two-counter schemes; the second balancing schemes is implemented on a NWICEB cluster
observation is with larger time of tcnt, the two-counter dynamic machine, which is a total of 128 processors with the high-
load balancing scheme shows better performance than the speed InfiniBand communication link. The 14,000-bus WECC
single-counter scheme when the number of processors power grid model is used as the study model. In order to
increases. The larger the number of processors, the better simulate the Ethernet environment on the NWICEB machine,
performance for two-counter scheme appears; and the third the operation of counter update is executed 20 times more to
observation is that when the number of processors is relatively mimic the low speed communication. The reason for using the
small, e.g. less than 128, and with tcnt=0.0001s, which number of 20 is because that the latency of an Ethernet
represents the high speed communication network, the network is approximately 20 times of that of InfiniBand.
performance of single-counter scheme is better than that of the Three different scenarios with different number of
two-counter scheme. However the two-counter scheme has contingency cases, N = 500, 5000, and 17346, are tested on
better performance than single-counter scheme when tcnt is NWICEB using both the single-counter and two-counter
larger for the number of processors less than 128. schemes. Two different environments (InfiniBand and
simulated Ethernet) are compared. The execution time of all
scenarios, excluding disk I/O time for the purpose of
eliminating side-effects, with InfiniBand network is listed in
Table 2, while Table 3 shows the execution time with the
simulated Ethernet network.
Figure 6: Single-counter vs. two-counter comparison for different counter time 2 147.51 144.16 1341.5 1316.4 5098.4 5015.2
with respect to different number of processors 4 83.546 85.655 754.4 764.7 2869.9 2804.6
8 49.191 48.145 402.45 419.18 1569.5 1532.7
The third observation is important for most utility/control
center users, who want to implement massive contingency 16 26.236 26.007 207.39 210.18 804.6 778.95
analysis in their current environments. Currently, most 32 14.12 16.903 109.03 112.7 408.36 399.34
utilities/control centers do not own a large number of 64 8.1423 8.7142 58.26 60.537 211.03 202.66
processors, and they are mostly using Ethernet network
128 5.8304 6.869 29.959 32.098 112.17 108.17
environment. Therefore, the two-counter dynamic load
balancing scheme would be useful for their contingency * Disk I/O time excluded
analysis applications.
In Table 2, the execution times with the two-counter
V. CASE STUDIES OF MASSIVE CONTINGENCY ANALYSIS WITH scheme are larger than those with the single-counter scheme,
DIFFERENT COUNTER-BASED SCHEME which indicates that the two-counter dynamic load balancing
scheme does not show better speedup performance than the
The massive contingency analysis framework with the
single-counter scheme under InfiniBand environment.
single-counter and two-counter dynamic computational load
5
Corresponding to Table 2, the speedup results with both (a) In the case of a computing environment with an Ethernet
single-counter vs. two-counter schemes under InfiniBand network or equivalent and no high-speed networking
environment are shown in Figure 7. This result matches the capabilities with a large number of processors, the two-
MATLAB simulation in Section IV. The main reason for this counter dynamic scheme is expected to have better
phenomenon is that the latency in InfiniBand is low and the performance for the application of massive contingency
communication speed is fast. As such, the counter congestion analysis;
is less likely to happen with a relatively small number of (b) In the case of a computing environment with a high-
processors. Therefore, the overhead introduced by an speed communication network but only a small number of
additional counter impairs the performance under the testing processors, the single-counter dynamic scheme is suggested;
circumstances. As shown by the MATLAB simulation, the (c) In the case of a computing environment with a high-
two-counter scheme is more suitable for the application of a speed communication network and a dedicated cluster
larger number of processors, i.e. with a large number of computer with a large number of processors, the performance
processors, the two-counter scheme will improve the of the two-counter dynamic scheme will be better than that of
computational performance of massive contingency analysis. the single-counter scheme.
These considerations can be used as guidance for actual
100 implementation of massive contingency analysis.
90
80
90
70
Single N=500 80
60
Speedup
Two N=500 70
50
Single N=5000 60 Single N=500
40
Speedup
30 Two N=5000 50 Two N=500
20 Single N=17346 40 Single N=5000
10 30
Two N=17346 Two N=5000
0 20
Single N=17364
1 2 4 8 16 32 64 128 10
Two N=17364
The number of processors 0
1 2 4 8 16 32 64 128
Figure 7: Single-counter vs. two-counter comparison with respect to different
The number of processors
number of under InfiniBand environment
Figure 8: Single-counter vs. two-counter comparison with respect to different
When the communication speed is relatively low, counter number of under Simulated Ethernet environment
congestion is more likely to happen. As shown in Table 3,
when the number of contingency cases is large, (N=17,346), As mentioned in the Introduction section, the number of
the two-counter scheme can improve the overall performance cases increases exponentially as the “x” in “N-x” increases.
under the simulated Ethernet environment. Very importantly, When N-x contingency are considered, there are two major
this is true when the number of processors is relatively small issues: massive number of cases and massive amount of data.
and the number of contingency cases is large enough. For Since the sheer number of contingency cases leads to the
example, for full N-1 contingency analysis cases and with 16 impracticality of even the simplified computation for all cases,
processors, the execution time with single counter is 804.6 to solve the first issue, we need smart contingency selection
seconds, while the time with two counters is 778.95 seconds, methods, as well as high performance computing (HPC)
which is 26 seconds less. The speedup with single-counter vs. techniques and hardware. The technical challenge of solving
two-counter schemes under the simulated Ethernet the second issue is how to navigate through the vast volume of
environment is shown in Figure 8. Figure 8 shows that when data and help grid operators to manage the complexity of
N is large (=17,364), the performance of the two-counter operations and decide among multiple choices of actions. The
scheme is better than that of the single-counter scheme; while state-of-the-art industrial tools use tabular forms to present
the performance of the two-counter scheme is worse when the contingency analysis results. When massive “N-x”
number of cases is small. These results match the MATLAB contingency cases are analyzed and the system is heavily
simulation in Section IV, Figure 6. stressed, the tabular method of display is rapidly overloaded
and it is then impossible for an operator to sift through the
VI. DISCUSSION large amounts of violation data and understand the system
The case studies, as well as MATLAB simulation results, situation within several seconds or minutes. Thus the
reveal insights regarding the performance of load balancing usefulness of massive contingency analysis is undermined and
schemes, which can serve as guidance for utilities to select the HPC benefit is diminished. In order to solve the second
suitable counter schemes to implement massive contingency issue, advanced visualization techniques, as well as human
analysis under their computing environments. Considerations factors, are needed to provide real-time situational awareness
for the implementation include:
6
VIII. REFERENCES
IX. BIOGRAPHIES
Zhenyu Huang (M'01, SM’05) received his B. Eng. from Huazhong
University of Science and Technology, Wuhan, China, and Ph.D. from
Tsinghua University, Beijing, China, in 1994 and 1999, respectively. From
1998 to 2002, he conducted research at the University of Hong Kong, McGill
University, and the University of Alberta. He is currently a staff research
engineer at the Pacific Northwest National Laboratory, Richland, WA, and a
licensed professional engineer in the state of Washington. His research
interests include power system stability and control, high-performance
computing applications, and power system signal processing.