8, AUGUST 2015 1
Abstract—Mining community structures from the complex network is an important problem across a variety of fields. Many existing
community detection methods detect communities through optimizing a community evaluation function. However, most of these
functions even have high values on random graphs and may fail to detect small communities in the large-scale network (the so-called
resolution limit problem). In this paper, we introduce two novel node-centric community evaluation functions by connecting correlation
analysis with community detection. We will further show that the correlation analysis can provide a novel theoretical framework which
unifies some existing evaluation functions in the context of a correlation-based optimization problem. In this framework, we can mitigate
the resolution limit problem and eliminate the influence of random fluctuations by selecting the right correlation function. Furthermore,
we introduce three key properties used in mining association rule into the context of community detection to help us choose the
appropriate correlation function. Based on our introduced correlation functions, we propose a community detection algorithm called
CBCD. Our proposed algorithm outperforms existing state-of-the-art algorithms on both synthetic benchmark networks and real-world
arXiv:1904.04583v1 [cs.SI] 9 Apr 2019
networks.
Index Terms—Complex networks, community detection, correlation analysis, random graph, node-centric function.
1 I NTRODUCTION
•Experimental results demonstrate that our method out- 2.2 Node-Centric Method
performs state-of-the-art methods on real networks and Almost all classical community detection metrics implicitly
LFR networks. assume that all nodes in a community are equally important.
The organization of this paper is structured as follows: These metrics, viewing the community as a whole, only
Section 2 discusses the related work. Section 3 introduces focus on the total number of edges within the community,
several basic definitions of correlation analysis and proves the sum of degree of all nodes or the size of the com-
some properties of proposed metrics. Section 3 describes munity. The connection strength between each vertex and
our corresponding community detection method. Section 4 the community is not involved in these metrics. This may
presents the experimental results of CBCD along with the result in missing important structural information so that
other methods. Section 5 concludes this paper. two distinct communities cannot be distinguished in certain
circumstances. Node-centric methods take into account how
2 R ELATED W ORK each vertex connects the community densely. This is consis-
2.1 Correlation-Based Method tent with the fact that the community is a local structure of
There are already some algorithms in the literature that the network.
utilize the different types of correlation information hid- Chakraborty et al. [42] [43] [44] propose a new node-
den behind the network structure to solve the community centric metric called permanence, which describes the mem-
detection problem. Once the correlation definition is spec- bership strength of a node to a community. The central
ified, community detection can be cast as an optimization idea behind permanence is to consider the following two
problem by maximizing a correlation-based objective func- factors: (1) the distribution of the external edges rather
tion. The existing community detection algorithms based than the number of all external connections (2) the strength
on correlation can be categorized according to the different of the internal-connectivity instead of the number of all
correlation definitions. internal edges. The corresponding algorithm is to maximize
the permanence-based objective function. WCC is another
2.1.1 Node-Node Similarity node-centric metric proposed in [45] [46], which considers
The pairwise similarity between two nodes can be viewed the triangle as the basic structure instead of the edge or
as a kind of correlation. The goal of community detection node. WCC of a node consists of two parts: isolation and
based on node-node similarity is to put the nodes which are intra-connectivity. The proposed algorithm aims at the op-
close to each other into the same group. In [32], a random- timization of a WCC-based objective function. Focs [47] is
walk based node similarity was proposed, which can be a heuristic method that accounts for local connectedness.
used in an agglomerative algorithm to efficiently detect the The local connectedness of a node to a community depends
communities in the network. The method in [33] is based on upon two scores of the node with respect to the community:
the node similarity proposed in [34] to find communities in community connectedness and neighborhood connected-
an iterative manner. ness. Focs mainly consists of two phases: leave phase and
expand phase, which is respectively based on community
2.1.2 Node-Node Correlation connectedness and neighborhood connectedness.
A covariance matrix of the network is derived from the
incidence matrix in [35], which can be viewed as the un-
biased version of the well-known modularity matrix. A 2.3 Summary
correlation matrix is obtained through introducing the re- Our method is different from all above existing methods.
scaling transformation into the covariance matrix, which Although the proposed metrics in our method are node-
significantly outperforms the covariance matrix on the iden- centric as well, they are derived from the perspective of
tification of communities. The algorithm in [36] constructs a correlation analysis. Different from the existing correlation-
node-node correlation matrix based on the Laplacian matrix based methods, our method utilizes the correlation between
so as to incorporate the feature of NMF (non-negative matrix the node and the community based on a 2 × 2 contingency
factorization) method. In [37], MacMahon et al. introduce table. Thus, our method can be viwed as a node-community
the appropriate correlation-based counterparts of the most correlation based method. Besides, we will choose appro-
popular community detection techniques via a consistent priate correlation functions to mitigate the resolution limit
redefinition of null models based on random matrix theory. problem under the guidance of [48] [49]. The further analy-
A correlation clustering [38] based community detection sis will be given in section 3.
framework is proposed in [39], which unifies the modular-
ity, normalized cut, sparsest cut, correlation clustering and
cluster deletion by introducing a single resolution parameter 3 C ORRELATION A NALYSIS I N N ETWORK
λ.
In this section, we first formalize the community detec-
2.1.3 Edge-Community Correlation tion problem, then, introduce two correlation measures and
In [40] [41], Duan et al. connect the modularity with correla- extend it for community structure evaluation. Two new
tion analysis by reformulating modularity’s objective func- correlation metrics for community detection will be pro-
tion as a correlation function based on the probability that posed in this section. Next we will describe modularity from
an edge falls into the community under the configuration correlation analysis perspective and discuss its relation with
model. Hence, it can be viewed as a method based on edge- our metrics. At last, we will discuss the desirable properties
community correlation information. for new correlation metric.
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 4
S
1 6
5 ΨS\{5} = (e1 , e2 , e3 , e4 , e6 , e7 )
2
= (1, 1, 1, 1, 0, 0)
7
ψ5 = (g1 , g2 , g3 , g4 , g6 , g7 )
3 4
= (1, 1, 0, 1, 1, 1)
Fig. 2. One example graph with 7 nodes and 10 edges, and S is a community of size 5. The community vector ΨS\{5} and neighbor vector ψ5 are
given in the right part of the figure.
du
the probability N and belongs to community S with the
probability N . In the case of random graph, the expected
probability that the chosen node simultaneously connects
vertex u and belongs to community S is N · dNu .
The modularity function has a few variants, but they are
u
all identical in nature. Without loss of generality, we will
adopt the definition given in Equation (2) as our modularity
function. Given a community S , the partial modularity of S
is reformulated as:
P
j∈S dj
P
lS K S 2 lS di
QS = − = − · i∈S ,
m 2m m 2m 2m
Fig. 4. Node u does not have strong isolation from the rest of graph where m is the number of the edges in graph G and lS is the
which is circled by dash line. number of the edges inside community S . If we randomly
choose an edge from the edge set E , the probability of the
chosen edge inside community S is lS /m. Likewise, the
the increment of ω will lead to the increment of ω/du when probability that one randomly P chosen edge has at least one
and du are both fixed. In regard to P2, M should increase dj
end inside community S is j∈S . Then, the expected
with the increment of ω/du , which means that the isolation 2m
of u from the rest of the graph is becoming stronger. In a
probability
P P of the chosen edge inside community S is
j∈S dj di
similar way for P3, M should monotonically decrease with 2m · i∈S
2m when G is a random graph. Let H denotes
the increment of du when ω and are both fixed since ω/ a binary random variable where H = 1 if the randomly
will decrease. chosen edge has at least one end inside community S
To further analyze the relationship between correlation and H = 0 otherwise. Then, the partial modularity QS
measure and community detection, let us consider the con- can be rewritten as: P (HH) − P (H)P (H). We can find
fidence measure of C and G . We have: that the partial modularity owns the same idea with our
measurement function in (5), i.e., they are both the concrete
P (GC) ω examples of Piatetsky-Shaprio Rule-Interest. However, the
confidence(C ⇒ G) = P (G|C) = = ,
P (C) d derivation of the partial modularity and our function starts
P (GC) ω from two different perspectives respectively. The partial
confidence(G ⇒ C) = P (C|G) = = , modularity is edge-centric, which utilizes the relationship
P (G)
between edge and community. Our formulation in (5) is
where confidence(C ⇒ G) and confidence(G ⇒ C) are defined node-centric, which utilizes the relationship between node
as the neighborhood connectedness score (nb-score) of a and community. For the sake of simplicity, we will use PS
node and the community connectedness score (com-score) to denote our measurement function in (5) and we can also
of a node in the Focs algorithm [47], respectively. These two call it “node modularity”.
node-based measures are the theoretical basis of the Focs Despite the PS function can be used for evaluating the
algorithm. This fact tells us that correlation analysis can be strength of the correlation between a node and a community,
appropriately applied in community detection. it is still not clear if it satisfies aforementioned three proper-
ties. A thorough analysis is essential to confirm the utility
3.2.1 Piatetsky-Shapiro’s Rule-Interest of PS in the context of community detection. P2 and P3
Piatetsky-Shapiro’s rule-interest [31] measures the differ- are necessary properties that a good community correlation
ence between the true probability and the expected prob- measure should have. Essentially, these two properties can
ability under the assumption of independence between C be interpreted as ”a higher value of P S(u, S) indicates a
and G . Piatetsky-Shapiro’s rule-interest is appropriate for strong correlation between u and S ”. P1 is concerned how
the task of community detection, and we will show that far the correlation between u and S is away from what it
modularity can be explained under our framework when is in the random graph. For the PS measure, we have the
the correlation measure is Piatetsky-Shapiro’s rule-interest following two theorems, where Theorem 1 proves that P2
later on. Our measurement function based on Piatetsky- and P3 hold and Theorem 2 proves that P1 is correct.
Shapiro’s rule-interest is calculated as:
Theorem 1. For fixed N > 0, we have 0 < < N , 0 < du < N
ω du and 0 < ω ≤ min(, du ). Let P S(u, S) be the correlation value
P S(u, S) = P (CG) − P (C)P (G) = − 2. (5)
N N between u and S defined in (5), then 1) P S(u, S) monotonically
Recalling the aforementioned variables in Table 2, the de- increases with the increment of ω when and du are fixed, 2)
gree of vertex u is du , the size of set S\{u} is and the P S(u, S) monotonically decreases with the increment of when
number of links between vertex u and community S is ω . ω and du are fixed, and 3) P S(u, S) monotonically decreases with
If we randomly select a node from the node set V \{u}, the the increment of du when and ω are fixed.
probability that the selected node simultaneously connects
ω
vertex u and belongs to community S is N . Similarly, a
randomly selected node from V \{u} connects vertex u with Proof. Let P S(u, S) = J(ω, , du ). We will directly calculate
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 7
the partial derivatives of P S(u, S) regarding to , ω and du . Let Y = P S(u, S), the second moment of Y can be obtained
by the linearity of expectations:
∂J(ω, , du ) 1 ∂J(ω, , du ) −du
= > 0, = < 0,
∂ω N ∂ N2
∂J(ω, , du ) −
ω du 2 ω2 2du ω 2 d2u
= 2 < 0. E[Y 2 ] = E[ − ] = E [ − + ]
∂du N N N2 N2 N3 N4
2
1 2
Overall, the monotonicity of P S(u, S) is consistent with the = 2 E[ω 2 ] − 3 E[du ω] + 4 E[du 2 ].
N N N
monotonicity of ω , and du respectively.
Theorem 2. Given the E-R model G(n, m) with the probability Calculating E[ω 2 ], E[du ω] and E[du 2 ] independently, we
p = 2m/(n(n − 1)) and let P S(u, S) be the correlation value have
between u and S , then E[P S(u, S)] = 0 holds for any subgraph X X XX
S ∈ GS and node u ∈ S , where GS is the set of all the subgraphs E[ω 2 ] = E[ Xi Xj ] = E[ Xi Xj ]
of G. i∈S j∈S i∈S j∈S
XX
= E[Xi Xj ] = ( − 1)p2 + p,
Proof. Let Xv be a random variable such that:
i∈S j∈S
( n n X
1 if v has a link with u,
X X X
Xv = E[du ω] = E[ Xi Xj ] = E[ Xi Xj ]
0 otherwise. i=1 j∈S i=1 j∈S
n X
Clearly, E[Xv ] = 1P· P(Xv = 1) + 0 · P(Xv = 0) = p, du =
X
P n
= E[Xi Xj ] = N p2 + p = λp + p,
i=1 Xi and ω = v∈S Xv . By the linearity of expectations, i=1 j∈S
n n n X
n
ω du ω du
X X X
E[P S(u, S)] = E[ − 2 ] = E[ ] − E[ 2 ] E[du 2 ] = E[ Xi Xj ] = E[Xi Xj ]
N N N N i=1 j=1 i=1 j=1
n
1 X X 2
= N (N − 1)p + N p = λ(λ − p + 1).
= E[ Xv ] − 2 E[ Xi ]
N v∈S N i=1
n
1 X X To obtain the variance of Y , we first apply Theorem 2:
= E[Xv ] − 2 E[Xi ]
N v∈S N i=1
p V ar[Y ] = E[Y 2 ] − (E[Y ])2 = E[Y 2 ]
= − 2 · N p = 0.
N N ( − 1)p2 + p 2p2 (λ + 1) λ2 (λ − p + 1)
= − + .
N2 N3 N4
Theorem 2 shows that PS can be regarded as the distance Then, by the Chebyshev inequality, we finish the proof:
between the correlation value of u and S in a real network
and the average correlation value of u and S over all the h √ i h √ i V ar[Y ]
random networks. However, this theoretic result ignores the P P S(u, S) ≥ κ = P Y − E[Y ] ≥ κ ≤
variance of PS variable over all the random networks. If the κ
distribution of PS values is not strongly peaked, it is very 1 ( − 1)p2 + p 2p2 (λ + 1) λ2 (λ − p + 1)
= − + .
likely that most PS values will far exceed zero even in the κ N2 N3 N4
random networks. Thus, we should investigate the variance
of PS values as well. Then, we have the following Lemma.
Lemma 1. Given the E-R model G(n, m) with the probability
p = 2m/(n(n − 1)). Let λ = N p, which is the expected degree Most large-scale real networks appear to be
under the E-R model. P S(u, S) is correlation value between u sparse [51] [52] [53], in which the number of edges is
and S . Let GS be the set of all the subgraphs of G. Then, ∀κ > 0, generally the order n rather than n2 [52]. In addition, sparse
∀S ∈ GS and ∀u ∈ S , we have √ the upper bound on the probability graphs are particularly sensitive to random fluctuations [23].
that PS(u, S) is no less than κ: Therefore, we should concentrate more on the performance
h √ i of PS on the sparse graph. With respect to the sparsity
P P S(u, S) ≥ κ under the E-R model, we have the following Theorem.
1 p2 2 − p2 + p λ2 (λ − p + 1) 2p2 (λ + 1)
≤ + − . Theorem 3. Given the ER model G(n, m) with the probability
κ N2 N4 N3 p = 2m/(n(n − 1)) and λ = N p is the expected degree under E-
Proof. Analogously to what we have done in the proof of R model. Assuming λ always remains finite in the limit of infinite
Theorem 2, we will use the sum of random variable Xi to size [23]. P S(u, S) is the correlation value between u and S .
denote ω and du : Then, ∀κ > 0, any subgraph S and ∀u ∈ S , we have:
n
X X
du = Xi , ω = Xv .
h i
lim P P S(u, S) ≥ κ = 0.
i=1 v∈S N →+∞,p→0
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 8
Proof. We first apply Lemma 1: node u will have all its links connecting all the members
h i of community S and have no other links with the rest of
P P S(u, S) ≥ κ graph. The above discussions are subjective and intuitive.
1 ( − 1)p2 + p 2p2 (λ + 1) λ2 (λ − p + 1) To further analyze φ(u, S), we have some similar theoretic
≤ 2 − + results as PS owns.
κ N2 N3 N4
2 2
1 ( − 1)p + p λ (λ − p + 1)
Theorem 4. For fixed N > 0, we have 0 < < N , 0 < du < N
≤ 2 +
κ N2 N4 and 0 < ω ≤ min(, du ). Let φ(u, S) be φ-Coefficient value
2
1 (N p) + pN
λ(λ − p + 1) between u and S defined in (8), then 1) φ(u, S) monotonically
≤ 2 + increases with the increment of ω when and du are fixed, 2)
κ N2 N2
2 φ(u, S) monotonically decreases with the increment of when ω
1 2(λ) + 2λ − λp
= 2 . and du are fixed, and 3) φ(u, S) monotonically decreases with the
κ N2
increase of du when and ω are fixed.
Then, we take the limit:
h i
lim P P S(u, S) ≥ κ Proof. Let φ(u, S) = J(ω, , du ). We will directly calculate
N →+∞,p→0
the partial derivatives of φ(u, S) regarding to , ω and du .
1 2(λ)2 + 2λ − λp
≤ lim = 0.
N →+∞,p→0 κ2 N2 ∂J(ω, , du ) N
=p >0
∂ω (N − )du (N − du )
Theorem 3 shows that it is difficult to obtain a high PS ∂J(ω, , du ) − 1 −N (du − 2ω) + ωN
= du (N − du ) 2 · 3
value in the large sparse random network. This theoretic ∂ 2 (N − ) 2
result indicates that the distribution of PS values is strongly
peaked, that is, most PS values are near by zero. In fact, ∂J(ω, , du ) − 21 −N du ( − 2ω) + ωN
= (N − ) · 3 .
even in the small random network, the upper bound intro- ∂du 2 d (N − d ) 2
u u
duced in Lemma 1 is often a small value. For example, let
κ = 0.0001, N = 280 , = 20 and λ = 8, then we have
We only need to focus on the term −N du ( − 2ω) + ωN .
P r |P S(u, S)| ≥ 0.01 ≤ 6.54%. Overall, the high PS value This term is obviously negative when ≥ 2ω . Now assum-
between u and S in the real sparse network is an indicator ing < 2ω . Since ω ≤ , we have 2ω − ≤ ω . Then,
that there is a significant correlation between u and S . This
provides us a theoretical basis for our community detection
−N du ( − 2ω) + ωN = −N ωN − du (2ω − )
algorithm.
To formulate a vertex-centric metric of a community, we ≤ −N ωN − du ω
have: < −N ωN − N ω = 0.
X 2lS KS
F (S) = P S(u, S) = − 2 , (6)
u∈S
N N Thus,
∂J(ω,,d )
u
< 0 holds. In a similar way, ∂J(ω,,d )
u
<0
∂ ∂du
where lS is the number of the edges inside community S holds as well. Overall, the monotonicity of φ(u, S) is consis-
and KS represents the sum of degrees of all vertexes in tent with the monotonicity of ω , and du respectively.
community S . Then, the objective function for a community
partition is: Theorem 5. Given the E-R model G(n, m) with the probability
M M
p = 2m/(n(n − 1)) and let φ(u, S) be φ-Coefficient value
X X 2lSj KSj between u and S . For fixed 0 < < N , E[φ(u, S)] = 0 holds for
Γ(P ) = F (Sj ) = − . (7)
N N2 any subgraph S ∈ GS and node u ∈ S , where GS is the set of all
j=1 j=1
the subgraphs of G.
3.2.2 φ-Coefficient
φ-Coefficient [54] is a variant of Pearson’s Product-moment Proof. Firstly, we introduce a random variable:
Correlation Coefficient for binary variables. In association
mining, it is often used to estimate whether there is a non- 1
Y =p .
random pattern. Our measurement function based on φ- du (N − du )
Coefficient is calculated as:
P (CG) − P (C)P (G) At the same time, let Xv be a random variable such that:
φ(u, S) = p
P (C)(1 − P (C))P (G)(1 − P (G)) (
1 if v has a link with u,
ωN − du Xv =
=p . (8) 0 otherwise.
(N − )du (N − du )
A positive φ(u, S) value indicates that node u has denser Integrating these two random variables, we have:
intra-connection with community S and stronger isolation
from the rest of graph. Moreover, φ(u, S) has the same E[Xv Y ] = E[Y |Xv = 1] · P(Xv = 1) + 0 · P(Xv = 0)
range of values as Pearson’s Product-moment Correlation h 1 i
Coefficient, i.e. −1 ≤ φ(u, S) ≤ 1. When φ(u, S) = 1,
=p·E p |Xv = 1 = E[Hv ]p,
du (N − du )
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 9
where E[Hv ] = E[Y |Xv = 1]. Besides, we have du = a bigger community since this operation will help increase
− 12
Pn P the value of modularity function. Apparently, it is counter-
i=1 Xi and ω = v∈S Xv . Let K = (N − ) . By
the linearity of expectations, intuitive and disturbing. Fortunately, we can prove that such
issue does not occur in Φ(S).
ωN − du
E[φ(u, S)] = E[ p ]
(N − )du (N − du ) Theorem 6. Given a graph G(n, m) with a subgraph which
consists of two equal-sized cliques Cs and Ct . There is only one
= K · E[(ωN − du )Y ]
n S connecting u and v , where u ∈ Cs and v ∈ Ct . Let P =
edge
Cs Ct and I = |Cs | = |Ct |. Two nodes c and o, where c ∈ Cs
X X
= K · E[(N Xv − Xi )Y ]
v∈S i=1
and o ∈ Ct , both have one edge connecting the rest of graph
X n
X respectively. Assuming 5 ≤ I < n4 , then we have Φ(Cs ) +
= K · N · E[Y Xv ] − · E[Y Xi ] Φ(Ct ) > Φ(P ).
v∈S i=1
= K · E[HV ] · N p − N p = 0. Proof. From I < n4 , we can know that < n4 . Let ∆Φ =
Φ(P ) − (Φ(Cs ) + Φ(Ct )) and T = 2 + 1, then ∆Φ can be
calculated as:
Theorem 4 and Theorem 5 indicate that φ(u, S) is a
good community measurement function since it owns the − 21 X
∆Φ = T (N − T ) φ(i, P ) − 2Φ(Cs )
necessary properties P1,P2 and P3. This is the theoretic basis i∈P
for φ(u, S) to be employed in community detection. Now X ωi N − di − 1 − 1
we can take the sum of φ(u, S) over all u ∈ S to formulate =2 p T (N − T ) 2 − (N − ) 2
a vertex-centric metric: i∈Cs
di (N − di )
ωN − du N − 1
+ 2p T (N − T ) 2 .
X X
Φ(S) = φ(u, S) =
du (N − du )
p
u∈S u∈S
(N − )du (N − du )
− 21 X ωN − du − 1
= (N − ) p . (9) Let L = √ N
T (N − T ) 2 and B =
P
u∈S
du (N − du ) du (N −du ) i∈Cs
√ωi N −di , we have:
Despite different definitions on what a community di (N −di )
should be have been proposed, some general ideas are
widely accepted by most scholars. One of them is that the 1
p p
(N − ) − T (N − T )
clique can be regarded as a perfect community. Thus, the ∆Φ = B( p ) + L.
2 T (N − T )(N − )
loosely connected cliques should be separated from each
other as different communities. The modularity function
n
p p
may fail to detect modules which are smaller than a scale Since T = 2 + 1 ≤ 2, then 2(N − 2) < T (N − T ).
in large networks, which is called the resolution limit prob- We can obtain:
lem [28] [29]. A more general example in real network has p p
been given in [28], which is shown in Fig. 5. There are 1 2(N − 2) − (N − )
∆Φ < −B( ) + L.
two cliques of the same size Cs and Ct , and there is an
p
2 T (N − T )(N − )
edge connecting one node u from Cs and another node v
from Ct . When the network is large enough, the modularity- Now we need to analyze B . Firstly, B can be calculated as:
X ωi N − di ωi = X (N − di )
B= p ===== p
u v i∈Cs
di (N − di ) i∈Cs
di (N − di )
Cs c o Ct
s
X N − di
= .
i∈C
di
s
1
Thus, we have a upper bound of 2 ∆Φ as: information. To avoid the resolution limit problem as much
p p as possible, first of all, the community metric should con-
1 q
2(N − 2) − (N − ) sider every vertex rather than view the community as a
∆Φ < − ( − 1) (N − )( p )
2 T (N − T )(N − ) single unit. Second, the community metric should take full
N advantage of the local structural information. Then, we have
+p
(N − )T (N − T ) the following Theorem.
p p p
−( − 1) (N − ) 2(N − 2) − (N − ) + NTheorem 7. Given a graph G(n, m) with a vertex u and a
= p community S . To simplify the notations, let Ψ = ΨS\{u} and
(N − )T (N − T )
q ψ = ψu , where Ψ is the community vector of S and ψ is the
−( − 1)(N − ) 2(2 − NN− ) − 1 + N
neighbor vector of u. For fixed > 0, du > 0 and ω > 0, we
= p have:
(N − )T (N − T ) Ψ·ψ
q
4
lim φ(u, S) = ,
−( − 1)(N − ) 3 − 1 + N N →+∞ kΨk 2 kψk2
< p .
(N − )T (N − T ) where k.k2 is Euclidean norm.
q
Let g() = −( − 1)(N − ) 4
n Proof. We directly calculate the limit of φ(u, S) as N tends
3 − 1 + N . Since 4 ≤ < 2 ,
to infinity.
g() decreases as increases. Then, we have:
ωN − du
g() ≤ g(4) < −1.8(N − 4) + N < 0. lim φ(u, S) = lim p
N →+∞ N →+∞ (N − )du (N − du )
Thus, ∆Φ < 0. We finish the proof.
ω − dNu
= lim q
Theorem 6 reveals the fact that Φ(S) can avoid merging N →+∞
(1 − N )du (1 − dNu )
two very pronounced communities. The reason is that Φ(S) PN
takes into account the local structural information of every ω Ψi ψ
vertex. Now we will provide a comprehensible interpreta- =√ = qP i=1 qP i
du N 2 N 2
tion on this point. Firstly, it can be found that all the φ(u, S) i=1 Ψi i=1 ψ i
values of vertices from Cs are 1 or approximately 1. The Ψ·ψ
= .
value 1 is the maximum value that a vertex can achieve, kΨk2 kψk2
which indicates that a node completely belongs to a com-
munity in the sense that this node connects every member
of the community and has no external links. Obviously, Theorem 7 indicates that φ(u, S) is the Cosine similarity
the φ(u, S) value will be reduced after merging Cs and Ct between Ψ and ψ as N tends to infinity. Then, we can
if the number of edges across Cs and Ct are too few. If recalculate Φ(S) in the limit:
we want to retain the correlation value 1 for each vertex, X ωu
the vertices from Cs should connect all vertices from Ct . Φ(S) = Cos(S) = √ . (10)
u∈S
du
However, it is a far cry from the situation shown in Fig.
6. We can observe that the number of inter-edges across Obviously, it considers the Cosine similarity values between
two communities which allows two communities to merge each node and the community. As a result, it is unlikely to
is affected by the Φ(S) values in two communities. When perform the community mergence operation in the case of
Φ(S) values of two communities are both higher, it will weak inter-connections between two communities. Theorem
require more inter-edges. 6 and Theorem 7 both indicate Φ(S) can mitigate the reso-
One of the main causes of the resolution limit problem lution limit problem.
is that the modularity only considers the whole community
rather than every vertex. Besides, it only depends on the
3.3 Summary
number of edges m when the network is sufficiently large.
Let Q be the modularity sum of two communities before the As a short summary, we would like to present the following
mergence and Q́ be the modularity value after the union. remarks. First of all, the use of different correlation functions
Then, we have: in our node-centric framework may lead to some different
but known community evaluation measures. As shown,
∆l K1 K2
∆Q = Q́ − Q = − , modularity, node modularity, neighborhood connectedness
2m 2m2 and community connectedness in Focs [47] are all concrete
where K1 is the degree sum of community 1, K2 is the de- examples in our abstract framework. More importantly,
gree sum of community 2 and ∆l is the difference between we may explore more correlation functions to obtain more
the number of intra-edges after merging two communities effective community evaluation measures in the future.
and that before the combination. If ∆Q > 0, it requires
∆l > K2m1 K2
. As we can see, when m increases, the number of
inter-edges needed will be reduced. As m tends to infinity, 4 C ORRELATION -BASED C OMMUNITY D ETECTION
any two communities will be merged even if there is only In this section, we propose a Correlation-Based Community
one edge between them. The reason why this happens is Detection (CBCD) algorithm, which is based on PS measure
that ∆l only embodies the overall structural information and φ-Coefficient. CBCD takes a graph G(V, E) as input
of a community and the modularity misses local structural and generates a partition of G which is the set of detected
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 11
Algorithm 2 Triangle Counting. (Line 7 to 13). Then, we put node u into community S and
Input: A graph G(V, E). update Γ(P ) with the difference brought by this operation
Output: A array T C such that T C[u] is the number of (Line 14 to 18):
triangles containing u. 2lS 0 S 0 KS 0 2lS S KS
1: Initial T C[u] = 0 for all u, U = ∅ and V = ∅. ∆F = F (S 0 ) − F (S) = − 2
− −
N N N N2
2: for each u ∈ V do
2(lS 0 − lS ) S 0 KS 0 − S KS
3: if deg(u) ≤ β then = −
4: Add u to U . N N2
2lu,S (S + 1)(KS + du ) − S KS
5: else = −
6: Add u to V . N N2
7: end if
2lu,S (S + 1)du + KS
= − .
8: end for N N2
9: for each u ∈ U do In practice, for each node, the partition is modified after
10: for each pair (v, w) of neighbors of u do performing the steps described in Line 7 to 18, which is
11: if (v, w) exist an edge and v < w then to make the algorithm more robust to local maxima. Next,
12: if deg(v) < β and deg(w) < β then we consider the nodes that have already been assigned to a
13: if u < v then community. Line 21 to 39 in Algorithm 3 describes the parti-
14: Increase T C[u], T C[v] and T C[w] by 1. tion refinement step. It refines the partition obtained in local
15: end if search step (Line 5 to 20) using a hill climbing method. In
16: else if deg(v) < β then each iteration, we perform the movements of nodes between
17: if u < v then communities to improve the value of Γ(P ). For each node
18: Increase T C[u], T C[v] and T C[w] by 1. u belonging to a community S , we compute the difference
19: end if brought by removing u from S , ∆F = F (S)−F (S 0 ) (Line 23
20: else if deg(w) < β then to Line 25). For a community Sj , we compute the difference
21: if u < w then brought by adding u to Sj , ∆Fj = F (Sj0 ) − F (Sj ). We
22: Increase T C[u], T C[v] and T C[w] by 1. choose a community Si such that ∆Fi > ∆F and ∆Fi −∆F
23: end if are maximized. Then, we add u to Si and remove u from
24: else S to update the partition. The local search step and the
25: Increase T C[u], T C[v] and T C[w] by 1. partition refinement step are performed alternately until
26: end if every node has been assigned to a community and objective
27: end if function Γ(P ) converges.
28: end for
29: end for
4.3 Community Merging
30: Induce a subgraph G(V, E 0 ) by V .
31: for each u ∈ U do After Local Optimization Iteration, there may be many small
32: for each pair (v, w) of neighbors of u in G do but significant communities. We need a merging operation
33: if (v, w) exist an edge and v < w then to find communities with suitable size. The theoretical basis
34: if u < v then of Community Merging comes from section 3.2.2. Φ(S) is
35: Increase T C[u], T C[v] and T C[w] by 1. the criterion used in Community Merging to judge whether
36: end if two communities should be merged. Note that here Φ(S)
37: end if is just used to implement the merging operation instead
38: end for of being an objective function. In the whole process, we
39: end for maintain a max-heap which contains a set of elements
40: return T C ; and supports delete or insert operation in O(log n) time.
Community Merging is described in Algorithm 4. First,
we start off with each community Si ∈ P being the sole
nodes that are not contained in any community. According node of a graph F and establish an edge between Si
to (9), for any partition P , we have: and Sj if at least one edge links them. F is a weighted
graph, in which the S weight between community i and j
M is ∆Wij = Φ(Si Sj ) − Φ(Si ) − Φ(Sj ). Then, for each
XX lu,Sj Sj du
Γ(P ) = δ P (u), j − pair (i, j) that ∆Wij > T h, we put a triad (i, j, ∆Wij )
u∈V j=1
N N2
into max-heap M ax H . In each iteration, we take the triad
l S du
X u,S (i, j, ∆Wij ) from the top of M ax H whose ∆Wij is the
≤ max − , (11)
u∈V
S∈AC(u) N N2 maximum and merge community i and community j . For
the union operation, we can use the disjoint-set data struc-
where AC(u) is the set of communities whose nodes are ture to implement it. Next, we update the graph F and
adjacent to node u. Obviously, (10) is more easily to find put the new triad (i, k, ∆Wik ) whose ∆Wik > T h into
the maximum value than (9) if we do not modify the M ax H . We continue these steps until M ax H is empty.
current partition and the PS value of each node is computed The threshold T h controls the size of communities we find.
independently from others. Thus, we will find a community If T h is too high, the detected communities may be too
S for each unassigned node u by maximizing P S(u, S) small. If T h is too low, the detected communities may be too
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 13
similarity between the final partition of the algorithm and affect the interactions conversely. The dynamical sys-
the actual communities. Many evaluation measures, such tem eventually evolves a steady system. The Attractor
as Normalized Mutual Information (NMI) [65], ARI [66] algorithm require a cohesion parameter λ that ranges
and Purity, have been proposed to evaluate the clustering from 0 to 1, which is used to determine how exclusive
quality of algorithms. NMI is the most widely-accepted and neighbors affect distance (positive or negative influ-
important evaluation measure, since it is more discrimina- ence). According to [71], we set λ = 0.5. The pro-
tory and more sensitive to errors in the community detec- gram can be downloaded from https://github.com/
tion procedure [65]. We will use NMI as our performance YcheCourseProject/CommunityDetection.
evaluation measure in the experiments. The code of NMI For all experiments, without further statement, we set
calculation offered by McDaid et al. [67] can be found at the threshold parameter of our algorithm T h = −2.8 when
https://github.com/aaronmcdaid/Overlapping-NMI. 0 < |V | < 4000 and T h = −0.43 when 4000 ≤ |V |, cor-
According to the comparative analysis of community responding to small networks and large-scale networks. All
detection algorithms [68], Louvain and Infomap are two experimental results have been obtained on a workstation
of the best classical algorithms in the literature. Thus, we with 3.5 GHz Intel(R) Xeon(R) CPU E5-1620 v3 and 16.0 GB
will take these two algorithms as the competing algorithms. RAM. For Louvain, we adopted the lowest partition of the
Besides, some novel algorithms proposed in recent years hierarchy, which is stored in the graph.tree file. For Infomap,
will also be compared with CBCD. These algorithms are the number of outer-most loops to run before picking the
listed as follow: best solution is specified to be 10.
1 1
CBCD
0.9 0.9 Infomap
Louvain
0.8 0.8 Attractor
SCD
0.7 0.7 Focs
0.6 0.6
NMI
NMI
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
mixing parameter u mixing parameter u
(a) average degree = 15, community size ranging from 15 to 30 (b) average degree = 15, community size ranging from 20 to 50
1 1
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
NMI
NMI
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
mixing parameter u mixing parameter u
(c) average degree = 20, community size ranging from 15 to 30 (d) average degree = 20, community size ranging from 20 to 50
Fig. 7. The performance of different algorithms on LFR benchmark. The panels indicate the NMI value of the detection algorithm as a function of
the mixing parameter u.
slightly better than CBCD when average degree is 15 and 5.2 Real-World Network
u ranges between 0.55 and 0.65. Compared to Louvain,
In most real-world networks, each node has no ground-
CBCD is always the winner. Let us consider the maximum
truth label. The modularity is typically used to evaluate
community size in the partition obtained by Louvain and
the quality of detected communities. However, as we have
CBCD. The relation between the mixing parameter u and the
discussed before, modularity is not a good quality measure
maximum community size of detected partition is plotted
of communities because of the resolution limit problem.
in Fig. 8. The maximum community size of both CBCD and
Besides, modularity is found out owning the tendency of
Louvain increases as the mixing parameter u is increased.
following the same general pattern for different classes of
This demonstrates that the detection algorithms tend to
networks [73]. Thus, we only conduct our experiment on
combine of two ground-truth communities when u is high.
several well-known real-world networks with ground-truth
The maximum community size of CBCD is almost always
communities: Karate (karate) [74], Football (football) [5],
lower than that of Louvain. We can observe that, especially
Personal Facebook (personal) [75], Political blogs (polblogs)
when the mixing parameter u ranges between 0.5 and 0.6,
[76], Books about US politics (polbooks) [77]. The detailed
the maximum community size of CBCD is significantly
statistics of real-world networks are given in Table 3, where
lower than the maximum community size of Louvain. This
|V | denotes the number of the nodes, |E| denotes the
result show that CBCD can mitigate the resolution problem
number of the edges, dmax denotes the maximal degree
to some extent.
of the nodes, hdi denotes the average degree of the nodes
and |C| denotes the number of ground-truth communities
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 16
160 160
Louvain Louvain
140 Ground-truth 140 Ground-truth
CBCD CBCD
Maximum Community Size
100 100
80 80
60 60
40 40
20 20
0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
mixing parameter u mixing parameter u
(a) average degree = 15, community size ranging from 15 to 30 (b) average degree = 15, community size ranging from 20 to 50
160 160
Louvain Louvain
140 Ground-truth 140 Ground-truth
CBCD CBCD
Maximum Community Size
100 100
80 80
60 60
40 40
20 20
0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
mixing parameter u mixing parameter u
(c) average degree = 20, community size ranging from 15 to 30 (d) average degree = 20, community size ranging from 20 to 50
Fig. 8. The maximum community size in the partition generated when the mixing parameter u is ranged from 0 to 0.7. The red curve corresponds
to the ground-truth partition.
in the network. The performance of the algorithms on the that the number of detected communities increases with the
real-world network is shown in Table 4, where NC is the increment of T h. The number of detected communities is 10
number of communities detected by the algorithm. when T h = −2.2. The quality of algorithm 4 (Community
American college football: American college football Merging) is mainly determined by the output of algorithm 3
network describes football games between Division IA col- (Local Optimization Iteration). Although CBCD can achieve
leges during the regular season in Fall 2000. It has 115 teams a NMI value of 0.773, it still cannot beat other algorithms
and 631 games between these teams. For each node (team), except Focs. This demonstrates that there is still room for
there is an edge connecting two nodes if two teams played a improving algorithm 3.
game. The teams were partitioned into 12 conferences (com- Zachary’s karate club network: Karate is a famous
munities). Louvain, Infomap, SCD and Attractor all have network derived from the Zachary’s observation about a
good performance on the football data set, and SCD achieve karate club. The network describes the friendship among
the best performance among these algorithms. We have to the members of a karate club. The network was divided into
admit the fact that these algorithms outperform our method two parts because of the divergence between administrator
on the football data set. Fig. 9 plots the variation of NMI and instructor. According to Table 4, CBCD outperforms all
of the partition detected by CBCD when the threshold T h other algorithms and achieves a NMI value of 0.840. Two
used in the merging operation increases from -2.8 to 0. As communities are successfully found by CBCD, but our al-
T h increases from -2.8 to -2.2, NMI increases to 0.773 which gorithm classifies one node ‘10’ into the wrong community.
is the maximal NMI value of CBCD on the football data set. We observe that this node only have two edges and each
NMI begins to decrease when T h further increases. Note edge connects one of two communities respectively. It is
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 17
TABLE 2
The characteristics of real-world network data sets.
TABLE 3
The performance of different algorithms on the real-world network.
NMI NC
football karate personal polbooks polblogs football karate personal polbooks polblogs
CBCD 0.734 0.837 0.3639 0.330 0.391 9 2 11 4 6
Infomap 0.833 0.563 0.248 0.293 0.268 12 3 6 5 303
Louvain 0.838 0.259 0.08 0.158 0.212 12 7 17 10 32
SCD 0.840 0.395 0.179 0.116 0.085 14 8 125 23 664
Focs 0.392 0.189 0.171 0.166 0.106 5 1 13 9 19
Attractor 0.833 0.04 0.299 0.315 0.124 12 1 56 7 313
difficult to decide which community it really belongs to. 5.3 Large-Scale Real-World Network
Thus, we consider this node as a noisy node. In fact, it The large-scale real networks which are provided from [78]
can make sense to assign node ‘10’ to both communities in all have overlapping ground-truth communities. Despite it
the context of overlapping community detection. However, is out of the scope of our paper, we will still run CBCD
this study mainly focuses on non-overlapping community along with other algorithms on these networks to test the
detection. If we delete node ’10’ from the karate, our CBCD performance of our algorithm on large-scale real networks.
algorithm can achieve the perfect performance of NMI=1. Its To evaluate the performance of community detection al-
output exactly matches the partition of ground-truth com- gorithms on the networks with overlapping community
munities. Infomap achieves the second best performance. structures, Overlapping Normalized Mutual Information
For Attractor, the worst performance is due to that it puts (ONMI) [79] is the major evaluation metric in this section.
all the members into one community. Since we will make a comparison on the networks with
Personal Facebook network: Personal is the network overlapping structures, then we set OVL of Focs to 0.6 for
which gives the friendship structure of the first author, detecting overlapping communities. We choose two large-
where each individual (node) is labeled according to the scale networks, Amazon and DBLP, for testing the perfor-
time period when he or she met the first author. Persons mance of different methods. The specific information of
are divided into the different groups according to their loca- these two networks is given as follow.
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 18
0.5 0.5
0.4 0.4
Fraction of communities
Fraction of communities
0.3 0.3
0.2 0.2
0.1 0.1
0 0
-0.04 -0.03 -0.02 -0.01 0 0.01 0.02 0.03 0.04 0.05 -0.04 -0.03 -0.02 -0.01 0 0.01 0.02 0.03 0.04
Bins of PS value Bins of PS value
(a) community size = 100, 6=10, n=1000 (b) community size = 100, 6=5, n=1000
0.5 0.5
0.4 0.4
Fraction of communities
Fraction of communities
0.3 0.3
0.2 0.2
0.1 0.1
0 0
-2 -1 0 1 2 -1 0 1
Bins of PS value #10-3 Bins of PS value #10-3
(c) community size = 100, 6=10, n=10000 (d) community size = 100, 6=5, n=10000
Fig. 10. The distribution of the PS values of a community for E-R random networks with different parameters.
the number of detected communities and ET is the execution Our proposed algorithm outperforms five existing state-of-
time of various algorithms. For the DBLP network, Focs the-art algorithms on both LFR benchmark networks and
achieves the best performance and SCD achieves the second- real-world networks. In the future, we will investigate more
best performance. Despite the performance of CBCD on correlation functions and extend our method to overlapping
DBLP is not as good as these two algorithms, CBCD is community detection.
still better than others. For the Amazon netowrk, CBCD
outperforms the other algorithms. Focs is the second-best
performer and Attractor is slightly inferior to Focs. Infomap ACKNOWLEDGMENTS
has the worst performance on both DBLP and Amazon, This work was partially supported by the Natural Science
which is in contrast with its performance on small real Foundation of China under Grant No. 61572094.
networks and LFR networks. Although CBCD is effective
on detecting meaningful communities, it has no obvious
advantage with respect to the execution time. This is what R EFERENCES
we should focus on in our future work. [1] V. Spirin and L. A. Mirny, “Protein complexes and functional mod-
ules in molecular networks,” Proceedings of the National Academy of
Sciences of the United States of America, vol. 100, no. 21, pp. 12 123–
5.4 The Distribution of PS Values 12 128, 2003.
[2] S. Papadopoulos, Y. Kompatsiaris, A. Vakali, and P. Spyridonos,
In this section, we study the distribution of PS values “Community detection in social media,” Data Mining and Knowl-
defined in Formula (6) of a community under the E-R model. edge Discovery, vol. 24, no. 3, pp. 515–554, 2012.
First, we generate 300 random networks for the specific [3] S. L. Feld, “The focused organization of social ties,” American
journal of sociology, vol. 86, no. 5, pp. 1015–1035, 1981.
parameters under the E-R model. These parameters are the [4] N. J. Krogan, G. Cagney, H. Yu, G. Zhong, X. Guo, A. Ignatchenko,
average degree λ and the network size n of random net- J. Li, S. Pu, N. Datta, A. P. Tikuisis et al., “Global landscape of
work. Then, we calculate the PS value of a given community protein complexes in the yeast saccharomyces cerevisiae,” Nature,
vol. 440, no. 7084, p. 637, 2006.
S , which is composed of 100 fixed nodes. Consequently, [5] M. Girvan and M. E. Newman, “Community structure in social
we obtain 300 PS values of the given communities derived and biological networks,” Proceedings of the National Academy of
from 300 different random networks. We divide PS values Sciences of the United States of America, vol. 99, no. 12, pp. 7821–
into many bins with equal length, where the low (high) bins 7826, 2002.
[6] S. Fortunato, “Community detection in graphs,” Physics Reports,
correspond to the set of lower (higher) PS values. In Fig. 10, vol. 486, no. 3, pp. 75–174, 2009.
bins are plotted on the x-axis, and for each bin, the fraction [7] M. E. Newman and M. Girvan, “Finding and evaluating commu-
of communities whose PS values fall into that bin are plotted nity structure in networks,” Physical review E, vol. 69, no. 2, p.
026113, 2004.
on the y -axis. We can observe that the distribution of PS
[8] F. Chung, “Spectral graph theory,” in Regional Conference Series in
value follows a Gaussian-like distribution. The PS metric Mathematics, 1997, p. 212.
values are concentrated near 0 and most values fall into a [9] Y.-C. Wei and C.-K. Cheng, “Towards efficient hierarchical designs
very narrow interval. This phenomenon corresponds to our by ratio cut partitioning,” in Computer-Aided Design, 1989. ICCAD-
89. Digest of Technical Papers., 1989 IEEE International Conference on.
theoretical result in section 3.2.1. Comparing Fig. 10 (a) with IEEE, 1989, pp. 298–301.
Fig. 10 (b) and comparing Fig. 10(c) with Fig. 10(d), we can [10] S. Mancoridis, B. S. Mitchell, C. Rorres, Y. Chen, and E. R. Gansner,
find that the interval that most PS values fall into becomes “Using automatic clustering to produce high-level system organi-
shorter with the decrease of average degree λ. Besides, the zations of source code,” in Program Comprehension, 1998. IWPC’98.
Proceedings., 6th International Workshop on. IEEE, 1998, pp. 45–52.
interval that most PS values fall into sharply shortens when [11] J. Shi and J. Malik, “Normalized cuts and image segmentation,”
the network size n increases. It is a remarkable fact that the IEEE Transactions on pattern analysis and machine intelligence, vol. 22,
PS value of a community in the random networks generated no. 8, pp. 888–905, 2000.
[12] M. E. Newman, “Fast algorithm for detecting community structure
from the E-R model is a very small value. This demonstrates in networks,” Physical review E, vol. 69, no. 6, p. 066133, 2004.
that the PS value of a community is an effective metric for [13] P. Schuetz and A. Caflisch, “Efficient modularity optimization
quantifying the goodness of a community structure. by multistep greedy algorithm and vertex mover refinement,”
Physical Review E, vol. 77, no. 4, p. 046112, 2008.
[14] J. M. Pujol, J. Béjar, and J. Delgado, “Clustering algorithm for
determining community structure in large networks,” Physical
6 C ONCLUTION Review E, vol. 74, no. 1, p. 016107, 2006.
In this paper, we introduce two novel node-centric commu- [15] A. Noack and R. Rotta, “Multi-level algorithms for modularity
clustering,” in International Symposium on Experimental Algorithms.
nity evaluation functions by connecting correlation anal- Springer, 2009, pp. 257–268.
ysis with community detection. We further show that [16] D. A. Spielman and S.-H. Teng, “Nearly-linear time algorithms
the correlation analysis is a novel theoretical framework for graph partitioning, graph sparsification, and solving linear
systems,” in Proceedings of the thirty-sixth annual ACM symposium
which unifies some existing quality functions and converts on Theory of computing. ACM, 2004, pp. 81–90.
community detection into a correlation-based optimization [17] R. Andersen and K. J. Lang, “Communities from seed sets,” in
problem. In this framework, we choose PS-metric and φ- Proceedings of the 15th international conference on World Wide Web.
coefficient to eliminate the influence of random fluctuations ACM, 2006, pp. 223–232.
[18] R. Andersen, F. Chung, and K. Lang, “Local graph partitioning
and mitigate the resolution limit problem. Furthermore, we using pagerank vectors,” in Foundations of Computer Science, 2006.
introduce three key properties used in mining association FOCS’06. 47th Annual IEEE Symposium on. IEEE, 2006, pp. 475–
rule into the context of community detection to help us 486.
choose the appropriate correlation function. A correlation- [19] K. Kloster and D. F. Gleich, “Heat kernel based community
detection,” in Proceedings of the 20th ACM SIGKDD international
based community detection algorithm CBCD that makes conference on Knowledge discovery and data mining. ACM, 2014, pp.
use of PS-metric and φ-coefficient is proposed in this paper. 1386–1395.
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 20
[20] Z. A. Zhu, S. Lattanzi, and V. S. Mirrokni, “A local algorithm for [44] T. Chakraborty, S. Srinivasan, N. Ganguly, A. Mukherjee, and
finding well-connected clusters.” in ICML (3), 2013, pp. 396–404. S. Bhowmick, “Permanence and community structure in complex
[21] T. Van Laarhoven and E. Marchiori, “Local network commu- networks,” ACM Transactions on Knowledge Discovery from Data
nity detection with continuous optimization of conductance and (TKDD), vol. 11, no. 2, p. 14, 2016.
weighted kernel k-means,” Journal of Machine Learning Research, [45] A. Prat-Pérez, D. Dominguez-Sal, and J.-L. Larriba-Pey, “High
vol. 17, no. 147, pp. 1–28, 2016. quality, scalable and parallel community detection for large real
[22] A. R. Benson, D. F. Gleich, and J. Leskovec, “Higher-order organi- graphs,” in Proceedings of the 23rd international conference on World
zation of complex networks,” Science, vol. 353, no. 6295, pp. 163– wide web. ACM, 2014, pp. 225–236.
166, 2016. [46] A. Prat-Pérez, D. Dominguez-Sal, J.-M. Brunat, and J.-L. Larriba-
[23] S. Fortunato and D. Hric, “Community detection in networks: A Pey, “Put three and three together: Triangle-driven community
user guide,” Physics Reports, vol. 659, pp. 1–44, 2016. detection,” ACM Transactions on Knowledge Discovery from Data
[24] P. Erdos, “On random graphs,” Publicationes mathematicae, vol. 6, (TKDD), vol. 10, no. 3, p. 22, 2016.
pp. 290–297, 1959. [47] S. Bandyopadhyay, G. Chowdhary, and D. Sengupta, “Focs: fast
[25] M. E. Newman, “Mixing patterns in networks,” Physical Review E, overlapped community search,” IEEE Transactions on Knowledge
vol. 67, no. 2, p. 026126, 2003. and Data Engineering, vol. 27, no. 11, pp. 2974–2985, 2015.
[26] B. Bollobás, “A probabilistic proof of an asymptotic formula [48] P.-N. Tan, V. Kumar, and J. Srivastava, “Selecting the right objec-
for the number of labelled regular graphs,” European Journal of tive measure for association analysis,” Information Systems, vol. 29,
Combinatorics, vol. 1, no. 4, pp. 311–316, 1980. no. 4, pp. 293–313, 2004.
[27] M. Molloy and B. Reed, “A critical point for random graphs with [49] L. Duan, W. N. Street, Y. Liu, S. Xu, and B. Wu, “Selecting the
a given degree sequence,” Random structures & algorithms, vol. 6, right correlation measure for binary data,” ACM Transactions on
no. 2-3, pp. 161–180, 1995. Knowledge Discovery from Data (TKDD), vol. 9, no. 2, p. 13, 2014.
[28] S. Fortunato and M. Barthlemy, “Resolution limit in community [50] L. Geng and H. J. Hamilton, “Interestingness measures for data
detection.” Proceedings of the National Academy of Sciences of the mining: A survey,” ACM Computing Surveys (CSUR), vol. 38, no. 3,
United States of America, vol. 104, no. 1, pp. 36–41, 2007. p. 9, 2006.
[29] A. Lancichinetti and S. Fortunato, “Limits of modularity maxi- [51] X. F. Wang and G. Chen, “Complex networks: small-world, scale-
mization in community detection.” Phys.rev.e, vol. 84, no. 2, p. free and beyond,” IEEE circuits and systems magazine, vol. 3, no. 1,
066122, 2011. pp. 6–20, 2003.
[30] R. Guimera, M. Sales-Pardo, and L. A. N. Amaral, “Modularity [52] R. Albert and A.-L. Barabási, “Statistical mechanics of complex
from fluctuations in random graphs and complex networks,” networks,” Reviews of modern physics, vol. 74, no. 1, p. 47, 2002.
Physical Review E, vol. 70, no. 2, p. 025101, 2004. [53] C. I. Del Genio, T. Gross, and K. E. Bassler, “All scale-free networks
[31] G. Piatetsky-Shapiro and W. J. Frawley, “Discovery, analysis, and are sparse,” Physical review letters, vol. 107, no. 17, p. 178701, 2011.
presentation of strong rules,” in Knowledge Discovery in Databases, [54] A. Agresti, Categorical Data Analysis, Second Edition, 2003.
1991, pp. 229–248. [55] M. S. Granovetter, “The strength of weak ties,” in Social networks.
[32] P. Pons and M. Latapy, “Computing communities in large net- Elsevier, 1977, pp. 347–367.
works using random walks,” in International symposium on com- [56] A. Rapoport, “Spread of information through a population with
puter and information sciences. Springer, 2005, pp. 284–293. socio-structural bias: I. assumption of transitivity,” The bulletin of
[33] Y. Pan, D.-H. Li, J.-G. Liu, and J.-Z. Liang, “Detecting community mathematical biophysics, vol. 15, no. 4, pp. 523–533, 1953.
structure in complex networks via node similarity,” Physica A: [57] G. Kossinets and D. J. Watts, “Empirical analysis of an evolving
Statistical Mechanics and its Applications, vol. 389, no. 14, pp. 2849– social network,” science, vol. 311, no. 5757, pp. 88–90, 2006.
2857, 2010.
[58] M. Al Hasan and V. S. Dave, “Triangle counting in large networks:
[34] T. Zhou, L. Lü, and Y.-C. Zhang, “Predicting missing links via
a review,” Wiley Interdisciplinary Reviews: Data Mining and Knowl-
local information,” The European Physical Journal B, vol. 71, no. 4,
edge Discovery, vol. 8, no. 2, p. e1226, 2018.
pp. 623–630, 2009.
[59] A. Itai and M. Rodeh, “Finding a minimum circuit in a graph,”
[35] H.-W. Shen, X.-Q. Cheng, and B.-X. Fang, “Covariance, correlation
SIAM Journal on Computing, vol. 7, no. 4, pp. 413–423, 1978.
matrix, and the multiscale community structure of networks,”
Physical Review E, vol. 82, no. 1, p. 016114, 2010. [60] T. Schank and D. Wagner, “Finding, counting and listing all
triangles in large graphs, an experimental study,” in International
[36] M. Zarei, D. Izadi, and K. A. Samani, “Detecting overlapping
workshop on experimental and efficient algorithms. Springer, 2005,
community structure of networks based on vertex–vertex corre-
pp. 606–609.
lations,” Journal of Statistical Mechanics: Theory and Experiment, vol.
2009, no. 11, p. P11013, 2009. [61] N. Alon, R. Yuster, and U. Zwick, “Finding and counting given
[37] M. MacMahon and D. Garlaschelli, “Community detection for length cycles,” Algorithmica, vol. 17, no. 3, pp. 209–223, 1997.
correlation matrices,” Phys. Rev. X, vol. 5, p. 021006, Apr 2015. [62] M. Latapy, “Main-memory triangle computations for very large
[Online]. Available: https://link.aps.org/doi/10.1103/PhysRevX. (sparse (power-law)) graphs,” Theoretical computer science, vol. 407,
5.021006 no. 1-3, pp. 458–473, 2008.
[38] N. Bansal, A. Blum, and S. Chawla, “Correlation clustering,” [63] R. E. Tarjan and J. Van Leeuwen, “Worst-case analysis of set union
Machine learning, vol. 56, no. 1-3, pp. 89–113, 2004. algorithms,” Journal of the ACM (JACM), vol. 31, no. 2, pp. 245–281,
[39] N. Veldt, D. F. Gleich, and A. Wirth, “A correlation clustering 1984.
framework for community detection,” in Proceedings of the 2018 [64] R. E. Tarjan, “A class of algorithms which require nonlinear time
World Wide Web Conference on World Wide Web. International to maintain disjoint sets,” Journal of computer and system sciences,
World Wide Web Conferences Steering Committee, 2018, pp. 439– vol. 18, no. 2, pp. 110–127, 1979.
448. [65] L. Danon, A. Diaz-Guilera, J. Duch, and A. Arenas, “Comparing
[40] L. Duan, W. N. Street, Y. Liu, and H. Lu, “Community detection community structure identification,” Journal of Statistical Mechan-
in graphs through correlation,” in Proceedings of the 20th ACM ics: Theory and Experiment, vol. 2005, no. 09, p. P09008, 2005.
SIGKDD international conference on Knowledge discovery and data [66] L. Hubert and P. Arabie, “Comparing partitions,” Journal of classi-
mining. ACM, 2014, pp. 1376–1385. fication, vol. 2, no. 1, pp. 193–218, 1985.
[41] L. Duan, Y. Liu, W. N. Street, and H. Lu, “Utilizing advances [67] A. F. McDaid, D. Greene, and N. Hurley, “Normalized mutual
in correlation analysis for community structure detection,” Expert information to evaluate overlapping community finding algo-
Systems with Applications, vol. 84, pp. 74–91, 2017. rithms,” arXiv preprint arXiv:1110.2515, 2011.
[42] T. Chakraborty, S. Srinivasan, N. Ganguly, A. Mukherjee, and [68] A. Lancichinetti and S. Fortunato, “Community detection algo-
S. Bhowmick, “On the permanence of vertices in network com- rithms: a comparative analysis,” Physical review E, vol. 80, no. 5, p.
munities,” in Proceedings of the 20th ACM SIGKDD international 056117, 2009.
conference on Knowledge discovery and data mining. ACM, 2014, pp. [69] V. D. Blondel, J.-L. Guillaume, R. Lambiotte, and E. Lefebvre, “Fast
1396–1405. unfolding of communities in large networks,” Journal of statistical
[43] T. Chakraborty, S. Kumar, N. Ganguly, A. Mukherjee, and mechanics: theory and experiment, vol. 2008, no. 10, p. P10008, 2008.
S. Bhowmick, “Genperm: a unified method for detecting non- [70] M. Rosvall and C. T. Bergstrom, “Maps of random walks on
overlapping and overlapping communities,” IEEE Transactions on complex networks reveal community structure,” Proceedings of the
Knowledge and Data Engineering, vol. 28, no. 8, pp. 2101–2114, 2016. National Academy of Sciences, vol. 105, no. 4, pp. 1118–1123, 2008.
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 21