507
The second category consists of structures / structure 3.2 X-RIME Architecture
variables defined on sub-graphs, mostly and most
importantly, defined on the egocentric network of each Figure 1 illustrates the layers constitute X-RIME,
vertex. Examples of this kind are egocentric network which are enclosed in the dash box. Like many other
density, clustering coefficient and maximal cliques. Hadoop based data analytics solutions, at the bottom layer
Typically, two phases are involved in the implementation of X-RIME is Hadoop HDFS. The raw input data, the
of them. The first phase is to find or construct sub-graphs. internal representation of networks after transformation
This phase usually involves exploring the neighborhood and cleansing, the intermediate result generated in each
surrounding each vertex by propagating messages along step of algorithms, and the final output of X-RIME will
edges / arcs among adjacent vertexes. The result could be all be stored with HDFS as files or directories.
kept and shared among structure / structure variables of
this kind. The second phase is to calculate structures /
structure variables within each sub-graph. As a result,
each sub-graph could be processed in parallel. This
paradigm makes the implementation ready for scaling out
on networks with even-distributed sub-graph sizes.
The third category consists of structures / structure
variables defined on the whole graph. Most of those
interesting SNA structures / structure variables belong to
this category, for example, k-core, weakly / strongly / bi-
connected components, PageRank [19], hyperlink
induced topic search [20], minimal spanning tree, breadth
first search, and so on. To calculate them, the whole
graph needs to be explored in some way. The
implementation usually involves multiple rounds of
iterations. Although operations done in each round are
highly algorithm dependent, most of them could be
regarded as generating labels on the fly and propagating Figure 1 Architecture of X-RIME
them through edges / arcs among vertexes. The output of
one round is the input of the next round, and some Above the HDFS layer is X-RIME’s data models.
algorithm dependent termination condition needs to be Since social networks are graphs consist of nodes of
checked at the end or beginning of each round. Since the actors and edges / arcs of relationships, two kinds of
iterations may take dozens of rounds, network traffic and models are usually used, namely adjacency list and
job scheduling latency would be more significant than the adjacency matrix. Generally speaking, adjacency list is
second category. We are currently investigating more suitable for sparse graphs where the edge number is
techniques besides compression to mitigate such effects. much smaller than square of the node number. On the
Structures / structure variables of the last category are other hand, adjacency matrix is more suitable for dense
those defined on vertex and edge sets. Compared with the graphs. Since most social networks today are sparse
third category, they depend not only on links existing in graphs, adjacency list could result in much more compact
the graph, but also on links that do not exist in the graph. representation. Moreover, many real world social
For example, force-directed network layout algorithms networks are inherently or most naturally encoded as
and community discovery algorithms usually need to adjacent node pairs. For example, short message logs in
calculate the repulsive force between each pair of telecom domain encode a pair of sender and receiver in a
vertexes. Generally speaking, MapReduce programming record. So, distributed adjacency list is chosen as the
model is not suitable for this category, since the “any to basic data model in X-RIME.
any” links usually mean large volume of intermediate Beyond distributed adjacency lists, we designed a few
results, and the graph structure could not be leveraged to more data models to accommodate different algorithms
mitigate such explosion. Nevertheless, approximate and application requirements. For example, quite a few
algorithms exist for some structures / structure variables algorithms do not care the order of arcs attached to each
of this kind, which usually have specific techniques to vertex, so we replace lists with sets of incoming and
reduce the calculation space. The grid variant of outgoing arcs. To accommodate weighted graphs and
Fruchterman-Reingold network layout algorithm [21] is algorithms which need to explore networks, we attached
an example, which could leverage the fact that repulsive labels to vertexes and arcs. Actually, labels are important
force is inversely proportional to the square of distance to constructs in X-RIME and used by many SNA algorithms.
reduce the number of pairs considered.
508
Routines are provided to do the transformation among board to board. Registered users could read, post or reply
data models. articles in any board, which means those sub-
As a preprocessing step, raw input data is cleaned and communities might overlap with each other. Both the
transformed into X-RIME data models before the real online community and several sub-communities are
analysis begins. This step is specific to the usage scenario analyzed in the following part of this section.
and should be implemented by X-RIME users. In this paper, we regard article authors (identified by
Above the data model layer are Hadoop MapReduce registered user ids) as SNA actors, and regard the reply
layer and a layer of SNA algorithms implementations. relationship among authors as SNA relationship.
Each step of a SNA algorithm in X-RIME is programmed Specifically, if an author A creates a new discussion topic
as a map()/reduce() pair and executed as a MapReduce by posting an article, and author B replies to articles in
job. Algorithm specific control flow and data flow this topic, a SNA relationship is created between A and B,
between steps are programmed as Hadoop MapReduce and represented as an arc from node B to node A in the
clients who create and submit jobs to the MapReduce social network. For two authors who reply to the same
runtime. Although the actual flows involved could be topic, there is no SNA relationship created between them.
sequential, parallel or hybrid, as shown in Figure 1, all Since a user may publish any number of articles in any
such Hadoop MapReduce clients are encapsulated in the number of boards, the relationship between any two
same interface, which takes the following arguments: authors might be created in contexts of multiple topics
z An input HDFS file / directory which stores the and / or multiple boards. For simplicity, MapReduce
input network after cleansing and transformation, programs are developed to remove redundancy in the raw
z An output HDFS file / directory used to store the data set and generate networks that have no loops and no
final output of the algorithm, more than one edge between any two different vertices.
z MapReduce specific parameters, such as mapper / As the disk space of the BBS is limited, administrators
reducer number, etc. of the community and sub-communities choose to delete
z Algorithm specific parameters. old articles periodically. As a result, we could not get a
When the MapReduce client for a SNA algorithm is complete historical view of those communities. Instead,
invoked, the caller will be blocked until all steps are we use web crawlers to create a snapshot of the whole
finished or some step fails. Currently, X-RIME provides BBS, which contains articles in the past 3 months or so.
SNA algorithms as a library. We are going to expose Web There are about 200K nodes and about 1.6M arcs in the
Service interfaces in near future. social network corresponding to this snapshot.
At the top of Figure 1 are social network aware
business intelligence applications. They invoke X-RIME 4.1 Degrees
library when they need to calculate any SNA structure or
structure variable. For real world business intelligence Although statistics on in and out degrees is the most
applications that need the functionality of data warehouse simple functionality provided by X-RIME, it is still
and data mining as well, X-RIME could be integrated helpful to understand basic properties of a community.
with Hadoop-based data warehouse and data mining Figure 2 illustrates the distribution of in degree, out
solutions. At lower layers, they could share the same degree and sum of them in 3 selected boards named
HDFS and MapReduce infrastructure. At the library or Circuit, MilitaryView and Career_POST. Among them,
Web Service layer, they could invoke each other on the Circuit board is for circuit technology related topics;
fly. In this way, more comprehensive and cost effective MilitaryView board is for military related topics; and
BI solutions could be built. Career_POST board is for announcement of job
opportunities.
4. Case Studies As shown in the figure, all three boards have major
parts of their population consist of inactive actors, who
X-RIME is the output of an ongoing research project, seldom post or reply articles and have small degrees. This
and has been published as an open source project at phenomenon is quite common in all kinds of real world
xrime.sourceforge.net. In this section, we use an online communities, where only a few active actors lead and
community as the example to present several usage drive social activities. We can also see many isolated
scenarios of X-RIME. Preliminary performance results actors (with 0 “in + out” degree), who posted articles and
will be introduced in Section 5. did not get any attention from others. Such actors join and
The online community is in the form of a bulletin leave the community occasionally, and constitute the
board system (BBS) consists of a bunch of boards. Each most instable part of the community.
board has its focused themes and could be regarded as a
sub-community. The sizes of sub-communities vary from
509
MilitaryView boards are more like real social groups,
where people have common interested topics for
interactions to take place. Among these two boards,
MilitaryView community is the one more tightly coupled,
which has less isolated actors and more interactions
among actors.
(c) Career_POST board K-cores and maximal cliques are useful SNA
Figure 2 Degree distribution in 3 boards structures when studying the topology and construction of
a social network. Particularly, K-cores could be used to
By comparing the three boards, we can see that simplify networks for analysis or visualization purpose.
Career_POST community is the loosest one. The The k value indicates the nesting level of the core within
dominant and largest degrees of actors in it are much less the network. Maximal cliques could be used to find core
than the other two communities. This is natural since members and structures in a community, among many
people come to this board only for information, and no other potential usages. X-RIME supports both of them.
discussion involved. On the contrary, Circuit and
510
largest a few k values. Specifically, the k-core with the
largest k value almost covers all maximal cliques with
more than 2 vertexes. This phenomenon is common for
most boards in this community, and could be used to
reduce the search space for maximal strong cliques.
Exceptions to this phenomenon are boards like
Career_POST, where only a few small maximal strong
cliques exist.
511
and TaskTracker). Default configuration is used for those
Hadoop clusters.
5. Experimental Results
Figure 8 X-RIME Scales Out Figure 9 CPU and Network Usage of X-RIME
We evaluated the performance of X-RIME on the Figure 8 illustrates performance results of maximal
social network of the online BBS community mentioned strong clique and weakly connected component
in last section. 7 compute nodes connected via 1Gb implemented in X-RIME, which are representatives of the
Ethernet are used to construct Hadoop clusters of second and third categories according to our classification
different sizes. Each node contains 4GB main memory respectively. The horizontal axis indicates the number of
with two 1.66GHz Intel Xeon processors, but for our tests, slave nodes in the Hadoop cluster, while the vertical axis
we only enable 1 processor / core on each node for indicates the wall-clock time used by the processing. We
simplicity. Among those nodes, a dedicated node is used note that the scalabilities of both algorithms are quite
as the master node (running NameNode and JobTracker), good here, while the weakly connected component
and the other nodes are used as slaves (running DataNode algorithm scales slightly better than the maximal strong
clique algorithm. This is because that the social network
512
is sparse and the connectivity of vertexes is imbalance [5] JAQL - Query Language for JavaScript Object
across the network. This in turn causes the workload Notation (JSON). http://code.google.com/p/jaql/.
imbalance among vertexes which has larger impact on the [6] Apache Mahout. http://lucene.apache.org/mahout/.
maximal strong clique algorithm than on the weakly [7] Apache hadoop. http://hadoop.apache.org/core.
connected component algorithm. We are currently [8] Social network analysis software,
investigating heuristics for input splitting in order to http://en.wikipedia.org/wiki/Social_network_analysis_sof
balance workload among slave nodes. tware.
Figure 9 illustrates the cluster CPU and network usage [9] Jonathan W. Berry, Bruce Hendrickson, Simon Kahan,
extracted with Ganglia Monitoring System [21] when and Petr Konecny. Software and Algorithms for Graph
executing these two algorithms. We note that the maximal Queries on Multithreaded Architectures. In IEEE IPDPS,
strong clique algorithm and the weakly connected pages 1-14, 2007.
component algorithm are both compute intensive, and the [10] David A. Bader and Kamesh Madduri. Parallel
network bandwidth is not a bottleneck. This means the Algorithms for Evaluating Centrality Indices in Real-
implementation of both algorithms in X-RIME could world Networks. In ICPP, pages 539-550, 2006.
scale out to much larger social networks and clusters [11] Guojing Cong and David A. Bader. An Experimental
without upgrading existing 1Gb network infrastructure. Study of Parallel Biconnected Components Algorithms on
Symmetric Multiprocessors (SMPs). In IEEE IPDPS,
6. Conclusion and Future Works page 45b, 2005.
[12] Will McLendon III, Bruce Hendrickson, Steven J.
In this paper, we have introduced X-RIME, a cloud- Plimpton, and Lawrence Rauchwerger. Finding strongly
based library for large scale social network analysis. An connected components in distributed graphs. In Journal
implementation-oriented classification and a layered of Parallel and Distributed Computing, 65(8): 901-910,
architecture are proposed to guide the development of 2005.
SNA algorithms. By sharing the same infrastructure and [13] Fábio Protti, Felipe M. G. França, and Jayme Luiz
integrating with existing cloud-based data warehouse and Szwarcfiter. On Computing All Maximal Cliques
data mining libraries and tools, more comprehensive and Distributedly. In Proceedings of the 4th International
cost effective social network aware business intelligence Symposium on Solving Irregularly Structured Problems
solutions could be built. We have also presented several in Parallel, pages 37-48, 1997.
case studies on an online community to illustrate the [14] Robert G. Gallager, Pierre A. Humblet, and Philip M.
usage of X-RIME. The performance of X-RIME is also Spira. A Distributed Algorithm for Minimum-Weight
evaluated with experiments, which shows good scalability Spanning Trees. ACM Transactions on Programming
on large social networks and clusters. Languages and Systems (TOPLAS), 5(1): 66-77, 1983.
X-RIME is an ongoing research project and an open [15] Douglas Gregor and Andrew Lumsdaine. The
source project at the same time. Future works include Parallel BGL: A Generic Library for Distributed Graph
extending the functionality with more SNA structures and Computations. In Parallel Object-Oriented Scientific
structure variables, and enhancing Hadoop MapReduce Computing (POOSC), 2005.
framework to better support the implementation of SNA [16] Cluster Computing and MapReduce.
algorithms. http://code.google.com/intl/en/edu/submissions/mapreduc
e-minilecture/listing.html.
[17] Jimmy Lin. Graph Algorithms with MapReduce.
7. References http://www.umiacs.umd.edu/~jimmylin/cloud-2008-
Spring/Session4.ppt.
[1] Wasserman, Stanley, and Faust, Katherine. Social [18] Wouter de Nooy, Andrej Mrvar, and Vladimir
Network Analysis: Methods and Applications. Cambridge: Batagelj. Exploratory Social Network Analysis with
Cambridge University Press, 1994. Pajek. New York: Cambridge University Press, 2004.
[2] Carrington, Peter J., John Scott, and Stanley [19] Lawrence Page, Sergey Brin, Rajeev Motwani, and
Wasserman (Eds.). Models and Methods in Social Terry Winograd. The PageRank Citation Ranking:
Network Analysis. New York: Cambridge University Bringing Order to the Web. Technical Report, Stanford
Press, 2005. InfoLab, 1999.
[3] Sanjay Ghemawat, Howard Gobioff, and Shun-Tak [20] Jon M. Kleinberg. Authoritative Sources in a
Leung. The Google file system. In SOSP, pages 29-43, Hyperlinked Environment. Journal of ACM, 46(5): 604-
2003. 632, 1999.
[4] Jeffrey Dean and Sanjay Ghemawat. MapReduce: [21] Ganglia Monitoring System. http://ganglia.info/.
Simplified Data Processing on Large Clusters. In OSDI,
pages 137-150, 2004.
513