Anda di halaman 1dari 7

A Comprehensive Study of Twitter Social Networks

Silvia Ciotec, Mihai Dascalu, Stefan Trausan-Matu


Computer Science Department
University Politehnica of Bucharest
Bucharest, Romania
silvia.ciotec@gmail.com, mihai.dascalu@cs.pub.ro, stefan.trausan@cs.pub.ro

AbstractAs most approaches perform social network analysis talk about detecting communities with common concerns on
from a static point of view, our paper is centered on the analysis Twitter. Their initial focus was to identify celebrities that are
of the Twitter network, emphasizing its dynamic aspects by using representative for an interest category, before detecting
an analytics and visualization-centered application. Our aim is to communities based on links among the followers of these
model the activity and importance of individual users over time,
celebrities. Their study also comprises the attributes of these
as well as the connection between a recent activity of the entire
network and on a given subject (for example, trending topics like communities and the effects of deepening or of the
an event or a celebrity). A users influence is measured based on specialization of interests.
his/hers followers and retweets, enabling the possibility to classify Secondly, Cha, et al. [2] show an in-depth comparison of
members of a certain community. Therefore, we shift the three measures of influence: in-degree, retweets and mentions,
perspective towards analyzing the Twitter network as a news- applied on a big volume of data collected from Twitter, that
spreading platform by studying the behavior of users, the consists of 2 billion follow links among 54 million users who
underlying timelines and relationships. produced a total of 1.7 billion tweets. After investigating topics
and the behavior of users in time, their study highlights that the
KeywordsTwitter analysis; Social Network Analysis;
retweets; user behavior; interactive visualizations.
most influential users can impact distinct topics and that this
cumulative effect is not gained by chance, but through
concentrated effort of limiting tweets to a single topic.
I. INTRODUCTION
An interesting approach regarding the Twitter social
Twitter has become a popular social networking and micro- network consists of a sentiment driven perspective. As a
blogging service in which users can post and receive short practical example, Agarwal, et al. [3] built models for
messages of up to 140 characters (named tweets). In other classifying tweets based on positive, negative and neutral
words, basic interactions enable a user to communicate and to sentiments. They experimented with a unigram model, a
be updated via the tweets he/she is following. The services feature based model and a tree kernel based model. Their
popularity is reflected in the number of users (500 million experiments demonstrated that characteristics that have to do
registered users in 2012), as well as the number of tweets per with Twitter-specific features (for example, emoticons or
day (200 million as of 1st Aug 2011) [1]. hashtags) add only marginal importance to the classifier, while
One important aspect of this social network is its ability to features that combine prior word polarities with their
spread the news and to connect people from all around the corresponding parts-of-speech tags are more relevant for the
world. Our focus was to automatically analyze Twitters classification task. The authors also created two publicly
potential to speak about a topic of interest, the behavior and available resources: a hand annotated dictionary for emoticons
relationships among its users. Thus, the paper presents statistics that maps emoticons to their polarity and an acronym
about tweets from the geographical point of view correlated dictionary collected from the web with English translations of
with their distribution among places on the world map. Our over 5.000 frequently used acronyms.
application can also extract information about individual users, In terms of visualization, Ediger, et al. [4] analyzed
for example tweeting peak hours, statistics about their Twitters public data stream as a graph by using GraphCT
followers or finding communities among users followers. (http://trac.research.cc.gatech.edu/graphs/wiki/GraphCT), the
The rest of the document is structured as follows: the Graph Characterization Toolkit for massive graphs
second section presents related work, the third and fourth representing social network data. Their analysis on graphs of
sections describe the tools that were integrated in order to over 60 million vertices and approximately 1.5 billion edges
conduct our experiments, the fifth section highlights the demonstrated that the packaged metrics reveal characteristics
experimental data and interpretation of results, while the last of Twitter users interactions. Also, actors were ranked within
section presents conclusions and future work. conversations that might help analysts focus attention on
smaller, representative data subsets.
II. STATE OF THE ART From a different perspective, Crymble [5] presents an
Several recent efforts have been made in order to analyze analysis of how the archival community is using social
social networks, especially Twitter. Firstly, Lim and Datta [1] networking services such as Twitter and Facebook as outreach
tools. The study shows that archival organizations analyses can be performed from the individual level to the
overwhelmingly use the services to promote content they have online community level. From a technical perspective, the API
created themselves, whereas archivists promote information is based on the REST architecture (Representational State
they find useful. In all cases, more frequent posting did not Transfer), a collection from the network design principles that
correlate to a larger audience [6]. defines resources and methods of accessing and using the
underlying data. In terms of authentication, OAuth [8] is an
III. TWITTER SOCIAL NETWORK AND INTEGRATED TOOLS open standard that provides 'secure delegated access' to server
resources and data on behalf of a resource owner. Nevertheless,
A. Twitter Overview
the API encompasses multiple limitations out of which the
Twitter is a social network in which registered users can most problematic consist of:
share thoughts and ideas by using small text messages. Twitter 1) Per user and per application limits
permits its users to post and receive messages of up to 140 The rate of usage for the API version 1.1 is taken into
characters in length. These messages are named tweets and account firstly from a per user point of view, or to be more
form the basis of social interactions. Also, users can be specific, based on an application token (access token). In case
notified when favorite users have posted messages. In the end, a method allows 15 requests per time window, then this
this mesh of users following other users generates the social method will be granted 15 requests for one token. When
structure of the Twitter network. application level authentication is used, the imposed limits are
Users can post their own tweets or re-post tweets from determined at a global level, creating an analogous limitation
other members in a process called retweeting. As an extra in terms of the requests per window in the name of the user.
functionality, users can reference each other in their tweets 2) The 15 minutes time window limit
(through the use of the key word @<user_name>) or they can The time windows of the Twitter API version 1.1 are made
mention keywords or key subjects for an easier search by based on 15 minutes intervals with mandatory authentication.
using the # (hashtag), for example #<subject>. There are two initial rates available for performing GET
In addition, the availability of celebrities and the multitude requests: the first one is a group of 15 requests once every 15
of young people, as well as Twitters nature to make public minutes and the second one is a group of 180 requests for the
both the local gossip and hottest news, creates an ideal same timeframe. More details regarding operations can be
environment for searching and spreading information about found at https://dev.twitter.com/docs/rate-limiting/1.1/limits.
celebrities and details of their lives. 3) Search timeframe limits
As proof, the first 10 most followed users arent The search is limited to 180 requests for a 15 minutes
corporations or mass media organizations, but individuals, window. Twitter does not permit an exhaustive search through
most of them famous (e.g., Justin Bieber dethroned Lady Gaga all possible tweets in its API, but only a search through its
as the most-followed Twitter user in January 2013, but ceded most recent posts. With the current version, the results are
the top spot to Perry in November 2013, who still is on the first returned from the latest 6 to 9 days. In order to take full
place in top twitter users, at the present moment July 2014). advantage of the API, Twitter4J (http://twitter4j.org) has been
Furthermore, these celebrities communicate with millions of selected as the most mature Java interface that makes use of
other users that follow them by using tweets, often published the latest version of the Twitter API.
by themselves or by publicists, thus avoiding the traditional
interactions from a mass-media point of view, between C. IMDB
themselves and their fans. The Internet Movie Data Base or better known as IMDB
Together with the conventional celebrities, another class of (http://www.imdb.com/) is a commercial website that hosts an
Twitter users composed of bloggers, writers and journalists on-line database referring to movies, TV shows, actors,
have started to occupy a small, but important share of tweets production casts, video games and fictive characters present in
and followers. mass media specialized in the visual entertainment [9]. IMDB
Thus, it can be said that Twitter has a whole spectrum of does not offer an API for interrogation, but even without these
communications that can span from a personal and private automation capabilities, most data can be downloaded in JSON
level, to the traditional mass-media messages. Henceforth, format (Java Script Object Notation). In current experiments,
Twitter can offer an interesting overall context, especially we opted to extract only the exact and popular alternative/stage
because Twitter, as opposed to TV, radio, printed media and names as these turned out to be the most relevant fields that
mass-media, permits easy observation of the information flow could be retrieved from IMDB.
between its users [7].
D. Gephi
B. Twitter API Gephi (https://gephi.org/) [10] is an opensource software
Twitter offers an API (https://dev.twitter.com/docs/api) that for network analysis and graphics whose flexible and multi-
enables data collection functionalities by providing developers tasking architecture brings new possibilities of working with
with the capability to search for and store user profile complex data sets, while generating representative visual
information, user connections, tweets and retweets, as well as graphs. In a nutshell, Gephi offers a broad and easy access to
geographical information for the tweets, if available. network data, as well as facilities that enable data filtering,
Nevertheless, this ensures extensibility as online social network navigation, manipulation and clustering.
E. Google Fusion Tables division of a network into corresponding modules) is used to
Fusion Tables (http://tables.googlelabs.com) are on-line detect the community structure of our follower graph.
data management applications specifically conceived for
V. EXPERIMENTAL RESULTS AND DISCUSSIONS
collaboration, visualization and data publication. In contrast to
traditional databases focusing on SQL queries and transaction A. Tweets Assigned to a Topic
processing, these toolsets are built for data management and
collaboration: combining multiple data sources, data analysis, This section presents a dataset that is comprised of data
interrogation, visualization and web publishing. After loading about Hollywood or European actors, musicians, public
datasets in table forms, the results can be filtered, aggregated persons and events.
and visualized using Google Maps or other visualizations APIs 1) Actors
provided by Google. Also, data from multiple sources can be Fig. 1 depicts a geographical distribution of tweets that talk
combined when the corresponding data sets are all connected about the actress Jennifer Lawrence, a trending movie actress
to the same entities. in top 10 ranking actors according to the STARmeter from
IMBD in May 2014. The used subset contains tweets from the
IV. IMPLEMENTATION week of 24-31 May 2014, out of which only 790 were geo-
Our developed application provides a web interface localized.
through which a user can search for an actor from IMDB
correlated with celebrities profiles on Twitter. The results are
sorted according to relevance, including fan-made profiles,
parody profiles, informative type profiles or other related
content. The relevance of Twitter results is given by a metric
computed as the product of the number of followers and the
number of re-tweets. Multiple formulas and approaches have
been used, but the presented heuristic generated the most
relevant results.
The advantage of using IMDB is that results are returned
even if typos are made in the actors name. Thus, if the exact
spelling of the actors name is unknown to the user, the most Fig. 1. Tweets containing geographical information regarding Jennifer
viable alternative is automatically used. Lawrence as a search term
Overall, the application is capable of collecting recent data
stored locally as imposed by the limit size of the Twitter API For a better visualization of the areas where users are more
that allows only up to 180 requests every 15 minutes. active with regards to a given topic, a heat-map of the data set
Visualization is achieved by integrating the previously has been generated (See Fig. 2). A heat map is a graphical
described technologies Gephi and Google Fusion Tables representation of data where the individual values are
tools set. Thus, the experimental results contain information cumulated and represented as colors. Thus, the color red
about the geographical area in which the tweets were posted. expresses the highest intensity of tweets in the rendered region,
Because not all Twitter users have geo-location activated, while light green color expresses a lower number of tweets.
not all tweets have geographical information, reducing the
data to a limited data set that sometimes can lead to precision
loss. The majority of the data sets contain approximately
1.000 posts. For a better visualization of the used data, a heat
map of the data sets has been generated (see Fig. 3).
Afterwards, the most active hours in the last week and the
number of tweets posted can be further analyzed.
Additionally, users were also analyzed from the followers
point of view. Statistics about the followers of the users
followers emphasize users importance through followers.
The application generates a graph in Graphml format
(http://graphml.graphdrawing.org/) that can be imported in
Fig. 2. Tweet intensity over the course of one week regarding "Jennifer
Gephi. Some measurements on the directed user graph Lawrence"
containing the users followers and the connection between
them can be performed with Gephi. Firstly, betweenness The next analyzed subset is about Jim Carrey, another actor
centrality (i.e., the number of shortest paths from all vertices to form Hollywood, who was rated in Top 20 actors of the last 20
all others that pass through the current node) [11, 12] years on IMDB (the top was last updated on 13th of June 2013).
highlights the most important and central nodes within the Fig. 3 shows a visualization of approximately 200 tweets about
community. Afterwards, modularity (i.e., the strength of the him between 24th May and 1st June, 2014. The intensity map
presented in Fig. 4 is quite similar to the heat map for Jennifer
Lawrence, most active regions being the United States, United Brazil, Europe (mostly the United Kingdom), Turkey,
Kingdom and Indonesia. This emphasizes which people are Indonesia and Japan.
committed to the Hollywood community, but also the places
where Twitter is popular.

Fig. 6. Tweet intensity over the course of one week regarding "Bill Gates"

Fig. 3. Tweets containing geographical information regarding Jim Carrey The next dataset is dedicated to Stephen Fry, who was born
as a search term in UK and lives in London. He also wields a considerable
amount of influence through his use of Twitter [13]. The
number of gathered tweets is about 60, from 24th of May to 1st
of June. Fig. 8 represents the intensity of these tweets,
underlining the fact that he is mostly acclaimed in his country.

Fig. 4. Tweet intensity over the course of one week regarding "Jim Carrey"

2) Other Celebrities
A geographical visualization of tweets about Bill Gates,
Fig. 7. Tweets containing geographical information regarding Stephen Fry
one of the most renowned and emblematic figures of the IT&C as a search term
community, from 24th of May to 1st of June, can be found in
Fig. 5. The data is composed of about 500 tweets.

Fig. 8. Tweet intensity over the course of one week regarding "Stephen Fry"
Fig. 5. Tweets containing geographical information regarding Bill Gates as
a search term
3) Events
Fig. 6 shows the HeatMap associated with the collected The analyzed event is a Japanese nuclear disaster,
data for the Bill Gates search. It can be easily noticed that a Fukushima, which had a gravity level of 5 on a 7 scale. Data
wide range of people are talking about him, from all corners of was gathered from 24th of May to 1st of June 2014. The data
the world, as the IT community is vast and more spread. By consists of about 600 tweets. The intensity of these tweets is
using this popularity feature around the world, the regions with represented in Fig. 10. As the event happened in Japan, its
the most active twitter users can be pointed: southern USA, obvious it was tweeted mostly in Japan.
distribution, the total number of tweets is of only 183. The
graphic highlights a few zones that are the most important:
zones between 15 19 (eastern US) with around 20 tweet per
zone and zones between 31-32 (mostly UK) with 20 and 13
tweets respectively. Again, the highest density of tweets
corresponds to the eastern part of the US and UK, therefore
highlighting areas in which Twitter is highly adopted.

Fig. 9. Tweets containing geographical information regarding Fukushima


as a search term

Fig. 12. Jim Carrey tweet distribution by UTM zones

The number of tweets that talk about Bill Gates, the founder
of Microsoft, consists of 488 tweets in our dataset. Fig. 13
shows that the highest density of tweets is in zones 18 and 19,
with about 60 tweets per zone, and a big density in zones 31,
32 with around 30 tweets per zone. Following the same pattern,
the zones correspond to eastern US and United Kingdom.

Fig. 10. Tweet intensity over the course of one week regarding "Fukushima

B. Tweets Statistics
Taking into account the analyzed topics and their Fig. 13. Bill Gates tweet distribution by UTM zones
distributions, the corresponding statistics can reveal
meaningful interpretations. The statistics built from the data set Stephen Fry, a pure British celebrity, reveals a density of
are made by means of UTM (Universal Transverse Mercator tweets talking about him mostly in the United Kingdom. As the
coordinate system) [14]. The UTM conformal projection uses a graph presents, the highest density of tweets can be found in
2-dimensional Cartesian coordinate system to pinpoint zone 31, corresponding to UK, with a number of 39 tweets.
locations all around the globe, independently of their vertical
position.
Therefore, for each UTM zone, the number of tweets is
pointed out. Jennifer Lawrence, the first search term, has a
wide distribution of tweets among the world map zones, as it
can be seen in Fig. 11. The graphic indicates that the zones
with the highest density are zone 31 (mostly to the United
Kingdom) with 106 tweets and zone 18 (New Jersey, Virginia,
Pennsylvania, New York, Delaware and Maryland from the
eastern zone of the United States) with 100 tweets.
Fig. 14. Stephen Fry tweet distribution by UTM zones

A completely different perspective is revealed through the


analysis of tweets about the Fukushima disaster. From a set of
data containing about 600 tweets, most of them are located in
zone 55, corresponding to Japan.

Fig. 11. Jennifer Lawrence tweet distribution by UTM zones

The data sub-set for the search term Jim Carrey is smaller
than the previous one. Although Fig. 12 presents a wide
E. User Graph and communities

As presented in the forth section, Gephi was used to build


an oriented graph of followers for each user. Fig. 18 depicts an
example for the user @JoanaJord14, her followers and the
connections between her followers. The users are represented
as nodes, while friend/follower relationships are represented as
edges.

Fig. 15. Fukushima tweet distribution by UTMzones

As the graphics show, the most popular zone is zone 31,


corresponding to UK, with a standard deviation of 38.65 and
an average of 40.6 tweets, followed by the zones from eastern
US, especially zone 19, with a standard deviation of 34.50 and
an average of 32 tweets.
C. Tweets analyzed from the users point of view
Our application is able to get the most active hours for a
Fig. 18. User Graphs for @JoanaJord14 and @_RuxandraD
user, according to its tweets from the past week. Thus, for the
user @stephenfry, the hours when he posted the most tweets The communities are differentiated by colors and the
during the week of 25th May 1st of June can be visualized in importance of users by the size of the node. The analyzed user
the graphic from Fig. 16. is in the center of the graph. The graph is also automatically
filtered by keeping solely the most important nodes in terms of
the betweenness centrality. In this manner, a user can easily
identify the communities among his/her followers.
VI. CONCLUSIONS AND FUTURE WORK
As our application allows the accumulation of the most
recent data relative to the present moment, the paper presents
statistics about different search subjects (celebrities, trending
topics or events) according to the geographical localization of
the tweets. Thus, areas in which tweeting is frequent can be
inspected, regardless of the subject, suggesting that the Twitter
network is most frequently used or preferred in certain areas.
Fig. 16. Peak hours for @stephenfry user Our study confirms that even though Twitter users can be
found anywhere on the globe, the active half of the users the
D. User analyzed from the followers point of view ones that post a tweet at least once a month are especially
localized in five countries: USA, Japan, Indonesia, Great
Additionally, the application can classify the number of
Britain and Brazil (source: http://mashable.com). In addition,
followers for each of the users followers. Therefore, for the
areas in which users make use of other alphabetical characters
user @JoanaJord14, we classified the followers count in 8
(e.g., Russia or China) seldom appear in the collected dataset.
slices. The figure shows that most followers (~30%) of
Furthermore, our application is also capable of capturing peak
@JoanaJord14 have between 5.000 and 50.000 followers.
hours in which Twitter users posted during the last week.
Staring from all the previously performed analytics, our study
represents a confirmation of common user behaviors on social
networks and the generated statistics provide valuable insights
in terms of usage, distribution and interests.
As future developments, the Twitter profile search will be
improved regarding the relevance and diversity of data, the
user links and their role within specific user groups. The user
search can be extended according to the targeted social
network, for academic subjects, or friends of users with
specific interests. The social network analysis can also be
improved by adding new methods of research, like presenting
Fig. 17. Followers count for @JoanaJord14s followers
the users and their friends/followers in a clustered form or the [7] S. Wu, J. M. Hofman, W. A. Mason, and D. J. Watts, "Who says
emphasis of the analysis of a users friends. what to whom on twitter," in 20th Int. Conf. on World Wide
From a different perspective, current experiments Web, Hyderabad, India, 2011, pp. 705714.
envision an integration with sentiment analysis tools [15] as in [8] Internet Engineering Task Force (IETF), "The OAuth 2.0
the experiments performed by Martinez--Cmara, et al. [16], Authorization Framework (RFC-6749)," 2012.
as well as creating an interface with our discourse analysis [9] F. Gao, "Modeling and Interference of the Internet Movie
platform ReaderBench [17] for a deeper representation of Database," Master thesis, Department of Mathematics and
cohesive links among tweets. Computer Science, Technische Universiteit Eindhoven, 2011.
[10] M. Bastian, S. Heymann, and M. Jacomy, "Gephi: An open
ACKNOWLEDGMENT source software for exploring and manipulating networks," in
International AAAI Conference on Weblogs and Social Media,
This research has been partially supported by the Sectoral
San Jose, CA, 2009, pp. 361362.
Operational Programme Human Resources Development 2007-
[11] U. Brandes, "A faster algorithm for betweenness centrality,"
2013 of the Ministry of European Funds through the Financial
Journal of Mathematical Sociology, vol. 25, pp. 163177, 2001.
Agreement POSDRU/159/1.5/S/134398.
[12] L. Freeman, "A set of measures of centrality based on
REFERENCES betweenness," Sociometry, vol. 40, pp. 3541, 1977.
[13] BBC News. (2009). A portrait of the decade [Online].
[1] K. H. Lim and A. Datta, "Finding twitter communities with
Available:
common interests using following links of celebrities," in 3rd
Int. workshop on Modeling social media, Milwaukee, http://news.bbc.co.uk/2/hi/in_depth/8409040.stm
Wisconsin, USA, 2012, pp. 2532. [14] U.S. Geological Survey, "The Universal Transverse Mercator
[2] M. Cha, H. Haddadi, F. Benevenuto, and K. P. Gummadi, (UTM) Grid," U.S. Geological Survey, Reston, VA 077-01,
"Measuring User Influence in Twitter: The Million Follower 2001.
Fallacy," in 4th Int. AAAI Conf. on Weblogs and Social Media [15] D. Lupan, M. Dascalu, S. Trausan-Matu, and P. Dessus,
(ICWSM), Washington, DC, USA, 2010, pp. 1017. "Analyzing emotional states induced by news articles with
[3] A. Agarwal, B. Xie, I. Vovsha, O. Rambow, and R. Passonneau, Latent Semantic Analysis," in 15th Int. Conf. on Artificial
"Sentiment analysis of Twitter data," in Workshop on Intelligence: Methodology, Systems, Applications (AIMSA
Languages in Social Media, Portland, Oregon, 2011, pp. 3038. 2012), Varna, Bulgaria, 2012, pp. 5968.
[4] D. Ediger, K. Jiang, J. Riedy, D. A. Bader, and C. Corley, [16] E. Martinez--Cmara, M. T. Martn--Valdivia, L. A. Urea-
"Massive Social Network Analysis: Mining Twitter for social -Lpez, and A. R. Montejo--Rez, "Sentiment analysis in
good," in 39th Int. Conf. on Parallel Processing, San Diego, Twitter," Natural Language Engineering, pp. 128, 2013.
CA, 2010, pp. 583593. [17] M. Dascalu, Analyzing discourse and text complexity for
[5] A. Crymble, "An Analysis of Twitter and Facebook Use by the learning and collaborating, Studies in Computational
Archival Community," Archivaria, vol. 70, pp. 125151, 2010. Intelligence vol. 534. Switzerland: Springer, 2014.
[6] A. Spink, B. J. Jansen, and J. Pedersen, "Searching for people on
Web search engines," Journal of Documentation, vol. 60, pp.
266278, 2004.