Anda di halaman 1dari 40

Systems challenges in online social media

Alan Mislove College of Computer and Information Science


Northeastern University

February 22nd, 2012, Networks Class

Social networking sites (Web 2.0)


Facebook Users (millions)

Popular way to connect and share content


Photos, videos, blogs, profiles, news, status... MySpace (275 M), Facebook (500 M)

500 375 250 125 0 04 06 08 10

Growing exponentially Incredible amounts of content being shared


Facebook (7.5 B photos/month) YouTube (48 hours of video/min)

22.02.12 Networks Class

Alan Mislove

Whats new in Web 2.0?

Web 1.0

Web 2.0

22.02.12 Networks Class

Alan Mislove

My groups research
Thesis: OSNs (Web 2.0) fundamentally different from Web 1.0 Introducing new and unforeseen challenges
Need new approaches to address these challenges

Leveraging social networks results in better systems


Due to the increasing integration of systems and social networks But, must be backed by measurement and analysis

My groups research is motivated by effects of this change


Will give three examples today

22.02.12 Networks Class

Alan Mislove

Effect 1: Changing patterns of content creation + exchange


submitted to usenix12

12.06.10 University of Massachusetts, Boston

Alan Mislove

Pre-2005 Web (a.k.a. Web 1.0)

Telvia

Telecom Italia

Fastweb

NGI

22.02.12 Networks Class

Alan Mislove

Difference 1: Content popularity


Fraction of requests
1 0.8 0.6 0.4 0.2 0 0 0.2 0.4

classic web [1] facebook photos [2] even distribution


0.6 0.8 1

Fraction of documents
(ranked from most to least popular)
[1] Breslau et al., INFOCOM, 1999, [2] Mislove et al., WSDM, 2010
22.02.12 Networks Class Alan Mislove 7

Implication: Caches less effective


Popularity distribution much more even
Objects have more narrow scope

In classic Web:
Caching top 10% serves between 55% [1] and 95% [2] of requests Success of CDNs, web caches, ...

In online social media:


Caching top 10% would only serve 27% [3] of requests

[1] Breslau et al., INFOCOM, 1999, [2] Arlitt et al. IEEE Network, 2000, [3] Mislove et al., WSDM, 2010
22.02.12 Networks Class Alan Mislove 8

Difference 2: Content generation

Telvia

Telecom Italia

Fastweb

NGI

22.02.12 Networks Class

Alan Mislove

Implication: Workload change


Significant content creation at networks edge
Ease of digital content creation (photos, video) Ubiquity of Internet access (cell phone, iPad)

In classic Web:
Workload was center-to-edge Caching, CDNs take load off origin server

In online social media:


Workload is edge-to-edge Significant geographic locality

22.02.12 Networks Class

Alan Mislove

10

How is OSN content being delivered?


Web 1.0 centralized architectures dominate
Akamai, Limelight, Clearway, ... Facebook serves much of its own content

Mismatch between infrastructure, workload Workload is naturally decentralized


Every Facebook upload goes via CA

Can we build a workload-matching distribution system?


Avoid unnecessary, expensive transfers

22.02.12 Networks Class

Alan Mislove

11

WebCloud: Decentralized delivery

First step towards decentralized Web content delivery


Challenge: Web doesnt support decentralization Browsers distinct from Web servers

Use novel techniques to allow browser to serve content


No client-side changes Users help serve content they upload Result: Scalable, workload-matching architecture

Next: Brief technical discussion


22.02.12 Networks Class Alan Mislove 12

WebCloud design overview


Goal: Move towards more decentralized content exchange
Keep content exchange at the edge

Want to make it work with todays sites, browsers


Reason: Users wont install anything Cant require users to do anything different

Idea: Introduce a middlebox to allow browsers to communicate To build WebCloud, need to make
Client-side changes Deploy middleboxes

22.02.12 Networks Class

Alan Mislove

13

Client-side changes
Want to turn web browser into web server
Implement WebCloud in Javascript Add it to the sites pages

Use LocalStorage to storage browsed content


Persistent cache, up to 5MB/site Easily programmatically accessed Treated like LRU cache

Use WebSockets/XHR to communicate with middlebox


Allows bi-directional communication Online client is always connected to middlebox

22.02.12 Networks Class

Alan Mislove

14

Middleboxes
Add redirector proxies in each ISP
Like Akamai proxy, but doesnt store any content Maintains open connect to online web visitors Run by OSN provider

Clients connect to proxy


Inform proxy of locally stored content
Redirector proxy

Clients request content from proxy


Proxy checks for other local clients Found: fetches content, forwards to requestor Not found: fetches content from origin site

22.02.12 Networks Class

Alan Mislove

15

Putting it all together

Internet Redirector proxy

Client A

Client B

Overall, WebCloud serves as a distributed cache


Use content-hashes to ensure integrity

Privacy implications
k-anonymity for viewers
22.02.12 Networks Class Alan Mislove 16

WebCloud applied to real-world site


76% reduction in 95th percentile bandwidth Bandwidth
120

(MB/s)

100 80 60 40 20 00:00 Fri 00:00 Sat 00:00 Sun 00:00 Mon 00:00 Tue 00:00 Wed 00:00 Thu 00:00 Fri

Time
(one week)

Top-50 U.S. web site


Simulation based on Akamai logs

Would dramatically reduce bandwidth required


Savings for both site and ISP
22.02.12 Networks Class Alan Mislove 17

Summary
Beginnings of shift in patterns of content creation + exchange
Patterns changing from center to edge to edge to edge Less biased popularity distribution

But, still using centralized delivery architectures WebCloud: Step towards decentralized Web content delivery
Users help serve content they create Implemented using existing browser features; no client changes

Evaluation demonstrated practicality, efficacy

22.02.12 Networks Class

Alan Mislove

18

Effect 2: Changing meaning of accounts/identity nsdi11

12.06.10 University of Massachusetts, Boston

Alan Mislove

19

User accounts
Account abstraction now ubiquitous
Represents one or more people in a computer system Encapsulates privileges

Traditionally verified by service operators Trend: Online services with free accounts
Not verified by operators

Accounts come with privileges


Send messages (Gmail) Upload content (Facebook) Vote (Digg)
22.02.12 Networks Class Alan Mislove 20

Sybils
Free accounts with privileges leading to Sybil attacks [IPTPS 2002]
Single person creates many accounts

Why?
Natural: Gain extra privileges Incentives set up to encourage this

Examples in the wild


Maze [ICDCS 2007] Digg [NSDI 2009] TripAdvisor [NYT, 10/2011] Facebook, Gmail [me, others]

22.02.12 Networks Class

Alan Mislove

21

Example: Online marketplaces

Auctions

Marketplace
Among most successful Web sites
eBay alone: $62 b in 2010

But, known to suffer from fraud


22.02.12 Networks Class Alan Mislove 22

Identities and reputations


Feedback profile

$2 $5 $1 $300 $90 $50

$40 $25 $90 $300 $90

$90
Significant monetary losses

Recent arrest of user who stole $717 k from 5,000 users Used >250 accounts
22.02.12 Networks Class Alan Mislove 23

Bazaar: A new approach


New approach to strengthening user reputations
Leverages an (existing) risk network Focuses on protecting buyers from malicious sellers

Works in conjunction with existing marketplace


Assumes same feedback system as today No additional monetary cost No strong identities

Insight: Successful transactions represent shared risk


Buyer and seller more likely to enter into future transactions

22.02.12 Networks Class

Alan Mislove

24

Bazaars risk network


$5 $7 $45 $4 $10 $1 $10 $50 $3 $25

Successful transaction two identities linked


Weighted by amount of transaction

Risk network automatically generated


Users need not even know about it
22.02.12 Networks Class Alan Mislove 25

Estimating risk
Max-flow: $5
$300 $5 $100 $4000 $50

Buyer

Seller

$200

Bazaar calculates max-flow between buyer and seller


If max-flow lower than potential transaction, flag as fraudulent
22.02.12 Networks Class Alan Mislove 26

Summary
Increasing trend of online services with free accounts
Opens new vector for attack

Focused on reputation manipulation in online marketplaces


Bazaar: A new approach to strengthening reputations

Evaluated on 10 m auctions from eBay UK


Would have prevented 164 k of negative feedback Only in five categories over 90 days

Currently looking to apply techniques to other domains


22.02.12 Networks Class Alan Mislove 27

Effect 3: Changing requirements of end users imc11

12.06.10 University of Massachusetts, Boston

Alan Mislove

28

Privacy on OSNs
Privacy is a signicant issue on OSNs
Received recent press, research attention

What is underlying privacy debate? 1. Sites control personal information of millions of users 2. Users are expected to manage their privacy
5,830 word privacy policy Over 100 dierent settings Default is open-to-the-world (over 800 million users)

16.10.2009 CCIS/COE Retreat

Alan Mislove

29

A fundamental shift for users


Prior to OSNs
Users were largely content consumers

Now, with sites like Facebook


Users expected to be content creators and managers Must enumerate who is able to access every uploaded content
Avg. 130 friends, 90 pieces of content/month...

Whats the extent of privacy problem?


So far, most studies anecdotal Can we quantify the extent of the privacy problem on Facebook?
16.10.2009 CCIS/COE Retreat Alan Mislove 30

Facebook privacy model


Consider Facebook-supported content:
Photos, Videos, Statuses, Links and Notes

Five sharing granularities:


Only Me (Me) Some Friends (SF) All Friends (AF) Friends of Friends (FoF) Everyone (All)

16.10.2009 CCIS/COE Retreat

Alan Mislove

31

Measuring desired and actual settings


Design a Facebook survey application
Collects actual setting for all content Selects up to 10 photos
Asks user about desired privacy setting

Recruit using Amazon Mechanical Turk


Total of 200 Facebook users Pay them each $1 116,553 actual settings 1,675 desired settings

Study was conducted under Northeastern IRB protocol #10-10-04


16.10.2009 CCIS/COE Retreat Alan Mislove 32

What are the existing privacy settings?


0.6 0.5
Only Me Some Friends All Friends Friends of Friends Everyone

Default

Fraction of Content

0.4 0.3 0.2 0.1 0

Photo

Video

Status

Link

Note

36% of all content shared with the default (visible to all users)
Photos have the most privacy-conscious settings
16.10.2009 CCIS/COE Retreat Alan Mislove 33

How do desired and actual settings compare?


907 randomly-selected photos Actual Setting Me SF AF FoF All Total Me 3 3 38 16 46 Desired Setting SF AF FoF 5 2 3 12 28 3 2 184 25 8 80 15 23 171 56 443 (49%) All 2 0 42 22 118 Total

132 (14%) 332 (37%)

Actual and desired settings mismatch for 63% of photos


When incorrect, almost always (77%) too open

To what extent are privacy violations caused by poor defaults?


16.10.2009 CCIS/COE Retreat Alan Mislove 34

What about photos with modied settings?


Additional 768 photos with non-default privacy settings Actual Setting Me SF AF FoF All Total Desired Setting SF AF FoF 6 4 0 12 29 8 8 237 40 17 148 45 0 0 0 254 (33%) Total

Me 2 2 40 39 0

All 4 11 69 47 0

218 (28%)

296 (39%)

Settings match only for 39% of privacy-modied photos


Even when user has explicitly changed setting

Take-away: Not just poor defaults


Users have signicant trouble managing their privacy
16.10.2009 CCIS/COE Retreat Alan Mislove 35

Can we improve sharing mechanisms?

Can we provide better management tools?


Ease users role as content manager

Idea: Leverage the structure of the social network


Create privacy groups from users friends Update the groups as the user forms or breaks friendships
16.10.2009 CCIS/COE Retreat Alan Mislove 36

Automatically detecting friendlists


Friendlists: Facebook feature similar to Google+ Circles
Ground truth; Meaningful groupings of users for privacy Collected 233 friendlists from our 200 AMT users

Do friendlists correspond with the social network?


Normalized conductance [WSDM10] rates the quality of community Strongly positive values indicate signicant community structure

Results on 233 friendlists:


Over 48% friendlists correspond to strong communities May be able to be inferred from social network

16.10.2009 CCIS/COE Retreat

Alan Mislove

37

Summary
Privacy an important issue on OSNs
But, to date, no quantication of privacy problem

Develop methodology to measure actual, desired privacy settings


Deployed to 200 Facebook users from AMT

Findings:
36% of all content shared with the default settings Privacy settings match expectations less than 40% of the time
Even when users has already modied setting

But, potential to aid users by providing better mechanisms


16.10.2009 CCIS/COE Retreat Alan Mislove 38

Conclusion
Social networks and computer systems increasingly integrated
New way of organizing information Leading to new opportunities, challenges

My groups goal: Leverage social networks in systems design WebCloud: Addresses challenges with emerging workloads Bazaar: Addresses challenges with free accounts Privacy: Addresses difference between privacy perception and reality

22.02.12 Networks Class

Alan Mislove

39

Questions?
Work done in collaboration with
Ben Adams (MPI-I), Bobby Bhattacharjee (University of Maryland), Meeyoung Cha (KAIST), Peter Druschel (MPI-SWS), Krishna P. Gummadi (MPI-SWS), Andreas Haeberlen (University of Pennsylvania), Ancsa Hannk (Northeastern University), Jonathan Katz (University of Maryland), Hema Swetha Koppula (Yahoo Research India), Sune Lehmann (TU Copenhagen), Yabing Liu (Northeastern University), Arash Molavi (Northeastern University), Jukka-Pekka Onnela (Harvard University), Ansley Post (Google), J. Niels Rosenquist (Harvard Medical School), Neil Spring (University of Maryland), Ravi Sundaram (Northeastern University), Malveeka Tewari (University of California, San Diego), Bimal Viswanath (MPI-SWS), Liang Zhang (Northeastern University), Fangfei Zhou (Northeastern University)

22.02.12 Networks Class

Alan Mislove

40