Anda di halaman 1dari 21

Cumulus:

Filesystem Backup to
the Cloud
Paper: Michael Vrable, Stefan Savage &
Geoffrey M.Voelker
Slides: Joe Buck, CMPS 229 - Spring 2010

1
Thursday, May 27, 2010
Introduction

• “The Cloud” is a new-ish platform


• Really a spectrum
• Rethink applications
• Backup

2
Thursday, May 27, 2010

The cloud is new and shiny, we need to rethink solutions in light of its characteristics.
The spectrum runs from thin cloud (S3) to thick cloud (Salesforce.com, Google docs)
Interesting systems work exists in the asymmetries
Backup

• Store data off-site


• File or sub-file granularity
• Point-in-time checkpoints
• Efficient backup, restore
• Full vs Incremental
3
Thursday, May 27, 2010

Costs include time, client space, Time to Restore.


Restore can be single file, whole file system.
Cloud Backup

• Simple interface
• All logic is in the client
• Minimize resource usage & cost

4
Thursday, May 27, 2010

Take the thin cloud approach. Uses get/put interface


By going client heavy, any storage vendor can be used but more network traffic
Cumulus Backup Format Example Backup
Monday
Snapshot Roots

photos/A photos/B mbox paper Metadata

photoA photoB mbox1 paper1 Data

5
Thursday, May 27, 2010

First day ofSavage,


Vrable, the week,
Voelker backup
(UCSD) all data
Cumulus: Filesystem Backup to the Cloud February 26, 2009 4 / 19
Example Backup - cont.
Cumulus Backup Format

Monday Monday Tuesday


Tuesday Snapshot Roots
Shared

photos/A photos/B mbox paper mbox' paper' Metadata

photoA photoB mbox1 paper1 mbox2 paper2 Data

� Stores filesystem snapshots at multiple points in time


� Data blocks shared within, between snapshots
� Minimizes storage, upload bandwidth needed

6
Thursday, May 27, 2010

Second day, mbox and papers change, photos do not.


Files Vrable, Savage, Voelker (UCSD)
are determined Cumulus: Filesystem Backup to the Cloud
to be changed February 26, 2009
by mtime & ctime entires. If local meta-data 4 / 19
cache is lost,
then all files are treated as new.
Example Backup - cont.
Aggregation: Minimizing Per-Block Costs
Monday Tuesday
Segments Snapshot Roots

photos/A photos/B mbox paper mbox' paper' Metadata

photoA photoB mbox1 paper1 mbox2 paper2 Data

� May have per-file in addition to per-byte costs


� Protocol overhead: Slower backups from more transactions
� Per-file overhead at storage server
� May be exposed as monetary cost by provider
� Cumulus reduces these costs by aggregating blocks into segments
7
before storage
Thursday, May 27, 2010

Data is stored � inAggregation


segments offollows
a fixedfrom
size. our constraints, but may not be needed in
Segment size needs othertosystems
optimize network and storage costs.
Data protocols tend to favor larger units of data transfer (works well with high-latency,
doesn’t require
Vrable, Savage, parallelism
Voelker (UCSD)for throughput).
Cumulus: Filesystem Backup to the Cloud February 26, 2009 5 / 19
Repacking

• Snapshots can link to old segments


• Cleaning allows space to be reclaimed
• Client-driven, threshold based

8
Thursday, May 27, 2010

Links look like segment, start, length.


Cleaning moves valid data to new segments so space can be reclaimed.
Cleaning thresholds need to balance space savings with data transfer
Cleaning involves reads and writes, like raid-5 updates.
Implementation Notes

• ~ 4,000 lines. C++ & Python


• Segments are the basis
• Data can be packaged after segmenting
• Compress, encryption, indexing, etc.

9
Thursday, May 27, 2010

Segments are the units of operation. Can be parts of files or multiple files.
Compression, etc is applied to segments at the client
Analysis

• What overhead is introduced by the design


choices?
• Analyze from a cost perspective
• Quantify the effects of aggregation and
tuning

10
Thursday, May 27, 2010

Costs based on current web service costs.


Evaluation Traces
Trace Data
Fileserver User
Duration (days) 157 223
Entries 26673083 122007
Files 24344167 116426
File Sizes
Median 0.996 KB 4.4 KB
Average 153 KB 21.4 KB
Maximum 54.1 GB 169 MB
Total 3.47 TB 2.37 GB
Update Rates
New data/day 9.50 GB 10.3 MB
Changed data/day 805 MB 29.9 MB
Total data/day 10.3 GB 40.2 MB

11
Thursday, May 27, 2010
Vrable, Savage, Voelker (UCSD) Cumulus: Filesystem Backup to the Cloud February 26, 2009 9 / 19
Analysis is based on traces.
Most of the numbers are based on the User trace
Evaluation

• Model an ideal backup solution


• All unique data stored on the server
• All new data transferred over the wire
• Compare Cumulus to baseline
• Ignore compression and metadata
12
Thursday, May 27, 2010
Is Cleaning Necessary?

Benefit of Cleaning
1
� Wit
0.95
clea
0.9
utili
0.85
Storage Utilization

decr
0.8
� Wee
0.75

0.7 keep
0.65 with
0.6 rang
0.55 With Cleaning � Exa
No Cleaning
0.5 dep
0 50 100 150 200
Time (days) para
13
Thursday, May 27, 2010

Vrable, Savage, Voelker (UCSD) Cumulus: Filesystem Backup to the Cloud Fe


How Much Data is Transferred?

Data Transfer Measured


40
16 MB Segments 52
35 4 MB Segments
Overhead vs. Optimal (%)

1 MB Segments 50

Raw Size (MB/day)


30 512 kB Segments
128 kB Segments 48
25
46 �
20 Agg
15 44 larg
10 42 incr
5 40

0 38
0 0.2 0.4 0.6 0.8 1
Cleaning Threshold
14
Thursday, May 27, 2010

Cleaning threshold is the about of utilization below which a segment is cleaned.

Vrable, Savage, Voelker (UCSD) Cumulus: Filesystem Backup to the Cloud F


What is the Storage Overhead?
Storage Overhead
25
16 MB Segments 3.3 � Larg
4 MB Segments
Overhead vs. Optimal (%)

20 1 MB Segments 3.2 incre


512 kB Segments

Raw Size (GB)


128 kB Segments � Too
15 3.1
leads
3 overh
10
2.9 � Aggr
5 2.8
leads
stora
0 2.7 when
0 0.2 0.4 0.6 0.8 1
mult
Cleaning Threshold

15
Thursday, May 27, 2010

Vrable, Savage, Voelker (UCSD) Cumulus: Filesystem Backup to the Cloud Feb
What Settings Minimize Total Cost?
Cost
50
16 MB Segments � Agg
0.75
Cost Increase vs. Optimal (%)
4 MB Segments larg
40 1 MB Segments
512 kB Segments 0.7 incr
128 kB Segments
30 � Tot
0.65 per-
20 inte
0.6
segm
10
0.55 � Clea
0.4–
0
0 0.2 0.4 0.6 0.8 1 size
Cleaning Threshold well

16
Thursday, May 27, 2010

There are per-segment charges.


The sweet-spot is inVrable,
the middle
Savage, Voelker (UCSD) Cumulus: Filesystem Backup to the Cloud Fe
Simulation Results

• Storage cost was > 75 % total cost


• Cumulus was within 5 - 10% of ideal

17
Thursday, May 27, 2010

The paper tried to call out integrated solutions but I think that’s an apple / oranges
comparison as all their limitations painted them into a corner.
Prototype Results

• The code worked


• Ongoing costs ~ $0.25/month (2 GB)
• “Better” than Jungle Disk & Brakup

18
Thursday, May 27, 2010

Snapshots were restorable.


Jungle Disk & Brakup weren’t tuned for cost,
Summary
• Cumulus is a cost-effective tool for
network backup
• Tunable metrics evaluated
• Low-overhead backup feasible on-top of a
simple interface
• Limited Deduplication
19
Thursday, May 27, 2010
My Thoughts

• Client-side cost?
• Segmentation...

20
Thursday, May 27, 2010

They never seem to quantify the client side cost of storing the meta-data and block-hash
maps.
Segmentation seems like just chunking a tar file. Simply auto-network tuning? Per vendor?
More Material

• Code available
• http://sysnet.ucsd.edu/projects/cumulus/
• FAST ’09 Presentation
• http://www.usenix.org/media/events/
fast09/tech/videos/vrable.mov

21
Thursday, May 27, 2010

Anda mungkin juga menyukai