PODC Keynote, July 19, 2000 PODC Keynote, July 19, 2000
1
Three Basic Issues
PODC Keynote, July 19, 2000 PODC Keynote, July 19, 2000
PODC Keynote, July 19, 2000 PODC Keynote, July 19, 2000
2
Berkeley Ninja Architecture
Persistent State is HARD
Base: Scalable, highly-
available platform for
persistent-state services
Classic DS focus on the computation, not the data
– this is WRONG, computation is the easy part Workstations & PCs
Data centers exist for a reason
– can’t have consistency or availability without them AP Internet
Other locations are for caching only:
– proxies, basestations,
basestations, set-
set-top boxes, desktops
– phones, PDAs,
PDAs, …
AP
Active Proxy:
Distributed systems can’t ignore location
Bootstraps thin devices
distinctions into infrastructure, runs PDAs
Cellphones, Pagers, etc. mobile code (e.g. IBM Workpad)
PODC Keynote, July 19, 2000
BASE:
– Basically Available
– Soft-
oft-state
– Eventual consistency
PODC Keynote, July 19, 2000 PODC Keynote, July 19, 2000
3
ACID vs. BASE The CAP Theorem
ACID BASE
Strong consistency Weak consistency
Isolation – stale data OK
Availability first
Focus on “commit”
Best effort
Nested transactions
Approximate answers OK
Consistency Availability
Availability?
Aggressive (optimistic)
Conservative
(pessimistic) Simpler!
Difficult evolution Faster Tolerance to network Theorem: You can have at
(e.g. schema) Easier evolution Partitions most two of these properties
But I think it’s a spectrum for any shared-data system
PODC Keynote, July 19, 2000 PODC Keynote, July 19, 2000
Examples Examples
Single-
Single-site databases Distributed databases
Cluster databases Distributed locking
Traits Traits
Tolerance to network Tolerance to network
2-phase commit Pessimistic locking
Partitions Partitions Make minority
cache validation
protocols partitions unavailable
PODC Keynote, July 19, 2000 PODC Keynote, July 19, 2000
4
Forfeit Consistency These Tradeoffs are Real
Examples
Coda The whole space is useful
Web cachinge
Real internet systems are a careful mixture of
Consistency Availability DNS ACID and BASE subsystems
– We use ACID for user profiles and logging (for revenue)
Traits But there is almost no work in this area
expirations/leases Symptom of a deeper problem: systems and
Tolerance to network
conflict resolution database communities are separate but
Partitions overlapping (with distinct vocabulary)
optimistic
PODC Keynote, July 19, 2000 PODC Keynote, July 19, 2000
5
The Boundary Different Address Spaces
The interface between two modules What if the two sides are NOT in the same
– client/server, peers, libaries,
libaries, etc… address space?
– IPC or LRPC
Basic boundary = the procedure call
Can’t do pass-
pass-by-
by-reference (pointers)
– Most IPC screws this up: pass by value-
value-result
C S – There are TWO copies of args not one
What if they share some memory?
– thread traverses the boundary – Can pass pointers, but…
– two sides are in the same address space – Need synchronization between client/server
– Not all pointers can be passed
PODC Keynote, July 19, 2000 PODC Keynote, July 19, 2000
6
Multiplexing clients? Boundary evolution?
PODC Keynote, July 19, 2000 PODC Keynote, July 19, 2000
PODC Keynote, July 19, 2000 PODC Keynote, July 19, 2000
7
Boundary Summary Conclusions
PODC Keynote, July 19, 2000 PODC Keynote, July 19, 2000
Yield:
Yield: Fraction of Answered Queries
– Related to uptime but measured by queries, not by time
Data/query * Queries/sec = constant = DQ – Drop 1 out of 10 connections => 90% yield
– for a given node – At full utilization: yield ~ capacity ~ Q
– for a given app/OS release Harvest:
Harvest: Fraction of the Complete Result
A fault can reduce the capacity (Q), completeness – Reflects that some of the data may be missing due to faults
(D) or both – Replication: maintain D under faults
Faults reduce this constant linearly (at best) DQ corollary: harvest * yield ~ constant
– ACID => choose 100% harvest (reduce Q but 100% D)
– Internet => choose 100% yield (available but reduced D)
PODC Keynote, July 19, 2000 PODC Keynote, July 19, 2000
8
Harvest Options Replica Groups
1) Ignore lost nodes
– RPC gives up With n members:
– forfeit small part of the database
– reduce D, keep Q Each fault reduces Q by 1/n
1/n
2) Pair up nodes D stable until nth fault
– RPC tries alternate
Added load is 1/(n
1/(n-1) per fault
– survives one fault per pair
RAID RAID – n=2 => double load or 50% capacity
– reduce Q, keep D
– n=4 => 133% load or 75% capacity
3) n-member replica groups – “load redirection problem”
Disaster tolerance: better have >3 mirrors
Decide when you care...
PODC Keynote, July 19, 2000 PODC Keynote, July 19, 2000
9
Server Pollution Evolution
10
Conclusions Conclusions
11
Parallel Disk I/O
12