PODC Keynote PDF

Inktomi at a Glance
Company Overview Applications

Search Technology
Towards Robust “INKT” on NASDAQ
Founded 1996 out of UC Network Products
Distributed Systems Berkeley Online Shopping
~700 Employees
Wireless Systems
Dr. Eric A. Brewer

Professor, UC Berkeley
Co-
Co-Founder & Chief Scientist, Inktomi
PODC Keynote, July 19, 2000 PODC Keynote, July 19, 2000
Our Perspective “Distributed Systems” don’t work...
Inktomi builds two There exist working DS:

distributed systems: – Simple protocols: DNS, WWW
– Global Search Engines – Inktomi search, Content Delivery Networks
– Distributed Web Caches – Napster, Verisign, AOL
Based on scalable But these are not classic DS:
cluster & parallel
computing technology – Not distributed objects
– No RPC
But very little use of – No modularity
classic DS research... – Complex ones are single owner (except phones)
1
Three Basic Issues
Where is the state?

Where’s the state?
Consistency vs. Availability (not all locations are equal)
Understanding Boundaries
Santa Clara Cluster Delivering High Availability
We kept up the service through:

• Very uniform Crashes & disk failures (weekly)
• No monitors Database upgrades (daily)
• No people
Software upgrades (weekly to monthly)
• No cables
OS upgrades (twice)
• Working power Power outage (several)
• Working A/C Network outages (now have 11 connections)
• Working BW Physical move of all equipment (twice)
2
Berkeley Ninja Architecture
Persistent State is HARD
Base: Scalable, highly-
available platform for
persistent-state services
Classic DS focus on the computation, not the data
– this is WRONG, computation is the easy part Workstations & PCs
Data centers exist for a reason
– can’t have consistency or availability without them AP Internet
Other locations are for caching only:
– proxies, basestations,
basestations, set-
set-top boxes, desktops
– phones, PDAs,
PDAs, …
AP
Active Proxy:
Distributed systems can’t ignore location
Bootstraps thin devices
distinctions into infrastructure, runs PDAs
Cellphones, Pagers, etc. mobile code (e.g. IBM Workpad)
PODC Keynote, July 19, 2000
ACID vs. BASE
DBMS research is about ACID (mostly)

But we forfeit “C” and “I” for availability,
Consistency vs. Availability graceful degradation, and performance
(ACID vs. BASE) This tradeoff is fundamental.
BASE:
– Basically Available
– Soft-
oft-state
– Eventual consistency
3
ACID vs. BASE The CAP Theorem
ACID BASE
Strong consistency Weak consistency
Isolation – stale data OK
Availability first
Focus on “commit”
Best effort
Nested transactions
Approximate answers OK
Consistency Availability
Availability?
Aggressive (optimistic)
Conservative
(pessimistic) Simpler!
Difficult evolution Faster Tolerance to network Theorem: You can have at
(e.g. schema) Easier evolution Partitions most two of these properties
But I think it’s a spectrum for any shared-data system
Forfeit Partitions Forfeit Availability
Examples Examples
Single-
Single-site databases Distributed databases
Cluster databases Distributed locking
Consistency Availability LDAP Consistency Availability Majority protocols

xFS file system
Traits Traits
Tolerance to network Tolerance to network
2-phase commit Pessimistic locking
Partitions Partitions Make minority
cache validation
protocols partitions unavailable
4
Forfeit Consistency These Tradeoffs are Real
Examples
Coda The whole space is useful
Web cachinge
Real internet systems are a careful mixture of
Consistency Availability DNS ACID and BASE subsystems
– We use ACID for user profiles and logging (for revenue)
Traits But there is almost no work in this area
expirations/leases Symptom of a deeper problem: systems and
Tolerance to network
conflict resolution database communities are separate but
Partitions overlapping (with distinct vocabulary)
optimistic
CAP Take Homes
Can have consistency & availability within a

cluster (foundation of Ninja), but it is still hard in
practice Understanding Boundaries
OS/Networking good at BASE/Availability, but
terrible at consistency
(the RPC hangover)
Databases better at C than Availability
Wide-
Wide-area databases can’t have both
Disconnected clients can’t have both
All systems are probabilistic…
5
The Boundary Different Address Spaces
The interface between two modules What if the two sides are NOT in the same
– client/server, peers, libaries,
libaries, etc… address space?
– IPC or LRPC
Basic boundary = the procedure call
Can’t do pass-
pass-by-
by-reference (pointers)
– Most IPC screws this up: pass by value-
value-result
C S – There are TWO copies of args not one
What if they share some memory?
– thread traverses the boundary – Can pass pointers, but…
– two sides are in the same address space – Need synchronization between client/server
– Not all pointers can be passed
Trust the other side? Partial Failure
Can the two sides fail independently?

What if we don’t trust the other side?
– RPC, IPC, LRPC
Have to check args,
args, no pointer passing Can’t be transparent (like RPC) !!
Kernels get this right:
New exceptions (other side gone)
– copy/check args
– use opaque references (e.g. File Descriptors) Reclaim local resources
– e.g. kernels leak sockets over time => reboot
Most systems do not:
– TCP Can use leases?
– Napster – Different new exceptions: lease expired
– web browsers
RPC tries to hide these issues (but fails)
6
Multiplexing clients? Boundary evolution?
Can the two sides be updated independently?

Does the server have to:
(NO)
– deal with high concurrency?
– Say “no” sometimes (graceful degradation) The DLL problem...
– Treat clients equally (fairness)
Boundaries need versions
– Bill for resources (and have audit trail)
– Isolate clients performance, data, …. Negotiation protocol for upgrade?
These all affect the boundary definition Promises of backward compatibility?
Affects naming too (version number)
Example: protocols vs. APIs Example: XML
Protocols have been more successful the APIs

XML doesn’t solve any of these issues
Some reasons:
It is RPC with an extensible type system
– protocols are pass by value
– protocols designed for partial failure It makes evolution better?
– not trying to look like local procedure calls – two sides need to agree on schema
– explicit state machine, rather than call/return – can ignore stuff you don’t understand
(this exposes exceptions well)
Can mislead us to ignore the real issues
Protocols still not good at trust, billing, evolution
7
Boundary Summary Conclusions
Classic Distributed Systems are fragile

We have been very sloppy about boundaries
Some of the causes:
Leads to fragile systems – focus on computation, not data
Root cause is false transparency: trying to look – ignoring location distinctions
like local procedure calls – poor definitions of consistency/availability goals
– poor understanding of boundaries (RPC in particular)
Relatively little work in evolution, federation,
client-
client-based resource allocation, failure recovery These are all fixable, but need to be far more
common
The DQ Principle Harvest & Yield
Yield:
Yield: Fraction of Answered Queries
– Related to uptime but measured by queries, not by time
Data/query * Queries/sec = constant = DQ – Drop 1 out of 10 connections => 90% yield
– for a given node – At full utilization: yield ~ capacity ~ Q
– for a given app/OS release Harvest:
Harvest: Fraction of the Complete Result
A fault can reduce the capacity (Q), completeness – Reflects that some of the data may be missing due to faults
(D) or both – Replication: maintain D under faults
Faults reduce this constant linearly (at best) DQ corollary: harvest * yield ~ constant
– ACID => choose 100% harvest (reduce Q but 100% D)
– Internet => choose 100% yield (available but reduced D)
8
Harvest Options Replica Groups
1) Ignore lost nodes
– RPC gives up With n members:
– forfeit small part of the database
– reduce D, keep Q Each fault reduces Q by 1/n
1/n
2) Pair up nodes D stable until nth fault
– RPC tries alternate
Added load is 1/(n
1/(n-1) per fault
– survives one fault per pair
RAID RAID – n=2 => double load or 50% capacity
– reduce Q, keep D
– n=4 => 133% load or 75% capacity
3) n-member replica groups – “load redirection problem”
Disaster tolerance: better have >3 mirrors
Decide when you care...
Graceful Degradation Thinking Probabilistically
Goal: smooth decrease in harvest/yield Maximize symmetry

proportional to faults – SPMD + simple replication schemes
– we know DQ drops linearly Make faults independent
Saturation will occur – requires thought
– high peak/average ratios... – avoid cascading errors/faults
– must reduce harvest or yield (or both) – understand redirected load
– must do admission control!!! – KISS
One answer: reduce D dynamically Use randomness
– disaster => redirect load, then reduce D to – makes worst-
worst-case and average case the same
compensate for extra load – ex: Inktomi spreads data & queries randomly
– Node loss implies a random 1% harvest reduction
9
Server Pollution Evolution
Can’t fix all memory leaks Three Approaches:

Third-
Third-party software leaks memory and sockets Flash Upgrade
– so does the OS sometimes – Fast reboot into new version
– Focus on MTTR (< 10 sec)
Some failures tie up local resources – Reduces yield (and uptime)
Rolling Upgrade
Solution: planned periodic “bounce” – Upgrade nodes one at time in a “wave”
– Not worth the stress to do any better – Temporary 1/n harvest reduction, 100% yield
– Bounce time is less than 10 seconds – Requires co-
co-existing versions
– Nice to remove load first… “Big Flip”
The Big Flip Key New Problems
Steps: Unknown but large growth

1) take down 1/2 the nodes – Incremental & Absolute scalability
2) upgrade that half – 1000’s of components
3) flip the “active half” (site upgraded) Must be truly highly available
4) upgrade second half – Hot swap everything (no recovery time allowed)
5) return to 100% – No “night”
50% Harvest, 100% Yield – Graceful degradation under faults & saturation
– or inverse? Constant evolution (internet time)
No mixed versions – Software will be buggy
– can replace schema, protocols, ... – Hardware will fail
– These can’t be emergencies...
Twice used to change physical location
10
Conclusions Conclusions
Winning solution is message-

message-passing clusters
Parallel Programming is very relevant, except…
– fine-
fine-grain communication =>
– historically avoids availability fine-
fine-grain exception handling
– no notion of online evolution – don’t want every load/store to deal with partial failure
– limited notions of graceful degradation (checkpointing)
– best for CPU- Key open problems:
CPU-bound tasks
– libraries & data structures for HA shared state
– support for replication and partial failure
Must think probabilistically about everything – better understanding of probabilistic systems
– no such thing as a 100% working system – cleaner support for exceptions (graceful degradation)
– no such thing as 100% fault tolerance – support for split-
split-phase I/O and many concurrent threads
– partial results are often OK (and better than none) – support for 10,000 threads/node (to avoid FSMs)
– Capacity * Completeness == Constant
New Hard Problems...
Really need to manage disks well

– problems are I/O bound, not CPU bound
Backup slides Lots of simultaneous connections
– 50Kb/s => at least 2000 connections/node
HAS to be highly available
– no maintenance window, even for upgrades
Continuous evolution
– constant site changes, always small bugs...
– large but unpredictable traffic growth
Graceful degradation under saturation
11
Parallel Disk I/O
Want 50+ outstanding reads/disk

– Provides disk-
disk-head scheduler with many choices
– Trades response time for throughput
Pushes towards a split-
split-phase approach to disks
General trend: each query is a finite-
finite-state machine
– split-
split-phase disk/network operations are state transitions
– multiplex many FSMs over small number of threads
– FSM handles state rather than thread stack
PODC Keynote, July 19, 2000
12

PODC Keynote PDF

Diunggah oleh

Informasi Dokumen

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

PODC Keynote PDF

Diunggah oleh

Hak Cipta:

Format Tersedia

Inktomi at a Glance

Company Overview Applications

Dr. Eric A. Brewer

Our Perspective “Distributed Systems” don’t work...

 Inktomi builds two  There exist working DS:

Where is the state?

Santa Clara Cluster Delivering High Availability

We kept up the service through:

ACID vs. BASE

 DBMS research is about ACID (mostly)

(ACID vs. BASE) This tradeoff is fundamental.

Forfeit Partitions Forfeit Availability

Consistency Availability  LDAP Consistency Availability  Majority protocols

CAP Take Homes

 Can have consistency & availability within a

Trust the other side? Partial Failure

 Can the two sides fail independently?

 Can the two sides be updated independently?

Example: protocols vs. APIs Example: XML

 Protocols have been more successful the APIs

 Classic Distributed Systems are fragile

The DQ Principle Harvest & Yield

Graceful Degradation Thinking Probabilistically

 Goal: smooth decrease in harvest/yield  Maximize symmetry

 Can’t fix all memory leaks Three Approaches:

The Big Flip Key New Problems

 Steps:  Unknown but large growth

 Winning solution is message-

New Hard Problems...

 Really need to manage disks well

 Want 50+ outstanding reads/disk

PODC Keynote, July 19, 2000

Anda mungkin juga menyukai

Inktomi builds two There exist working DS:

Where is the state?

DBMS research is about ACID (mostly)

Consistency Availability LDAP Consistency Availability Majority protocols

Can have consistency & availability within a

Can the two sides fail independently?

Can the two sides be updated independently?

Protocols have been more successful the APIs

Classic Distributed Systems are fragile

Goal: smooth decrease in harvest/yield Maximize symmetry

Can’t fix all memory leaks Three Approaches:

Steps: Unknown but large growth

Winning solution is message-

Really need to manage disks well

Want 50+ outstanding reads/disk