Anda di halaman 1dari 12

Inktomi at a Glance

Company Overview Applications


‹ Search Technology
Towards Robust ‹ “INKT” on NASDAQ
‹ Founded 1996 out of UC ‹ Network Products
Distributed Systems Berkeley ‹ Online Shopping
‹ ~700 Employees
‹ Wireless Systems

Dr. Eric A. Brewer


Professor, UC Berkeley
Co-
Co-Founder & Chief Scientist, Inktomi

PODC Keynote, July 19, 2000 PODC Keynote, July 19, 2000

Our Perspective “Distributed Systems” don’t work...

‹ Inktomi builds two ‹ There exist working DS:


distributed systems: – Simple protocols: DNS, WWW
– Global Search Engines – Inktomi search, Content Delivery Networks
– Distributed Web Caches – Napster, Verisign, AOL
‹ Based on scalable ‹ But these are not classic DS:
cluster & parallel
computing technology – Not distributed objects
– No RPC
‹ But very little use of – No modularity
classic DS research... – Complex ones are single owner (except phones)
PODC Keynote, July 19, 2000 PODC Keynote, July 19, 2000

1
Three Basic Issues

‹Where is the state?


Where’s the state?
‹Consistency vs. Availability (not all locations are equal)
‹Understanding Boundaries

PODC Keynote, July 19, 2000 PODC Keynote, July 19, 2000

Santa Clara Cluster Delivering High Availability

We kept up the service through:


• Very uniform ‹ Crashes & disk failures (weekly)
• No monitors ‹ Database upgrades (daily)
• No people
‹ Software upgrades (weekly to monthly)
• No cables
‹ OS upgrades (twice)
• Working power ‹ Power outage (several)
• Working A/C ‹ Network outages (now have 11 connections)
• Working BW ‹ Physical move of all equipment (twice)

PODC Keynote, July 19, 2000 PODC Keynote, July 19, 2000

2
Berkeley Ninja Architecture
Persistent State is HARD
Base: Scalable, highly-
available platform for
persistent-state services
‹ Classic DS focus on the computation, not the data
– this is WRONG, computation is the easy part Workstations & PCs
‹ Data centers exist for a reason
– can’t have consistency or availability without them AP Internet
‹ Other locations are for caching only:
– proxies, basestations,
basestations, set-
set-top boxes, desktops
– phones, PDAs,
PDAs, …
AP
Active Proxy:
‹ Distributed systems can’t ignore location
Bootstraps thin devices
distinctions into infrastructure, runs PDAs
Cellphones, Pagers, etc. mobile code (e.g. IBM Workpad)
PODC Keynote, July 19, 2000

ACID vs. BASE

‹ DBMS research is about ACID (mostly)


‹ But we forfeit “C” and “I” for availability,
Consistency vs. Availability graceful degradation, and performance

(ACID vs. BASE) This tradeoff is fundamental.

BASE:
– Basically Available
– Soft-
oft-state
– Eventual consistency
PODC Keynote, July 19, 2000 PODC Keynote, July 19, 2000

3
ACID vs. BASE The CAP Theorem
ACID BASE
‹ Strong consistency ‹ Weak consistency
‹ Isolation – stale data OK
‹ Availability first
‹ Focus on “commit”
‹ Best effort
‹ Nested transactions
‹ Approximate answers OK
Consistency Availability
‹ Availability?
‹ Aggressive (optimistic)
‹ Conservative
(pessimistic) ‹ Simpler!
‹ Difficult evolution ‹ Faster Tolerance to network Theorem: You can have at
(e.g. schema) ‹ Easier evolution Partitions most two of these properties
But I think it’s a spectrum for any shared-data system
PODC Keynote, July 19, 2000 PODC Keynote, July 19, 2000

Forfeit Partitions Forfeit Availability

Examples Examples
‹ Single-
Single-site databases ‹ Distributed databases
‹ Cluster databases ‹ Distributed locking

Consistency Availability ‹ LDAP Consistency Availability ‹ Majority protocols


‹ xFS file system

Traits Traits
Tolerance to network Tolerance to network
‹ 2-phase commit ‹ Pessimistic locking
Partitions Partitions ‹ Make minority
‹ cache validation
protocols partitions unavailable
PODC Keynote, July 19, 2000 PODC Keynote, July 19, 2000

4
Forfeit Consistency These Tradeoffs are Real

Examples
‹ Coda ‹ The whole space is useful
‹ Web cachinge
‹ Real internet systems are a careful mixture of
Consistency Availability ‹ DNS ACID and BASE subsystems
– We use ACID for user profiles and logging (for revenue)
Traits ‹ But there is almost no work in this area
‹ expirations/leases ‹ Symptom of a deeper problem: systems and
Tolerance to network
‹ conflict resolution database communities are separate but
Partitions overlapping (with distinct vocabulary)
‹ optimistic

PODC Keynote, July 19, 2000 PODC Keynote, July 19, 2000

CAP Take Homes

‹ Can have consistency & availability within a


cluster (foundation of Ninja), but it is still hard in
practice Understanding Boundaries
‹ OS/Networking good at BASE/Availability, but
terrible at consistency
(the RPC hangover)
‹ Databases better at C than Availability
‹ Wide-
Wide-area databases can’t have both
‹ Disconnected clients can’t have both
‹ All systems are probabilistic…
PODC Keynote, July 19, 2000 PODC Keynote, July 19, 2000

5
The Boundary Different Address Spaces

‹ The interface between two modules ‹ What if the two sides are NOT in the same
– client/server, peers, libaries,
libaries, etc… address space?
– IPC or LRPC
‹ Basic boundary = the procedure call
‹ Can’t do pass-
pass-by-
by-reference (pointers)
– Most IPC screws this up: pass by value-
value-result
C S – There are TWO copies of args not one
‹ What if they share some memory?
– thread traverses the boundary – Can pass pointers, but…
– two sides are in the same address space – Need synchronization between client/server
– Not all pointers can be passed
PODC Keynote, July 19, 2000 PODC Keynote, July 19, 2000

Trust the other side? Partial Failure

‹ Can the two sides fail independently?


‹ What if we don’t trust the other side?
– RPC, IPC, LRPC
‹ Have to check args,
args, no pointer passing ‹ Can’t be transparent (like RPC) !!
‹ Kernels get this right:
‹ New exceptions (other side gone)
– copy/check args
– use opaque references (e.g. File Descriptors) ‹ Reclaim local resources
– e.g. kernels leak sockets over time => reboot
‹ Most systems do not:
– TCP ‹ Can use leases?
– Napster – Different new exceptions: lease expired
– web browsers
‹ RPC tries to hide these issues (but fails)
PODC Keynote, July 19, 2000 PODC Keynote, July 19, 2000

6
Multiplexing clients? Boundary evolution?

‹ Can the two sides be updated independently?


‹ Does the server have to:
(NO)
– deal with high concurrency?
– Say “no” sometimes (graceful degradation) ‹ The DLL problem...
– Treat clients equally (fairness)
‹ Boundaries need versions
– Bill for resources (and have audit trail)
– Isolate clients performance, data, …. ‹ Negotiation protocol for upgrade?
‹ These all affect the boundary definition ‹ Promises of backward compatibility?
‹ Affects naming too (version number)

PODC Keynote, July 19, 2000 PODC Keynote, July 19, 2000

Example: protocols vs. APIs Example: XML

‹ Protocols have been more successful the APIs


‹ XML doesn’t solve any of these issues
‹ Some reasons:
‹ It is RPC with an extensible type system
– protocols are pass by value
– protocols designed for partial failure ‹ It makes evolution better?
– not trying to look like local procedure calls – two sides need to agree on schema
– explicit state machine, rather than call/return – can ignore stuff you don’t understand
(this exposes exceptions well)
‹ Can mislead us to ignore the real issues
‹ Protocols still not good at trust, billing, evolution

PODC Keynote, July 19, 2000 PODC Keynote, July 19, 2000

7
Boundary Summary Conclusions

‹ Classic Distributed Systems are fragile


‹ We have been very sloppy about boundaries
‹ Some of the causes:
‹ Leads to fragile systems – focus on computation, not data
‹ Root cause is false transparency: trying to look – ignoring location distinctions
like local procedure calls – poor definitions of consistency/availability goals
– poor understanding of boundaries (RPC in particular)
‹ Relatively little work in evolution, federation,
client-
client-based resource allocation, failure recovery ‹ These are all fixable, but need to be far more
common

PODC Keynote, July 19, 2000 PODC Keynote, July 19, 2000

The DQ Principle Harvest & Yield

‹ Yield:
Yield: Fraction of Answered Queries
– Related to uptime but measured by queries, not by time
Data/query * Queries/sec = constant = DQ – Drop 1 out of 10 connections => 90% yield
– for a given node – At full utilization: yield ~ capacity ~ Q
– for a given app/OS release ‹ Harvest:
Harvest: Fraction of the Complete Result
‹ A fault can reduce the capacity (Q), completeness – Reflects that some of the data may be missing due to faults
(D) or both – Replication: maintain D under faults

‹ Faults reduce this constant linearly (at best) ‹ DQ corollary: harvest * yield ~ constant
– ACID => choose 100% harvest (reduce Q but 100% D)
– Internet => choose 100% yield (available but reduced D)
PODC Keynote, July 19, 2000 PODC Keynote, July 19, 2000

8
Harvest Options Replica Groups
1) Ignore lost nodes
– RPC gives up With n members:
– forfeit small part of the database
– reduce D, keep Q ‹ Each fault reduces Q by 1/n
1/n
2) Pair up nodes ‹ D stable until nth fault
– RPC tries alternate
‹ Added load is 1/(n
1/(n-1) per fault
– survives one fault per pair
RAID RAID – n=2 => double load or 50% capacity
– reduce Q, keep D
– n=4 => 133% load or 75% capacity
3) n-member replica groups – “load redirection problem”
‹ Disaster tolerance: better have >3 mirrors
Decide when you care...
PODC Keynote, July 19, 2000 PODC Keynote, July 19, 2000

Graceful Degradation Thinking Probabilistically

‹ Goal: smooth decrease in harvest/yield ‹ Maximize symmetry


proportional to faults – SPMD + simple replication schemes
– we know DQ drops linearly ‹ Make faults independent
‹ Saturation will occur – requires thought
– high peak/average ratios... – avoid cascading errors/faults
– must reduce harvest or yield (or both) – understand redirected load
– must do admission control!!! – KISS
‹ One answer: reduce D dynamically ‹ Use randomness
– disaster => redirect load, then reduce D to – makes worst-
worst-case and average case the same
compensate for extra load – ex: Inktomi spreads data & queries randomly
– Node loss implies a random 1% harvest reduction
PODC Keynote, July 19, 2000 PODC Keynote, July 19, 2000

9
Server Pollution Evolution

‹ Can’t fix all memory leaks Three Approaches:


‹ Third-
Third-party software leaks memory and sockets ‹ Flash Upgrade
– so does the OS sometimes – Fast reboot into new version
– Focus on MTTR (< 10 sec)
‹ Some failures tie up local resources – Reduces yield (and uptime)
‹ Rolling Upgrade
Solution: planned periodic “bounce” – Upgrade nodes one at time in a “wave”
– Not worth the stress to do any better – Temporary 1/n harvest reduction, 100% yield
– Bounce time is less than 10 seconds – Requires co-
co-existing versions
– Nice to remove load first… ‹ “Big Flip”
PODC Keynote, July 19, 2000 PODC Keynote, July 19, 2000

The Big Flip Key New Problems

‹ Steps: ‹ Unknown but large growth


1) take down 1/2 the nodes – Incremental & Absolute scalability
2) upgrade that half – 1000’s of components
3) flip the “active half” (site upgraded) ‹ Must be truly highly available
4) upgrade second half – Hot swap everything (no recovery time allowed)
5) return to 100% – No “night”
‹ 50% Harvest, 100% Yield – Graceful degradation under faults & saturation
– or inverse? ‹ Constant evolution (internet time)
‹ No mixed versions – Software will be buggy
– can replace schema, protocols, ... – Hardware will fail
– These can’t be emergencies...
‹ Twice used to change physical location
PODC Keynote, July 19, 2000 PODC Keynote, July 19, 2000

10
Conclusions Conclusions

‹ Winning solution is message-


message-passing clusters
‹ Parallel Programming is very relevant, except…
– fine-
fine-grain communication =>
– historically avoids availability fine-
fine-grain exception handling
– no notion of online evolution – don’t want every load/store to deal with partial failure
– limited notions of graceful degradation (checkpointing)
– best for CPU- ‹ Key open problems:
CPU-bound tasks
– libraries & data structures for HA shared state
– support for replication and partial failure
‹ Must think probabilistically about everything – better understanding of probabilistic systems
– no such thing as a 100% working system – cleaner support for exceptions (graceful degradation)
– no such thing as 100% fault tolerance – support for split-
split-phase I/O and many concurrent threads
– partial results are often OK (and better than none) – support for 10,000 threads/node (to avoid FSMs)
– Capacity * Completeness == Constant
PODC Keynote, July 19, 2000 PODC Keynote, July 19, 2000

New Hard Problems...

‹ Really need to manage disks well


– problems are I/O bound, not CPU bound
Backup slides ‹ Lots of simultaneous connections
– 50Kb/s => at least 2000 connections/node
‹ HAS to be highly available
– no maintenance window, even for upgrades
‹ Continuous evolution
– constant site changes, always small bugs...
– large but unpredictable traffic growth
‹ Graceful degradation under saturation
PODC Keynote, July 19, 2000 PODC Keynote, July 19, 2000

11
Parallel Disk I/O

‹ Want 50+ outstanding reads/disk


– Provides disk-
disk-head scheduler with many choices
– Trades response time for throughput
‹ Pushes towards a split-
split-phase approach to disks
‹ General trend: each query is a finite-
finite-state machine
– split-
split-phase disk/network operations are state transitions
– multiplex many FSMs over small number of threads
– FSM handles state rather than thread stack

PODC Keynote, July 19, 2000

12

Anda mungkin juga menyukai