Anda di halaman 1dari 58

adopting apache

@ebenhewitt
10. 14. 10
strange loop
st louis
• i wrote this
agenda
• context “If I had asked the
• features people what they
• data model wanted, they would
• api have said ‘faster
horses’”.

--Henry Ford
so it turns out, there’s a lot of
data in the world…
• Google processes 8 EB of data every year
– 24 PB every day
– 1PB is a quadrillion bytes
– 1 EB is a 1024 PB
• eBay
– 50TB of new data every day
• World of Warcraft
– uses 1.3 PB to store the game
• Chevron
– 2TB of data every day
• WalMart’s Customer Database
– 2004, .5 petabyte = 500 TB
The movie Avatar required 1PB
storage

…or the equivalent of a single MP3


…if that MP3
was 32 years
long
it ain’t getting any smaller
• 2006: 166 exabytes
• 2010: >1000 exabytes
how do you scale relational
databases?
1. tune queries
2. indexes
3. vertical scaling
– works for a time
– eventually need to add boxes
4. shard
– create a horizontal partition (how to join now?)
– argh
5. denormalize
6. now you have new problems
– data replication, consistency
– master/slave (SPOF)
7. update configuration management
– start doing undesirable things (turn off journaling)
– caching
the no sql value proposition:
• sql sux
• rdbms sux
• throw out
everything you
know
• run around like a
crazy person
“nosql”  “big data”
• mongodb
• couchdb
• tokyo cabinet
• redis
• riak
• what about?
– Poet, Lotus, Xindice
– they’ve been around forever
– rdbms was once the new kid…
what is
daughter of Priam &
Hecuba

distributed
decentralized
fault tolerant
elastic
durable
database

cassandra.apache.org
innovation at scale
google bigtable (2006) amazon dynamo (2007)
• consistency model: • consistency model:
strong client tune-able
• data model: sparse map • data model: key-value
• clones: hbase, • O(1) dht
hypertable • clones: riak, voldemort
• column family, • symmetric p2p, gossip
sequential writes, • AP
bloom filters, linear
insert performance
• CP
proven
• SimpleGeo >50 Large EC2 instances

• Digg: 3TB of data

• The Facebook stores 150TB of data on 150 nodes

• US Government has 400 nodes for analytics in


intelligence community in partnership with Digital
Reasoning

• Used at Twitter, Rackspace, Mahalo, Reddit,


no free lunch
• no transactions
• no joins
• no ad hoc queries
agenda
• context
• features
• data model
• api
cassandra properties
• tuneably consistent
• durable, fault tolerant
• very fast writes
• highly available
• linear, elastic scalability
• decentralized/symmetric
• ~12 client languages
– Thrift RPC API
• ~automatic provisioning of new nodes
• 0(1) dht
• big data
consistency

• consistency
– all clients have same view of data

• availability
– writeable in the face of node failure

• partition tolerance
– processing can continue in the face of
network failure (crashed router, broken
daniel abadi: pacelc

partition! normal
trade-off A & condition:
C tradeoff
latency &
consistency
write consistency
Level Description
ZERO Good luck with that
ANY 1 replica (hints count)
ONE 1 replica. read repair in bkgnd
QUORUM (N /2) + 1
ALL N = replication factor

read consistency
Level Description
ZERO Ummm…
ANY Try ONE instead
ONE 1 replica
QUORUM Return most recent TS after (N /2)
+ 1 report
ALL N = replication factor
durability
fast writes: staged eda
• A general-purpose framework for high
concurrency & load conditioning
• Decomposes applications into stages
separated by queues
• Adopt a structured approach to event-
driven concurrency
highly
agenda
• context
• features
• data model
• api
structure
keyspace column column…
• setting family… • name
s (eg, • setting • value
partitione s (eg, • timesta
r) comparat mp
or, type
[Std])
keyspac

• ~= database
• typically one per application
• some settings are configurable only
per keyspace
– partitioner
• Configured in XML in YAML in API
create a keyspace
//Create Keyspace
KsDef k = new KsDef();
k.setName(keyspaceName);
k.setReplication_factor(1);
k.setStrategy_class
("org.apache.cassandra.locator.RackUnawareStrategy");

List<CfDef> cfDefs = new ArrayList<CfDef>();


k.setCf_defs(cfDefs);

//Connect to Server
TTransport tr = new TSocket(HOST, PORT);
TFramedTransport tf = new TFramedTransport(tr); //new default
TProtocol proto = new TBinaryProtocol(tf);
Cassandra.Client client = new Cassandra.Client(proto);
tr.open();
partitioner smack-down
Random Order Preserving
• system will use MD5 • key distribution
(key) to distribute data determined by token
across nodes • lexicographical ordering
• even distribution of • can specify the token
keys from one CF for this node to use
across ranges/nodes • ‘scrabble’ distribution
• required for range
queries
– scan over rows like cursor
in index
column family
• group records of similar kind
• CFs are sparse tables
• ex:
– Tweet
– Address
– Customer
– PointOfInterest
column family

keys column
key s
nickname
user=ebe
12 n
=The
Situation
3
key
45 user=alis
icon=
n=
6
on
42
json-like notation
User {
123 : { user:eben,
nickname: The Situation },

456 : { user: alison,


icon: ,

: The Danger Zone}


}
think of cassandra as

row-oriented
• each row is uniquely identifiable by
key
• rows group columns and super
a column has 3 parts
1. name
– byte[]
– determines sort order
– used in queries
– indexed
2. value
– byte[]
– you don’t query on column values
3. timestamp
– long (clock)
– last-write-wins conflict resolution
get started
$cassandra –f
$bin/cassandra-cli
cassandra> connect localhost/9160

cassandra> set Keyspace1.Standard1[‘eben’]


[‘age’]=‘29’
cassandra> set Keyspace1.Standard1[‘eben’]
[‘email’]=‘e@e.com’
cassandra> get Keyspace1.Standard1[‘eben']
[‘age']
=> (column=6e616d65, value=29,
column comparators
• byte
• utf8
• long
• timeuuid (version 1)
• lexicaluuid (any, usually version 4)
• <pluggable>
– ex: lat/long
super

super columns group columns under a


common name
super column
<<SCF>>PointOfInterest

<<SC>>Cen <<SC>>
1001 tral Park Empire State Bldg

7 desc=Fun
to walk in.
phone=212
.
desc=Great
view from
555.11212 102nd floor!

<<SC>>
6311 The Loop

2 phone=314
.
desc=Home
of Strange
555.11212 Loop!
super column
super column
family
PointOfInterest {
key: 85255 { column
Phoenix Zoo { phone: 480-555-5555, desc: They have animals
here. },
Spring Training { phone: 623-333-3333, desc: Fun for baseball
fans. }, key
}, //end phx super column
flexible
schema
key: 10019 { s
Central Park { desc: Walk around. It's pretty.} ,
Empire State Building { phone: 212-777-7777,
desc: Great view from 102nd floor. }
} //end nyc
about super column families
• sub-column names in a SCF are not
indexed
– top level columns (SCF Name) are always
indexed
• often used for denormalizing data
from standard CFs
rdbms: domain-based
model
what answers do I have?
big query language

cassandra: query-based
model
what questions do I have?
replica/tion
• configurable replication factor
• replica placement strategy
rack unaware  Simple Strategy
rack aware  Old Network Topology
Strategy
data center shard  Network Topology
Strategy
agenda
• context
• features
• data model
• api
slice predicate
• data structure describing columns to
return
– SliceRange
• start column name (byte[])
• finish column name (can be empty to stop on
count)
• reverse
• count (like LIMIT)
• get() : Column read api
– get the Col or SC at given ColPath
COSC cosc = client.get(key, path, CL);

• get_slice() : List<ColumnOrSuperColumn>
– get Cols in one row, specified by SlicePredicate:
List<ColumnOrSuperColumn> results =
client.get_slice(key, parent, predicate, CL);

• multiget_slice() : Map<key, List<CoSC>>


– get slices for list of keys, based on SlicePredicate
Map<byte[],List<ColumnOrSuperColumn>> results =
client.multiget_slice(rowKeys, parent, predicate, CL);

• get_range_slices() : List<KeySlice>
– returns multiple Cols according to a range
– range is startkey, endkey, starttoken, endtoken:
List<KeySlice> slices = client.get_range_slices(
insert

insert(userIDKey, cp,
new Column("name".getBytes(UTF8),
"George Clinton".getBytes(), clock),
CL);
delete
String columnFamily = "Standard1";
byte[] key = "k2".getBytes(); //row key

Clock clock = new Clock


(System.currentTimeMillis());

ColumnPath colPath = new ColumnPath();


colPath.column_family = columnFamily;
colPath.column = "b".getBytes();

client.remove(key, colPath, clock,


ConsistencyLevel.ALL);
batch_mutate
Map<byte[], Map<String, List<Mutation>>> mutationMap =
new HashMap<byte[], Map<String, List<Mutation>>>();

List<Mutation> mutationList = new ArrayList<Mutation>();


mutationList.add(mutation);

Map<String, List<Mutation>> m = new HashMap<String,


List<Mutation>>();
m.put(columnFamily, mutationList);

//just for this row key, though we could add more


mutationMap.put(key, m);
client.batch_mutate(mutationMap, ConsistencyLevel.ALL);
raw thrift: for masochists

• pycassa (python)
• Telephus (twisted python)
• fauna/cassandra gem (ruby)
• hector (java)
• pelops (java)
• kundera (JPA)
• hectorSharp (C#)
?
what about…

SELECT WHERE
ORDER BY
JOIN ON
GROUP
SELECT WHERE
cassandra is an index factory
<<cf>>USER
Key: UserID
Cols: username, email, birth date, city, state
 
How to support this query?

SELECT * FROM User WHERE city = ‘Scottsdale’

Create a new CF called UserCity:


 
<<cf>>USERCITY
Key: city
SELECT WHERE pt 2
• Use an aggregate key
state:city: { user1, user2}

• Get rows between AZ: & AZ;


for all Arizona users

• Get rows between AZ:Scottsdale &


AZ:Scottsdale1
for all Scottsdale users
ORDER BY

Columns Rows
are sorted according to are placed according to their Partitioner:
CompareWith or
CompareSubcolumnsWith
•Random: MD5 of key
•Order-Preserving: actual key

are sorted by key, regardless of


partitioner
data
• skinny rows, wide rows (billions of
columns)
• denormalize known queries
– secondary index support in 0.7
• client join others
• 2 caching layers: row, index
is cassandra a good fit?
• sub-millisecond writes • your programmers can
• you need durability deal
• you have lots of data – documentation
– complexity
> GBs
– consistency model
>= three servers
– change
• growing data over time – visibility tools
• your app is evolving • your operations can deal
– startup mode, fluid data – hardware considerations
structure
– can move data
• loose domain data – JMX monitoring
– “points of interest”
• multi data-center
use cases
• jboss.org/inifispan With Hadoop!
– data grid cache • BI w/o ETL
• log data stream • raptr.com
• hotelier – storage & analytics
– points of interest for gaming stats
– guests • imagini
• geospatial – visual quizzes for
publishers
• travel
– real time for 100s of
– segment analytics millions of users
coming in 0.7
• secondary indexes
• hadoop improvements
• large row support ( > 2GB)
• dynamic routing around slow nodes
YOU ALREADY
HAVE THE RIGHT
DATABASE TODAY
FOR THE APPLICATION YOU
HAVE TODAY
what would you do
if scale wasn’t a problem?
"An invention has to
make sense in the
world in which it is
finished,
not the world in
which it is started”.

--Ray Kurzweil

@ebenhewitt
cassandra.apache.org

Anda mungkin juga menyukai