ManagingBigDataintheEnterprise
Introduction
Datavolumesare growingmuchfaster thancomputepower. Thisgrowthdemands newstrategiesfor processingand analyzinginformation. AccordingtoIDC1,theamountdigitalinformationproducedin 2011willbetentimesthatproducedin2006:1,800exabytes. Themajorityofthisdatawillbeunstructuredcomplexdata poorlysuitedtomanagementbystructuredstoragesystems likerelationaldatabases. Unstructureddatacomesfrommanysourcesandtakesmany formsweblogs,textfiles,sensorreadings,usergenerated contentlikeproductreviewsortextmessages,audio,videoand stillimageryandmore. Largevolumesofcomplexdatacanhideimportantinsights. Aretherebuyingpatternsinpointofsaledatathatcan forecastdemandforproductsatparticularstores?DoRFIDtag readsshowanomaliesinthemovementofgoodsduring distribution?Douserlogsfromawebsite,orcallingrecordsin amobilenetwork,containinformationaboutrelationships amongindividualcustomers?Canacollectionofnucleotide sequencesbeassembledintoasinglegene?Companiesthat canextractfactslikethesefromthehugevolumeofdatacan bettercontrolprocessesandcosts,canbetterpredictdemand andcanbuildbetterproducts. Dealingwithbigdatarequirestwothings: Inexpensive,reliablestorage;and Newtoolsforanalyzingunstructuredandstructured data.
ReliableStorage:HDFS
MajorInternet propertieslikeGoogle, Amazon,Facebookand Yahoo!havepioneered theuseofnetworksof inexpensivecomputers forlargescaledata storageand processing.HDFSuses thesetechniquesto storeenterprisedata. Hadoopincludesafaulttolerantstoragesystemcalledthe HadoopDistributedFileSystem,orHDFS.HDFSisabletostore hugeamountsofinformation,scaleupincrementallyand survivethefailureofsignificantpartsofthestorage infrastructurewithoutlosingdata. Hadoopcreatesclustersofmachinesandcoordinateswork amongthem.Clusterscanbebuiltwithinexpensivecomputers. Ifonefails,Hadoopcontinuestooperatetheclusterwithout losingdataorinterruptingwork,byshiftingworktothe remainingmachinesinthecluster. HDFSmanagesstorageontheclusterbybreakingincoming filesintopieces,calledblocks,andstoringeachoftheblocks redundantlyacrossthepoolofservers.Inthecommoncase, HDFSstoresthreecompletecopiesofeachfilebycopyingeach piecetothreedifferentservers: 2
4
1
2 5
1 2 3 4 5
HDFS
3 4
2
3 4
1
3 5
WhatisHadoop?BigDataintheEnterprise
manages.Becausetheclusterstoresseveralcopiesofevery block,moreclientscanreadthematthesametimewithout creatingbottlenecks. Otherfaulttolerant storagesystemsare oftenmoreexpensive thanHDFS. Ofcoursetherearemanyotherredundancytechniques, includingthevariousstrategiesemployedbyRAIDmachines. HDFSofferstwokeyadvantagesoverRAID:Itrequiresno specialhardware,sinceitcanbebuiltfromcommodityservers, andcansurvivemorekindsoffailureadisk,anodeonthe networkoranetworkinterface. TheoneobviousobjectiontoHDFSitsconsumptionofthree timesthenecessarystoragespaceforthefilesitmanagesis notsoserious,giventheplummetingcostofstorage.In addition,HDFSofferssomerealadvantagesfordata processing,asthenextsectionwillshow.
HadoopforBigDataAnalysis
Manypopulartoolsforenterprisedatamanagement relationaldatabasesystems,forexamplearedesignedto makesimplequeriesrunquickly.Theyusetechniqueslike indexingtoexaminejustasmallportionofalltheavailable datainordertoansweraquestion. Hadoopisdesignedfor largescaleanalyses thatneedtoexamine allthedataina repository. Hadoopisadifferentsortoftool.Hadoopisaimedatproblems thatrequireexaminationofalltheavailabledata.Forexample, textanalysisandimageprocessinggenerallyrequirethatevery singlerecordberead,andofteninterpretedinthecontextof similarrecords.HadoopusesatechniquecalledMapReduceto carryoutthisexhaustiveanalysisquickly. Intheprevioussection,wesawthatHDFSdistributesblocks fromasinglefileamongalargenumberofserversfor reliability.Hadooptakesadvantageofthisdatadistributionby pushingtheworkinvolvedinananalysisouttomanydifferent servers.Eachoftheserversrunstheanalysisonitsownblock fromthefile.Resultsarecollatedanddigestedintoasingle resultaftereachpiecehasbeenanalyzed.
WhatisHadoop?BigDataintheEnterprise
Hadooptakes advantageofHDFS datadistribution strategytopushwork outtomanynodesina cluster.Thisallows analysestorunin parallelandeliminates thebottlenecks imposedbymonolithic storagesystems.
2 4 5 1 3 4 2 3 4
1 2 5
1 3 5
Summary
HadoopsMapReduceandHDFSusesimple,robusttechniques oninexpensivecomputersystemstodeliververyhighdata availabilityandtoanalyzeenormousamountsofinformation quickly.Hadoopoffersenterprisesapowerfulnewtoolfor managingbigdata. Formoreinformation,pleasecontactClouderaat: info@cloudera.com +16503620488 http://www.cloudera.com/
WhatisHadoop?BigDataintheEnterprise