Anda di halaman 1dari 4

WhatIsHadoop?

ManagingBigDataintheEnterprise
Introduction
Datavolumesare growingmuchfaster thancomputepower. Thisgrowthdemands newstrategiesfor processingand analyzinginformation. AccordingtoIDC1,theamountdigitalinformationproducedin 2011willbetentimesthatproducedin2006:1,800exabytes. Themajorityofthisdatawillbeunstructuredcomplexdata poorlysuitedtomanagementbystructuredstoragesystems likerelationaldatabases. Unstructureddatacomesfrommanysourcesandtakesmany formsweblogs,textfiles,sensorreadings,usergenerated contentlikeproductreviewsortextmessages,audio,videoand stillimageryandmore. Largevolumesofcomplexdatacanhideimportantinsights. Aretherebuyingpatternsinpointofsaledatathatcan forecastdemandforproductsatparticularstores?DoRFIDtag readsshowanomaliesinthemovementofgoodsduring distribution?Douserlogsfromawebsite,orcallingrecordsin amobilenetwork,containinformationaboutrelationships amongindividualcustomers?Canacollectionofnucleotide sequencesbeassembledintoasinglegene?Companiesthat canextractfactslikethesefromthehugevolumeofdatacan bettercontrolprocessesandcosts,canbetterpredictdemand andcanbuildbetterproducts. Dealingwithbigdatarequirestwothings: Inexpensive,reliablestorage;and Newtoolsforanalyzingunstructuredandstructured data.

ApacheHadoopisapowerfulopensourcesoftwareplatform thataddressesbothoftheseproblems.HadoopisanApache SoftwareFoundationproject.Clouderaofferscommercial supportandservicestoHadoopusers. 1AnUpdatedForecastofWorldwideInformationGrowth Through2011,IDC,March2008. WhatisHadoop?BigDataintheEnterprise 1

ReliableStorage:HDFS
MajorInternet propertieslikeGoogle, Amazon,Facebookand Yahoo!havepioneered theuseofnetworksof inexpensivecomputers forlargescaledata storageand processing.HDFSuses thesetechniquesto storeenterprisedata. Hadoopincludesafaulttolerantstoragesystemcalledthe HadoopDistributedFileSystem,orHDFS.HDFSisabletostore hugeamountsofinformation,scaleupincrementallyand survivethefailureofsignificantpartsofthestorage infrastructurewithoutlosingdata. Hadoopcreatesclustersofmachinesandcoordinateswork amongthem.Clusterscanbebuiltwithinexpensivecomputers. Ifonefails,Hadoopcontinuestooperatetheclusterwithout losingdataorinterruptingwork,byshiftingworktothe remainingmachinesinthecluster. HDFSmanagesstorageontheclusterbybreakingincoming filesintopieces,calledblocks,andstoringeachoftheblocks redundantlyacrossthepoolofservers.Inthecommoncase, HDFSstoresthreecompletecopiesofeachfilebycopyingeach piecetothreedifferentservers: 2
4

1
2 5

1 2 3 4 5

HDFS

3 4

2
3 4

1
3 5

Figure1:HDFSdistributesfileblocksamongservers HDFShasseveralusefulfeatures.Intheverysimpleexample shown,anytwoserverscanfail,andtheentirefilewillstillbe available.HDFSnoticeswhenablockoranodeislost,and createsanewcopyofmissingdatafromthereplicasit

WhatisHadoop?BigDataintheEnterprise

manages.Becausetheclusterstoresseveralcopiesofevery block,moreclientscanreadthematthesametimewithout creatingbottlenecks. Otherfaulttolerant storagesystemsare oftenmoreexpensive thanHDFS. Ofcoursetherearemanyotherredundancytechniques, includingthevariousstrategiesemployedbyRAIDmachines. HDFSofferstwokeyadvantagesoverRAID:Itrequiresno specialhardware,sinceitcanbebuiltfromcommodityservers, andcansurvivemorekindsoffailureadisk,anodeonthe networkoranetworkinterface. TheoneobviousobjectiontoHDFSitsconsumptionofthree timesthenecessarystoragespaceforthefilesitmanagesis notsoserious,giventheplummetingcostofstorage.In addition,HDFSofferssomerealadvantagesfordata processing,asthenextsectionwillshow.

HadoopforBigDataAnalysis
Manypopulartoolsforenterprisedatamanagement relationaldatabasesystems,forexamplearedesignedto makesimplequeriesrunquickly.Theyusetechniqueslike indexingtoexaminejustasmallportionofalltheavailable datainordertoansweraquestion. Hadoopisdesignedfor largescaleanalyses thatneedtoexamine allthedataina repository. Hadoopisadifferentsortoftool.Hadoopisaimedatproblems thatrequireexaminationofalltheavailabledata.Forexample, textanalysisandimageprocessinggenerallyrequirethatevery singlerecordberead,andofteninterpretedinthecontextof similarrecords.HadoopusesatechniquecalledMapReduceto carryoutthisexhaustiveanalysisquickly. Intheprevioussection,wesawthatHDFSdistributesblocks fromasinglefileamongalargenumberofserversfor reliability.Hadooptakesadvantageofthisdatadistributionby pushingtheworkinvolvedinananalysisouttomanydifferent servers.Eachoftheserversrunstheanalysisonitsownblock fromthefile.Resultsarecollatedanddigestedintoasingle resultaftereachpiecehasbeenanalyzed.

WhatisHadoop?BigDataintheEnterprise

Hadooptakes advantageofHDFS datadistribution strategytopushwork outtomanynodesina cluster.Thisallows analysestorunin parallelandeliminates thebottlenecks imposedbymonolithic storagesystems.

2 4 5 1 3 4 2 3 4

1 2 5

1 3 5

Figure2:Hadooppushesworkouttothedata Runningtheanalysisonthenodesthatactuallystorethedata deliversmuchmuchbetterperformancethanreadingdata overthenetworkfromasinglecentralizedserver.Hadoop monitorsjobsduringexecution,andwillrestartworklostdue tonodefailureifnecessary.Infact,ifaparticularnodeis runningveryslowly,Hadoopwillrestartitsworkonanother serverwithacopyofthedata.

Summary
HadoopsMapReduceandHDFSusesimple,robusttechniques oninexpensivecomputersystemstodeliververyhighdata availabilityandtoanalyzeenormousamountsofinformation quickly.Hadoopoffersenterprisesapowerfulnewtoolfor managingbigdata. Formoreinformation,pleasecontactClouderaat: info@cloudera.com +16503620488 http://www.cloudera.com/

WhatisHadoop?BigDataintheEnterprise

Anda mungkin juga menyukai