What Is Hadoop

WhatIsHadoop?
ManagingBigDataintheEnterprise
Introduction
Datavolumesare growingmuchfaster thancomputepower. Thisgrowthdemands newstrategiesfor processingand analyzinginformation. AccordingtoIDC1,theamountdigitalinformationproducedin 2011willbetentimesthatproducedin2006:1,800exabytes. Themajorityofthisdatawillbeunstructuredcomplexdata poorlysuitedtomanagementbystructuredstoragesystems likerelationaldatabases. Unstructureddatacomesfrommanysourcesandtakesmany formsweblogs,textfiles,sensorreadings,usergenerated contentlikeproductreviewsortextmessages,audio,videoand stillimageryandmore. Largevolumesofcomplexdatacanhideimportantinsights. Aretherebuyingpatternsinpointofsaledatathatcan forecastdemandforproductsatparticularstores?DoRFIDtag readsshowanomaliesinthemovementofgoodsduring distribution?Douserlogsfromawebsite,orcallingrecordsin amobilenetwork,containinformationaboutrelationships amongindividualcustomers?Canacollectionofnucleotide sequencesbeassembledintoasinglegene?Companiesthat canextractfactslikethesefromthehugevolumeofdatacan bettercontrolprocessesandcosts,canbetterpredictdemand andcanbuildbetterproducts. Dealingwithbigdatarequirestwothings: Inexpensive,reliablestorage;and Newtoolsforanalyzingunstructuredandstructured data.
ApacheHadoopisapowerfulopensourcesoftwareplatform thataddressesbothoftheseproblems.HadoopisanApache SoftwareFoundationproject.Clouderaofferscommercial supportandservicestoHadoopusers. 1AnUpdatedForecastofWorldwideInformationGrowth Through2011,IDC,March2008. WhatisHadoop?BigDataintheEnterprise 1
ReliableStorage:HDFS
MajorInternet propertieslikeGoogle, Amazon,Facebookand Yahoo!havepioneered theuseofnetworksof inexpensivecomputers forlargescaledata storageand processing.HDFSuses thesetechniquesto storeenterprisedata. Hadoopincludesafaulttolerantstoragesystemcalledthe HadoopDistributedFileSystem,orHDFS.HDFSisabletostore hugeamountsofinformation,scaleupincrementallyand survivethefailureofsignificantpartsofthestorage infrastructurewithoutlosingdata. Hadoopcreatesclustersofmachinesandcoordinateswork amongthem.Clusterscanbebuiltwithinexpensivecomputers. Ifonefails,Hadoopcontinuestooperatetheclusterwithout losingdataorinterruptingwork,byshiftingworktothe remainingmachinesinthecluster. HDFSmanagesstorageontheclusterbybreakingincoming filesintopieces,calledblocks,andstoringeachoftheblocks redundantlyacrossthepoolofservers.Inthecommoncase, HDFSstoresthreecompletecopiesofeachfilebycopyingeach piecetothreedifferentservers: 2
4
1
2 5
1 2 3 4 5
HDFS
3 4
2
3 4
1
3 5
Figure1:HDFSdistributesfileblocksamongservers HDFShasseveralusefulfeatures.Intheverysimpleexample shown,anytwoserverscanfail,andtheentirefilewillstillbe available.HDFSnoticeswhenablockoranodeislost,and createsanewcopyofmissingdatafromthereplicasit
WhatisHadoop?BigDataintheEnterprise
manages.Becausetheclusterstoresseveralcopiesofevery block,moreclientscanreadthematthesametimewithout creatingbottlenecks. Otherfaulttolerant storagesystemsare oftenmoreexpensive thanHDFS. Ofcoursetherearemanyotherredundancytechniques, includingthevariousstrategiesemployedbyRAIDmachines. HDFSofferstwokeyadvantagesoverRAID:Itrequiresno specialhardware,sinceitcanbebuiltfromcommodityservers, andcansurvivemorekindsoffailureadisk,anodeonthe networkoranetworkinterface. TheoneobviousobjectiontoHDFSitsconsumptionofthree timesthenecessarystoragespaceforthefilesitmanagesis notsoserious,giventheplummetingcostofstorage.In addition,HDFSofferssomerealadvantagesfordata processing,asthenextsectionwillshow.
HadoopforBigDataAnalysis
Manypopulartoolsforenterprisedatamanagement relationaldatabasesystems,forexamplearedesignedto makesimplequeriesrunquickly.Theyusetechniqueslike indexingtoexaminejustasmallportionofalltheavailable datainordertoansweraquestion. Hadoopisdesignedfor largescaleanalyses thatneedtoexamine allthedataina repository. Hadoopisadifferentsortoftool.Hadoopisaimedatproblems thatrequireexaminationofalltheavailabledata.Forexample, textanalysisandimageprocessinggenerallyrequirethatevery singlerecordberead,andofteninterpretedinthecontextof similarrecords.HadoopusesatechniquecalledMapReduceto carryoutthisexhaustiveanalysisquickly. Intheprevioussection,wesawthatHDFSdistributesblocks fromasinglefileamongalargenumberofserversfor reliability.Hadooptakesadvantageofthisdatadistributionby pushingtheworkinvolvedinananalysisouttomanydifferent servers.Eachoftheserversrunstheanalysisonitsownblock fromthefile.Resultsarecollatedanddigestedintoasingle resultaftereachpiecehasbeenanalyzed.
Hadooptakes advantageofHDFS datadistribution strategytopushwork outtomanynodesina cluster.Thisallows analysestorunin parallelandeliminates thebottlenecks imposedbymonolithic storagesystems.
2 4 5 1 3 4 2 3 4
1 2 5
1 3 5
Figure2:Hadooppushesworkouttothedata Runningtheanalysisonthenodesthatactuallystorethedata deliversmuchmuchbetterperformancethanreadingdata overthenetworkfromasinglecentralizedserver.Hadoop monitorsjobsduringexecution,andwillrestartworklostdue tonodefailureifnecessary.Infact,ifaparticularnodeis runningveryslowly,Hadoopwillrestartitsworkonanother serverwithacopyofthedata.
Summary
HadoopsMapReduceandHDFSusesimple,robusttechniques oninexpensivecomputersystemstodeliververyhighdata availabilityandtoanalyzeenormousamountsofinformation quickly.Hadoopoffersenterprisesapowerfulnewtoolfor managingbigdata. Formoreinformation,pleasecontactClouderaat: info@cloudera.com +16503620488 http://www.cloudera.com/

What Is Hadoop

Diunggah oleh

Informasi Dokumen

Deskripsi Asli:

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

What Is Hadoop

Diunggah oleh

Hak Cipta:

Format Tersedia

WhatIsHadoop?

ApacheHadoopisapowerfulopensourcesoftwareplatform thataddressesbothoftheseproblems.HadoopisanApache SoftwareFoundationproject.Clouderaofferscommercial supportandservicestoHadoopusers. 1AnUpdatedForecastofWorldwideInformationGrowth Through2011,IDC,March2008. WhatisHadoop?BigDataintheEnterprise 1

Figure1:HDFSdistributesfileblocksamongservers HDFShasseveralusefulfeatures.Intheverysimpleexample shown,anytwoserverscanfail,andtheentirefilewillstillbe available.HDFSnoticeswhenablockoranodeislost,and createsanewcopyofmissingdatafromthereplicasit

Anda mungkin juga menyukai