Anda di halaman 1dari 3

Assignment8:IntroductiontoHadoop

Tasks:
1. StudyHadoopandMapReduce.
2. Write therecordreader routine,to read a csv fileand extract information (asmentioned in
thefollowingsections).
3. Write the Mapper, Combiner and Reducer routines (those necessary) to implement the
queriesmentionedinthefollowingsections.
4. Writetheoutputformatroutinetoprintresultsofqueriestoa
.txt
file.
WritethecodeforthesequeriesinR.

DataSet:
Thedataset isa csv file whichcontain informationaboutsongs.The filecontainsfollowing
information:
Nameofartistswhosungthesong(artistsseparatedby).
Tags, list of tags for the song. Example: New York, classic, rock etc. The tags are
basedonbothgenreandlyricsofthesong(tagsseparatedby).
Title,titleofthesong.
Songid,alphanumericformat.
Todownloadthedataset,use
thislink
.

Informationtoextract:
From the file extract names of the artists, title and tags of the song. Group all this
informationunderthe
songid
.AllthisworkistobedoneintheRecordreaderphase.

Queries:
The queries are to be implemented in Mapper, Combiner and Reducer phases. Some of
themmaybeemptybasedonthequery.
Query1:
Printtitlesandidsofthesongswhichhavemorethan
NUM_TAGS
tags.

Query2:
Print names of the artists, along with song names and song ids, who have sung of more
than
NUM_SONGS songs. If a songis sung by multiple artiststogether, consideritseparatelyfor
eachartist.
Query3:
It is an extension of query 2. In it implement the query 2 with one more condition. The
condition isto consider onlythosesongswhich have morethanNUM_TAGStags.Thusfirstfilter
thesongswhichhaveNUM_TAGStags,thenimplementthequery2onthisdata.
Query4:
In this query we develop an index for artist names. i.e. given an artistnameyou shall be
able toretrieve names of all thesongs sung bythatartist.Example: oninput
ArjithSingh
, print
allthesongssungby
ArjithSingh
.
Important,youshallnotsearchthewholedatasettofindthesongssungby
ArjithSingh
.
Write code to construct an index for artistnames. Corresponding to an artist, storeall the
songssung by him/her. Since it isan indexon artist names,itshallbesortedonartistnames.On
aquerysearchinthisindexandprinttheresults.
Alsoprintthewholeindextoaseparatefileonce.Anindexlikethisiscalledinvertedindex.

Input/Output:
GetNUM_TAGS, NUM_SONGS and
Artist name as input. Construct amenufirsttoselect
thequery,thentogetrequiredinputs.
Printoutput ofeach query and the indexconstructedto
.txtfileswithappropriateformatting
andnames.

Deliverables:
Rcodeforabovefunctionalitycompressedas.tar.gz,named<YOUR_ROLL_No>.tar.gz.

**************************************************************************************
LOGIC:
Hadoophasfollowingphases:
1. recordreader.//Readingandparsingtheinputtorecords.
2. map.//Executeanoperationoneachrecord.
3. combiner.//Doreductionslocaltoanode.
4. partitioner.//Shufflingandsorting.CantbealteredexceptprovidingaComparator.
5. reduce.//Combinetheresultsfromthecombiner.
6. outputformat.

TheassignmentcoversalltheusereditablephasesalongwiththreeimportantusesofHadoopi.e.

filtering
(findingartistswithatleast3vowelsintheirnames)
numericalsummarizations
(Countingnumberofsongsofanartist)
indexing
.

Anda mungkin juga menyukai