Anda di halaman 1dari 7

Inside

Hadoop

Hadoop Distributed File System (HDFS)

- Allows your big data to be stored across an entire cluster in distributed manner in a
reliable manner and allows your application to analyze that data to access that data
quickly and reliably

To save data and previous data in HDFS:


Above is the data preview for the 100, 000 movie ratings dataset from the Movielens website.
We can also rename the file and download it back to your computer. We used HDFS through a
web interface so this is using an HTTP interface to HDFS that allows us to manipulate and view
all the files in our HDFS file system.

The command line interface: manipulate your HDFS file system from a command line.

lusis-mbp:~ Lusi$ ssh maria_dev@127.0.0.1 -p 2222
maria_dev@127.0.0.1's password:
[maria_dev@sandbox ~]$

Typed in the first line and the password of the account, and I am in!

To list what is in the file system and the contents of my home directory and HDFS for maria_dev
we type in the following:

[maria_dev@sandbox ~]$ hadoop fs -ls
Found 3 items
drwxr-xr-x - maria_dev hdfs 0 2017-09-15 15:10 .Trash
drwxr-xr-x - maria_dev hdfs 0 2017-09-15 15:10 files-view
drwxr-xr-x - maria_dev hdfs 0 2017-09-07 20:03 hive

Now we want to create a directory in here to hold the Movielens ratings data.

[maria_dev@sandbox ~]$ hadoop fs -mkdir ml-100k

To make sure the new directory is successfully created, we run the ls command again.

[maria_dev@sandbox ~]$ hadoop fs -ls


Found 4 items
drwxr-xr-x - maria_dev hdfs 0 2017-09-15 15:10 .Trash
drwxr-xr-x - maria_dev hdfs 0 2017-09-15 15:10 files-view
drwxr-xr-x - maria_dev hdfs 0 2017-09-07 20:03 hive
drwxr-xr-x - maria_dev hdfs 0 2017-10-09 18:06 ml-100k

We want to get some data so we type in wget which is a Linux command that retrieves
information from the web:


Above is what happened after typing in wget and the data web address. In details, follow below:



As we can see that u.data has been uploaded. Now we want the u.data to be copied from the
local file system on this host into HDFS, and lets make sure we actually did that



To remove the u.data file and then remove the ml-100k directory:



There is also so much we can do in Hadoop.



MapReduce Fundamental Concepts

The MAPPER converts raw source data into key/value pairs. For instance, our Movielens data
will be the input data, and the MAPPER will transform the data in pairs: K1:V (key is the user id
and the value is the movie that user rated). If the same key appeared several times in this stage,
that is fine (i.e. Bob rated several movies).

After identify the users and their ratings, we shuffle them around and put the users ratings
under the user such as user 136 rated movie 1, movie 2, and movie 3.

Then we proceed to the REDUCER, which processes each keys (users) values. For instance, user
136 rated three movies, so it is mapped by len(movies) to 186:3.

How many of each movie rating exist?

Before doing that we downloaded pip, nano, and mrjob.

To view the data in u.data we need to type in less u.data:



We need to get the script that does the MapReduce. Below is Nano editor which acts as a text
editor: nano RatingsBreakdown.py


With all that settled, we can run the script above locally (no need for Hadoop at all): python
RatingsBreakdown.py u.data. The countings of the ratings are shown below:



The rating results are shown above. It appears that 4-star rating is more popular among other
ratings, and the 1-star rating is the least popular. It seems like people are generally generous
when it comes to rating.

Run the script on Hadoop cluster:



The r Hadoop tells mrjob that we actually want to use Hadoop cluster to run the script this
time by using the Hadoop installation on the client node that we are running from. It is going to
use a resource manager and a Hadoop Application Manager that actually distribute that job.
This line of code will map everything and start to reduce it.



Here we found the same result as the previous local result. Four-stars rating seems to be
popular. Now we used Hadoop, MapReduce, Hadoop streaming, Python and mrjob package to
find the most popular rating.







Rank movies by their popularity

How many times each movie has been rated?
- running the script locally for simplicity
- we are interested in counting up number of times a movie has been rated



Below is just some of the movies counted up i.e. movieID 1010 was rated 44 times:



Sort the movies by their numbers of ratings so this way we can find out the most popular
movies and the least popular movies:

With some help from my instructor the code is as follows:



There are two steps. The first set of mapper and reducer will extract the movieIDs and the
number 1s and sum up the number 1s. The movieIDs is key and value is the number 1, which is
mapped by the mapper_get_movieIDs function. However, in the second step, the sum of movie
counts becomes the key and movieID becomes the value which is done by the
reducer_count_movieIDs function. This way when we pass through a second step the shuffle
and sort phase will automatically sort things by the movie count for us. To make sure the list
sorts properly, we convert to string and zero fill it with five digits. The final function
reducer_sorted_output starts the second stage. This function will get a unique count with the
movie associated to the count. The result is that the movie column will appear on the left and
the count column will be on the right.