Adm2000 Lab Guide

Hadoop Cluster Administration
Lab Guide
September 2015, v5.0
For use with the following courses:

ADM 200
ADM 201
ADM 202
ADM 203
This Guide is protected under U.S. and international copyright laws, and is the exclusive property of
MapR Technologies, Inc.
2015, MapR Technologies, Inc. All rights reserved. All other trademarks cited here are the property of
their respective owners.
PROPRIETARY AND CONFIDENTIAL INFORMATION

2015 MapR Technologies, Inc. All Rights Reserved.
ii
Get Started
Icons Used in This Guide
This lab guide uses the following ions to draw attention to different types of information:
Note: Additional information that will clarify something, provide additional
details, or help you avoid mistakes.
CAUTION: Details you must read to avoid potentially serious problems.
Q&A: A question posed to the learner during a lab exercise.
Try This! Extra exercises you can complete to strengthen learning.
Lab Requirements
You will need the following to complete the labs for this course:
Access to a physical or virtual cluster with at least 4 nodes. Instructions on setting up virtual
clusters using either Amazon Web Services (AWS) or Google Cloud Platform (GCP) are included
with the course files.
Visit http://doc.mapr.com/display/MapR/Preparing+Each+Node for information on required
specifications for the nodes.
SSH access to the nodes. For Mac users, this is built into the standard terminal program.
Windows users may need to download and install an additional utility for this (such as PuTTY). If
you are using GCP, you can SSH into the nodes from the GCP console.
You can download PuTTY at http://www.putty.org.
Note: Make sure that you can access the nodes via SSH before starting the labs.
Get Started
Using This Guide

1. You will select one of your nodes to be the master node. Most of the work will be performed from
the master node, and the other nodes will be accessed from there.
2. When command syntax is presented in this guide, any arguments that are enclosed in chevrons,
<like this>, should be substituted with an appropriate value. For example, this:
# cp <source file> <destination file>
might be entered as this:
# cp /etc/ssh/sshd_config /etc/ssh/sshd_config.bak

Note: Sample commands provide guidance, but do not always reflect exactly what you
will see on the screen. For example, if there is output associated with a command, it
may not be shown.
Tips for Using AWS and GCP Clusters

Using AWS Clusters
Use these conventions when working with an AWS cluster for the labs:
AWS Clusters Information
IP Addresses
Each node in your cluster will have both an internal IP address, and an external
IP address. To view them, log into the AWS console and click on the node.
Use the external IP address to connect to a node in your cluster, from

a node outside the cluster (for example, from a terminal window on
your laptop).
Use the internal IP address to connect from one node in the cluster to
another.
Default <user>
The default AWS user name is ec2-user. Whenever you see <user> in a
command sample in the lab guide, substitute ec2-user.
Log in as root
The user ec2-user has sudo privileges. To log into a node as root, first log in
as ec2-user and then sudo to root:
$ sudo -i
SSH access
To connect to a node in your cluster, use the .pem file that was provided with
the course materials (or that you downloaded when you created your AWS
instances). For example:
ssh i <.pem file> ec2-user@<external IP address>

Get Started
Using GCP Clusters

Use these conventions when working with a GCP cluster for the labs:
GCP Clusters Information
IP Addresses
Each node in your cluster will have both an internal IP address, and an external
IP address. To determine the IP addresses of your nodes, run this command
from your terminal window (where you installed the Google Cloud SDK):
gcloud compute instances list
Default <user>
To connect to the node from a system outside the cluster, such as your
laptop, use the SSH button in the Google Developer's Console (see the
information below on SSH access).
Use the internal IP address to connect from one node in the cluster to
another.
The default user name will be based on the Google account under which your
project was created. If you are unsure of the default user name, connect to one
of your GCP nodes (see SSH access, below). The login prompt displayed will
include your user name. For example:
[username@node1 ~]$
Log in as root
The default user has sudo privileges. To log into a node as root, first log in as
the default user and then sudo to root:
$ sudo -i
SSH access
To connect to a node in your cluster:

1. Log into the Google Developer's Console and open your project.
2. Navigate to Compute > Compute Engine > VM instances.
3. Click the SSH button to the right of the node you want to connect to.
Lab GS1: SSH Into Your Nodes

Estimated time to complete: 10 minutes
Overview
The purpose of this lab is to make sure you can connect to the nodes you will be using throughout the
course. This is required for all of the remaining labs.

Get Started
Note: Instructions in this section are specific to the nodes that are used in the classroom
training. If you are in the classroom training, make sure you download the course files to your
system before beginning.
If you are taking the on-demand training, you will need to provide you own nodes and the
method for connecting to those nodes may differ from the instructions presented here.
If you are using GCP nodes, you can SSH into them directly from the Google
Developer's Console.
If you are using AWS nodes, you should have downloaded the .pem file when you
created your instances. Windows users of AWS nodes will need to convert the .pem
file to a .ppk file: instructions for doing that can be found in the AWS documentation,
last seen here: http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/putty.html.
Connect to your Nodes: Windows with PuTTY

This procedure assumes that you are using PuTTY as your terminal application. Other applications will
work, but you will need to adjust the instructions accordingly.
If you are using a Mac or Linux system, skip to the next section for instructions.
1. Open a PuTTY window. In the Host Name (or IP address) field, enter the external IP address
for the master node in your cluster.
2. In the Category list on the left-hand side, navigate to Connect > SSH > Auth.

Get Started
3. A new screen will appear when you click Auth in the menu. Browse for the .ppk file (supplied
with the course files for classroom training), and click Open to open a terminal window.
4. Once the terminal window opens up, log in as the default user (ec2-user, for AWS nodes). A
password is not required since you are using the .ppk file to authenticate. Once logged in, you
will see the command prompt, and be able to sudo to the root user.
5. Log out of the node, and repeat for the other nodes in your cluster. You can also open multiple
PuTTY windows to have access to multiple nodes at the same time.

Get Started
Connect to your Nodes: Mac or Linux

Follow these instructions to SSH into your nodes from a Max or Linux system.
1. Set permissions on the .pem file (supplied with the course files for classroom training) to 600, if
they are not already set correctly:
$ chmod 600 students07172012.pem
2. Use a terminal window to SSH into the master node in your assigned cluster:
$ ssh i students07172012.pem <user>@<external IP address>
Make sure to use the external IP address for the node. On AWS clusters, <user> will be ec2user.
Once logged in, you should see the command prompt. You should also be able to sudo to the
root user:
$ sudo -i
3. Log out of the node, and repeat for the other nodes in your cluster.
Lab GS2: Set Up Passwordless SSH

In the Prepare for Installation labs, you will use clustershell to copy files from the master node, to other
nodes in the cluster. For this to work, you must have passwordless SSH set up between your nodes. Set
up passwordless SSH only on the first 3 nodes in your cluster.
Note: Follow the instructions in Appendix A of the lab guide for details on how to set up
passwordless SSH. These instructions were written for the AWS nodes that are used in the
classroom training; this procedure may be different for on-demand training students.
Classroom students: Check with your instructor before proceeding, as passwordless SSH may
have been set up in advance.

Lesson 1: Prepare for Installation

Lab 1.1: Audit the Cluster
Install the Pre-Install and Post-Install Tools

1. Log in to the master node as root.
2. Download the zip file:
wget http://course-files.mapr.com/ADM200.zip
3. Extract the files from ADM200.zip; this will create two directories (post-install and pre-install).
Verify that the directories exist, and contain files.
# unzip ADM200.zip
# ls pre-install
# ls post-install
Install and Configure Clustershell

1. Install the package clustershell-1.6-1.el6.noarch.rpm, located in the pre-install directory.
# rpm -i pre-install/clustershell-1.6-1.el6.noarch.rpm
2. Edit the /etc/clustershell/groups file to include all the internal IP addresses for the three
nodes in your cluster, separated by spaces or commas. The file should contain just this line:
all: <IP address master node> <IP address node 1> <IP address node 2>
Note: Be sure to use the internal IP addresses for your nodes.
For example:
3. Test that clush provides passwordless access to all the nodes of the cluster:
# clush -a date
Copy the Scripts

1. Copy the /root/pre-install and /root/post-install directories to all of your nodes:
# clush -a --copy /root/pre-install /root/post-install
2. When that completes, confirm that all of the nodes have a copy of the directories:
# clush -Ba ls -la /root/ | grep pre-install
# clush -Ba ls -la /root/ | grep post-install
Run a Cluster Audit

1. From the master node, run an audit of the nodes in the cluster:
# /root/pre-install/cluster-audit.sh | tee cluster-audit.log

Note: On some OS versions, an SSH bug causes tcgetattr: Invalid argument
errors to be displayed to the screen during the audit. These can be ignored.
2. View the output file to evaluate your cluster hardware. Look for hardware or firmware levels that
are mismatched, or nodes that don't meet the baseline requirements to install Hadoop.

1-2
Lab 1.2: Run Pre-Install Tests

Evaluate Network Bandwidth
1. As root on the master node, type this command to start the network test:
# /root/pre-install/network-test.sh | tee network-test.log
Press enter at the prompt to continue. This runs the RPC test to validate the network
bandwidth. This test will take a few minutes to run.
Note: Results should be about 90% of peak bandwidth. So with a 1GbE network,
expect to see results of about 115MB/sec. With a 10GbE network, look for results
around 1100MB/sec. If you are not seeing results in this range, then you need to
check with your network administrators to verify the connections and firmware.
With virtual clusters, expect to see lower than optimal results.
Evaluate Data Flow

1. Type the following command to run the stream utility:
# clush -Ba /root/pre-install/memory-test.sh | tee memory-test.log
As with the network performance test, it will take a few minutes to complete.
2. Review the results.
This tests the memory performance of the cluster. The exact bandwidth of memory is highly variable and
is dependent on the speed of the DIMMs, the number of memory channels and, to a lesser degree, the
CPU frequency.
Evaluate Raw Disk Performance

Caution! This test destroys any existing data on the disks it uses. Make sure the drives do not
have any needed data on them, and that you do not run this test after you have installed MapR
on the cluster.
The first step lets you view the disks that will be used in the test: review the output carefully to
make sure the list contains only intended disks.

1-3
1. Type the command below to list the unused disks on each node. These are the disks that IOzone
will run against, so be sure to examine the list carefully.
# clush -ab /root/pre-install/disk-test.sh
2. After you have verified the list of disks is correct, run the command with the --destroy argument:
# clush -ab /root/pre-install/disk-test.sh --destroy

Note: In the lab environment, the test will run for 15-20 minutes, depending on the
number and sizes of the disks on your nodes. In a production environment with a
larger cluster, it can take significantly longer.
The test will generate one output log for each disk on your system. For example:
xvdb-iozone.log
xvdc-iozone.log
xvdd-iozone.log
3. If there are many different drives, the output can be difficult to read. The summIOzone.sh script
creates a summary of the output. Run the script and review the output:
# clush -a '/root/pre-install/summIOzone.sh'
Note: The script assumes that the log files are in the present working directory, so the
script must be run from the directory that contains the log files.
Keep the results of this and the other benchmark tests for post-installation comparison.
Lab 1.3: Plan Service Layout

You are a system administrator for company ABC, which is just getting started with Hadoop. You will be
installing and configuring a small 3-node cluster for initial deployment, with high availability. The R&D
and Marketing departments will share this cluster.
Q:
What type(s) of nodes (data, control, or control-as-data) will your cluster have?
A:
Since you only have three nodes, you will need to spread the control services (such as
ZooKeeper and CLDB) out over all of the nodes. To do this and still have room for data,
you will need to use control-as-data nodes.

1-4
Warden
NodeManager
Resource Manager
HistoryServer
NFS
MFS
CLDB
ZooKeeper
Fill out the chart below to show where the various services will be installed on your cluster:
Node 1 (in rack A)

Node 2 (in rack B)
Node 3 (in rack C)
Warden
NodeManager
Resource Manager
HistoryServer
NFS
MFS
CLDB
ZooKeeper
Try this! How would you configure a 10-node cluster with the same requirements?
Node 1 (in rack A)

Node 2 (in rack A)
Node 3 (in rack A)
Node 4 (in rack B)
Node 5 (in rack B)
Node 6 (in rack B)
Node 7 (in rack B)
Node 8 (in rack C)
Node 9 (in rack C)
Node 10 (in rack C)

1-5
Lesson 2: Install a MapR Cluster

Lab 2.1: Install a MapR Cluster
Preparation
1. Log in as root on the master node, then download and run the mapr-setup script:
# wget http://package.mapr.com/releases/installer/mapr-setup.sh
# bash ./mapr-setup.sh
This script will prepare the node to use the browser-based installer.
Note: If you see the message, Error: Nothing to do, you can ignore it.
Accept the defaults for the mapr admin user, UID, GID, and password. Enter mapr as the
password, then re-enter it confirm:
2. When the script completes, point your browser to the node at port 9443:
Note: For virtual clusters, the mapr-setup.sh script may display the internal IP address
here in this screen. Make sure you use the external IP address of the node.
3. Open the requested URL to continue with the installation. Ignore any warnings you receive about
the connection not being secure, and log into the installer as the user mapr (the password will
also be mapr, unless you changed it when running the mapr-setup.sh script).
4. From the main MapR Installer screen, click Next in the lower right corner to start moving through
the installation process. The first screen is Select Version & Services.
Select Version & Services

The first screen is Select Version & Services, as shown below.
1. From the MapR Version pull-down menu, set the MapR Version to 5.0.0.
2. In the Edition field, select Enterprise Database Edition (this is the default).
3. Under License Option, select Add License After Installation Completes.
4. In the Select Services section, the option Data Lake: Common Hadoop Services is selected by
default as the Auto-Provisioning Template. Perform the following actions to review services,
and to set them for the course:
a. Click Show advanced service options.
This displays the services that will be installed with the selected template.

2-2
b. Change the Auto-Provisioning Template selection to Data Exploration: Interactive

SQL with Apache Drill to see how it changes the services template. Then select
Operational Analytics: NoSQL database with MapR-DB to see those options.
c.
Change the selection to Custom Services and select the following:

HBase/MapR-DB Common
YARN + MapReduce
5. Click Next to advance to the Database Setup screen.
Database Setup
The entries displayed on this screen will depend on which services were selected on the previous screen
(a database needs to be selected for Hue, Oozie, Metrics, or Hive). Since we did not select any services
requiring a database setup, there will be no database to configure.
Click Next to advance to the Set Up the Cluster screen.

2-3
Set Up the Cluster

This screen is where you define the MapR Administrator Account, and name your cluster.
1. The MapR Administrator Account section will show the values you entered when you ran the
setup script. In the Password field, enter the password for the mapr user.
2. In the Cluster Name field, enter a name for your cluster. If you are in the instructor-led course,
use the cluster name that was assigned to you in the .hosts file to be sure your cluster name is
not the same as another student's.
3. Click Next to advance to the Configure Nodes screen.
Configure Nodes
This screen is where you define the nodes to include in your cluster, the disks that will be used, and the
authentication method. The hostname of the install node will already be filled in for you.

2-4
1. In the Nodes section, enter the fully qualified hostnames of the three nodes that will be in your
cluster, one per line. The hostname of the install node will generally be filled in for you.
th
Caution! Install the cluster on just your first 3 nodes. The 4 node will be used in a
later lab, and should not have MapR installed on it at this time.
For example:
2. In the Disks section, enter the names of the disks that will be used by the cluster. You can run
this command on a node to verify the disks on your system:
# fdisk l | grep dev
Enter the disks as a comma-separated list. For example:
3. In Configure Remote Authentication, select SSH Password as the Login Method.
Then:
a. Enter root as the SSH Username.

b. Enter mapr as the SSH Password.
c.
Leave the SSH Port as set, to 22.
6. Click Next to advance to the Verify Nodes screen.

2-5
Verify Nodes
Verify Nodes verifies that each node can be reached and meets the minimum requirements. When
complete, the node icons will display as green (ready to install), yellow (warning), or red (cannot install).
1. To check the status of a yellow or red node, click on the node icon. A box on the right-hand side
of the screen will appear, with details of any warnings or errors found. For example:
Note: In the screenshot above, there are warnings because the nodes do not have
swap space. This will often be the case with virtual AWS or GCP clusters. You can
proceed with warnings, but in a production environment it is best to correct any issues
and re-verify the nodes.
2. When ready, click Next to advance to the Configure Service Layout screen.

2-6
Configure Service Layout

The Configure Service Layout screen displays the services that will be installed on the nodes. From
here, you can rearrange and add services as well.
1. Click View Node Layout to see which services will be installed on which nodes. After reviewing
the layout, click Close.
2. Click Advanced Configuration and review the screen. Services are divided into logical
groupings; you can change these groupings, or add groups of your own. Make some changes to
the service layout for practice:
a) Look for the DEFAULT group. Drag the NFS service icon to the row below it, to make a
new group.
b) Change the group name to NFS.
c) In the Nodes column of the new group, click the Modify button.
d) Select one or more nodes to be in the NFS group by clicking on the node icons.
e) Click OK. You now have those nodes in the NFS group so the NFS service will be
installed on them.
f)
Click the trashcan icon to the right of the NFS group to delete it. Click OK to verify the
deletion.

2-7
g) Scroll up to the top of the page. The NFS service icon is now unprovisioned and must be
placed in a group before the installation can proceed.
h) Drag the NFS service icon back to the DEFAULT group.

3. At the bottom of the page, click Restore Defaults, then OK to confirm. This will restore the
default service layout.
4. Configure the service layout to match what you came up with in the service layout lab for
Company ABC, if it differs from the default layout.
Caution! The service layout you configured for Company ABC does not list all of the
services that will go on a MapR cluster. Do not delete services from this screen that do
not appear on your worksheet; just make sure that the services that DO appear on your
worksheet are laid out the way you intend.
Note: Make note of which node is running the JobHistoryServer, and record its external
IP address. You will need this information later when viewing job history.
5. Drag the Webserver service into the DEFAULT group, so it will be installed on all three nodes.
6. Click Save to save the service layout, then click Install to start the installation process.
Installing MapR
The installation will take approximately 30 minutes to complete on a 3-node lab cluster. Time
required in a production environment will vary based on what is being installed, and the number of nodes.
When the install is complete, click Next. Since you did not install a license at the start of the installation,
a page will appear letting you know that a license must be entered. Click Next to advance to the final
step of the installation.

2-8
Lab 2.2: Install a MapR License

1. Launch the MapR Control System (MCS) UI by pointing your browser at the external IP address
of your install node, at port 8443 (or by clicking the link on the last page of the Installer). Ignore
any messages that appear about the connection not being secure, and continue on.
2. Log into the MCS as the user mapr, and accept the license agreement.
3. Click on the Manage Licenses link in the upper right-hand corner of the MCS to open the
licensing window. Then:
a. If you already have a license file, click Add licenses via upload and browse to the
location of the license file. Then click Apply Licenses.
b. If you do not have a license file, click Add licenses via Web. This will prompt you to log
into your mapr.com account (if you have one), or create an account. From there, you can
register your cluster and download a trial license.
4. Some nodes in the cluster may have orange icons in the node heatmap, indicating degraded
service. This is normal since some services were started before the license was applied. You will
typically have to restart CLDB and NFS Gateway services.
To restart any failed services:
a. Find the Services panel on the right side of the dashboard, and note any services that
have failures:
b. Click on the red number in the Fail column to see a list of nodes where the specified
service has failed. Select all the nodes and click the Manage Services button at the top.

2-9
c.
Find the service you want to restart. From the drop-down list next to the service name,
select Restart. Then click OK.
5. Back at the Services pane, check to see that all the NFS Gateway services are running. You may
not see failures: instead, you may just see no active NFS services:
If they are not running, restart them as you did with the CLDB service.
6. Return to the dashboard you should see all green node icons (you may have to refresh the
screen).

2-10
Lessons Learned
Some of the key takeaways from this lesson are listed below.
The MapR installer will guide you through the installation process, and make sure that
interdependencies are not violated.
The MapR Installer verifies that nodes are ready for installation before proceeding.
Plan your service layout prior to installing the MapR software. In particular:
Make sure that you have identified where the key control services (CLDB, ZooKeeper,
ResourceManager) will be running in the cluster.
Ensure that you have enough instances of the control services to provide high availability
for your organization if it is required.
After installing MapR, apply a license and restart any failed services.

2-11
Lesson 3: Verify and Test the Cluster

Lab 3.1: runRWSpeedTest
The runRWSpeedTest script uses an HDFS API to stress test the IO subsystem.
1. Log into the master node as root.
2. Run the test:
# clush -Ba '/root/post-install/runRWSpeedTest.sh' | tee RWSpeedTest.log
The output provides an estimate of the maximum throughput the I/O subsystem can deliver.
3. Compare the results to the results from the pre-install disk-test.sh output. You should expect
to see about 85 - 90% of the pre-install test results.
Lab 3.2: Run TeraGen & TeraSort

TeraGen is a MapReduce program that will generate synthetic data. TeraSort samples this data and uses
Map/Reduce to sort it. These two tests together will challenge the upper limits of a clusters performance.
1. Log into the master node as the user mapr and create a volume to use with the test:
# maprcli volume create -name benchmarks -mount 1 -path /benchmarks
Note: The on-demand course ADM 201, Configure a MapR Cluster, has more
information on MapR volumes.
2. Verify that the new mount point directory and volume exist:
# hadoop fs -ls /
3. Switch to the mapr user to run the script:
# su mapr
4. Run the TeraGen command:
yarn jar /opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/mapreduce/hadoop-
mapreduce-examples-2.7.0-mapr-1506.jar teragen 500000 /benchmarks/teragen1
This will create 500,000 rows of data.
5. Open the MCS to the dashboard view so you can watch node utilization while the next step
(TeraSort) is running.
a. At the top of the dashboard, set the heatmap to show Disk Space Utilization so you will
see the load on each node. It should be spread relatively evenly across the cluster.
Hotspots suggest a problem with a hard drive or its controller.
b. On the right-hand side of the dashboard, the YARN section will display information on the
job as it is running.
6. Type the following to sort the newly created data:

yarn jar /opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/mapreduce/hadoop-
mapreduce-examples-2.7.0-mapr-1506.jar terasort /benchmarks/teragen1
/benchmarks/terasort1
7. Look at the TeraSort output and analyze how long it takes to perform each step. To drill down in
the results of the TeraSort command:
a. Determine the external IP address of the node that is running the JobHistoryServer (you
recorded this information when you installed the cluster).
b. Point your browser to that node, at port 19888:
https://<external IP address>:19888
Note: If you cannot connect, make sure you are using the external IP
address of the node that is running the JobHistoryServer service.

3-2
c.
Jobs are listed with the most recently job at the top. Click the Job ID link to see job
details. It will show the number of map and reduce tasks, as well as how many attempts
were failed, killed, or successful:
d. To see the results of the map or reduce tasks, click on Map in the Task Type column.
This will show all of the map tasks for that job, their statuses, and the elapsed time.
You can keep drilling down to get detailed information on each task in the job.
Lessons Learned
Running benchmark tests after installation gives you a performance baseline that you can refer to
later, and helps you spot any concerns early on.
Jobs can be monitored with the MCS, or with the JobHistoryServer.

3-3
Lesson 4: Configure Cluster Storage

Lab 4.1: Configure Node Topology
Topology can be changed through the MCS or the command line. In this lab, you will assign each of your
nodes to a separate topology: /data/rack1, /data/rack2, or /data/rack3.
1. In the MCS, navigate to Cluster > Nodes.
2. Select the first node by checking the box next to it.

3. Click Change Topology. The Change Topology dialog box appears.
4. Type in the topology path, /data/rack1, to create and assign the new topology. Then click OK.
5. Repeat the steps to assign /data/rack2 to a different node.

6. Using the maprcli node move command at the command line, assign the /data/rack3
topology to the last node.
For syntax information, enter maprcli node move with no arguments.
To determine the node's server ID, run maprcli node list -columns id.
7. Create the /decommissioned topology:

a. Using either the command line or the MCS, move one of the nodes to the
/decommissioned topology to create it.
b. Move the node back to its appropriate topology. The /decommissioned topology will
remain, but have no nodes assigned to it.
8. Verify that the /decommissioned topology was created, and that all of the nodes are assigned to
their correct topologies under /data:
maprcli node topo
maprcli node list -json | grep topo
Lab 4.2: Create Volumes

Overview
The marketing and R&D departments will be sharing nodes in the cluster, but still want to keep access to
their data isolated. To do this, you will create directories and volumes in the cluster for the departments
and their users.
Add Users and Groups

The table below lists users and groups in the marketing and R&D departments.
Name
Type
UID
GID
mkt
group
--
6000
miner
group
--
6001
sharon
user
600
6000
keith
user
601
6000
rnd
group
--
7000
cobra
group
--
7001
rattler
group
--
7002
jenn
user
700
7000
tucker
user
701
7000
mark
user
702
7000
marje
user
703
7000
porter
user
704
7000

4-2
Before they can be assigned volumes or permissions in the cluster, they must exist at the OS level on
each node in the cluster. They must have the same name, UID, and GID on each node.
1. Log into your master node as root.
2. Create UNIX users and groups for all of the entities in the table above. Use clush to facilitate the
operations. For example:
# clush -a
clush> groupadd mkt -g 6000
clush> useradd sharon -g 6000 -u 600
clush>
clush> quit
Create Directories and Volumes

Now build the following hierarchy to store project data. In the diagram below, the rectangles (such as
projects) represent directories in the cluster, and the triangles represent volumes.
1. Log into the master node as root.

2. Create the directories shown in the diagram above (the rectangular entries). Use the command:
hadoop fs -mkdir -p <path>
3. Verify that the directories were created in the cluster:
hadoop fs -ls -R /projects
4. The table below shows the accountable entity (AE), mount path, and quotas for each volume.
Use this information to create the dev and prod volumes for each project. Create some of the
volumes using each of the below methods:
The command line (maprcli volume create)
The MCS (navigate to MapR-FS > Volumes and click New Volume)

4-3
Volume Name
AE
Mount Path
Advisory
Quota
Hard
Quota
miner-dev
miner
/projects/mkt/miner/miner-dev
70 GB
100 GB
miner-prod
miner
/projects/mkt/miner/miner-prod
100 GB
130 GB
cobra-dev
cobra
/projects/rnd/cobra/cobra-dev
70 GB
100 GB
cobra-prod
cobra
/projects/rnd/cobra/cobra-prod
100 GB
130 GB
rattler-dev
rattler
/projects/rnd/rattler/rattler-dev
200 GB
220 GB
rattler-prod
rattler
/projects/rnd/rattler/rattler-prod
250 GB
300 GB
5. From the UNIX command line on the master node, change the group for each volume to its
respective department:
# hadoop fs -chgrp -R mkt /projects/mkt
# hadoop fs -chgrp -R rnd /projects/rnd
6. Verify that the groups have changed:
# hadoop fs -ls -R /projects /mkt
# hadoop fs -ls -R /projects/rnd
Lessons Learned
At a minimum, configure a rack topology that puts each node into a specific rack. You can create
more specific topologies if you need to segregate data onto specific nodes.
Set up a topology for decommissioned nodes, to facilitate node maintenance and removal.
Volumes are a key management tool create volumes liberally (and early) to organize your data.
Assign a high-level topology (such as /data) to volumes, unless you need a volume's data to be
on a specific group of nodes.
Use quotas to limit the size of a volume.
Use accountable entities to establish the user or group responsible for the volume's disk usage.
When specifying a mount path for a volume, the parent directories must exist. Use the command
hadoop fs mkdir if needed to create the directories.

4-4
Lesson 5: Data Ingestion

Lab 5.1: Load Data
Follow these steps to copy data from a legacy server to a volume in your cluster, using the NFS protocol.
1. Log into the master node as root, and create a volume:
# maprcli volume create -name NFStest -mount 1 -path /NFStest
2. List the contents of NFStest, to see that it is empty:
# hadoop fs ls /NFStest
3. Log into your fourth node (the one that is not part of the cluster) as root. We'll call this the NFS
node for this lab exercise.
4. On the NFS node, create an input directory as a mount point for the cluster, and verify the
directory exists:
# mkdir -p /mnt/input
# ls /mnt
5. On the NFS node, mount the volume you created on your cluster node:
# mount -o hard,nolock <cluster node IP>:/mapr/<cluster name>/NFStest
/mnt/input
6. On the NFS node, copy a group of files from the local file system to the input directory, and verify
that they were copied over:
# cp /etc/*.conf /mnt/input
# ls /mnt/input
7. Log into the master node on your cluster, as root. Verify that the files are there:
# hadoop fs -ls /NFStest
8. On the NFS mode, unmount the input directory and verify that it no longer sees the files:
# umount /mnt/input
# ls /mnt/input
9. On the cluster node, verify that the files are still there:
# hadoop fs ls /NFStest
Lab 5.2: Configure Snapshots

Create a Scheduled Snapshot

1. In the MCS, navigate to MapR-FS > Volumes.
2. In the list of volumes, click on NFStest to open the volume properties.
3. Expand the Snapshot Scheduling section. Select the Critical data schedule,
4. Click OK. This creates a snapshot schedule for the volume.
Note: The Critical data schedule takes a snapshot every hour, at the top of the hour.
Depending on what time it is when you apply the schedule, it may take up to an hour for
the snapshot to be created.
5. Navigate to MapR-FS > Schedules. You will see that the schedule you selected for the volume
is listed as In use. Click the green checkmark to see information about the schedule including
when the snapshot will expire.
6. Click the drop-down box that shows Hourly, and change it to Every 5 min. Change the Retain
for field to be 1 hour, and click Save Schedule. This changes the Critical data schedule, for
every volume that is using the schedule. Now a scheduled snapshot will be taken every 5
minutes, instead of at the top of the hour.
7.
Click the schedule name to see a list of all volumes that are associated with that schedule.

5-2
Create a Manual Snapshot

1. In the MCS, navigate back to the MapR-FS > Volumes.
2. Check the box to the left of the NFStest volume.
3. At the top of the pane, click Volume Actions.
4. Click New Snapshot. The Create New Snapshot dialog box appears. Enter Manual_1 as the
name of the snapshot, and click OK.
5. Navigate to MapR-FS > Snapshots. The snapshot you took should be listed there.
Restore Data From a Snapshot

1. From the command line, view the list of snapshots:
# maprcli volume snapshot list
2. List the hidden directory that contains the snapshot:
# hadoop fs -ls /NFStest/.snapshot
3. Remove a file from the volume, then list the volume to see that it is gone:
# hadoop fs -rm /NFStest/<file name>
# hadoop fs -ls /NFStest
4. Restore the file from the Manual_1 snapshot:
# hadoop fs -cp /NFStest/.snapshot/Manual_1/<file name> /NFStest
5. Verify that the file has been restored:
# hadoop fs -ls /NFStest/<file name>

5-3
6. Since the cluster file system is mounted, you can also use Linux commands to see the status of
the file. Use the ls command to see that the file has been restored:
# ls /mapr/<cluster name>/NFStest/<file name>

Note: Remember that "/" in the hadoop command is the root of the cluster file system,
and the "/" in Linux is the root of the local file system. This is why different paths are
specified for the hadoop -fs -ls command, and the Linux ls command.
Remove or Preserve a Snapshot

You can view a list of snapshots in the MCS, but you won't be able to see their contents. You can also
preserve a snapshot that is scheduled to expire, or you can delete a snapshot.
1. In the MCS, navigate to MapR-FS > Snapshots.
2. Select the manual snapshot by checking the box to its left. Since there is no expiration date on a
manual snapshot, you do not have the option to preserve it. Click Remove Snapshot to delete it.
3. At the command line, verify that the snapshot is gone:

# hadoop fs ls /NFStest/.snapshot
4. Select one of the scheduled snapshots, and click Preserve Snapshot.
Preserving it removes the expiration date.

5-4
Lab 5.3: Configure a Local Mirror

1. In the MCS, navigate to MapR-FS > Volumes.
2. Click New Volume. Create a new volume with the following properties:
Volume Type
Local Mirror Volume
Mirror Name
NFStest-mirror
Source Volume Name
NFStest
Mount Path
/NFStest-mirror
Topology
/data
Replication
NS Replication
Mirror Schedule
Normal data
3. Click OK. This creates and mounts the local mirror volume.
4. Navigate to MapR-FS > Mirror Volumes. Verify that your mirror volume appears.
Q:
The Last Mirrored and % Done columns do not contain information. Why not?
A:
Your actions created the mirror volume, but did not start the mirroring process.
If you want the volume to mirror right away, you can manually start the mirror.
Otherwise, it will start at the time specified in the schedule (which you can see
by navigating to MapR-FS > Schedules).
Follow these steps to force the mirror volume to start synchronizing prior to the scheduled time:
1. Navigate to MapR-FS > Mirror Volumes.
2. Check the box next to the mirror volume you just created.
3. Click Volume Actions. From the pull-down menu, select Start Mirroring. This will start the
synchronization process.

5-5
Create a Custom Schedule

Three schedules are created by default: Normal data, Important data, and Critical data. You can also
create custom schedules.
1. In the MCS, navigate to MapR-FS > Schedules.
2. Click New Schedule.
3. Name the new schedule Quarterly.
4. In the Schedule Rules section, click the arrow on the first drop-down box. This will show all of
the intervals that can be selected for your schedule.
a. Select Yearly
b. Set the next field to on the 31
c.
st
Set the next field to March.
d. Set the retain time to 1 years

st
This sets the action to occur on March 31 each year, and be kept for one year.
5. Click Add Rule to add another rule to your schedule. Add a total of 3 more rules, that will:
th
Run yearly on June 30 , and be retained for 1 year.
Run yearly only September 30 , and be retained for 1 year.
Run yearly on December 31 , and be retained for 2 years.
th
st
6. Click Save Schedule when you have fully defined the schedule. The newly created schedule
appears in the schedule list, and is available for mirrors and snapshots.
Note: Schedules are available to be used for either snapshots or mirrors. If a schedule
is applied to a mirror volume, the retain time is ignored (the mirror will not expire). If the
schedule is used for a snapshot, the snapshot will automatically be deleted when the
retain interval is met.

5-6
Lab 5.4: Configure a Remote Mirror

For this lab, you will need a second cluster. For classroom training, the instructor will pair you with
another student, and you will each create a remote mirror with the other student's cluster. On-demand
training students will need to install a second cluster to perform this lab. You can create a single-node
th
cluster on the 4 node, that you used for the NFS exercise.
Note: To create a remote mirror, the following conditions must be met:
Each cluster must already be up and running.
Each cluster must have a unique name
Every node in each cluster must be able to resolve all nodes in remote clusters, either
through DNS or entries in /etc/hosts.
The MapR user for both the local (source) and remote (destination) clusters must have
the same UID.
You need to have dump permission on the source volume, and restore permissions on
the mirror volumes at the destination cluster.
Edit the Source Cluster Configuration

1. Determine the cluster name and CLDB nodes on the destination cluster, by viewing this file on
the destination cluster:
/opt/mapr/conf/mapr-clusters.conf
2. Log into a node on the source cluster.
3. Edit the /opt/mapr/conf/mapr-clusters.conf file. You should see a line with the source
cluster name and a list of its CLDB nodes. Add a second line to describe the destination cluster,
in the form:
<cluster name> <CLDB node1>:7222 <CLDB node2>:7222 <CLDB node3>:7222
4. Restart the warden on all of the nodes in the source cluster:
# clush a 'service mapr-warden restart'
Edit the Destination Cluster Configuration

Perform the same steps to make the destination cluster aware of the source cluster:
1. Log into a node on the destination cluster.
2. Edit the /opt/mapr/conf/mapr-clusters.conf file. Add a line to describe the source cluster, in
the form:
<cluster name> <CLDB node1>:7222 <CLDB node2>:7222 <CLDB node3>:7222
3. Restart the warden on all of the nodes in the destination cluster.

5-7
Verify Cluster Configuration

Verify that each cluster has a unique name and is aware of the other cluster
1. Log on to the MCS of the source cluster.
2. Verify that the name of the source cluster is listed at the top.

3. Click the + symbol next to the name and verify that the destination cluster is listed under
Available Clusters.
4. Click on the link for the destination cluster to open the MCS for the destination cluster.
Note: For an AWS cluster, it will attempt to connect using the internal IP address, which
will fail. You will need to connect using the node's external IP address.
5. Repeat these steps to verify visibility on the destination cluster's MCS.
Create a Remote Mirror Volume with MCS

1. Log into the MCS on the destination cluster.
2. Navigate to MapR-FS > Volumes.
3. Use the New Volume button to set up a Remote Mirror Volume.
Set the Volume Type to Remote Mirror Volume.
Assign a descriptive name as the Mirror Name.

5-8
For the Source Volume Name, enter the name of the volume on the source cluster that you
want to mirror.
For the Source Cluster Name, enter the name of the source cluster (once you start typing,
the cluster name should appear so you can select it).
For the Mount Path, enter the path on the destination cluster where the mirror volume will be
mounted.
Under Permissions, make sure that the user mapr has restore permissions.
4. Click OK to create the volume. The volume should appear in your volume list: navigate to
MapR-FS > Mirror Volumes to verify.
Initiate Mirroring to the Destination Cluster

1. In the MCS, select the remote mirror volume you created.
2. Click Volume Actions and select Start Mirroring. Give the volume a few minutes to finish
mirroring.
3. Log into any node on the destination cluster, and list the contents of the destination mirror
volume:
# hadoop fs ls <remote mirror mount point>
Or, if the volume is mounted via NFS, you can simply use operating system commands:
# ls /mapr/<cluster name>/<remote mirror mount point>
You should see the same contents in the mirror volume as you do in the source volume.
4. Click on the volume name to open volume properties.
5. In the Snapshot Scheduling section, set a Mirror Schedule. This will ensure that the remote
mirror is updated on a regular basis.

5-9
Lessons Learned
With the cluster file system mounted, you can use standard Linux commands to copy data into
the cluster.
Snapshots can be created manually, or on a schedule. Snapshots that are created manually do
not have a set expiration date. Snapshots that are created on a schedule will have an expiration
date, but then can be preserved before they expire if you want to keep them longer.
Mirror volumes must be synchronized after they are created. They can be synchronized
manually, or with a schedule.
When a schedule is applied to a mirror volume, the retain time is ignored (data in mirror volumes
does not expire; the mirror is updated with new data each time it is synchronized).
Remote mirrors are set up between two clusters, typically for disaster recovery purposes.
With a local mirror volume, the data is pushed from the source volume to the mirror volume. With
a remote mirror volume, data is pulled from the remote mirror.

5-10
Lesson 6: Monitor Your Cluster

Lab 6.1: Monitor Cluster Health
Check Cluster Heat Map

1. In the MCS, navigate to Cluster > Dashboard.
The Cluster Heatmap is displayed in the center panel.
2. By default, the cluster heatmap shows the Health view. From the drop-down list at the top of the
heat map, choose other options to see how they impact the node icons:
You can also use the drop-down list to filter by certain alarm types. For example, if you select
Node Alarm Heartbeat Processing Slow, you will see alarms on any nodes that have a slow
heartbeat:
3. Click on a node icon to see more on the node's status. The view shows information on:
The node's performance
MapReduce slots (for MRv1)
Database operations
MapR-FS disks
System disks
Services running on the node
Check for Service Failures

1. On the dashboard of the MCS, look at the Services pane:
The pane lists services running on the cluster. In particular, look for any numbers in the Fail
column.
2. Click a failed service: a screen displays that shows which node has the failed service. From here,
you can start or restart the service.

6-2
Lab 6.2: Stop, Start, and Restart Services

You can start, stop, and restart services through either the MCS, or with the command-line interface.
2. Click the checkbox next to one or more nodes, then click Manage Services at the top of the
screen.
3. A list of services displays. From the drop-down list next to a service name, choose to start, stop,
or restart the service on the selected nodes. Then click OK.
4. You can also start and stop services from the command line:
# maprcli node services -<name> start|stop|restart nodes <list of nodes>

6-3
Lab 6.3: Perform Maintenance

Replace a Failed Disk

If a drive goes down, access to the entire storage pool is lost even if the other two disks are fully
operational. Follow these steps to hot-swap a drive and rebuild the storage pool.
Note: You will perform this procedure on a "healthy" disk, since you do not have any failed
disks in your lab cluster. The procedure is the same for a disk that has actually failed.
1. When a drive fails, a Data Under Replicated alarm is raised. Check the alarm in the MCS to
determine which node had a disk failure.
2. If there was an actual disk failure, you would view the logs to determine the cause, to make sure
the disk needs to be replaced. The log files are located at /opt/mapr/logs/faileddisk.log.
4. Click the name of the node with the failed drive: the node properties display (for the purposes of
this lab, you can select any of the nodes).
5. Scroll down to MapR-FS and Available Disks.
6. Scroll to the right and check the Storage Pool ID. Make note of all of the disks included in the
same storage pool:
7. Check the box next to the failed disk, and click Remove Disk(s) to MapR-FS.

6-4
8. Click OK. All of the disks in the storage pool will be brought offline to the cluster. All of the disks
in the storage pool will be removed from MapR-FS, and the File System column will be empty.
The Used column will show 0%:
9. Replace the failed disk with a new one.

10. Select all the devices that were part of the storage pool, and click Add Disks to MapR-FS. The
disks will be assigned the next available Storage Pool ID, and the File System will once again
show as MapR-FS.
Decommission a Node
Use the /decommissioned topology to take a node offline, either for retirement or to perform extended
maintenance.
2. Check the box next to the node you will take offline.
3. Click Change Topology.
4. In the drop-down list, select the /decommissioned topology. Then click OK. The node is moved
to the decommissioned topology. Since the containers on the node belong in a different topology
(such as /data), the system will initiate the process of creating new copies of the data on available
nodes.

6-5
Appendix A: Set Up Passwordless SSH

If passwordless SSH is not already set up on your cluster, follow these instructions. For the instructor-led
course ADM 2000, check with your instructor prior to completing these steps.
Note: These instructions are specific to the classroom training, which uses AWS nodes for the
clusters. If you are taking the on-demand course and supplying your own nodes, you will need
to make adjustments.
Allow Root Logins on All Nodes

Log into each of your nodes as the root user, one at a time, and perform the following steps:
1. Change the password of the root account to mapr. Ignore any cautions about the password
strength.
# passwd
2. Copy the /etc/ssh/sshd_config file to sshd_config.bak, and then edit the file:
# cp /etc/ssh/sshd_config /etc/ssh/sshd_config.bak
# vi /etc/ssh/sshd_config
3. In the /etc/ssh/sshd_config file, comment or uncomment lines to match what is shown below:

(uncomment line, set to yes)
PasswordAuthentication yes
(uncomment line, set to yes)
PermitRootLogin yes
If the file also contains either of these lines, delete them or comment them out:
PermitRootLogin no
PasswordAuthentication no

4. Save the file, and run this command:
# sshd t
If no output is returned, proceed. If there are any errors, correct them before continuing.
5. Restart the sshd service:
# service sshd restart
Appendix A: Set Up Passwordless SSH
Generate and Copy Key Pairs

1. Log in as the root user on the master node, and run the ssh-keygen command to generate
keys for the root user:
# ssh-keygen
When you are prompted for input:
Accept the default directory in which to save the key.
If you are prompted to overwrite an existing file, answer yes.
Dont enter a passphrase.
The screenshot below displays expected output:
2. Copy the generated key pair into place on all three nodes (including the master). When prompted
for the password, enter mapr.
# ssh-copy-id root@<master node IP address or hostname>
# ssh-copy-id root@<node1 IP address or hostname>
# ssh-copy-id root@<node2 IP address or hostname>
The screenshot below shows the type of output generated:
3. Now ssh from the master node into each other node as root you should be able to connect
without an authentication file:
# ssh root@<node internal IP address or hostname>

A-2

Adm2000 Lab Guide

Diunggah oleh

Informasi Dokumen

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Adm2000 Lab Guide

Diunggah oleh

Hak Cipta:

Format Tersedia

Hadoop Cluster Administration

For use with the following courses:

PROPRIETARY AND CONFIDENTIAL INFORMATION

CAUTION: Details you must read to avoid potentially serious problems.

Q&A: A question posed to the learner during a lab exercise.

Try This! Extra exercises you can complete to strengthen learning.

Using This Guide

Tips for Using AWS and GCP Clusters

Use the external IP address to connect to a node in your cluster, from

PROPRIETARY AND CONFIDENTIAL INFORMATION

Using GCP Clusters

To connect to a node in your cluster:

Lab GS1: SSH Into Your Nodes

PROPRIETARY AND CONFIDENTIAL INFORMATION

Connect to your Nodes: Windows with PuTTY

PROPRIETARY AND CONFIDENTIAL INFORMATION

PROPRIETARY AND CONFIDENTIAL INFORMATION

Connect to your Nodes: Mac or Linux

Lab GS2: Set Up Passwordless SSH

PROPRIETARY AND CONFIDENTIAL INFORMATION

Lesson 1: Prepare for Installation

Install the Pre-Install and Post-Install Tools

Install and Configure Clustershell

Note: Be sure to use the internal IP addresses for your nodes.

Lesson 1: Prepare for Installation

Copy the Scripts

Run a Cluster Audit

PROPRIETARY AND CONFIDENTIAL INFORMATION

Lesson 1: Prepare for Installation

Lab 1.2: Run Pre-Install Tests

Evaluate Data Flow

Evaluate Raw Disk Performance

PROPRIETARY AND CONFIDENTIAL INFORMATION

Lesson 1: Prepare for Installation

Lab 1.3: Plan Service Layout

PROPRIETARY AND CONFIDENTIAL INFORMATION

Lesson 1: Prepare for Installation

Node 1 (in rack A)

Node 1 (in rack A)

PROPRIETARY AND CONFIDENTIAL INFORMATION

Lesson 2: Install a MapR Cluster

Lesson 2: Install a MapR Cluster

Select Version & Services

PROPRIETARY AND CONFIDENTIAL INFORMATION

Lesson 2: Install a MapR Cluster

b. Change the Auto-Provisioning Template selection to Data Exploration: Interactive

Change the selection to Custom Services and select the following:

5. Click Next to advance to the Database Setup screen.

Click Next to advance to the Set Up the Cluster screen.

PROPRIETARY AND CONFIDENTIAL INFORMATION

Lesson 2: Install a MapR Cluster

Set Up the Cluster

PROPRIETARY AND CONFIDENTIAL INFORMATION

Lesson 2: Install a MapR Cluster

3. In Configure Remote Authentication, select SSH Password as the Login Method.

a. Enter root as the SSH Username.

Leave the SSH Port as set, to 22.

6. Click Next to advance to the Verify Nodes screen.

PROPRIETARY AND CONFIDENTIAL INFORMATION

Lesson 2: Install a MapR Cluster

PROPRIETARY AND CONFIDENTIAL INFORMATION

Lesson 2: Install a MapR Cluster

Configure Service Layout

PROPRIETARY AND CONFIDENTIAL INFORMATION