Anda di halaman 1dari 5

Perform the following steps to import data to HDFS:

1. Use the following command to create a data directory on HDFS: hadoop fs


-mkdir /project/data

This command will create a directory /user/hduser/data in the HDFS filesystem.


hduser@nn1:/usr/local/hdpmetaxxcache$ hadoop fs -mkdir /project/data
Warning: $HADOOP_HOME is deprecated.

hduser@nn1:/usr/local/hdpmetaxxcache$
________________________________________________________________

2. Copy the data file from the local directory to HDFS using the following command:

hadoop fs -cp file:///usr/local/hdpmetaxxcache/data /project/data

Alternatively, we can use the command


hadoop fs -put /usr/local/hdpmetaxxcache/data /project/data

3. Verify the data file on HDFS with the following command: hadoop fs -ls
/project/data

4. Move the data file from the local directory to HDFS with the following command:
hadoop fs -mv file:///data/datafile /user/hduser/data

The local copy will be deleted if you use this command.


________________________________________________________________

5. Use the distributed copy to copy the large data file to HDFS:

hadoop distcp file:///usr/local/hdpmetaxxcache/data /project1/data

This command will initiate a MapReduce job with a number of mappers to run the
copy task in parallel.

hduser@nn1:/usr/local/hdpmetaxxcache$ hadoop distcp


file:///usr/local/hdpmetaxxcache/data /project1/data
Warning: $HADOOP_HOME is deprecated.

14/07/04 05:04:17 INFO tools.DistCp: srcPaths=[file:/usr/local/hdpmetaxxcache/data]


14/07/04 05:04:17 INFO tools.DistCp: destPath=/project1/data
14/07/04 05:04:18 INFO tools.DistCp: sourcePathsCount=3
14/07/04 05:04:18 INFO tools.DistCp: filesToCopyCount=2
14/07/04 05:04:18 INFO tools.DistCp: bytesToCopyCount=5.2k
14/07/04 05:04:19 INFO mapred.JobClient: Running job: job_201407040414_0003
14/07/04 05:04:20 INFO mapred.JobClient: map 0% reduce 0%
14/07/04 05:04:40 INFO mapred.JobClient: map 49% reduce 0%
14/07/04 05:04:48 INFO mapred.JobClient: Task Id :
attempt_201407040414_0003_m_000000_0, Status : FAILED
java.io.IOException: Copied: 0 Skipped: 0 Failed: 2
at org.apache.hadoop.tools.DistCp$CopyFilesMapper.close(DistCp.java:582)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:430)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:366)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:416)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190
)
at org.apache.hadoop.mapred.Child.main(Child.java:249)

14/07/04 05:04:49 INFO mapred.JobClient: map 0% reduce 0%


14/07/04 05:04:55 INFO mapred.JobClient: map 100% reduce 0%
14/07/04 05:04:57 INFO mapred.JobClient: Job complete: job_201407040414_0003
14/07/04 05:04:57 INFO mapred.JobClient: Counters: 23
14/07/04 05:04:57 INFO mapred.JobClient: Job Counters
14/07/04 05:04:57 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=36113
14/07/04 05:04:57 INFO mapred.JobClient: Total time spent by all reduces
waiting after reserving slots (ms)=0
14/07/04 05:04:57 INFO mapred.JobClient: Total time spent by all maps waiting
after reserving slots (ms)=0
14/07/04 05:04:57 INFO mapred.JobClient: Launched map tasks=2
14/07/04 05:04:57 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=0
14/07/04 05:04:57 INFO mapred.JobClient: File Input Format Counters
14/07/04 05:04:57 INFO mapred.JobClient: Bytes Read=358
14/07/04 05:04:57 INFO mapred.JobClient: File Output Format Counters
14/07/04 05:04:57 INFO mapred.JobClient: Bytes Written=0
14/07/04 05:04:57 INFO mapred.JobClient: FileSystemCounters
14/07/04 05:04:57 INFO mapred.JobClient: FILE_BYTES_READ=5305
14/07/04 05:04:57 INFO mapred.JobClient: HDFS_BYTES_READ=517
14/07/04 05:04:57 INFO mapred.JobClient: FILE_BYTES_WRITTEN=60312
14/07/04 05:04:57 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=5305
14/07/04 05:04:57 INFO mapred.JobClient: distcp
14/07/04 05:04:57 INFO mapred.JobClient: Files copied=2
14/07/04 05:04:57 INFO mapred.JobClient: Bytes copied=5305
14/07/04 05:04:57 INFO mapred.JobClient: Bytes expected=5305
14/07/04 05:04:57 INFO mapred.JobClient: Map-Reduce Framework
14/07/04 05:04:57 INFO mapred.JobClient: Map input records=2
14/07/04 05:04:57 INFO mapred.JobClient: Physical memory (bytes)
snapshot=38006784
14/07/04 05:04:57 INFO mapred.JobClient: Spilled Records=0
14/07/04 05:04:57 INFO mapred.JobClient: CPU time spent (ms)=320
14/07/04 05:04:57 INFO mapred.JobClient: Total committed heap usage
(bytes)=16252928
14/07/04 05:04:57 INFO mapred.JobClient: Virtual memory (bytes)
snapshot=378900480
14/07/04 05:04:57 INFO mapred.JobClient: Map input bytes=258
14/07/04 05:04:57 INFO mapred.JobClient: Map output records=0
14/07/04 05:04:57 INFO mapred.JobClient: SPLIT_RAW_BYTES=159
hduser@nn1:/usr/local/hdpmetaxxcache$

________________________________________________________________

hduser@nn1:/usr/local/hdpmetaxxcache$ hadoop distcp


file:///usr/local/hdpmetaxxcache/data /project/data
Warning: $HADOOP_HOME is deprecated.

14/07/04 05:00:02 INFO tools.DistCp: srcPaths=[file:/usr/local/hdpmetaxxcache/data]


14/07/04 05:00:02 INFO tools.DistCp: destPath=/project/data
14/07/04 05:00:03 INFO tools.DistCp: sourcePathsCount=3
14/07/04 05:00:03 INFO tools.DistCp: filesToCopyCount=2
14/07/04 05:00:03 INFO tools.DistCp: bytesToCopyCount=5.2k
14/07/04 05:00:05 INFO mapred.JobClient: Running job: job_201407040414_0002
14/07/04 05:00:06 INFO mapred.JobClient: map 0% reduce 0%
14/07/04 05:00:28 INFO mapred.JobClient: map 49% reduce 0%
14/07/04 05:00:36 INFO mapred.JobClient: Task Id :
attempt_201407040414_0002_m_000000_0, Status : FAILED
java.io.IOException: Copied: 0 Skipped: 0 Failed: 2
at org.apache.hadoop.tools.DistCp$CopyFilesMapper.close(DistCp.java:582)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:430)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:366)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:416)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190
)
at org.apache.hadoop.mapred.Child.main(Child.java:249)

14/07/04 05:00:37 INFO mapred.JobClient: map 0% reduce 0%


14/07/04 05:00:54 INFO mapred.JobClient: map 49% reduce 0%
14/07/04 05:01:03 INFO mapred.JobClient: Task Id :
attempt_201407040414_0002_m_000000_1, Status : FAILED
java.io.IOException: Copied: 0 Skipped: 0 Failed: 2
at org.apache.hadoop.tools.DistCp$CopyFilesMapper.close(DistCp.java:582)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:430)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:366)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:416)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190
)
at org.apache.hadoop.mapred.Child.main(Child.java:249)

14/07/04 05:01:05 INFO mapred.JobClient: map 0% reduce 0%


14/07/04 05:01:20 INFO mapred.JobClient: map 49% reduce 0%
14/07/04 05:01:28 INFO mapred.JobClient: Task Id :
attempt_201407040414_0002_m_000000_2, Status : FAILED
java.io.IOException: Copied: 0 Skipped: 0 Failed: 2
at org.apache.hadoop.tools.DistCp$CopyFilesMapper.close(DistCp.java:582)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:430)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:366)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:416)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190
)
at org.apache.hadoop.mapred.Child.main(Child.java:249)

14/07/04 05:01:29 INFO mapred.JobClient: map 0% reduce 0%


^Z
[1]+ Stopped hadoop distcp
file:///usr/local/hdpmetaxxcache/data /project/data
hduser@nn1:/usr/local/hdpmetaxxcache$
________________________________________________________________

To copy multiple files from the local directory to HDFS, we can use the following
command:
hadoop fs -copyFromLocal src1 src2 data

This command will copy two files src1 and src2 from the local directory to the data
directory on HDFS.

Similarly, we can move files from the local directory to HDFS. Its only difference
from the previous command is that the local files will be deleted.

hadoop fs -moveFromLocal src1 src2 data

This command will move two files, src1 and src2, from the local directory to HDFS.

Although the distributed copy can be faster than the simple data importing
commands, it can incur a large load to the node that the data resides on because of
the possibility of high data transfer requests. distcp will be more useful when
copying data from one HDFS location to another. For example:

hadoop distcp hdfs:///project1/data/ hdfs:///user/

hduser@nn1:/usr/local/hdpmetaxxcache$ hadoop distcp hdfs:///project1/data/


hdfs:///user/
Warning: $HADOOP_HOME is deprecated.

14/07/04 05:09:21 INFO tools.DistCp: srcPaths=[hdfs:/project1/data]


14/07/04 05:09:21 INFO tools.DistCp: destPath=hdfs:/user
14/07/04 05:09:22 INFO tools.DistCp: sourcePathsCount=11
14/07/04 05:09:22 INFO tools.DistCp: filesToCopyCount=6
14/07/04 05:09:22 INFO tools.DistCp: bytesToCopyCount=66.9k
14/07/04 05:09:24 INFO mapred.JobClient: Running job: job_201407040414_0004
14/07/04 05:09:25 INFO mapred.JobClient: map 0% reduce 0%
14/07/04 05:09:34 INFO mapred.JobClient: map 100% reduce 0%
14/07/04 05:09:36 INFO mapred.JobClient: Job complete: job_201407040414_0004
14/07/04 05:09:36 INFO mapred.JobClient: Counters: 22
14/07/04 05:09:36 INFO mapred.JobClient: Job Counters
14/07/04 05:09:36 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=10495
14/07/04 05:09:36 INFO mapred.JobClient: Total time spent by all reduces
waiting after reserving slots (ms)=0
14/07/04 05:09:36 INFO mapred.JobClient: Total time spent by all maps waiting
after reserving slots (ms)=0
14/07/04 05:09:36 INFO mapred.JobClient: Launched map tasks=1
14/07/04 05:09:36 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=0
14/07/04 05:09:36 INFO mapred.JobClient: File Input Format Counters
14/07/04 05:09:36 INFO mapred.JobClient: Bytes Read=1770
14/07/04 05:09:36 INFO mapred.JobClient: File Output Format Counters
14/07/04 05:09:36 INFO mapred.JobClient: Bytes Written=0
14/07/04 05:09:36 INFO mapred.JobClient: FileSystemCounters
14/07/04 05:09:36 INFO mapred.JobClient: HDFS_BYTES_READ=70477
14/07/04 05:09:36 INFO mapred.JobClient: FILE_BYTES_WRITTEN=60289
14/07/04 05:09:36 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=68548
14/07/04 05:09:36 INFO mapred.JobClient: distcp
14/07/04 05:09:36 INFO mapred.JobClient: Files copied=6
14/07/04 05:09:36 INFO mapred.JobClient: Bytes copied=68548
14/07/04 05:09:36 INFO mapred.JobClient: Bytes expected=68548
14/07/04 05:09:36 INFO mapred.JobClient: Map-Reduce Framework
14/07/04 05:09:36 INFO mapred.JobClient: Map input records=10
14/07/04 05:09:36 INFO mapred.JobClient: Physical memory (bytes)
snapshot=37609472
14/07/04 05:09:36 INFO mapred.JobClient: Spilled Records=0
14/07/04 05:09:36 INFO mapred.JobClient: CPU time spent (ms)=450
14/07/04 05:09:36 INFO mapred.JobClient: Total committed heap usage
(bytes)=16252928
14/07/04 05:09:36 INFO mapred.JobClient: Virtual memory (bytes)
snapshot=378376192
14/07/04 05:09:36 INFO mapred.JobClient: Map input bytes=1670
14/07/04 05:09:36 INFO mapred.JobClient: Map output records=0
14/07/04 05:09:36 INFO mapred.JobClient: SPLIT_RAW_BYTES=159
hduser@nn1:/usr/local/hdpmetaxxcache$

Anda mungkin juga menyukai