Department of Computer Science and Engineering Sri Bhagawan Mahaveer Jain College of Engineering
Jakkasandra, Kanakapura (T), Ramangara District.-562 112
June - 2012
CERTIFICATE
Students of 8th Semester, Computer Science and Engineering, in the partial fulfillment for the award of Bachelors degree in Computer Science and Engineering of Visvesvaraya Technological University, Belgaum During the year 2011-12 It is certified that all corrections / suggestions indicated for Internal Assessment have been incorporated in the report. This project report has been approved as it satisfies the academic requirements in respect of Project Work prescribed for the Bachelor of Engineering Degree.
External Viva:
Name of the Examiners 1. Signature with date
2.
ACKNOWLEDGEMENT
We owe a great gratitude towards our Professors who have helped us stay well grounded in the real world during the tenure of our engineering program at, Sri Bhagawan Mahaveer Jain College Of Engineering and helped us to attain profound technical skills in the field of Computer Science & Engineering, thereby fulfilling the most cherished goal of our life to become a Computer Science Engineer.
We convey our sincere gratitude to our guide, Asst. Prof. Mr. Chidananda Murthy P, department of Computer Science & Engineering, for his support, continuing co-operation, valuable suggestion and encouragement during the development of project.
We are thankful to Asst. Prof. Ms. Pushpa H. G., Head of Department of Computer Science and Engineering, SBMJCE, for her encouragement, inspiration and help throughout the course.
We express our immense gratitude to Dr. Y Vijay Kumar, Principal, SBMJCE, for providing us with this opportunity and inspiration during the tenure of the course.
We would also like to extend out our gratitude to all the teachers and staff members of our department and college, who directly or indirectly supported us in our endeavors.
Last but not the least; we are very much thankful to all, our family members and friends for their valuable support during this period of our study.
Thank you...this project would have never reached this point without all of you.
ABSTRACT
The generationThe generation of an enormous amount of sequence data, from the
Nextthe Next-generation Deoxyribonucleic acid sequencing machines has placed unprecedented demands on traditional single processor read mapping algorithms. To optimize the mapping of next-generation sequence data to the human genome and other reference genomes, for use in a variety of biological analyses including single-nucleotide polymorphism discovery, genotyping and personal genomics, a short read mapping program is being modeled to run on a distributed cluster i.e., the Hadoop architecture. An algorithmic technique called seed-and-extend is used to accelerate the mapping process and reduce the time and makes it efficient. This further reduces the running time from hours to mere minutes for typical jobs involving mapping of millions of short reads to the human genome.
Our objectives are: 1. Design, simulate and analyze the feasibility of implementing the algorithm on a multi node cluster on Hadoop. 2. Implementation of the algorithm on MapReduce platform thereby increasing the performance by reducing the processing time.
With the results from our objectives we can obtain an optimized DNA sequencing on a multi node cluster on Hadoop.
II
Table of Contents
ACKNOWLEDGEMENT ................................................................................................... I ABSTRACT ........................................................................................................................ II Table of Contents ............................................................................................................... III List of Figures ....................................................................................................................VI List of Tables ................................................................................................................... VII Acronym and Abbreviations ........................................................................................... VIII Glossary .............................................................................................................................IX Chapter 1 INTRODUCTION............................................................................................... 1 1 INTRODUCTION TO DNA SEQUENCING ON HADOOP .................................... 2 1.1 Importance of DNA Sequencing ........................................................................... 2 DNA and its Sequencing ............................................................................... 2 High Throughput Sequencing or Next-generation sequencing ...................... 3
Introduction to the Hadoop ................................................................................... 4 Hadoop Framework ....................................................................................... 4 Hadoop Architecture ...................................................................................... 5 Hadoop File System ....................................................................................... 5 Hadoop MapReduce....................................................................................... 6
Statement of the Problem ...................................................................................... 6 Objective ............................................................................................................... 7 Scope ..................................................................................................................... 7 Literature Survey ................................................................................................... 7 Organization of the Report .................................................................................... 8
Chapter 2 REQUIREMENT ANALYSIS ........................................................................... 9 2 REQUIREMENTS ANALYSIS OF OPTIMIZING DNA SEQUENCING ON HADOOP ........................................................................................................................... 10 2.1 2.2 System Requirements .......................................................................................... 10 Input/output Requirements .................................................................................. 10 Input Requirements ...................................................................................... 10 Output Requirements ................................................................................... 11
Non-Functional Requirements ............................................................................ 12 Performance ................................................................................................. 12 Availability .................................................................................................. 12 Reliability..................................................................................................... 13 Robust .......................................................................................................... 13 Scalable ........................................................................................................ 13
Chapter 3 DESIGN ............................................................................................................ 14 3 DESIGN OF OPTIMIZING DNA SEQUENCING ON HADOOP .......................... 15 3.1 3.2 3.3 3.4 Design considerations ......................................................................................... 15 Assumptions and dependencies........................................................................... 15 General Constraints ............................................................................................. 15 Parameters ........................................................................................................... 16 Read mapping .............................................................................................. 16 MapReduce .................................................................................................. 16 Alignment filtration ..................................................................................... 17
3.4.1 3.4.2 3.4.3 3.5 3.6 3.7 3.8 3.9 3.10 3.11 3.11.1
Seed-and-extend Algorithm ................................................................................ 18 Landau-Vishkin k-difference alignment algorithm ............................................. 19 Data flow ............................................................................................................. 20 Sequence Diagram............................................................................................... 22 System Overview ................................................................................................ 23 Use Case Diagram ........................................................................................... 23 System Architecture ........................................................................................ 24 Hadoop Distributed File System .................................................................. 24
Chapter 4 IMPLEMENTATION ....................................................................................... 26 4 IMPLEMENTATION OF OPTIMIZING DNA SEQUENCING ON HADOOP ..... 27 4.1 4.2 4.3 4.4 Execution Flow .................................................................................................. 27 Convert Format ................................................................................................... 28 DNA Sequencing................................................................................................. 31 Print Sequences ................................................................................................... 41
Chapter 5 TESTING .......................................................................................................... 42 5 TESTING OF OPTIMIZING DNA SEQUENCING ON HADOOP ........................ 43 5.1 Test Setup ............................................................................................................ 43 IV
TESTING SCENARIOS.................................................................................... 47 Test Case 1 ................................................................................................... 47 Test Case 2 ................................................................................................... 48 Test Case 3 ................................................................................................... 49 Test Case 4 ................................................................................................... 50
Chapter 6 RESULTS.......................................................................................................... 51 6 RESULTS OF OPTIMIZING DNA SEQUENCING ON HADOOP ....................... 52 CASE 1: Running the application on a One Node Hadoop Cluster .............................. 52 CASE 2: Running the application on a Two Node Hadoop Cluster .............................. 54 CASE 3: Running the application on a Four Node Hadoop Cluster ............................. 55 CONCLUSION .................................................................................................................. 56 FUTURE ENHANCEMENTS .......................................................................................... 57 APPENDIX A : DNA and its Sequencing ......................................................................... 58 APPENDIX B : MapReduce and RMAP........................................................................... 63 APPENDIX C : Single System Installation ....................................................................... 69 APPENDIX D : Multi-Node Installation ........................................................................... 75 BIBLIOGRAPHY .............................................................................................................. 80
List of Figures
Figure 1-1 Multi-node Hadoop cluster ............................................................................... 5 Figure 2-1 Overview of system.......................................................................................... 11 Figure 3-1 System Flow ..................................................................................................... 17 Figure 3-2 Hash Table view for Landau-Vishkin Algorithm ............................................ 18 Figure 3-3 Shuffle and matching of sequences .................................................................. 18 Figure 3-4 Data Flow Overview ........................................................................................ 20 Figure 3-5 Data Flow of DNA Sequencing ....................................................................... 21 Figure 3-6 Sequence Diagram ........................................................................................... 22 Figure 3-7 DNA Sequencing Overview ............................................................................. 23 Figure 3-8 Use Case Diagram ............................................................................................ 23 Figure 3-9 HDFS Architecture........................................................................................... 24 Figure 3-10 Overview of the HDFS................................................................................... 25 Figure 4-1 Flow chart for Sequence of Execution ............................................................. 27 Figure 4-2 Flow chart for Format Convertor ..................................................................... 28 Figure 4-3 Flow chart for Convert File .............................................................................. 29 Figure 4-4 Flow chart for Save Sequence .......................................................................... 30 Figure 4-5 Flow chart for DNA Sequencing...................................................................... 31 Figure 4-6 Flow chart for Map class .................................................................................. 32 Figure 4-7 Flow chart for Reduce class ............................................................................. 33 Figure 4-8 Flow chart for Aligning the sequences............................................................. 34 Figure 4-9 Flow chart for Extending the sequences .......................................................... 35 Figure 4-10 Flow chart for Landau-Vishkin algorithm ..................................................... 36 Figure 4-11 Flow chart for k-difference alignment ........................................................... 37 Figure 4-12 Flow chart for k-mismatch alignment ............................................................ 38 Figure 4-13 Flow chart for Filter Alignment ..................................................................... 39 Figure 4-14 Flow chart for Filer Reduce Class .................................................................. 40 Figure 4-15 Flow chart for Print Sequences ...................................................................... 41 Figure 4-16 Flow chart for Printing output ........................................................................ 41 Figure 5-1 Snapshot: starting ssh ....................................................................................... 44 Figure 5-2 Snapshot: starting datanode over the cluster .................................................... 45 Figure 5-3 Snapshot: starting namenode over the cluster .................................................. 46 Figure 5-4 Snapshot: Extension conversion of fasta file ................................................... 47 Figure 5-5 Snapshot: start of sequencing process .............................................................. 48 Figure 5-6 Snapshot: Printing results ................................................................................. 49 Figure 6-2 Snapshot: start of sequencer on a single node cluster ...................................... 52 Figure 6-3 Snapshot: completion of job with displayed total running time ...................... 53 Figure 6-5 Snapshot: start of sequencer on a two node cluster.......................................... 54 Figure 6-6 Snapshot: completion of job with displayed total running time ...................... 54 Figure 6-8 Snapshot: start of sequencer on a four node cluster ......................................... 55 Figure 6-9 Snapshot: completion of job with displayed total running time ...................... 55 Figure 0-1 Graphical comparison between total processing time against n active system 56 VI
List of Tables
Table 3-1 MapReduce Description .................................................................................... 16 Table 5-1Test setup scenario: Initialization of cluster nodes............................................. 44 Table 5-2 Test setup scenario: Initialization of datanode .................................................. 45 Table 5-3 Test setup scenario: Initialization of namenode ................................................ 46 Table 5-4 Test case1: Extension converter module ........................................................... 47 Table 5-5 Test case2: DNA sequencing module ................................................................ 48 Table 5-6 Test case3: Print Alignment module ................................................................. 49 Table 5-7 Test case4: Tracking job over web interface ..................................................... 50 Table 5-8 Snapshot: Running and Non-Running Tasks .................................................... 50 Table 5-9 Snapshot: Running Jobs .................................................................................... 50 Table 6-1 Snapshot: cluster summary- single active node over the cluster ....................... 52 Table 6-2 Snapshot: cluster summary- two active node over the cluster .......................... 54 Table 6-3 Snapshot: cluster summary- four active node over the cluster .......................... 55
VII
VIII
Glossary
This section contains definitions of terms used throughout this project Base spacing: The number of points from one peak (end) to the next in the matched seed of the sequence. FASTA format: A standard text-based file format for storing one or more sequences, in which nucleotide are represented using single-letter codes. Genes: It is a molecular unit from DNA of a living organism which performs one function. Genome: The total DNA contained in each cell of an organism. There are somewhere in the order of a hundred thousand genes. It includes both the genes and the non-coding sequences of the DNA. Genotyping: It is the process of determining differences in the genetic make-up (genotype) of an individual by examining the individual's DNA sequence and comparing it to another individual's sequence or a reference sequence. Indels: insertion /deletion errors in sequences, while the seeds are formed . Mutation: these are the mismatch errors only which are found during seed formation. Reads: These are millions of short sequence of DNA which are taken into account for further sequencing. Seeds: The substrings found and matched in reference file and with the reads. Sequence: A linear series of nucleotide base characters that represent a DNA sequence, displayed in rows from left to right. SNP: are the most common form of genetic variation in humans and a resource for mapping complex genetic traits as they can alter DNA, RNA and protein sequences at different levels. It varies from one individual to another. Capillary electrophoresis (CE): It is used to separate ionic species by their charge and frictional forces and radius with use of an applied voltage. It is carried on in the interior of a small capillary filled with an electrolyte.
IX
Chapter 1 INTRODUCTION
2011-2012
Page 1
2011-2012
Page 2
DNA replication reaction is run in a test tube, in the presence of trace amounts of all nucleotides. Electrophoresis is used to separate the resulting fragments by size and i.e. how the sequences are read. In a large-scale sequencing lab, where automated DNA sequencers are used, in which thewhich the fragments are piped through a tiny glass-fiber capillary during the electrophoresis step, they come out the far end in size-order and their different color is monitored on the screen as they come out. The generated fragment on the screen consists of four colors red, green, blue and yellow each represent one of the four nucleotides. The actual gel image would be perhaps 3 or 4 meters long and 30 or 40 cm wide. The computer automatically generates the sequence called read from the gel after the fragment is generated. A few of the major uses of DNA sequencing are: Diagnosing Diseases Comparing normal sequences to sequences in people with genetic illnesses to determine what parts of the genome are involved in the disease. Forensic Genetics Comparing DNA left at a crime scene to the DNA sequence of suspects or victims. Paternity Tests Matching the DNA of parents and children to determine how they are related. Comparisons With Other Genomes - Comparing genomes of different species to help in scientific research and conservation efforts.
Distributed computing is a wide and varied field, but the key distinctions of Hadoop are that it is: AccessibleHadoop runs on large clusters of commodity machines or on cloud computing services and hence is easily accessible. Robust it is intended to run on commodity hardware, Hadoop is architected with the assumption of frequent hardware malfunctions. It can gracefully handle most such failures. ScalableHadoop scales linearly to handle larger data by adding more nodes to the cluster. SimpleHadoop allows users to quickly write efficient parallel code.
Comment [e4]: Make these bold Formatted: Font: Bold
2011-2012
Page 4
A small Hadoop cluster[ ] will include a single master and multiple worker nodes. The master node consists of a Job Tracker, Task Tracker, Name Node, and Data Node. A slave or worker node acts as both a Data Node and Task Tracker, though it is possible to have data-only worker nodes, and compute-only worker nodes. The standard startup and shutdown scripts require secure shell to be set up between nodes in the cluster.
for communication; clients use RPC to communicate between each other. HDFS stores large files (an ideal file size is a multiple of 64 MB), across multiple machines. File access can be achieved through the native Java API. With Hadoop, the same data set will be divided into smaller (typically 64 MB) blocks that are spread among many machines in the cluster via the Hadoop Distributed File System (HDFS). With a modest degree of replication, the cluster machines can read the data set in parallel and provide a much higher throughput. And such a cluster of commodity machines turns out to be cheaper than one high-end server. Hadoop uses key/value pairs as its basic data unit, which is flexible enough to work with the less-structured data types. In Hadoop, data can originate in any form, but it eventually transforms into (key/value) pairs for the processing functions to work on.
1.4 Objective
DNA sequencing on a single system is a time consuming and an expensive effort with least efficient resultantefficient resultant output hence, the objective of this project is to Optimize Tthe DNA Sequencing process using the MapReduce over the Hadoop framework in a distributed system environment to achieve high performance gain and reduction in execution time.
1.5 Scope
The scope of this project is that, it is an implementation in which the running time reduces from hours to mere minutes for typical jobs involving mapping of millions of short reads to human genome. Running the sequencing on Hadoop cluster make the sequencing accelerate ,accelerate, providing much greater performance.
Hadoop is a good choice for building batch processing systems to process huge amounts of unstructured data. Also, to use Hadoop effectively, the system should process data in parallel. Also, a definite advantage of Hadoop is that when there is a low requirement hardware and scale the cluster horizontally, these can be easily implemented. http://hadoop.apache.org/ [6] All about Hadoop. http://developer.yahoo.com/hadoop/ [7] Developers guide to Hadoop.
2011-2012
Page 8
2011-2012
Page 9
2011-2012
Page 10
Comment [e10]: Use the keyword SHALL B to write all your functional requirements. Minimu of 20 functionall requirements should be listed.
2.4.1 Performance
The main objective of this project to decrease the total time required for processing a DNA Sequence. In extension by decreasing the processing time, the performance quotient of the system is increased. The proposed system shall in real time demonstrate effectively reduction in the execution time of the sequencer and show comparable results over single and distributed systems.
2.4.2 Availability
The system shall at all times be available for the user to utilize. The Hadoop architecture has in it an inherent system wherein if a node malfunctions it will be flagged for maintenance and the work assigned to it will be sent to another system for completion. Thus this system transcends hardware dependency as the system is no longer completely dependent on the availability of each and every node present within the network.
2011-2012
Page 12
2.4.3 Reliability
This is the most important feature of the system. Since the data obtained from the system can potentially be used in a variety of biological analyses including SNP discovery, genotyping, and personal genomics (differences in one persons genome relative to a reference human genome, or compare the genomes of closely related species). Even a single base pair difference can have a significant biological impact, so researchers require highly sensitive mapping algorithms to analyze the reads.
2.4.4 Robust
Since the system is intended to run on commodity hardware, Hadoop is an architected created with the assumption of frequent hardware malfunctions. It can gracefully handle such failures without effecting the performance and reliability of the system.
2.4.5 Scalable
Researchers are generating sequence data at an incredible rate and need highly scalable system to analyze their data. The Hadoop framework is utilized because of its scalability factor. Over time N number of systems can be included in the network to handle more task intensive jobs.
2011-2012
Page 13
Chapter 3 DESIGN
2011-2012
Page 14
Comment [e12]: Completely wrong. You need to specify what design methodology/design decisi that has been followed in designing this project.
Comment [e14]: Write about what technical assumptions you have made
2011-2012
Page 15
3.4 Parameters
3.4.1 Read mapping
The sequences are aligned or mapped to the reads of reference genome to find the locations where each exact read occurs in the reference sequence, allowing a small number of differences. The read mapping allows 1-10% of the read length to differ from the reference.
Comment [e16]: Give the refrence genome format
3.4.2 MapReduce
MapReduce is the software framework which supports parallel execution of the data in data intensive applications. The framework automatically provides common services for parallel computing, such as the partitioning the input data, scheduling, monitoring, and inter-machine communication necessary for remote execution.
Table 3-1 MapReduce Description
FUNCTION
INPUT PARAMETER
RETURNS
Map
<seed, merInfo>
Reduce
DESCRIPTION Each Map function takes a series of key/value pairs, processes each, and generates zero or more output key/value pairs. The Reduce function is called once for each unique key in the sorted order which iterates through the values that are associated with that key and produce zero or more outputs.
The map function processes each DNA filet, and emits a sequence of <seed, merInfo> pairs, where seed is the key, and the merInfo is a tuple(id, position, isRef, isRC, left_flank, right_flank). The same seed are merged when passed to reduce function as input, this is done in sort and shuffle phase. The reduce function accepts all pairs for a given seed, sorts the corresponding merInfo and emits a <seed, list (merInfo)> pair. The set of all output pairs are then printed. It is easy to understand and keep track of seeds which are sequenced and extended at the found positions.
2011-2012
Page 16
The job executed by the cluster shows a significant reduction in execution time and thus the performance varies depending on the number of nodes present in the cluster.
2011-2012
Page 17
Step 3: Shuffle phase groups all value based on key-value pair. Identify all exact matches with reference sequences.
Step 4: Search optimal alignment For each match, extend un-gapped alignments. Step 5: Evaluate the alignment statistically Stop extension when k-value exceeds threshold value (10%).
2011-2012
Page 18
Step 1: Get the length of the text and pattern. Step 2: loop from position=0 to position=last length of pattern/text if(text[position]==pattern[position]) then match+=2; else mismatch++; if(mismatch > k) then return bad Alignment; Step 3: loop from i=0 to i<mismatch what[i]= 0; distance[mismatch]=match; what[mismatch]=2; //indicates end of matching Step 4: Set Values and return good Alignment;
The Landau-Vishkin k-difference alignment algorithm is used in the reduce function, where both the reference and query alignments obtained are aligned by using this string matching algorithm for obtaining good matching alignment records.
2011-2012
Page 19
The data to be processed is spread across different modules starting from the Convert Format module, where the content of the file is converted to byte format. The byte data is processed by the DNA Sequence module and produces data in Output Collector format. This data is now fed to the Print Sequence module, which produces the tabular result.
2011-2012
Page 20
The complete data flow diagram of the DNA Sequence module is shown in Figure 3-5 The data flows step by step to different parts of the module.
The data being moved from one module to another module is spread vastly during the processing of the system. Hence, an optimized DNA sequencing is obtained from the overall execution of the program.
2011-2012
Page 21
The above shown figure 3-6, shows the sequence of execution that takes place in the system during the period of execution.
2011-2012
Page 22
The use case shown in the figure 3-8 shows the interactions between the modules and the external user accessibility. Department of CSE, SBMJCE 2011-2012 Page 23
Figure 3-9 gives a run-time view of the architecture showing three types of address spaces: the application, the Name Node and the Data Node. An essential portion of HDFS is that there are multiple instances of Data Node. The application incorporates the HDFS client library into its address space. The client library manages all communication from the application to the Name Node and the Data Department of CSE, SBMJCE 2011-2012 Page 24
Node. An HDFS cluster consists of a single Name Node - a master server that manages the file system namespace and regulates access to files by clients. In addition, there are a number of Data Nodes, usually one per computer node in the cluster, which manage storage attached to the nodes that they run on. The Name Node and Data Node are pieces of software designed to run on commodity machines. These machines typically run a Linux operating system (OS). HDFS is built using the Java language; any machine that supports Java can run the Name Node or the Data Node software.
Usage of the Java language means that HDFS can be deployed on a wide range of machines. A typical deployment has a dedicated machine that runs only the Name Node software. Each of the other machines in the cluster runs one instance of the Data Node software. The architecture does not preclude running multiple Data Nodes on the same machine but in a real deployment that is rarely the case.
2011-2012
Page 25
Chapter 4 IMPLEMENTATION
2011-2012
Page 26
The first MapReduce program is alignall, which is started after a timer is started in the main function. The alignall module is divided in to other modules which are explained in the following section. After the map reduce processing, if the filter alignments is set to 1 then, the timer is started and the second map reduce program is executed. The filter module has many other sub-modules which are explained in the following section. After the complete execution of the program the output is printed and is displayed in tabular form. Department of CSE, SBMJCE 2011-2012 Page 27
The figure 4-2 depicts the flow of the programFormat ,Converter; the program execution begins with specifying the input files as parameter for the process of converting the file format. At the end of the processing, the file is converted to a binary format.
2011-2012
Page 28
The convert module reads the file till the end of file, where first skip character is removed from the file and the sequences are converted to uppercase, appended and saved until the next skip character is encountered.
The figure 4-3 depicts the flow of converting the file format. At the end of processing, the sequences are read and sent to save sequence module for writing the converted format to a file.
2011-2012
Page 29
The save sequence module is depicted in figure 4-4. Where the save sequence module reads till the end of the appended sequences and gets those sequences and writes them to the binary file in <id, sequence> format and saves the binary file. Once the reading is complete the offset value is varied and again the save sequences module is called by setting the appropriate offset value till the full length of the file is achieved.
2011-2012
Page 30
2011-2012
Page 31
Map class module is shown in figure 4-6 where the map function is executed in parallel. The following module gets the sequences specified and checks if the sequence obtained are from reference type or query type.
If it is a reference sequence then the sequence is read till the end and sets the initial left and right flank details. The input is checked for repeat sequences, if found then the corresponding seed is obtained which matches the redundant value. Else the normal seed sequences are obtained and stored to the intermediate pair for further processing. If the sequence is of query type then the flanks are checked for reverse compliment and the corresponding sequence is reverse complimented in place. Now the sequences are processed till the end and is checked if has any repeat seeds present in the seed sequence, Department of CSE, SBMJCE 2011-2012 Page 32
if present then the corresponding seed is obtained and is stored. Else the seed sequence is obtained and stored to the intermediate pair for further processing. Reduce class module as shown in the figure 4-7 the Reduce function is executed in parallel and the values are read from the intermediate pair till it has next values. The results obtained are stored in a variable merIn. Then the merIn variable is checked and put in to respective tuples and then the align batch module is called for further processing.
2011-2012
Page 33
The Align batch module is shown in the figure 4-8. It first gets the current query tuple and extends the obtained tuple against the reference tuple and returns the extended full good alignment once the reads are matched with allowed differences.
The obtained alignments are then checked if the filter alignment is specified to get the unambiguous best alignments based on the number of differences allowed for the sequencing. This processing is performed for all the matched extended tuples to get the best alignment and the second best alignments if filter alignments is specified.
2011-2012
Page 34
Extend module as shown in figure 4-9, obtains the left query tuple and gets the real flank length and performs Landau-Vishkin extend function. The returned alignment information is check if there the aligned length is -1, then it returns no alignment. Else the reference start and the differences encountered are recorded and then it processes the right query tuple in the same way.
At the end of processing, the full alignment is returned after setting the corresponding reference end position and the number of differences encountered.
2011-2012
Page 35
Landau-Vishkin Extend module as shown in figure 4-10, checks for the allowed differences, if assigned then the reference and the query tuples are obtained and are processed for k-difference alignment.
2011-2012
Page 36
K-difference alignment module as shown in figure 4-11, computes the dynamic programming model where in each text and pattern specified for processing. When the generated dynamic programming model is ready, it then returns the alignment based on the dynamic programming on the fly. The good alignment returned is then processed and stored such that the alignments are written to the output file.
2011-2012
Page 37
K-mismatch alignment as shown in figure 4-12, this module is the k-mismatch string matching algorithm specified by Landau-Vishkin. The module starts to align the query sequences against the reference sequences and performs string matching with a specified number of allowed difference to a set of string and then returns good alignment if the mismatch is less than equal to the specified mismatches, else it returns a bad alignment.
2011-2012
Page 38
Filter alignment module is shown in figure 4-13, the second map reduce program gets executed if the filter alignment is specified in the parameters. Firstly the job configuration is set in for processing. In the filter map class, filter combined class and the filter reduce class are set and executed.
The implemented module processes the previously stored result value and obtains the corresponding unambiguous best alignments with second best alignments. At the end of processing, the results are generated and stored to the HDFS for further processing as necessary.
2011-2012
Page 39
Filter Reduce module as shown in figure 4-14, obtains the processed results and gets the value of the same. It checks the current alignments difference to the best alignments difference and then sets the best alignment out of the two compared alignments. The process is repeated again to get the second best alignment and the final output is written to the HDFS.
2011-2012
Page 40
4.4
Print Sequences
Print sequences module prints the output from raw byte format to a textual column form after processing the data obtained from the HDFS. As shown in the figure 4-15, the file obtained from as a parameter is checked, if the result is a directory and it lists all the files present in that directory and then finally prints the file.
Print file module as shown in figure 4-16, reads the content in <key,value> pair sequence and then prints the records in a tabular form till the end of values obtained.
2011-2012
Page 41
Chapter 5 TESTING
2011-2012
Page 42
Test Setup
Before a test case can be executed the system wherein the test scenarios are performed need to be created to enable us to perform the test runs with proficiency.
2011-2012
Page 43
1. Creation of a dedicated connection between the cluster nodes using ssh system.
Test Setup Scenario Name of test Description Expected output Actual Output Remarks
Observation Scenario Check for initialization of cluster nodes Proper Initialization of Hadoop cluster Connection to nodes established Connection to nodes established Setup Successful
2011-2012
Page 44
2. Starting the Hadoop based Data Node over the cluster via the Master Node.
Test Setup Scenario Name of test Description Expected Output Actual Output Remarks
Observation Scenario Check for initialization of Data Node on all cluster nodes Proper Initialization of Data Nodes Data Nodes initialized Data Nodes initialized Setup Successful
2011-2012
Page 45
3. Starting the Hadoop Based Task Tracker over the cluster via the Master Node.
Observation Scenario Check for initialization of Task Tracker on all cluster nodes
Proper Initialization of Task Tracker Task Tracker initialized Task Tracker initialized Setup Successful
Once the system is setup is completed as depicted in the above Test Setup cases the test cases can be executed and the results analyzed.
2011-2012
Page 46
Test Case Scenarios Feature being tested Feature being tested Description Expected output Actual Output Remarks
Observation Scenario Testing Extension Convertor Module Algorithm Proper working of Algorithm No Errors No Errors Test Successful
2011-2012
Page 47
Test Case Scenarios Feature being tested Feature being tested Description Expected output Actual Output Remarks
Observation Scenario Testing DNA Sequencer Module Algorithm Proper working of Algorithm No Errors No Errors Test Successful
2011-2012
Page 48
Test Case Scenarios Feature being tested Feature being tested Description Expected output Actual Output Remarks
Observation Scenario Testing Print Alignment Module Algorithm Proper working of Algorithm No Errors No Errors Test Successful
2011-2012
Page 49
Test Case Scenarios Feature being tested Feature being tested Description Expected output Actual Output Remarks
Observation Scenario Tracking of Job over the Web Interface Feature Proper working of Algorithm No Errors No Errors Test Successful
2011-2012
Page 50
Chapter 6 RESULTS
2011-2012
Page 51
2011-2012
Page 52
Total Time Taken For Completion of Task: 787 seconds or 13 min 7 seconds
Figure 6-23 Snapshot: completion of job with displayed total running time
2011-2012
Page 53
Total Time Taken For Completion of Task: 403 seconds or 6 min 43 seconds
Figure 6-46 Snapshot: completion of job with displayed total running time
2011-2012
Page 54
Total Time Taken For Completion of Task: 221 seconds or 3 min 41 seconds
Figure 6-69 Snapshot: completion of job with displayed total running time
2011-2012
Page 55
CONCLUSION
The runtime of the DNA Sequencing Algorithm over the Hadoop system for single and multi node system is considered with relevant information. Three different run-time scenarios are performed. The results of the changes in run-time for different number of nodes is considered and contrasted. The runtime results are as follows: 1. Runtime in a Single node Hadoop cluster:787 seconds or 13 min 7 seconds 2. Runtime in a Multi-node Hadoop cluster with 2 systems: 403 seconds or 6 min 43 seconds 3. Runtime in a Multi-node Hadoop cluster with 4 systems: : 221 seconds or 3 min 41 seconds
Figure 0-1 Graphical comparison between total processing time against n active system
When the run-time of the algorithm for different test cases is contrasted against the increase in the number of Nodes it is evident that the run-time is inversely proportional to the number of active systems in the Hadoop cluster.
2011-2012
Page 56
Thus we can conclude that the implementation of the algorithm over the Hadoop interface using Map/Reduce programming technique enables in exponential reduction in processing time of the algorithm for large Datasets.
FUTURE ENHANCEMENTS
Building a complete interactive web interface for the system, i.e. , a complete Cloud Interfacing system. Automating the process of setting up of individual systems to be included into the Hadoop Cluster. creation of a Self-monitoring process which runs in the Background and monitors the state of the Node, such that, in case of node failure it automatically resolves the issue, else flag the system for immediate maintenance.
2011-2012
Page 57
TCGGAGCTG-3. Reduce neither the number of gaps nor the number of mismatches : TCAG-ACGATTG || | | | | 2 mismatches 6 matches 4 gaps
TC-GGA-GCTG4. Same as 3. but one base (or gap) moved : TCAG-ACGATTG || | | | | | 1 mismatch 7 matches 4 gaps
TC-GGA-GCT-G
Methods of DNA Sequencing There are basically 4 methods of Sequencing: Maxam-Gilbert method (or) Chemical Sequencing It was the first DNA sequencing method developed based on chemical modification of DNA. It used the purified DNA directly which required radioactive labeling and then was treated Chemically. The fragments are electrophoresed side by side in gels for size separation. To visualize the fragments, the gel is exposed to X-ray film for autoradiography, yielding a series of dark bands each corresponding to a radio labeled DNA fragment, from which the sequence may be inferred. The extensive use of hazardous chemical, and the complexity in the technical procedure made it difficult with scale-up. Department of CSE, SBMJCE 2011-2012 Page 58
Sanger Di-Deoxy method (or) Chain Termination The principles of DNA replication were used by Sanger in the development of the process. It required that each read start be cloned for production of single-stranded DNA. It required single-stranded DNA template, and many other DNA sequence type and modified nucleotides that terminate DNA strand elongation, thus terminating DNA strand extension and resulting in DNA fragments of varying length. The newly synthesized and the labeled DNA fragments were then heated and separated by size by gel electrophoresis . The DNA bands are then visualized by the special rays or UV light, and the DNA sequence can be directly read off the X-ray film or gel image. The relative positions of the different bands among the four lanes are then used to read the DNA sequence. It is more efficient and uses fewer toxic chemicals and lower amounts of radioactivity than the before method, it rapidly became the method of choice. These methods have greatly simplified DNA sequencing. Shotgun Sequencing It is a method for determining the sequence of a very large piece of DNA. The basic DNA sequencing reaction can only get the sequence of a few hundred nucleotides. The large fragment is shotgun cloned, and then each of the resulting smaller clones (sub clones) is sequenced. By finding out where the sub clones overlap, the sequence of the larger piece becomes apparent. It does not require prior information about the sequence, and it can be used for DNA molecules as large as entire chromosomes.. The ends of these fragments overlap and, when aligned properly by a genome assembly program, can be used to reconstruct the complete genome. It yields the sequence data quickly, but the task of assembling the fragments can be quite complicated for larger genomes. For a genome as large as the human genome, it may take many days of CPU time on large-memory, multiprocessor computers to assemble the fragments, and the resulting assembly will usually contain numerous gaps that have to be filled in later. Primer walking An alternative to shotgun sequencing is primer walking. Following the initial sequencing determination, primed from a region of known sequence, subsequent primers were designed. These primers then serve as sequencing start point which establish an additional >500 BP of sequence data. New primers are synthesized for the newly established sequence in the template DNA, and the process continues. The advantage was that extensive sub-cloning was not required. The amount of overlap or coverage required is Department of CSE, SBMJCE 2011-2012 Page 59
also decreased because the direction and location of the new sequence is known, substantially decreasing the effort needed to assemble the final sequence. But, the consequence faced was the amount of time required for each step in the primer walk and the need to design a robust primer for every step. Hence, this method is basically used to fill gaps in a sequence that has been determined by shotgun cloning.
How are the Sequence Generated An Automated sequencing gel: DNA replication reaction is run in a test tube, but in the presence of trace amounts of all four of the dideoxy terminator nucleotides. Electrophoresis is used to separate the resulting fragments by size and i.e. how we get 'read' the sequence from it, as the colors march past in order.
In a large-scale sequencing lab, we use a machine to run the electrophoresis step and to monitor the different colors as they come out. Since about 2001, these machines - not surprisingly called automated DNA sequencers - have used 'capillary electrophoresis', where the fragments are piped through a tiny glass-fiber capillary during the electrophoresis step, and they come out the far end in size-order. At left is a screen shot of a real fragment of sequencing gel (this one from an older model of sequencer, but the concepts are identical). The four colors red, green, blue and yellow each represent one of the four nucleotides. The actual gel image would be perhaps 3 or 4 Department of CSE, SBMJCE 2011-2012 Page 60
meters long and 30 or 40 cm wide. We don't even have to 'read' the sequence from the gel - the computer does that.
Eg: This is a plot of the colors detected in one 'lane' of a gel (one sample), scanned from smallest fragments to largest. The computer interprets the colors by printing the nucleotide sequence across the top of the plot. This is just a fragment of the entire file, which would span around 900 or so nucleotides of accurate sequence. The sequencer also gives the operator a text file containing just the nucleotide sequence, without the color traces.
BLAST: Basic Local Alignment Search Tool It is an algorithm comparing primary biological sequence information, such as
the amino-acid sequences of different proteins or the nucleotides of DNA sequences. It enables a researcher to compare a query sequence with a library or database of sequences, and identify library sequences that resemble the query sequence above a certain threshold. Different types of BLASTs are available according to the query sequences. For example, following the discovery of a previously unknown gene in the mouse, a scientist will typically perform a BLAST search of the human genome to see if humans carry a similar gene; BLAST will identify sequences in the human genome that resemble the mouse gene based on similarity of sequence. It is one of the most widely used bioinformatics programs, because it addresses a fundamental problem and the algorithm emphasizes speed over sensitivity. It emphasis on speed. It is faster, but it cannot "guarantee the optimal alignments of the query and database sequences" which Smith-Waterman does. It searches only for the more significant patterns in the sequences, but with comparative sensitivity. It is also often used as part of other algorithms that require approximate sequence matching. Department of CSE, SBMJCE 2011-2012 Page 61
It takes input sequences as FASTA format or Genbank format. While output format may include HTML, plain text, and XML formatting. Sometimes the results are given in a graphical format showing the hits found, a table showing sequence identifiers for the hits with scoring related data, as well as alignments for the sequence of interest and the hits received with corresponding BLAST scores for these. The easiest to read and most informative of these is probably the table. DNA Sequencing Applications and Approaches DNA sequencing can be used for a variety of applications, including: De novo sequencing of genomes Detection of variants (SNPs) and mutations Biological identification Confirmation of clone constructs Detection of methylation events Gene expression studies Detection of copy number variation.
It is often reported that the goal of sequencing a genome is to obtain information about the complete set of genes in that particular genome sequence. The proportion of a genome that encodes for genes may be very small. However, it is not always possible (or desirable) to only sequence the coding regions separately. Also, as scientists understand more about the role of this noncoding DNA (often referred to as junk DNA), it will become more important to have a complete genome sequence as a background to understanding the genetics and biology of any given organism. A few of the major uses of DNA sequencing are: Diagnosing Diseases Comparing normal sequences to sequences in people with genetic illnesses to determine what parts of the genome are involved in the disease. Forensic Genetics Comparing DNA left at a crime scene to the DNA sequence of suspects or victims Paternity Tests Matching the DNA of parents and children to determine how they are related.
2011-2012
Page 63
The Map and Reduce functions of MapReduce are both defined with respect to data structured in (key, value) pairs. Map takes one pair of data with a type in one data domain, and returns a list of pairs in a different domain: Map(k1,v1) list(k2,v2) The Map function is applied in parallel to every pair in the input dataset. This produces a list of pairs for each call. After that, the MapReduce framework collects all pairs with the same key from all lists and groups them together, thus creating one group for each one of the different generated keys. The Reduce function is then applied in parallel to each group, which in turn produces a collection of values in the same domain: Reduce(k2, list (v2)) list(v3) Each Reduce call typically produces either one value v3 or an empty return, though one call is allowed to return more than one value. The returns of all calls are collected as the desired result list. Thus the MapReduce framework transforms a list of (key, value) pairs into a list of values. MapReduce allows for distributed processing of the map and reduction operations. The parallelism also offers some possibility of recovering from partial failure of servers or storage during the operation: if one mapper or reducer fails, the work can be rescheduled assuming the input data is still available. MapReduce achieves reliability by parceling out a number of operations on the set of data to each node in the network. Each node is expected to report back periodically with completed work and status updates. If a node falls silent for longer than that interval, the master node records the node as dead and sends out the node's assigned work to other nodes. The application defines the functions as follows: an input reader a Map function a partition function a compare function a Reduce function an output writer
2011-2012
Page 64
INPUT READER
The input reader divides the input into appropriate size 'splits' (in practice typically 16 MB to 128 MB) and the framework assigns one split to each Map function. It reads data from stable storage and generates key/value pairs. Map function It takes a series of key/value pairs, processes each, and generates zero or more output key/value pairs. The input and output types of the map can be (and often are) different from each other. Partition function Each Map function output is allocated to a particular reducer by the application's partition function. The partition function is given the key and the number of reducers and returns the index of the desired reduce. A typical default is to hash the key and modulo the number of reducers. Between the map and reduce stages, the data is shuffled (parallel-sorted / exchanged between nodes) in order to move the data from the map node that produced it, to get it reduced. The shuffle can sometimes take longer than the computation time depending on network bandwidth, CPU speeds, data produced and time taken by map and reduce computations. Comparison function The input for each Reduce is pulled from the machine where the Map ran and sorted using the application's comparison function. Reduce function The framework calls the application's Reduce function once for each unique key in the sorted order. The Reduce can iterate through the values that are associated with that key and produce zero or more outputs. Output writer The Output Writer writes the output of the Reduce to stable storage, usually a distributed file system.
2011-2012
Page 65
RMAP
RMAP is aimed to map accurately reads from the next-generation sequencing technology. It can map reads with or without error probability information (quality scores) and supports paired-end reads mapping. There is no limitations on read widths or number of mismatches. It can map more than 8 million reads in an hour at full sensitivity to 2 mismatches. It can map sequencing reads to their genomic location. The length of read must be at most 64bp, and should not be shorter than 20bp. The user must specify a maximum number of mismatches permitted between a read and the genomic location to which it maps. For example, when mapping reads of length 36bp by rmap, it might be desirable to allow up to 3 mismatches in the mapping to account for sequencing errors or single nucleotide polymorphism (SNP). For 50bp reads, it might be desirable to allow, e.g., 5 mismatches. Specifying the reads Reads must be specified in either of the 2 ways: FASTA format sequence file Eg: >1_168_0365_0364 GTTAAAAGTATGTGTGTCCTATGTCCTCAAGA >1_168_0021_0625 TTTTATACACTTCAAAAAAAAAAAACCCTAGA ......
Solexa sequencing probability score files four numbers per base are listed to present the negative log-transform of the probabilities of four nucleotides (A, C, G, T) to be sequenced at this base position.
2011-2012
Page 66
For Example: -40 -40 40 -40 40 -40 -40 -40 40 -40 -40 -40 40 -40 -40 -40 -40 -40 -40 40 -40 -40 40 -40 -40 40 -40 -40 -40 -40 -40 40 -40 40 -40 -40 -40 40 -40 -40 -40 -40 40 -40 ...... Each read should have at least a minimum length that is specified by the user as a command line argument. Reads with length exceeding the specified limit will be truncated at the 3'-end. Any characters other than {A,C,G,T,a,c,g,t} in the reads will be automatically transformed into an 'N'. When counting mismatches between a read and some genome location, any 'N' will mismatch whatever character it aligns with in program. Specifying the genome The genome can be specified in two ways: as a single chromosome file, or as a directory containing multiple files, one for each chromosome. The chromosome files must be in FASTA format, but must only have a single FASTA sequence (i.e. only the first line of the file can start with the '>' character). This is the format of the files downloadable from the UCSC Genome Browser. When specifying a directory containing multiple chromosomes, a filename suffix is also required to indicate those files in the directory that are to be searched. -40 -40 -40 40 40 -40 -40 -40 -40 -40 40 -40 -40 -40 -40 40 -40 -40 40 -40 -40 -40 -40 40 -40 -40 -40 40 -40 -40 40 -40 -40 40 -40 -40 35 -35 -40 -40 40 -40 -40 -40 -40 -40 -40 40 40 -40 -40 -40 -40 -40 -40 40 -40 -40 40 -40 -40 -40 -40 40 -40 40 -40 -40 40 -40 -40 -40 -40 -40 -40 40 -40 -40 -40 40 40 -40 -40 -40
2011-2012
Page 67
How matches are scored Scoring is simply a count of mismatches, and fewer mismatches are better. For rmap, the mismatches are counted in all bases used in mapping. If for a given read, there is no location in the genome found to match that read with fewer than the specified number of mismatches, then nothing is reported for that read. If there is a "good" match, then the best match is reported, along with the number of mismatches. These matches are reported in BED format: chromosome start For example: chr1 153728548 153728583 s_2_0100_1 1 + end name score strand
If more than one genomic sequence matches the read with the fewest number of mismatches, then the read is considered ambiguous, and is not reported. There is an option to have the names of all ambiguous reads reported in a separate file.
2011-2012
Page 68
Install Sun Java 6 JDK $ sudo apt-get install sun-java6-jdk $ sudo update-java-alternatives -s java-6-sun The full JDK will be placed in /usr/lib/jvm/java-6-sun (this directory is actually a symlink on Ubuntu). After installation, check whether Suns JDK is correctly set up:
user@ubuntu:~# java -version java version "1.6.0_20" Java(TM) SE Runtime Environment (build 1.6.0_20-b02) Java HotSpot(TM) Client VM (build 16.3-b01, mixed mode, sharing) Configuring SSH Hadoop requires SSH access to manage its nodes, i.e. remote machines plus your local machine if you want to use Hadoop on it. For our single-node setup of Hadoop, we need to configure SSH access to localhost. First, we need to install ssh on our system then we have to generate an SSH key for the user. user@ubuntu:~$ sudo apt-get install ssh user@ubuntu:~$ su - user user@ubuntu:~$ ssh-keygen -t dsa -P "" Generating public/private dsa key pair. Enter file in which to save the key (/home/user/.ssh/id_dsa): Created directory '/home/user/.ssh'. Your identification has been saved in /home/user/.ssh/id_dsa. Department of CSE, SBMJCE 2011-2012 Page 69
Your public key has been saved in /home/user/.ssh/id_dsa.pub. The key fingerprint is: 9b:82:ea:58:b4:e0:35:d7:ff:19:66:a6:ef:ae:0e:d2user@ubuntu The key's randomart image is: user@ubuntu:~$ The second line will create a DSA key pair with an empty password. Generally, using an empty password is not recommended, but in this case it is needed to unlock the key without your interaction (you dont want to enter the passphrase every time Hadoop interacts with its nodes). Second, you have to enable SSH access to your local machine with this newly created key. user@ubuntu:~$ cat $HOME/.ssh/id_dsa.pub >> $HOME/.ssh/authorized_keys The final step is to test the SSH setup by connecting to your local machine with the user machine. The step is also needed to save your local machines host key fingerprint to the user users known_hosts file. user@ubuntu:~$ ssh localhost The authenticity of host 'localhost (::1)' can't be established. RSA key fingerprint is d7:87:25:47:ae:02:00:eb:1d:75:4f:bb:44:f9:36:26. Are you sure you want to continue connecting (yes/no)? yes Warning: Permanently added 'localhost' (DSA) to the list of known hosts. Linux ubuntu 2.6.32-22-generic #33-Ubuntu SMP Wed Apr 28 13:27:30 UTC 2010 i686 GNU/Linux Ubuntu 10.04 LTS user@ubuntu:~$ You have to reboot your machine in order to make the changes take effect.
2011-2012
Page 70
Hadoop Installation You have to download Hadoop from the Apache Download Mirrors and extract the contents of the Hadoop package to a location of your choice. I picked /usr/local/Hadoop/ $ cd /usr/local $ sudo tar xzf hadoop-0.20.2.tar.gz $ sudo mv hadoop-0.20.2 hadoop Update $HOME/.bashrc Add the following lines to the end of the $HOME/.bashrc file of user. If you use a shell other than bash, you should of course update its appropriate configuration files instead of .bashrc. # Set Hadoop-related environment variables export HADOOP_HOME=/usr/local/hadoop
# Set JAVA_HOME (we will also configure JAVA_HOME directly for Hadoop later on) export JAVA_HOME=/usr/lib/jvm/java-6-sun # Add Hadoop bin/ directory to PATH export PATH=$PATH:$HADOOP_HOME/bin You can repeat this exercise also for other users who want to use Hadoop. Configuration hadoop-env.sh The only required environment variable we have to configure for Hadoop is JAVA_HOME. Open/conf/hadoop-env.sh in the editor of your choice (if you used the installation path in this tutorial, the full path is/usr/local/hadoop/conf/hadoop-env.sh) and set the JAVA_HOME environment variable to the Sun JDK/JRE 6 directory. Change
# The java implementation to use. Required. # export JAVA_HOME=/usr/lib/j2sdk1.5-sun Department of CSE, SBMJCE 2011-2012 Page 71
to # The java implementation to use. Required. export JAVA_HOME=/usr/lib/jvm/java-6-sun conf/*-site.xml In this section, we will configure the directory where Hadoop will store its data files, the network ports it listens to, etc. Our setup will use Hadoops Distributed File System, HDFS. Now we create the directory and set the required ownerships and permissions:
$ sudo mkdir -p /app/hadoop/tmp $ sudo chownhduser:hadoop /app/hadoop/tmp # ...and if you want to tighten up security, chmod from 755 to 750... $ sudo chmod 777 /app/hadoop/tmp If you forget to set the required ownerships and permissions, you will see a java.io.IOException when you try to format the name node in the next section). Add the following snippets between the <configuration> ... </configuration> tags in the respective configuration XML file.
In file conf/core-site.xml: <!-- In: conf/core-site.xml --> <property> <name>hadoop.tmp.dir</name> <value>/app/hadoop/tmp</value> <description>A base for other temporary directories.</description> </property> <property> <name>fs.default.name</name> <value>hdfs://localhost:54310</value> <description>The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri's scheme determines the config property (fs.SCHEME.impl) naming
2011-2012
Page 72
theFileSystem implementation class. The uri's authority is used to determine the host, port, etc. for a filesystem.</description> </property>
In file conf/mapred-site.xml:
<!-- In: conf/mapred-site.xml --> <property> <name>mapred.job.tracker</name> <value>localhost:54311</value> <description>The host and port that the MapReduce job tracker runs at. If "local", then jobs are run in-process as a single map and reduce task. </description> </property>
In file conf/hdfs-site.xml:
<!-- In: conf/hdfs-site.xml --> <property> <name>dfs.replication</name> <value>1</value> <description>Default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time. </description> </property>
2011-2012
Page 73
This will startup a Name node, Data node, Job tracker and a Task tracker on your machine. Stopping your single-node cluster Run the command
2011-2012
Page 74
Networking We will assume the IP address 192.168.0.1 to beof master machine and 192.168.0.2 of the slave machine. Change the IP address according to your system IP.
Update /etc/hosts on both machines with the following lines: # /etc/hosts (for master AND slave) 192.168.0.1<master system IP> 192.168.0.2<slave system IP> SSH access The user on the master (user@master) must be able to connect a) to its own user account on the master i.e. ssh master in this context and not necessarily ssh localhost b) to the user account on the slave (user@slave) via a password-less SSH login. We just have to add the user@masters public SSH key (which should be in $HOME/.ssh/id_dsa.pub) to the authorized_keys file of user@slave (in this users $HOME/.ssh/authorized_keys). master<master system name> slave<slave system name>
You can do this manually or use the following SSH command: user@master:~$ ssh-copy-id -i $HOME/.ssh/id_dsa.pub user@slave
2011-2012
Page 75
This command will prompt you for the login password for user on slave, then copy the public SSH key for you, creating the correct directory and fixing the permissions as necessary. The final step is to test the SSH setup by connecting with user from the master to the user account on the slave. The step is also needed to save slaves host key fingerprint to the user@mastersknown_hosts file.
So, connecting to master user@master:~$ ssh master The authenticity of host 'master (192.168.0.1)' can't be established. RSA key fingerprint is 3b:21:b3:c0:21:5c:7c:54:2f:1e:2d:96:79:eb:7f:95. Are you sure you want to continue connecting (yes/no)? yes Warning: Permanently added 'master' (DSA) to the list of known hosts. Linux master 2.6.20-16-386 #2 Thu Jun 7 20:16:13 UTC 2007 i686 ... user@master:~$ and from master to slave.
user@master:~$ ssh slave The authenticity of host 'slave (192.168.0.2)' can't be established. DSA key fingerprint is 74:d7:61:86:db:86:8f:31:90:9c:68:b0:13:88:52:72. Are you sure you want to continue connecting (yes/no)? yes Warning: Permanently added 'slave' (DSA) to the list of known hosts. Ubuntu 10.04 ... user@slave:~$
2011-2012
Page 76
The conf/masters file defines on which machines Hadoop will start secondary Name Nodes in our multi-node cluster. In our case, this is just the master machine. The primary Name Node and the Job Tracker will always be the machines on which you run the bin/start-dfs.sh and bin/start-mapred.sh scripts.
Master<name of master system> conf/slaves(master only) This conf/slaves file lists the hosts, one per line, where the Hadoop slave daemons (Data Nodes and Task Trackers) will be run. On master system, update conf/slaves that it looks like this: Master<name of master system> Slave<name of slave systems on each newline> conf/*-site.xml (all machines) Assuming you configured each machine as described in the single-node cluster installation, you will only have to change a few variables.
First, we have to change the fs.default.name variable (in conf/core-site.xml) which specifies the Name Node(the HDFS master) host and port. In our case, this is the master machine. <!-- In: conf/core-site.xml --> <property> <name>fs.default.name</name> <value>hdfs://master:54310</value> <description>The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri's scheme determines the config property (fs.SCHEME.impl) naming Department of CSE, SBMJCE 2011-2012 Page 77
theFileSystem implementation class. The uri's authority is used to determine the host, port, etc. for a filesystem.</description> </property> Second, we have to change the mapred.job.tracker variable (in conf/mapred-site.xml) which specifies the Job Tracker (MapReduce master) host and port. Again, this is the master in our case. <!-- In: conf/mapred-site.xml --> <property> <name>mapred.job.tracker</name> <value>master:54311</value> <description>The host and port that the MapReduce job tracker runs at. If "local", then jobs are run in-process as a single map and reduce task. </description> </property> Third, we change the dfs.replication variable (in conf/hdfs-site.xml) which specifies the default block replication. It defines how many machines a single file should be replicated to before it becomes available. The default value of dfs.replication is 3. However, we should always keep this value less than or equal to the number of slave systems in the network. <!-- In: conf/hdfs-site.xml --> <property> <name>dfs.replication</name> <value>2</value> <description>Default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time. </description> </property> Starting the multi-node cluster
2011-2012
Page 78
Starting the cluster is done in two steps. First, the HDFS daemons are started: the Name Node daemon is started on master, and Data Node daemons are started on all slaves. Second, the MapReduce daemons are started: the Job Tracker is started on master, and Task Tracker daemons are started on all slaves. HDFS daemons Run the command /bin/start-dfs.sh on master system: user@master:/usr/local/hadoop$ bin/start-dfs.sh MapReduce daemons Run the command /bin/start-mapred.sh on master system: user@master:/usr/local/hadoop$ bin/start-mapred.sh Stopping the multi-node cluster Like starting the cluster, stopping it is done in two steps. The workflow is the opposite of starting, however MapReduce daemons Run the command /bin/stop-mapred.sh on master system: user@master:/usr/local/hadoop$ bin/stop-mapred.sh HDFS daemons Run the command /bin/stop-dfs.sh on master: hduser@master:/usr/local/hadoop$ bin/stop-dfs.sh
2011-2012
Page 79
BIBLIOGRAPHY
[1] [2] [3] [4] [5] [6] [7] [8] http://bioinformatics.oxfordjournals.org/content http://seqcore.brcf.med.umich.edu http://cmg.health.ufl.edu/ http://genomics.ucsd.edu/Publications http://hadoop.apache.org/ http://developer.yahoo.com/hadoop/ http://www.wikipedia.org/ http://code.google.com/edu/parallel/mapreduce-tutorial.html
2011-2012
Page 80