Problems:
zlib.h
sudo apt-get install zlib1g-dev
The workflow consists of supplying sequences in FASTA, FASTQ, Illumina Bustard & Gerald, or
SRF file formats and producing results in the BAM format (A binary format for storing sequence
data). A BAM file (.bam) is the binary version of a SAM file. A SAM file (.sam) is a tab-delimited
text file that contains sequence alignment data.
Build phase:
Step 1: Convert the reference into binary format (e.coli.fa to e.coli.dat)
Step 2: For >100 million basepairs make a jump database (e.coli.dat to e.coli.15)
Step 3: Convert the reads to binary format (mate1.fq to read.mkb)
MosaikBuild
MosaikBuild converts various sequence formats into Mosaiks native read format.
Build.sh
# You may need to create the jump database for large genome (> 100 million basepair)
#../bin/MosaikJump -ia reference/e.coli.dat -hs 15 -out reference/e.coli.15
Align Phase:
Pass as inputs the reads (read.mkb) and reference file (e.coli.dat) in binary format to
MosaikAligner. (For >100 million basepairs use the jump database created in build phase)
MosaikAligner
#!/bin/sh
### Create ../bin/MosaikAligner
# cd ../src
# make
###
# Align the reads
ANN_PATH=../src/networkFile
# You may need to use the jump database for large genome (> 100 million basepair)
#../bin/MosaikAligner -in fastq/read.mkb -out fastq/read.mka -ia reference/e.coli.dat -annpe
$ANN_PATH/2.1.26.pe.100.0065.ann -annse $ANN_PATH/2.1.26.se.100.005.ann -j
reference/e.coli.15
Note
1. pe.ann and se.ann are on MOSAIK/src/networkFile/.
2. read.mka.bam is the resultant bam while other outputted bams are for other
purposes.
Final ouput( gold.sam)
EXTRAS:
What's new?
1. A new neural-net for mapping quality (MQ) calibration is introduced. Initial testing
using simulated reads shows that this method improve the accuracy compared to the
previous MQ scheme.
2. The overall alignment speed is much quicker now due to a banded Smith-
Waterman algorithm implementation. Longer Roche 454 reads align much quicker than
before.
1. A local alignment search option has been added to help rescue mates in
paired-end/mate-pair reads that may be missing due to highly repetitive regions in the
genome.
2. SOLiD support has finally come of age. MOSAIK imports and aligns SOLiD
reads in colorspace, but now seamlessly converts the alignments back into basespace.
No more downstream bioinformatics headaches.
3. Robust support for the BAM alignment file formats.
4. The command line parameters have been cleaned up and sensible default
parameters have been chosen. This cuts down the ridiculously long command-lines to
simply specifying an input file and an output file in most cases.
MOSAIK is multithreaded. If you have a machine with 8 processors, you can use all 8
processors to align reads faster while using the same memory footprint as when using one
processor.
System Requirements
1. Hardware:
a) 64-bit x86-64 CPUs with SSE instructions.
b) 8 GB main memory ( for a genome as large as humans).
c) 8 GB hard disk (for a genome as large as humans).
2. Software:
a) 64-bit Linux system (kernel >=2.6).
Installation:
1. Download the SOAPaligner http://soap.genomics.org.cn/soapaligner.html .
2. In the Linux console, type:
cd <TheDirectoryYouPutTheTarball>
tar zxvf SOAPaligner.tar.gz
cd SOAPaligner
3. In your directory there are 2 executable files, 2bwt-builder (for format) and soap
(for align).
(same as here : https://github.com/gigascience/bgi-
soap2/tree/master/executables/2.21/x86_64)
To run SOAPaligner, we need to build index files for the reference genome, and then search
reads against the formatted index files.
STEPS:
1.Format reference sequence:
<ExecutablePath>/2bwt-builder <FastaPath/YourFasta>
eg: ./2bwt-builder reference/human_genome.fa
Then under the directory there will be 13 index files, all their prefixes are your_fasta file name
with .index added, e.g. human_genome.fa.index.
The suffixes include *.amb, *.ann, *.bwt, *.fmv, *.hot, *.lkt, *.pac, *.rev.bwt, *.rev.fmv, *.rev.lkt,
*.rev.pac, *.sa, and *.sai.
Eg:
./soap a read/mate1.fq -b read/mate2.fq -D reference/human_genome.fa.index -o
<PE_output> -2 <SE_output> -m <min_insert_size> -x <max_insert_size>
NOTE: For the D option, the program can only accept the prefix of your index files, such as
~/human_genome.fa.index.
3.Options:
SOAPaligner needs about 2 hours to format the reference sequence and build indexing tables.
The RAM usage is depending on the total size of the reference sequence. For the human
reference genome, it will occupy 7GB RAM.
Table 1. Performance of aligning 1 million single-end reads (35bp read length) or 1 million read
pairs onto the human reference genome
Future Development