Anda di halaman 1dari 7

Topic/Title: HISAT2 for RNA-seq Reads Alignment Against Human Population

Description: Guide to use HISAT2 for RNA-seq reads alignment against


human population and visualisation of mapped reads. Methods for gene
expression analysis in HISAT2 explained.
70 character description: Guide to HISAT2 for RNA-seq reads alignment
against human population.
Social media description: HISAT2 for processing sequencing data with
analysis and visualization of sequencing data.
Tool ID: 100125
Tool: Fast and sensitive aligner for mapping next-generation sequencing
reads to single reference genome as well as to a human population.
Other tools: Cufflinks, TopHat, BBMap
SEO: hisat2-rna-seq-reads-alignment-human-population
Keywords: HISAT2 for RNA-seq reads alignment against human population,
RNA-Seq reads mapping, HISAT2 and Tophat2,
mapping nextgeneration sequencing reads, align to genome, aligning against human
population, Analysing RNA-Seq data, RNA-Seq bioinformatics tool, RNA-Seq
Data Analysis
Fig Link: https://insidedna.me/tool_page_assets/tutorials/tutorial41/hisat2-rna-seq-readsalignment-human-population-fig2.png
Fig Description: HISAT2 for RNA-seq reads alignment against human
population screen

HISAT2 is a fast alignment program for mapping next-generation sequencing reads


(both DNA and RNA).In one of our tutorials we described how to use TopHat
mapper. In this tutorial we will show how to use HISAT2 for RNA-Seq reads mapping.
We decided to describe alternative alignment tool because HISAT2 is faster, more
computationally efficient and has some interesting features, such as draft SNVs
detection in the reads. Lets try it!
First of all, we need to open terminal window on the InsideDNA platform
(TERMINAL button at the top of InsideDNA homepage).

Type several commands in the command line:


mkdir hisat2
to create working directory
cd hisat2
to move into working directory
Now we need to download several files for further analysis. It is easy to download
them with wget command in the command line. However this will not give us any
information about databases structure, so we will discuss how to make it step by
step.
Human reference genome (hg38 assembly) can be downloaded from UCSC
genome browser (https://genome.ucsc.edu/). On the home page you need to click
on Downloads ->Genome Data button.

You will see a list of different species genomes and associated data. On this page
you need to find Human reference genome (hg38, GRCh38) and click on Data set
by chromosome button.

There you will see list of reference FASTA files. We will download only one
chromosome of reference genome (chr10.fa.gz file). This will make our further
computational steps faster.

Also we need to download file which contains common SNPs observed in human
population. Go one step back and then click on Annotation dataset.

From the list of annotation files choose snp.147.txt.gz (contains dbSNP build 147
data).
Third file that we need is Human RNA-Seq data file which contains raw reads
without coordinates. We will use the data from ENCODE Consortium project page of
UCSC (https://genome.ucsc.edu/ENCODE/). The aim of ENCODE project was to
identify functional elements in the human genome, so one may find the data on
gene expression, regulation, DNA modification, chromatin structure etc. on this
page.

You need to click on Downloads button and then choose one of RNA-Seq data links
(each link corresponds to the place where this experiment was made) in the list.

Each file from the following list (see below) corresponds to experiment on specific
cell line. From this list we need to choose one file in FASTQ format (for example,
wgEncodeCaltechRnaSeqGm12878R1x75dFastqRep1.fastq.gz file).

All of these files must be uploaded in hisat2 directory which we created earlier
(FILES ->Upload files)

We need to extract text files from archive GZ files so run gunzip[filename]


commands for all files that we uploaded except for snp147.txt.gz. We will treat
snp147.txt.gz in a different manner because this file is the heaviest one and it takes
quite a long time to extract snp147.txt file from the archive. Moreover, we dont
need most of the known polymorphisms for further analysis because we use only
one reference chromosome. We will use zrep command to search for string pattern
in compressed snp147.txt file. With the following command we will extract all SNPs
which belong to 10th chromosome:
zgrep$'chr10\t' snp147.txt.gz > /data/userXXX/hisat2/chr10.snps
where$'chr10\t' defines our search pattern (chr10 + TAB); chr10.snps - output
file and userXXX- your user name.
On the next step we need to change variant file format to make it acceptable for
hisat2-buildindexer:
isub -t snps -c 16 -r 60 -e '/srv/dna_tools/hisat22.0.4/hisat2_extract_snps_haplotypes_UCSC.py /data/userXXX/hisat2/chr10.fa
/data/userXXX/hisat2/chr10.snps /data/userXXX/hisat2/snps_list'
where snps_listargument specifies the name of output file (snps_list.snp) where
common variants are written in the new format; chr10.snps input file with
common variants; chr10.fa reference genome file. To see the format of output
variant file type less snps_list.snp.

First column represents SNP identifier (NCBI ID), second column type of variant
(single nucleotide polymorphism, insertion or deletion), third column chromosome,
fourth column positon, last column alternative form. If one wants to obtain such
variant
file
from
VCF
file

use
another
script
hisat2_extract_snps_haplotypes_VCF.py with the same parameters.
Actually, the file with known SNPs (snps_list.snp) is optional for HISAT2; however we
will use it in our pipeline to show one interesting feature of HISAT2 aligner. Prior to
read alignment we need to create an index files for our reference genome. hisat2build command is used for that purpose. It takes not only reference FASTA file as an
input, but also utilizes information about known SNPs to build HISAT2 index.
isub -t his_build -c 16 -r 60 -e '/srv/dna_tools/hisat2-2.0.4/hisat2-build
--snp/data/userXXX/hisat2/snps_list.snp/data/userXXX/hisat2/chr10.fa
/data/userXXX/hisat2/refer'
where--snp parameter specifies input variant file (snps_list.snp); chr10.fa
reference FASTA file; refer - basename of the index files to write. After long step of
HISAT2 index building you will see 8 files refer.n.ht2 where n is a number from 1 to
8 in the working directory (ls command). Now we can run hisat2 aligner itself:
isub -t hisat -c 16 -r 60 -e '/srv/dna_tools/hisat2-2.0.4/hisat2 x
/data/user540/hisat2/refer U
/data/user540/hisat2/wgEncodeCaltechRnaSeqGm12878R1x75dFastqRep1.fastq
-S /data/user540/hisat2/alignment.sam'
wherex refer parameter specifies basename of HISAT2 index; -S alignment.sam
specifies output alignment file (by default output alignment is written in SAM
format) -U specifies input FASTAQ file with unpaired reads. For those who aligns
paired-end reads -1 and -2 arguments are used for FASTAQ files with first and
second mate reads respectively.
Eventually we have an alignment file (alignment.sam) with reads coordinates on
the reference genome to read this file type less alignment.sam. The structure of
SAM
files
was
described
in
one
of
our
tutorials
(https://insidedna.me/tutorials/view/samtools-commands-tutorial-working-sam-bamfiles) and in the SAM format specification section of SAM tools site
(https://samtools.github.io/hts-specs/SAMv1.pdf) so we will skip this format
description.
Finally, I would like to mention that for some reads last column of our alignment file
contains string of the following structure: Zs:Z:<S><SNP ID>.

HISAT2 uses information about known variants and outputs all possible known SNPs
which were observed in read sequences. It doesnt necessarily mean that this SNV
is present in our sample (variant calling algorithms make more accurate
predictions), however, this feature of HISAT2 gives us a draft picture of observed
variants. For example, the string "Zs:Z:1|S|rs3747203,97|S|rs16990981indicates
that second base of the read corresponds to a known SNP (ID: rs3747203). 97 bases
after the third base (the base after the second one), the read at 100th base involves
another known SNP (ID: rs16990981). 'S' indicates a single nucleotide
polymorphism. 'D' and 'I' indicate a deletion and an insertion, respectively.

Anda mungkin juga menyukai