You will see a list of different species genomes and associated data. On this page
you need to find Human reference genome (hg38, GRCh38) and click on Data set
by chromosome button.
There you will see list of reference FASTA files. We will download only one
chromosome of reference genome (chr10.fa.gz file). This will make our further
computational steps faster.
Also we need to download file which contains common SNPs observed in human
population. Go one step back and then click on Annotation dataset.
From the list of annotation files choose snp.147.txt.gz (contains dbSNP build 147
data).
Third file that we need is Human RNA-Seq data file which contains raw reads
without coordinates. We will use the data from ENCODE Consortium project page of
UCSC (https://genome.ucsc.edu/ENCODE/). The aim of ENCODE project was to
identify functional elements in the human genome, so one may find the data on
gene expression, regulation, DNA modification, chromatin structure etc. on this
page.
You need to click on Downloads button and then choose one of RNA-Seq data links
(each link corresponds to the place where this experiment was made) in the list.
Each file from the following list (see below) corresponds to experiment on specific
cell line. From this list we need to choose one file in FASTQ format (for example,
wgEncodeCaltechRnaSeqGm12878R1x75dFastqRep1.fastq.gz file).
All of these files must be uploaded in hisat2 directory which we created earlier
(FILES ->Upload files)
First column represents SNP identifier (NCBI ID), second column type of variant
(single nucleotide polymorphism, insertion or deletion), third column chromosome,
fourth column positon, last column alternative form. If one wants to obtain such
variant
file
from
VCF
file
use
another
script
hisat2_extract_snps_haplotypes_VCF.py with the same parameters.
Actually, the file with known SNPs (snps_list.snp) is optional for HISAT2; however we
will use it in our pipeline to show one interesting feature of HISAT2 aligner. Prior to
read alignment we need to create an index files for our reference genome. hisat2build command is used for that purpose. It takes not only reference FASTA file as an
input, but also utilizes information about known SNPs to build HISAT2 index.
isub -t his_build -c 16 -r 60 -e '/srv/dna_tools/hisat2-2.0.4/hisat2-build
--snp/data/userXXX/hisat2/snps_list.snp/data/userXXX/hisat2/chr10.fa
/data/userXXX/hisat2/refer'
where--snp parameter specifies input variant file (snps_list.snp); chr10.fa
reference FASTA file; refer - basename of the index files to write. After long step of
HISAT2 index building you will see 8 files refer.n.ht2 where n is a number from 1 to
8 in the working directory (ls command). Now we can run hisat2 aligner itself:
isub -t hisat -c 16 -r 60 -e '/srv/dna_tools/hisat2-2.0.4/hisat2 x
/data/user540/hisat2/refer U
/data/user540/hisat2/wgEncodeCaltechRnaSeqGm12878R1x75dFastqRep1.fastq
-S /data/user540/hisat2/alignment.sam'
wherex refer parameter specifies basename of HISAT2 index; -S alignment.sam
specifies output alignment file (by default output alignment is written in SAM
format) -U specifies input FASTAQ file with unpaired reads. For those who aligns
paired-end reads -1 and -2 arguments are used for FASTAQ files with first and
second mate reads respectively.
Eventually we have an alignment file (alignment.sam) with reads coordinates on
the reference genome to read this file type less alignment.sam. The structure of
SAM
files
was
described
in
one
of
our
tutorials
(https://insidedna.me/tutorials/view/samtools-commands-tutorial-working-sam-bamfiles) and in the SAM format specification section of SAM tools site
(https://samtools.github.io/hts-specs/SAMv1.pdf) so we will skip this format
description.
Finally, I would like to mention that for some reads last column of our alignment file
contains string of the following structure: Zs:Z:<S><SNP ID>.
HISAT2 uses information about known variants and outputs all possible known SNPs
which were observed in read sequences. It doesnt necessarily mean that this SNV
is present in our sample (variant calling algorithms make more accurate
predictions), however, this feature of HISAT2 gives us a draft picture of observed
variants. For example, the string "Zs:Z:1|S|rs3747203,97|S|rs16990981indicates
that second base of the read corresponds to a known SNP (ID: rs3747203). 97 bases
after the third base (the base after the second one), the read at 100th base involves
another known SNP (ID: rs16990981). 'S' indicates a single nucleotide
polymorphism. 'D' and 'I' indicate a deletion and an insertion, respectively.