.. PVAAS documentation master file, created by
sphinx-quickstart on Thu Jan 30 15:40:51 2014.
You can adapt this file completely to your liking, but it should at least
contain the root `toctree` directive.
.. image:: _static/pvaas_logo.png
:height: 300 px
:width: 1400 px
:scale: 35 %
:alt: logo.png
Introduction
====================
Alternative splicing is a precisely regulated process that involves interactions between cis- and trans-acting
regulatory elements. Factors (such as genetic mutation) that disrupt these interactions result in aberrant splicings.
Deep transcriptome sequencing (RNA-seq) has the capability to interrogate both "aberrant splicing variants" and "sequence variants (SNP, indel, etc)" simultaneously, thus provides the unique opportunity
to identify "sequence variants" that directly associated with aberrant splicing. According annotation status, splicings junctions detected from RNA-seq can be divided into 3 categories:
* **Canonical splicing**: both 3' splice site (3'SS) and 5' splice site (5'SS) are known.
* **Semi-canonical splicing**: only one of the 3'SS and 5'SS is known, the other splice site is novel.
* **Novel splicing**: both 3'SS and 5'SS are novel.
We define semi-canonical splicing and novel splicing as **aberrant splicing**, as both of them use spurious splicing sites.
PVAAS (**Program to identify Variants Associated with Aberrant Splicing**) is such program to identify single nucleotide variants and indels that are associated with aberrant splicing.
As shown in example below, G->A mutations are strongly associated with the usage of 5' spurious splicing site.
.. image:: _static/Figure1_concept.png
:height: 350 px
:width: 700 px
:scale: 100 %
:alt: concept.png
Installation
=============
Prerequisites
--------------------------------------------------------
#. `gcc `_
#. `python2.7.* `_
#. `numpy and scipy `_
#. `cython `_
#. `Xcode `_ (MacOS only)
Check dependencies
------------------------------------------------------------
::
# check gcc
$ gcc --version
Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 5.0 (clang-500.2.79) (based on LLVM 3.3svn)
Target: x86_64-apple-darwin12.5.0
Thread model: posix
# check python version
$ python --version
Python 2.7.2
# check numpy and scipy
python -c 'import numpy'
python -c 'import scipy'
# check cython
cython --version
Cython version 0.19.2
Download
----------
**PVAAS** program is available `here `_
Install
----------
::
$ tar zxf PVAAS-VERSION.tar.gz
$ cd PVAAS-VERSION
$ python setup.py install
# or you can install PVAAS to a specified location:
$ python setup.py install --root=/some/where
Usage
=======
Usage: pvaas.py [options]
Options:
--version show program's version number and exit
-h, --help show this help message and exit
-i FILE_BAM, --input=FILE_BAM
BAM file. BAM file should be indexed and sorted using
`SAMtools `_.
-g GENE_BED, --gene=GENE_BED
Referencde gene models in `BED
`_
format. Must be standard 12 column. File can be plain
text or gzipped.
-r GENOME_FA, --genome=GENOME_FA
Reference genome sequence file in `FASTA
`_
format. File can be plain text or gzipped. '-r' and
'-v' are mututally exclusive.
-v VARIANT_VCF, --vcf=VARIANT_VCF
`VCF `_ file containing user-defined variant.
File can be plain text or gzipped. '-v' and '-r' are
mututally exclusive.
-o OUTPUT_PREFIX, --output=OUTPUT_PREFIX
Prefix of output file. default=pvaas_output
-q MIN_MAPQ, --mapq=MIN_MAPQ
Minimum `mapping quality
`_. Mapping
quality is Phred-scaled probability of the alignment
being wrong. default=30
-m MIN_SEQQ, --seqq=MIN_SEQQ
Minimum `sequence quality
`_. Sequence
quality is Phread-scaled probability of the base
calling being wrong. default=30
-p P_VAL, --pvalue=P_VAL
Fisher exact test pvalue (adjusted using benjamini-
hochberg procedure) cutoff. default=0.05
-d MIN_DIST_SIZE, --dist=MIN_DIST_SIZE
Minimum distance between variant and the termini of
read. default=5
-c MIN_COVERAGE, --cvg=MIN_COVERAGE
Minimum number of reads supporting one allele.
default=2
-k, --clean Remove intermediate '.sortName.bam'
-f, --vcffilter Turn on VCF filter (only varaint marked with "PASS"
will be used.
Results
=============
PVAAS generates these files
**prefix.spliceFrag.bam**
------------------------------------------------------------
This BAM file contains splice alignments extracted from the original BAM file. For single-end RNA-seq, only the alignment records of spliced reads were extracted;
For pair-end RNA-seq, if one read were spliced, the alignment records of both reads (read1 and read2) would be extracted from the original BAM file for downstream analysis.
Each alignment was annotated using the provided gene model (i.e. BED file specified by '-g'), in particular, three different tags ("KK","KN","NN")
were appended to each splice read (and its mate) as:
* **KK** (Known-Known): Both 5' splice site (5'SS) and 3' splice site (3'SS) are annotated (named canonical splicing). You might see multiple "KK" tags if read was spliced multiple times or both ends were spliced. ( eg: KK:Z:chr1:14829-14969.)
* **KN** (Known-Novel): One splice site is known, the other splice site is novel (named semi-canonical splicing). (eg. KN:Z:chr1:783186-784863).
* **NN** (Novel-Novel): Both splice sites are novel (named novel splicing).
Example:
::
UNC14-SN744:268:C10DDACXX:4:1308:13818:35728/2 163 chr9 14660 50 48M = 14929 457 CTTCCGCTCCTTGAAGCTGGTCTCCACCCACTGCTGGTTCCGTCACCC 8=?D:DDDHAHDBE@HGIBHE@`_.
Only variants with adjusted P value < cutoff (specified by -p) were included.
::
chrom position ref_base AF(all) AF(canonical) AF(semi-canonical) AF(novel) 2x2(semi_canonical) BH_pval(semi-canonical) 2x2(novel) BH_pval(novel)
chr9 96320999 C A19,C149 C147 A19 C2 [[147,0],[0,19]] 1.4001656652e-24 [[147,0],[2,0]] 1.0
Running PVAAS
======================
Download these files
---------------------
* `hg19.fa.gz `_ (Human reference genome in FASTA format)
* `hg19.All.bed12.gz `_ (Human reference gene models in BED format)
* `chr9.sorted.bam `_ (Read alignment (BAM) file)
* `chr9.sorted.bam.bai `_ (Index file)
run PVAAS with de novo SNP calling
-------------------------------------
::
python2.7 pvaas.py -i chr9.sorted.bam -r hg19.fa.gz -g hg19.All.bed12.gz -o output
@ 2014-07-15 11:51:20: Running PVAAS v0.1.4
@ 2014-07-15 11:51:20: Loading reference genome hg19.fa
@ 2014-07-15 11:53:28: Reading reference bed file: hg19.All.bed12
@ 2014-07-15 11:53:52: Sorting chr9.sorted.bam by name (read id)
@ 2014-07-15 11:57:27: Save sorted chr9.sorted.bam to output.sortName.bam
@ 2014-07-15 11:57:27: Extracting and annotating splice read and its mate
@ 2014-07-15 11:59:32: Sorting output.spliceFrag.bam by coordinates
@ 2014-07-15 11:59:51: Indexing output.spliceFrag.bam
@ 2014-07-15 11:59:53: Searching for mutation associated with aberrant splicing ...
@ 2014-07-15 13:32:25: Correcting P values using Benjamini & Hochberg's procedure
NOTE:
#. Reference genome sequences (i.e. hg19.fa.gz) must be provided to invoke SNP calling.
#. PVAAS does not perform indel calling. If users want to associate indels to aberrant splicing, they need to perform indel calling by themselves, then run PVAAS with a VCF file (see below).
run PVAAS with a user defined VCF file
---------------------------------------
::
python2.7 pvaas.py -i chr9.sorted.bam -v test.vcf -g hg19.All.bed12.gz -o output
@ 2014-11-12 12:22:43: Running PVAAS v0.1.5
@ 2014-11-12 12:22:43: Reading reference bed model: hg19.refseq.bed12
@ 2014-11-12 12:22:45: Sorting chr9.sorted.bam by name (read id)
@ 2014-11-12 12:27:19: Save sorted chr9.sorted.bam to output.sortName.bam
@ 2014-11-12 12:27:19: Extracting and annotating splice read and its mate
@ 2014-11-12 12:29:10: Sorting output.spliceFrag.bam by coordinates
@ 2014-11-12 12:29:41: Indexing output.spliceFrag.bam
@ 2014-11-12 12:30:10: Correcting P values using Benjamini & Hochberg's procedure
Example
======================
In below screenshot taken from UCSC genome browser, 'A' is significantly associated with aberrant splicing (i.e. extended 5'SS). "Blue" bars indicate reads mapped to '+' strand, and "red" bars
indicate reads mapped '-' strand.
.. image:: _static/Supp_Figure_1.png
:height: 350 px
:width: 700 px
:scale: 100 %
:alt: concept.png
Contact
==========
Liguo Wang: wangliguo78_AT_gmail.com
.. sectionauthor:: Guido van Rossum