.. PVAAS documentation master file, created by sphinx-quickstart on Thu Jan 30 15:40:51 2014. You can adapt this file completely to your liking, but it should at least contain the root `toctree` directive. .. image:: _static/pvaas_logo.png :height: 300 px :width: 1400 px :scale: 35 % :alt: logo.png Introduction ==================== Alternative splicing is a precisely regulated process that involves interactions between cis- and trans-acting regulatory elements. Factors (such as genetic mutation) that disrupt these interactions result in aberrant splicings. Deep transcriptome sequencing (RNA-seq) has the capability to interrogate both "aberrant splicing variants" and "sequence variants (SNP, indel, etc)" simultaneously, thus provides the unique opportunity to identify "sequence variants" that directly associated with aberrant splicing. According annotation status, splicings junctions detected from RNA-seq can be divided into 3 categories: * **Canonical splicing**: both 3' splice site (3'SS) and 5' splice site (5'SS) are known. * **Semi-canonical splicing**: only one of the 3'SS and 5'SS is known, the other splice site is novel. * **Novel splicing**: both 3'SS and 5'SS are novel. We define semi-canonical splicing and novel splicing as **aberrant splicing**, as both of them use spurious splicing sites. PVAAS (**Program to identify Variants Associated with Aberrant Splicing**) is such program to identify single nucleotide variants and indels that are associated with aberrant splicing. As shown in example below, G->A mutations are strongly associated with the usage of 5' spurious splicing site. .. image:: _static/Figure1_concept.png :height: 350 px :width: 700 px :scale: 100 % :alt: concept.png Installation ============= Prerequisites -------------------------------------------------------- #. `gcc <http://gcc.gnu.org/>`_ #. `python2.7.* <http://www.python.org/getit/releases/2.7/>`_ #. `numpy and scipy <http://numpy.scipy.org/>`_ #. `cython <http://cython.org/>`_ #. `Xcode <https://developer.apple.com/xcode/>`_ (MacOS only) Check dependencies ------------------------------------------------------------ :: # check gcc $ gcc --version Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/usr/include/c++/4.2.1 Apple LLVM version 5.0 (clang-500.2.79) (based on LLVM 3.3svn) Target: x86_64-apple-darwin12.5.0 Thread model: posix # check python version $ python --version Python 2.7.2 # check numpy and scipy python -c 'import numpy' python -c 'import scipy' # check cython cython --version Cython version 0.19.2 Download ---------- **PVAAS** program is available `here <http://sourceforge.net/projects/pvaas/files/?source=navbar>`_ Install ---------- :: $ tar zxf PVAAS-VERSION.tar.gz $ cd PVAAS-VERSION $ python setup.py install # or you can install PVAAS to a specified location: $ python setup.py install --root=/some/where Usage ======= Usage: pvaas.py [options] Options: --version show program's version number and exit -h, --help show this help message and exit -i FILE_BAM, --input=FILE_BAM BAM file. BAM file should be indexed and sorted using `SAMtools <http://samtools.sourceforge.net/>`_. -g GENE_BED, --gene=GENE_BED Referencde gene models in `BED <http://genome.ucsc.edu/FAQ/FAQformat.html#format1>`_ format. Must be standard 12 column. File can be plain text or gzipped. -r GENOME_FA, --genome=GENOME_FA Reference genome sequence file in `FASTA <http://www.ncbi.nlm.nih.gov/BLAST/blastcgihelp.shtml>`_ format. File can be plain text or gzipped. '-r' and '-v' are mututally exclusive. -v VARIANT_VCF, --vcf=VARIANT_VCF `VCF <http://samtools.github.io/hts- specs/VCFv4.2.pdf>`_ file containing user-defined variant. File can be plain text or gzipped. '-v' and '-r' are mututally exclusive. -o OUTPUT_PREFIX, --output=OUTPUT_PREFIX Prefix of output file. default=pvaas_output -q MIN_MAPQ, --mapq=MIN_MAPQ Minimum `mapping quality <http://maq.sourceforge.net/qual.shtml>`_. Mapping quality is Phred-scaled probability of the alignment being wrong. default=30 -m MIN_SEQQ, --seqq=MIN_SEQQ Minimum `sequence quality <http://maq.sourceforge.net/qual.shtml>`_. Sequence quality is Phread-scaled probability of the base calling being wrong. default=30 -p P_VAL, --pvalue=P_VAL Fisher exact test pvalue (adjusted using benjamini- hochberg procedure) cutoff. default=0.05 -d MIN_DIST_SIZE, --dist=MIN_DIST_SIZE Minimum distance between variant and the termini of read. default=5 -c MIN_COVERAGE, --cvg=MIN_COVERAGE Minimum number of reads supporting one allele. default=2 -k, --clean Remove intermediate '.sortName.bam' -f, --vcffilter Turn on VCF filter (only varaint marked with "PASS" will be used. Results ============= PVAAS generates these files **prefix.spliceFrag.bam** ------------------------------------------------------------ This BAM file contains splice alignments extracted from the original BAM file. For single-end RNA-seq, only the alignment records of spliced reads were extracted; For pair-end RNA-seq, if one read were spliced, the alignment records of both reads (read1 and read2) would be extracted from the original BAM file for downstream analysis. Each alignment was annotated using the provided gene model (i.e. BED file specified by '-g'), in particular, three different tags ("KK","KN","NN") were appended to each splice read (and its mate) as: * **KK** (Known-Known): Both 5' splice site (5'SS) and 3' splice site (3'SS) are annotated (named canonical splicing). You might see multiple "KK" tags if read was spliced multiple times or both ends were spliced. ( eg: KK:Z:chr1:14829-14969.) * **KN** (Known-Novel): One splice site is known, the other splice site is novel (named semi-canonical splicing). (eg. KN:Z:chr1:783186-784863). * **NN** (Novel-Novel): Both splice sites are novel (named novel splicing). Example: :: UNC14-SN744:268:C10DDACXX:4:1308:13818:35728/2 163 chr9 14660 50 48M = 14929 457 CTTCCGCTCCTTGAAGCTGGTCTCCACCCACTGCTGGTTCCGTCACCC 8=?D:DDDHAHDBE@HGIBHE@<D9<E3)):C3BD<?0:?######## RG:Z:120904_UNC14-SN744_0268_AC10DDACXX_4_GTGGCC IH:i:1 HI:i:1 NM:i:1 KK:Z:chr9:14940-15080 UNC14-SN744:268:C10DDACXX:4:2201:9246:148959/1 99 chr9 173290 66 48M = 175773 4465 ACCGTTTCTAAGTTCCAGCCACTCTTCATAGAGCTCTCCACCTTGGCT CCCFFFFFHHHHFIJJJJJIJJJJJJJJJIJIJIIJJJIJJJIIIJJJ RG:Z:120904_UNC14-SN744_0268_AC10DDACXX_4_GTGGCC IH:i:1 HI:i:1 NM:i:0 KN:Z:chr9:175784-177718 UNC14-SN744:268:C10DDACXX:4:2103:5415:27906/2 163 chr9 2769934 55 48M = 2770128 7813 CAAATGCTGGGATTACAGATGTGAACCACCATGCCCAGCTAGAGGCTC CCCFFFFFHHGHBHIJJJJJJJJJIJJIJJJJJJJJJJJJGIHIJIIH RG:Z:120904_UNC14-SN744_0268_AC10DDACXX_4_GTGGCC IH:i:1 HI:i:1 NM:i:1 NN:Z:chr9:2770163-2777734 You would see the combination of "KK", "KN" and "NN" if the read (pair) has multiple annotation status. For example, one read is canonical splicing while its mate is semi-canonical splicing. **prefix.rawPval.xls** ----------------------- Spreadsheet (11 columns) containing variants that are potentially associated aberrant splicing. P values were not adjusted and filtered. #. Chromosome name. #. Chromosome position (1-based). #. Reference allele. #. Allele frequency (AF) of **all reads**. All reads were counted without filting. #. Allele frequency (AF) of reads supporting **canonical splicing**. Canonical splicing is splicing event that has two known splice sites (i.e. both 5'SS and 3'SS are known). For example, "A174,G67" means 174 reads having allele 'A', another 67 reads having allele 'G'. Only reads passing filter were counted. #. Allele frequency (AF) of reads supporting **semi-canonical splicing**. Semi-canonical splicing is splicing event that has one known splice site and one novel splice site. Only reads passing filter were counted. #. Allele frequency (AF) of reads supporting **novel splicing**. Novel splicing is splicing event that has two novel splice sites (i.e. both 5'SS and 3'SS are novel). Only reads passing filter were counted. #. 2x2 table for Fisher Exact test (If the variant was significantly associated with semi-canonical splicing). #. Raw P-value of Fisher exact test(If the variant was significantly associated with semi-canonical splicing). #. 2x2 table for Fisher Exact test (If the variant was significantly associated with novel splicing). #. Raw P-value of Fisher exact test(If the variant was significantly associated with novel splicing). Example:: chrom position ref_base AF(all) AF(canonical) AF(semi-canonical) AF(novel) 2x2(semi_canonical) raw_pval(semi-canonical) 2x2(novel) raw_pval(novel) chr9 100823084 T C221,T226 C206,T219 C8,T4 C7,T3 [[219,206],[4,8]] 0.251694519825 [[219,206],[3,7]] 0.212766063103 chr9 37948724 T C33,T38 C30,T35 NA C3,T3 NA NA [[35,30],[3,3]] 1.0 chr9 88631577 G AA3,GA106 AA2,GA103 GA1 AA1,GA2 [[103,2],[1,0]] 1.0 [[103,2],[2,1]] 0.0817805991497 chr9 96320999 C A19,C149 C147 A19 C2 [[147,0],[0,19]] 2.333609442e-25 [[147,0],[2,0]] 1.0 **prefix.adjustedPval.xls** ------------------------------- Format is same as **prefix.rawPval.xls**. But P values were adjusted using `Benjamini-Hochberg procedure <http://en.wikipedia.org/wiki/False_discovery_rate#Benjamini.E2.80.93Hochberg_procedure>`_. Only variants with adjusted P value < cutoff (specified by -p) were included. :: chrom position ref_base AF(all) AF(canonical) AF(semi-canonical) AF(novel) 2x2(semi_canonical) BH_pval(semi-canonical) 2x2(novel) BH_pval(novel) chr9 96320999 C A19,C149 C147 A19 C2 [[147,0],[0,19]] 1.4001656652e-24 [[147,0],[2,0]] 1.0 Running PVAAS ====================== Download these files --------------------- * `hg19.fa.gz <https://drive.google.com/file/d/0BwAUopGWU_kheHRXTl8wRnZTUnc/edit?usp=sharing>`_ (Human reference genome in FASTA format) * `hg19.All.bed12.gz <https://drive.google.com/file/d/0BwAUopGWU_khV2dKcTEza280ajg/edit?usp=sharing>`_ (Human reference gene models in BED format) * `chr9.sorted.bam <https://drive.google.com/file/d/0BwAUopGWU_khQzc3QnlId002dHM/edit?usp=sharing>`_ (Read alignment (BAM) file) * `chr9.sorted.bam.bai <https://drive.google.com/file/d/0BwAUopGWU_khdWJHUlJCNDZjazg/edit?usp=sharing>`_ (Index file) run PVAAS with de novo SNP calling ------------------------------------- :: python2.7 pvaas.py -i chr9.sorted.bam -r hg19.fa.gz -g hg19.All.bed12.gz -o output @ 2014-07-15 11:51:20: Running PVAAS v0.1.4 @ 2014-07-15 11:51:20: Loading reference genome hg19.fa @ 2014-07-15 11:53:28: Reading reference bed file: hg19.All.bed12 @ 2014-07-15 11:53:52: Sorting chr9.sorted.bam by name (read id) @ 2014-07-15 11:57:27: Save sorted chr9.sorted.bam to output.sortName.bam @ 2014-07-15 11:57:27: Extracting and annotating splice read and its mate @ 2014-07-15 11:59:32: Sorting output.spliceFrag.bam by coordinates @ 2014-07-15 11:59:51: Indexing output.spliceFrag.bam @ 2014-07-15 11:59:53: Searching for mutation associated with aberrant splicing ... @ 2014-07-15 13:32:25: Correcting P values using Benjamini & Hochberg's procedure NOTE: #. Reference genome sequences (i.e. hg19.fa.gz) must be provided to invoke SNP calling. #. PVAAS does not perform indel calling. If users want to associate indels to aberrant splicing, they need to perform indel calling by themselves, then run PVAAS with a VCF file (see below). run PVAAS with a user defined VCF file --------------------------------------- :: python2.7 pvaas.py -i chr9.sorted.bam -v test.vcf -g hg19.All.bed12.gz -o output @ 2014-11-12 12:22:43: Running PVAAS v0.1.5 @ 2014-11-12 12:22:43: Reading reference bed model: hg19.refseq.bed12 @ 2014-11-12 12:22:45: Sorting chr9.sorted.bam by name (read id) @ 2014-11-12 12:27:19: Save sorted chr9.sorted.bam to output.sortName.bam @ 2014-11-12 12:27:19: Extracting and annotating splice read and its mate @ 2014-11-12 12:29:10: Sorting output.spliceFrag.bam by coordinates @ 2014-11-12 12:29:41: Indexing output.spliceFrag.bam @ 2014-11-12 12:30:10: Correcting P values using Benjamini & Hochberg's procedure Example ====================== In below screenshot taken from UCSC genome browser, 'A' is significantly associated with aberrant splicing (i.e. extended 5'SS). "Blue" bars indicate reads mapped to '+' strand, and "red" bars indicate reads mapped '-' strand. .. image:: _static/Supp_Figure_1.png :height: 350 px :width: 700 px :scale: 100 % :alt: concept.png Contact ========== Liguo Wang: wangliguo78_AT_gmail.com .. sectionauthor:: Guido van Rossum <guido@python.org>