ANALYSIS SUMMARY
Table of Contents
Sequencing Stats
Putative SNV files
Merged reports
Putative small indel files
Breakdancer output
Nonstarters
Possible deletions
Holt Estimated Allele Frequency Files
Common Causes of False Positives
All files are tab delimited text
Analysis was performed using maq-0.7.1 (Li H., Ruan J. and Durbin R. (2008) Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res., 18 (11), 1851-1858).
Below is a detailed description of each of the files that you received.
Notes on nomenclature:
We have chosen to use the term Single Nucleotide Variant (SNV) instead of Single Nuceoltide Polymorphism (SNP) because many variants do not fit the strict definition of polymorphism. Many of the programs we use for the analysis use the term SNP. Please consider all occurances of SNP to be equivalent to SNV.
We also use the term multi-alleleic (MA). Multi-alleleic sites could be 1) true heterozygous SNVs in a diploid strain, 2) true multi-alleleic sites from a population, 3) signature of a collapsed repeat in the reference, or 4) signature of sample contamination.
----------------------------------------------------------
SEQUENCING STATS
Excel file that contains the summary statistics for all strains
seq_stats_Organism_proposal#.date.xls
1: org = organism-strain
2: ref_used = reference used for alignment
3: size = genome size
4: avg_depth = average read depth
5: sd_depth = standard deviation of depth
6: min_depth = minimum depth
7: max_depth = maximum depth
8: #SE = number of single end (SE) reads
9: %SE_map = % SE reads mapped to reference
10: #PE = number of paired end (PE) reads
11: %PE_map = percentage PE reads mapped
12: %PE_pairs = percentage of mapped PE reads that mapped as pairs
13: read_len = read length
14: %simSNV = estimate of the percentage of Single Nucleotide Variants (SNVs) found.
Estimated by "mutating" the reference at base 500, 1500, 2500....
to simulate SNVs and calculating % of simulated SNVs identified by the SNV caller.
15: %ref = percentage of the reference covered by reads
16: #hom_SNV = putative homozygous SNVs - These could represent true SNVs or could
represent errors in the reference.
17: #bp_per_SNV = number of base pairs per homozygous SNV
18: #MA_SNV = number of putative multi-alleleic SNVs. Multi-alleleic calls in areas of
excessive depth are likely expanded repeats in the sequenced strain relative
to the reference used.Seq Stats
19: %simIndel = estimate of the percentage of small indels found. Estimated by "mutating"
the reference at base 1000, 2000, 3000, 4000..... to simulate indels and
calculating percentage of simulated indels idenified by the small indel
caller. The indel sizes are equally distributed between insertions and
deletions, and range in size from 1-15bp in size.
20: #hom_indel = number of putative homozygous small indels
21: #MA_indel = number of putative multi-alleleic small indels
22: comments
TOP
----------------------------------------------------------
PUTATIVE SNV FILES
StrainName.snp.aa.txt
1: contig = contig/chromosome/scaffold name
2: pos = genomic position in contig/chromosome/scaffold
3: type = NA = not annotated
NC = non-coding
Syn = synonymous
Non-Syn = non-synonymous
Nonsense = introducing or abolishing a stop codon
3' = 3'UTR
5' = 5'UTR
Int = intronic
Spl = splice junction
4: name = gene name
NC = non-coding
NA = not annotated
5: strand = "+" = gene translated from plus strand
"-" = gene translated from minus strand
NC = non-coding
NA = not annotated
6: ref_nt = reference base
7: cds_nt_pos = cds nucleotide position NC=non-coding, NA=not annotated
8: ref_aa:codon = reference amino acid:reference codon NC=non-coding, NA=not annotated
9: cds_aa_pos = cds codon position NC=non-coding, NA=not annotated
10: cons_qual = phred-like consensus quality (255 max value)
11: depth = read depth (255 max value)
12: avg_hits = the average number of hits in reference for reads covering this
position (essentially the # of times the flanking region appears
in the genome. zero here means there were 3 or more mismatches in
close proximity, and it is too complicated for MAQ to calculate.
For more information please refer to documentation for Maq)
13: cons_base = consensus base
14: 1st_base:count = most common allele count
15: 2nd_base:count = 2nd most common allele count
16: cons_aa:codon = consensus amino acid:consensus codon ( NC=non-coding, NA=not annotated )
TOP
----------------------------------------------------------
MERGED REPORTS
organism_merge.out
1: contig = contig/chromosome/scaffold name
2: pos = genomic position in contig/chromosome/scaffold
3: type = NA = not annotated
NC = non-coding
CDS = coding
3' = 3'UTR
5' = 5'UTR
Int = intronic
Spl = splice junction
4: name = gene name ( NC = non-coding, NA = not annotated )
5: strand = +/- strand the gene is translated from ( NC = non-coding, NA = not annotated )
6: ref_nt = reference base
7: cds_nt_pos = cds nucleotide position ( NC = non-coding NA = not annotated )
8: ref-aa:codon = reference amino acid:reference codon ( NC = non-coding, NA = not annotated )
10: cds_aa_pos = cds codon position NC=non-coding NA=not annotated
11: strain_1 = strain 1 stats seperated by "/" fields described below
12: strain_2 = strain 2 stats seperated by "/" fields described below
13+ etc... one column per additional strain.
Per Strain Stats Column:
SNV/cons_qual/depth/avg_hits/cons_base/1st_base:count/2nd_base:count/cons_aa:codon
SNV = SNV was called in that strain, blank if strain matched reference
cons_qual = phred like score of cons qual
depth = read depth
avg_hits = ave # flanking region is found in ref, indicates uniqueness, want ~1.00
cons_base = consensus base
1st_base:count = most common allele and its count
2nd_base:count = 2nd most common base and its count
cons_aa:codon = consensus amino acid : consensus codon NC=non-coding NA=not annotated
TOP
----------------------------------------------------------
PUTATIVE SMALL INDEL FILES:
StrainName.indelpe.hom.txt
short indel calls that are homozygous
StrainName.indelpe.ma.txt
short indel calls that are multi-alleleic
Size of findable indels depends upon the read length, with deletions being easier to locate. Results from a simulation are shown below.
type read 1bp 2bp 3bp 4bp 5bp 6bp 7bp 8bp 9bp 10bp 11bp 12bp 13bp 14bp 15bp
del 35bp 97% 97% 98% 100% 93% 99% 91% 94% 91% 84% 84% 81% 68% 45% 38%
ins 35bp 98% 98% 99% 96% 92% 91% 0% 0% 0% 0% 0% 0% 0% 0% 0%
del 70bp 97% 97% 97% 99% 96% 99% 93% 97% 91% 93% 93% 93% 95% 100% 93%
ins 70bp 98% 97% 99% 97% 96% 99% 96% 93% 89% 93% 93% 89% 87% 0% 0%
A note on False Positive Calls: We find that if the number of fwd reads plus the number of rev reads is extremely small compared to the average depth across the genome the putative indel is most likely a false positive. Visual confirmation using the alignment file is the best way to confirm this.
1: type = NC=non-coding CDS=coding 3'=3'UTR 5'=5'UTR, Int=intronic NA=not annotated
2: name = gene name, NC=non-coding, NA=not annotated
3: contig = contig/chromosome/scaffold name
4: pos = genomic position in contig/chromosome/scaffold
5: evidence = evidence for indel, highest "*" > "+"
6: depth = read depth (255 max value)
7: size:seq = size-of-indel:sequence-of-indel. A negative number indicates deletion,
a positive number indicates insertion.
For example -3:TTT is a deletion of TTT
8: #fwd = number of forward reads
9: #rev = number of reverse reads
10: up_flank = upstream flanking sequence
11: down_flank = downstream flanking sequence - if deletion, deleted sequence is included in the flanking sequence
12: #w/o_indel = number of reads crossing this site that don't contain the indel
Evidence for indel
" * confirmed by reads from both strands
" + confirmed by reads at least 2 reads from a single strand
TOP
----------------------------------------------------------
BREAKDANCER OUTPUT
Nature Methods 6, 677 - 681 (2009)
This program attempts to identiy large structural variations. Can only be run on Paired End data. Information below is taken from the BreakDancer README. It is known that this program will give many false positive calls especially for genomes that are repetitive. Validation of predicted structural variation is necessary.
StrainName.bd.max.filtered.txt
1: #Chr1 = scaffold name of 1st SV breakpoint
2: Pos1 = coordinate of 1st SV breakpoint
3: Orientation1 = number of reads mapped to the plus (+) or the minus (-) strand in the anchoring regions.
4: Chr2 = scaffold name of 2nd SV breakpoint
5: Pos2 = coordinates of 2nd SV breakpoints.
6: Orientation2 = number of reads mapped to the plus (+) or the minus (-) strand in the anchoring regions.
7: Type = type of SV detected: DEL (deletions),
INS (insertion),
INV (inversion),
ITX (intra-chromosomal translocation),
CTX (inter-chromosomal translocation),
and Unknown.
8: Size = size of the SV in bp. Meaningless for inter-chromosomal translocations.
9: Score = confidence score associated with the prediction. We find that calls that have a score of
90 or higher are likely real.
10: num_Reads = number of reads (((Needs to be described, is this the number of reads that overlap this
region, or the number of reads that support the call?)))
11: num_Reads_lib = can be used to dissect the origin of the supporting read pairs, which is useful in pooled
analysis. For example, one may want to give SVs that are supported by more than one
libraries higher confidence than those detected in only one library. It can also be used
to distinguish somatic events from the germline, i.e., those detected in only the tumor
libraries versus those detected in both the tumor and the normal libraries.
12: Allele_frequency = is currently a placeholder for displaying estimated allele frequency. The allele frequencies
estimated in this version are not accurate and should not be trusted.
13&14: are information useful to reproduce the results.
Example 1:
1 10000 10+0- 2 20000 7+10- CTX -296 99 10 tB|10 1.00 BreakDancerMax-0.0.1 t1
An inter-chromosomal translocation that starts from chr1:10000 and goes into chr2:20000 with 10 supporting read pairs from the library tB and a confidence score of 99.
Example 2:
1 59257 5+1- 1 60164 0+5- DEL 862 99 5 nA|2:tB|1 0.56 BreakDancerMax-0.0.1 c4
A deletion between chr1:59257 and chr1:60164 connected by 5 read pairs, among which 2 in library nA and 1 in library tB support the deletion hypothesis. This deletion is detected by BreakDancerMax-0.0.1 with a separation threshold of 4 s.d.
Example 3:
1 62767 10+0- 1 63126 0+10- INS -13 36 10 NA|10 1.00 BreakDancerMini-0.0.1 q10
An 13 bp insertion detected by BreakDancerMini between chr1:62767 and chr1:63126 with 10 supporting read pairs from a single library 'NA' and a confidence score of 36.
Notes:
Real SV breakpoints are expected to reside within the predicted boundaries with a margin > the read length.
TOP
----------------------------------------------------------
NONSTARTERS
StrainName.nonstarters.txt
These are regions which do not have any reads that start within the reported range, which can indicate locations of structural variation. The false positive rate is very high for projects of depth less than read length and is much more informative when there is depth that is 2X of read length. For 35bp reads the length of the nonstarter (-l) is set to 11bp, for 75bp reads -l is set to 25bp.
1: contig = contig/chromosome/scaffold name
2: last_start_pos = first alignment position of read at start of region
3: next_start_pos = first alignment position of read at end of region
the alignment below would be reported as
ctg 3 16
1234567890123456789012345678901234567890
GTGAACTGCACTGCGTGGCGGCGTTACCACCGCCGCCACA
........................................
.......................G.......... .....
...............................C.. ,,,,,
,,,,g,,,,,,,,,,,,,,,,,,,,,,,,,,,,, ,,,,,
.............................A..... ....
,c,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, ....
.....................C.T........... ..
.........................
,,,g,,,t,,,,,c,,,,,,,,,,,
........................
,,g,,,,,ac,,c,,,,,,,,,,,
TOP
----------------------------------------------------------
POSSIBLE DELETIONS:
StrainName.depth.under.3.collapsed.txt
Coordinates with under 3 reads of depth. If two ranges are less than 100 bp apart they are combined. These could be deletions, or they could be regions that are highly divergent from the reference and have problems aligning.
e.g. a contig with under 3 reads mapped to bases 1 to 38 will have this line
contig start - end length
scaffold_1 1 - 38 38
TOP
----------------------------------------------------------
HOLT ESTIMATED ALLELE FREQUENCY FILES
StrainName.hap-number.allele.frequency.txt
For analysis of populations we identify putative SNVs and estimate their allele frequencies using the script developed by Holt et al (Bioinformatics(2009)25(16):2074-2075) which was developed to work with pooled samples. Since we are working with an aliquiot from a population and not working with a set number of haplotyes, we set the haplotype to be = (average depth)/10 with an upper limit of 20.
We have evaluated the false positive and false negative rate of this script using simulated populations. These populations were made by taking sequence from different closely related strains of ecoli and pooling the reads. Since these strains had been analyzed as single clones, we were able to evaluate false positive and false negative rates. In general we found that Holt's output had a significantly lower false negative rate when compared the MAQ's cns.final.snp, however Holt's output also had a significantly higher false positive rate.
Total Depth
100X 200X 300X 400X 500X
Holt FPR 2% 1% 2% 35% 78%
MAQ cns.final.snp FPR 1% 1% 1% 1% 1%
Holt FNR 36% 15% 9% 5% 4%
MAQ cns.final.snp FNR 69% 46% 22% 8% 5%
Most of the false negatives were SNVs at low allele frequency. If you want a very low false positive rate, you may want to consider only using the SNVs that are called by both Holt's script and MAQ's cns.final.snp.
For organisms that have stretches of high GC or which may have significant structural variation compared to the reference, the false positive rate can be quite high. It is best to visually verify the putative SNVs to rule out these cases. In general, we assume a call is a false positive if 1) we see many G/C mismatches in the vicinity of the putative SNV, 2) the putative SNV never exists in the middle of a read, and 3) if there is a signature of a large insertion or deletion in the region where the putative SNV is called. A few examples are shown below in the section labeled "Common causes of False Positives".
We have validated the accuracy of the allele frequency estimates using qPCR (Battista personal communication) and have found that >90% of the allele frequency estimates agree with qPCR estimates.
From Holt's documentation:
The script is designed to detect SNPs and estimate their frequency within a pool of DNA sequenced using the Illumina/Solexa Genome Analyzer. It assumes that equal amounts of DNA from a known number of bacterial strains were pooled prior to library preparation and sequencing.
NOTE
This script relies on Maq version 0.6.8 or later.You must edit this script to point to your local installation of maq and maq.pl, which can be downloaded from maq.sourceforge.net. You can read about Maq in an open access paper in Genome Research: Li, Ruan & Durbin, 2008. Genome Res, Nov;18(11):1851-8. http://genome.cshlp.org/content/18/11/1851
This script will:
- use Maq to map Illumina reads to a reference and call SNPs
- filter out SNPs in non-unique regions (optional)
- extract the pileup for each potential SNP called by Maq
- estimate the frequency of each SNP within the pool
Output table columns contain:
[SNP info]
- 1: Pool descriptor
- 2: SNP position
- 3: Reference allele
- 4: Read depth at SNP position
- 5: Same SNP allele using weighted and unweighted proportion? (0=no, 1=yes)
[Unweighted estimates]
- 6: SNP allele
- 7: Proportion
- 8: Standard deviation
- 9: Estimated frequency of SNP within the pool
-10: Estimated frequency of SNP within the pool (integer number of strains)
-11: Lower bound of 95% confidence interval for SNP frequency
-12: Upper bound of 95% confidence interval for SNP frequency
[Weighted estimates]
-13: SNP allele
-14: Proportion
-15: Standard deviation
-16: Estimated frequency of SNP within the pool
-17: Estimated frequency of SNP within the pool (integer number of strains)
-18: Lower bound of 95% confidence interval for SNP frequency
-19: Upper bound of 95% confidence interval for SNP frequency
TOP
----------------------------------------------------------------
COMMON CAUSES OF FALSE POSITIVES:
Below are several samtool views of situations that lead to false positive SNV calls. Unfortunately the ability to see orphan reads was lost through cutting and pasting, therefore I have described what the orphan pattern would look like for each case.
FP SNV calls at the edge of a deletion:
This is an example of a small deletion. A large deletion would have orphan fwd reads upstream of the breakpoint and orphan rev reads downstream of the breakpoint.
ref seq CGGCTGGCAACGTCGCCAGGCACACCGCCAAGCTGGTTGCGCAGGTTTTCAACCTGCTGCCAGGCCCACTGGTTATCCTGATCCTGTTCATACC
cons seq ......................................ATC.T. .C.AAG..........................
. .................................... .C.AAG..........................
. .................................... .C.AAG..........................
, ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, ,c,aag,,,,,,,,,,,,,,,,,,,,,,,,,,
, ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, ,c,aag,,,,,,,,,,,,,,,,,,,,,,,,,,
.C ...................................A C.AAG..........................
.. ..................................AT C.AAG..........................
,, ..................................AT C.AAG..........................
,, ..................................AT C.AAG..........................
,, ..................................AT C.AAG..........................
,,, ..................................AT C.AAG..........................
,,, .................................ATC C.AAG..........................
,,,,, ................................ATC. c,aag,,,,,,,,,,,,,,,,,,,,,,,,,,
,,,,, ................................ATC. .AAG..........................
,,,,,, ...............................ATC.T .AAG..........................
,,,,,, ...............................ATC.T ,aag,,,,,,,,,,,,,,,,,,,,,,,,,,
,,,,,, ...............................ATC.T ,aag,,,,,,,,,,,,,,,,,,,,,,,,,,
,,,,,, ...............................ATC.T ,aag,,,,,,,,,,,,,,,,,,,,,,,,,,
.....T. ..............................ATC.T. AAG..........................
....... ..............................ATC.T. AAG..........................
....... AAG..........................
,,,,,,, AAG.....................C....
FP SNV calls at the edge of a large insertion:
If the insertion is very large, you will see orphan fwd reads upstream and orphan rev reads downstream.
ref seq GAGCAGATTGATCAAAAAATTTACCGCACTAGGCCCGTATATTCGTGAAGGTAAGTGCAAAGATAATCGATTCTTTTTCGATTGTCTGGCTGTA
cons seq ............................................S........R........................................
..................... ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,g,,aatg ................................
,,,,,,,,,,,,,,,,,,,,, ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,g,,aatg ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,, ,,,,,,,,,,,,,,,,,,,,,,,,,,,,g,,aatg, ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
...................... ,,,,,,,,,,,,,,,,,,,,,,,,,,,,g,,aatg, ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,,, ,,,,,,,,,,,,,,,,,,,,,,,,,,,,g,,aatg, ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,,, ,,,,,,,,,,,,,,,,,,,,,,,,,,,g,,aatg,c ...............................
....................... ,,,,,,,,,,,,,,,,,,,,,,,,,,,g,,aatg,c ...............................
....................... catca,c,,,,,,,,,,,,,,,,,,,,,,,,,,,,, ,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,,,, ATCA.C.............................. ..................
........................ ATCA.C.............................. ..................
........................ agca,c,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, ..................
........................ TCA.C............................... .................
......................... a,c,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, ...............
......................... ,c,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, ,,,,,,,,,,,,,,
......................... ,c,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, ,,,,,,,,,,,,,,
......................... ,c,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, .............
......................... ,c,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, .............
.......................... .................................... .........
........................... ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, .........
........................... ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, .........
........................... ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, .........
FP SNV calls in high GC regions:
This is an example of getting "stuck on Cs". We also see errors where it gets "stuck on Gs".
ref seq TTGCTGCCTGCGCGCTACGGCTGGCAAGATTGCGCACTTCACCTGCCACTACCGCAAAACCACGCCCCTGCTCTCCGGCGCGAGCCGCTTCCAC
cons seq ........................................M.....................................................
........................ cc,c,c,cc,c,,,,,,,,,,,,,,,,,,,,,,,,, ........................
,,,,,,,,,,,,,,,,,,,,,,,, ,,,c,c,,c,c,,,,,,,,,,,,,,,,,,,,,,,,, ........................
,,,,,,,,,,,,,,,,,,,,,,,, ,,,,,c,,c,c,,,,,,,,,,,,,,,,,,,,,,,,, ,,,c,,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,,,,, ,c,c,c,,c,c,,,,,,,,,,,,,,,,,,,,,,,,, ,c,c,,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,,,,,, .................................... ,,g,,,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,,,,,, ,,c,c,,c,c,,,,,,,,,,,,,,,,,,,,,,,,,, ,,,c,g,,,,g,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,,,,,,, ,,c,c,cc,c,,,,,,,,,,,,,,,,,,,,,,,,,, ,c,c,,,,,,g,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,,,,,,, c,c,c,,c,c,,,,,,,,,,,,,,,,,,,,,,,,,, .......................
........................... c,c,c,,c,c,,,,,,,,,,,,,,,,,,,,,,,,,, .......................
........................... c,c,c,,c,c,,,,,,,,,,,,,,,,,,,,,,,,,, .......................
........................... ,,c,c,,c,c,,,,,,,,,,,,,,,,,,,,,,,,,, .......................
........................... ,,c,c,,c,c,,,,,,,,,,,,,,,,,,,,,,,,,, .......................
........................... ,,c,c,,c,c,,,,,,,,,,,,,,,,,,,,,,,,,, ,,c,,,,,,g,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,,,,,,,, c,c,c,,c,c,,,,,,,,,,,,,,,,,,,,,,,,,, ......................
............................ .................................... ......................
............................ .................................... ......................
............................ ,c,c,,c,c,,,,,,,,,,,,,,,,,,,,,,,,,,, ,c,,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,,,,,,,,, ,c,c,cc,c,,,,,,,,,,,,,,,,,,,,,,,,,,, ,c,,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,,,,,,,,, ,,,c,,c,c,,,,,,,,,,,,,,,,,,,,,,,,,,, .....................
............................. .................................... c,g,,,,g,,,,,,,,,,,,,
TOP