ANALYSIS SUMMARY
Table of Contents
Sequencing Stats
Putative SNV files
Putative small indel files
Nonstarters
Possible deletions
Alignment Files
Holt Estimated Allele Frequency Files
Common Causes of False Positives
All files are tab delimited text
Analysis was performed using maq-0.7.1 (Li H., Ruan J. and Durbin R. (2008) Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res., 18 (11), 1851-1858).
Below is a detailed description of each of the files that you received.
Notes on nomenclature:
We have chosen to use the term Single Nucleotide Variant (SNV) instead of Single Nuceoltide Polymorphism (SNP) because many variants do not fit the strict definition of polymorphism. Many of the programs we use for the analysis use the term SNP. Please consider all occurances of SNP to be equivalent to SNV.
We also use the term multi-alleleic (MA). Multi-alleleic sites could be 1) true heterozygous SNVs in a diploid strain, 2) true multi-alleleic sites from a population, 3) signature of a collapsed repeat in the reference, or 4) signature of sample contamination.
----------------------------------------------------------
SEQUENCING STATS
Excel file that contains the summary statistics for all strains
seq_stats_Organism_proposal#.date.xls
1: org = organism-strain
2: ref_used = reference used for alignment
3: size = genome size
4: avg_depth = average read depth
5: sd_depth = standard deviation of depth
6: min_depth = minimum depth
7: max_depth = maximum depth
8: #SE = number of single end (SE) reads
9: %SE_map = % SE reads mapped to reference
10: #PE = number of paired end (PE) reads
11: %PE_map = percentage PE reads mapped
12: %PE_pairs = percentage of mapped PE reads that mapped as pairs
13: read_len = read length
14: %simSNV = estimate of the percentage of Single Nucleotide Variants (SNVs) found.
Estimated by "mutating" the reference at base 500, 1500, 2500....
to simulate SNVs and calculating % of simulated SNVs identified by the SNV caller.
15: %ref = percentage of the reference covered by reads
16: #hom_SNV = putative homozygous SNVs - These could represent true SNVs or could
represent errors in the reference.
17: #bp_per_SNV = number of base pairs per homozygous SNV
18: #MA_SNV = number of putative multi-alleleic SNVs. Multi-alleleic calls in areas of
excessive depth are likely expanded repeats in the sequenced strain relative
to the reference used.Seq Stats
19: %simIndel = estimate of the percentage of small indels found. Estimated by "mutating"
the reference at base 1000, 2000, 3000, 4000..... to simulate indels and
calculating percentage of simulated indels idenified by the small indel
caller. The indel sizes are equally distributed between insertions and
deletions, and range in size from 1-15bp in size.
20: #hom_indel = number of putative homozygous small indels
21: #MA_indel = number of putative multi-alleleic small indels
22: comments
TOP
----------------------------------------------------------
PUTATIVE SNV FILES
StrainName.snp.aa.txt
1: contig = contig/chromosome/scaffold name
2: pos = genomic position in contig/chromosome/scaffold
3: type = NA = not annotated
NC = non-coding
Syn = synonymous
Non-Syn = non-synonymous
Nonsense = introducing or abolishing a stop codon
3' = 3'UTR
5' = 5'UTR
Int = intronic
Spl = splice junction
4: name = gene name
NC = non-coding
NA = not annotated
5: strand = "+" = gene translated from plus strand
"-" = gene translated from minus strand
NC = non-coding
NA = not annotated
6: ref_nt = reference base
7: cds_nt_pos = cds nucleotide position NC=non-coding, NA=not annotated
8: ref_aa:codon = reference amino acid:reference codon NC=non-coding, NA=not annotated
9: cds_aa_pos = cds codon position NC=non-coding, NA=not annotated
10: cons_qual = phred-like consensus quality (255 max value)
11: depth = read depth (255 max value)
12: avg_hits = the average number of hits in reference for reads covering this
position (essentially the # of times the flanking region appears
in the genome. zero here means there were 3 or more mismatches in
close proximity, and it is too complicated for MAQ to calculate.
For more information please refer to documentation for Maq)
13: cons_base = consensus base
14: 1st_base:count = most common allele count
15: 2nd_base:count = 2nd most common allele count
16: cons_aa:codon = consensus amino acid:consensus codon ( NC=non-coding, NA=not annotated )
TOP
----------------------------------------------------------
PUTATIVE SMALL INDEL FILES:
StrainName.indelpe.hom.txt
short indel calls that are homozygous
StrainName.indelpe.ma.txt
short indel calls that are multi-alleleic
Size of findable indels depends upon the read length, with deletions being easier to locate. Results from a simulation are shown below.
type read 1bp 2bp 3bp 4bp 5bp 6bp 7bp 8bp 9bp 10bp 11bp 12bp 13bp 14bp 15bp
del 35bp 97% 97% 98% 100% 93% 99% 91% 94% 91% 84% 84% 81% 68% 45% 38%
ins 35bp 98% 98% 99% 96% 92% 91% 0% 0% 0% 0% 0% 0% 0% 0% 0%
del 70bp 97% 97% 97% 99% 96% 99% 93% 97% 91% 93% 93% 93% 95% 100% 93%
ins 70bp 98% 97% 99% 97% 96% 99% 96% 93% 89% 93% 93% 89% 87% 0% 0%
A note on False Positive Calls: We find that if the number of fwd reads plus the number of rev reads is extremely small compared to the average depth across the genome the putative indel is most likely a false positive. Visual confirmation using the alignment file is the best way to confirm this.
1: type = NC=non-coding CDS=coding 3'=3'UTR 5'=5'UTR, Int=intronic NA=not annotated
2: name = gene name, NC=non-coding, NA=not annotated
3: contig = contig/chromosome/scaffold name
4: pos = genomic position in contig/chromosome/scaffold
5: evidence = evidence for indel, highest "*" > "+"
6: depth = read depth (255 max value)
7: size:seq = size-of-indel:sequence-of-indel. A negative number indicates deletion,
a positive number indicates insertion.
For example -3:TTT is a deletion of TTT
8: #fwd = number of forward reads
9: #rev = number of reverse reads
10: up_flank = upstream flanking sequence
11: down_flank = downstream flanking sequence - if deletion, deleted sequence is included in the flanking sequence
12: #w/o_indel = number of reads crossing this site that don't contain the indel
Evidence for indel
" * confirmed by reads from both strands
" + confirmed by reads at least 2 reads from a single strand
TOP
----------------------------------------------------------
NONSTARTERS
StrainName.nonstarters.txt
These are regions which do not have any reads that start within the reported range, which can indicate locations of structural variation. The false positive rate is very high for projects of depth less than read length and is much more informative when there is depth that is 2X of read length. For 35bp reads the length of the nonstarter (-l) is set to 11bp, for 75bp reads -l is set to 25bp.
1: contig = contig/chromosome/scaffold name
2: last_start_pos = first alignment position of read at start of region
3: next_start_pos = first alignment position of read at end of region
the alignment below would be reported as
ctg 3 16
1234567890123456789012345678901234567890
GTGAACTGCACTGCGTGGCGGCGTTACCACCGCCGCCACA
........................................
.......................G.......... .....
...............................C.. ,,,,,
,,,,g,,,,,,,,,,,,,,,,,,,,,,,,,,,,, ,,,,,
.............................A..... ....
,c,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, ....
.....................C.T........... ..
.........................
,,,g,,,t,,,,,c,,,,,,,,,,,
........................
,,g,,,,,ac,,c,,,,,,,,,,,
TOP
----------------------------------------------------------
POSSIBLE DELETIONS:
StrainName.depth.under.3.collapsed.txt
Coordinates with under 3 reads of depth. If two ranges are less than 100 bp apart they are combined. These could be deletions, or they could be regions that are highly divergent from the reference and have problems aligning.
e.g. a contig with under 3 reads mapped to bases 1 to 38 will have this line
contig start - end length
scaffold_1 1 - 38 38
TOP
----------------------------------------------------------
ALIGNMENT FILES:
StrainName.bam
Sequence Alignment/MAP (SAM) and BAM Format
The SAM format is a generic nucleotide alignment format that describes the alignment of query sequences or sequencing reads to a reference sequence or assembly. SAM is a tab-delimited text format. It is easy to understand, easy to parse, and easy to generate and check for errors. However, SAM is a bit slow to parse. A binary equivalent to SAM was developed to deal with this issue and is known as BAM which is used in intensive data processing. BAM is useful in most production pipelines, while SAM may be useful for interconversion with external applications and for exploratory analysis. We provide the BAM files so you can visualize the SNV and structural variation calls. Three of our favorites programs for visualizing the BAM file include samtools (http://samtools.sourceforge.net/), Integrated Genome Viewer (http://www.broadinstitute.org/igv/), and LookSeq (http://www.sanger.ac.uk/resources/software/lookseq/).
TOP
--------------------------------------------------------
HOLT ESTIMATED ALLELE FREQUENCY FILES
StrainName.hap-number.allele.frequency.txt
For analysis of populations we identify putative SNVs and estimate their allele frequencies using the script developed by Holt et al (Bioinformatics(2009)25(16):2074-2075) which was developed to work with pooled samples. Since we are working with an aliquiot from a population and not working with a set number of haplotyes, we set the haplotype to be = (average depth)/10 with an upper limit of 20.
We have evaluated the false positive and false negative rate of this script using simulated populations. These populations were made by taking sequence from different closely related strains of ecoli and pooling the reads. Since these strains had been analyzed as single clones, we were able to evaluate false positive and false negative rates. In general we found that Holt's output had a significantly lower false negative rate when compared the MAQ's cns.final.snp, however Holt's output also had a significantly higher false positive rate.
Total Depth
100X 200X 300X 400X 500X
Holt FPR 2% 1% 2% 35% 78%
MAQ cns.final.snp FPR 1% 1% 1% 1% 1%
Holt FNR 36% 15% 9% 5% 4%
MAQ cns.final.snp FNR 69% 46% 22% 8% 5%
Most of the false negatives were SNVs at low allele frequency. If you want a very low false positive rate, you may want to consider only using the SNVs that are called by both Holt's script and MAQ's cns.final.snp.
For organisms that have stretches of high GC or which may have significant structural variation compared to the reference, the false positive rate can be quite high. It is best to visually verify the putative SNVs to rule out these cases. In general, we assume a call is a false positive if 1) we see many G/C mismatches in the vicinity of the putative SNV, 2) the putative SNV never exists in the middle of a read, and 3) if there is a signature of a large insertion or deletion in the region where the putative SNV is called. A few examples are shown below in the section labeled "Common causes of False Positives".
We have validated the accuracy of the allele frequency estimates using qPCR (Battista personal communication) and have found that >90% of the allele frequency estimates agree with qPCR estimates.
From Holt's documentation:
The script is designed to detect SNPs and estimate their frequency within a pool of DNA sequenced using the Illumina/Solexa Genome Analyzer. It assumes that equal amounts of DNA from a known number of bacterial strains were pooled prior to library preparation and sequencing.
NOTE
This script relies on Maq version 0.6.8 or later.You must edit this script to point to your local installation of maq and maq.pl, which can be downloaded from maq.sourceforge.net. You can read about Maq in an open access paper in Genome Research: Li, Ruan & Durbin, 2008. Genome Res, Nov;18(11):1851-8. http://genome.cshlp.org/content/18/11/1851
This script will:
- use Maq to map Illumina reads to a reference and call SNPs
- filter out SNPs in non-unique regions (optional)
- extract the pileup for each potential SNP called by Maq
- estimate the frequency of each SNP within the pool
Output table columns contain:
[SNP info]
- 1: Pool descriptor
- 2: SNP position
- 3: Reference allele
- 4: Read depth at SNP position
- 5: Same SNP allele using weighted and unweighted proportion? (0=no, 1=yes)
[Unweighted estimates]
- 6: SNP allele
- 7: Proportion
- 8: Standard deviation
- 9: Estimated frequency of SNP within the pool
-10: Estimated frequency of SNP within the pool (integer number of strains)
-11: Lower bound of 95% confidence interval for SNP frequency
-12: Upper bound of 95% confidence interval for SNP frequency
[Weighted estimates]
-13: SNP allele
-14: Proportion
-15: Standard deviation
-16: Estimated frequency of SNP within the pool
-17: Estimated frequency of SNP within the pool (integer number of strains)
-18: Lower bound of 95% confidence interval for SNP frequency
-19: Upper bound of 95% confidence interval for SNP frequency
TOP
----------------------------------------------------------------
COMMON CAUSES OF FALSE POSITIVES:
Below are several samtool views of situations that lead to false positive SNV calls. Unfortunately the ability to see orphan reads was lost through cutting and pasting, therefore I have described what the orphan pattern would look like for each case.
FP SNV calls at the edge of a deletion:
This is an example of a small deletion. A large deletion would have orphan fwd reads upstream of the breakpoint and orphan rev reads downstream of the breakpoint.
ref seq CGGCTGGCAACGTCGCCAGGCACACCGCCAAGCTGGTTGCGCAGGTTTTCAACCTGCTGCCAGGCCCACTGGTTATCCTGATCCTGTTCATACC
cons seq ......................................ATC.T. .C.AAG..........................
. .................................... .C.AAG..........................
. .................................... .C.AAG..........................
, ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, ,c,aag,,,,,,,,,,,,,,,,,,,,,,,,,,
, ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, ,c,aag,,,,,,,,,,,,,,,,,,,,,,,,,,
.C ...................................A C.AAG..........................
.. ..................................AT C.AAG..........................
,, ..................................AT C.AAG..........................
,, ..................................AT C.AAG..........................
,, ..................................AT C.AAG..........................
,,, ..................................AT C.AAG..........................
,,, .................................ATC C.AAG..........................
,,,,, ................................ATC. c,aag,,,,,,,,,,,,,,,,,,,,,,,,,,
,,,,, ................................ATC. .AAG..........................
,,,,,, ...............................ATC.T .AAG..........................
,,,,,, ...............................ATC.T ,aag,,,,,,,,,,,,,,,,,,,,,,,,,,
,,,,,, ...............................ATC.T ,aag,,,,,,,,,,,,,,,,,,,,,,,,,,
,,,,,, ...............................ATC.T ,aag,,,,,,,,,,,,,,,,,,,,,,,,,,
.....T. ..............................ATC.T. AAG..........................
....... ..............................ATC.T. AAG..........................
....... AAG..........................
,,,,,,, AAG.....................C....
FP SNV calls at the edge of a large insertion:
If the insertion is very large, you will see orphan fwd reads upstream and orphan rev reads downstream.
ref seq GAGCAGATTGATCAAAAAATTTACCGCACTAGGCCCGTATATTCGTGAAGGTAAGTGCAAAGATAATCGATTCTTTTTCGATTGTCTGGCTGTA
cons seq ............................................S........R........................................
..................... ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,g,,aatg ................................
,,,,,,,,,,,,,,,,,,,,, ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,g,,aatg ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,, ,,,,,,,,,,,,,,,,,,,,,,,,,,,,g,,aatg, ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
...................... ,,,,,,,,,,,,,,,,,,,,,,,,,,,,g,,aatg, ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,,, ,,,,,,,,,,,,,,,,,,,,,,,,,,,,g,,aatg, ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,,, ,,,,,,,,,,,,,,,,,,,,,,,,,,,g,,aatg,c ...............................
....................... ,,,,,,,,,,,,,,,,,,,,,,,,,,,g,,aatg,c ...............................
....................... catca,c,,,,,,,,,,,,,,,,,,,,,,,,,,,,, ,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,,,, ATCA.C.............................. ..................
........................ ATCA.C.............................. ..................
........................ agca,c,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, ..................
........................ TCA.C............................... .................
......................... a,c,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, ...............
......................... ,c,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, ,,,,,,,,,,,,,,
......................... ,c,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, ,,,,,,,,,,,,,,
......................... ,c,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, .............
......................... ,c,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, .............
.......................... .................................... .........
........................... ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, .........
........................... ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, .........
........................... ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, .........
FP SNV calls in high GC regions:
This is an example of getting "stuck on Cs". We also see errors where it gets "stuck on Gs".
ref seq TTGCTGCCTGCGCGCTACGGCTGGCAAGATTGCGCACTTCACCTGCCACTACCGCAAAACCACGCCCCTGCTCTCCGGCGCGAGCCGCTTCCAC
cons seq ........................................M.....................................................
........................ cc,c,c,cc,c,,,,,,,,,,,,,,,,,,,,,,,,, ........................
,,,,,,,,,,,,,,,,,,,,,,,, ,,,c,c,,c,c,,,,,,,,,,,,,,,,,,,,,,,,, ........................
,,,,,,,,,,,,,,,,,,,,,,,, ,,,,,c,,c,c,,,,,,,,,,,,,,,,,,,,,,,,, ,,,c,,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,,,,, ,c,c,c,,c,c,,,,,,,,,,,,,,,,,,,,,,,,, ,c,c,,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,,,,,, .................................... ,,g,,,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,,,,,, ,,c,c,,c,c,,,,,,,,,,,,,,,,,,,,,,,,,, ,,,c,g,,,,g,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,,,,,,, ,,c,c,cc,c,,,,,,,,,,,,,,,,,,,,,,,,,, ,c,c,,,,,,g,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,,,,,,, c,c,c,,c,c,,,,,,,,,,,,,,,,,,,,,,,,,, .......................
........................... c,c,c,,c,c,,,,,,,,,,,,,,,,,,,,,,,,,, .......................
........................... c,c,c,,c,c,,,,,,,,,,,,,,,,,,,,,,,,,, .......................
........................... ,,c,c,,c,c,,,,,,,,,,,,,,,,,,,,,,,,,, .......................
........................... ,,c,c,,c,c,,,,,,,,,,,,,,,,,,,,,,,,,, .......................
........................... ,,c,c,,c,c,,,,,,,,,,,,,,,,,,,,,,,,,, ,,c,,,,,,g,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,,,,,,,, c,c,c,,c,c,,,,,,,,,,,,,,,,,,,,,,,,,, ......................
............................ .................................... ......................
............................ .................................... ......................
............................ ,c,c,,c,c,,,,,,,,,,,,,,,,,,,,,,,,,,, ,c,,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,,,,,,,,, ,c,c,cc,c,,,,,,,,,,,,,,,,,,,,,,,,,,, ,c,,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,,,,,,,,, ,,,c,,c,c,,,,,,,,,,,,,,,,,,,,,,,,,,, .....................
............................. .................................... c,g,,,,g,,,,,,,,,,,,,
TOP