ANALYSIS SUMMARY

  Table of Contents
    Sequencing Stats
    Putative SNV files
    Putative small indel files
    Nonstarters
    Possible deletions
    Alignment Files
    Holt Estimated Allele Frequency Files
    Common Causes of False Positives
  

All files are tab delimited text

Analysis was performed using maq-0.7.1 (Li H., Ruan J. and Durbin R. (2008) Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res., 18 (11), 1851-1858). Below is a detailed description of each of the files that you received. Notes on nomenclature: We have chosen to use the term Single Nucleotide Variant (SNV) instead of Single Nuceoltide Polymorphism (SNP) because many variants do not fit the strict definition of polymorphism. Many of the programs we use for the analysis use the term SNP. Please consider all occurances of SNP to be equivalent to SNV. We also use the term multi-alleleic (MA). Multi-alleleic sites could be 1) true heterozygous SNVs in a diploid strain, 2) true multi-alleleic sites from a population, 3) signature of a collapsed repeat in the reference, or 4) signature of sample contamination.

----------------------------------------------------------

SEQUENCING STATS
Excel file that contains the summary statistics for all strains

seq_stats_Organism_proposal#.date.xls

 1: org         = organism-strain
 2: ref_used    = reference used for alignment
 3: size        = genome size
 4: avg_depth   = average read depth
 5: sd_depth    = standard deviation of depth
 6: min_depth   = minimum depth
 7: max_depth   = maximum depth
 8: #SE         = number of single end (SE) reads
 9: %SE_map     = % SE reads mapped to reference
10: #PE         = number of paired end (PE) reads
11: %PE_map     = percentage PE reads mapped
12: %PE_pairs   = percentage of mapped PE reads that mapped as pairs
13: read_len    = read length
14: %simSNV     = estimate of the percentage of Single Nucleotide Variants (SNVs) found.
                  Estimated by "mutating" the reference at base 500, 1500, 2500.... 
                  to simulate SNVs and calculating % of simulated SNVs identified by the SNV caller.
15: %ref        = percentage of the reference covered by reads
16: #hom_SNV    = putative homozygous SNVs - These could represent true SNVs or could
                  represent errors in the reference.
17: #bp_per_SNV = number of base pairs per homozygous SNV
18: #MA_SNV     = number of putative multi-alleleic SNVs. Multi-alleleic calls in areas of
                  excessive depth are likely expanded repeats in the sequenced strain relative
                  to the reference used.Seq Stats
19: %simIndel   = estimate of the percentage of small indels found. Estimated by "mutating" 
                  the reference at base 1000, 2000, 3000, 4000..... to simulate indels and
                  calculating percentage of simulated indels idenified by the small indel
                  caller. The indel sizes are equally distributed between insertions and
                  deletions, and range in size from 1-15bp in size.
20: #hom_indel  = number of putative homozygous small indels
21: #MA_indel   = number of putative multi-alleleic small indels
22: comments
TOP

----------------------------------------------------------


PUTATIVE SNV FILES 

StrainName.snp.aa.txt 

 1: contig         =  contig/chromosome/scaffold name
 2: pos            = genomic position in contig/chromosome/scaffold
 3: type           = NA        = not annotated 
                     NC        = non-coding 
                     Syn       = synonymous 
                     Non-Syn   = non-synonymous 
                     Nonsense  = introducing or abolishing a stop codon 
                     3'        = 3'UTR 
                     5'        = 5'UTR
                     Int       = intronic 
                     Spl       = splice junction
 4: name           = gene name 
                     NC = non-coding
                     NA = not annotated
 5: strand         = "+" = gene translated from plus strand 
                     "-" = gene translated from minus strand 
                     NC  = non-coding
                     NA  =  not annotated
 6: ref_nt         = reference base
 7: cds_nt_pos     = cds nucleotide position NC=non-coding, NA=not annotated
 8: ref_aa:codon   = reference amino acid:reference codon NC=non-coding, NA=not annotated
 9: cds_aa_pos     = cds codon position NC=non-coding, NA=not annotated
10: cons_qual      = phred-like consensus quality (255 max value)
11: depth          = read depth (255 max value)
12: avg_hits       = the average number of hits in reference for reads covering this
                     position (essentially the # of times the flanking region appears
                     in the genome. zero here means there were 3 or more mismatches in
                     close proximity, and it is too complicated for MAQ to calculate.
                     For more information please refer to documentation for Maq)
13: cons_base      = consensus base
14: 1st_base:count = most common allele count
15: 2nd_base:count = 2nd most common allele count
16: cons_aa:codon  = consensus amino acid:consensus codon ( NC=non-coding, NA=not annotated )
TOP
----------------------------------------------------------


PUTATIVE SMALL INDEL FILES:

StrainName.indelpe.hom.txt
short indel calls that are homozygous

StrainName.indelpe.ma.txt
short indel calls that are multi-alleleic

Size of findable indels depends upon the read length, with deletions being easier to locate. Results from a simulation are shown below. 
type	read	1bp	2bp	3bp	4bp	5bp	6bp	7bp	8bp	9bp	10bp	11bp	12bp	13bp	14bp	15bp
del	35bp	97%	97%	98%	100%	93%	99%	91%	94%	91%	84%	84%	81%	68%	45%	38%
ins	35bp	98%	98%	99%	96%	92%	91%	0%	0%	0%	0%	0%	0%	0%	0%	0%
del	70bp	97%	97%	97%	99%	96%	99%	93%	97%	91%	93%	93%	93%	95%	100%	93%
ins	70bp	98%	97%	99%	97%	96%	99%	96%	93%	89%	93%	93%	89%	87%	0%	0%

A note on False Positive Calls: We find that if the number of fwd reads plus the number of rev reads is extremely small compared to the average depth across the genome the putative indel is most likely a false positive. Visual confirmation using the alignment file is the best way to confirm this.

1: type        = NC=non-coding CDS=coding 3'=3'UTR 5'=5'UTR, Int=intronic NA=not annotated
2: name        = gene name, NC=non-coding, NA=not annotated
3: contig      = contig/chromosome/scaffold name
4: pos         = genomic position in contig/chromosome/scaffold
5: evidence    = evidence for indel, highest "*" > "+"
6: depth       = read depth (255 max value)
7: size:seq    = size-of-indel:sequence-of-indel. A negative number indicates deletion,
                 a positive number indicates insertion. 
                 For example -3:TTT is a deletion of TTT
8: #fwd        = number of forward reads
9: #rev        = number of reverse reads
10: up_flank   = upstream flanking sequence
11: down_flank = downstream flanking sequence - if deletion, deleted sequence is included in the flanking sequence
12: #w/o_indel = number of reads crossing this site that don't contain the indel 

Evidence for indel 
"	* confirmed by reads from both strands
"	+ confirmed by reads at least 2 reads from a single strand
TOP
----------------------------------------------------------


NONSTARTERS

StrainName.nonstarters.txt

These are regions which do not have any reads that start within the reported range, which can indicate locations of structural variation. The false positive rate is very high for projects of depth less than read length and is much more informative when there is depth that is 2X of read length. For 35bp reads the length of the nonstarter (-l) is set to 11bp, for 75bp reads -l is set to 25bp.
 1: contig         = contig/chromosome/scaffold name
 2: last_start_pos = first alignment position of read at start of region
 3: next_start_pos = first alignment position of read at end of region

the alignment below would be reported as
ctg 3 16
 
1234567890123456789012345678901234567890
GTGAACTGCACTGCGTGGCGGCGTTACCACCGCCGCCACA
........................................
.......................G.......... .....
...............................C.. ,,,,,
,,,,g,,,,,,,,,,,,,,,,,,,,,,,,,,,,, ,,,,,
.............................A..... ....
,c,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, ....
  .....................C.T........... ..
               .........................
               ,,,g,,,t,,,,,c,,,,,,,,,,,
                ........................
                ,,g,,,,,ac,,c,,,,,,,,,,,
TOP
----------------------------------------------------------


POSSIBLE DELETIONS:

StrainName.depth.under.3.collapsed.txt 

Coordinates with under 3 reads of depth. If two ranges are less than 100 bp apart they are combined. These could be deletions, or they could be regions that are highly divergent from the reference and have problems aligning.
e.g. a contig with under 3 reads mapped to bases 1 to 38 will have this line
contig         start    -       end  length
scaffold_1      1       -       38      38 
TOP
----------------------------------------------------------


ALIGNMENT FILES:

StrainName.bam

Sequence Alignment/MAP (SAM) and BAM Format
The SAM format is a generic nucleotide alignment format that describes the alignment of query sequences or sequencing reads to a reference sequence or assembly. SAM is a tab-delimited text format. It is easy to understand, easy to parse, and easy to generate and check for errors. However, SAM is a bit slow to parse. A binary equivalent to SAM was developed to deal with this issue and is known as BAM which is used in intensive data processing. BAM is useful in most production pipelines, while SAM may be useful for interconversion with external applications and for exploratory analysis. We provide the BAM files so you can visualize the SNV and structural variation calls. Three of our favorites programs for visualizing the BAM file include samtools (http://samtools.sourceforge.net/), Integrated Genome Viewer (http://www.broadinstitute.org/igv/), and LookSeq (http://www.sanger.ac.uk/resources/software/lookseq/).
TOP
--------------------------------------------------------



HOLT ESTIMATED ALLELE FREQUENCY FILES

StrainName.hap-number.allele.frequency.txt

For analysis of populations we identify putative SNVs and estimate their allele frequencies using the script developed by Holt et al (Bioinformatics(2009)25(16):2074-2075) which was developed to work with pooled samples. Since we are working with an aliquiot from a population and not working with a set number of haplotyes, we set the haplotype to be = (average depth)/10 with an upper limit of 20.

We have evaluated the false positive and false negative rate of this script using simulated populations. These populations were made by taking sequence from different closely related strains of ecoli and pooling the reads. Since these strains had been analyzed as single clones, we were able to evaluate false positive and false negative rates. In general we found that Holt's output had a significantly lower false negative rate when compared the MAQ's cns.final.snp, however Holt's output also had a significantly higher false positive rate.
			Total Depth
			100X	200X	300X	400X	500X
Holt FPR		2%	1%	2%	35%	78%
MAQ cns.final.snp FPR	1%	1%	1%	1%	1%
Holt FNR		36%	15%	9%	5%	4%
MAQ cns.final.snp FNR	69%	46%	22%	8%	5%

Most of the false negatives were SNVs at low allele frequency. If you want a very low false positive rate, you may want to consider only using the SNVs that are called by both Holt's script and MAQ's cns.final.snp.

For organisms that have stretches of high GC or which may have significant structural variation compared to the reference, the false positive rate can be quite high. It is best to visually verify the putative SNVs to rule out these cases. In general, we assume a call is a false positive if 1) we see many G/C mismatches in the vicinity of the putative SNV, 2) the putative SNV never exists in the middle of a read, and 3) if there is a signature of a large insertion or deletion in the region where the putative SNV is called.  A few examples are shown below in the section labeled "Common causes of False Positives".

We have validated the accuracy of the allele frequency estimates using qPCR (Battista personal communication) and have found that >90% of the allele frequency estimates agree with qPCR estimates.
From Holt's documentation:
The script is designed to detect SNPs and estimate their frequency within a pool of DNA sequenced using the Illumina/Solexa Genome Analyzer. It assumes that equal amounts of DNA from a known number of bacterial strains were pooled prior to library preparation and sequencing.
NOTE 
This script relies on Maq version 0.6.8 or later.You must edit this script to point to your local installation of maq and maq.pl, which can be downloaded from maq.sourceforge.net. You can read about Maq in an open access paper in Genome Research: Li, Ruan & Durbin, 2008. Genome Res, Nov;18(11):1851-8. http://genome.cshlp.org/content/18/11/1851
	
 	This script will:
		- use Maq to map Illumina reads to a reference and call SNPs
		- filter out SNPs in non-unique regions (optional)
		- extract the pileup for each potential SNP called by Maq
		- estimate the frequency of each SNP within the pool
		
	Output table columns contain:
		[SNP info]
		- 1: Pool descriptor
		- 2: SNP position
		- 3: Reference allele
		- 4: Read depth at SNP position
		- 5: Same SNP allele using weighted and unweighted proportion? (0=no, 1=yes)
		[Unweighted estimates]
		- 6: SNP allele
		- 7: Proportion
		- 8: Standard deviation
		- 9: Estimated frequency of SNP within the pool
		-10: Estimated frequency of SNP within the pool (integer number of strains)
		-11: Lower bound of 95% confidence interval for SNP frequency
		-12: Upper bound of 95% confidence interval for SNP frequency
		[Weighted estimates]
		-13: SNP allele
		-14: Proportion
		-15: Standard deviation
		-16: Estimated frequency of SNP within the pool
		-17: Estimated frequency of SNP within the pool (integer number of strains)
		-18: Lower bound of 95% confidence interval for SNP frequency
		-19: Upper bound of 95% confidence interval for SNP frequency
TOP
----------------------------------------------------------------


COMMON CAUSES OF FALSE POSITIVES:

Below are several samtool views of situations that lead to false positive SNV calls. Unfortunately the ability to see orphan reads was lost through cutting and pasting, therefore I have described what the orphan pattern would look like for each case.

FP SNV calls at the edge of a deletion:
This is an example of a small deletion. A large deletion would have orphan fwd reads upstream of the breakpoint and orphan rev reads downstream of the breakpoint.
ref seq		CGGCTGGCAACGTCGCCAGGCACACCGCCAAGCTGGTTGCGCAGGTTTTCAACCTGCTGCCAGGCCCACTGGTTATCCTGATCCTGTTCATACC
cons seq	......................................ATC.T.                  .C.AAG..........................
		. ....................................                        .C.AAG..........................
		. ....................................                        .C.AAG..........................
		, ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,                        ,c,aag,,,,,,,,,,,,,,,,,,,,,,,,,,
		, ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,                        ,c,aag,,,,,,,,,,,,,,,,,,,,,,,,,,
		.C ...................................A                        C.AAG..........................
		..  ..................................AT                       C.AAG..........................
		,,  ..................................AT                       C.AAG..........................
		,,  ..................................AT                       C.AAG..........................
		,,  ..................................AT                       C.AAG..........................
		,,, ..................................AT                       C.AAG..........................
		,,,  .................................ATC                      C.AAG..........................
		,,,,, ................................ATC.                     c,aag,,,,,,,,,,,,,,,,,,,,,,,,,,
		,,,,, ................................ATC.                      .AAG..........................
		,,,,,, ...............................ATC.T                     .AAG..........................
		,,,,,, ...............................ATC.T                     ,aag,,,,,,,,,,,,,,,,,,,,,,,,,,
		,,,,,, ...............................ATC.T                     ,aag,,,,,,,,,,,,,,,,,,,,,,,,,,
		,,,,,, ...............................ATC.T                     ,aag,,,,,,,,,,,,,,,,,,,,,,,,,,
		.....T. ..............................ATC.T.                     AAG..........................
		....... ..............................ATC.T.                     AAG..........................
		.......                                                          AAG..........................
		,,,,,,,                                                          AAG.....................C....

FP SNV calls at the edge of a large insertion: If the insertion is very large, you will see orphan fwd reads upstream and orphan rev reads downstream.
ref seq		GAGCAGATTGATCAAAAAATTTACCGCACTAGGCCCGTATATTCGTGAAGGTAAGTGCAAAGATAATCGATTCTTTTTCGATTGTCTGGCTGTA
cons seq	............................................S........R........................................
		.....................   ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,g,,aatg  ................................
		,,,,,,,,,,,,,,,,,,,,,   ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,g,,aatg  ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
		,,,,,,,,,,,,,,,,,,,,,    ,,,,,,,,,,,,,,,,,,,,,,,,,,,,g,,aatg, ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
		......................   ,,,,,,,,,,,,,,,,,,,,,,,,,,,,g,,aatg, ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
		,,,,,,,,,,,,,,,,,,,,,,   ,,,,,,,,,,,,,,,,,,,,,,,,,,,,g,,aatg, ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
		,,,,,,,,,,,,,,,,,,,,,,    ,,,,,,,,,,,,,,,,,,,,,,,,,,,g,,aatg,c ...............................
		.......................   ,,,,,,,,,,,,,,,,,,,,,,,,,,,g,,aatg,c ...............................
		.......................               catca,c,,,,,,,,,,,,,,,,,,,,,,,,,,,,, ,,,,,,,,,,,,,,,,,,,
		,,,,,,,,,,,,,,,,,,,,,,,                ATCA.C.............................. ..................
		........................               ATCA.C.............................. ..................
		........................               agca,c,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, ..................
		........................                TCA.C............................... .................
		.........................                 a,c,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, ...............
		.........................                  ,c,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, ,,,,,,,,,,,,,,
		.........................                  ,c,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, ,,,,,,,,,,,,,,
		.........................                  ,c,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,  .............
		.........................                  ,c,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,  .............
		..........................                     ....................................  .........
		...........................                    ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,  .........
		...........................                    ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,  .........
		...........................                    ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,  .........

FP SNV calls in high GC regions: This is an example of getting "stuck on Cs". We also see errors where it gets "stuck on Gs".
ref seq		TTGCTGCCTGCGCGCTACGGCTGGCAAGATTGCGCACTTCACCTGCCACTACCGCAAAACCACGCCCCTGCTCTCCGGCGCGAGCCGCTTCCAC
cons seq	........................................M.....................................................
		........................      cc,c,c,cc,c,,,,,,,,,,,,,,,,,,,,,,,,,    ........................
		,,,,,,,,,,,,,,,,,,,,,,,,      ,,,c,c,,c,c,,,,,,,,,,,,,,,,,,,,,,,,,    ........................
		,,,,,,,,,,,,,,,,,,,,,,,,      ,,,,,c,,c,c,,,,,,,,,,,,,,,,,,,,,,,,,    ,,,c,,,,,,,,,,,,,,,,,,,,
		,,,,,,,,,,,,,,,,,,,,,,,,      ,c,c,c,,c,c,,,,,,,,,,,,,,,,,,,,,,,,,    ,c,c,,,,,,,,,,,,,,,,,,,,
		,,,,,,,,,,,,,,,,,,,,,,,,,      ....................................   ,,g,,,,,,,,,,,,,,,,,,,,,
		,,,,,,,,,,,,,,,,,,,,,,,,,      ,,c,c,,c,c,,,,,,,,,,,,,,,,,,,,,,,,,,   ,,,c,g,,,,g,,,,,,,,,,,,,
		,,,,,,,,,,,,,,,,,,,,,,,,,,     ,,c,c,cc,c,,,,,,,,,,,,,,,,,,,,,,,,,,   ,c,c,,,,,,g,,,,,,,,,,,,,
		,,,,,,,,,,,,,,,,,,,,,,,,,,     c,c,c,,c,c,,,,,,,,,,,,,,,,,,,,,,,,,,    .......................
		...........................    c,c,c,,c,c,,,,,,,,,,,,,,,,,,,,,,,,,,    .......................
		...........................    c,c,c,,c,c,,,,,,,,,,,,,,,,,,,,,,,,,,    .......................
		...........................    ,,c,c,,c,c,,,,,,,,,,,,,,,,,,,,,,,,,,    .......................
		...........................    ,,c,c,,c,c,,,,,,,,,,,,,,,,,,,,,,,,,,    .......................
		...........................    ,,c,c,,c,c,,,,,,,,,,,,,,,,,,,,,,,,,,    ,,c,,,,,,g,,,,,,,,,,,,,
		,,,,,,,,,,,,,,,,,,,,,,,,,,,    c,c,c,,c,c,,,,,,,,,,,,,,,,,,,,,,,,,,     ......................
		............................    ....................................    ......................
		............................    ....................................    ......................
		............................    ,c,c,,c,c,,,,,,,,,,,,,,,,,,,,,,,,,,,    ,c,,,,,,,,,,,,,,,,,,,,
		,,,,,,,,,,,,,,,,,,,,,,,,,,,,    ,c,c,cc,c,,,,,,,,,,,,,,,,,,,,,,,,,,,    ,c,,,,,,,,,,,,,,,,,,,,
		,,,,,,,,,,,,,,,,,,,,,,,,,,,,    ,,,c,,c,c,,,,,,,,,,,,,,,,,,,,,,,,,,,     .....................
		.............................    ....................................    c,g,,,,g,,,,,,,,,,,,,

TOP