################################################################### # Data release of 5 deeply sequenced Arabidopsis thaliana strains. # (Col-0, Kro-0, C24, Ler-1, Bur-0) # date: 2011_05_10 # # Questions? Please contact # Korbinian Schneeberger # # ################################################################### Data analysis description: Initally each of the genomes was analyzed using the resequencing short read pipeline SHORE (Ossowski et al, Genome Res, 2008). Later, we used a reference-guided assembly pipeline to further assemble the genomes. We provide resequencing-based SNPs, difference derived from whole genome alignments against the reference sequence (WGA_Variants), reads and PAV (presence/absence varation) for each genes for the accessions Bur-0, C24, Kro-0, and Ler-1. The directory structure is as follows: |-- |-- Assemblies |-- Marker |-- WGA_Variants |-- Reads `-- PAV 1. Assemblies ------------- Assembly of the reference-guided whole-genome assemblies. Two different version based on different stringency criterias are provided. We recommend "High_Quality" version, though "Standard" might include more sequence. |-- |-- Assemblies |-- High_Quality |-- Standard In addition to standard and high quality assemblies for Ler-1, we provide a de novo assembly generated with ALLPATHS-LG. It's located at |-- Ler-1 `-- Assemblies `-- Allpaths_LG 2. Marker --------- SNP calls results of the resequencing analysis. This data has proven to be useful for linkage analysis, e.g. in genetic mapping. Within these folder are 4 files which are described in the following: |-- |-- Marker |-- .215k.TAIR8.csv |-- .SNPs.TAIR8.txt |-- .215k.TAIR9.csv `-- .SNPs.TAIR9.txt TAIR8 and TAIR9 files describes the same information though in respect to the different versions of the reference assembly. .SNPs.TAIRX.txt describes high quality SNP markers. .215k.TAIRX.csv describes the base calls at those positions that have been querried with tiling arrays in the study of Suzi Atwell, published in Nature in 2010. This data can immediately be combined with the data of Suzi Atwell. For information about the 215k subset please see: Atwell et al, Genome-wide association study of 107 phenotypes in Arabidopsis thaliana inbred lines, Nature, 2010. 3. Reads -------- |-- |-- Reads tar balls for all reads used in this study. Reads are separated by their sequencing library and are provided in fastq format. 4. WGA Variants --------------- |-- |-- WGA_Variants |-- del.annotation.TAIR8.txt |-- hdr.annotation.TAIR8.txt |-- ins.annotation.TAIR8.txt `-- snp.annotation.TAIR8.txt The files describe the deletions, insertions and SNPs derived from a whole-genome alignment against the reference sequence. HDR, or highly diverged region, describes the regions that reside between conserved regions but could not be aligned against each other, though both of the genome feature sequence thus these regions are neither insertion nor deletions. 5. PAV ------ Presence / absent variation of aligneable genes. |-- |-- PAV |-- PAV_Genes_.list PAV_Genes_.list hold a list of presence / absent variation of alignable genes. ################################################################### # File format description ################################################################### 1. Assemblies ------------- Assemblies are provided in fasta format. 2. SNPs ------- .SNPs.TAIRX.txt <# of nonrepetitive reads supporting substituion> .215k.TAIRX.csv ,, 3. Reads -------- All reads are in fastq format. 4. WGA Variants --------------- del.annotation.txt hdr.annotation.txt TBD ins.annotation.txt snp.annotation.txt 5. PAV ------ PAV_Genes_.list is either equal to or 'NA', if no gene alignment is present. ------------------------------------------------------------------- Cologne, Germany, 2011 Questions? Korbinian Schneeberger