################################################################### # Data release of 80 Arabidopsis thaliana accessions. # date: 2011_01_28 ################################################################### Data analysis description: The genomes have been analyzed separately and in a joint approach. First, each of the genomes was analyzed separately using the short read pipeline SHORE (Ossowski et al, Genome Res, 2008). Later, positionwise presence of alleles were used to instantiate low quality base calls in other accessions (see Cao at al for further information). Within the 'genome_matrix' folder you will find two files. The files TAIR8_genome_matrix_2010_10_18.txt, TAIR9_genome_matrix_2010_10_18.txt and TAIR10_genome_matrix_2010_10_18.txt features the outcome of the joint approach, a matrix with base calls of all 80 genomes annotated by their reference positions. Further information about this matrix and its columns is given in README_GenomeMatrix. In the folder 'strains' there are 80 different subfolders, one for each analyzed strain. Within each of these folders the output of the separate analysis of each of the strains resides, in each of them are three subfolders: |-- / |-- TAIR8/ |-- TAIR9/ |-- TAIR10/ `-- reads/ They describe the same analysis but annotated against two different assemblies: TAIR8 and TAIR9 assemblies. TAIR10 and TAIR9 do not differ in positions, though they are equivalent. As the short read analysis was performed against the TAIR8 assembly we afterwards mapped the TAIR8 positions to their respective TAIR9 positions. Thus, a minor fraction of TAIR9 assembly positions do NOT feature base calls (i.e. the one introduced into the assembly when upgrading from TAIR8 to TAIR9). Short reads in FASTQ format are contained in the reads/ subfolder. Within these folder are 10 files which are described in the following: |-- |-- TAIR8 | |-- SV_deletion.PE.txt.gz | |-- SV_deletion_complex.PE.txt.gz | |-- SV_insertion.PE.txt.gz | |-- SV_inversion.PE.txt.gz | |-- filtered_reference.txt.gz | |-- filtered_variant.txt.gz | |-- inaccessible_regions.txt.gz | |-- inaccessible_regions_low_cov.txt.gz | |-- insertion.txt.gz | |-- working_reference.txt.gz | `-- working_variant.txt.gz |-- TAIR9 | |-- SV_deletion.PE.txt.gz | |-- SV_insertion.PE.txt.gz | |-- SV_inversion.PE.txt.gz | |-- filtered_reference.txt.gz | |-- filtered_variant.txt.gz | |-- inaccessible_regions.txt.gz | |-- inaccessible_regions_low_cov.txt.gz | |-- insertion.txt.gz | |-- working_reference.txt.gz | `-- working_variant.txt.gz `-- TAIR10 |-- SV_deletion.PE.txt.gz |-- SV_insertion.PE.txt.gz |-- SV_inversion.PE.txt.gz |-- filtered_reference.txt.gz |-- filtered_variant.txt.gz |-- inaccessible_regions.txt.gz |-- inaccessible_regions_low_cov.txt.gz |-- insertion.txt.gz |-- working_reference.txt.gz `-- working_variant.txt.gz ################################################################### # Strain folders: File content description. ################################################################### 1) Quality Data (filtered_reference, filtered_variant and insertion) Use this data for high quality annotations of genome differences. filtered_reference and filtered_variant represent the outcome of the joint analysis approach. filtered_reference.txt.gz Positions featuring the reference base, used in the genome_matrix_2010_03_25.txt, including a quality value. filtered_variant.txt.gz Positions and annotation of SNPs (single nucleotide polymorphisms) and 1-3bp deletions detected through short read alignments, including a quality value. insertion.txt 1-3bp insertions annotated through the short read alignments. Note, the algorithm used is different to the algorithm used for the 1-3bp deletions. Thus, a comparison between the compliment of deletions and insertion suffers from different recall rates and will hardly allow any meaningful conclusions. Insertions have not been included in the genome matrix. 2) Working Data (working_reference and working_variant) These files describe the output of the separate analysis of each of the accessions. Each call (reference and variant) has a quality value attached. This quality value ranges from 1 to 40, where 40 specifies highest quality. Note, these files feature all calls, but calls with quality values below 20 are hardly reliable. This set of predictions can be used to adjust specificity and selectivity by selecting a more or less stringent quality value threshold. working_reference.txt.gz Positions featuring the reference base. working_variant.txt.gz Positions prediticed to be different to the reference, either base changes or deletions. (No insertions found in this file.) 3) SV Data (SV_deletion.PE, SV_deletion_complex.PE, SV_insertion.PE and SV_inversion.PE) Paired-end mapping analysis. Using distance and orientation of the two reads of each clone, three types of structural variations were predicted. Later deletion prediction were splitted into two sets. One targets the deletions were clearly one deletion event is described, the other one describes deletion events in region with more complex rearrangements. SV_deletion.PE.txt.gz Large deletion prediction. Note deletions shorter than 10bp have a bad selectivity and sensitivity and have only been included for completeness. SV deletions have been included in the genome matrix. An unique ID value was attached to each prediction in order to find the identical deletion in other accessions again. This allows for frequency analysis of SV deletions. SV_deletion_complex.PE.txt.gz Large deletion predictions found to be in the context of more complex rearrangements. SV_insertion.PE.txt.gz Large insertions. See notes on SV deletions. SV insertions have not been included in the genome matrix. SV_inversion.PE.txt.gz Inversion predictions. SV inversions have also not been included in the genome matrix. 4) Inaccessible Regions (inaccessible_regions and inaccessible_regions_low_cov) Absence of sufficient short read alignments or presence of merely repetitive alignments usually prevent base calling in resequencing analysis. The following two files describe such inaccessible regions, though based on different classification schemes. inaccessible_regions.txt.gz Describes all regions which did not allow for a base call (including SV deletions) independent of the reason. Within the genome matrix these regions are annotated as "Z" or "N" (see README_GenomeMatrix). inaccessible_regions_low_cov.txt.gz Any region without coverage (aka unsequenced regions). ################################################################### # File format description ################################################################### filtered_reference.txt <# of nonrepetitive reads supporting substituion> filtered_variant.txt <# of nonrepetitive reads supporting substituion> insertion.txt working_reference.txt <#N> <# repetitive positions> working_variant.txt <#N> <# repetitive positions> SV_deletion.PE.txt <# read pairs supporting SV call> <# positions w/o nonrepetitive core alignment> <# positions w/o core alignment> SV_deletion_complex.PE.txt <# read pairs supporting SV call> <# positions w/o nonrepetitive core alignment> <# positions w/o core alignment> SV_insertion.PE.txt <# read pairs supporting SV call> SV_inversion.PE <# read pairs supporting upstream break> <# read pairs supporting downstream break> <> <> inaccessible_regions.txt inaccessible_regions_low_cov.txt <#N> <# repetitive positions> ------------------------------------------------------------------- Tuebingen, Germany, 2010 Questions? Korbinian Schneeberger