###################################################################
# Data release of 5 deeply sequenced Arabidopsis thaliana strains.
# (Col-0, Kro-0, C24, Ler-1, Bur-0)
# date: 2011_05_10
#
# Questions? Please contact 
# Korbinian Schneeberger
# <schneeberger@mpipz.mpg.de>
# 
###################################################################


Data analysis description:

Initally each of the genomes was analyzed using the resequencing 
short read pipeline SHORE (Ossowski et al, Genome Res, 2008). 
Later, we used a reference-guided assembly pipeline to further 
assemble the genomes.

We provide resequencing-based SNPs, difference derived from whole 
genome alignments against the reference sequence (WGA_Variants), 
reads and PAV (presence/absence varation) for each genes for the
accessions Bur-0, C24, Kro-0, and Ler-1.

The directory structure is as follows:

  |-- <strain_name>
      |-- Assemblies
      |-- Marker
      |-- WGA_Variants
      |-- Reads
      `-- PAV 


1. Assemblies
-------------
Assembly of the reference-guided whole-genome assemblies. Two
different version based on different stringency criterias are
provided. We recommend "High_Quality" version, though "Standard"
might include more sequence.

|-- <strain_name>
    |-- Assemblies
        |-- High_Quality
        |-- Standard

In addition to standard and high quality assemblies for Ler-1,
we provide a de novo assembly generated with ALLPATHS-LG. It's
located at

|-- Ler-1
    `-- Assemblies
        `-- Allpaths_LG


2. Marker
---------
SNP calls results of the resequencing analysis.
This data has proven to be useful for linkage analysis, e.g. 
in genetic mapping. 

Within these folder are 4 files which are described in the 
following:

|-- <strain_name>
    |-- Marker
        |-- <strain_name>.215k.TAIR8.csv
        |-- <strain_name>.SNPs.TAIR8.txt
        |-- <strain_name>.215k.TAIR9.csv
        `-- <strain_name>.SNPs.TAIR9.txt


TAIR8 and TAIR9 files describes the same information though in
respect to the different versions of the reference assembly. 

<strain_name>.SNPs.TAIRX.txt describes high quality SNP markers.

<strain_name>.215k.TAIRX.csv describes the base calls at those
positions that have been querried with tiling arrays in the study
of Suzi Atwell, published in Nature in 2010. This data can 
immediately be combined with the data of Suzi Atwell.


For information about the 215k subset please see:
Atwell et al, Genome-wide association study of 107 phenotypes in 
Arabidopsis thaliana inbred lines, Nature, 2010.


3. Reads
--------

|-- <strains_name>
    |-- Reads

tar balls for all reads used in this study. Reads are separated
by their sequencing library and are provided in fastq format.


4. WGA Variants
---------------

|-- <strains_name>
    |-- WGA_Variants
        |-- del.annotation.TAIR8.txt
        |-- hdr.annotation.TAIR8.txt
        |-- ins.annotation.TAIR8.txt
        `-- snp.annotation.TAIR8.txt

The files describe the deletions, insertions and SNPs derived 
from a whole-genome alignment against the reference sequence.
HDR, or highly diverged region, describes the regions that
reside between conserved regions but could not be aligned 
against each other, though both of the genome feature sequence
thus these regions are neither insertion nor deletions.

5. PAV
------
Presence / absent variation of aligneable genes.

|-- <strain_name>
    |-- PAV
        |-- PAV_Genes_<strain_name>.list

PAV_Genes_<strain_name>.list hold a list of 
presence / absent variation of alignable genes.



###################################################################
# File format description
###################################################################


1. Assemblies
-------------
Assemblies are provided in fasta format.

2. SNPs
-------

<strain_name>.SNPs.TAIRX.txt
	<Sample>
        <Chromosome>
        <Position>
        <Reference base>
        <Substitution base>
        <Quality>
        <# of nonrepetitive reads supporting substituion>
        <concordance>
        <Avg. # alignments of overlapping reads>

<strain_name>.215k.TAIRX.csv
	<Chromosome>,<Position>,<BaseCall>


3. Reads
--------

All reads are in fastq format.


4. WGA Variants
---------------

del.annotation.txt
	<Sample>
	<Chromosome>
	<Begin>
	<End>
	<Deleted (reference) allele>
	<Annotation>

hdr.annotation.txt
	TBD
			
ins.annotation.txt
	<Sample>
	<Chromosome>
	<Begin>
	<End>
	<Inserted (novel) allele>
	<Annoation>
		
snp.annotation.txt
	<Sample>
	<Chromosome>
	<Position>
	<Reference allele>
	<New allele>
	<Annotation>	


5. PAV
------

PAV_Genes_<strain_name>.list
	<ref_gene_id>
	<strain_gene_id>

<strain_gene_id> is either equal to <strain_gene_id> or 'NA',
if no gene alignment is present.



-------------------------------------------------------------------
Cologne, Germany, 2011
Questions?
Korbinian Schneeberger <schneeberger@mpipz.mpg.de>

