About the 1001 Genomes Project
Background on Arabidopsis thaliana
The first genome sequence of any plant was from a single inbred strain (accession) of A. thaliana. Its complete release in 2000 was a major milestone for biology. The 120 Mb genome sequence of the Columbia (Col-0) accession propelled A. thaliana to the forefront of efforts to understand the genetic basis of quantitative variation among natural accessions. A particular advantage for such analyses is that locally adapted lines collected from the wild are typically inbred, because the species is predominantly selfing. Natural A. thaliana accessions, which occur throughout much of the northern hemisphere, show tremendous phenotypic variation in physiological, morphological and life history traits, including metabolite content, flowering and germination behaviour, light and stress response, or disease resistance.
The availability of naturally inbred strains enables repeated phenotyping of the same, adapted genotype under diverse controlled conditions, making A. thaliana extremely well suited for studying genotype-environment interactions, a problem of direct and obvious importance not only to evolutionary scientists or plant breeders, but also to human biology, where such experiments are generally not possible.
Past efforts in identifying whole-genome sequence variation in natural accessions
In order to accelerate the discovery of variants that affect quantitative traits in natural accessions, two previous projects had as their main aim the identification of genome-wide polymorphisms in A. thaliana. Magnus Nordborg (Los Angeles, USA; now at the Gregor Mendel Institute, Vienna) and colleagues (http://walnut.usc.edu/2010) initiated a few years ago a project in which they dideoxy-sequenced some 1,000 fragments across the genome of 96 accessions. The major conclusions from this and similar, smaller studies by others were that there has been considerable global gene flow, such that most sequence variants are found worldwide, but that there is moderate population structure, which are both properties that are very much reminiscent of humans. Based on the Nordborg et al. data, 20 diverse accessions were selected for much deeper polymorphism discovery using an array-based resequencing approach, spearheaded by Detlef Weigel (Max Planck Institute) in collaboration with Joe Ecker (Salk Institute), Nordborg, Perlegen Sciences, and several other colleagues (Clark et al., Science 2007). Together, almost 10% of all protein-coding genes were found to harbour drastic-effect SNPs such as premature stop codons or appear to be deleted (or at least seem to be very different in sequence) in at least one accession, while almost 200 SNPs were predicted to lead to longer open reading frames. These findings highlighted the fact that a single reference genome is not sufficient to determine the entire gene complement of a species.
A practical motivation for this study was to enable genome-wide association (GWA) mapping. The progress that human geneticists have made with GWA mapping in the past two years has been nothing but phenomenal, and bodes very well for applying association mapping to A. thaliana. LD decays in the global sample of 20 accessions, chosen for maximum genetic diversity, with an average of about 10 kb, similar to humans (Kim et al., Nature Genetics 2007). That average LD in the two species is not so different might seem surprising, given the selfing nature of A. thaliana, but it reflects the fact that outcrossing is not that rare. The results from this enterprise have been used to design a 250k SNP chip for GWA studies, and the genotyping of 1,200 strains should be completed in 2010 is currently underway, through in a collaboration between Justin Borevitz and Joy Bergelson (University of Chicago), and Magnus Nordborg (http://walnut.usc.edu/2010/SNPs).
The next step: A single genome is not enough
It has become increasingly clear that it is dangerous to think about "the" genome of a species, even though this is what the initial sequencing papers stated in their titles just a few years ago. The previous emphasis on relatively minor changes between individuals was largely due to the fact that sequence variation has overwhelmingly been studied by PCR- or hybridisation-based methods. Along these lines it is worth reiterating that the often-quoted 1% divergence between humans and chimps turned out to be a red herring. While we differ from our closest living relatives only by about one out of every hundred bases that can be aligned, there is a much larger fraction of our genome(s) that we do not share at all. Of similar importance is the observation that some genes with fundamental effects on life history traits such as flowering are not even functional in the first A. thaliana accession sequenced, and thus would not have been appreciated based on the first genome alone.
We demonstrated the practicality of the 1001 Genomes project in 2008, by demonstrating that even single-end short reads could reveal the majority of sequence changes in A. thaliana accessions (Ossowski et al., 2008). By completing genome sequencing of 1001 accessions, we will not only fill in the gaps between the HapMap tag-SNPs, but also create a resource that is large enough to be used directly for association mapping, and ideally reduce causal variants to individual nucleotides. We will thus tremendously shorten the time required to link a specific genotype to a particular phenotype.