About the 1001 Genomes Project
Background on Arabidopsis thaliana
The first genome sequence of any plant was from a single inbred strain (accession) of A. thaliana. Its complete release in 2000 was a major milestone for biology. The 120 Mb genome sequence of the Columbia (Col-0) accession propelled A. thaliana to the forefront of efforts to understand the genetic basis of quantitative variation among natural accessions. A particular advantage for such analyses is that locally adapted lines collected from the wild are typically inbred, because the species is predominantly selfing. Natural A. thaliana accessions, which occur throughout much of the northern hemisphere, show tremendous phenotypic variation in physiological, morphological and life history traits, including metabolite content, flowering and germination behaviour, light and stress response, or disease resistance.
The availability of naturally inbred strains enables repeated phenotyping of the same, adapted genotype under diverse controlled conditions, making A. thaliana extremely well suited for studying genotype-environment interactions, a problem of direct and obvious importance not only to evolutionary scientists or plant breeders, but also to human biology, where such experiments are generally not possible.
Past efforts in identifying whole-genome sequence variation in natural accessions
In order to accelerate the discovery of variants that affect quantitative traits in natural accessions, two previous projects had as their main aim the identification of genome-wide polymorphisms in A. thaliana. Magnus Nordborg (now at the Gregor Mendel Institute, Vienna) and colleagues initiated a few years ago a project in which they dideoxy-sequenced some 1,000 fragments across the genome of 96 accessions. The major conclusions from this and similar, smaller studies by others were that there has been considerable global gene flow, such that most sequence variants are found worldwide, but that there is moderate population structure, which are both properties that are very much reminiscent of humans. Based on the Nordborg et al. (2005) data, 20 diverse accessions were selected for much deeper polymorphism discovery using an array-based resequencing approach, spearheaded by Detlef Weigel (Max Planck Institute) in collaboration with Joe Ecker (Salk Institute), Nordborg, Perlegen Sciences, and several other colleagues (Clark et al., Science 2007). Together, almost 10% of all protein-coding genes were found to harbour drastic-effect SNPs such as premature stop codons or appear to be deleted (or at least seem to be very different in sequence) in at least one accession, while almost 200 SNPs were predicted to lead to longer open reading frames. These findings highlighted the fact that a single reference genome is not sufficient to determine the entire gene complement of a species.
A practical motivation for this study was to enable genome-wide association studies (GWAS). LD decays in the global sample of 20 accessions, chosen for maximum genetic diversity, with an average of about 10 kb, similar to humans (Kim et al., Nature Genetics 2007). That average LD in the two species is not so different might seem surprising, given the selfing nature of A. thaliana, but it reflects the fact that outcrossing is not that rare. The results from this enterprise were used to design a 250k SNP array, with multiple markers in each haplotype block (Kim et al., 2007). For some phenotypes, such as disease resistance, GWAS was shown to be successful when as few as 96 accessions were genotyped with this array (Atwell et al., 2010). The 250k SNP array was subsequently used to genotype the RegMap collection of 1307 diverse accessions, which not only provided a fantastic GWAS resource, but also revealed new aspects of the species’ history (Horton et al, 2012).
The next step: A single genome is not enough
It has become increasingly clear that it is dangerous to think about "the" genome of a species, even though this is what the initial sequencing papers stated in their titles just a few years ago. The previous emphasis on relatively minor changes between individuals was largely due to the fact that sequence variation has overwhelmingly been studied by PCR- or hybridisation-based methods. Along these lines it is worth reiterating that the often-quoted 1% divergence between humans and chimps turned out to be a red herring. While we differ from our closest living relatives only by about one out of every hundred bases that can be aligned, there is a much larger fraction of our genome(s) that we do not share at all. Of similar importance is the observation that some genes with fundamental effects on life history traits such as flowering are not even functional in the first A. thaliana accession sequenced, and thus would not have been appreciated based on the first genome alone.
There were several motivations for the recently completed first phase of the 1001 Genomes project: to quantify genome variation in a large and representative sample of accessions; to investigate the demographic history of the species; to identify features that make specific geographic or genetic subsets particularly well suited for forward genetics, field experiments and selection scans; and to provide a powerful GWAS resource. Previous studies had shown that the ability to detect footprints of selection depended greatly on the sample (e.g., Cao et al., 2011; Long et al., 2013; Huber et al., 2014). Similarly, while GWAS have identified common alleles with major effects from as few as 96 accessions (Aranzana et al., 2005; Atwell et al., 2010), a much larger sample is required for most traits. The SNP-genotyped RegMap panel (Horton et al. 2012) provided such a collection, but did not efficiently capture all SNPs and structural variants. Fully sequencing this collection would be of limited benefit, as one could accurately impute the missing data by sequencing a subset. We therefore assembled a set of accessions that sufficiently overlap the RegMap panel for imputation of variants in all lines. The combined collection constitutes a powerful resource for determining how genetic variation translates into phenotypic variation.