About the 1001 Genomes Project

Background on Arabidopsis thaliana

The first genome sequence of any plant was from a single inbred strain (accession) of A. thaliana. Its complete release in 2000 was a major milestone for biology. The 120 Mb genome sequence of the Columbia (Col-0) accession propelled A. thaliana to the forefront of efforts to understand the genetic basis of quantitative variation among natural accessions. A particular advantage for such analyses is that locally adapted lines collected from the wild are typically inbred, because the species is predominantly selfing. Natural A. thaliana accessions, which occur throughout much of the northern hemisphere, show tremendous phenotypic variation in physiological, morphological and life history traits, including metabolite content, flowering and germination behavior, light and stress response, or disease resistance.

The availability of naturally inbred strains enables repeated phenotyping of the same, adapted genotype under diverse controlled conditions, making A. thaliana extremely well suited for studying genotype-environment interactions, a problem of direct and obvious importance not only to evolutionary scientists or plant breeders, but also to human biology, where such experiments are generally not possible.

Past efforts in identifying whole-genome sequence variation in natural accessions

In order to accelerate the discovery of variants that affect quantitative traits in natural accessions, two previous projects had as their main aim the identification of genome-wide polymorphisms in A. thaliana. Magnus Nordborg (Gregor Mendel Institute, Vienna) and colleagues initiated soon after publication of the first reference genome a project in which they dideoxy-sequenced some 1,000 fragments across the genome of 96 accessions. The major conclusions from this and similar, smaller studies by others were that there has been considerable global gene flow, such that most sequence variants are found worldwide, but that there is moderate population structure, which are both properties that are very much reminiscent of humans. Based on these data, 20 diverse accessions were selected for much deeper polymorphism discovery using an array-based resequencing approach, spearheaded by Detlef Weigel (Max Planck Institute) in collaboration with Joe Ecker (Salk Institute), Nordborg, Perlegen Sciences, and several other colleagues. Together, almost 10% of all protein-coding genes were found to harbor drastic-effect SNPs such as premature stop codons or appear to be deleted (or at least seem to be very different in sequence) in at least one accession, while almost 200 SNPs were predicted to lead to longer open reading frames. These findings highlighted the fact that a single reference genome is not sufficient to determine the entire gene complement of a species.

A practical motivation for this study was to enable genome-wide association studies (GWAS). LD decays in the global sample of 20 accessions, chosen for maximum genetic diversity, with an average of about 10 kb, similar to humans. That average LD in the two species is not so different might seem surprising, given the selfing nature of A. thaliana, but it reflects the fact that outcrossing is not that rare. The results from this enterprise were used to design a 250k SNP array, with multiple markers in each haplotype block. For some phenotypes, such as disease resistance, GWAS was shown to be successful when as few as 96 accessions were genotyped with this array. The 250k SNP array was subsequently used to genotype the RegMap collection of 1307 diverse accessions, which not only provided a fantastic GWAS resource, but also revealed new aspects of the species’ history.

The second step: A single genome is not enough

It is generally accepted now that it is dangerous to think about "the" genome of a species, even though this is what the initial sequencing papers stated in their titles just a few years ago. The previous emphasis on relatively minor changes between individuals was largely due to the fact that sequence variation has overwhelmingly been studied by PCR- or hybridization-based methods. Along these lines it is worth reiterating that the often-quoted 1% divergence between humans and chimps turned out to be a red herring. While we differ from our closest living relatives only by about one out of every hundred bases that can be aligned, there is a much larger fraction of our genome(s) that we do not share at all. Of similar importance is the observation that some genes with fundamental effects on life history traits such as flowering are not even functional in the first A. thaliana accession sequenced, and thus would not have been appreciated based on the first genome alone.

There were several motivations for the recently completed first phase of the 1001 Genomes project: to quantify genome variation in a large and representative sample of accessions; to investigate the demographic history of the species; to identify features that make specific geographic or genetic subsets particularly well suited for forward genetics, field experiments and selection scans; and to provide a powerful GWAS resource. Previous studies had shown that the ability to detect footprints of selection depended greatly on the sample. Similarly, while GWAS have identified common alleles with major effects from as few as 96 accessions, a much larger sample is required for most traits. The SNP-genotyped RegMap panel provided such a collection, but did not efficiently capture all SNPs and structural variants. Fully sequencing this collection would be of limited benefit, as one could accurately impute the missing data by sequencing a subset. We therefore assembled a set of accessions that sufficiently overlap the RegMap panel for imputation of variants in all lines. The combined collection constitutes a powerful resource for determining how genetic variation translates into phenotypic variation.

From resequencing to a collection of complete genomes: The 1001G+ effort

Understanding how genetic variation translates into phenotypic variation, and how this translation depends on the environment, is a major challenge for modern biology. Thanks to advances in technology, it has become possible to start answering this question by sequencing entire populations and connecting this information to phenotypic data, whether this be public health records, crop yield data, or the ability to withstand stress in a controlled experiment or in nature. There is, however, an important aspect that was often glossed over in all these (often highly publicized) efforts: we are still far from fully describing genetic variation on a population scale. Previous short-read sequencing methods have only supported the accurate discovery of simple variants (single nucleotide and very short insertion/deletion polymorphisms), with results being invariably biased with respect to what is present or missing in the reference genome. Large or complex structural variants, as well as simple variants inside complex variants have generally been missed completely. It is currently not known how serious this problem is, for the simple reason that finding out requires completely assembling large number of genomes, and comparing the result to data generated using standard methods. This is the objective of the 1001G+ effort. Long-read sequencing has now advanced to a stage where generating nearly complete genomes for large samples is feasible. Building on the success with the 1001 Genomes Project, we are assembling dozens, and hopefully, hundreds of genomes from a diverse collection of Arabidopsis thaliana strains, annotating them with transcriptome and epigenome information, and developing tools to make the results available to the community. This will go a long way toward answering the question of what is hidden in the part of the genome we currently cannot see.