###################################################################################
# Data: Genome assemblies, gene annotation, and genetic variation (including
#       structural variation and small local variations) of seven Arabidopsis 
#       thaliana accessions (An-1, C24, Cvi-0, Eri-1, Kyo, Ler, Sha)can be found 
#       for each strain in the “strains” folder, or for bulk download in the 
#       “full_set” folder.
#       Orthologous relationships can be found in the main folder.
# Date: 2019-10-21
###################################################################################

###################################################################################
# Gene orthologs
###################################################################################

Orthogroups.csv

Ortholgs were calculated with OrthoFinder using default parameters; the files contains all ortholog clustering groups across all eight accessions (An-1, C24, Col-0, Cvi-0, Eri-1, Kyo, Ler, Sha)


###################################################################################
# Genome assemblies
###################################################################################

*.chr.all.v2.0.fasta.gz    

PacBio reads were filtered for short (<50bp) or low quality (QV<80) reads using SMRTLink5 package. De novo assembly of each genome was initially performed using three different assembly tools including Falcon, Canu and MECAT. The resulting assemblies were polished with Arrow from the SMRTLink5 package and then further corrected with mapping of Illumina short reads using BWA to remove small-scale assembly errors which were identified with SAMTools. For each genome, the final assembly was based on the Falcon assembly as these assemblies always showed highest assembly contiguity. A few contigs were further connected or extended based on whole genome alignments between Falcon and Canu or MECAT assemblies. Contigs were labelled as organellar contigs if they showed alignment identity and coverage both larger than 95% when aligned against the mitochondrial or chloroplast reference sequences. A few of contigs aligned to multiple chromosomes and were split if no Illumina short read alignments supported the conflicting regions. Assembly contigs larger than 20kb were combined to pseudo-chromosomes according to their alignment positions when aligned against the reference sequence using MUMmer4. Contigs with consecutive alignments were concatenated with a stretch of 500 Ns. 


###################################################################################
# Protein-coding gene annotations and sequences
###################################################################################

*.protein-coding.genes.v2.5.2019-10-09.gff3.gz
*.protein-coding.genes.v2.5.2019-10-09.gene.fasta.gz  (gene genomic sequences)
*.protein-coding.genes.v2.5.2019-10-09.prot.fasta.gz (protein sequences)
*.protein-coding.genes.v2.5.2019-10-09.CDS.fasta.gz (CDS sequences)

Protein-coding genes were annotated based on ab initio gene predications, protein sequence alignments and RNA-seq data. Three ab initio gene predication tools were used including Augustus, GlimmerHMM and SNAP. The reference protein sequences from the Araport 11 annotation were aligned to each genome assembly using exonerate with the parameter setting “--percent 70 --minintron 10 --maxintron 60000”. For five accessions (An-1, C24, Cvi-0, Ler-0, and Sha) we downloaded a total of 155 RNA-seq data sets from the NCBI SRA database. RNA-seq reads were mapped to the corresponding genome using HISAT2 and then assembled into transcripts using StringTie (both with default parameters). All different evidences were integrated into consensus gene models using Evidence Modeler. 
The resulting gene models were further evaluated and updated using the Araport 11 annotation. Firstly, for each of the seven genomes, the predicted gene and protein sequences were aligned to the reference sequence, while all reference gene and protein sequences were aligned to each of the other seven genomes using Blast. Then, potentially mis-annotated genes including mis-merged (two or more genes are annotated as a single gene), mis-split (one gene is annotated as two or more genes) and unannotated genes were identified based on the alignments using in-house python scripts. Mis-annotated or unannotated genes were corrected or added by incorporating the open reading frames generated by ab initio predications or protein sequence alignment using Scipio.


###################################################################################
# Genetic variation
###################################################################################

*.syri.out.gz

All assemblies were aligned to the reference sequence (TAIR10) using nucmer from the MUMmer4 toolbox with parameter setting “-max -l 40 -g 90 -b 100 -c 200”. The resulting alignments were further filtered for alignment length (>100) and identity (>90). Structural rearrangements and local variations were identified using SyRI (https://github.com/schneebergerlab/syri).

For details of the file format, please check https://schneebergerlab.github.io/syri/fileformat.html

Example of commands used:
  nucmer --maxmatch  -l 40 -g 90 -c 100 -b 200 -t 20 Col.fasta An-1.fasta
  delta-filter -m -i 90 -l 100 out.delta > out_m_i90_l100.delta
  show-coords -THrd out_m_i90_l100.delta > out_m_i90_l100.coords
  syri -c out.chrom.coords -d out_m_i90_l100.delta -r Col.fasta -q An-1.fasta --nc 5 --all -k

-----------------------------------------------------------------

If you have any questions, please contact Wen-Biao Jiao <jiao@mpipz.mpg.de>