Comparative Evaluation of Assembly Reconciliation Tools

Results

Quality Statistics

To obtain quality statistics of the resulting assemblies, we ran Quast2.3 using the following command:


    python ${PATH_TO}/quast-2.3/quast.py  -o $output_dir  -L --est-ref-size $genome_size --gage $result_assembly -R $reference

Gene Coverage

To calculate estimated gene coverage, first create a blast database of the output assembly, then align your genes against that database. This could be done using the following blast commands:


    makeblastdb -in $Assembly -out databaseBLAST -dbtype nucl -parse_seqids
    blastn -query $gene_list_file -out output.blast.txt -db databaseBLAST -num_threads $N

To obtain Genes sequences for each specie, we used the reference genomes and the corresponding annotations:

Staphylococcus aureus subsp. aureus USA300 TCH1516, (2844 Genes).
Rhodobacter sphaeroides KD131, (4474 Genes).
Homo sapiens GRCh38.p2 (ftp release 80), (2289 Genes).

You may download our gene list files. A perl script to calculate total percentage of gene coverage available from GitHub repository.

GAGE Statistics - No Reference

For Bombus_impatiens We used E-Size statistics provided by GAGE. The script can be downloaded from their website by clicking on this link. We used the following command to run the script.


    java GetFastaStats -o -min 500 -genomeSize <Genome Expected Size> $result_assembly

Synthetic Data Statistics

To assess the correctness of the merged assemblies we aligned the flawed synthetic input each of the resulting assemblies to the reference. and visualized the alignment using colored barplot. Pairwise alignments and visualization were generated an R script utilizing Decipher, an R Bioconductor package.


#!/usr/bin/env Rscript
args = commandArgs(trailingOnly=TRUE)

# Adapted From: http://decipher.cee.wisc.edu/AlignSynteny.html
# load the DECIPHER library in R
library(DECIPHER)
library(RColorBrewer)

# specify the path to each FASTA file (in quotes)
# each genome must be given a unique identifier here
# for example: Genome1, Genome2, etc.

fas <- c(Genome1="<>",
Genome2="<>",
Genome3="<>" ...)


# specify where to create the new sequence database
db <- dbConnect(SQLite(), ":memory:")


# load the sequences from the file in a loop
for (i in seq_along(fas)) {
    Seqs2DB(as.character(fas[i]), "FASTA", db, names(fas[i]))
}   


# map the syntenic regions between each genome pair
synteny <- FindSynteny(db, minScore=500, processors = NULL, kmer=101L)

postscript("barplot.eps")
par(mar=c(5,9,4,5))
plot(synteny, colorRamp = colorRampPalette(brewer.pal(10, "PRGn")), labels = abbreviate(rownames(synteny), 12), cex.lab = 1.5)  # displays a bar plot of adjacent pairs
dev.off()

dbDisconnect(db)

References:

Gurevich, A., Saveliev, V., Vyahhi, N. & Tesler, G. QUAST: quality assessment tool for genome assemblies. Bioinformatics 29, 1072–1075 (2013).
ES Wright, Using DECIPHER v2.0 to Analyze Big Biological Sequence Data in R. The R Journal, 8(1), 352-359 (2016).