Results
Quality Statistics
To obtain quality statistics of the resulting assemblies, we ran Quast2.3 using the following command:
python ${PATH_TO}/quast-2.3/quast.py -o $output_dir -L --est-ref-size $genome_size --gage $result_assembly -R $reference
Gene Coverage
To calculate estimated gene coverage, first create a blast database of the output assembly, then align your genes against that database. This could be done using the following blast commands:
makeblastdb -in $Assembly -out databaseBLAST -dbtype nucl -parse_seqids
blastn -query $gene_list_file -out output.blast.txt -db databaseBLAST -num_threads $N
To obtain Genes sequences for each specie, we used the reference genomes and the corresponding annotations:
- Staphylococcus aureus subsp. aureus USA300 TCH1516, (2844 Genes).
- Rhodobacter sphaeroides KD131, (4474 Genes).
- Homo sapiens GRCh38.p2 (ftp release 80), (2289 Genes).
You may download our gene list files. A perl script to calculate total percentage of gene coverage available from GitHub repository.
GAGE Statistics - No Reference
For Bombus_impatiens We used E-Size statistics provided by GAGE. The script can be downloaded from their website by clicking on this link. We used the following command to run the script.
java GetFastaStats -o -min 500 -genomeSize <Genome Expected Size> $result_assembly
Synthetic Data Statistics
To assess the correctness of the merged assemblies we aligned the flawed synthetic input each of the resulting assemblies to the reference. and visualized the alignment using colored barplot. Pairwise alignments and visualization were generated an R script utilizing Decipher, an R Bioconductor package.
References:
- Gurevich, A., Saveliev, V., Vyahhi, N. & Tesler, G. QUAST: quality assessment tool for genome assemblies. Bioinformatics 29, 1072–1075 (2013).
- ES Wright, Using DECIPHER v2.0 to Analyze Big Biological Sequence Data in R. The R Journal, 8(1), 352-359 (2016).