Data sets
GAGE data set
We used the resulting assemblies of Genome Assembly Gold-Standard Evaluations (GAGE)'s evaluation of multiple assemblers on their dataset.
The assemblies are publicly available and can be obtained from the results tab at their website, reads are also available.
Synthetic data set
We used Saccharomyces cerevisiae strain S288c reference (accession GCF_000146045.2), and introduced certain types of misassemblies.
To produce synthetic reads, we utilized ART: a next-generation sequencing read simulator. To generate single reads we used the following command:
${PATH_TO}/art_bin_VanillaIceCream/art_illumina -sam -i $(PATH_TO}/Saccharomyces_cerevisiae.fasta -l 51 -f 15 -o yeast
While the paired-end reads are generated using the this command:
${PATH_TO}/art_bin_VanillaIceCream/art_illumina -sam -i $(PATH_TO}/Saccharomyces_cerevisiae.fasta -l 51 -f 15 -p -m 450 -s 10 -o yeast
Synthetic assemblies were generated using RSVSim, a bioconductor package that simulate structural variations.
Synthetic flawed assemblies we used were generated by the following R script.
#!/usr/bin/env Rscript
library("RSVSim")
library("numbers")
filepath <- "$(PATH_TO)/Saccharomyces_cerevisiae_chrs.fasta"
chrs <- readDNAStringSet(filepath, format="fasta",
nrec=-17L, skip=0L, seek.first.rec=TRUE, use.names=TRUE)
yeast_genome <- DNAStringSet(x=chrs, start=NA, end=NA, width=NA, use.names=TRUE)
primeNum <- 173
for (size in seq(500,5000, by=500)) {
dirName <- paste("DeL", size, sep = "_")
if (! file.exists(dirName))
dir.create(file.path(getwd(),dirName))
del_sim <- simulateSV(output=dirName, genome=yeast_genome,chrs = "chr04", dels=1, sizeDels=size,
bpSeqSize=101, seed=primeNum, verbose=FALSE)
primeNum <- nextPrime(primeNum)
}
primeNum <- 173
for (size in seq(50,2000, by=150)) {
dirName <- paste("Ins", size, sep = "_")
if (! file.exists(dirName))
dir.create(file.path(getwd(),dirName))
ins_sim <- simulateSV(output=dirName, genome=yeast_genome,chrs = c("chr04", "chr01") , ins=1, sizeIns=size,
percCopiedIns=0.25, bpSeqSize=101, seed=primeNum, verbose=FALSE)
primeNum <- nextPrime(primeNum)
}
primeNum <- 173
for (size in seq(500,5000, by=500)) {
dirName <- paste("Inv", size, sep = "_")
if (! file.exists(dirName))
dir.create(file.path(getwd(),dirName))
del_sim <- simulateSV(output=dirName, genome=yeast_genome,chrs = "chr04", invs =1, sizeInvs=size,
bpSeqSize=101, seed=primeNum, verbose=FALSE)
primeNum <- nextPrime(primeNum)
}
primeNum <- 173
for (i in 1:10) {
dirName <- paste("Trans", i, sep = "_")
if (! file.exists(dirName))
dir.create(file.path(getwd(),dirName))
ins_sim <- simulateSV(output=dirName, genome=yeast_genome,chrs = c("chr04", "chr01") , trans = 1,
bpSeqSize=101, seed=primeNum, verbose=FALSE)
primeNum <- nextPrime(primeNum)
}
References:
- Salzberg, S. L. et al. GAGE: A critical evaluation of genome assemblies and assembly algorithms. Genome Res. 22, 557–567 (2012).
- Huang, W., Li, L., Myers, J. R. & Marth, G. T. ART: a next-generation sequencing read simulator. Bioinformatics 28, 593–594 (2012).
- Bartenhagen C. RSVSim: RSVSim: an R/Bioconductor package for the simulation of structural variations. R package version 1.14.0 (2015).