github download link: (simrlls)
This post has been modified for the updated v.0.0.8 script.
The program simrrls can be used to simulate raw (fastq format) sequence data on an input topology under the coalescent in a manner that emulates restriction-site associated DNA, with slight variations for different data types (e.g., RADseq, ddRAD, GBS, paired-end data).
It was originally developed to generate data for bug testing pyrad and to create example data sets for use in tutorials. However, I can imagine other uses for it, such as investigating the effects of missing data, insufficient sequencing coverage and error rates on assembly results.
To use the program you will need to first install the Egglib Python module, used here to perform coalescent simulations.
Calling the script – simrrls
Datatypes– RAD is the default type for which a barcode and restriction site overhang are attached to the left side of single end reads. Depending on the data type selected simrrls will put cut sites and barcodes in the proper ends of sequences so that they emulate RAD, GBS, or ddRAD data, single or paired-end. The choice also effects how locus dropout occurs (i.e., one or two cutters present).
Allelic Dropout– There are two ways in which allelic dropout (mutation-disruption) can occur, each of which can be toggled. The first is mutations to the cut site recognition sequence (-mc). If this is turned on then an allele will be dropped if a mutation occurs within the length of the restriction recognition sites determined by the cutters (-c1 and -c2) and the data type (-f) which determines whether one or two cutters are used (e.g., rad is a single cutter, ddrad is double-cutter). The other form of mutation disruption occurs when mutations give rise to new cut sites within the length of a selected fragment (-ms). Here the total fragment length is determined by (-l) and the max insert length (-i2), and the probability of disruption by the cutters (-c1 and -c2), and the substitution rate (-N and -u).
Sequencing coverage and error– Another source of missing data in RADseq data sets is low sequencing coverage. You can set the sampling rate as the number of reads that are sampled from each haplotype. The mean (-dm) and standard deviation (-ds). To examine the effect sequencing rate has on base calls, etc., you could combine low coverage sequencing with a high error rate (-e).
Topology– Data are simulated under a Jukes-Cantor model with equal base frequencies. If there is no user-supplied topology (-t) then simrrls uses a default 12 tip topology, shown below. Branch lengths are in coalescent units. The outgroup taxon in the tree “X” is not included in the output, but is used to polarize mutations relative to an outgroup for creating indels.
Output – Two data files are created using the output name
prefix (-o), which if it were ‘testrad’ would create
testrad_R1_.fastq.gz
and testrad_barcodes.txt
. The first is a
compressed fastq file with sequence data and the
latter is a text file mapping barcodes to sample names.
If you selected paired-end data it would create
two sequence files, one with “_R1_”
in the name and the other with “_R2_”.
With these fastq data and a barcode map the data can then be
assembled in pyrad.
Examples –
Modified population parameters:
Modified sequencing parameters:
Modified library type (In this case allowing paired-end reads overlap):
Modified topology:
More info – Check out the git repository: (simrlls.py)