This workflow outlines analysis of data from mapping bisulfite treated DNA (BS-seq) to the oyster genome.
Initially sequencing reads, in this case from oyster sperm, are mapped to the genome using BS-MAP.
./bsmap -a /Volumes/betty/filtered_174gm_A_NoIndex_L006_R1.fastq.gz -b /Volumes/betty/filtered_174gm_A_NoIndex_L006_R2.fastq.gz -d /Volumes/betty/oyster_v9.fa -o /Volumes/betty/BiGO_Betty_plain.sam -p 4
Total number of aligned reads:
pairs: 90102067 (53%)
single a: 17033002 (9.9%)
single b: 15975991 (9.3%)
Resulting SAM file then subjected to methratio python script
python methratio.py -d /Volumes/betty/oyster_v9.fa -u -z -g -o /Volumes/betty/BiGO_betty_plain_methratio_v1.txt -s /Volumes/Bay3/Software/BSMAP/bsmap-2.73/samtools /Volumes/betty/BiGO_Betty_plain.sam
total 167774088 valid mappings, 127192776 covered cytosines, average coverage: 13.23 fold.
This file was uploaded to SQLShare using python script
python singleupload.py firstname.lastname@example.org c8APIKEYAPIKEYAPIKEY5c15c /Volumes/web/cnidarian/BiGO_betty_plain_methratio_v1.txt
Interested in is how many CpG loci is there information for?
The following query will get all lines where CG are in the 3rd and 4th position.
SELECT * FROM [email@example.com].[BiGO_betty_plain_methratio_v1.txt] betty where context like '__CG_' --_=single character wildcard
To count the number of instances in this query
How to convert methratio file to gff format
In this case will make GFF only of CpGs, where there is at least 10x coverage.
SELECT chr as seqname, 'methratio' as source, 'CpG' as feature, pos as start, pos + 1 as [end], ratio as score, strand, '.' as frame, '.' as attribute FROM [firstname.lastname@example.org]. [BiGO_betty_plain_methratio_v1.txt] betty where context like '__CG_' --_=single character wildcard and CT_Count > 9`