parse_bam
- parse_bam(fileName: str, sampleName: str, outDir: str, bedFile: Optional[str] = None, basemod: str = 'A+CG', center: bool = False, windowSize: int = 1000, region: Optional[str] = None, threshA: int = 129, threshC: int = 129, extractAllBases: bool = False, cores: Optional[int] = None) None
- fileName
name of bam file with Mm and Ml tags
- sampleName
name of sample for output SQL table name labelling. Valid names contain [
a-zA-Z0-9_
].- outDir
directory where SQL database is stored
- bedFile
name of bed file that defines regions of interest over which to extract mod calls. The bed file either defines regions over which to extract mod calls OR defines regions (likely motifs) over which to center positions for mod calls and then parse_bam extracts mod calls over a window flanking that region defined in by
windowSize
. Optional 4th column in bed file to specify strand of region of interest as+
or-
. Default is to consider regions as all+
. NB. ThebedFile
andregion
parameters are mutually exclusive; specify one or the other.- basemod
One of the following:
'A'
- extract mA only'CG'
- extract mCpG only'A+CG'
- extract mA and mCpG
- center
One of the following:
'True'
- report positions with respect to center of motif window (+/- windowSize); only valid with bed file input'False'
- report positions in original reference space
- windowSize
window size around center point of feature of interest to plot (+/-); only mods within this window are stored; only used if center=True; still, only reads that span the regions defined in the bed file will be included; default is 1,000 bp
- region
single region over which to extract base mods, rather than specifying many windows in bedFile; format is chr:start-end. NB. The
bedFile
andregion
parameters are mutually exclusive; specify one or the other.- threshA
threshold above which to call an A base methylated; default is 129
- threshC
threshold above which to call a C base methylated; default is 129
- extractAllBases
One of the following:
'True'
- Store all base mod calls, regardles of methylation probability threshold. Bases stored are those that can have a modification call (A, CG, or both depending onbasemod
parameter) and are sequenced bases, not all bases in the reference.'False'
- Only modifications above specified threshold are stored
- cores
number of cores over which to parallelize; default is all available
Valid argument combinations for
bedFile
,center
, andwindowSize
are below. Regions of interest generally fall into two categories: small motifs at which to center analysis (usecenter
= True) or full windows of interest (do not specifycenter
orwindowSize
).bedFile
–> extract all modified bases in regions defined in bed filebedfile
+center
–> extract all modified bases in regions defined in bed file, report positions relative to region centers and extract base modificiations within default windowSize of 1kbbedfile
+center
+windowSize
–> extract all modified bases in regions defined in bed file, report positions relative to region centers and extract base modifications within flanking +/- windowSizeregion
–> extract all modified bases in single region
Example
For regions defined by
bedFile
:>>> dm.parse_bam("dimelo/test/data/mod_mappings_subset.bam", "test", "dimelo/dimelo_test", bedFile="dimelo/test/data/test.bed", basemod="A+CG", center=True, windowSize=500, threshA=190, threshC=190, extractAllBases=False, cores=8)
For single region defined with
region
:>>> dm.parse_bam("dimelo/test/data/mod_mappings_subset.bam", "test", "dimelo/dimelo_test", region="chr1:2907273-2909473", basemod="A+CG", threshA=190, threshC=190, cores=8)
Return
Returns a SQL database in the specified output directory. Database can be converted into pandas dataframe with:
>>> fileName = "dimelo/test/data/mod_mappings_subset.bam" >>> sampleName = "test" >>> outDir = "dimelo/dimelo_test" >>> all_data = pd.read_sql("SELECT * from methylationByBase_" + sampleName, sqlite3.connect(outDir + "/" + fileName.split("/")[-1].replace(".bam", "") + ".db")) >>> aggregate_counts = pd.read_sql("SELECT * from methylationAggregate_" + sampleName, sqlite3.connect(outDir + "/" + fileName.split("/")[-1].replace(".bam", "") + ".db"))
Each database contains these two tables with columns listed below:
- methylationByBase_sampleName
id(read_name:pos)
read_name
chr
pos
prob
mod
- methylationAggregate_sampleName
id(pos:mod)
pos
mod
methylated_bases
total_bases
When running parse_bam with a region defined, a summary bed file is also produced to support visualizing aggregate data with any genome browser tool. The columns of this bed file are chr, start, end, methylated_bases, total_bases.
For example, to take a summary output bed and create a file with fraction of modified bases with a window size of 100 bp for visualization with the WashU browser, you could run the below commands in terminal:
bedtools makewindows -g ref_genome.chromsizes.txt -w 100 > ref_genome_windows.100.bp.bed
bedtools map -a ref_genome_windows.100.bp.bed -b outDir/fileName_sampleName_chr_start_end_A.bed -c 4,5 -o sum,sum -null 0 | awk -v "OFS=\t" '{if($5>0){print $1,$2,$3,$4/$5}else{print $1,$2,$3,$5}}' > outDir/fileName_sampleName_chr_start_end_A.100.bed
bgzip outDir/fileName_sampleName_chr_start_end_A.100.bed
tabix -f -p bed outDir/fileName_sampleName_chr_start_end_A.100.bed.gz
dimelo-parse-bam - CLI interface
Parse a bam file into DiMeLo database tables
dimelo-parse-bam [-h] -f FILENAME -s SAMPLENAME -o OUTDIR (-b BEDFILE | -r REGION)
[-m {A,CG,A+CG}] [-A THRESHA] [-C THRESHC] [-e] [-p CORES] [-c]
[-w WINDOWSIZE]
dimelo-parse-bam optional arguments
-b
BEDFILE
,--bedFile
BEDFILE
- name of bed file that defines regions of interest over which to extract mod calls (default:None
)-r
REGION
,--region
REGION
- single region over which to extract base mods, e.g."chr1:1-100000"
(default:None
)-m
BASEMOD
,--basemod
BASEMOD
- which base modifications to extract (default:A+CG
)-A
THRESHA
,--threshA
THRESHA
- threshold above which to call an A base methylated (default:129
)-C
THRESHC
,--threshC
THRESHC
- threshold above which to call a C base methylated (default:129
)-e
,--extractAllBases
- store all base mod calls, regardless of methylation probability threshold-p
CORES
,--cores
CORES
- number of cores over which to parallelize (default:None
)-c
,--center
- report positions with respect to center of motif window; only valid with bed file input-w
WINDOWSIZE
,--windowSize
WINDOWSIZE
- window size around center point of feature of interest to plot (+/-); only mods within this window are stored (default: 1000 bp) (default:1000
)
dimelo-parse-bam required arguments
-f
FILENAME
,--fileName
FILENAME
- name of bam file with Mm and Ml tags (default:None
)-s
SAMPLENAME
,--sampleName
SAMPLENAME
- name of sample for output SQL table name labelling (default:None
)-o
OUTDIR
,--outDir
OUTDIR
- directory where SQL database is stored (default:None
)