parse_bam

parse_bam(fileName: str, sampleName: str, outDir: str, bedFile: Optional[str] = None, basemod: str = 'A+CG', center: bool = False, windowSize: int = 1000, region: Optional[str] = None, threshA: int = 129, threshC: int = 129, extractAllBases: bool = False, cores: Optional[int] = None) None
fileName

name of bam file with Mm and Ml tags

sampleName

name of sample for output SQL table name labelling. Valid names contain [a-zA-Z0-9_].

outDir

directory where SQL database is stored

bedFile

name of bed file that defines regions of interest over which to extract mod calls. The bed file either defines regions over which to extract mod calls OR defines regions (likely motifs) over which to center positions for mod calls and then parse_bam extracts mod calls over a window flanking that region defined in by windowSize. Optional 4th column in bed file to specify strand of region of interest as + or -. Default is to consider regions as all +. NB. The bedFile and region parameters are mutually exclusive; specify one or the other.

basemod

One of the following:

  • 'A' - extract mA only

  • 'CG' - extract mCpG only

  • 'A+CG' - extract mA and mCpG

center

One of the following:

  • 'True' - report positions with respect to center of motif window (+/- windowSize); only valid with bed file input

  • 'False' - report positions in original reference space

windowSize

window size around center point of feature of interest to plot (+/-); only mods within this window are stored; only used if center=True; still, only reads that span the regions defined in the bed file will be included; default is 1,000 bp

region

single region over which to extract base mods, rather than specifying many windows in bedFile; format is chr:start-end. NB. The bedFile and region parameters are mutually exclusive; specify one or the other.

threshA

threshold above which to call an A base methylated; default is 129

threshC

threshold above which to call a C base methylated; default is 129

extractAllBases

One of the following:

  • 'True' - Store all base mod calls, regardles of methylation probability threshold. Bases stored are those that can have a modification call (A, CG, or both depending on basemod parameter) and are sequenced bases, not all bases in the reference.

  • 'False' - Only modifications above specified threshold are stored

cores

number of cores over which to parallelize; default is all available

Valid argument combinations for bedFile, center, and windowSize are below. Regions of interest generally fall into two categories: small motifs at which to center analysis (use center = True) or full windows of interest (do not specify center or windowSize).

  • bedFile –> extract all modified bases in regions defined in bed file

  • bedfile + center –> extract all modified bases in regions defined in bed file, report positions relative to region centers and extract base modificiations within default windowSize of 1kb

  • bedfile + center + windowSize –> extract all modified bases in regions defined in bed file, report positions relative to region centers and extract base modifications within flanking +/- windowSize

  • region –> extract all modified bases in single region

Example

For regions defined by bedFile:

>>> dm.parse_bam("dimelo/test/data/mod_mappings_subset.bam", "test", "dimelo/dimelo_test", bedFile="dimelo/test/data/test.bed", basemod="A+CG", center=True, windowSize=500, threshA=190, threshC=190, extractAllBases=False, cores=8)

For single region defined with region:

>>> dm.parse_bam("dimelo/test/data/mod_mappings_subset.bam", "test", "dimelo/dimelo_test", region="chr1:2907273-2909473", basemod="A+CG", threshA=190, threshC=190, cores=8)

Return

Returns a SQL database in the specified output directory. Database can be converted into pandas dataframe with:

>>> fileName = "dimelo/test/data/mod_mappings_subset.bam"
>>> sampleName = "test"
>>> outDir = "dimelo/dimelo_test"
>>> all_data = pd.read_sql("SELECT * from methylationByBase_" + sampleName, sqlite3.connect(outDir + "/" + fileName.split("/")[-1].replace(".bam", "") + ".db"))
>>> aggregate_counts = pd.read_sql("SELECT * from methylationAggregate_" + sampleName, sqlite3.connect(outDir + "/" + fileName.split("/")[-1].replace(".bam", "") + ".db"))

Each database contains these two tables with columns listed below:

  1. methylationByBase_sampleName
    • id(read_name:pos)

    • read_name

    • chr

    • pos

    • prob

    • mod

  2. methylationAggregate_sampleName
    • id(pos:mod)

    • pos

    • mod

    • methylated_bases

    • total_bases

When running parse_bam with a region defined, a summary bed file is also produced to support visualizing aggregate data with any genome browser tool. The columns of this bed file are chr, start, end, methylated_bases, total_bases.

For example, to take a summary output bed and create a file with fraction of modified bases with a window size of 100 bp for visualization with the WashU browser, you could run the below commands in terminal:

  • bedtools makewindows -g ref_genome.chromsizes.txt -w 100 > ref_genome_windows.100.bp.bed

  • bedtools map -a ref_genome_windows.100.bp.bed -b outDir/fileName_sampleName_chr_start_end_A.bed -c 4,5 -o sum,sum -null 0 | awk -v "OFS=\t" '{if($5>0){print $1,$2,$3,$4/$5}else{print $1,$2,$3,$5}}' > outDir/fileName_sampleName_chr_start_end_A.100.bed

  • bgzip outDir/fileName_sampleName_chr_start_end_A.100.bed

  • tabix -f -p bed outDir/fileName_sampleName_chr_start_end_A.100.bed.gz

dimelo-parse-bam - CLI interface

Parse a bam file into DiMeLo database tables

dimelo-parse-bam [-h] -f FILENAME -s SAMPLENAME -o OUTDIR (-b BEDFILE | -r REGION)
                 [-m {A,CG,A+CG}] [-A THRESHA] [-C THRESHC] [-e] [-p CORES] [-c]
                 [-w WINDOWSIZE]

dimelo-parse-bam optional arguments

  • -h, --help - show this help message and exit

  • -b BEDFILE, --bedFile BEDFILE - name of bed file that defines regions of interest over which to extract mod calls (default: None)

  • -r REGION, --region REGION - single region over which to extract base mods, e.g. "chr1:1-100000" (default: None)

  • -m BASEMOD, --basemod BASEMOD - which base modifications to extract (default: A+CG)

  • -A THRESHA, --threshA THRESHA - threshold above which to call an A base methylated (default: 129)

  • -C THRESHC, --threshC THRESHC - threshold above which to call a C base methylated (default: 129)

  • -e, --extractAllBases - store all base mod calls, regardless of methylation probability threshold

  • -p CORES, --cores CORES - number of cores over which to parallelize (default: None)

  • -c, --center - report positions with respect to center of motif window; only valid with bed file input

  • -w WINDOWSIZE, --windowSize WINDOWSIZE - window size around center point of feature of interest to plot (+/-); only mods within this window are stored (default: 1000 bp) (default: 1000)

dimelo-parse-bam required arguments

  • -f FILENAME, --fileName FILENAME - name of bam file with Mm and Ml tags (default: None)

  • -s SAMPLENAME, --sampleName SAMPLENAME - name of sample for output SQL table name labelling (default: None)

  • -o OUTDIR, --outDir OUTDIR - directory where SQL database is stored (default: None)