CpGCpHhMM
hidden Markov model for detecting CpG-CpH differentially methylated regions
To run CpGCpHhMM,
please log in OpenLooper and visit this page again.
[vlog]
Introduction
Methylated
non-CpGs (mCpHs) in mammalian cells yield weak enrichment signals and colocalize with methylated CpGs (mCpGs). The mCpHs are cell type-specific and associated with epigenetic regulation, although their dependency on mCpGs remains to be elucidated.We developed a hidden Markov model (HMM) to systematically detect genomic regions in which CpG and CpH are differentially methylated, providing an opportunity to infer the functional importance of non-CpG methylation.
An empirical HMM was designed to detect the differentially methylated regions (DMRs) of CpG and CpH (CpG-CpH DMRs) in each human sample. Specifically, the whole genome was segmented into 180 bp-long bins and the emission probability
E for a state {P, N, or U) at each bin was calculated;
P, positive correlation between CpH and mCpG methylation;
N, negative correlation between CpH and CpG methylation;
U, uncorrelated. The
E for bins in which the number of reads aligned at CpGs and CpHs was >10 was calculated. In addition, to ensure the continuity of the Markov model, the genomic region was divided if the continuous undetected bins were longer than 100,000 bp, and the HMM was applied separately. The probability of state transition was estimated using an expectation-maximization (EM) algorithm that repeats the EM steps until the difference between the previous and current transition probabilities of all state transactions is <5e-4. Then, the Viterbi algorithm that finds an optimal path among the states was applied, and the bins were re-defined to P-, N-, or U-state. The consecutive bins were linked if they were in the same state and the distance between them was <3 bins (540 bp). Finally, the N-state regions were defined as CpG–CpH DMRs. More information is avaiable at
Lee et al (2020).
Instruction
Input file
GZIP file with tap-seperated 4 columns.
CHR: chromosome number (1..19,X,Y)
BP: position of the cytosine
CXX: cytosine context. CG, CHH, or CHG (both CHH and CHG are recognized as CH)
mC_read: methylated read count mapped at this position
totalC_read: total read count mapped at this position
Example:
CHR BP CXX mC_read totalC_read
19 60001 CHH 0 10
19 60006 CHG 0 17
19 60120 CG 2 2
...
[Download an example gz file]
Output
output_dir/gblock/
pos: starting position of 180bp bin
mCGn: sum of methylated read count aligned at CG in the bin
CGn: sum of total read count aligned at CG in the bin
mCHn: sum of methylated read count aligned at CH in the bin
CHn: sum of total read count aligned at CH in the bin
refCGn: number of CG in the bin in hg19
refCHn: number of CH in the bin in hg19
output_dir/emit/
pos: starting position of 180bp bin
mCGlv: average methylation level at CG
refCGn: number of CG in the bin in hg19
mCHlv: average methylation level at CH
refCHn: number of CH in the bin in hg19
e_p: probability that the bin is belong to P-state
e_n: probability that the bin is belong to N-state
e_i: probability that the bin is belong to I-state
max_stat: state of top probability (0:P, 1:N, 2:I)
output_dir/initProb/
randomly set initial probability of P-, N-, and I-state
output_dir/TRN/
log-scaled transition rates between states. column order: P, N, I, row order: P, N, I
output_dir/viterbi/
pos: starting position of 180bp bin
mCGlv: average methylation level at CG
refCGn: number of CG in the bin in hg19
mCHlv: average methylation level at CH
refCHn: number of CH in the bin in hg19
state: state designated by Viterbi decoding (0:P-, 1:N-, 2:I-state)
output_dir/statistics/
Number of bins deteced as P, N, and I-state by top emission probability and Viterbi decoding
NOTE:
n/a
Reference
1. Jong-Hun Lee, Yutaka Saito, Sung-Joon Park, Kenta Nakai, "Existence and possible roles of independent non-CpG methylation in the mammalian brain",
DNA Research dsaa020 (2020)