Methylation analysis with Methyl-IT. Part 3

An example of methylation analysis with simulated datasets

Part 3: Estimation of an optimal cutpoint for the methylation signal

 Methylation analysis with Methyl-IT is illustrated on simulated datasets of methylated and unmethylated read counts with relatively high average of  methylation levels: 0.15 and 0.286 for control and treatment groups, respectively. Herein, the detection of the methylation signal is confronted as a signal detection problem. A first look on the estimation of an optimal cutoff point for the methylation signal is covered.

1. Background

Normally, there is a spontaneous variability in the control group. This is a consequence of the random fluctuations, or noise, inherent to the methylation process. The stochasticity of the the methylation process is derives from the stochasticity inherent in biochemical processes. There are fundamental physical reasons to acknowledge that biochemical processes are subject to noise and fluctuations [1,2]. So, regardless constant environment, statistically significant methylation changes can be found in control population with probability greater than zero and proportional to a Boltzmann factor [3].

Natural signals and those generated by human technology are not free of noise and, as mentioned above, the methylation signal is no exception. Only signal detection based approaches are designed to filter out the signal from the noise, in natural and in human generated signals.

The need for the application of (what is now known as) signal detection in cancer research was pointed out by Youden in the midst of the last century [4]. Here, the application of signal detection approach was performed according with the standard practice in current implementations of clinical diagnostic test [5-7]. That is, optimal cutoff values of the methylation signal were estimated on the receiver operating characteristic curves (ROCs) and applied to identify DMPs. The decision of whether a DMP detected by Fisher’s exact test (or any other statistical test implemented in Methyl-IT) is taken based on the optimal cutoff value.

2. Cutpoint for the spontaneous variability in the control group

In Methyl-IT function estimateCutPoint  is used in the estimation of the optimal cutoff (cutpoint) value to distinguish signal from noise. There are also another available approaches that will be covered in a further post.

# Cutpoint estimation for FT approach
cut.ft = estimateCutPoint(LR = ft.tv, simple = TRUE, 
                            control.names = control.nam, 
                            treatment.names = treatment.nam,
                            div.col = 7L, verbose = FALSE)

# Cutpoint estimation for the FT approach using the ECDF critical value
cut.ft.hd = estimateCutPoint(LR = ft.hd, simple = TRUE,
                            control.names = control.nam, 
                            treatment.names = treatment.nam,
                            div.col = 7L, verbose = FALSE)

cut.emd = estimateCutPoint(LR = DMP.ecdf, simple = TRUE,
                            control.names = control.nam, 
                            treatment.names = treatment.nam,
                            div.col = 7L, verbose = FALSE)

# Cutpoint estimation for the Weibull 2-parameter distribution approach
cut.wb = estimateCutPoint(LR = DMPs.wb, simple = TRUE,
                            control.names = control.nam, 
                            treatment.names = treatment.nam,
                            div.col = 7L, verbose = FALSE)
# Cutpoint estimation for the Gamma 2-parameter distribution approach
cut.g2p = estimateCutPoint(LR = DMPs.g2p, simple = TRUE,
                            control.names = control.nam, 
                            treatment.names = treatment.nam,
                            div.col = 7L, verbose = FALSE)

# Control cutpoint to define TRUE negatives and TRUE positives
cuts <- data.frame(cut.ft = cut.ft$cutpoint, cut.ft.hd = cut.ft.hd$cutpoint, 
                   cut.ecdf = cut.emd$cutpoint, cut.wb = cut.wb$cutpoint,
                   cut.g2p = cut.g2p$cutpoint)
cuts
##      cut.ft cut.ft.hd cut.ecdf   cut.wb   cut.g2p
## 1 0.9847716 0.9847716 0.987013 0.988024 0.9847716

Now, with high probability true DMP can be selected with Methyl-IT function selectDIMP.

ft.DMPs <- selectDIMP(ft.tv, div.col = 7L, cutpoint = 0.9847716, absolute = TRUE)
ft.hd.DMPs <- selectDIMP(ft.hd, div.col = 7L, cutpoint = 0.9847716, absolute = TRUE)
emd.DMPs <- selectDIMP(DMP.ecdf, div.col = 7L, cutpoint = 0.9847716, absolute = TRUE)
wb.DMPs <- selectDIMP(DMPs.wb, div.col = 7L, cutpoint = 0.9847716, absolute = TRUE)
g2p.DMPs <- selectDIMP(DMPs.g2p, div.col = 7L, cutpoint = 0.9847716, absolute = TRUE)

A summary table with the number of detected DMPs by each approach:

data.frame(ft = unlist(lapply(ft.DMPs, length)), ft.hd = unlist(lapply(ft.hd.DMPs, length)),
ecdf = unlist(lapply(DMP.ecdf, length)), Weibull = unlist(lapply(wb.DMPs, length)),
Gamma = unlist(lapply(g2p.DMPs, length)))
## ft ft.hd ecdf Weibull Gamma ## C1 889 589 56 578 688 ## C2 893 608 58 593 708 ## C3 933 623 57 609 723 ## T1 1846 1231 107 773 1087 ## T2 1791 1182 112 775 1073 ## T3 1771 1178 116 816 1093

Nevertheless, we should evaluate the classification performance as given in the next section.

3. Evaluation of DMP classification

As shown above, DMPs are found in the control population as well. Hence, it is important to know whether a DMP is the resulting effect of the treatment or just spontaneously occurred in the control sample as well. In particular, the confrontation of this issue is extremely important when methylation analysis is intended to be used as part of a diagnostic clinical test and a decision making in biotechnology industry.

Methyl-IT function is used here, to evaluate the classification of DMPs into one of the two classes, control and treatment. Several classifiers are available to be used with this function (see the help/manual for this function or type ?evaluateDIMPclass in R console).

To evaluate the classification performances, for each methylation analysis approach, we show the results with the best available classifier. Here, the best results were found with a logistic model and a quadratic discriminant analysis (QDA) based on principal component (PC).

3.1. Evaluation of Fisher’s exact test DMP classification

ft.class = evaluateDIMPclass(LR = ft.DMPs, control.names = control.nam,
                           treatment.names = treatment.nam,
                           column = c(hdiv = TRUE, TV = TRUE, 
                                      wprob = TRUE, pos = TRUE),
                           classifier = "logistic", interaction = "wprob:TV", 
                           output = "conf.mat", prop = 0.6, pval.col = 11L 
)

## Model: treat ~ hdiv + TV + logP + pos + TV:logP
## $Performance
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   CT   TT
##         CT  135   29
##         TT  618 1511
##                                           
##                Accuracy : 0.7178          
##                  95% CI : (0.6989, 0.7362)
##     No Information Rate : 0.6716          
##     P-Value [Acc > NIR] : 1.015e-06       
##                                           
##                   Kappa : 0.2005          
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9812          
##             Specificity : 0.1793          
##          Pos Pred Value : 0.7097          
##          Neg Pred Value : 0.8232          
##              Prevalence : 0.6716          
##          Detection Rate : 0.6590          
##    Detection Prevalence : 0.9285          
##       Balanced Accuracy : 0.5802          
##                                           
##        'Positive' Class : TT              
##                                           
## 
## $FDR
## [1] 0.2902771
## 
## $model
## 
## Call:  glm(formula = formula, family = binomial(link = "logit"), data = dt)
## 
## Coefficients:
## (Intercept)         hdiv           TV         logP          pos  
##   4.199e+01   -1.451e-01   -4.367e+01    1.092e+00   -6.602e-04  
##     TV:logP  
##  -1.546e+00  
## 
## Degrees of Freedom: 3437 Total (i.e. Null);  3432 Residual
## Null Deviance:       4353 
## Residual Deviance: 4067  AIC: 4079

3.2. Evaluation DMP classification derived from Fisher’s exact test and ECDF critical value

ft.hd.class = evaluateDIMPclass(LR = ft.hd.DMPs, control.names = control.nam,
                           treatment.names = treatment.nam,
                           column = c(hdiv = TRUE, TV = TRUE, 
                                      wprob = TRUE, pos = TRUE),
                           classifier = "logistic", interaction = "wprob:TV",
                           pval.col = 11L, output = "conf.mat", prop = 0.6
)
## Model: treat ~ hdiv + TV + logP + pos + TV:logP
ft.hd.class
## $Performance
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  CT  TT
##         CT 174  18
##         TT 325 980
##                                          
##                Accuracy : 0.7709         
##                  95% CI : (0.7487, 0.792)
##     No Information Rate : 0.6667         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.3908         
##  Mcnemar's Test P-Value : < 2.2e-16      
##                                          
##             Sensitivity : 0.9820         
##             Specificity : 0.3487         
##          Pos Pred Value : 0.7510         
##          Neg Pred Value : 0.9062         
##              Prevalence : 0.6667         
##          Detection Rate : 0.6546         
##    Detection Prevalence : 0.8717         
##       Balanced Accuracy : 0.6653         
##                                          
##        'Positive' Class : TT             
##                                          
## 
## $FDR
## [1] 0.2490421
## 
## $model
## 
## Call:  glm(formula = formula, family = binomial(link = "logit"), data = dt)
## 
## Coefficients:
## (Intercept)         hdiv           TV         logP          pos  
##    229.1917      -0.1727    -232.0026       4.4204       0.1808  
##     TV:logP  
##     -4.9700  
## 
## Degrees of Freedom: 2241 Total (i.e. Null);  2236 Residual
## Null Deviance:       2854 
## Residual Deviance: 2610  AIC: 2622

3.3. Evaluation ECDF based DMP classification

ecdf.class = evaluateDIMPclass(LR = DMP.ecdf, control.names = control.nam,
                           treatment.names = treatment.nam,
                           column = c(hdiv = TRUE, TV = TRUE, 
                                      wprob = TRUE, pos = TRUE),
                           classifier = "pca.qda", n.pc = 4, pval.col = 10L,
                           center = TRUE, scale = TRUE,
                           output = "conf.mat", prop = 0.6
)
ecdf.class
## $Performance
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  CT  TT
##         CT  72 145
##         TT   4   4
##                                           
##                Accuracy : 0.3378          
##                  95% CI : (0.2763, 0.4036)
##     No Information Rate : 0.6622          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : -0.0177         
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.02685         
##             Specificity : 0.94737         
##          Pos Pred Value : 0.50000         
##          Neg Pred Value : 0.33180         
##              Prevalence : 0.66222         
##          Detection Rate : 0.01778         
##    Detection Prevalence : 0.03556         
##       Balanced Accuracy : 0.48711         
##                                           
##        'Positive' Class : TT              
##                                           
## 
## $FDR
## [1] 0.5
## 
## $model
## $qda
## Call:
## qda(ind.coord, grouping = data[resp][, 1], tol = tol, method = method)
## 
## Prior probabilities of groups:
##        CT        TT 
## 0.3373134 0.6626866 
## 
## Group means:
##             PC1         PC2         PC3         PC4
## CT  0.013261015  0.06134001  0.04739951 -0.09566950
## TT -0.006749976 -0.03122262 -0.02412678  0.04869664
## 
## $pca
## Standard deviations (1, .., p=4):
## [1] 1.1635497 1.0098710 0.9744562 0.8226468
## 
## Rotation (n x k) = (4 x 4):
##             PC1        PC2        PC3        PC4
## hdiv -0.6692087  0.2408009  0.0214317 -0.7026488
## TV   -0.3430242 -0.4422200  0.8043031  0.1996806
## logP  0.6439784 -0.1703322  0.3451435 -0.6611768
## pos  -0.1406627 -0.8470202 -0.4832319 -0.1710485
## 
## attr(,"class")
## [1] "pcaQDA"

3.4. Evaluation of Weibull based DMP classification

ws.class = evaluateDIMPclass(LR = wb.DMPs, control.names = control.nam,
                           treatment.names = treatment.nam,
                           column = c(hdiv = TRUE, TV = TRUE, 
                                      wprob = TRUE, pos = TRUE),
                           classifier = "pca.qda", n.pc = 4,  pval.col = 10L,
                           center = TRUE, scale = TRUE,
                           output = "conf.mat", prop = 0.6
)
ws.class
## $Performance
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  CT  TT
##         CT 512   0
##         TT   0 725
##                                     
##                Accuracy : 1         
##                  95% CI : (0.997, 1)
##     No Information Rate : 0.5861    
##     P-Value [Acc > NIR] : < 2.2e-16 
##                                     
##                   Kappa : 1         
##  Mcnemar's Test P-Value : NA        
##                                     
##             Sensitivity : 1.0000    
##             Specificity : 1.0000    
##          Pos Pred Value : 1.0000    
##          Neg Pred Value : 1.0000    
##              Prevalence : 0.5861    
##          Detection Rate : 0.5861    
##    Detection Prevalence : 0.5861    
##       Balanced Accuracy : 1.0000    
##                                     
##        'Positive' Class : TT        
##                                     
## 
## $FDR
## [1] 0
## 
## $model
## $qda
## Call:
## qda(ind.coord, grouping = data[resp][, 1], tol = tol, method = method)
## 
## Prior probabilities of groups:
##        CT        TT 
## 0.4133837 0.5866163 
## 
## Group means:
##              PC1         PC2         PC3        PC4
## CT  0.0006129907  0.02627154  0.01436693 -0.3379167
## TT -0.0004319696 -0.01851334 -0.01012426  0.2381272
## 
## $pca
## Standard deviations (1, .., p=4):
## [1] 1.3864692 1.0039443 0.9918014 0.2934775
## 
## Rotation (n x k) = (4 x 4):
##              PC1         PC2         PC3           PC4
## hdiv -0.70390059 -0.01853114 -0.06487288  0.7070870321
## TV   -0.08754381  0.61403346  0.78440944  0.0009100769
## logP  0.70393865  0.01695660  0.06446889  0.7071255955
## pos  -0.03647491 -0.78888021  0.61346321 -0.0007020890
## 
## attr(,"class")
## [1] "pcaQDA"

3.5. Evaluation of Gamma based DMP classification

g2p.class = evaluateDIMPclass(LR = g2p.DMPs, control.names = control.nam,
                           treatment.names = treatment.nam,
                           column = c(hdiv = TRUE, TV = TRUE, 
                                      wprob = TRUE, pos = TRUE),
                           classifier = "pca.qda", n.pc = 4, pval.col = 10L, 
                           center = TRUE, scale = TRUE,
                           output = "conf.mat", prop = 0.6
)
g2p.class
## $Performance
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  CT  TT
##         CT 597   0
##         TT   0 970
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9976, 1)
##     No Information Rate : 0.619      
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##  Mcnemar's Test P-Value : NA         
##                                      
##             Sensitivity : 1.000      
##             Specificity : 1.000      
##          Pos Pred Value : 1.000      
##          Neg Pred Value : 1.000      
##              Prevalence : 0.619      
##          Detection Rate : 0.619      
##    Detection Prevalence : 0.619      
##       Balanced Accuracy : 1.000      
##                                      
##        'Positive' Class : TT         
##                                      
## 
## $FDR
## [1] 0
## 
## $model
## $qda
## Call:
## qda(ind.coord, grouping = data[resp][, 1], tol = tol, method = method)
## 
## Prior probabilities of groups:
##        CT        TT 
## 0.3810132 0.6189868 
## 
## Group means:
##            PC1          PC2           PC3        PC4
## CT -0.09825234  0.009217742 -0.0015298812 -0.3376429
## TT  0.06047858 -0.005673919  0.0009417082  0.2078338
## 
## $pca
## Standard deviations (1, .., p=4):
## [1] 1.3934010 1.0005617 0.9910414 0.2741290
## 
## Rotation (n x k) = (4 x 4):
##              PC1         PC2         PC3           PC4
## hdiv -0.70097482 -0.01600430 -0.09106340  0.7071673216
## TV   -0.13109213  0.26685824  0.95477779 -0.0009560507
## logP  0.70085340  0.01201179  0.09357872  0.7070383630
## pos  -0.01592678 -0.96352803  0.26711394 -0.0031966410
## 
## attr(,"class")
## [1] "pcaQDA"

3.6. Summary of DMP classification performance

For the current simulated dataset, the best classification performance was obtained for the approach of DMP detection based on a 2-parameter gamma probability distribution model for the Hellinger divergence of methylation levels. DMPs from treatment can be distinguished from control DMPs with very high accuracy. The second best approach was obtained for the 2-parameter Weibull probability distribution model.

Obviously, for practical application, we do not need to go through all these steps. Herein, we just illustrate the need for a knowledge on the probability distributions of the signal plus noise in control and in treatment groups. The available approaches for methylation analysis are not designed to evaluate the natural variation in the control population. As a matter of fact, the phrase “natural variation” itself implies the concept of probability distribution of the variable under study. Although the probability distribution of the variable measured is objective and, as such, it does not depend on the model assumptions, the selection of the best fitted model notably improves the accuracy of our predictions.

4. Conclusions Summary

Herein, an illustrative example of methylation analysis with Methyl-IT have been presented. Whatever could be the statistical test/approach used to identify DMPs, the analysis with simulated datasets, where the average of methylation levels in the control samples is relatively high, indicates the need for the application of signal detection based approaches.

4.1. Concluding remarks

The simplest suggested steps to follow for a methylation analysis with Methyl-IT are:

  1. To estimate a reference virtual sample from a reference group by using function poolFromGRlist. Notice that several statistics are available to estimate the virtual samples, i.e., mean, median, sum. For experiments limited by the number of sample, at least, try the estimation of the virtual sample from the control group. Alternatively, the whole reference group can be used in pairwise comparisons with control and treatment groups (computationally expensive).
  2. To estimate information divergence using function estimateDivergence
  1. To perform the estimation of the cumulative density function of the Hellinger divergence of methylation levels using function nonlinearFitDist.
  1. To get the potential DMPs using function getPotentialDIMP.
  1. To estimate the optimal cutpoint using function estimateCutPoint.
  1. To retrieve DMPs with function selectDIMP.
  1. To evaluate the classificatio performance using function evaluateDIMPclass.

As shown here, alternative analysis is possible by using Fisher’s exact test (FT). Whether FT would be better than the approach summarized above will depend on the dataset under study. The approaches with Root Mean Square Test (RMST) and Hellinger divergence test (HDT) are also possible with function rmstGR. In these cases, we can proceed as suggested for FT. In general, RMST and HDT yield better results (not discussed here) than FT.

References

  1. Samoilov, Michael S., Gavin Price, and Adam P. Arkin. 2006. “From fluctuations to phenotypes: the physiology of noise.” Science’s STKE : Signal Transduction Knowledge Environment 2006 (366). doi:10.1126/stke.3662006re17.
  2. Eldar, Avigdor, and Michael B. Elowitz. 2010. “Functional roles for noise in genetic circuits.” Nature Publishing Group. doi:10.1038/nature09326.
  3. Sanchez, Robersy, and Sally A. Mackenzie. 2016. “Information Thermodynamics of Cytosine DNA Methylation.” Edited by Barbara Bardoni. PLOS ONE 11 (3). Public Library of Science: e0150427. doi:10.1371/journal.pone.0150427.
  4. Youden WJ. Index for rating diagnostic tests. Cancer, 1950, 3:32–5. doi:10.1002/1097-0142(1950)3:1<32::AID-CNCR2820030106>3.0.CO;2-3.
  5. Carter, Jane V., Jianmin Pan, Shesh N. Rai, and Susan Galandiuk. 2016. “ROC-ing along: Evaluation and interpretation of receiver operating characteristic curves.” Surgery 159 (6). Mosby: 1638–45. doi:10.1016/j.surg.2015.12.029.
  6. López-Ratón, Mónica, Maria Xosé Rodríguez-Álvarez, Carmen Cadarso-Suárez, Francisco Gude-Sampedro, and Others. 2014. “OptimalCutpoints: an R package for selecting optimal cutpoints in diagnostic tests.” Journal of Statistical Software 61 (8). Foundation for Open Access Statistics: 1–36. https://www.jstatsoft.org/article/view/v061i08.
  7. Hippenstiel, Ralph D. 2001. Detection theory: applications and digital signal processing. CRC Press.

Methylation analysis with Methyl-IT. Part 2

An example of methylation analysis with simulated datasets

Part 2: Potential DMPs from the methylation signal

 Methylation analysis with Methyl-IT is illustrated on simulated datasets of methylated and unmethylated read counts with relatively high average of  methylation levels: 0.15 and 0.286 for control and treatment groups, respectively. In this part, potential differentially methylated positions are estimated following different approaches.

1. Background

Only a signal detection approach can detect with high probability real DMPs. Any statistical test (like e.g. Fisher’s exact test) not based on signal detection requires for further analysis to distinguish DMPs that naturally can occur in the control group from those DMPs induced by a treatment. The analysis here is a continuation of Part 1.

2. Potential DMPs from the methylation signal using empirical distribution

As suggested from the empirical density graphics (above), the critical values $H_{\alpha=0.05}$ and $TV_{d_{\alpha=0.05}}$ can be used as cutpoints to select potential DMPs. After setting $dist.name = “ECDF”$ and $tv.cut = 0.926$ in Methyl-IT function getPotentialDIMP, potential DMPs are estimated using the empirical cummulative distribution function (ECDF) and the critical value $TV_{d_{\alpha=0.05}}=0.926$.

DMP.ecdf <- getPotentialDIMP(LR = divs, div.col = 9L, tv.cut = 0.926, tv.col = 7,
                            alpha = 0.05, dist.name = "ECDF")

3. Potential DMPs detected with Fisher’s exact test

In Methyl-IT Fisher’s exact test (FT) is implemented in function FisherTest. In the current case, a pairwise group application of FT to each cytosine site is performed. The differences between the group means of read counts of methylated and unmethylated cytosines at each site are used for testing (pooling.stat=”mean”). Notice that only cytosine sites with critical values $TV_d$> 0.926 are tested (tv.cut = 0.926).

ft = FisherTest(LR = divs, tv.cut = 0.926,
                     pAdjustMethod = "BH",  pooling.stat = "mean", 
                     pvalCutOff = 0.05, num.cores = 4L,
                     verbose = FALSE, saveAll = FALSE) 

ft.tv <- getPotentialDIMP(LR = ft, div.col = 9L, dist.name = "None",
                          tv.cut = 0.926, tv.col = 7, alpha = 0.05)

There is not a one-to-one mapping between $TV$ and $HD$. However, at each cytosine site $i$, these information divergences hold the inequality:

$TV(p^{tt}_i,p^{ct}_i)\leq \sqrt{2}H_d(p^{tt}_i,p^{ct}_i)$  [1].

where $H_d(p^{tt}_i,p^{ct}_i) = \sqrt{\frac{H(p^{tt}_i,p^{ct}_i)}w}$ is the Hellinger distance and $H(p^{tt}_i,p^{ct}_i)$ is given by Eq. 1 in part 1.
So, potential DMPs detected with FT can be constrained with the critical value $H^{TT}_{\alpha=0.05}\geq114.5$

4. Potential DMPs detected with Weibull 2-parameters model

Potential DMPs can be estimated using the critical values derived from the fitted Weibull 2-parameters models, which are obtained after the non-linear fit of the theoretical model on the genome-wide $HD$ values for each individual sample using Methyl-IT function nonlinearFitDist [2]. As before, only cytosine sites with critical values $TV>0.926$ are considered DMPs. Notice that, it is always possible to use any other values of $HD$ and $TV$ as critical values, but whatever could be the value it will affect the final accuracy of the classification performance of DMPs into two groups, DMPs from control and DNPs from treatment (see below). So, it is important to do an good choices of the critical values.

nlms.wb <- nonlinearFitDist(divs, column = 9L, verbose = FALSE, num.cores = 6L)
# Potential DMPs from 'Weibull2P' model
DMPs.wb <- getPotentialDIMP(LR = divs, nlms = nlms.wb, div.col = 9L, 
                            tv.cut = 0.926, tv.col = 7, alpha = 0.05, 
                            dist.name = "Weibull2P")
nlms.wb$T1 
##         Estimate   Std. Error  t value Pr(>|t|))      Adj.R.Square
## shape  0.5413711 0.0003964435 1365.570         0 0.991666592250838
## scale 19.4097502 0.0155797315 1245.833         0                  
##                     rho       R.Cross.val              DEV
## shape 0.991666258901194 0.996595712743823 34.7217494754823
## scale                                                     
##                     AIC               BIC     COV.shape     COV.scale
## shape -221720.747067975 -221694.287733122  1.571674e-07 -1.165129e-06
## scale                                     -1.165129e-06  2.427280e-04
##       COV.mu     n
## shape     NA 50000
## scale     NA 50000

5. Potential DMPs detected with Gamma 2-parameters model

As in the case of Weibull 2-parameters model, potential DMPs can be estimated using the critical values derived from the fitted Gamma 2-parameters models and only cytosine sites with critical values $TV_d > 0.926$ are considered DMPs.

nlms.g2p <- nonlinearFitDist(divs, column = 9L, verbose = FALSE, num.cores = 6L,
                            dist.name = "Gamma2P")
# Potential DMPs from 'Gamma2P' model
DMPs.g2p <- getPotentialDIMP(LR = divs, nlms = nlms.g2p,  div.col = 9L, 
                             tv.cut = 0.926, tv.col = 7, alpha = 0.05, 
                             dist.name = "Gamma2P")
nlms.g2p$T1
##         Estimate   Std. Error  t value Pr(>|t|))      Adj.R.Square
## shape  0.3866249 0.0001480347 2611.717         0 0.999998194156282
## scale 76.1580083 0.0642929555 1184.547         0                  
##                     rho       R.Cross.val                 DEV
## shape 0.999998194084045 0.998331895911125 0.00752417919133131
## scale                                                        
##                    AIC               BIC     COV.alpha     COV.scale
## shape -265404.29138371 -265369.012270572  2.191429e-08 -8.581717e-06
## scale                                    -8.581717e-06  4.133584e-03
##       COV.mu    df
## shape     NA 49998
## scale     NA 49998

Summary table:

data.frame(ft = unlist(lapply(ft, length)), ft.hd = unlist(lapply(ft.hd, length)),
ecdf = unlist(lapply(DMPs.hd, length)), Weibull = unlist(lapply(DMPs.wb, length)),
Gamma = unlist(lapply(DMPs.g2p, length)))
##      ft ft.hd ecdf Weibull Gamma
## C1 1253   773   63     756   935
## C2 1221   776   62     755   925
## C3 1280   786   64     768   947
## T1 2504  1554  126     924  1346
## T2 2464  1532  124     942  1379
## T3 2408  1477  121     979  1354

6. Density graphic with a new critical value

The graphics for the empirical (in black) and Gamma (in blue) densities distributions of Hellinger divergence of methylation levels for sample T1 are shown below. The 2-parameter gamma model is build by using the parameters estimated in the non-linear fit of $H$ values from sample T1. The critical values estimated from the 2-parameter gamma distribution $H^{\Gamma}_{\alpha=0.05}=124$ is more ‘conservative’ than the critical value based on the empirical distribution $H^{Emp}_{\alpha=0.05}=114.5$. That is, in accordance with the empirical distribution, for a methylation change to be considered a signal its $H$ value must be $H\geq114.5$, while according with the 2-parameter gamma model any cytosine carrying a signal must hold $H\geq124$.

suppressMessages(library(ggplot2))

# Some information for graphic
dt <- data[data$sample == "T1", ]
coef <- nlms.g2p$T1$Estimate # Coefficients from the non-linear fit
dgamma2p <- function(x) dgamma(x, shape = coef[1], scale = coef[2])
qgamma2p <- function(x) qgamma(x, shape = coef[1], scale = coef[2])

# 95% quantiles 
q95 <- qgamma2p(0.95) # Gamma model based quantile
emp.q95 = quantile(divs$T1$hdiv, 0.95) # Empirical quantile

# Density plot with ggplot
ggplot(dt, aes(x = HD)) + 
  geom_density(alpha = 0.05, bw = 0.2, position = "identity", na.rm = TRUE,
               size = 0.4) + xlim(c(0, 150)) +
  stat_function(fun = dgamma2p, colour = "blue") +
  xlab(expression(bolditalic("Hellinger divergence (HD)"))) + 
  ylab(expression(bolditalic("Density"))) +
  ggtitle("Empirical and Gamma densities distributions of Hellinger divergence (T1)") +
  geom_vline(xintercept = emp.q95, color = "black", linetype = "dashed", size = 0.4) +
  annotate(geom = "text", x = emp.q95 - 20, y = 0.16, size = 5,
           label = 'bolditalic(HD[alpha == 0.05]^Emp==114.5)',
           family = "serif", color = "black", parse = TRUE) +
  geom_vline(xintercept = q95, color = "blue", linetype = "dashed", size = 0.4) +
  annotate(geom = "text", x = q95 + 9, y = 0.14, size = 5,
           label = 'bolditalic(HD[alpha == 0.05]^Gamma==124)',
           family = "serif", color = "blue", parse = TRUE) +
  theme(
    axis.text.x  = element_text( face = "bold", size = 12, color="black",
                                 margin = margin(1,0,1,0, unit = "pt" )),
    axis.text.y  = element_text( face = "bold", size = 12, color="black", 
                                 margin = margin( 0,0.1,0,0, unit = "mm")),
    axis.title.x = element_text(face = "bold", size = 13,
                                color="black", vjust = 0 ),
    axis.title.y = element_text(face = "bold", size = 13,
                                color="black", vjust = 0 ),
    legend.title = element_blank(),
    legend.margin = margin(c(0.3, 0.3, 0.3, 0.3), unit = 'mm'),
    legend.box.spacing = unit(0.5, "lines"),
    legend.text = element_text(face = "bold", size = 12, family = "serif")

References

  1. Steerneman, Ton, K. Behnen, G. Neuhaus, Julius R. Blum, Pramod K. Pathak, Wassily Hoeffding, J. Wolfowitz, et al. 1983. “On the total variation and Hellinger distance between signed measures; an application to product measures.” Proceedings of the American Mathematical Society 88 (4). Springer-Verlag, Berlin-New York: 684–84. doi:10.1090/S0002-9939-1983-0702299-0.
  2. Sanchez, Robersy, and Sally A. Mackenzie. 2016. “Information Thermodynamics of Cytosine DNA Methylation.” Edited by Barbara Bardoni. PLOS ONE 11 (3). Public Library of Science: e0150427. doi:10.1371/journal.pone.0150427.

Methylation analysis with Methyl-IT. Part 1

An example of methylation analysis with simulated datasets

Part 1: Methylation signal

 Methylation analysis with Methyl-IT is illustrated on simulated datasets of methylated and unmethylated read counts with relatively high average of  methylation levels: 0.15 and 0.286 for control and treatment groups, respectively. The main Methyl-IT downstream analysis is presented alongside the application of Fisher’s exact test. The importance of a signal detection step is shown.

1. Background

Methyl-IT R package offers a methylome analysis approach based on information thermodynamics (IT) and signal detection. Methyl-IT approach confront detection of differentially methylated cytosine as a signal detection problem. This approach was designed to discriminate methylation regulatory signal from background noise induced by molecular stochastic fluctuations. Methyl-IT R package is not limited to the IT approach but also includes Fisher’s exact test (FT), Root-mean-square statistic (RMST) or Hellinger divergence (HDT) tests. Herein, we will show that a signal detection step is also required for FT, RMST, and HDT as well.

2. Data generation

For the current example on methylation analysis with Methyl-IT we will use simulated data. Read count matrices of methylated and unmethylated cytosine are generated with Methyl-IT function simulateCounts. Function simulateCounts randomly generates prior methylation levels using Beta distribution function. The expected mean of methylation levels that we would like to have can be estimated using the auxiliary function:

bmean <- function(alpha, beta) alpha/(alpha + beta)
alpha.ct <- 0.09
alpha.tt <- 0.2
c(control.group = bmean(alpha.ct, 0.5), treatment.group = bmean(alpha.tt, 0.5), 
  mean.diff = bmean(alpha.tt, 0.5) - bmean(alpha.ct, 0.5)) 
##   control.group treatment.group       mean.diff 
##       0.1525424       0.2857143       0.1331719

This simple function uses the α and β (shape2) parameters from the Beta distribution function to compute the expected value of methylation levels. In the current case, we expect to have a difference of methylation levels about 0.133 between the control and the treatment.

2.1. Simulation

Methyl-IT function simulateCounts will be used to generate the datasets, which will include three group of samples: reference, control, and treatment.

suppressMessages(library(MethylIT))

# The number of cytosine sites to generate
sites = 50000 
# Set a seed for pseudo-random number generation
set.seed(124)
control.nam <- c("C1", "C2", "C3")
treatment.nam <- c("T1", "T2", "T3")

# Reference group 
ref0 = simulateCounts(num.samples = 4, sites = sites, alpha = alpha.ct, beta = 0.5,
                      size = 50, theta = 4.5, sample.ids = c("R1", "R2", "R3"))
# Control group
ctrl = simulateCounts(num.samples = 3, sites = sites, alpha = alpha.ct, beta = 0.5,
                      size = 50, theta = 4.5, sample.ids = control.nam)
# Treatment group
treat = simulateCounts(num.samples = 3, sites = sites, alpha = alpha.tt, beta = 0.5,
                       size = 50, theta = 4.5, sample.ids = treatment.nam)

Notice that reference and control groups of samples are not identical but belong to the same population.

2.2. Divergences of methylation levels

The estimation of the divergences of methylation levels is required to proceed with the application of signal detection basic approach. The information divergence is estimated here using the function estimateDivergence. For each cytosine site, methylation levels are estimated according to the formulas: $p_i={n_i}^{mC_j}/({n_i}^{mC_j}+{n_i}^{C_j})$, where ${n_i}^{mC_j}$ and ${n_i}^{C_j}$ are the number of methylated and unmethylated cytosines at site $i$.

If a Bayesian correction of counts is selected in function estimateDivergence, then methylated read counts are modeled by a beta-binomial distribution in a Bayesian framework, which accounts for the biological and sampling variations [1,2,3]. In our case we adopted the Bayesian approach suggested in reference [4] (Chapter 3).

Two types of information divergences are estimated: TV, total variation (TV, absolute value of methylation levels) and Hellinger divergence (H). TV is computed according to the formula: $TV=|p_{tt}-p_{ct}|$ and H:

$H(\hat p_{ij},\hat p_{ir}) = w_i[(\sqrt{\hat p_{ij}} – \sqrt{\hat p_{ir}})^2+(\sqrt{1-\hat p_{ij}} – \sqrt{1-\hat p_{ir}})^2]$  (1)

where $w_i = 2 \frac{m_{ij} m_{ir}}{m_{ij} + m_{ir}}$, $m_{ij} = {n_i}^{mC_j}+{n_i}^{uC_j}+1$, $m_{ir} = {n_i}^{mC_r}+{n_i}^{uC_r}+1$ and $j \in {\{c,t}\}$

The equation for Hellinger divergence is given in reference [5], but
any other information theoretical divergences could be used as well. Divergences are estimated for control and treatment groups in respect to a virtual sample, which is created applying function poolFromGRlist on the reference group.

.

# Reference sample
ref = poolFromGRlist(ref0, stat = "mean", num.cores = 4L, verbose = FALSE)

# Methylation level divergences
DIVs <- estimateDivergence(ref = ref, indiv = c(ctrl, treat), Bayesian = TRUE, 
                           num.cores = 6L, percentile = 1, verbose = FALSE)

The mean of methylation levels differences is:

unlist(lapply(DIVs, function(x) mean(mcols(x[, 7])[,1])))
##            C1            C2            C3            T1            T2 
## -0.0009820776 -0.0014922009 -0.0022257725  0.1358867135  0.1359160219 
##            T3 
##  0.1309217360

3. Methylation signal

Likewise for any other signal in nature, the analysis of methylation signal requires for the knowledge of its probability distribution. In the current case, the signal is represented in terms of the Hellinger divergence of methylation levels (H).

divs = DIVs[order(names(DIVs))]

# To remove hd == 0 to estimate. The methylation signal only is given for  
divs = lapply(divs, function(div) div[ abs(div$hdiv) > 0 ])
names(divs) <- names(DIVs)

# Data frame with the Hellinger divergences from both groups of samples samples 
l = c(); for (k in 1:length(divs)) l = c(l, length(divs[[k]]))
data <- data.frame(H = c(abs(divs$C1$hdiv), abs(divs$C2$hdiv), abs(divs$C3$hdiv),
                           abs(divs$T1$hdiv), abs(divs$T2$hdiv), abs(divs$T3$hdiv)),
                   sample = c(rep("C1", l[1]), rep("C2", l[2]), rep("C3", l[3]),
                              rep("T1", l[4]), rep("T2", l[5]), rep("T3", l[6]))
)

Empirical critical values for the probability distribution of H and TV can be obtained using quantile function from the R package stats.

critical.val <- do.call(rbind, lapply(divs, function(x) {
  hd.95 = quantile(x$hdiv, 0.95)
  tv.95 = quantile(x$TV, 0.95)
  return(c(tv = tv.95, hd = hd.95))
}))

critical.val
##       tv.95%    hd.95%
## C1 0.7893927  81.47256
## C2 0.7870469  80.95873
## C3 0.7950869  81.27145
## T1 0.9261629 113.73798
## T2 0.9240506 114.45228
## T3 0.9212163 111.54258

3.1. Density estimation

The kernel density estimation yields the empirical density shown in the graphics:

suppressMessages(library(ggplot2))

# Some information for graphic
crit.val.ct <- max(critical.val[c("C1", "C2", "C3"), 2]) # 81.5
crit.val.tt <- min(critical.val[c("T1", "T2", "T3"), 2]) # 111.5426

# Density plot with ggplot
ggplot(data, aes(x = H, colour = sample, fill = sample)) + 
  geom_density(alpha = 0.05, bw = 0.2, position = "identity", na.rm = TRUE,
               size = 0.4) + xlim(c(0, 125)) +   
  xlab(expression(bolditalic("Hellinger divergence (H)"))) + 
  ylab(expression(bolditalic("Density"))) +
  ggtitle("Density distribution for control and treatment") +
  geom_vline(xintercept = crit.val.ct, color = "red", linetype = "dashed", size = 0.4) +
  annotate(geom = "text", x = crit.val.ct-2, y = 0.3, size = 5,
           label = 'bolditalic(H[alpha == 0.05]^CT==81.5)',
           family = "serif", color = "red", parse = TRUE) +
  geom_vline(xintercept = crit.val.tt, color = "blue", linetype = "dashed", size = 0.4) +
  annotate(geom = "text", x = crit.val.tt -2, y = 0.2, size = 5,
           label = 'bolditalic(H[alpha == 0.05]^TT==114.5)',
           family = "serif", color = "blue", parse = TRUE) +
  theme(
    axis.text.x  = element_text( face = "bold", size = 12, color="black",
                                 margin = margin(1,0,1,0, unit = "pt" )),
    axis.text.y  = element_text( face = "bold", size = 12, color="black", 
                                 margin = margin( 0,0.1,0,0, unit = "mm")),
    axis.title.x = element_text(face = "bold", size = 13,
                                color="black", vjust = 0 ),
    axis.title.y = element_text(face = "bold", size = 13,
                                color="black", vjust = 0 ),
    legend.title = element_blank(),
    legend.margin = margin(c(0.3, 0.3, 0.3, 0.3), unit = 'mm'),
    legend.box.spacing = unit(0.5, "lines"),
    legend.text = element_text(face = "bold", size = 12, family = "serif")
  )
The above graphic shows that with high probability the methylation signal induced by the treatment has H values $H^{TT}_{\alpha=0.05}\geq114.5$. According to the critical value estimated for the differences of methylation levels, the methylation signal holds $TV^{TT}_{\alpha=0.05}\geq0.926$. Notice that most of the methylation changes are not signal but noise (found to the left of the critical values). This situation is typical for all the natural and technologically generated signals. Assuming that the background methylation variation is consistent with a Poisson process and that methylation changes conform to the second law of thermodynamics, the Hellinger divergence of methylation levels follows a Weibull distribution probability or some member of the generalized gamma distribution family [6].

References

  1. Hebestreit, Katja, Martin Dugas, and Hans-Ulrich Klein. 2013. “Detection of significantly differentially methylated regions in targeted bisulfite sequencing data.” Bioinformatics (Oxford, England) 29 (13): 1647–53. doi:10.1093/bioinformatics/btt263.
  2. Hebestreit, Katja, Martin Dugas, and Hans-Ulrich Klein. 2013. “Detection of significantly differentially methylated regions in targeted bisulfite sequencing data.” Bioinformatics (Oxford, England) 29 (13): 1647–53. doi:10.1093/bioinformatics/btt263.
  3. Dolzhenko, Egor, and Andrew D Smith. 2014. “Using beta-binomial regression for high-precision differential methylation analysis in multifactor whole-genome bisulfite sequencing experiments.” BMC Bioinformatics 15 (1). BioMed Central: 215. doi:10.1186/1471-2105-15-215.
  4. Baldi, Pierre, and Soren Brunak. 2001. Bioinformatics: the machine learning approach. Second. Cambridge: MIT Press.
  5. Basu, A., A. Mandal, and L. Pardo. 2010. “Hypothesis testing for two discrete populations based on the Hellinger distance.” Statistics & Probability Letters 80 (3-4). Elsevier B.V.: 206–14. doi:10.1016/j.spl.2009.10.008.
  6. Sanchez R, Mackenzie SA. Information Thermodynamics of Cytosine DNA Methylation. PLoS One, 2016, 11:e0150427.