# Fisher's exact test failure can lead to biased results

Robersy Sanchez

Department of Biology. Pennsylvania State University, University Park, PA 16802

Email: rus547@psu.edu

Fisher's exact test is a statistical significance test used in the analysis of contingency tables. Although this test is routinely used even though, it has been full of with controversy for over 80 years.Herein, the case of its application analyzed is scrutinized with specific examples.

# Overview

The statistical significance of the difference between two bisulfite sequence from control and treatment groups at each CG site can be evaluated with Fisher's exact test. This is a statistical test used to determine if there are nonrandom associations between two categorical variables.

Let there exist two such (categorical) variables $X$ and $Y$, where $X$ stands for two groups of individuals: control and treatment, and $Y$ be a two states variable denoting the methylation status, carrying the number of times that a cytosine site is found methylated ($^{m}CG$) and non-methylated ($CG$), respectively.

This information can be summarized in a $2 \times 2$ table, a $2 \times 2$ matrix in which the entries $a_{ij}$ represent the number of observations in which $x=i$ and $y=j$. Calculate the row and column sums $R_i$ and $C_j$, respectively, and the total sum:

$N=\sum_iR_i=\sum_jC_j$
of the matrix:

$Y = ^mCG$ $Y = CG$ $R_i$
Control $a_{11}$ $a_12$ $a_{11}+a_{12}$
Treatment $a_{21}$ $a_22$ $a_{21}+a_{22}$
$C_i$ $a_{11}+a_{21}$ $a_{12}+a_{22}$ $a_{11}+a_{12}+a_{21}+a_{22} = N$

Then the conditional probability of getting the actual matrix, given the particular row and column sums, is given by the formula:

$P_{cutoff}=\frac{R_1!R_2!}{N!\prod_{i,j}a_{ij}!}C_1!C_2!$

# Example 1

Let's consider the following hypthetical case of methylation at a given cytosine site found in the comparison of control and treatment groups:

case1 <- matrix(c(5, 14, 15, 12),  nrow = 2,
dimnames = list(Group = c("Ctrl", "Treat"),
Meth.status = c("mCG", "CG")))
case1

##        Meth.status
## Group   mCG CG
##   Ctrl    5 15
##   Treat  14 12


That is, the cytosine site was found methylated 5 times from 20 counts in the control group and 14 out of 26 in the treatment. This accounts for methylation levels about 0.28 and 0.53 in the control and and the treatment groups, respectively, which correspond to a value of 0.25 (50%) of methylation levels difference.

## Proportions
case1/rowSums(case1)

##        Meth.status
## Group         mCG        CG
##   Ctrl  0.2500000 0.7500000
##   Treat 0.5384615 0.4615385


Fisher's exact test found not difference between these group!

fisher.test(case1)

##
## 	Fisher's Exact Test for Count Data
##
## data:  case1
## p-value = 0.07158
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  0.06352714 1.18168682
## sample estimates:
## odds ratio
##   0.293905


Considering the direction of the changes seems to be more sensitive to the magnitude of methylation changes, statistically significant at a significance level of $\alpha = 0.05$.

fisher.test(case1, alternative = "less")

##
## 	Fisher's Exact Test for Count Data
##
## data:  case1
## p-value = 0.04666
## alternative hypothesis: true odds ratio is less than 1
## 95 percent confidence interval:
##  0.0000000 0.9804613
## sample estimates:
## odds ratio
##   0.293905


To realize how difficult would be the interpretation of this result in a real concrete scenario, let's consider a basketball team where a top player finished a game scoring 14 field-goal out of 26 shooting. Base on Fisher's exact test, does it make sense to say that another player who made 6 out of 15 field-goals performed as well as the best player?

# Bootstrap tests

Alternative testings are possible by means of bootstrap re-sampling from the set (population) of all the matrices with the same row and column sums $R_i$ and $C_j$. The analyses will be accomplished with function tableBoots from the R packge named usefr.

## Hellinger test statistic

Hellinger test statistic has been proposed to compare discrete probability distributions (1). Basu et all have shown that the Hellinger divergence, as reported in 1, has asymptotically has a chi-square distribution with one degree of freedom under the null hypothesis. This a good property that makes this statistic Hellinger statistic suitable for bootstrap's test of two discrete populations.

library(usefr)
tableBoots(x = case1, stat = "hd", num.permut = 1999)

## [1] 0.0445


## Root Mean Square statistic

The Root-Mean-Square statistic has been also proposed to test differences between two discrete probability distributions (2)

library(usefr)
tableBoots(x = case1, stat = "rmst", num.permut = 1999)

## [1] 0.05


## Chi-square test

The $\chi^2$ statistic fails

library(usefr)
tableBoots(x = case1, stat = "chisq", num.permut = 1999)

## [1] 0.059


Assuming that expected discrete probability distribution (DPD) are:

p <- case1[1,]/sum(case1[1,])
p

##  mCG   CG
## 0.25 0.75


Then, we can test whether the treatment departs from the expected DPD:

chisq.test(x= case1[2,], p = p,
simulate.p.value = TRUE, B = 2e3)

##
## 	Chi-squared test for given probabilities with simulated p-value (based
## 	on 2000 replicates)
##
## data:  case1[2, ]
## X-squared = 11.538, df = NA, p-value = 0.001499


## The example 1 is not unique

The failure of Fisher's exact test is evident in the comparisons given below:

res <- lapply(seq(1,78,1), function(k) {
datos <- matrix(c(5, k+1, 15, k),  nrow = 2,
dimnames = list(Group = c("Ctrl", "Treat"),
Meth.status = c("mCG", "CG")))
datos
x <- datos/rowSums(datos)
x <- x[2,1] - x[1,1]
ft <- fisher.test(datos)
hd <- tableBoots(x = datos, stat = "hd", num.permut = 1999)
rmst <- tableBoots(x = datos, stat = "rmst", num.permut = 1999)
chisq <- tableBoots(x = datos, stat = "chisq", num.permut = 1999)

chisq.p <- chisq.test(x= datos[2,], p = datos[1,]/sum(datos[1,]),
simulate.p.value = TRUE, B = 2e3)$p.value c(meth.diff = x, FT.p.value = ft$p.value,
HD.p.value = hd,
RMST = rmst,
chisq = chisq,
chisq.p = chisq.p)
})
do.call(rbind, res)

##       meth.diff FT.p.value HD.p.value   RMST  chisq      chisq.p
##  [1,] 0.4166667 0.20948617     0.0670 0.1060 0.1235 0.1604197901
##  [2,] 0.3500000 0.28326746     0.1135 0.1130 0.1390 0.1009495252
##  [3,] 0.3214286 0.17506841     0.1140 0.1030 0.1220 0.0744627686
##  [4,] 0.3055556 0.20467538     0.1310 0.0965 0.1160 0.0454772614
##  [5,] 0.2954545 0.13181103     0.1120 0.1110 0.1130 0.0389805097
##  [6,] 0.2884615 0.14213458     0.0930 0.0895 0.0990 0.0204897551
##  [7,] 0.2833333 0.15672314     0.0980 0.0710 0.0900 0.0224887556
##  [8,] 0.2794118 0.10138414     0.0810 0.0795 0.0890 0.0114942529
##  [9,] 0.2763158 0.10534027     0.0775 0.0675 0.0895 0.0109945027
## [10,] 0.2738095 0.11089814     0.0750 0.0665 0.0685 0.0069965017
## [11,] 0.2717391 0.11750154     0.0705 0.0685 0.0735 0.0044977511
## [12,] 0.2700000 0.07769381     0.0665 0.0560 0.0705 0.0044977511
## [13,] 0.2685185 0.07887317     0.0620 0.0625 0.0700 0.0029985007
## [14,] 0.2672414 0.08061039     0.0530 0.0555 0.0595 0.0014992504
## [15,] 0.2661290 0.08276810     0.0690 0.0510 0.0585 0.0019990005
## [16,] 0.2651515 0.08523730     0.0580 0.0540 0.0465 0.0019990005
## [17,] 0.2642857 0.08793188     0.0600 0.0475 0.0505 0.0009995002
## [18,] 0.2635135 0.09078398     0.0575 0.0560 0.0575 0.0004997501
## [19,] 0.2628205 0.09374027     0.0510 0.0480 0.0575 0.0004997501
## [20,] 0.2621951 0.06031279     0.0585 0.0520 0.0470 0.0009995002
## [21,] 0.2616279 0.06080833     0.0585 0.0555 0.0515 0.0004997501
## [22,] 0.2611111 0.06141565     0.0455 0.0470 0.0510 0.0004997501
## [23,] 0.2606383 0.06211375     0.0520 0.0455 0.0500 0.0004997501
## [24,] 0.2602041 0.06288506     0.0395 0.0470 0.0555 0.0004997501
## [25,] 0.2598039 0.06371480     0.0470 0.0465 0.0505 0.0004997501
## [26,] 0.2594340 0.06459058     0.0580 0.0460 0.0550 0.0004997501
## [27,] 0.2590909 0.06550200     0.0490 0.0405 0.0480 0.0004997501
## [28,] 0.2587719 0.06644033     0.0385 0.0425 0.0360 0.0004997501
## [29,] 0.2584746 0.06739822     0.0430 0.0460 0.0440 0.0004997501
## [30,] 0.2581967 0.06836952     0.0420 0.0455 0.0435 0.0004997501
## [31,] 0.2579365 0.06934906     0.0420 0.0370 0.0495 0.0004997501
## [32,] 0.2576923 0.07033254     0.0420 0.0375 0.0450 0.0004997501
## [33,] 0.2574627 0.07131633     0.0340 0.0430 0.0465 0.0004997501
## [34,] 0.2572464 0.07229742     0.0435 0.0375 0.0445 0.0004997501
## [35,] 0.2570423 0.04633836     0.0360 0.0425 0.0525 0.0004997501
## [36,] 0.2568493 0.04646593     0.0340 0.0305 0.0405 0.0004997501
## [37,] 0.2566667 0.04660925     0.0490 0.0335 0.0435 0.0004997501
## [38,] 0.2564935 0.04676622     0.0415 0.0400 0.0350 0.0004997501
## [39,] 0.2563291 0.04693498     0.0330 0.0410 0.0415 0.0004997501
## [40,] 0.2561728 0.04711391     0.0390 0.0330 0.0465 0.0004997501
## [41,] 0.2560241 0.04730155     0.0400 0.0365 0.0435 0.0004997501
## [42,] 0.2558824 0.04749662     0.0405 0.0335 0.0405 0.0004997501
## [43,] 0.2557471 0.04769797     0.0395 0.0350 0.0330 0.0004997501
## [44,] 0.2556180 0.04790460     0.0380 0.0375 0.0365 0.0004997501
## [45,] 0.2554945 0.04811561     0.0360 0.0370 0.0445 0.0004997501
## [46,] 0.2553763 0.04833021     0.0300 0.0390 0.0375 0.0004997501
## [47,] 0.2552632 0.04854768     0.0400 0.0345 0.0335 0.0004997501
## [48,] 0.2551546 0.04876741     0.0315 0.0380 0.0370 0.0004997501
## [49,] 0.2550505 0.04898884     0.0350 0.0310 0.0355 0.0004997501
## [50,] 0.2549505 0.04921147     0.0370 0.0330 0.0470 0.0004997501
## [51,] 0.2548544 0.04943486     0.0405 0.0450 0.0435 0.0004997501
## [52,] 0.2547619 0.04965863     0.0345 0.0370 0.0320 0.0004997501
## [53,] 0.2546729 0.04988242     0.0335 0.0410 0.0335 0.0004997501
## [54,] 0.2545872 0.05010595     0.0355 0.0350 0.0370 0.0004997501
## [55,] 0.2545045 0.05032893     0.0300 0.0345 0.0425 0.0004997501
## [56,] 0.2544248 0.05055114     0.0290 0.0330 0.0365 0.0004997501
## [57,] 0.2543478 0.05077235     0.0385 0.0320 0.0375 0.0004997501
## [58,] 0.2542735 0.05099239     0.0400 0.0345 0.0320 0.0004997501
## [59,] 0.2542017 0.05121109     0.0345 0.0295 0.0345 0.0004997501
## [60,] 0.2541322 0.05142831     0.0335 0.0355 0.0325 0.0004997501
## [61,] 0.2540650 0.05164393     0.0330 0.0370 0.0370 0.0004997501
## [62,] 0.2540000 0.05185784     0.0320 0.0300 0.0360 0.0004997501
## [63,] 0.2539370 0.05206995     0.0370 0.0370 0.0365 0.0004997501
## [64,] 0.2538760 0.05228018     0.0325 0.0335 0.0375 0.0004997501
## [65,] 0.2538168 0.05248845     0.0300 0.0335 0.0315 0.0004997501
## [66,] 0.2537594 0.05269472     0.0280 0.0305 0.0350 0.0004997501
## [67,] 0.2537037 0.05289893     0.0270 0.0355 0.0365 0.0004997501
## [68,] 0.2536496 0.05310105     0.0320 0.0340 0.0380 0.0004997501
## [69,] 0.2535971 0.05330104     0.0405 0.0330 0.0390 0.0004997501
## [70,] 0.2535461 0.05349888     0.0290 0.0315 0.0355 0.0004997501
## [71,] 0.2534965 0.05369454     0.0260 0.0355 0.0310 0.0004997501
## [72,] 0.2534483 0.05388802     0.0300 0.0295 0.0340 0.0004997501
## [73,] 0.2534014 0.05407930     0.0360 0.0340 0.0315 0.0004997501
## [74,] 0.2533557 0.05426839     0.0290 0.0335 0.0305 0.0004997501
## [75,] 0.2533113 0.05445527     0.0285 0.0340 0.0290 0.0004997501
## [76,] 0.2532680 0.05463995     0.0335 0.0325 0.0260 0.0004997501
## [77,] 0.2532258 0.05482244     0.0295 0.0300 0.0370 0.0004997501
## [78,] 0.2531847 0.03529458     0.0285 0.0405 0.0310 0.0004997501


# Extreme situations found in methylation data sets

Unfortunate situations like that one shown below are not rare in methylation data. The minimum coverage (total counts) in the example below is 5.

case2 <- matrix(c(3, 90, 2, 10),  nrow = 2,
dimnames = list(Group = c("Ctrl", "Treat"),
Meth.status = c("mCG", "CG")))
case2

##        Meth.status
## Group   mCG CG
##   Ctrl    3  2
##   Treat  90 10


The small numbers of count in the control sample yields a bias estimation of the methylation level:

## Proportions
case2/rowSums(case2)

##        Meth.status
## Group   mCG  CG
##   Ctrl  0.6 0.4
##   Treat 0.9 0.1


There is not enough experimental data to ensure that 3/(3+2) = 0.6 is an unbiased estimated value of the methylation levels at the given cytosine site. Fisher's exact test is insensitive to this experimental fact.

fisher.test(case2)

##
## 	Fisher's Exact Test for Count Data
##
## data:  case2
## p-value = 0.09893
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  0.01735485 2.28134136
## sample estimates:
## odds ratio
##  0.1716671

set.seed(1)
st <- tableBoots(x = case2, stat = "hd", num.permut = 1999)
st

## [1] 0.0355


# References

1. Basu A, Mandal A, Pardo L. Hypothesis testing for two discrete populations based on the Hellinger distance. Stat Probab Lett. Elsevier B.V.; 2010;80: 206–214. doi:10.1016/j.spl.2009.10.008.
2. Perkins W, Tygert M, Ward R. Some deficiencies of χ2 and classical exact tests of significance. Appl Comput Harmon Anal. Elsevier Inc.; 2014;36: 361–386. doi:10.1016/j.acha.2013.06.002.

# Fitting Mixture Distributions

## 1. Background

Mixture Distributions were introduced in a Basically, it is said that a distribution $f(x)$ is a mixture of k components distributions $f_1(x), …, f_k(x)$ if:

$f(x) = \sum_{i=1}^k \pi_i f_i(x)$

where $\pi_i$ are the so called mixing weights, $0 \le \pi_i \le 1$, and $\pi_1 + … + \pi_k = 1$. Here, new data points from distribution will be generated in the standard way: first to pick a distribution, with probabilities given by the mixing weights, and then to generate one observation according to that distribution. More information about mixture distribution can be read in Wikipedia.

Herein, we will show how to fit numerical data to a mixture of probability distributions model.

## 2. Generating random variables from a mixture of Gamma and Weibull distributions

To generate from a mixture distribution the R package usefr will be used.

library(usefr) set.seed(123) # set a seed for random generation # ========= A mixture of three distributions ========= phi = c(3/10, 7/10) # Mixture proportions # --------------------------------------------------------- # === Named vector of the corresponding distribution function parameters # must be provided args <- list(gamma = c(shape = 2, scale = 0.1),               weibull = c(shape = 3, scale = 0.5)) # ------------------------------------------------------------ # ===== Sampling from the specified mixture distribution ==== X <- rmixtdistr(n = 1e5, phi = phi , arg = args)

### 2.1. The histogram of the mixture distribution

The graphics for the simulated dataset and the corresponding theoretical mixture distribution

hist(X, 90, freq = FALSE, las = 1, family = "serif",
panel.first={points(0, 0, pch=16, cex=1e6, col="grey95")
grid(col="white", lty = 1)}, family = "serif", col = "cyan1",
border = "deepskyblue", xlim = c(0, 1.5))
x1 <- seq(-4, 1.5, by = 0.001)
lines(x1, dmixtdistr(x1, phi = phi, arg = args), col = "red")

The nonlinear fit of this dataset is NOT straightforward!

## 3. Nonlinear fit of the random generated dataset

The nonlinear fit of the random generated data set will be accomplished with the function

FIT <- fitMixDist(X, args = list(gamma = c(shape = NULL, scale = NULL),weibull = c(shape = NULL, scale = NULL)), npoints = 200, usepoints = 1000)
fitting ...|========================================================================================================| 100%*** Performing nonlinear regression model crossvalidation...Warning messages:1: In dgamma(c(0.005, 0.015, 0.025, 0.035, 0.045, 0.055, 0.065, 0.075, :NaNs produced2: In dgamma(c(0.005, 0.015, 0.025, 0.035, 0.045, 0.055, 0.065, 0.075, :NaNs produced3: In dgamma(c(0.005, 0.015, 0.025, 0.035, 0.045, 0.055, 0.065, 0.075, :NaNs produced
summary(FIT$fit) #> Parameters: #> Estimate Std. Error t value Pr(>|t|) #> gamma.shape 1.828989 0.021444 85.29 <2e-16 *** #> gamma.scale 6.058377 0.108738 55.72 <2e-16 *** #> weibull.shape 3.449296 0.025578 134.85 <2e-16 *** #> weibull.scale 0.507976 0.001461 347.66 <2e-16 *** #> --- #> Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 #> Residual standard error: 0.02403 on 71 degrees of freedom #> Number of iterations to termination: 27 #> Reason for termination: Relative error in the sum of squares is at most ftol'.  ## 3.1. Graphics of the simulated dataset and the corresponding theoretical mixture distribution hist(X, 90, freq = FALSE, las = 1, family = "serif", panel.first={points(0, 0, pch=16, cex=1e6, col="grey95")grid(col="white", lty = 1)}, family = "serif", col = "seagreen1",border = "deepskyblue", xlim = c(0, 1.5), cex.lab = 1.2)x1 <- seq(-4, 10, by = 0.001)lines(x1, dmixtdistr(x1, phi = FIT$phi, arg = FIT$args), col = "red")mtext("Histogram of Gamma & Weibull Mixture Distributions.", cex = 1.4, font = 3, family = "serif") ## 3.2. Bootstrap goodness-of-fit test A bootstrap goodness-of-fit (GOF) test is performed with function mcgoftest. The parameter values are taken from the previous fitted mixture distribution model. Notice the particular way to set up the list of parameters. The Null hypothesis is that the data set follows a mixture of Gamma and Weibull distributions with the estimated parameter values. pars <- c(list(phi = FIT$phi), arg = list(FIT$args))mcgoftest(varobj = X, distr = "mixtdistr", pars = pars, num.sampl = 999, sample.size = 99999, stat = "chisq", num.cores = 4, breaks = 200, seed = 123) #> *** Monte Carlo GoF testing based on Pearson's Chi-squared statistic ( parametric approach ) ... #> Chisq mc_p.value sample.size num.sampl #> 815.1484 0.0010​ 99999.0000 999.0000 The GOF rejected the null hypothesis. In particular, the computation yielded several warnings about the fitting of Gamma distribution. Nevertheless, we can use the previous estimated parameter values to “guess” better starting values for the fitting algorithm: FIT <- fitMixDist(X, args = list(gamma = c(shape = 1.8, scale = 6),weibull = c(shape = 3, scale = 0.5)), npoints = 200, usepoints = 1000)summary(FIT$fit)
#> fitting ...
#>  |======================================================================================================================| 100%
#> *** Performing nonlinear regression model  crossvalidation...
#> Warning messages:
#> 1: In dgamma(c(0.005, 0.015, 0.025, 0.035, 0.045, 0.055, 0.065, 0.075,  :
#>  NaNs produced
#> 2: In dgamma(c(0.005, 0.015, 0.025, 0.035, 0.045, 0.055, 0.065, 0.075,  :
#>  NaNs produced
#> 3: In dgamma(c(0.005, 0.015, 0.025, 0.035, 0.045, 0.055, 0.065, 0.075,  :
#>  NaNs produced
#> Parameters:
#>               Estimate Std. Error t value Pr(>|t|)
#> gamma.shape   2.0050960  0.0265535   75.51   <2e-16 ***
#> gamma.scale   0.0953489  0.0017654   54.01   <2e-16 ***
#> weibull.shape 2.9221485  0.0127682  228.86   <2e-16 ***
#> weibull.scale 0.4957061  0.0007669  646.41   <2e-16 ***
#> ---
#> Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

#> Residual standard error: 0.02681 on 145 degrees of freedom
#> Number of iterations to termination: 17
#> Reason for termination: Relative error in the sum of squares is at most ftol'. 

The GOF test is now repeated with new estimated parameter values:

pars <- c(list(phi = FIT$phi), arg = list(FIT$args))mcgoftest(varobj = X, distr = "mixtdistr", pars = pars, num.sampl = 999,          sample.size = 99999, stat = "chisq", num.cores = 4, breaks = 200, seed = 123)
*** Monte Carlo GoF testing based on Pearson's Chi-squared statistic ( parametric approach )  ...
Chisq  mc_p.value sample.size   num.sampl
111.3769      0.9150  99999.0000    999.0000 

That is, for the last estimated parameters, there is not enough statistical reasons to reject the null hypothesis. Although the corresponding histogram looks pretty similar to the previous ones, the numbers indicate that there small fitting differences that our eyes cannot detect in the graphics.

# Sampling from a Mixture of Distributions

## Sampling from a Mixture of Distributions

It is said that a distribution $f(x)$ is a mixture of k components distributions $f_1(x), …, f_k(x)$ if:

$f(x) = \sum_{i=1}^k \pi_i f_i(x)$

where $\pi_i$ are the so called mixing weights, $0 \le \pi_i \le 1$, and $\pi_1 + … + \pi_k = 1$. Here, new data points from distribution will be generated in the standard way: first to pick a distribution, with probabilities given by the mixing weights, and then to generate one observation according to that distribution. More information about mixture distribution can be read in Wikipedia.

# 1. Generating random variables from a mixture of normal distributions

To generate from a mixture distribution the R package usefr will be used.

library(usefr)set.seed(123) # set a seed for random generation# ========= A mixture of three distributions =========phi = c(7/10, 3/10) # Mixture proportions# ---------------------------------------------------------

# === Named vector of the corresponding distribution function parameters
# must be provided
args <- list(norm = c(mean = 1, sd = 1), norm = c(mean = 5, sd = 1))# ------------------------------------------------------------
# ===== Sampling from the specified mixture distribution ====
x <- rmixtdistr(n = 1e5, pi = pi , arg = args)# ------------------------------------------------------------
# === The graphics for the simulated dataset and the corresponding theoretical
# mixture distribution
par(bg = "gray98", mar = c(3, 4, 2, 1) )
hist(x, 90, freq = FALSE, las = 1, family = "serif", col = rgb(0, 0, 1, 0.2), border = "deepskyblue")
x1 <- seq(-4, 10, by = 0.001)
lines(x1, dmixtdistr(x1, phi = phi, arg = args), col = "red")

# 2. Mixture of Weibull and Gamma distributions

Mixture of normal distributions is what most frequently we see online and in paper. Let’s see the mixture of Weibull and Gamma distributions.

set.seed(123) # set a seed for random generation # ==== A mixture of three distributions ===== pi = c(7/10, 3/10) # Mixture proportions
# ---------------------------------------------------------
# === Named vector of the corresponding distribution function parameters # must be provided args <- list(gamma = c(shape = 20, scale = 1/15), weibull = c(shape = 3, scale = 0.5)) # --------------------------------------------------------- # === Sampling from the specified mixture distribution ==== x <- rmixtdistr(n = 1e5, pi = pi , arg = args) # --------------------------------------------------------- # === The graphics for the simulated dataset and the corresponding theoretical # mixture distribution par(bg = "gray98", mar = c(3, 4, 2, 1) ) hist(x, 90, freq = FALSE, las = 1, family = "serif", col = "cyan1", border = "deepskyblue") x1 <- seq(-4, 10, by = 0.001) lines(x1, dmixtdistr(x1, pi = pi, arg = args), col = "red")

# 3. Mixture of Gamma, Weibull, and Log-Normal distributions

set.seed(123) # set a seed for random generation# =============== A mixture of three distributions ========================pi = c(5/10, 3/10, 2/10) # Mixture proportions# --------------------------------------------------------------------------# ==== Named vector of the corresponding distribution function parameters# must be providedargs <- list(gamma = c(shape = 20, scale = 1/10),             weibull = c(shape =  4, scale = 0.8),             lnorm = c(meanlog = 1.2, sdlog = 0.08))# --------------------------------------------------------------------------# ======= Sampling from the specified mixture distribution =======x <- rmixtdistr(n = 1e5, pi = pi , arg = args)# --------------------------------------------------------------------------# The graphics for the simulated dataset and the corresponding theoretical# mixture distributionpar(bg = "gray98",  mar = c(3, 4, 2, 1) )hist(x, 90, freq = FALSE, las = 1, family = "serif", col = "plum1", border = "violet")x1 <- seq(-4, 10, by = 0.001)lines(x1, dmixtdistr(x1, pi = pi, arg = args), col = "red")

# Goodness of fit with large Sample size

## 1. Background

The goodness of fit (GOF) tests frequently fail with real datasets when the sample size goes beyond 100. This issue is critical when working with experimental data where enviromental random noise cannot be prevented. Fortunately, permutation and Monte Carlo approaches for GOF could help to confront the issue. Herein, two examples with simulated data are presented.

## 2. Install and loading the R packages needed for the computation

To run a Monte Carlo (MC) approach for GOF we will install the R package usefr and its dependences from GitHub:

source("https://bioconductor.org/biocLite.R")
biocLite('BiocParallel')
install.packages(c("minpack.lm", "numDeriv", "devtools", "ggplot2", "gridExtra"),
dependencies=TRUE)
devtools::install_git("https://github.com/genomaths/usefr.git")

Next, to load the packages that are needed for this example:

suppressMessages(library(ggplot2))
suppressMessages(library(gridExtra))
suppressMessages(library(usefr))

## 2. Example 1

Let us simulate random samples from a specified Normal and Weibull distributions. To make reproducible this example, we set a seed.

set.seed(1)
x1 = rnorm(10000, mean = 1.5, sd = 2)
x2 = rweibull(10000, shape = 0.5, scale = 1.2)
dt <- data.frame(x1 = x1, x2 = x2)

## 2.1 The histograms and density plots of the given Normal and Weibull distributions

p1 <- ggplot(data = dt, aes(x1)) +
geom_histogram(data = dt, aes(y=..density..), binwidth = 1,
colour = "black", fill = "skyblue", na.rm=TRUE) +
stat_function(fun = dnorm, n = 101, col = "red",
args = list(mean = 1.5, sd = 2), size = 1) +
theme_gray(base_family = "serif", base_size = 14) +
annotate(geom = "text", x = 7, y = 0.16, size = 6,
label = 'bolditalic(N(1.5,2))',
family = "serif", color = "blue", parse = TRUE)

p2 <- ggplot(data = dt, aes(x2)) +
geom_histogram(data = dt, aes(y=..density..), binwidth = 1,
colour = "black", fill = "skyblue", na.rm=TRUE) +
xlim(0,20) + ylim(0,0.23) +
stat_function(fun = dweibull, n = 101, col = "red",
args = list(shape = 0.5, scale = 1.2), size = 1) +
theme_gray(base_family = "serif", base_size = 14) +
annotate(geom = "text", x = 10, y = 0.18, size = 6,
label = 'bolditalic(W(0.5, 1.2))',
family = "serif", color = "blue", parse = TRUE)
grid.arrange(p1, p2, nrow = 1)

The bell shape, typical of the normal distribution, is clear. Many people are not familiar with the Weibull distribution. The decay observed in the Weibull distribution presented in the figure is typically found for the for information divergences of DNA methylation levels, in humans and in plants [1].

## 3. The Monte Carlo GOF test

The test is performed with function mcgoftest from the R package usefr

mcgoftest(x1, cdf = pnorm, pars = c(1.5, 2), num.sampl = 500,
sample.size = 1000, num.cores = 1)
## *** Monte Carlo GoF testing based on Kolmogorov-Smirnov statistic ...##   KS.stat.D      mc_p.value KS.stat.p.value     sample.size       num.sampl ##   0.2534276       0.6007984       0.0000000    1000.0000000     500.0000000 

Kolmogorov-Smirnov (KS) test rejects the Null Hypothesis: KS_p.value $\simeq$ 0. According with KS test there is not enough statistical evidence to support the null hypothesis that the observed values of variable x1 follow a normal distribution with mean 1.5 and standard deviation 2. This looks paradoxical if we take into account that this sample was generated using the theoretical distribution. The fact is that computer algorithms to generate random numbers are not perfect and, so far, the numerical algorithms can only generate pseudo-random numbers, which mostly follow the theoretical distribution. The Monte Carlo sampling approach, however, does not reject the Null Hypothesis: mc_p.value = 0.5908184. However, the failure of KS test would be the result of more deep mathematical-philosophical issues.

The testing for the Weibull distribution yields similar results:

mcgoftest(x2, cdf = pweibull, pars = c(shape = 0.5, scale = 1.2), num.sampl = 500,
sample.size = 1000, num.cores = 1)
## *** Monte Carlo GoF testing based on Kolmogorov-Smirnov statistic ...##      KS.stat.D      mc_p.value KS.stat.p.value     sample.size       num.sampl ##      0.1947641       0.8323353       0.0000000    1000.0000000     500.0000000​

MC KS test does not reject the null hypothesis that variable x comes from Weibull(x|shape = 0.5, scale = 1.2), while the standard Kolmogorov-Smirnov test rejects the Null Hypothesis.

## References

1. Sanchez, Robersy, and Sally A. Mackenzie. 2016. “Information Thermodynamics of Cytosine DNA Methylation.” Edited by Barbara Bardoni. PLOS ONE 11 (3). Public Library of Science: e0150427. doi:10.1371/journal.pone.0150427.