Stochastic GOEA Simulations

GOATOOLS manuscript simulation figures
GOATOOLS manuscript simulation figures (details)

Stochastic simulations of multitudes of Gene Ontology Enrichment Analyses (GOEAs) are used to generate simulated values of FDR, sensitivity, and specificity for GOEAs run using GOATOOLS.

Two categories of simulations are contained herein:

Preparatory: Hypotheses and multiple-test simulations; elements include:
- FDR or FWER simulations only
Consequent: Gene Ontology Enrichment Results (GOEA) simulations; elements include:
- Fisher's exact test
- Benjamini/Hochberg FDR multiple test corrections
- Gene ontology associations

All simulations shown use alpha=0.05.

Stochastic GOEA simulations

The first simulation results contained unacceptably high FDRs for 3 out of 20 gene groups containing 64 or 124 genes. (Panels A3 and A4).

The simulations used randomly generated gene lists containing varying percentages of True Null (general population) genes and Non-True-Null (Humoral Response) genes.

Results of Failure Investigation

High numbers of false positives were often showed:

GO IDs that are associated with an unusually high number of genes
GO IDs that are purified, rather than enriched

Simulate. Then View Enriched Genes Only

If the simulations were re-run, but only the enriched results were used in reports and figures, the simulations solidly PASSED resulting in FDRs that were very close to zero.

GOEAs for gene group sizes

Purge 30 GOs from 17,000+ GO Association. Then Simulate

If the simulations were re-run, but with ~30 GO IDs out of 17,000+ GO IDs in the association stripped out of the association, the simulations also solidly PASSED. The 30 GO IDs were chosen to be purged because they were associated with more than 1000 genes.

Enriched GOEAs with propagate counts True

Enriched GOEAs only with counts propagated: 4-64 genes

GOEA Stress tests

Upon doing a stress tests by randomizing the associations for True-Null genes prior to simulation, the "enriched-gene" simulations FAILed, but the "30-GOs-Purged" simulations PASSed.

Randomly Shuffle Association. Simulate. Then View Enriched Genes Only

This simulation FAILs due to FDRs>0.05 (A2: 64, 124 genes; A3: 124 genes), but is close to passing.

Purge 30 GOs from Association, then Randomize

This simulation PASSes.

Benjamini/Hochberg-Only Simulations

Simulations of the underlying Benjamini/Hochberg multiple test correction are a subset of the GOEA simulations. Inputs to these simulations are hypotheses test results (P-values), versus the inputs in GOEA simulations which are gene lists containing various percentages of enrichments.

Simulation Inputs

The simulation inputs are groups of hypothesis test results (P-values) tagged as either True nulls and Non-True nulls:

True Nulls are P-values randomly chosen from a uniform distribution of values between 0.0 to 1.0.
Non-True Nulls (a.k.a False Nulls) are P-values randomly chosen from a uniform distribution of values between:
- 0.00 to 0.01 => Extremely different from the population.
- 0.00 to 0.03 => Moderately different from the population.
- 0.00 to 0.05 => Minimally different from the population.

Benjamini/Hochberg-Only Simulated FDRs

Results show:

The worst (highest) simulated FDR means are equal to the alpha (0.05) for all simulation sets with 100% True Null s (A1, B1, and C1).
As the percentage of true nulls drops, the FDR also drops;
- row 1, with 100% True Null, has the highest mean FDR (0.05), while
- row 4, with 0% True Null, has the lowest mean FDR (0.012).
The simulated mean FDRs are the same across all study group sizes. For example, in A2
- the leftmost boxplot with blue dots showing the groups of 4 tested hypotheses has the same mean FDR as
- the rightmost boxplot with red dots showing the groups of 128 tested hypotheses.

Benjamini/Hochberg-Only Simulated Sensitivity

Observations:

100% of Non-True Nulls are discovered if their uncorrected P-Value is Extreme (below 0.01 when alpha is 0.05) (A2-A4)
Moderate non-true null P-values (B2-B3) are discovered less frequently than extreme P-values (A2-A3).
Minimal non-true null P-values (C2-C4) are discovered even less frequently than moderate P-values (B2-B4).
Smaller groups of tested hypotheses (B3, blue bar) have more discoveries than larger groups (B3, red bar).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README_details.md

README_details.md

Stochastic GOEA Simulations

Table of Contents

Stochastic GOEA simulations

Results of Failure Investigation

Simulate. Then View Enriched Genes Only

GOEAs for gene group sizes

Purge 30 GOs from 17,000+ GO Association. Then Simulate

Enriched GOEAs with propagate counts True

Enriched GOEAs only with counts propagated: 4-64 genes

Enriched GOEAs only with counts propagated: 4-64 genes

GOEA Stress tests

Randomly Shuffle Association. Simulate. Then View Enriched Genes Only

Purge 30 GOs from Association, then Randomize

Benjamini/Hochberg-Only Simulations

Simulation Inputs

Benjamini/Hochberg-Only Simulated FDRs

Benjamini/Hochberg-Only Simulated Sensitivity

Files

README_details.md

Latest commit

History

README_details.md

File metadata and controls

Stochastic GOEA Simulations

Table of Contents

Stochastic GOEA simulations

Results of Failure Investigation

Simulate. Then View Enriched Genes Only

GOEAs for gene group sizes

Purge 30 GOs from 17,000+ GO Association. Then Simulate

Enriched GOEAs with propagate counts True

Enriched GOEAs only with counts propagated: 4-64 genes

Enriched GOEAs only with counts propagated: 4-64 genes

GOEA Stress tests

Randomly Shuffle Association. Simulate. Then View Enriched Genes Only

Purge 30 GOs from Association, then Randomize

Benjamini/Hochberg-Only Simulations

Simulation Inputs

Benjamini/Hochberg-Only Simulated FDRs

Benjamini/Hochberg-Only Simulated Sensitivity