SpiderLearner Quick Start

Installing and Loading the `ensembleGGM` Package

Begin by installing and loading the devtools package, then using the install_github function to install ensembleGGM from GitHub as follows:

library(devtools)
install_github("katehoffshutta/ensembleGGM")
library(ensembleGGM)

Loading Example Data

Next, load the example data. Note that you will need to install the affy and curatedOvarianData packages from Bioconductor.

if (!requireNamespace("BiocManager", quietly = TRUE))
  install.packages("BiocManager")

# Uncomment this line to install the affy and curatedOvarianData package 
# Takes time, so re-comment after
# BiocManager::install("affy")
# BiocManager::install("curatedOvarianData")

library(affy)
library(curatedOvarianData)

Extracting and Standardizing Example Data

For illustration, we extract an example dataset (GSE32062.GPL6480_eset) from the curatedOvarianCancer package, and select a subset of genes related to ovarian carcinoma based on the Human Phenotype Ontology (Köhler et al. 2021):

standardize = function(x){return((x-mean(x))/sd(x))}

data(GSE32062.GPL6480_eset)
lateStage = exprs(GSE32062.GPL6480_eset)

# Extract a subset of genes related to ovarian carcinoma 
# based on the Human Phenotype Ontology
# See https://hpo.jax.org/app/browse/term/HP:0025318

lateStageSmall = lateStage[c(1680,1681,3027,4564,8930,12243,12245,13694,13695,13701,13979,16082,16875,17980),]
lateStageSmall = t(lateStageSmall)
names(lateStageSmall) = colnames(lateStageSmall)
lateStageSmall = apply(lateStageSmall,2,standardize)
head(lateStageSmall)

##                BRCA1      BRCA2     CDKN2A        DMPK       KRAS      PALB2
## GSM794865 -0.6028802 -1.1376194  0.2887462 -0.03206431 -0.8508601 -1.3167310
## GSM794866 -0.1237956  0.7974723 -1.0629278 -0.89750567 -0.3283184  0.8708805
## GSM794867  0.9028595 -0.7873664  1.8230450 -0.50683914 -0.2154519 -0.3221758
## GSM794868  1.2158611 -0.5421297  0.6851006  0.04632488 -1.2685259 -0.9095274
## GSM794869 -0.3436371 -0.2762188  0.7401463 -0.22931356  0.6232824  0.5098349
## GSM794870  0.4111565  0.5665514 -1.2593233  1.30307811  0.6981820 -0.3248203
##                 PALLD       PTCH1       PTCH2       PTEN      RAD51C
## GSM794865  1.91246451 -0.09995631  0.28531547  0.2503601 -1.19871005
## GSM794866 -1.50509530  1.34329261  0.82345264 -0.8031427 -1.06716226
## GSM794867 -1.83488960  0.31336992  0.26316065 -0.6280457  0.10481385
## GSM794868 -0.00508719 -0.98669751  0.07568094  0.1233805 -0.09129142
## GSM794869  0.20471774  0.35693082  0.37732193  0.3551492  0.02784237
## GSM794870  1.53759831 -0.97097949 -1.22914835  1.1162712  0.17535862
##                 SMAD4       SUFU       TP53
## GSM794865 -0.04516039  0.4075326  0.3881490
## GSM794866 -0.63087621  0.8761307  0.1602356
## GSM794867 -1.04189988 -1.2253780  0.7926124
## GSM794868  0.18237901  0.6599834  1.0135767
## GSM794869 -0.05990108 -0.2091264  0.9350137
## GSM794870 -0.11255515  0.8103267 -0.6199303

Instantiating the SpiderLearner and Adding Candidates

Instantiate a SpiderLearner object with the SpiderLearner$new() function, and add candidates as desired:

s = SpiderLearner$new()

apple = HugeEBICCandidate$new(gamma = 0)
banana = HugeEBICCandidate$new(gamma = 0.5)
clementine = HugeRICCandidate$new()
date = HGlassoCandidate$new()
elderberry = MLECandidate$new() 
fraise = HugeStARSCandidate$new(thres = 0.05)
grape = HugeStARSCandidate$new(thres = 0.1)
honeydew = QGraphEBICCandidate$new(gamma = 0)
icewine = QGraphEBICCandidate$new(gamma = 0.5)

candidates = list(apple, 
                  banana, 
                  clementine, 
                  date, 
                  elderberry,
                  fraise,
                  grape,
                  honeydew,
                  icewine)

for(candidate in candidates)
{
  s$addCandidate(candidate)
}

Running the SpiderLearner

Here is the syntax for running the model. Output is suppressed here for space.

slResults = s$runSpiderLearner(lateStageSmall, K = 10, standardize=T, nCores = 1)

There are two ways to access the results. The first is in the object we’ve saved as slResults that is returned by the runSpiderLearner function. The results are also stored in the SpiderLearner object itself and can be accessed with the getResults function. Note that the results accessed with getResults will only be the most recent set of results; therefore, if you wish to change your library with addCandidate or deleteCandidate and then run the model again, you should save the results as a separate object each time.

ls(slResults)

## [1] "foldsNets"         "fullModels"        "optTheta"         
## [4] "simpleMeanNetwork" "weights"

ls(s$getResults())

## [1] "foldsNets"         "fullModels"        "optTheta"         
## [4] "simpleMeanNetwork" "weights"

Investigating SpiderLearner Results

A good starting point for investigating results is to look at the weights of each candidate method.

s$getWeights()

##            method       weight
## 1          ebic_0 7.609961e-01
## 2        ebic_0.5 3.705783e-09
## 3             ric 1.554953e-08
## 4         hglasso 5.487284e-08
## 5             mle 2.261278e-01
## 6      stars_0.05 3.705798e-09
## 7       stars_0.1 3.118281e-09
## 8   qgraph_ebic_0 1.287595e-02
## 9 qgraph_ebic_0.5 4.009458e-08

We can plot the GGM for the SpiderLearner ensemble model using the plotSpiderLearner function:

s$plotSpiderLearner()

We can also plot the GGM corresponding to any of the candidate method using the plotCandidate function with the method identifier as an argument - for example, here is the MLE:

s$plotCandidate(identifier = "mle")

The adjacency matrix of the estimated GGM can be also accessed with getGGM:

s$getGGM()[1:5,1:5]

##              BRCA1        BRCA2      CDKN2A         DMPK        KRAS
## BRCA1  0.000000000 0.0089647641  0.01646969 0.1137168716  0.07360712
## BRCA2  0.008964764 0.0000000000  0.01408362 0.0009387959  0.17306543
## CDKN2A 0.016469693 0.0140836240  0.00000000 0.0360062990 -0.01496000
## DMPK   0.113716872 0.0009387959  0.03600630 0.0000000000  0.03968742
## KRAS   0.073607123 0.1730654308 -0.01496000 0.0396874153  0.00000000

The $i,j^{th}$ entry in this matrix represents the estimated partial correlation between the $i^{th}$ and $j^{th}$ variable in this dataset.¹

Running More Ensembles

It is straightforward to make changes in the library, number of folds, or dataset and run SpiderLearner again using the same object. For example, we can remove the hub graphical lasso and the MLE as candidate methods using the syntax:

s$removeCandidate("hglasso")
s$removeCandidate("mle")

We can check what’s in our library now with the printLibrary function:

s$printLibrary()

## [1] "ebic_0"          "ebic_0.5"        "ric"             "stars_0.05"     
## [5] "stars_0.1"       "qgraph_ebic_0"   "qgraph_ebic_0.5"

Finally, we can run our model again. Say that this time, we want to use 5 folds; we can modify that parameter here as well.

slResults_v2 = s$runSpiderLearner(lateStageSmall, K = 5, standardize=T, nCores = 1)

Now, when we use the getWeights() function, we will get the results for our latest analysis:

s$getWeights()

##            method       weight
## 1          ebic_0 9.788951e-01
## 2        ebic_0.5 4.260849e-08
## 3             ric 8.836936e-08
## 4      stars_0.05 4.260849e-08
## 5       stars_0.1 5.089110e-08
## 6   qgraph_ebic_0 2.110453e-02
## 7 qgraph_ebic_0.5 1.270060e-07

Contact Us / Contribute

This package is new, and any and all suggestions are welcomed. You can use GitHub to raise issues, contribute, or communicate with us about the package:

https://github.com/katehoffshutta/ensembleGGM

In particular, we would love to add more GGM estimation methods as Candidate objects and we welcome contributions in that area.

References

Köhler, Sebastian, Michael Gargano, Nicolas Matentzoglu, Leigh C Carmody, David Lewis-Smith, Nicole A Vasilevsky, Daniel Danis, et al. 2021. “The Human Phenotype Ontology in 2021.” Nucleic Acids Research 49 (D1): D1207–D1217.

Rolfs, Benjamin T, and Bala Rajaratnam. 2013. “A Note on the Lack of Symmetry in the Graphical Lasso.” Computational Statistics & Data Analysis 57 (1): 429–34.

Note that there is a known lack of symmetry in the graphical lasso-estimated precision matrix (Rolfs and Rajaratnam 2013), and consequently, in the matrix of partial correlations estimated by SpiderLearner. In the ensembleGGM package, we address this by averaging the $i,j^{th}$ and $j,i^{th}$ entry of the adjacency matrix to obtain a symmetric matrix, consistent with the fact that partial correlation should be symmetric.↩︎