DINGO Tutorial
Package installation and loading
To install all the packages needed for this tutorial, run this chunk. It is set to eval=F
so that it doesn’t reinstall every time you knit.
install.packages("dplyr")
install.packages("huge")
install.packages("iDINGO")
install.packages("igraph")
install.packages("matrixcalc")
install.packages("rmdformats")
The next chunk loads the libraries and defines a few basic functions we will use.
Data preprocessing
These data are from a metabolomics study conducted on HAPO study participants (Group and others 2002). Deidentified metabolomics measurements are available for 51 metabolites on 1600 women in four ancestry groups.
Load the data from Raji’s website:
data = read.csv("https://raw.githubusercontent.com/raji-lab/ENAR2021-Tutorial/main/Data/hapo_metabolomics_2020.csv") %>% filter(anc_gp %in% c("ag1","ag2"))
dim(data)
## [1] 800 54
## [1] 119
We have some missing data. First, see if any metabolite is missing in more than 25 percent of the samples, and if so, exclude it from the analysis.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000000 0.000000 0.000000 0.002755 0.002188 0.030000
No metabolites need to be excluded; the highest missingness is about 3.68 percent. Since this is an illustrative example, let’s not worry too much about imputing, but just use mean imputation on those missing values.
completeData = data
mxIndices = grep("mt",names(data))
completeData[,mxIndices] = apply(data[,mxIndices],2,impute)
Finally, we standardize the metabolites.
Running DINGO to get a point estimate of the global and local networks
At this point, we are ready to run the DINGO algorithm with the dingo
function from the iDINGO
package (Ha, Baladandayuthapani, and Do 2015)(Class et al. 2018). This should take about a minute. Because it is relatively slow, there’s a timer here for reference.
set.seed(42)
start_time_fast = Sys.time()
ag = ifelse(cleanData$anc_gp == "ag1",1,2)
model = dingo(dat = cleanData[,mxIndices],
x = ag,
diff.score = F,
verbose = T,
cores = 1)
## Step 1 of DINGO is finished at Thu Mar 10 17:38:53 2022
## Step 2 of DINGO is finished at Thu Mar 10 17:39:22 2022
## Time difference of 44.89071 secs
Browse the output:
## [1] "boot.diff" "diff.score" "genepair" "levels.x" "P"
## [6] "p.val" "Psi" "Q" "R1" "R2"
## [11] "rho" "step.times"
Take a look at the global component:
## [1] 51 51
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.00000000 0.0304542 0.06583722 -0.09146267 0.0000000
## [2,] 0.02452015 0.0000000 0.34819236 0.08819875 0.0000000
## [3,] 0.02836992 0.1863643 0.00000000 -0.06145703 0.4028492
## [4,] -0.11639153 0.1394026 -0.18147846 0.00000000 0.1012357
## [5,] 0.00000000 0.0000000 0.42369064 0.03605565 0.0000000
And the local components:
## [1] 1275
## [1] 0.06401273 0.01978540 0.39047613 -0.24106757 0.27047173
## [1] 1275
## [1] 0.06264400 0.02290508 0.39128236 -0.24295443 0.26856958
Why aren’t these adjacency matrices like model$P
was? These are edge weights corresponding to the edges from the model$genepair
output variable:
## gene1 gene2
## 1 mt1_1 mt1_2
## 2 mt1_1 mt1_3
## 3 mt1_2 mt1_3
## 4 mt1_1 mt1_4
## 5 mt1_2 mt1_4
An easier way to see this is to look at a data frame of all of them:
dingoDF = data.frame("gene1"=model$genepair$gene1,
"gene2"=model$genepair$gene2,
"R1"=model$R1,
"R2"=model$R2)
head(dingoDF)
## gene1 gene2 R1 R2
## 1 mt1_1 mt1_2 0.06401273 0.06264400
## 2 mt1_1 mt1_3 0.01978540 0.02290508
## 3 mt1_2 mt1_3 0.39047613 0.39128236
## 4 mt1_1 mt1_4 -0.24106757 -0.24295443
## 5 mt1_2 mt1_4 0.27047173 0.26856958
## 6 mt1_3 mt1_4 -0.21654715 -0.21224948
Why would we want to do this? This format allows the user to organize attributes of each edge into a data frame, so we can add things like z-scores and p-values once we have them. This would be hard to do with an adjacency matrix.
Inference with bootstrap
What we have looked at so far is just a point estimate. If we look at the inferential attributes of the output, they’re empty:
## NULL
## NULL
To get the bootstrap differences, scores, and p-values for each edge, we have to run DINGO with the bootstrap option. Because this is time consuming, it’s not being run here, but the code is included below. We’ll just load the output from an .rda
file.
# set.seed(42)
# start_time_fast = Sys.time()
#
# ag = ifelse(cleanData$anc_gp == "ag1",1,2)
# modBoot = dingo(dat = cleanData[,mxIndices],
# x = ag,
# diff.score = T,
# B = 30,
# verbose = T,
# cores = 3)
#
# end_time_fast = Sys.time()
#
# total_time_fast = end_time_fast - start_time_fast
# total_time_fast
# save(modBoot,file="dingoResults.rda")
Now, we have the variables we need for inference:
## [1] 0.12328385 -0.14405408 -0.02869068 0.14093235 0.10098032
## [1] 0.8371529 0.6700149 0.8779664 0.8047253 0.8785260
dingoDF = data.frame("gene1"=modBoot$genepair$gene1,
"gene2"=modBoot$genepair$gene2,
"R1"=modBoot$R1,
"R2"=modBoot$R2,
"diff.value"=modBoot$R1 - modBoot$R2,
"diff.score"=modBoot$diff.score,
"p.val"=modBoot$p.val)
head(dingoDF)
## gene1 gene2 R1 R2 diff.value diff.score p.val
## 1 mt1_1 mt1_2 0.06401273 0.06264400 0.0013687354 0.12328385 0.8371529
## 2 mt1_1 mt1_3 0.01978540 0.02290508 -0.0031196832 -0.14405408 0.6700149
## 3 mt1_2 mt1_3 0.39047613 0.39128236 -0.0008062385 -0.02869068 0.8779664
## 4 mt1_1 mt1_4 -0.24106757 -0.24295443 0.0018868556 0.14093235 0.8047253
## 5 mt1_2 mt1_4 0.27047173 0.26856958 0.0019021468 0.10098032 0.8785260
## 6 mt1_3 mt1_4 -0.21654715 -0.21224948 -0.0042976659 -0.17872937 0.6114126
To determine which of the 1275 possible edges differs significantly between ancestry group 1 and ancestry group 2, we will use these p-values.
There’s a lot of edges below the usual 0.05 cutoff, as would be expected due to the multiple hypothesis testing problem!
## [1] 139
We can look at a simple Bonferroni correction and see 55 edges remain:
## [1] 55
To look at a graph of only those edges that differ between ancestry groups according to this threshold, we can filter the data frame above and select only the columns needed for an edge list: node names and an edge weight.
edgeList = dingoDF %>%
filter(p.val < 0.05/1275) %>%
select(gene1, gene2, diff.value) %>%
rename(weight = diff.value)
We can then use the igraph
package to make a graph object from this edgelist and plot it (Csardi, Nepusz, and others 2006).
myGraph = graph_from_edgelist(as.matrix(edgeList[,1:2]),directed=F)
E(myGraph)$weight = edgeList$weight
plot(myGraph,
layout = layout_with_dh)
Some hub structure is immediately apparent in the plot above. We can tweak a few more parameters of the visualization to make a plot that also shows the edge weights and the sign of the difference between ancestry group 1 and ancestry group 2:
plot(myGraph,
layout = layout_with_dh,
vertex.size = degree(myGraph),
edge.width = 20*abs(E(myGraph)$weight), # 20 is arbitrary
edge.color = ifelse(E(myGraph)$weight > 0, "red","blue"))
If we were continuing analysis on this dataset, we could be motivated to look into these three hub metabolites mt3_1
, mt3_7
, and mt3_10
, which seem to be central players in the network-level metabolomic differences between ancestry group 1 and 2.
Bonus: review the algebra
To understand the convolution step of DINGO, it may be helpful to see it actually put it together with the different parts of the model output:
PsiInv = solve(model$Psi)
x = c(1,1)
Q = model$Q
kappa = matrix.trace(Q %*% x %*% t(x) %*% t(Q) %*% PsiInv)
L = PsiInv - 1/(1+kappa)*PsiInv %*% Q %*% x %*% t(x) %*% t(Q) %*% PsiInv
G = model$P
N = t(diag(nrow(G)) - G) %*% L %*% (diag(nrow(G)) - G)
partialCorrelationMatrix = -cov2cor(N)[1:5,1]
model$R1[1:5] # this is the same thing as the partialCorrelationMatrix, but in edge list form
## [1] 0.06401273 0.01978540 0.39047613 -0.24106757 0.27047173
References
Class, Caleb A, Min Jin Ha, Veerabhadran Baladandayuthapani, and Kim-Anh Do. 2018. “IDINGO—Integrative Differential Network Analysis in Genomics with Shiny Application.” Bioinformatics 34 (7): 1243–5.
Csardi, Gabor, Tamas Nepusz, and others. 2006. “The Igraph Software Package for Complex Network Research.” InterJournal, Complex Systems 1695 (5): 1–9.
Group, HAPO Study Cooperative Research, and others. 2002. “The Hyperglycemia and Adverse Pregnancy Outcome (Hapo) Study.” International Journal of Gynecology & Obstetrics 78 (1): 69–77.
Ha, Min Jin, Veerabhadran Baladandayuthapani, and Kim-Anh Do. 2015. “DINGO: Differential Network Analysis in Genomics.” Bioinformatics 31 (21): 3413–20.