DINGO Tutorial

Package installation and loading

To install all the packages needed for this tutorial, run this chunk. It is set to eval=F so that it doesn’t reinstall every time you knit.

install.packages("dplyr")
install.packages("huge")
install.packages("iDINGO")
install.packages("igraph")
install.packages("matrixcalc")
install.packages("rmdformats")

The next chunk loads the libraries and defines a few basic functions we will use.

library(dplyr)
library(huge)
library(iDINGO)
library(igraph)
library(matrixcalc)
library(rmdformats)

impute = function(x)
{
  imputeVal = mean(x,na.rm=T) 
  x[is.na(x)] = imputeVal
  return(x)
}

standardizeMetabolite = function(x)
{
  x = (x - mean(x))/sd(x)
  return(x)
}

Data preprocessing

These data are from a metabolomics study conducted on HAPO study participants (Group and others 2002). Deidentified metabolomics measurements are available for 51 metabolites on 1600 women in four ancestry groups.

Load the data from Raji’s website:

data = read.csv("https://raw.githubusercontent.com/raji-lab/ENAR2021-Tutorial/main/Data/hapo_metabolomics_2020.csv") %>% filter(anc_gp %in% c("ag1","ag2"))
dim(data)
## [1] 800  54
sum(is.na(data))
## [1] 119

We have some missing data. First, see if any metabolite is missing in more than 25 percent of the samples, and if so, exclude it from the analysis.

mxMissingProp = apply(data,2,function(x){sum(is.na(x))/800})
summary(mxMissingProp)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.000000 0.000000 0.000000 0.002755 0.002188 0.030000

No metabolites need to be excluded; the highest missingness is about 3.68 percent. Since this is an illustrative example, let’s not worry too much about imputing, but just use mean imputation on those missing values.

completeData = data
mxIndices = grep("mt",names(data))
completeData[,mxIndices] = apply(data[,mxIndices],2,impute)

Finally, we standardize the metabolites.

cleanData = completeData
completeData[,mxIndices] = apply(cleanData[,mxIndices],2,standardizeMetabolite)

Running DINGO to get a point estimate of the global and local networks

At this point, we are ready to run the DINGO algorithm with the dingo function from the iDINGO package (Ha, Baladandayuthapani, and Do 2015)(Class et al. 2018). This should take about a minute. Because it is relatively slow, there’s a timer here for reference.

set.seed(42)
start_time_fast = Sys.time()

ag = ifelse(cleanData$anc_gp == "ag1",1,2)
model = dingo(dat = cleanData[,mxIndices],
            x = ag,
            diff.score = F, 
            verbose = T,
            cores = 1) 
## Step 1 of DINGO is finished at Thu Mar 10 17:38:53 2022 
## Step 2 of DINGO is finished at Thu Mar 10 17:39:22 2022
end_time_fast = Sys.time()

total_time_fast = end_time_fast - start_time_fast
total_time_fast
## Time difference of 44.89071 secs

Browse the output:

ls(model)
##  [1] "boot.diff"  "diff.score" "genepair"   "levels.x"   "P"         
##  [6] "p.val"      "Psi"        "Q"          "R1"         "R2"        
## [11] "rho"        "step.times"

Take a look at the global component:

dim(model$P)
## [1] 51 51
model$P[1:5,1:5]
##             [,1]      [,2]        [,3]        [,4]      [,5]
## [1,]  0.00000000 0.0304542  0.06583722 -0.09146267 0.0000000
## [2,]  0.02452015 0.0000000  0.34819236  0.08819875 0.0000000
## [3,]  0.02836992 0.1863643  0.00000000 -0.06145703 0.4028492
## [4,] -0.11639153 0.1394026 -0.18147846  0.00000000 0.1012357
## [5,]  0.00000000 0.0000000  0.42369064  0.03605565 0.0000000

And the local components:

length(model$R1)
## [1] 1275
model$R1[1:5]
## [1]  0.06401273  0.01978540  0.39047613 -0.24106757  0.27047173
length(model$R2)
## [1] 1275
model$R2[1:5]
## [1]  0.06264400  0.02290508  0.39128236 -0.24295443  0.26856958

Why aren’t these adjacency matrices like model$P was? These are edge weights corresponding to the edges from the model$genepair output variable:

model$genepair[1:5,]
##   gene1 gene2
## 1 mt1_1 mt1_2
## 2 mt1_1 mt1_3
## 3 mt1_2 mt1_3
## 4 mt1_1 mt1_4
## 5 mt1_2 mt1_4

An easier way to see this is to look at a data frame of all of them:

dingoDF = data.frame("gene1"=model$genepair$gene1,
                     "gene2"=model$genepair$gene2,
                     "R1"=model$R1,
                     "R2"=model$R2)
head(dingoDF)
##   gene1 gene2          R1          R2
## 1 mt1_1 mt1_2  0.06401273  0.06264400
## 2 mt1_1 mt1_3  0.01978540  0.02290508
## 3 mt1_2 mt1_3  0.39047613  0.39128236
## 4 mt1_1 mt1_4 -0.24106757 -0.24295443
## 5 mt1_2 mt1_4  0.27047173  0.26856958
## 6 mt1_3 mt1_4 -0.21654715 -0.21224948

Why would we want to do this? This format allows the user to organize attributes of each edge into a data frame, so we can add things like z-scores and p-values once we have them. This would be hard to do with an adjacency matrix.

Inference with bootstrap

What we have looked at so far is just a point estimate. If we look at the inferential attributes of the output, they’re empty:

model$diff.score
## NULL
model$p.val
## NULL

To get the bootstrap differences, scores, and p-values for each edge, we have to run DINGO with the bootstrap option. Because this is time consuming, it’s not being run here, but the code is included below. We’ll just load the output from an .rda file.

# set.seed(42)
# start_time_fast = Sys.time()
# 
# ag = ifelse(cleanData$anc_gp == "ag1",1,2)
# modBoot = dingo(dat = cleanData[,mxIndices],
#             x = ag,
#             diff.score = T,
#             B = 30,
#             verbose = T,
#             cores = 3) 
# 
# end_time_fast = Sys.time()
# 
# total_time_fast = end_time_fast - start_time_fast
# total_time_fast
# save(modBoot,file="dingoResults.rda")
load("dingoResults.rda")

Now, we have the variables we need for inference:

modBoot$diff.score[1:5]
## [1]  0.12328385 -0.14405408 -0.02869068  0.14093235  0.10098032
modBoot$p.val[1:5]
## [1] 0.8371529 0.6700149 0.8779664 0.8047253 0.8785260
dingoDF = data.frame("gene1"=modBoot$genepair$gene1,
                     "gene2"=modBoot$genepair$gene2,
                     "R1"=modBoot$R1,
                     "R2"=modBoot$R2,
                     "diff.value"=modBoot$R1 - modBoot$R2,
                     "diff.score"=modBoot$diff.score,
                     "p.val"=modBoot$p.val)
head(dingoDF)
##   gene1 gene2          R1          R2    diff.value  diff.score     p.val
## 1 mt1_1 mt1_2  0.06401273  0.06264400  0.0013687354  0.12328385 0.8371529
## 2 mt1_1 mt1_3  0.01978540  0.02290508 -0.0031196832 -0.14405408 0.6700149
## 3 mt1_2 mt1_3  0.39047613  0.39128236 -0.0008062385 -0.02869068 0.8779664
## 4 mt1_1 mt1_4 -0.24106757 -0.24295443  0.0018868556  0.14093235 0.8047253
## 5 mt1_2 mt1_4  0.27047173  0.26856958  0.0019021468  0.10098032 0.8785260
## 6 mt1_3 mt1_4 -0.21654715 -0.21224948 -0.0042976659 -0.17872937 0.6114126

To determine which of the 1275 possible edges differs significantly between ancestry group 1 and ancestry group 2, we will use these p-values.

hist(dingoDF$p.val,breaks=100)

There’s a lot of edges below the usual 0.05 cutoff, as would be expected due to the multiple hypothesis testing problem!

sum(dingoDF$p.val < 0.05)
## [1] 139

We can look at a simple Bonferroni correction and see 55 edges remain:

sum(dingoDF$p.val < 0.05/1275)
## [1] 55

To look at a graph of only those edges that differ between ancestry groups according to this threshold, we can filter the data frame above and select only the columns needed for an edge list: node names and an edge weight.

edgeList = dingoDF %>% 
  filter(p.val < 0.05/1275) %>% 
  select(gene1, gene2, diff.value) %>%
  rename(weight = diff.value)

We can then use the igraph package to make a graph object from this edgelist and plot it (Csardi, Nepusz, and others 2006).

myGraph = graph_from_edgelist(as.matrix(edgeList[,1:2]),directed=F)
E(myGraph)$weight = edgeList$weight
plot(myGraph, 
     layout = layout_with_dh)

Some hub structure is immediately apparent in the plot above. We can tweak a few more parameters of the visualization to make a plot that also shows the edge weights and the sign of the difference between ancestry group 1 and ancestry group 2:

plot(myGraph, 
     layout = layout_with_dh,
     vertex.size = degree(myGraph),
     edge.width = 20*abs(E(myGraph)$weight), # 20 is arbitrary
     edge.color = ifelse(E(myGraph)$weight > 0, "red","blue"))

If we were continuing analysis on this dataset, we could be motivated to look into these three hub metabolites mt3_1, mt3_7, and mt3_10, which seem to be central players in the network-level metabolomic differences between ancestry group 1 and 2.

Bonus: review the algebra

To understand the convolution step of DINGO, it may be helpful to see it actually put it together with the different parts of the model output:

PsiInv = solve(model$Psi)
x = c(1,1)
Q = model$Q
kappa = matrix.trace(Q %*% x %*% t(x) %*% t(Q) %*% PsiInv)
L = PsiInv - 1/(1+kappa)*PsiInv %*% Q %*% x %*% t(x) %*% t(Q) %*% PsiInv
G = model$P
N = t(diag(nrow(G)) - G) %*% L %*% (diag(nrow(G)) - G)
partialCorrelationMatrix = -cov2cor(N)[1:5,1]
model$R1[1:5] # this is the same thing as the partialCorrelationMatrix, but in edge list form
## [1]  0.06401273  0.01978540  0.39047613 -0.24106757  0.27047173

References

Class, Caleb A, Min Jin Ha, Veerabhadran Baladandayuthapani, and Kim-Anh Do. 2018. “IDINGO—Integrative Differential Network Analysis in Genomics with Shiny Application.” Bioinformatics 34 (7): 1243–5.

Csardi, Gabor, Tamas Nepusz, and others. 2006. “The Igraph Software Package for Complex Network Research.” InterJournal, Complex Systems 1695 (5): 1–9.

Group, HAPO Study Cooperative Research, and others. 2002. “The Hyperglycemia and Adverse Pregnancy Outcome (Hapo) Study.” International Journal of Gynecology & Obstetrics 78 (1): 69–77.

Ha, Min Jin, Veerabhadran Baladandayuthapani, and Kim-Anh Do. 2015. “DINGO: Differential Network Analysis in Genomics.” Bioinformatics 31 (21): 3413–20.