1 Install and load necessary packages

The following packages are used for the error-prone, time-to-event analysis presented here:

library(icensmis)
library(icRSF)

There are also some general packages we will need:

library(dplyr)
library(ggplot2)
library(knitr)
library(MASS)

2 Simulate error-prone time-to-event data

The icRSF package provides a function called simout for simulating error-prone time-to-event data. We will use this function to simulate data for this tutorial. To learn more about this function, run the line below in an interactive R session.

?icRSF::simout

Let’s assume a simple model in which we have a study of 1000 participants, half of whom are assigned a treatment and half of whom are kept as controls.

set.seed(2024)
trt_groups = matrix(rbinom(n=1000,size=1,prob = 0.5),ncol=1)
head(trt_groups)
##      [,1]
## [1,]    1
## [2,]    0
## [3,]    1
## [4,]    1
## [5,]    0
## [6,]    1

In our model, tests are conducted every two years for an 8-year period.

test_times = c(2,4,6,8)

The sensitivity and specificity of the self report are set to be 0.7 and 0.95, respectively:

sens = 0.7
spec = 0.95

We assume that 60 percent of the subjects will experience the event by the end of the study.

noevent = 0.4

Let’s assume our treatment is successful and set a negative coefficient for the treatment variable:

betas = c(-0.5)

Finally assume that after the first positive result, no further tests are given (“No Test after First Positive” or NTFP design).

design = "NTFP"

Using these variables, we simulate data with simout:

icmis_data = simout(Xmat = trt_groups,
                    testtimes = test_times,
                    sensitivity = sens,
                    specificity = spec,
                    betas = betas,
                    noevent = noevent,
                    design = design)

For further analysis, we add the information about treatment groups to the simulated data using a left_join.

trt_groups_df = data.frame(trt_groups)
trt_groups_df$ID = 1:nrow(trt_groups)

icmis_data_with_trt = left_join(icmis_data,trt_groups_df,by="ID")

3 Fit the modified Cox PH model with icmis

This section uses the models described in Gu, Ma, and Balasubramanian (2015). We can run icmis on the simulated data; we expect to recover a coefficient of about -0.5 for our treatment group based on the specification of beta above.

mod_fit <- icmis(subject = ID, 
                 testtime = time, 
                 result = result, 
                 data = icmis_data_with_trt,
                 sensitivity = sens, 
                 specificity = spec, 
                 formula = ~ factor(trt_groups),
                 param = 3,
                 control = list(maxit = 1000))

And compare to what we would recover if we assumed near-perfect sensitivity and specificity:

mod_fit_perf <- icmis(subject = ID, 
                 testtime = time, 
                 result = result, 
                 data = icmis_data_with_trt,
                 sensitivity = 0.999, 
                 specificity = 0.999, 
                 formula = ~ factor(trt_groups),
                 param = 3,
                 control = list(maxit = 1000))

Compare the coefficients estimated with the two methods:

kable(mod_fit$coefficient)
coefficient SE z p.value
factor(trt_groups)1 -0.5495611 0.1240429 -4.43041 9.4e-06
kable(mod_fit_perf$coefficient)
coefficient SE z p.value
factor(trt_groups)1 -0.3947894 0.087659 -4.503695 6.7e-06

Generate a forest plot of the estimated log hazard ratios and corresponding confidence intervals:

In addition to the regression coefficients \(\beta\), icmis estimates the survival function at each time point, \(\mathbf{\theta} = \theta_1, \dots, \theta_J\):

mod_fit$survival
##   time      surv     low95    high95
## 1    2 0.7875695 0.7374511 0.8292387
## 2    4 0.6193853 0.5620463 0.6714855
## 3    6 0.5280839 0.4697821 0.5829645
## 4    8 0.4388299 0.3802235 0.4958202

The plot_surv function can be used to make a graph from this output:

plot_surv(mod_fit)

We can compare this to the survival function estimated by the naive approach that assumes near-perfect sensitivity and specificity:

mod_fit_perf$survival
##   time      surv     low95    high95
## 1    2 0.8111321 0.7798133 0.8384650
## 2    4 0.6269462 0.5864172 0.6646908
## 3    6 0.4989580 0.4556948 0.5406401
## 4    8 0.3956140 0.3524478 0.4384191

Plotting the two together to compare:

plot_df = data.frame("time"=mod_fit$survival$time,
                     "icmis"=mod_fit$survival$surv,
                     "naive"=mod_fit_perf$survival$surv)
plot(plot_df$time,plot_df$icmis,type="s",
     xlab = "Time", ylab = "Survival(Time)",col="blue",lwd=3, main = "Naive Method vs. icmis",ylim=c(0,1))
lines(plot_df$time,plot_df$naive,type="s",col="red",lwd=3, lty=2)
legend("bottomleft", c("icmis","naive"),col=c("blue","red"), lty=c(1,2),lwd=c(3,3))

4 Design a study with icpower

This section applies the methods described in Gu and Balasubramanian (2016). The specified incidence and hazard ratios we chose for the study design are referenced from Tsadok et al. (2012) and Eliassen et al. (2016).

First, we provide an estimate for the survival function at each year based on a pre-specified annual cumulative incidence (often from prior studies):

getSurvivals = function(annualCumulativeIncidence,visitsPerYear,years)
{
  incidencePerVisit = annualCumulativeIncidence/visitsPerYear
  return(1-c(1:(years*visitsPerYear))*incidencePerVisit)
}

mySurv = getSurvivals(0.016, 1, 10)

Next, we plug this set of survival functions into the icpower function.

power80 = icpower(survivals = mySurv,
                  HR = 1.14,
                  sensitivity = 0.81,
                  specificity = 0.995,
                  power = 0.8,
                  rho = 0.5,
                  alpha = 0.05,
                  pmiss = 0.02,
                  pcensor = 0.01,
                  design = "MCAR" )

The output of icpower gives us the total sample size and the number of individuals in each group needed to achieve the desired power:

power80$result
##       N   N1   N2 power
## 1 12232 6116 6116   0.8

5 Perform high-dimensional variable selection with icRSF

This section demonstrates the methods described in Xu et al. (2018).

To begin, we will simulate a high-dimensional predictor set consisting of \(P=1000\) variables measured on \(N=500\) individuals.

library(MASS)
Xmat = MASS::mvrnorm(n=500,mu=rep(0,1000),Sigma=diag(1000))
colnames(Xmat) = paste0("Pred",1:1000)
Xmat[1:5,1:5]
##            Pred1      Pred2       Pred3       Pred4     Pred5
## [1,]  0.09107837 -0.8988530 -0.06404411 -0.89378671 0.6104948
## [2,]  0.18604260  1.2180332  1.14530263 -1.30165841 1.1851611
## [3,]  0.16707252  0.6721093 -1.00955358 -0.22864163 0.3027679
## [4,] -2.25449482 -0.4802918 -1.13358163 -0.01505257 0.1621946
## [5,] -0.42154141 -0.0628793 -0.24244102 -1.46962161 0.3401176

Of these 1000 variables, we select the first 50 to have a true association with survival, setting the log hazard ratio equal to 1:

beta = rep(0, 1000)
beta[1:50]=1

Now we simulate error-prone data corresponding to these covariates. Let’s keep the sensitivity and specificity from the stroke example as our test characteristics and test over a 10 year period.

simdata = simout(Xmat, 
                  testtimes=1:10, 
                  sensitivity=0.8, 
                  specificity=0.995, 
                  noevent=0.9, 
                  betas=beta, 
                  design='NTFP')

Next, we run variable importance calculations on the simulated data. To save time, after running this chunk one time, the results are saved to a file, this chunk is set to eval=F, and for later analysis the results are loaded from file.

rsf_res = icrsf(data=simdata, 
              subject=ID, 
              testtimes=time, 
              result=result, 
              sensitivity=0.8,
              specificity=0.995, 
              Xmat=Xmat, 
              root.size=10, 
              ntree=100, 
              ns=sqrt(ncol(Xmat)), 
              node=8)

save(rsf_res,file="rsf_res.rda")

Load the RSF results:

load("rsf_res.rda")

We see high variable importance for the first 50 predictors, the set of predictors that we established as truly associated with survival in our simulation.

boxplot(rsf_res[-c(1:50)],rsf_res[1:50],names = c("Other Predictors", "True Predictors"),main="RSF Variable Importance",ylab="Variable Importance Score")

6 Perform high-dimensional variable selection with Bayesian Variable Selection

First, we define a helper function to link together several pieces of the Bayesian Variable Selection (BVS) process.

bvs_fit = function(data, # outcome
                    Xmat, # covariates matrix
                    sensitivity, 
                    specificity, 
                    b, om1, om2, # hyperparameters
                    niter, # number of iterations
                    psample, # probability of updating coefficient
                    initsurv, # rate of event free at the study end
                    nreport = 10000, # every how many iterations to output parameters
                    nburn #  burn-in
                    ){
  if (niter < 10000) stop("Too small niter (at least 10000)")
  Dm <- icensmis:::dmat(data$ID, data$time, data$result, 
                        sensitivity, specificity, 1)
  fit <- icensmis:::bayesmc(Dm, Xmat, b, om1, om2, niter,
                            psample, initsurv, nreport,
                            icensmis:::fitsurv)
  icensmis:::gamma_mean(fit, nburn) 
}

Next, we apply this function to our simulated data. As in the RSF example, this chunk is set to eval=F and results are loaded from file to save time in the demonstration.

bvs_result = bvs_fit(data = simdata,
                      Xmat = Xmat,
                      sensitivity = 0.8, 
                      specificity = 0.995,
                      b = 1, 
                      om1 = 5, 
                      om2 = 100, 
                      niter = 50000, 
                      psample = 0.3, 
                      initsurv = 0.9,
                      nreport = 1000, 
                      nburn = 10000)
save(bvs_result,file="bvs_result.rda")

Load in the BVS result and see what the estimated values of the true predictors are.

load("bvs_result.rda")
df = data.frame(Metabolites = colnames(Xmat), 
                 Score = bvs_result, 
                 Rank = rank(-bvs_result),
                 TruePredictor = ifelse(1:ncol(Xmat) %in% 1:50,1,0))
head(df)
##   Metabolites    Score Rank TruePredictor
## 1       Pred1 0.008275  196             1
## 2       Pred2 0.270850   24             1
## 3       Pred3 0.000000  624             1
## 4       Pred4 0.058375   67             1
## 5       Pred5 0.622525   12             1
## 6       Pred6 1.000000    3             1

We can see that the variables we simulated to have a true association with the time-to-event have a higher variable importance score than the other predictors.

boxplot(df$Score ~ df$TruePredictor, names = c("Other Predictors","True Predictors"), ylab = "Variable Importance Score",xlab="",main="Bayesian Variable Selection")

References

Eliassen, Bent-Martin, Marita Melhus, Grethe S Tell, Kristin Benjaminsen Borch, Tonje Braaten, Ann Ragnhild Broderstad, and Sidsel Graff-Iversen. 2016. “Validity of Self-Reported Myocardial Infarction and Stroke in Regions with Sami and Norwegian Populations: The SAMINOR 1 Survey and the CVDNOR Project.” BMJ Open 6 (11): e012717.
Gu, Xiangdong, and Raji Balasubramanian. 2016. “Study Design for Non-Recurring, Time-to-Event Outcomes in the Presence of Error-Prone Diagnostic Tests or Self-Reports.” Statistics in Medicine 35 (22): 3961–75.
Gu, Xiangdong, Yunsheng Ma, and Raji Balasubramanian. 2015. “Semiparametric Time to Event Models in the Presence of Error-Prone, Self-Reported Outcomes—with Application to the Women’s Health Initiative.” The Annals of Applied Statistics 9 (2): 714.
Tsadok, Meytal Avgil, Cynthia A Jackevicius, Elham Rahme, Karin H Humphries, Hassan Behlouli, and Louise Pilote. 2012. “Sex Differences in Stroke Risk Among Older Patients with Recently Diagnosed Atrial Fibrillation.” Jama 307 (18): 1952–58.
Xu, Hui, Xiangdong Gu, Mahlet G Tadesse, and Raji Balasubramanian. 2018. “A Modified Random Survival Forests Algorithm for High Dimensional Predictors and Self-Reported Outcomes.” Journal of Computational and Graphical Statistics 27 (4): 763–72.