This document shows all code, intermediate outcomes, and final outcomes belonging to the manuscript “Simulating Progression-Free and Overall Survival for First-Line Doublet Chemotherapy With or Without Bevacizumab in Metastatic Colorectal Cancer Patients Based on Real-World Registry Data” by K Degeling et al. in PharmacoEconomics (available here). Using real world data from the Treatment of Recurrent and Advanced Colorectal Cancer registry, two parametric survival models and a logistic regression model are estimated to populate a discrete event simulation model. Please read the manuscript before reading through this document, as the manuscript contains information essential for understanding decisions made in the code that is included in this document.

Please appropriately cite the work above if you use any code provided in this document for your own analyses.

This document contains the following sections:

Please note that the number of bootstrap iterations and simulation runs performed is set to a lower number for illustration purposes due to the required runtime to perform the 10,000 iterations/simulations that are performed for the final analyses presented in the manuscript. Hence, results and plots in this document may be different from those reported in the manuscript. Results presented in the manuscript should be considered as the final and definitive results at all times.

Initialization

To intialize the data analysis, several actions are performed. Most importantly, the data are loaded from an encrypted hard drive, as are the input matrices defined by clinicians to recode the treatments as observed in the unprocessed data.

# Clear Workspace and plots, and clear console
rm(list=ls()); gc(); if(!is.null(dev.list())) dev.off(); cat("\014");

# Load packages
library(dplyr);
library(readxl);
library(stringr);

# Set the locale so R handles dates as English regardless of location
Sys.setlocale("LC_TIME", "English");

# Load the raw TRACC data for the patient characteristics and first line treatment
df <- read_excel("G:/Files/mCRC Targeting Melbourne/CRC MODELING DATA 20181105.xlsx", 
                 col_types = c("numeric", "date", "text", "text",
                               "text", "text", "text", "date", "date", 
                               "date", "date", "numeric", "text", 
                               "text", "text", "text", "text", "text", 
                               "text", "text", "text", "text", "text", 
                               "text", "text", "text", "text", "text", 
                               "text", "text", "numeric", "text", 
                               "text", "text", "text", "text", "text", 
                               "text", "text", "text", "text", "text", 
                               "text", "text", "text", "text", "text", 
                               "text", "text", "text", "numeric", 
                               "text", "text", "text", "text", "date", 
                               "text", "text", "text", "date", "text", 
                               "text", "text", "text", "text", "date",
                               "numeric","text",
                               "text", "text", "text", "text", "date", 
                               "date", "date", "text", "text", "text", 
                               "text", "text", "text", "text", "numeric", 
                               "text", "numeric", "text", "numeric", 
                               "text", "text", "text", "numeric", 
                               "text", "numeric", "text", "numeric", 
                               "text", "numeric", "text", "numeric", 
                               "text", "numeric", "text", "numeric", 
                               "text", "numeric", "text", "numeric", 
                               "text", "numeric", "text", "text", 
                               "numeric", "text", "text"));

# Replace spaces in column names with dots/points to handle columns more easily
colnames(df) <- str_replace_all(string=colnames(df), pattern=" ", replace=".");

# Load the chemo and biologic classification by Hui-Li & Koen
chemo.class <- read_excel("G:/Files/mCRC Targeting Melbourne/Chemotherapy Classification 20181123.xlsx");
biolo.class <- read_excel("G:/Files/mCRC Targeting Melbourne/Biological Classification 20181123.xlsx");

Step 1: Cleaning the TRACC Data

Before all models can be fitted, the data need to be cleaned and the appropriate patient population is to be selected. To this end, several steps are performed:

  1. Initial data cleaning and patient selection
  2. Cleaning of patient characteristics and treatment history
  3. Cleaning of clinical outcomes for first-line treatment
  4. Cleaning of clinical outcomes for further treatment lines
  5. Final patient selection

Step 1.1: Initial data cleaning and patient selection

In the first step of the data cleaning, the patient selection is performed. All data required for this selection are recoded and cleaned in this step as well. Patients included are those …

  • who received chemotherapy with or without bevacizumab
  • treated with a palliative treatment intent
  • with a treatment start date after 1 January 2009 and before 1 June 2017
# Inclusion is based on whether the first line of palliative treatment included
# doublet chemotherapy with or without bevacizumab. Patients for whom treatment
# intent was indicated as curative or potentially curative, are excluded.

# Only patients with recorded first-line chemotherapy treatment are considered for inclusion
df <- df %>%
  filter(!is.na(REGIMEN.1L)) %>%
  filter(TREATMENTINTENT=="Palliative only")

# Recode first-line treatment information for inclusion
df <- df %>%
  mutate(L1.Regimen = REGIMEN.1L) %>%
  mutate(L1.Biologic = BIOLOGIC.1L) %>%
  mutate(L1.Start.Date = as.character(START.DATE.1L))

# In the TRACC registry, all regimen in a treatment line need to be recoded first. 
# This has been manually done in an Excel file, which has been loaded in the global 
# environment as object 'chemo.class' and will now be used to recode the regimen.
# The 'chemo.class' object contains a column 'Recorded.Regimen' with all uncoded
# treatments and a columns 'Classification.G0' with the recoded treatments. The
# column 'L1.Chemo.Eligible' is added to indicate whether a patient received doublet
# chemotherapy and, hence, is eligible for inclusion.
df <- df %>%
  left_join(select(chemo.class, Recorded.Regimen, Classification.G0), by=c("L1.Regimen" = "Recorded.Regimen")) %>%
  rename(L1.Chemo.Class = Classification.G0) %>%
  mutate(L1.Chemo.Class = ifelse(is.na(L1.Chemo.Class), "None", L1.Chemo.Class)) %>%
  mutate(L1.Chemo.Eligible = ifelse(L1.Chemo.Class=="Doublet", T, F))

# Since only patients who had a doublet chemotherapy with or without bevacizumab,
# so not with another additional agent like cetuximab, are include, it need to be
# checked whether patients had bevacizumab yes or no, and whether they are
# eligible based on any other biologicals. This is done based on the first
# reported biological
df <- df %>%
  left_join(select(biolo.class, Recorded.Biologicals, Classification.G0), by=c("L1.Biologic" = "Recorded.Biologicals")) %>%
  rename(L1.Biologic.Class=Classification.G0) %>%
  mutate(L1.Biologic.Class = ifelse(is.na(L1.Biologic.Class), "None", L1.Biologic.Class)) %>%
  mutate(L1.Biologic.Eligible = ifelse(L1.Biologic.Class %in% c("None", "Bevacizumab"), T, F)) %>%
  mutate(L1.Bevacizumab = ifelse(L1.Biologic.Class=="Bevacizumab", T, F))

# Final selection of patients is performed based on whether they had ...
# 1) doublet chemotherapy with or without bevacizumab and no other biological
# 2) treatment start date after 2009-01-01
# 3) treatment start date before 2017-06-01
df<- df %>%
  filter(L1.Chemo.Eligible) %>%
  filter(L1.Biologic.Eligible) %>%
  filter(L1.Start.Date >= as.Date("2009-01-01")) %>%
  filter(L1.Start.Date < as.Date("2017-06-01"))

# Allocate patients to the appropriate treatment arm
df <- df %>%
  mutate(L1.Arm.ChemBev = ifelse(L1.Bevacizumab, T, F))

Step 1.2: Cleaning of patient characteristics and treatment history

The second step in cleaning the data is making sure all variables are recoded in relevant categories and all observations are categorized in one of these defined categories. The following variables are extracted from the data:

  • Age at treatment start
  • Gender
  • Hospital type (private or public)
  • Hospital location (located in Victoria or not)
  • Stage at initial diagnosis
  • Primary tumour site
  • Presence of peritoneal metastases
  • Presence of liver metastases
  • Number of metastatic sites
  • ECOG performance status
  • Presence of comorbidities
  • Charlson comorbidity index
  • CEA level
  • RAS mutation status
  • BRAF mutation status
  • Treatment intent
  • Treatment start year
df <- df %>%
  
  # Age in years at treatment start
  mutate(Age = floor(as.numeric(difftime(L1.Start.Date, DOB, units="weeks")/52))) %>%         

  # Gender is female or not
  mutate(Gender.Female = ifelse(GENDER=="F", T, F)) %>%                                       
  
  # Hospital type is private or not
  mutate(Hospital.Private = ifelse(TLOCATIONTYPE=="Private", T, F)) %>%
  
  # Hospital is located in Victoria or not
  mutate(Hospital.Victoria = ifelse(STATE=="VIC", T, F)) %>%
  
  # Stage of disease at initial diagnosis
  mutate(Stage.Diagnosis = ifelse(STAGE.AT.DIAGNOSIS=="Unknown", NA, NA))%>%
  mutate(Stage.Diagnosis = ifelse(STAGE.AT.DIAGNOSIS=="Locally advanced (rectal only)", 0, Stage.Diagnosis)) %>%
  mutate(Stage.Diagnosis = ifelse(STAGE.AT.DIAGNOSIS=="Stage I", 1, Stage.Diagnosis)) %>%
  mutate(Stage.Diagnosis = ifelse(STAGE.AT.DIAGNOSIS=="Stage II", 2, Stage.Diagnosis)) %>%
  mutate(Stage.Diagnosis = ifelse(STAGE.AT.DIAGNOSIS=="Stage III", 3, Stage.Diagnosis)) %>%
  mutate(Stage.Diagnosis = ifelse(STAGE.AT.DIAGNOSIS=="Stage IV", 4, Stage.Diagnosis)) %>%
  
  # Primary tumour site
  mutate(Primary.Site = ifelse(PRIMARYSITE=="Occult", NA, NA)) %>%
  mutate(Primary.Site = ifelse(PRIMARYSITE=="Colon", "Colon", Primary.Site)) %>%
  mutate(Primary.Site = ifelse(PRIMARYSITE=="Colon NOS", "Colon", Primary.Site)) %>%
  mutate(Primary.Site = ifelse(PRIMARYSITE=="Left colon", "Left", Primary.Site)) %>%
  mutate(Primary.Site = ifelse(PRIMARYSITE=="Right colon", "Right", Primary.Site)) %>%
  mutate(Primary.Site = ifelse(PRIMARYSITE=="Rectum", "Rectum", Primary.Site)) %>%
  mutate(Primary.Site = ifelse(PRIMARYSITE=="Multiple primaries", "Colon", Primary.Site)) %>%
  
  # Specific metastatic sites
  mutate(Metastases.Liver = ifelse(is.na(LIVER.METS), F, T)) %>%
  mutate(Metastases.Peritoneum = ifelse(is.na(PERITONEUM.OMENTUM.METS), F, T)) %>%
  
  # Number of metastatic sites
  mutate(Metastases.Number = NUMBER.OF.METS) %>%
  
  # ECOG Performance Status
  mutate(L1.ECOG = as.numeric(str_sub(ECOG.First.Line, 1, 1))) %>%
  
  # Whether the patient has comorbidities or not
  mutate(Comorbidities = ifelse(NO.COMORBIDITIES=="Yes", FALSE, TRUE)) %>%
  
  # Recode the comorbidities
  mutate(MYOCARDIAL.INFARCTION = ifelse(is.na(MYOCARDIAL.INFARCTION) | MYOCARDIAL.INFARCTION=="No", 0, 1)) %>%
  mutate(CONGESTIVE.HEART.FAILURE = ifelse(is.na(CONGESTIVE.HEART.FAILURE) | CONGESTIVE.HEART.FAILURE=="No", 0, 1)) %>%
  mutate(PERIPHERAL.VASCULAR.DISEASE = ifelse(is.na(PERIPHERAL.VASCULAR.DISEASE) | PERIPHERAL.VASCULAR.DISEASE=="No", 0, 1)) %>%
  mutate(CEREBROVASCULAR.DISEASE = ifelse(is.na(CEREBROVASCULAR.DISEASE) | CEREBROVASCULAR.DISEASE=="No", 0, 1)) %>%
  mutate(DEMENTIA = ifelse(is.na(DEMENTIA) | DEMENTIA=="No", 0, 1)) %>%
  mutate(COPD = ifelse(is.na(COPD) | COPD=="No", 0, 1)) %>%
  mutate(CONNECTIVE.TISSUE.DISEASE = ifelse(is.na(CONNECTIVE.TISSUE.DISEASE) | CONNECTIVE.TISSUE.DISEASE=="No", 0, 1)) %>%
  mutate(PEPTIC.ULCER.DISEASE = ifelse(is.na(PEPTIC.ULCER.DISEASE) | PEPTIC.ULCER.DISEASE=="No", 0, 1)) %>%
  mutate(MILD.LIVER.DISEASE = ifelse(is.na(MILD.LIVER.DISEASE) | MILD.LIVER.DISEASE=="No", 0, 1)) %>%
  mutate(DIABETES.NO.COMPS = ifelse(is.na(DIABETES.NO.COMPS) | DIABETES.NO.COMPS=="No", 0, 1)) %>%
  mutate(DIABETES.COMPS = ifelse(is.na(DIABETES.COMPS) | DIABETES.COMPS=="No", 0, 1)) %>%
  mutate(HEMIPLEGIA.STROKE = ifelse(is.na(HEMIPLEGIA.STROKE) | HEMIPLEGIA.STROKE=="No", 0, 1)) %>%
  mutate(RENAL.DISEASE.MOD.SEVERE = ifelse(is.na(RENAL.DISEASE.MOD.SEVERE) | RENAL.DISEASE.MOD.SEVERE=="No", 0, 1)) %>%
  mutate(SECOND.TUMOUR.NO.METS = ifelse(is.na(SECOND.TUMOUR.NO.METS) | SECOND.TUMOUR.NO.METS=="No", 0, 1)) %>%
  mutate(LYMPHOMA.MYELOMA = ifelse(is.na(LYMPHOMA.MYELOMA) | LYMPHOMA.MYELOMA=="No", 0, 1)) %>%
  mutate(LEUKAEMIA = ifelse(is.na(LEUKAEMIA) | LEUKAEMIA=="No", 0, 1)) %>%
  mutate(LIVER.DIS.MOD.SEVERE = ifelse(is.na(LIVER.DIS.MOD.SEVERE) | LIVER.DIS.MOD.SEVERE=="No", 0, 1)) %>%
  mutate(IMMUNE.DEFICIENCY = 0) %>%
  mutate(SECOND.MET.SOLID.TUMOUR = ifelse(is.na(SECOND.MET.SOLID.TUMOUR) | SECOND.MET.SOLID.TUMOUR=="No", 0, 1)) %>%
  
  # Charlson comorbidity score (NOT CORRECTED FOR AGE!!!)
  mutate(Charlson =  
           (1 * MYOCARDIAL.INFARCTION) + 
           (1 * CONGESTIVE.HEART.FAILURE) + 
           (1 * PERIPHERAL.VASCULAR.DISEASE) + 
           (1 * CEREBROVASCULAR.DISEASE) + 
           (1 * DEMENTIA) + 
           (1 * COPD) + 
           (1 * CONNECTIVE.TISSUE.DISEASE) + 
           (1 * PEPTIC.ULCER.DISEASE) + 
           (1 * MILD.LIVER.DISEASE) + 
           (1 * DIABETES.NO.COMPS) + 
           (2 * DIABETES.COMPS) + 
           (2 * HEMIPLEGIA.STROKE) + 
           (2 * RENAL.DISEASE.MOD.SEVERE) + 
           (2 * SECOND.TUMOUR.NO.METS) + 
           (2 * LYMPHOMA.MYELOMA) + 
           (2 * LEUKAEMIA) + 
           (3 * LIVER.DIS.MOD.SEVERE) + 
           (6 * IMMUNE.DEFICIENCY) +
           (6 * SECOND.MET.SOLID.TUMOUR)
  ) %>%
  
  # CEA Level
  mutate(CEA.Level = CEALEVEL) %>%

  # RAS status
  mutate(RAS.Mutated = ifelse(RAS.STATUS=="Pending", NA, NA)) %>%
  mutate(RAS.Mutated = ifelse(RAS.STATUS=="Unknown", NA, RAS.Mutated)) %>%
  mutate(RAS.Mutated = ifelse(RAS.STATUS=="Mutated", T, RAS.Mutated)) %>%
  mutate(RAS.Mutated = ifelse(RAS.STATUS=="Wild-type", F, RAS.Mutated)) %>%
  
  # BRAF status
  mutate(BRAF.Mutated = ifelse(BRAFSTATUS=="Pending", NA, NA)) %>%
  mutate(BRAF.Mutated = ifelse(BRAFSTATUS=="Unknown", NA, BRAF.Mutated)) %>%
  mutate(BRAF.Mutated = ifelse(BRAFSTATUS=="Mutated", T, BRAF.Mutated)) %>%
  mutate(BRAF.Mutated = ifelse(BRAFSTATUS=="Wild-type", F, BRAF.Mutated)) %>%
  
  # Treatment intent
  mutate(Treatment.Intent = ifelse(TREATMENTINTENT=="Curative resection occurred", "Curative", NA)) %>%
  mutate(Treatment.Intent = ifelse(TREATMENTINTENT=="Curative resection planned", "Curative", Treatment.Intent)) %>%
  mutate(Treatment.Intent = ifelse(TREATMENTINTENT=="Potentially curative resection (if good response to treatment)", "Potentially", Treatment.Intent)) %>%
  mutate(Treatment.Intent = ifelse(TREATMENTINTENT=="Palliative only", "Palliative", Treatment.Intent)) %>%
  
  # Treatment start
  mutate(Treatment.Start = format(as.Date(L1.Start.Date), "%Y"))

Step 1.3: Cleaning of clinical outcomes for first-line treatment

Clinical outcomes for the first line of treatment are extrated from the data in this third step of the data cleaning. In the manucript, only progression-free survival is considered as clinical outcome for first-line treatment. First-line progression is defined as physician-defined progression (so a patient will be able to receive further treatment or follow up) or death.

# First, the treatment stop date, date of progression, and date of death need to be extraced from the data
df<- df %>%  
  mutate(L1.Stop.Date = as.character(STOP.DATE.1L)) %>%
  mutate(L1.Progression.Date = as.character(PROGRESSION.DATE.First.Line)) %>%
  mutate(Death.Date = as.character(DATEOFDEATH))

# The progression-free survival (PFS) and overall survival (OS) are calculated using the
# treatment start date as origin
df <- df %>% 
  mutate(L1.PFS.Days = as.numeric(difftime(L1.Progression.Date, L1.Start.Date, units="days"))) %>%
  mutate(L1.OS.Days = as.numeric(difftime(Death.Date, L1.Start.Date, units="days"))) %>%
  mutate(L1.CENS.Days = as.numeric(difftime(L1.Stop.Date, L1.Start.Date, units="days"))) %>%
  mutate(L1.CENS.Days = ifelse(is.na(L1.CENS.Days), as.numeric(difftime(DATEOFREVIEW, L1.Start.Date, units="days")), L1.CENS.Days))


# Determine which of the events occured to the patient with regard to first line progression:
# 0) Censored:      when neither progression or death is recorded
# 1) Progression:   when progression is recorded
# 2) Death:         when progression is not recorded but death is
df <- df %>%
  mutate(L1.PFS.Events = ifelse(!is.na(L1.OS.Days) & (L1.OS.Days<=L1.PFS.Days | is.na(L1.PFS.Days)), "Death", NA)) %>%
  mutate(L1.PFS.Events = ifelse(!is.na(L1.PFS.Days) & (L1.PFS.Days<L1.OS.Days | is.na(L1.OS.Days)), "Progression", L1.PFS.Events)) %>%
  mutate(L1.PFS.Events = ifelse(is.na(L1.PFS.Days) & is.na(L1.OS.Days), "Censored", L1.PFS.Events))

# Determine whether an event occured or the patient was censored and the correspoding time
df <- df %>% 
  mutate(L1.PFS.Event = ifelse(L1.PFS.Events=="Censored", 0, 1)) %>%
  mutate(L1.PFS.Time = ifelse(L1.PFS.Events=="Progression", L1.PFS.Days, NA)) %>%
  mutate(L1.PFS.Time = ifelse(L1.PFS.Events=="Death", L1.OS.Days, L1.PFS.Time)) %>%
  mutate(L1.PFS.Time = ifelse(L1.PFS.Events=="Censored", L1.CENS.Days, L1.PFS.Time))

# Since a negative PFS or OS is not possible, patients with a non-positive time-to-event
# are excluded from the analysis, as their data is assumed to be incorrect
df <- df %>%
  mutate(L1.PFS.Include = ifelse(!is.na(L1.PFS.Time) & L1.PFS.Time > 0, T, F));

Step 1.4: Cleaning of clinical outcomes for further treatment lines

With regard to the clinical outcomes of treatment lines or follow up after first-line progression, only overall survival is considered. Overall survival is extracted from the data in this fourth step of the data cleaning.

# Time till death after progression
df <- df %>% 
  mutate(LX.OS.Days = as.numeric(difftime(DATEOFDEATH, L1.Progression.Date, units="days"))) %>%
  mutate(LX.CENS.Days = as.numeric(difftime(DATEOFREVIEW, L1.Progression.Date, units="days")))

# Determine whether the patient was censored:
# 0) Censored:      when death is not recorded
# 1) Death:         when death is recorded
df <- df %>%
  mutate(LX.OS.Events = ifelse(!is.na(LX.OS.Days), "Death", NA)) %>%
  mutate(LX.OS.Events = ifelse(is.na(LX.OS.Days) & !is.na(LX.CENS.Days), "Censored", LX.OS.Events)) %>%
  mutate(LX.OS.Events = ifelse(L1.PFS.Events=="Progression", LX.OS.Events, NA))

# Determine whether death occured or the patient was censored and the correspoding time
df <- df %>% 
  mutate(LX.OS.Event = ifelse(LX.OS.Events=="Censored", 0, 1)) %>%
  mutate(LX.OS.Time = ifelse(LX.OS.Events=="Death", LX.OS.Days, NA)) %>%
  mutate(LX.OS.Time = ifelse(LX.OS.Events=="Censored", LX.CENS.Days, LX.OS.Time))

# Since a negative PFS or OS is not possible, patients with a non-positive time-to-event
# are excluded from the analysis, as their data is assumed to be incorrect
df <- df %>%
  mutate(LX.OS.Include = ifelse(L1.PFS.Include & !is.na(LX.OS.Time) & LX.OS.Time > 0, T, F))

Step 1.5: Final patient selection

Patients with a negative progression-free survival or overall survival are excluded from the sample, as it is assumed that these are data entry errors. Finally, for the included patients, an overview table of their characteristics is generated.

# Only include patients who at least are included for first-line treatment
df <- df %>%
  filter(L1.PFS.Include)

# Only the new colomns need to be maintained
df <- df %>%
  select(L1.Regimen:LX.OS.Include)

# General function to get an overview table of the patient characteristics
getTable <- function(data, column.class=NULL, columns.names=NULL, columns.discrete=NULL, chisq.simulate=T, chisq.nsim=10000, export=F, export.file=NULL) {
  
  if(is.null(columns.names)) {
    columns.names <- c("Age", "Gender.Female", "Hospital.Private", "Stage.Diagnosis", "Primary.Site", 
                       "Metastases.Number", "Metastases.Liver", "Metastases.Peritoneum", "L1.ECOG",
                       "Charlson", "CEA.Level", "RAS.Mutated", "BRAF.Mutated");
    columns.discrete <- c(F, T, T, T, T, T, T, T, T, T, F, T, T);
  }
  
  if(is.null(column.class)) {
    column.class <- "L1.Arm.ChemBev";
  }
  
  if(length(columns.names)!=length(columns.discrete)) stop("OBJECTS 'columns.names' AND 'columns.discrete' SHOULD HAVE EQUAL LENGTHS!!!");
  
  classification <- unlist(data[, column.class]);
  if(length(unique(classification))!=2) stop("NEEDS A CLASSIFICATION THAT RESULTS IN TWO CLASSES!!!")
  
  out.discrete <- matrix(c(table(classification), NA, NA, NA), nrow=1);
  rownames(out.discrete) <- "n";
  colnames(out.discrete) <- c(rep(names(table(classification)),2 ), "p-value");
  for(column in columns.names) {
    
    variable <- unlist(data[, column]);
    
    if(columns.discrete[which(columns.names==column)[1]]) {
      
      variable[is.na(variable)] <- "Missings";
      
      counts <- table(variable, classification);
      rownames(counts) <- str_c(column, "-", rownames(counts));
      
      percentages <- t(t(counts) / out.discrete[1, 1:2])*100;
      
      chisq <- chisq.test(counts, simulate.p.value = chisq.simulate, B = chisq.nsim)
      chisq.p <- c(chisq$p.value, rep(NA, nrow(counts)-1));
      
      out.discrete <- rbind(out.discrete, cbind(counts, percentages, chisq.p));
      
    } else {
      
      means <- tapply(variable, classification, mean, na.rm=T);
      sd <- tapply(variable, classification, sd, na.rm=T);
      NA.counts <- tapply(is.na(variable), classification, sum);
      NA.percentages <- tapply(is.na(variable), classification, mean)*100;
      
      ttest <- t.test(x = variable[classification==unique(classification[1])],
                      y = variable[classification!=unique(classification[1])],
                      alteranative = "two-sided", paired = F);
      
      out <- rbind(c(means, sd, ttest$p.value), 
                   c(NA.counts, NA.percentages, NA));
      rownames(out) <- str_c(column, c("-MeanSD", "-Missings"));
      
      out.discrete <- rbind(out.discrete, out);
      
    }    
  }
  
  colnames(out.discrete)[1:2] <- str_c(str_c(column.class, "==", colnames(out.discrete)[1:2]), " (#/Mean)");
  colnames(out.discrete)[3:4] <- str_c(str_c(column.class, "==", colnames(out.discrete)[3:4]), " (%/SD)");
  
  if(export) write.table(out.discrete, file=export.file, sep=";");
  
  return(out.discrete)
}

# Export Characteristics Tables
Table1.L1.PFS <- getTable(data=filter(df, L1.PFS.Include), export=F);
Table1.LX.OS <- getTable(data=filter(df, LX.OS.Include), export=F);

Overview table for patients included for first-line treatment:

L1.Arm.ChemBev==FALSE (#/Mean) L1.Arm.ChemBev==TRUE (#/Mean) L1.Arm.ChemBev==FALSE (%/SD) L1.Arm.ChemBev==TRUE (%/SD) p-value
n 191.00000 676.00000 NA NA NA
Age-MeanSD 63.17801 62.54882 13.9234652 12.145134 0.5714513
Age-Missings 0.00000 0.00000 0.0000000 0.000000 NA
Gender.Female-FALSE 118.00000 407.00000 61.7801047 60.207101 0.7425257
Gender.Female-TRUE 73.00000 269.00000 38.2198953 39.792899 NA
Hospital.Private-FALSE 115.00000 337.00000 60.2094241 49.852071 0.0121988
Hospital.Private-TRUE 76.00000 339.00000 39.7905759 50.147929 NA
Stage.Diagnosis-0 8.00000 35.00000 4.1884817 5.177515 0.2915708
Stage.Diagnosis-1 0.00000 11.00000 0.0000000 1.627219 NA
Stage.Diagnosis-2 10.00000 35.00000 5.2356021 5.177515 NA
Stage.Diagnosis-3 21.00000 95.00000 10.9947644 14.053254 NA
Stage.Diagnosis-4 151.00000 491.00000 79.0575916 72.633136 NA
Stage.Diagnosis-Missings 1.00000 9.00000 0.5235602 1.331361 NA
Primary.Site-Colon 13.00000 26.00000 6.8062827 3.846154 0.3540646
Primary.Site-Left 73.00000 261.00000 38.2198953 38.609468 NA
Primary.Site-Missings 3.00000 11.00000 1.5706806 1.627219 NA
Primary.Site-Rectum 51.00000 165.00000 26.7015707 24.408284 NA
Primary.Site-Right 51.00000 213.00000 26.7015707 31.508876 NA
Metastases.Number-1 91.00000 256.00000 47.6439791 37.869822 0.1606839
Metastases.Number-2 55.00000 254.00000 28.7958115 37.573965 NA
Metastases.Number-3 33.00000 123.00000 17.2774869 18.195266 NA
Metastases.Number-4 9.00000 36.00000 4.7120419 5.325444 NA
Metastases.Number-5 3.00000 6.00000 1.5706806 0.887574 NA
Metastases.Number-6 0.00000 1.00000 0.0000000 0.147929 NA
Metastases.Liver-FALSE 64.00000 211.00000 33.5078534 31.213018 0.5971403
Metastases.Liver-TRUE 127.00000 465.00000 66.4921466 68.786982 NA
Metastases.Peritoneum-FALSE 138.00000 491.00000 72.2513089 72.633136 0.9280072
Metastases.Peritoneum-TRUE 53.00000 185.00000 27.7486911 27.366864 NA
L1.ECOG-0 51.00000 284.00000 26.7015707 42.011834 0.0001000
L1.ECOG-1 95.00000 344.00000 49.7382199 50.887574 NA
L1.ECOG-2 34.00000 43.00000 17.8010471 6.360947 NA
L1.ECOG-3 11.00000 5.00000 5.7591623 0.739645 NA
Charlson-0 127.00000 487.00000 66.4921466 72.041420 0.0014999
Charlson-1 27.00000 125.00000 14.1361257 18.491124 NA
Charlson-2 22.00000 41.00000 11.5183246 6.065089 NA
Charlson-3 7.00000 18.00000 3.6649215 2.662722 NA
Charlson-4 2.00000 2.00000 1.0471204 0.295858 NA
Charlson-5 3.00000 1.00000 1.5706806 0.147929 NA
Charlson-6 1.00000 1.00000 0.5235602 0.147929 NA
Charlson-7 0.00000 1.00000 0.0000000 0.147929 NA
Charlson-8 2.00000 0.00000 1.0471204 0.000000 NA
CEA.Level-MeanSD 725.91907 478.48598 5254.2478968 2048.838683 0.5459979
CEA.Level-Missings 19.00000 61.00000 9.9476440 9.023669 NA
RAS.Mutated-FALSE 72.00000 317.00000 37.6963351 46.893491 0.0001000
RAS.Mutated-Missings 63.00000 112.00000 32.9842932 16.568047 NA
RAS.Mutated-TRUE 56.00000 247.00000 29.3193717 36.538461 NA
BRAF.Mutated-FALSE 52.00000 266.00000 27.2251309 39.349112 0.0085991
BRAF.Mutated-Missings 127.00000 369.00000 66.4921466 54.585799 NA
BRAF.Mutated-TRUE 12.00000 41.00000 6.2827225 6.065089 NA

Overview table for patients included for further treatmen lines:

L1.Arm.ChemBev==FALSE (#/Mean) L1.Arm.ChemBev==TRUE (#/Mean) L1.Arm.ChemBev==FALSE (%/SD) L1.Arm.ChemBev==TRUE (%/SD) p-value
n 156.00000 562.00000 NA NA NA
Age-MeanSD 62.87179 62.32028 14.0789103 12.2110342 0.6567444
Age-Missings 0.00000 0.00000 0.0000000 0.0000000 NA
Gender.Female-FALSE 96.00000 341.00000 61.5384615 60.6761566 0.8505149
Gender.Female-TRUE 60.00000 221.00000 38.4615385 39.3238434 NA
Hospital.Private-FALSE 93.00000 272.00000 59.6153846 48.3985765 0.0153985
Hospital.Private-TRUE 63.00000 290.00000 40.3846154 51.6014235 NA
Stage.Diagnosis-0 4.00000 28.00000 2.5641026 4.9822064 0.2659734
Stage.Diagnosis-1 0.00000 8.00000 0.0000000 1.4234875 NA
Stage.Diagnosis-2 9.00000 28.00000 5.7692308 4.9822064 NA
Stage.Diagnosis-3 18.00000 82.00000 11.5384615 14.5907473 NA
Stage.Diagnosis-4 124.00000 407.00000 79.4871795 72.4199288 NA
Stage.Diagnosis-Missings 1.00000 9.00000 0.6410256 1.6014235 NA
Primary.Site-Colon 10.00000 18.00000 6.4102564 3.2028470 0.2894711
Primary.Site-Left 61.00000 216.00000 39.1025641 38.4341637 NA
Primary.Site-Missings 1.00000 11.00000 0.6410256 1.9572954 NA
Primary.Site-Rectum 38.00000 133.00000 24.3589744 23.6654804 NA
Primary.Site-Right 46.00000 184.00000 29.4871795 32.7402135 NA
Metastases.Number-1 70.00000 218.00000 44.8717949 38.7900356 0.6216378
Metastases.Number-2 50.00000 214.00000 32.0512821 38.0782918 NA
Metastases.Number-3 28.00000 95.00000 17.9487179 16.9039146 NA
Metastases.Number-4 7.00000 31.00000 4.4871795 5.5160142 NA
Metastases.Number-5 1.00000 4.00000 0.6410256 0.7117438 NA
Metastases.Liver-FALSE 53.00000 172.00000 33.9743590 30.6049822 0.4386561
Metastases.Liver-TRUE 103.00000 390.00000 66.0256410 69.3950178 NA
Metastases.Peritoneum-FALSE 109.00000 402.00000 69.8717949 71.5302491 0.6921308
Metastases.Peritoneum-TRUE 47.00000 160.00000 30.1282051 28.4697509 NA
L1.ECOG-0 39.00000 230.00000 25.0000000 40.9252669 0.0001000
L1.ECOG-1 83.00000 291.00000 53.2051282 51.7793594 NA
L1.ECOG-2 28.00000 37.00000 17.9487179 6.5836299 NA
L1.ECOG-3 6.00000 4.00000 3.8461538 0.7117438 NA
Charlson-0 106.00000 406.00000 67.9487179 72.2419929 0.0057994
Charlson-1 22.00000 106.00000 14.1025641 18.8612100 NA
Charlson-2 16.00000 32.00000 10.2564103 5.6939502 NA
Charlson-3 5.00000 14.00000 3.2051282 2.4911032 NA
Charlson-4 2.00000 1.00000 1.2820513 0.1779359 NA
Charlson-5 2.00000 1.00000 1.2820513 0.1779359 NA
Charlson-6 1.00000 1.00000 0.6410256 0.1779359 NA
Charlson-7 0.00000 1.00000 0.0000000 0.1779359 NA
Charlson-8 2.00000 0.00000 1.2820513 0.0000000 NA
CEA.Level-MeanSD 844.29064 488.46878 5796.9883585 2031.4895195 0.4746236
CEA.Level-Missings 15.00000 52.00000 9.6153846 9.2526690 NA
RAS.Mutated-FALSE 63.00000 269.00000 40.3846154 47.8647687 0.0001000
RAS.Mutated-Missings 50.00000 75.00000 32.0512821 13.3451957 NA
RAS.Mutated-TRUE 43.00000 218.00000 27.5641026 38.7900356 NA
BRAF.Mutated-FALSE 41.00000 217.00000 26.2820513 38.6120996 0.0146985
BRAF.Mutated-Missings 105.00000 309.00000 67.3076923 54.9822064 NA
BRAF.Mutated-TRUE 10.00000 36.00000 6.4102564 6.4056940 NA

Step 2: Multiple Imputation of Missing Data

After cleaning the data in Step 1, the missing data are addressed in Step 2. Two steps are performed in order to do so:

  1. Addressing missings in the RAS status and BRAF status
  2. Performing multiple imputation by chained equations

Step 2.1: addressing missings in the RAS status and BRAF status

As explained in the manuscript, missing data for the RAS mutation status and BRAF mutation status may be partly addressed by assuming these mutations to be mutatual exclusive. For example, this indicates that if a RAS mutation is known for a patient with a missing BRAF status, the BRAF status can be assumed to be ‘wild-type’.

# Address missing values in Mutations
# Potentially 175 missing values can be addressed:
# - 1 for RAS
# - 142 for BRAF
table(list(RAS=df$RAS.Mutated, BRAF=df$BRAF.Mutated), useNA="always");
##        BRAF
## RAS     FALSE TRUE <NA>
##   FALSE   158   51  180
##   TRUE    160    1  142
##   <NA>      0    1  174
# If patients have a KRAS mutation and the RAS status is unknown,
# we assume that they will not have a RAS mutation. Similarly, if
# the BRAF status is unknown and the patient has a RAS mutation,
# we assume that the patient does not have a BRAF mutation.
df <- df %>%
  mutate(RAS.Mutated = ifelse(is.na(RAS.Mutated) & BRAF.Mutated, F, RAS.Mutated)) %>%
  mutate(BRAF.Mutated = ifelse(is.na(BRAF.Mutated) & RAS.Mutated, F, BRAF.Mutated))

# Oberve new distribution of mutations and missing values
table(list(RAS=df$RAS.Mutated, BRAF=df$BRAF.Mutated), useNA="always");
##        BRAF
## RAS     FALSE TRUE <NA>
##   FALSE   158   52  180
##   TRUE    302    1    0
##   <NA>      0    0  174

Step 2.2: performing multiple imputation by chained equations

After addressing part of the missing values for the RAS mutation status and BRAF mutation status, remaining missing data for these and the other variables are addressed by performing multiple imputation by chained equations. Missing values are addressed for the following variables (imputation method used):

  • Stage at diagnosis (multinomial logit)
  • Primary tumor site (multinomial logit)
  • CEA level (Bayesian linear regression)
  • RAS mutation status (logistic regression)
  • BRAF mutation status (logistic regression)

The following variables were additionally used as predictors:

  • Age at treatment start
  • Gender
  • Hospital type (private of public)
  • Presence of liver metastases
  • Presence of peritoneal metastases
  • Number of metastatic sites
  • ECOG performance status
  • Charlson comorbidity index

After performing the multiple imputation, post-processing is performed while extracting the imputed datasets to make sure RAS mutation status and BRAF mutation status remain mutually exclusive.

# Load required package
library(mice);

# List the included columns
mice.cols.included <- c("Age", "Gender.Female", "Hospital.Private", "Metastases.Liver", "Metastases.Peritoneum", "Metastases.Number", "L1.ECOG", "Charlson", "Primary.Site", "Stage.Diagnosis", "CEA.Level", "RAS.Mutated", "BRAF.Mutated");

# Select the data used for the multiple imputation
df.mice <- df %>%
  select(mice.cols.included)

# Make sure the type of all columns is correct
df.mice <- df.mice %>%
  mutate(Age = as.numeric(Age)) %>% 
  mutate(Gender.Female = as.logical(Gender.Female)) %>% 
  mutate(Hospital.Private = as.logical(Hospital.Private)) %>% 
  mutate(Stage.Diagnosis = as.factor(Stage.Diagnosis)) %>% 
  mutate(Primary.Site = as.factor(Primary.Site)) %>% 
  mutate(Metastases.Liver = as.logical(Metastases.Liver)) %>% 
  mutate(Metastases.Peritoneum = as.logical(Metastases.Peritoneum)) %>% 
  mutate(Metastases.Number = as.numeric(Metastases.Number)) %>% 
  mutate(Charlson = as.numeric(Charlson)) %>% 
  mutate(CEA.Level = as.numeric(CEA.Level)) %>% 
  mutate(RAS.Mutated = as.logical(RAS.Mutated)) %>% 
  mutate(BRAF.Mutated = as.logical(BRAF.Mutated)) %>% 
  mutate(L1.ECOG = as.factor(L1.ECOG))


# Define predictor matrix
predictor.matrix <- matrix(1, length(mice.cols.included), length(mice.cols.included)) - diag(1, length(mice.cols.included), length(mice.cols.included));
colnames(predictor.matrix) <- rownames(predictor.matrix) <- mice.cols.included;

# Perform multiple imputation
m <- 25;              # Number of imputations
set.seed(20181211);   # Random seed for reproduction
mice.out <- mice(data=df.mice, m=m, maxit=20, predictorMatrix=predictor.matrix, printFlag=F,
                 method=c(Age="", 
                          Gender.Female="", 
                          Hospital.Private="", 
                          Metastases.Liver="", 
                          Metastases.Peritoneum="", 
                          Metastases.Number="", 
                          L1.ECOG="", 
                          Charlson="", 
                          Primary.Site="polyreg",
                          Stage.Diagnosis="polyreg", 
                          CEA.Level="norm", 
                          RAS.Mutated="logreg", 
                          BRAF.Mutated="logreg")
);


# Extract the imputed datasets
count.BRAF.and.RAS <- 0;    # Counter to track how much post-processing is required
datasets <- lapply(1:m, function(i.dataset) {
  
  # Get the imputed columns
  df.imputed <- complete(mice.out, action=i.dataset);
  cols.imputed <- colnames(df.imputed);
  
  # Replace the colums in the original dataset by the imputed columns
  df.new <- df;
  df.new[, cols.imputed] <- df.imputed; 
  
  # Correct when RAS and BRAF mutations conflict
  count.BRAF.and.RAS <<- count.BRAF.and.RAS + sum(df.new$BRAF.Mutated & df.new$RAS.Mutated);
  df.new$BRAF.Mutated <- ifelse(df.new$RAS.Mutated==1 & df.new$BRAF.Mutated==1, 0, df.new$BRAF.Mutated);
  
  # Return the newly completed datasets
  df.new;
  
});

# Check number of post-processed mutations
prob.BRAF.and.RAS <- (count.BRAF.and.RAS/m) / nrow(datasets[[1]]) * 100;
prob.BRAF.and.RAS;
## [1] 0.3460208

Step 3: Univariable Analysis

Initial analyses were performed to determine which distribution types would be used to parameterize the survival models and which transformations would be used for each variable in the multivariable modeling. To this end, the following steps were performed as part of this univariable modeling:

  1. Calculating the overlap of the competing events for first-line progression-free survival (Model L1-Time)
  2. Selecting the distribution for progression-free survival (Model L1-Time)
  3. Selecting the distribution for overal survival following first-line progression (Model LX-Time)
  4. Selecting the transformation for progression-free survival (Model L1-Time)
  5. Selecting the transformation for progression-free survival (Model L1-Event)
  6. Selecting the transformation for overal survival following first-line progression (Model LX-Time)

The following transformations were considered:

Step 3.1: calculating the overlap of the competing events for first-line progression-free survival (Model L1-Time)

In modeling first-line progression-free survival, there are two competing events that need to be accounted for: physician-defined progression and death. To determine which modeling approach will be implemented to account for these two competing events, the overlap of the corresponding time-to-event distributions is important to know. It was decided that competing events will be handled by using the ‘Unimodal Distribution and Regression model (UDR)’ approach based on the strategy of selecting the time-to-event first based on a joint time-to-event distribution, and then select the event corresponding to that time.

# function for calculating the overlap between the distribution for time-to-progression and time-to-death
getOverlap <- function(data, n.bins=512, digits=3) {

  library(flux);

  # Get events
  events <- sort(unique(data$event));
  n.events <- length(events);

  # Get event-specific times
  times <- lapply(events, function(e) subset(data, event==e)$time );

  # Density settings
  min <- min(data$time);
  max <- max(data$time);
  n <- n.bins;
  x <- seq(from=min, to=max, length.out=n);

  # Get event-specific densities
  y <- sapply(1:n.events, function(i1) density(times[[i1]], from=min, to=max, n=n)$y );

  # Get event=specific density overlap
  yo <- sapply(1:n.events, function(i2) {

    sapply(1:n, function(j){

      own <- y[j,i2];
      other <- max(y[j,which(c(1:n.events)!=i2)]);

      if(own<=0) {
        return(0)
      } else
        return(min(own,other))

    })

  });

  # Calculate event-specific overlap
  o <- sapply(1:n.events, function(i3) {

    (auc(x=x, y=yo[,i3]) / auc(x=x, y=y[,i3])) *100

  });

  return(round(mean(o), digits=digits));
}

# Calculate the distribution overlap
getOverlap(data=data.frame(time = df$L1.PFS.Time[df$L1.PFS.Events!="Censored"], event = df$L1.PFS.Events[df$L1.PFS.Events!="Censored"]))
## [1] 72.457
# Visualize distribution overlap
plot(density(df$L1.PFS.Time[df$L1.PFS.Events!="Censored"]), xlab="Time-to-Event in Days", ylab=NA, main="Density Plot", las=1, lwd=2, xlim=c(0,2500), ylim=c(0,0.0025))
lines(density(df$L1.PFS.Time[df$L1.PFS.Events=="Progression"]), lwd=2, col=3)
lines(density(df$L1.PFS.Time[df$L1.PFS.Events=="Death"]), lwd=2, col=2)
lines(density(df$L1.PFS.Time[df$L1.PFS.Events=="Censored"]), lwd=2, col=4)
legend("topright", lty=1, lwd=2, col=c(1,3,2, 4), c(str_c("Combined [excl. censored patients](n=", sum(df$L1.PFS.Events!="Censored"),")"), str_c("Progression (n=", sum(df$L1.PFS.Events=="Progression"),")"), str_c("Death (n=", sum(df$L1.PFS.Events=="Death"),")"), str_c("Censored (n = ", sum(df$L1.PFS.Events=="Censored"),")")))

Step 3.2: selecting the distribution for progression-free survival (Model L1-Time)

A distribution type is selected for survival Model L1-Time that represents the two competing events applicable to first-line progression-free survival (physcian-defined progression and death) in one unimodal survival model. Selection was based on the likelihood and assessment of quantile-quantile plots and Kaplan-Meier plots. The following distributions were considered:

  • Exponential
  • Weibull
  • Logistic
  • Log-normal
# Load packages
library(survival);  # Fit Survival Models
library(EnvStats);  # Q-Q Plot for censored data

# Distributions considered
dist.considered <- c("exponential", 
                     "weibull", 
                     "logistic", 
                     "lognormal");

# Data for first-line
df.L1 <- df;

# Select distribution
dist.performance.L1 <- sapply(dist.considered, function(dist) {
  
  # Fit survival model for distribution
  model <- survreg(formula=Surv(time=L1.PFS.Time, event=L1.PFS.Event) ~ 1, data=df.L1, dist=dist);
  
  # Return the likelihood
  return(model$loglik[1]);
  
});

# Print the results and which distribution would be selected based on the likelihood
dist.selected.L1 <- dist.considered[which.max(dist.performance.L1)];
print(dist.performance.L1);
## exponential     weibull    logistic   lognormal 
##   -5522.443   -5483.372   -5646.513   -5512.667
print(dist.selected.L1);
## [1] "weibull"
# Plot the survival curves
for(i in 1:length(dist.considered)) {
  
  # Get distribution type
  dist <- dist.considered[i];
  
  # Fit survival model for distribution
  model <- survreg(formula=Surv(time=L1.PFS.Time, event=L1.PFS.Event) ~ 1, data=df.L1, dist=dist);
  
  # If it is the first distribution considered, plot kaplan-meier
  if(i==1) plot(survfit(Surv(time=L1.PFS.Time, event=L1.PFS.Event) ~ 1, data=df.L1), conf.int=F, xlab="Time to Death or Progression (days)", lwd=2);
  
  # Plot fitted distribution
  lines(y=seq(.99, .01, by=-.01), x=predict(model, type="quantile", p=seq(.01, .99, by=.01))[1,], col=rainbow(length(dist.considered))[i], lwd=2)
  
  # Add the legend if it is the last distribution
  if(i==length(dist.considered)) legend("right", legend=dist.considered, col=rainbow(length(dist.considered)), lty=1, lwd=2);
  
} 

# Plot the Q-Q Plots curves
par(mfrow=c(2,2));
for(i in 1:length(dist.considered)) {
  
  # Get distribution type
  dist <- dist.considered[i];
  
  # Fit survival model for distribution
  model <- survreg(formula=Surv(time=L1.PFS.Time, event=L1.PFS.Event) ~ 1, data=df.L1, dist=dist);
  
  # Plot the Q-Q Plot
  if(dist=="exponential") qqPlotCensored(x=df.L1$L1.PFS.Time, censored=df.L1$L1.PFS.Event==0, censoring.side="right", 
                                         equal.axes=T, main=paste0("Q-Q Plot for the ", dist, " distribution"),
                                         distribution="exp", param.list=list(rate=1/exp(model$coefficients)));
  if(dist=="weibull") qqPlotCensored(x=df.L1$L1.PFS.Time, censored=df.L1$L1.PFS.Event==0, censoring.side="right", 
                                     equal.axes=T, main=paste0("Q-Q Plot for the ", dist, " distribution"), 
                                     distribution="weibull", param.list=list(shape=1/model$scale, scale=exp(model$coefficients)));
  if(dist=="logistic") qqPlotCensored(x=df.L1$L1.PFS.Time, censored=df.L1$L1.PFS.Event==0, censoring.side="right", 
                                      equal.axes=T, main=paste0("Q-Q Plot for the ", dist, " distribution"), 
                                      distribution="logis", param.list=list(location=model$coefficients, scale=model$scale));
  if(dist=="lognormal") qqPlotCensored(x=df.L1$L1.PFS.Time, censored=df.L1$L1.PFS.Event==0, censoring.side="right", 
                                       equal.axes=T, main=paste0("Q-Q Plot for the ", dist, " distribution"), 
                                       distribution="lnorm", param.list=list(meanlog=model$coefficients, sdlog=model$scale));
}

Step 3.3: selecting the distribution for overal survival following first-line progression (Model LX-Time)

A distribution type is selected for survival Model LX-Time that represents overall survival after first-line progression. Selection was based on the likelihood and assessment of quantile-quantile plots and Kaplan-Meier plots. As for Model L1-Time, the following distributions were considered:

  • Exponential
  • Weibull
  • Logistic
  • Log-normal
# Data for first-line
df.LX <- filter(df, LX.OS.Include);

# Select distribution
dist.performance.LX <- sapply(dist.considered, function(dist) {
  
  # Fit survival model for distribution
  model <- survreg(formula=Surv(time=LX.OS.Time, event=LX.OS.Event) ~ 1, data=df.LX, dist=dist);
  
  # Return the likelihood
  return(model$loglik[1]);
  
});

# Print the results and which distribution would be selected based on the likelihood
dist.selected.LX <- dist.considered[which.max(dist.performance.LX)];
print(dist.performance.LX);
## exponential     weibull    logistic   lognormal 
##   -4207.651   -4207.648   -4430.506   -4245.090
print(dist.selected.LX);
## [1] "weibull"
# Plot the survival curves
for(i in 1:length(dist.considered)) {
  
  # Get distribution type
  dist <- dist.considered[i];
  
  # Fit survival model for distribution
  model <- survreg(formula=Surv(time=LX.OS.Time, event=LX.OS.Event) ~ 1, data=df.LX, dist=dist);
  
  # If it is the first distribution considered, plot kaplan-meier
  if(i==1) plot(survfit(Surv(time=LX.OS.Time, event=LX.OS.Event) ~ 1, data=df.LX), conf.int=F, xlab="Overall Survival (days)", lwd=2);
  
  # Plot fitted distribution
  lines(y=seq(.99, .01, by=-.01), x=predict(model, type="quantile", p=seq(.01, .99, by=.01))[1,], col=rainbow(length(dist.considered))[i], lwd=2)
  
  # Add the legend if it is the last distribution
  if(i==length(dist.considered)) legend("right", legend=dist.considered, col=rainbow(length(dist.considered)), lty=1, lwd=2);
  
} 

# Plot the Q-Q Plots curves
par(mfrow=c(2,2));
for(i in 1:length(dist.considered)) {
  
  # Get distribution type
  dist <- dist.considered[i];
  
  # Fit survival model for distribution
  model <- survreg(formula=Surv(time=LX.OS.Time, event=LX.OS.Event) ~ 1, data=df.LX, dist=dist);
  
  # Plot the Q-Q Plot
  if(dist=="exponential") qqPlotCensored(x=df.LX$LX.OS.Time, censored=df.LX$LX.OS.Event==0, censoring.side="right", 
                                         equal.axes=T, main=paste0("Q-Q Plot for the ", dist, " distribution"),
                                         distribution="exp", param.list=list(rate=1/exp(model$coefficients)));
  if(dist=="weibull") qqPlotCensored(x=df.LX$LX.OS.Time, censored=df.LX$LX.OS.Event==0, censoring.side="right", 
                                     equal.axes=T, main=paste0("Q-Q Plot for the ", dist, " distribution"), 
                                     distribution="weibull", param.list=list(shape=1/model$scale, scale=exp(model$coefficients)));
  if(dist=="logistic") qqPlotCensored(x=df.LX$LX.OS.Time, censored=df.LX$LX.OS.Event==0, censoring.side="right", 
                                      equal.axes=T, main=paste0("Q-Q Plot for the ", dist, " distribution"), 
                                      distribution="logis", param.list=list(location=model$coefficients, scale=model$scale));
  if(dist=="lognormal") qqPlotCensored(x=df.LX$LX.OS.Time, censored=df.LX$LX.OS.Event==0, censoring.side="right", 
                                       equal.axes=T, main=paste0("Q-Q Plot for the ", dist, " distribution"), 
                                       distribution="lnorm", param.list=list(meanlog=model$coefficients, sdlog=model$scale));
}

Step 3.4: selecting the transformation for progression-free survival (Model L1-Time)

The best-performing transformation was selected for each variable and survival model separately by fitting individual models for each candidate transformation and comparing performance based on Wald test p-values. Univariable models were also fitted for variables without multiple candidate transformations to record p-values. All p-values were calculated for each of 25 imputed datasets, and for variables with multiple candidate transformations, transformations’ p-value ranks were determined for each imputed dataset. Final selection of the transformation to be used was based on the average transformation rank over all 25 imputed datasets.

# Load package
library(aod);   # Wald-test

# Candidate variables, including their importance and potential transformations
vars.information.L1 <- list(
  
  # Age at treatment start
  Age = list(Transformations = c(Log = "log(Age)", SQRT = "sqrt(Age)", Poly1 = "poly(Age, 1)", Poly2 = "poly(Age, 2)", Poly3 = "poly(Age, 3)", Poly4 = "poly(Age, 4)")),
  
  # Gender
  Gender.Female = list(Transformations = c(Binary.FemaleYN = "Gender.Female")),
  
  # Hospital Type
  Hospital.Private = list(Transformations = c(Binary.PrivateYN = "Hospital.Private")),
  
  # Stage at Initial Diagnosis
  Stage.Diagnosis = list(Transformations = c(Binary.Stage4YN = "as.factor(as.character(Stage.Diagnosis)=='4')")),
  
  # Primary Tumour Site
  Primary.Site = list(Transformations = c(Factor.GroupColMul = "as.factor(ifelse(Primary.Site %in% c('Colon','Multiple'), 'ColonMultiple', Primary.Site))", Factor.GroupColMul.GroupLefRec = "as.factor(ifelse(Primary.Site %in% c('Colon','Multiple'), 'ColonMultiple', ifelse(Primary.Site %in% c('Left','Rectum'), 'LeftRectum', Primary.Site)))")),
  
  # Number of Metastases
  Metastases.Number = list(Transformations = c(Factor.Group3plus = "as.factor(ifelse(Metastases.Number>=3, 3, Metastases.Number))")),
  
  # Liver Metastases
  Metastases.Liver = list(Transformations = c(Binary = "Metastases.Liver")),
  
  # Peritoneum Metastases
  Metastases.Peritoneum = list(Transformations = c(Binary = "Metastases.Peritoneum")),
  
  # ECOG Performance Status
  L1.ECOG = list(Transformations = c(Factor.Group23 = "as.factor(ifelse(L1.ECOG==0, 0, ifelse(L1.ECOG==1, 1, 2)))", Binary.2plusYN = "as.factor(ifelse(L1.ECOG %in% c(0,1), F, T))")),
  
  # Charlson Score
  Charlson = list(Transformations = c(Factor.Group2plus = "as.factor(ifelse(Charlson>=2, 2, Charlson))", Factor.Group01.Group2plus = "as.factor(ifelse(Charlson %in% c(0,1), 1, ifelse(Charlson>=2, 2, Charlson)))", Binary.1plusYN = "as.factor(ifelse(Charlson>=1, T, F))")),
  
  # CEA Level
  CEA.Level = list(Transformations = c(Binary.200plus = "as.factor(ifelse(Metastases.Number<=2, T, F) * ifelse(CEA.Level>=200, T, F))")),
  
  # RAS Status
  RAS.Mutated = list(Transformations = c(Binary.MutatedYN = "RAS.Mutated")),
  
  # BRAF Status
  BRAF.Mutated = list(Transformations = c(Binary.MutatedYN = "BRAF.Mutated"))
  
);

# Set three levels of significance
significance.levels <- c(0.1, 0.05, 0.01);

# For each imputed dataset, select distribution and perform univariable analysis
univariable.L1 <- lapply(1:length(datasets), function(dataset) {
  
  # Retrieve datasets
  data.surv.L1 <- datasets[[dataset]];

  # For each variable
  vars.univariable <- lapply(vars.information.L1, function(var) {
    
    # Consider all potential transformations
    sapply(var$Transformations, function(var.form) {
      
      # Define formula to be fitted
      f <- as.formula(str_c("Surv(time=L1.PFS.Time, event=L1.PFS.Event) ~ ", var.form, collapse=""));
      
      # Fit model
      fit <- survreg(formula=f, data=data.surv.L1, dist=dist.selected.L1);
      
      # Perform Wald-test for significance
      test <- wald.test(Sigma=fit$var, b=c(fit$coefficients, fit$scale), Terms=c(2:length(fit$coefficients)))
      
      # Determine significance
      pvalue <- test$result$chi2["P"];
      significance <- "";
      if(pvalue<max(significance.levels)) {
        significance <- str_c(rep("*", tail(which(pvalue<significance.levels), n=1)), collapse="");
      }
      
      # Return transformation and model performance
      c(trans=f, ll=fit$loglik[2], pvalue=pvalue, sign=significance)
      
    })
    
  });
  
  # Return outcomes
  return(list(vars.univariable=vars.univariable));
  
})


# Add summary for variables to vars.information object
for(var in names(vars.information.L1)) {
  
  # Get names of transformations
  transformation.names <- names(vars.information.L1[[var]]$Transformations);
  
  # Extract data from univariable object
  loglik <- as.data.frame(
    sapply(transformation.names, function(trans) 
      sapply(univariable.L1, function(univar)
        univar$vars.univariable[[var]][2,trans]
      )
    )
  )
  pvalues <- as.data.frame(
    sapply(transformation.names, function(trans) 
      sapply(univariable.L1, function(univar)
        univar$vars.univariable[[var]][3,trans]
      )
    )
  )
  significance <- as.data.frame(
    sapply(transformation.names, function(trans) 
      sapply(univariable.L1, function(univar)
        univar$vars.univariable[[var]][4,trans]
      )
    )
  )
  
  # Transform List object to Matrix
  pvalues <- matrix(unlist(pvalues), ncol=ncol(pvalues), byrow=F)
  colnames(pvalues) <- transformation.names;
  
  # Rank based on pvalues
  ranks <- matrix(apply(pvalues, 1, rank, ties.method="average"), ncol=ncol(pvalues), byrow=T);
  colnames(ranks) <- colnames(pvalues);
  avgrank <- apply(ranks, 2, mean);
  
  # Select best performing transformation
  selected <- names(avgrank)[which.min(avgrank)];
  selected.transformation <- vars.information.L1[[var]]$Transformations[selected];
  selected.avgpvalue <- mean(pvalues[, selected]);
  selected.significance <- "";
  if(selected.avgpvalue<max(significance.levels)){
    selected.significance <- str_c(rep("*", tail(which(selected.avgpvalue<significance.levels), n=1)), collapse="");
  }
  
  # Save to vars.information object
  vars.information.L1[[var]]$LogLik <- loglik;
  vars.information.L1[[var]]$Pvalues <- pvalues;
  vars.information.L1[[var]]$Significance <- significance;
  vars.information.L1[[var]]$Ranks <- ranks;
  vars.information.L1[[var]]$AvgRank <- avgrank;
  
  vars.information.L1[[var]]$SelectedTransformation <- selected.transformation;
  vars.information.L1[[var]]$SelectedPvalue <- selected.avgpvalue;
  vars.information.L1[[var]]$SelectedSignificance <- selected.significance;
  
}


# Generate summary Table
vars.univariable.L1 <- t(sapply(vars.information.L1, function(var) c(
  trans = names(var$SelectedTransformation),
  pvalue = var$SelectedPvalue,
  sign = var$SelectedSignificance
))); 

# View results
kable(vars.univariable.L1);
trans pvalue sign
Age Poly2 0.00543870740945696 ***
Gender.Female Binary.FemaleYN 0.0413078627543801 **
Hospital.Private Binary.PrivateYN 0.132221927655926
Stage.Diagnosis Binary.Stage4YN 0.0180820041944674 **
Primary.Site Factor.GroupColMul.GroupLefRec 0.00371107490900566 ***
Metastases.Number Factor.Group3plus 0.40180927502595
Metastases.Liver Binary 0.0548355494244619 *
Metastases.Peritoneum Binary 0.108781784235471
L1.ECOG Factor.Group23 8.5890183854076e-12 ***
Charlson Binary.1plusYN 0.164886584017105
CEA.Level Binary.200plus 0.00551021390960015 ***
RAS.Mutated Binary.MutatedYN 0.208343809880177
BRAF.Mutated Binary.MutatedYN 0.152022282845375

Step 3.5: selecting the transformation for progression-free survival (Model L1-Event)

To distinguish between the two competing events of physisician-defined progression death for first-line progression-free survival, a logistic regression model is needed (Model L1-Event). For this model, all best-performing transformations indentified for corresponding Model L1-Time were used, so only the best transformation for the additional variable relating to the time-to-event needs to be selected.

# Define a binary variable 'Death' to represent whether first-line progression refered to Death (1), Physician-defined progression (0), or the patient was censored (NA)
Death <- ifelse(datasets[[1]]$L1.PFS.Events=="Death", 1, ifelse(datasets[[1]]$L1.PFS.Events=="Progression", 0, NA));

# Fit the different transformations
summary(glm(Death ~ 1))
## 
## Call:
## glm(formula = Death ~ 1)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.1023  -0.1023  -0.1023  -0.1023   0.8977  
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.10234    0.01065    9.61   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.09198216)
## 
##     Null deviance: 74.506  on 810  degrees of freedom
## Residual deviance: 74.506  on 810  degrees of freedom
##   (56 observations deleted due to missingness)
## AIC: 369.34
## 
## Number of Fisher Scoring iterations: 2
summary(glm(Death ~ sqrt(datasets[[1]]$L1.PFS.Time)))
## 
## Call:
## glm(formula = Death ~ sqrt(datasets[[1]]$L1.PFS.Time))
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -0.21793  -0.11542  -0.09612  -0.07337   0.95884  
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)   
## (Intercept)                     0.029326   0.027825   1.054  0.29222   
## sqrt(datasets[[1]]$L1.PFS.Time) 0.004473   0.001576   2.838  0.00465 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.09118778)
## 
##     Null deviance: 74.506  on 810  degrees of freedom
## Residual deviance: 73.771  on 809  degrees of freedom
##   (56 observations deleted due to missingness)
## AIC: 363.31
## 
## Number of Fisher Scoring iterations: 2
summary(glm(Death ~ log(datasets[[1]]$L1.PFS.Time)))
## 
## Call:
## glm(formula = Death ~ log(datasets[[1]]$L1.PFS.Time))
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -0.12443  -0.10784  -0.10247  -0.09366   0.93415  
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)
## (Intercept)                     0.04526    0.06323   0.716    0.474
## log(datasets[[1]]$L1.PFS.Time)  0.01058    0.01155   0.916    0.360
## 
## (Dispersion parameter for gaussian family taken to be 0.09200045)
## 
##     Null deviance: 74.506  on 810  degrees of freedom
## Residual deviance: 74.428  on 809  degrees of freedom
##   (56 observations deleted due to missingness)
## AIC: 370.5
## 
## Number of Fisher Scoring iterations: 2
summary(glm(Death ~ poly(datasets[[1]]$L1.PFS.Time, 1)))
## 
## Call:
## glm(formula = Death ~ poly(datasets[[1]]$L1.PFS.Time, 1))
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -0.35633  -0.11251  -0.08696  -0.06513   0.95045  
## 
## Coefficients:
##                                    Estimate Std. Error t value Pr(>|t|)
## (Intercept)                         0.10236    0.01054   9.710  < 2e-16
## poly(datasets[[1]]$L1.PFS.Time, 1)  1.31589    0.31269   4.208 2.86e-05
##                                       
## (Intercept)                        ***
## poly(datasets[[1]]$L1.PFS.Time, 1) ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.09012303)
## 
##     Null deviance: 74.506  on 810  degrees of freedom
## Residual deviance: 72.910  on 809  degrees of freedom
##   (56 observations deleted due to missingness)
## AIC: 353.78
## 
## Number of Fisher Scoring iterations: 2
summary(glm(Death ~ poly(datasets[[1]]$L1.PFS.Time, 2)))
## 
## Call:
## glm(formula = Death ~ poly(datasets[[1]]$L1.PFS.Time, 2))
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -0.62677  -0.09385  -0.08514  -0.08335   0.91693  
## 
## Coefficients:
##                                     Estimate Std. Error t value Pr(>|t|)
## (Intercept)                           0.1026     0.0105   9.764  < 2e-16
## poly(datasets[[1]]$L1.PFS.Time, 2)1   1.3072     0.3115   4.196 3.02e-05
## poly(datasets[[1]]$L1.PFS.Time, 2)2   0.8118     0.3048   2.664  0.00789
##                                        
## (Intercept)                         ***
## poly(datasets[[1]]$L1.PFS.Time, 2)1 ***
## poly(datasets[[1]]$L1.PFS.Time, 2)2 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.08944915)
## 
##     Null deviance: 74.506  on 810  degrees of freedom
## Residual deviance: 72.275  on 808  degrees of freedom
##   (56 observations deleted due to missingness)
## AIC: 348.69
## 
## Number of Fisher Scoring iterations: 2
summary(glm(Death ~ poly(datasets[[1]]$L1.PFS.Time, 3)))
## 
## Call:
## glm(formula = Death ~ poly(datasets[[1]]$L1.PFS.Time, 3))
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -0.48879  -0.10220  -0.07718  -0.06743   0.93404  
## 
## Coefficients:
##                                     Estimate Std. Error t value Pr(>|t|)
## (Intercept)                          0.10303    0.01045   9.856  < 2e-16
## poly(datasets[[1]]$L1.PFS.Time, 3)1  1.33858    0.31025   4.315  1.8e-05
## poly(datasets[[1]]$L1.PFS.Time, 3)2  0.83827    0.30347   2.762  0.00587
## poly(datasets[[1]]$L1.PFS.Time, 3)3 -0.90747    0.30755  -2.951  0.00326
##                                        
## (Intercept)                         ***
## poly(datasets[[1]]$L1.PFS.Time, 3)1 ***
## poly(datasets[[1]]$L1.PFS.Time, 3)2 ** 
## poly(datasets[[1]]$L1.PFS.Time, 3)3 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.08860409)
## 
##     Null deviance: 74.506  on 810  degrees of freedom
## Residual deviance: 71.504  on 807  degrees of freedom
##   (56 observations deleted due to missingness)
## AIC: 341.99
## 
## Number of Fisher Scoring iterations: 2

Step 3.6: selecting the transformation for overal survival following first-line progression (Model LX-Time)

Selection of the best-performing transformation for Model LX-Time was done in the same way as for Model L1-Tim. The only difference is that an additional variable representing the first-line progression-free survival was included.

# Candidate variables, including their importance and potential transformations
vars.information.LX <- list(
  
  # Progression-free survival
  PFS = list(Transformations = c(Log = "log(L1.PFS.Time)", SQRT = "sqrt(L1.PFS.Time)", Poly1 = "poly(L1.PFS.Time, 1)", Poly2 = "poly(L1.PFS.Time, 2)", Poly3 = "poly(L1.PFS.Time, 3)")),
  
  # Age at treatment start
  Age = list(Transformations = c(Log = "log(Age)", SQRT = "sqrt(Age)", Poly1 = "poly(Age, 1)",  Poly2 = "poly(Age, 2)",  Poly3 = "poly(Age, 3)")),
  
  # Gender
  Gender.Female = list(Transformations = c(Binary.FemaleYN = "Gender.Female")),
  
  # Hospital Type
  Hospital.Private = list(Transformations = c(Binary.PrivateYN = "Hospital.Private")),
  
  # Stage at Initial Diagnosis
  Stage.Diagnosis = list(Transformations = c(Binary.Stage4YN = "as.factor(as.character(Stage.Diagnosis)=='4')")),
  
  # Primary Tumour Site
  Primary.Site = list(Transformations = c(Factor.GroupColMul = "as.factor(ifelse(Primary.Site %in% c('Colon','Multiple'), 'ColonMultiple', Primary.Site))", Factor.GroupColMul.GroupLefRec = "as.factor(ifelse(Primary.Site %in% c('Colon','Multiple'), 'ColonMultiple', ifelse(Primary.Site %in% c('Left','Rectum'), 'LeftRectum', Primary.Site)))")),
  
  # Number of Metastases
  Metastases.Number = list(Transformations = c(Factor.Group3plus = "as.factor(ifelse(Metastases.Number>=3, 3, Metastases.Number))")),
  
  # Liver Metastases
  Metastases.Liver = list(Transformations = c(Binary = "Metastases.Liver")),
  
  # Peritoneum Metastases
  Metastases.Peritoneum = list(Transformations = c(Binary = "Metastases.Peritoneum")),
  
  # ECOG Performance Status
  L1.ECOG = list(Transformations = c(Factor.Group23 = "as.factor(ifelse(L1.ECOG==0, 0, ifelse(L1.ECOG==1, 1, 2)))", Binary.2plusYN = "as.factor(ifelse(L1.ECOG %in% c(0,1), F, T))")),
  
  # Charlson Score
  Charlson = list(Transformations = c(Factor.Group2plus = "as.factor(ifelse(Charlson>=2, 2, Charlson))", Factor.Group01.Group2plus = "as.factor(ifelse(Charlson %in% c(0,1), 1, ifelse(Charlson>=2, 2, Charlson)))", Binary.1plusYN = "as.factor(ifelse(Charlson>=1, T, F))")),
  
  # CEA Level
  CEA.Level = list(Transformations = c(Binary.200plus = "as.factor(ifelse(Metastases.Number<=2, T, F) * ifelse(CEA.Level>=200, T, F))")),
  
  # RAS Status
  RAS.Mutated = list(Transformations = c(Binary.MutatedYN = "RAS.Mutated")),
  
  # BRAF Status
  BRAF.Mutated = list(Transformations = c(Binary.MutatedYN = "BRAF.Mutated"))
  
);


# For each imputed dataset, select distribution and perform univariable analysis
univariable.LX <- lapply(1:length(datasets), function(dataset) {
  
  # Retrieve datasets
  data.surv <- datasets[[dataset]];
  data.surv.LX <- filter(data.surv, LX.OS.Include);
  
  # For each variable
  vars.univariable <- lapply(vars.information.LX, function(var) {
    
    # Consider all potential transformations
    sapply(var$Transformations, function(var.form) {
      
      # Define formula to be fitted
      f <- as.formula(str_c("Surv(time=LX.OS.Time, event=LX.OS.Event) ~ ", var.form, collapse=""));
      
      # Fit model
      fit <- survreg(formula=f, data=data.surv.LX, dist=dist.selected.LX);
      
      # Perform Wald-test for significance
      test <- wald.test(Sigma=fit$var, b=c(fit$coefficients, fit$scale), Terms=c(2:length(fit$coefficients)))
      
      # Determine significance
      pvalue <- test$result$chi2["P"];
      significance <- "";
      if(pvalue<max(significance.levels)) {
        significance <- str_c(rep("*", tail(which(pvalue<significance.levels), n=1)), collapse="");
      }
      
      # Return transformation and model performance
      c(trans=f, ll=fit$loglik[2], pvalue=pvalue, sign=significance)
      
    })
    
  });
  
  # Return outcomes
  return(list(vars.univariable=vars.univariable));
  
})


# Add summary for variables to vars.information object
for(var in names(vars.information.LX)) {
  
  # Get names of transformations
  transformation.names <- names(vars.information.LX[[var]]$Transformations);
  
  # Extract data from univariable object
  loglik <- as.data.frame(
    sapply(transformation.names, function(trans) 
      sapply(univariable.LX, function(univar)
        univar$vars.univariable[[var]][2,trans]
      )
    )
  )
  pvalues <- as.data.frame(
    sapply(transformation.names, function(trans) 
      sapply(univariable.LX, function(univar)
        univar$vars.univariable[[var]][3,trans]
      )
    )
  )
  significance <- as.data.frame(
    sapply(transformation.names, function(trans) 
      sapply(univariable.LX, function(univar)
        univar$vars.univariable[[var]][4,trans]
      )
    )
  )
  
  # Transform List object to Matrix
  pvalues <- matrix(unlist(pvalues), ncol=ncol(pvalues), byrow=F)
  colnames(pvalues) <- transformation.names;
  
  # Rank based on pvalues
  ranks <- matrix(apply(pvalues, 1, rank, ties.method="average"), ncol=ncol(pvalues), byrow=T);
  colnames(ranks) <- colnames(pvalues);
  avgrank <- apply(ranks, 2, mean);
  
  # Select best performing transformation
  selected <- names(avgrank)[which.min(avgrank)];
  selected.transformation <- vars.information.LX[[var]]$Transformations[selected];
  selected.avgpvalue <- mean(pvalues[, selected]);
  selected.significance <- "";
  if(selected.avgpvalue<max(significance.levels)){
    selected.significance <- str_c(rep("*", tail(which(selected.avgpvalue<significance.levels), n=1)), collapse="");
  }
  
  # Save to vars.information object
  vars.information.LX[[var]]$LogLik <- loglik;
  vars.information.LX[[var]]$Pvalues <- pvalues;
  vars.information.LX[[var]]$Significance <- significance;
  vars.information.LX[[var]]$Ranks <- ranks;
  vars.information.LX[[var]]$AvgRank <- avgrank;
  
  vars.information.LX[[var]]$SelectedTransformation <- selected.transformation;
  vars.information.LX[[var]]$SelectedPvalue <- selected.avgpvalue;
  vars.information.LX[[var]]$SelectedSignificance <- selected.significance;
  
}


# Generate summary Table
vars.univariable.LX <- t(sapply(vars.information.LX, function(var) c(
  trans = names(var$SelectedTransformation),
  pvalue = var$SelectedPvalue,
  sign = var$SelectedSignificance
))); 

# View results
kable(vars.univariable.LX);
trans pvalue sign
PFS Log 0 ***
Age Poly2 1.47682807410954e-05 ***
Gender.Female Binary.FemaleYN 0.0293104401172259 **
Hospital.Private Binary.PrivateYN 0.00854917648739073 ***
Stage.Diagnosis Binary.Stage4YN 0.464603901073672
Primary.Site Factor.GroupColMul.GroupLefRec 0.0240492505778438 **
Metastases.Number Factor.Group3plus 0.678845101916153
Metastases.Liver Binary 0.719065221171095
Metastases.Peritoneum Binary 0.0120556554881923 **
L1.ECOG Factor.Group23 0 ***
Charlson Binary.1plusYN 0.912530732728782
CEA.Level Binary.200plus 0.267061210740859
RAS.Mutated Binary.MutatedYN 0.0110552668651976 **
BRAF.Mutated Binary.MutatedYN 0.0330294095990516 **

Step 4: Multivariable Modeling of Survival

Multivariable (survival) models are fitted to reflect patient heterogeneity. A variable selection procedure is implemented that considers both clinical relevance and statistical performance. In total, the multivariable modeling exists of three steps:

  1. Defining functions used for fitting the multivariable models
  2. Defining the input for the multivariable modeling process
  3. Fitting the multivariable models

Step 4.1: defining functions used for fitting the multivariable models

Several custom functions are defined to perform the variable selection procedure that considers both clinical relevance and statistical performance. Please find an explanation of the functions in the R code below.

# The 'getSurvDataset.L1' and 'getSurvDataset.LX' functions take the cleaned data and transform these in a data.frame 
# dat can be used to estimate (survival) models. This is necessary, because the cleaned data needs to be transformed 
# according to the selected transformations and treatment interaction terms need to be calculated accordingly
getSurvDataset.L1 <- function(data) {
  
  data <- data %>% 
    
    ## Single variables
    
    # Event
    mutate(Event = L1.PFS.Event) %>%
    
    # Events
    mutate(Events = L1.PFS.Events) %>%
    
    # Events.01
    mutate(Events.01 = ifelse(Events=="Death", 1, NA)) %>%
    mutate(Events.01 = ifelse(Events=="Progression", 0, Events.01)) %>%
    
    # L1.PFS.Time = "Time",
    mutate(Time = L1.PFS.Time) %>%
    mutate(Time1 = L1.PFS.Time) %>%
    mutate(Time2 = L1.PFS.Time^2) %>%
    mutate(Time3 = L1.PFS.Time^3) %>%
    mutate(Time4 = L1.PFS.Time^4) %>%
    
    # Arm.ChemBev = "ChemBev",
    mutate(ChemBev = as.numeric(as.logical(L1.Arm.ChemBev))) %>%
    
    #  Age = "Age1 + Age2",
    mutate(Age1 = Age) %>%
    mutate(Age2 = Age1^2) %>%
    
    #  Gender.Female = "Female.YN",
    mutate(Female.YN = as.numeric(as.logical(Gender.Female))) %>%
    
    #  Hospital.Private = "Private.YN",
    mutate(Private.YN = as.numeric(as.logical(Hospital.Private))) %>%
    
    #  Stage.Diagnosis = "Stage4.YN",
    mutate(Stage4.YN = ifelse(as.character(Stage.Diagnosis)=="4", 1, 0)) %>%
    
    #  Primary.Site = "Site.Left + Site.Right + Site.Rectum",
    mutate(Site.LeftRectum = ifelse(Primary.Site %in% c("Left","Rectum"), 1, 0)) %>%
    mutate(Site.Right = ifelse(Primary.Site=="Right", 1, 0)) %>%
    
    #  Metastases.Number = "Metastases.2 + Metastases.3plus",
    mutate(Metastases.2 = ifelse(as.numeric(as.character(Metastases.Number))==2, 1, 0)) %>%
    mutate(Metastases.3plus = ifelse(as.numeric(as.character(Metastases.Number))>=3, 1, 0)) %>%
    
    #  Metastases.Liver = "Liver.YN",
    mutate(Liver.YN = as.numeric(as.logical(Metastases.Liver))) %>%
    
    #  Metastases.Peritoneum = "Peritoneum.YN",
    mutate(Peritoneum.YN = as.numeric(as.logical(Metastases.Peritoneum))) %>%
    
    #  L1.ECOG = "ECOG.1 + ECOG.23",
    mutate(ECOG.1 = ifelse(as.numeric(as.character(L1.ECOG))==1, 1, 0)) %>%
    mutate(ECOG.23 = ifelse(as.numeric(as.character(L1.ECOG))>=2, 1, 0)) %>%
    
    #  Charlson = "Charlson.1 + Charlson.2 + Charlson.3plus",
    mutate(Charlson.1plus = ifelse(as.numeric(as.character(Charlson))>=1, 1, 0)) %>%
    
    #  CEA.Level = "CEA.200plus",
    mutate(CEA.200plus = ifelse(as.numeric(as.character(Metastases.Number))<=2, 1, 0) * ifelse(as.numeric(CEA.Level)>=200, 1, 0)) %>%
    
    #  RAS.Mutated = "MutantRAS.YN",
    mutate(MutantRAS.YN = as.numeric(as.logical(RAS.Mutated))) %>%
    
    #  BRAF.Mutated = "MutantBRAF.YN",
    mutate(MutantBRAF.YN = as.numeric(as.logical(BRAF.Mutated))) %>%
    
    
    ## Interactions
    mutate(Time1.Int = Time1 * ChemBev) %>%
    mutate(Time2.Int = Time2 * ChemBev) %>%
    mutate(Time3.Int = Time3 * ChemBev) %>%
    mutate(Time4.Int = Time4 * ChemBev) %>%
    mutate(Age1.Int = Age1 * ChemBev) %>%
    mutate(Age2.Int = Age2 * ChemBev) %>%
    mutate(Female.YN.Int = Female.YN * ChemBev) %>%
    mutate(Private.YN.Int = Private.YN * ChemBev) %>%
    mutate(Stage4.YN.Int = Stage4.YN * ChemBev) %>%
    mutate(Site.LeftRectum.Int = Site.LeftRectum * ChemBev) %>%
    mutate(Site.Right.Int = Site.Right * ChemBev) %>%
    mutate(Metastases.2.Int = Metastases.2 * ChemBev) %>%
    mutate(Metastases.3plus.Int = Metastases.3plus * ChemBev) %>%
    mutate(Liver.YN.Int = Liver.YN * ChemBev) %>%
    mutate(Peritoneum.YN.Int = Peritoneum.YN * ChemBev) %>%
    mutate(ECOG.1.Int = ECOG.1 * ChemBev) %>%
    mutate(ECOG.23.Int = ECOG.23 * ChemBev) %>%
    mutate(Charlson.1plus.Int = Charlson.1plus * ChemBev) %>%
    mutate(CEA.200plus.Int = CEA.200plus * ChemBev) %>%
    mutate(MutantRAS.YN.Int = MutantRAS.YN * ChemBev) %>%
    mutate(MutantBRAF.YN.Int = MutantBRAF.YN * ChemBev)
  
  
  # Only the newly created columns need to be returned
  return(select(data, Event:MutantBRAF.YN.Int))
  
}
getSurvDataset.LX <- function(data) {
  
  data <- data %>% 
    
    ## Single variables
    
    # L1.OS.Event = Event,
    mutate(Event = LX.OS.Event) %>%
    
    # L1.OS.Time = "Time",
    mutate(Time = LX.OS.Time) %>%
    
    # L1.PFS.Time = "PFS.Time",
    mutate(PFS.Time = log(L1.PFS.Time)) %>%
    
    # Arm.ChemBev = "ChemBev",
    mutate(ChemBev = as.numeric(as.logical(L1.Arm.ChemBev))) %>%
    
    #  Age = "Age1 + Age2",
    mutate(Age1 = as.numeric(Age)-mean(Age)) %>%
    mutate(Age2 = as.numeric(Age1^2)) %>%
    
    #  Gender.Female = "Female.YN",
    mutate(Female.YN = as.numeric(as.logical(Gender.Female))) %>%
    
    #  Hospital.Private = "Private.YN",
    mutate(Private.YN = as.numeric(as.logical(Hospital.Private))) %>%
    
    #  Stage.Diagnosis = "Stage4.YN",
    mutate(Stage4.YN = ifelse(as.numeric(Stage.Diagnosis)==4, 1, 0)) %>%
    
    #  Primary.Site = "Site.Left + Site.Right + Site.Rectum",
    mutate(Site.LeftRectum = ifelse(Primary.Site %in% c("Left","Rectum"), 1, 0)) %>%
    mutate(Site.Right = ifelse(Primary.Site=="Right", 1, 0)) %>%
    
    #  Metastases.Number = "Metastases.2 + Metastases.3plus",
    mutate(Metastases.2 = ifelse(as.numeric(Metastases.Number)==2, 1, 0)) %>%
    mutate(Metastases.3plus = ifelse(as.numeric(Metastases.Number)>=3, 1, 0)) %>%
    
    #  Metastases.Liver = "Liver.YN",
    mutate(Liver.YN = as.numeric(as.logical(Metastases.Liver))) %>%
    
    #  Metastases.Peritoneum = "Peritoneum.YN",
    mutate(Peritoneum.YN = as.numeric(as.logical(Metastases.Peritoneum))) %>%
    
    #  L1.ECOG = "ECOG.1 + ECOG.23",
    mutate(ECOG.1 = ifelse(as.numeric(as.character(L1.ECOG))==1, 1, 0)) %>%
    mutate(ECOG.23 = ifelse(as.numeric(as.character(L1.ECOG))>=2, 1, 0)) %>%
    
    #  Charlson = "Charlson.3plus.YN",
    mutate(Charlson.1plus.YN = ifelse(as.numeric(Charlson)>=1, 1, 0)) %>%
    
    #  CEA.Level = "CEA.200plus",
    mutate(CEA.200plus = ifelse(as.numeric(Metastases.Number)<=2, 1, 0) * ifelse(as.numeric(CEA.Level)>=200, 1, 0)) %>%
    
    #  RAS.Mutated = "MutantRAS.YN",
    mutate(MutantRAS.YN = as.numeric(as.logical(RAS.Mutated))) %>%
    
    #  BRAF.Mutated = "MutantBRAF.YN",
    mutate(MutantBRAF.YN = as.numeric(as.logical(BRAF.Mutated))) %>%
    
    
    ## Interactions
    mutate(PFS.Time.Int = PFS.Time * ChemBev) %>%
    mutate(Age1.Int = Age1 * ChemBev) %>%
    mutate(Age2.Int = Age2 * ChemBev) %>%
    mutate(Female.YN.Int = Female.YN * ChemBev) %>%
    mutate(Private.YN.Int = Private.YN * ChemBev) %>%
    mutate(Stage4.YN.Int = Stage4.YN * ChemBev) %>%
    mutate(Site.LeftRectum.Int = Site.LeftRectum * ChemBev) %>%
    mutate(Site.Right.Int = Site.Right * ChemBev) %>%
    mutate(Metastases.2.Int = Metastases.2 * ChemBev) %>%
    mutate(Metastases.3plus.Int = Metastases.3plus * ChemBev) %>%
    mutate(Liver.YN.Int = Liver.YN * ChemBev) %>%
    mutate(Peritoneum.YN.Int = Peritoneum.YN * ChemBev) %>%
    mutate(ECOG.1.Int = ECOG.1 * ChemBev) %>%
    mutate(ECOG.23.Int = ECOG.23 * ChemBev) %>%
    mutate(Charlson.1plus.YN.Int = Charlson.1plus.YN * ChemBev) %>%
    mutate(CEA.200plus.Int = CEA.200plus * ChemBev) %>%
    mutate(MutantRAS.YN.Int = MutantRAS.YN * ChemBev) %>%
    mutate(MutantBRAF.YN.Int = MutantBRAF.YN * ChemBev)
  
  # Only the newly created columns need to be returned
  return(select(data, Event:MutantBRAF.YN.Int))
  
}

# The 'avgSurvivalModel' and 'avgLogisticModel' functions estimate (survival) models based on a number of datasets and
# combine these separate models into one model using Rubin's rule so that a p-value can be calculated for all models
# combined in the variable selection procedure
avgSurvivalModel <- function(formula, datasets, distribution, conf.int=F) {
  
  # Fit models for each dataset
  models <- lapply(datasets, function(dataset) {
    
    model <- survreg(formula=as.formula(formula), data=dataset, dist=distribution);
    
    list(coef=c(model$coefficients, 'Log(Scale)'=log(model$scale)), sigma=model$var);
    
  })
  
  # Coefficient Estimates
  m <- length(models);
  Q <- sapply(models, function(model) model$coef);
  Qm <- matrix(rowMeans(Q), nrow=1);
  
  # Within Variance
  U <- lapply(models, function(model) model$sigma);
  Um <- Reduce('+', U) / m;
  
  # Between variance
  B <- lapply(1:m, function(i) {
    
    Q <- matrix(Q[,i], nrow=1);
    
    t((Q-Qm)) %*% (Q-Qm) / (m-1)
    
  });
  Bm <- Reduce('+', B);
  
  # Total Variance
  Tm <- Um + (1+(1/m))*Bm;
  
  # Save the Total Variance-Covariance matrix
  S <- Tm;
  
  # Provide proper format and naming
  Qm <- as.numeric(Qm);
  names(Qm) <- rownames(Q);
  
  # Confidence intervals for coefficients
  if(conf.int) {
    
    # Extract diagonals of variance matrices to get variance vectors
    Um <- diag(Um);
    Bm <- diag(Bm);
    Tm <- diag(Tm);
    
    # Degrees of freedom
    rm <- (1+(1/m)) * Bm / Um;
    v <- (m-1)*(1+(1/rm))^2;
    
    # Determine 95% confidence interval bounds
    lower <- Qm + qt(p=0.025, df=v)*sqrt(Tm);
    upper <- Qm + qt(p=0.975, df=v)*sqrt(Tm);
    
  }
  
  if(conf.int) {
    return(list(coef=Qm, sigma=S, lower=lower, upper=upper))
  } else {
    return(list(coef=Qm, sigma=S))
  }
}
avgLogisticModel <- function(formula, datasets, conf.int=F) {
  
  # Fit models for each dataset
  models <- lapply(datasets, function(dataset) {
    
    model <- glm(formula=as.formula(formula), data=dataset, family=binomial(link="logit"));
    
    list(coef=c(model$coefficients), sigma=vcov(model));
    
  })
  
  # Coefficient Estimates
  m <- length(models);
  Q <- sapply(models, function(model) model$coef);
  Qm <- matrix(rowMeans(Q), nrow=1);
  
  # Within Variance
  U <- lapply(models, function(model) model$sigma);
  Um <- Reduce('+', U) / m;
  
  # Between variance
  B <- lapply(1:m, function(i) {
    
    Q <- matrix(Q[,i], nrow=1);
    
    t((Q-Qm)) %*% (Q-Qm) / (m-1)
    
  });
  Bm <- Reduce('+', B);
  
  # Total Variance
  Tm <- Um + (1+(1/m))*Bm;
  
  # Save the Total Variance-Covariance matrix
  S <- Tm;
  
  # Provide proper format and naming
  Qm <- as.numeric(Qm);
  names(Qm) <- rownames(Q);
  
  # Confidence intervals for coefficients
  if(conf.int) {
    
    # Extract diagonals of variance matrices to get variance vectors
    Um <- diag(Um);
    Bm <- diag(Bm);
    Tm <- diag(Tm);
    
    # Degrees of freedom
    rm <- (1+(1/m)) * Bm / Um;
    v <- (m-1)*(1+(1/rm))^2;
    
    # Determine 95% confidence interval bounds
    lower <- Qm + qt(p=0.025, df=v)*sqrt(Tm);
    upper <- Qm + qt(p=0.975, df=v)*sqrt(Tm);
    
  }
  
  if(conf.int) {
    return(list(coef=Qm, sigma=S, lower=lower, upper=upper))
  } else {
    return(list(coef=Qm, sigma=S))
  }
  
}

# The 'survivalBackward', 'survivalForward', and 'logisticForward' functions perform a backward or forward selection
# procedure based on a set of basis variables and a set of candidate variables
survivalBackward <- function(datasets, time, event, vars.include, vars.selection, vars.formulas, pvalue, dist) {
  
  # Extract all used columns from the provided input
  columns.used <- sapply(c(vars.include, vars.selection), function(var) {
    formula.complete <- vars.formulas[var];
    formula.splitted <- str_split(string=formula.complete, pattern=fixed("+"), simplify=T);
    formula.nospaces <- str_trim(string=formula.splitted);
  });
  
  # Objects to record process
  vars.unselected <- c();
  continue <- TRUE;
  
  # Backward selection
  while(continue) {
    
    # Get formula according to the remaining backward selection candidates
    formula.current <- str_c(
      str_c(c("Surv(", time, ", ", event, ") ~ "), collapse=""),  
      str_c(vars.formulas[c(vars.include, vars.selection)], collapse=" + ")
    );
    
    # Fit model with backward candidates
    model.current <- avgSurvivalModel(formula=formula.current, datasets=datasets, distribution=dist);
    
    # Get pvalue for each variable
    coefficient.indexes <- lapply(vars.selection, function(var) which(names(model.current$coef) %in% columns.used[[var]]));
    names(coefficient.indexes) <- vars.selection;
    pvalues <- sapply(vars.selection, function(var) {
      testresult <- wald.test(Sigma = model.current$sigma,
                              b = model.current$coef,
                              Terms = coefficient.indexes[[var]]
      )$result$chi2["P"];
      
      unname(testresult)
    }); # sapply pvalues 

    # Get highes p value
    pvalue.max <- max(pvalues);
    
    # If the highest p value is higher than the threshold, remove this one
    if(pvalue.max > pvalue) {
      
      # Get the name of the variable with the highest pvalue
      pvalue.index <- which.max(pvalues);
      
      # Temporary list of selection candidates without the selected variable
      vars.selection.tmp <- vars.selection[-pvalue.index];
      
      # Store unselected candidates
      vars.unselected <- c(vars.unselected, vars.selection[pvalue.index]);
      
      # If no backward selection candidates remain, stop backward selection
      if(length(vars.selection.tmp)<1) {
        
        vars.selection <- vars.selection.tmp;
        continue <- F;
        
      } else {
        
        # Update backward selection candidates and unselected candidates from temporary vectors
        vars.selection <- vars.selection.tmp;
        
      }
      
    } else {
      
      # If no pvalue higher than thershold, stop backward selection
      continue <- F;
      
    }
  } # while continue
  
  # Return outcomes
  return(list(included = c(vars.include, vars.selection),
              excluded = vars.unselected))
  
}
survivalForward <- function(datasets, time, event, vars.include, vars.selection, vars.formulas, pvalue, dist) {
  
  # Extract all used columns from the provided input
  columns.used <- sapply(c(vars.include, vars.selection), function(var) {
    formula.complete <- vars.formulas[var];
    formula.splitted <- str_split(string=formula.complete, pattern=fixed("+"), simplify=T);
    formula.nospaces <- str_trim(string=formula.splitted);
  });

  # Objects to record process
  continue <- TRUE;
  
  # Backward selection
  while(continue) {
    
    # Get pvalue for each variable
    pvalues <- sapply(vars.selection, function(var) {
      
      # Get formula according to the current forward selection candidates
      formula.current <- str_c(
        str_c(c("Surv(", time, ", ", event, ") ~ "), collapse=""),  
        str_c(vars.formulas[c(vars.include, var)], collapse=" + ")
      );
      
      # Fit model with backward candidates
      model.current <- avgSurvivalModel(formula=formula.current, datasets=datasets, distribution=dist);
      
      
      
      # Get pvalue for this variable
      coefficient.indexes <- which(names(model.current$coef) %in% columns.used[[var]])
      test.result <- wald.test(Sigma = model.current$sigma,
                               b = model.current$coef,
                               Terms = coefficient.indexes
      )$result$chi2["P"]
      
      unname(test.result)
      
    })

    # Get lowest p value
    pvalue.min <- min(pvalues);
    
    # If the highest p value is lower than the threshold, add this one
    if(pvalue.min < pvalue) {
      
      # Get the name of the variable with the lowest pvalue
      pvalue.index <- which.min(pvalues);
      
      # Update variable lists
      vars.include <- c(vars.include, vars.selection[pvalue.index]);
      vars.selection <- vars.selection[-pvalue.index];
      
      # If no forward selection candidates remain, stop backward selection
      if(length(vars.selection)<1) {
        
        continue <- F;
        
      } 
      
    } else {
      
      # If no pvalue higher than thershold, stop forward selection
      continue <- F;
      
    }
  } # while !stop
  
  
  # Return outcomes
  return(list(included = vars.include,
              excluded = vars.selection))
  
}
logisticForward <- function(datasets, events, vars.include, vars.selection, vars.formulas, pvalue) {
  
  # Extract all used columns from the provided input
  columns.used <- sapply(c(vars.include, vars.selection), function(var) {
    formula.complete <- vars.formulas[var];
    formula.splitted <- str_split(string=formula.complete, pattern=fixed("+"), simplify=T);
    formula.nospaces <- str_trim(string=formula.splitted);
  });
  
  # Objects to record process
  continue <- TRUE;
  
  # Backward selection
  while(continue) {
    
    # Get pvalue for each variable
    pvalues <- sapply(vars.selection, function(var) {
      
      # Get formula according to the current forward selection candidates
      formula.current <- str_c(
        str_c(c(events, " ~ "), collapse=""),  
        str_c(vars.formulas[c(vars.include, var)], collapse=" + ")
      );
      
      # Fit model with backward candidates
      model.current <- avgLogisticModel(formula=formula.current, datasets=datasets);
      
      
      
      # Get pvalue for this variable
      coefficient.indexes <- which(names(model.current$coef) %in% columns.used[[var]])
      test.result <- wald.test(Sigma = model.current$sigma,
                               b = model.current$coef,
                               Terms = coefficient.indexes
      )$result$chi2["P"]
      
      unname(test.result)
      
    })

    # Get lowest p value
    pvalue.min <- min(pvalues);
    
    # If the highest p value is lower than the threshold, add this one
    if(pvalue.min < pvalue) {
      
      # Get the name of the variable with the lowest pvalue
      pvalue.index <- which.min(pvalues);
      
      # Update variable lists
      vars.include <- c(vars.include, vars.selection[pvalue.index]);
      vars.selection <- vars.selection[-pvalue.index];
      
      # If no forward selection candidates remain, stop backward selection
      if(length(vars.selection)<1) {
        
        continue <- F;
        
      } 
      
    } else {
      
      # If no pvalue higher than thershold, stop forward selection
      continue <- F;
      
    }
  } # while !stop
  
  
  # Return outcomes
  return(list(included = vars.include,
              excluded = vars.selection))
  
}

# The 'fitLogisticModel' and 'fitSurvivalModel' functions are high-level functions that call the different selection
# procedures and return the final model based on the selected set of variables
fitLogisticModel <- function(datasets, events, pvalue, vars.include, vars.selection, vars.formulas) {
  
  result.forward <- logisticForward(datasets = datasets, events = events, vars.include = vars.include, vars.selection = vars.selection, vars.formulas = vars.formulas, pvalue = pvalue);
  
  formula.final <- str_c(
    str_c(c(events, " ~ "), collapse=""),  
    str_c(vars.formulas[result.forward$included], collapse=" + ")
  );
  model.final <- avgLogisticModel(formula = formula.final, datasets = datasets, conf.int = T)
  
  return(list(coef = model.final$coef, sigma = model.final$sigma, lower = model.final$lower, upper = model.final$upper, included = result.forward$included))
  
}
fitSurvivalModel <- function(datasets, time, event, dist, vars.include, vars.backward, vars.forward, vars.formulas, pvalue.backward, pvalue.forward, pvalue.interaction) {
  
  result.backward <- survivalBackward(datasets = datasets, time = time, event = event, dist = dist, vars.include = vars.include, vars.selection = vars.backward, vars.formulas = vars.formulas, pvalue = pvalue.backward)
  
  result.forward <- survivalForward(datasets = datasets, time = time, event = event, dist = dist, vars.include = result.backward$included, vars.selection = c(vars.forward, result.backward$excluded), vars.formulas = vars.formulas, pvalue = pvalue.forward)
  
  vars.include.interaction <- result.forward$included;
  vars.backward.interaction <- str_c(vars.include.interaction, ".Int");
  vars.backward.interaction <- vars.backward.interaction[-which(vars.backward.interaction=="Arm.ChemBev.Int")];
  
  result.interaction <- survivalBackward(datasets = datasets, time = time, event = event, dist = dist, vars.include = vars.include.interaction, vars.selection = vars.backward.interaction, vars.formulas = vars.formulas, pvalue = pvalue.interaction)
  
  formula.final <- str_c(
    str_c(c("Surv(", time, ", ", event, ") ~ "), collapse=""),  
    str_c(vars.formulas[result.interaction$included], collapse=" + ")
  );
  model.final <- avgSurvivalModel(formula = formula.final, datasets = datasets, distribution = "weibull", conf.int = T)
  
  return(list(coef = model.final$coef, sigma = model.final$sigma, lower = model.final$lower, upper = model.final$upper, included = result.interaction$included))
  
}

Step 4.2: defining the input for the multivariable modeling process

The selection procedures require specific input, which is defined here:

  • The formula for each variable, i.e. which columns in the dataset are used
  • The clinical relevance of the variables, i.e. whether they should be always included or are subject to a backward or forward selection procedure
  • Datasets in the appropriate format
# List object that contains an overview of all variables and the columns in the data that 
# they use for first-line progression-free survival (L1)
vars.formulas.L1 <- list(
  
  # Single variables
  L1.PFS.Time = "Time1 + Time2",
  Arm.ChemBev = "ChemBev",
  Age = "Age1 + Age2",
  Gender.Female = "Female.YN",
  Hospital.Private = "Private.YN",
  Stage.Diagnosis = "Stage4.YN",
  Primary.Site = "Site.LeftRectum + Site.Right",
  Metastases.Number = "Metastases.2 + Metastases.3plus",
  Metastases.Liver = "Liver.YN",
  Metastases.Peritoneum = "Peritoneum.YN",
  L1.ECOG = "ECOG.1 + ECOG.23",
  Charlson = "Charlson.1plus",
  CEA.Level = "CEA.200plus",
  RAS.Mutated = "MutantRAS.YN",
  BRAF.Mutated = "MutantBRAF.YN",
  
  
  # Interactions with treatment
  L1.PFS.Time.Int = "Time1.Int + Time2.Int",
  Age.Int = "Age1.Int + Age2.Int",
  Gender.Female.Int = "Female.YN.Int",
  Hospital.Private.Int = "Private.YN.Int",
  Stage.Diagnosis.Int = "Stage4.YN.Int",
  Primary.Site.Int = "Site.LeftRectum.Int + Site.Right.Int",
  Metastases.Number.Int = "Metastases.2.Int + Metastases.3plus.Int",
  Metastases.Liver.Int = "Liver.YN.Int",
  Metastases.Peritoneum.Int = "Peritoneum.YN.Int",
  L1.ECOG.Int = "ECOG.1.Int + ECOG.23.Int",
  Charlson.Int = "Charlson.1plus.Int",
  CEA.Level.Int = "CEA.200plus.Int",
  RAS.Mutated.Int = "MutantRAS.YN.Int",
  BRAF.Mutated.Int = "MutantBRAF.YN.Int"
);

# List object that contains an overview of all variables and the columns in the data that 
# they use for overall survival after first-line progression (LX)
vars.formulas.LX <- list(
  
  # Single variables
  PFS = "PFS.Time",
  Arm.ChemBev = "ChemBev",
  Age = "Age1 + Age2",
  Gender.Female = "Female.YN",
  Hospital.Private = "Private.YN",
  Stage.Diagnosis = "Stage4.YN",
  Primary.Site = "Site.LeftRectum + Site.Right",
  Metastases.Number = "Metastases.2 + Metastases.3plus",
  Metastases.Liver = "Liver.YN",
  Metastases.Peritoneum = "Peritoneum.YN",
  L1.ECOG = "ECOG.1 + ECOG.23",
  Charlson = "Charlson.1plus.YN",
  CEA.Level = "CEA.200plus",
  RAS.Mutated = "MutantRAS.YN",
  BRAF.Mutated = "MutantBRAF.YN",

  # Interactions with treatment
  PFS.Int = "PFS.Time.Int",
  Age.Int = "Age1.Int + Age2.Int",
  Gender.Female.Int = "Female.YN.Int",
  Hospital.Private.Int = "Private.YN.Int",
  Stage.Diagnosis.Int = "Stage4.YN.Int",
  Primary.Site.Int = "Site.LeftRectum.Int + Site.Right.Int",
  Metastases.Number.Int = "Metastases.2.Int + Metastases.3plus.Int",
  Metastases.Liver.Int = "Liver.YN.Int",
  Metastases.Peritoneum.Int = "Peritoneum.YN.Int",
  L1.ECOG.Int = "ECOG.1.Int + ECOG.23.Int",
  Charlson.Int = "Charlson.1plus.YN.Int",
  CEA.Level.Int = "CEA.200plus.Int",
  RAS.Mutated.Int = "MutantRAS.YN.Int",
  BRAF.Mutated.Int = "MutantBRAF.YN.Int"
);


# Indicate which variables should be included and which are subject to selection (backward vs. forward)
vars.include.L1 <- c("Arm.ChemBev", "Age", "L1.ECOG", "BRAF.Mutated");
vars.backward.L1 <- c("Stage.Diagnosis", "Primary.Site", "Metastases.Number", "Metastases.Liver",
                      "Metastases.Peritoneum", "Charlson");
vars.forward.L1 <- c("Gender.Female", "Hospital.Private", "CEA.Level", "RAS.Mutated");

# Indicate which variables should be included and which are subject to selection (backward vs. forward)
vars.include.LX <- c("Arm.ChemBev", "Age", "Primary.Site", "Metastases.Peritoneum", "L1.ECOG", "BRAF.Mutated");
vars.backward.LX <- c("PFS","Stage.Diagnosis", "Metastases.Number", "Metastases.Liver", "Charlson", "CEA.Level");
vars.forward.LX <- c("Gender.Female", "Hospital.Private", "RAS.Mutated");

# Transform cleaned datasets into appropriate format
datasets.surv.L1 <- lapply(datasets, function(dataset) getSurvDataset.L1(dataset));
datasets.surv.LX <- lapply(datasets, function(dataset) getSurvDataset.LX(filter(dataset, LX.OS.Include)));

Step 4.3: fitting the multivariable models

After defining all functions and input required for fitting the (survival) models, all procedures can be performed.

Model L1-Time

# Fit Model L1-Time
Model.L1.Time <- fitSurvivalModel(datasets = datasets.surv.L1, time = "Time", event = "Event", dist = "weibull", vars.include = vars.include.L1, vars.backward = vars.backward.L1, vars.forward = vars.forward.L1, vars.formulas = vars.formulas.L1, pvalue.backward = 0.3, pvalue.forward = 0.2, pvalue.interaction = 0.3);

# Save and print Model L1-Time
save(x=Model.L1.Time, file="object Model.L1.Time 20190201.RData");
write.table(cbind(Coef=Model.L1.Time$coef, Lower=Model.L1.Time$lower, Upper=Model.L1.Time$upper), file="Model L1-Time 20190201.txt", sep=";");
cbind(Coefficient=Model.L1.Time$coef, Lower.Bound=Model.L1.Time$lower, Upper.Bound=Model.L1.Time$upper);
##                   Coefficient   Lower.Bound  Upper.Bound
## (Intercept)      4.5352456627  3.6607544949  5.409736831
## ChemBev          0.4903821144  0.2176428890  0.763121340
## Age1             0.0467445556  0.0193879515  0.074101160
## Age2            -0.0003818097 -0.0006122064 -0.000151413
## ECOG.1          -0.4420102362 -0.7086076563 -0.175412816
## ECOG.23         -0.4466416971 -0.7601591259 -0.133124268
## MutantBRAF.YN   -0.1586860032 -0.3946605642  0.077288558
## Site.LeftRectum -0.0320105967 -0.2807102482  0.216689055
## Site.Right      -0.2048020078 -0.4594543905  0.049850375
## Liver.YN        -0.1006873497 -0.2196800253  0.018305326
## Peritoneum.YN   -0.0967462377 -0.2154643846  0.021971909
## CEA.200plus     -0.0349843979 -0.3386554012  0.268686605
## Private.YN       0.3951723972  0.1640512674  0.626293527
## MutantRAS.YN    -0.0958525980 -0.2360036442  0.044298448
## ECOG.1.Int       0.2336980952 -0.0577437169  0.525139907
## ECOG.23.Int     -0.0886445670 -0.4773567723  0.300067638
## CEA.200plus.Int -0.2021634520 -0.5296258343  0.125298930
## Private.YN.Int  -0.4110088123 -0.6653954483 -0.156622176
## Log(Scale)      -0.3324565000 -0.3850521961 -0.279860804

Model L1-Event

# Fit Model L1-Event
vars.all.L1 <- c(vars.include.L1, vars.backward.L1, vars.forward.L1);
vars.all.L1 <- vars.all.L1[vars.all.L1!="Arm.ChemBev"];
Model.L1.Event <- fitLogisticModel(datasets = datasets.surv.L1, events = "Events.01", vars.include = c("L1.PFS.Time", "Arm.ChemBev"),  vars.selection = c(vars.all.L1, str_c(vars.all.L1, ".Int"), "L1.PFS.Time.Int"), vars.formulas = vars.formulas.L1, pvalue = 0.20);

# Save and print Model L1-Event
save(x=Model.L1.Event, file="object Model.L1.Event 20190201.RData")
write.table(cbind(Coef=Model.L1.Event$coef, Lower=Model.L1.Event$lower, Upper=Model.L1.Event$upper), file="Model L1-Event 20190201.txt", sep=";");
cbind(Coefficient=Model.L1.Event$coef, Lower.Bound=Model.L1.Event$lower, Upper.Bound=Model.L1.Event$upper);
##                       Coefficient   Lower.Bound   Upper.Bound
## (Intercept)         -2.069855e+00 -3.011333e+00 -1.128376e+00
## Time1               -1.871496e-04 -8.169673e-03  7.795374e-03
## Time2               -2.561821e-06 -1.577255e-05  1.064891e-05
## ChemBev              1.738231e-01 -1.229186e+00  1.576832e+00
## Charlson.1plus       6.461158e-01  1.593857e-01  1.132846e+00
## Site.LeftRectum.Int -1.462309e+00 -2.459083e+00 -4.655346e-01
## Site.Right.Int      -1.469024e+00 -2.529350e+00 -4.086988e-01
## Time1.Int            2.267294e-03 -6.025885e-03  1.056047e-02
## Time2.Int            2.566552e-06 -1.072477e-05  1.585787e-05

Model LX-Time

# Fit Model LX-Time
Model.LX.Time <- fitSurvivalModel(datasets = datasets.surv.LX, time = "Time", event = "Event", dist = "weibull", vars.include = vars.include.LX, vars.backward = vars.backward.LX, vars.forward = vars.forward.LX, vars.formulas = vars.formulas.LX, pvalue.backward = 0.3, pvalue.forward = 0.2, pvalue.interaction = 0.3)

# Save and print Model LX-Time
save(x=Model.LX.Time, file="object Model.LX.Time 20190201.RData")
write.table(cbind(Coef=Model.LX.Time$coef, Lower=Model.LX.Time$lower, Upper=Model.LX.Time$upper), file="Model LX-Time 20190201.txt", sep=";");
cbind(Coefficient=Model.LX.Time$coef, Lower.Bound=Model.LX.Time$lower, Upper.Bound=Model.LX.Time$upper);
##                     Coefficient   Lower.Bound   Upper.Bound
## (Intercept)        3.2891795888  2.2620579298  4.316301e+00
## ChemBev            0.8818385546 -0.2812537861  2.044931e+00
## Age1              -0.0251070916 -0.0370240389 -1.319014e-02
## Age2              -0.0005670460 -0.0011710130  3.692101e-05
## Site.LeftRectum   -0.1579432364 -0.5383176339  2.224312e-01
## Site.Right        -0.2365318069 -0.6211753023  1.481117e-01
## Peritoneum.YN      0.1513874151 -0.1773574308  4.801323e-01
## ECOG.1            -0.2762104282 -0.6659711458  1.135503e-01
## ECOG.23           -1.2022780055 -1.6308052807 -7.737507e-01
## MutantBRAF.YN      0.2745640877 -0.3312597292  8.803879e-01
## PFS.Time           0.5751723494  0.4054450540  7.448996e-01
## Stage4.YN         -0.5243148245 -0.9757145591 -7.291509e-02
## Charlson.1plus.YN  0.1044981510 -0.0565984893  2.655948e-01
## MutantRAS.YN       0.1091776800 -0.3022287934  5.205842e-01
## Private.YN         0.1053037877 -0.0458828408  2.564904e-01
## Age1.Int           0.0174056852  0.0034559505  3.135542e-02
## Age2.Int           0.0004030007 -0.0003232005  1.129202e-03
## Peritoneum.YN.Int -0.2386910341 -0.6130196513  1.356376e-01
## ECOG.1.Int        -0.0930659837 -0.5165916222  3.304597e-01
## ECOG.23.Int        0.3115761369 -0.2210811948  8.442335e-01
## MutantBRAF.YN.Int -0.6795392885 -1.3768905224  1.781195e-02
## PFS.Time.Int      -0.1486793160 -0.3474887968  5.013016e-02
## Stage4.YN.Int      0.4904170757 -0.0221555921  1.002990e+00
## MutantRAS.YN.Int  -0.4071664687 -0.8555996074  4.126667e-02
## Log(Scale)        -0.1525148212 -0.2153760094 -8.965363e-02

Step 5: Model Validation

Internal validation is performed to assess the performanc of the models. This exists of five steps:

  1. Defining functions used to perform the validation
  2. Graphical assessment of internal validity
  3. Calculating the apparent model performance
  4. Performing bootstrapping to calculate the optimism
  5. Correcting the apparent performance for the optimism

For survival models Model L1-Time and Model LX-Time, several validation plots were assessed and model discrimination was assessed by Uno et al.’s modification of Harrel’s C-statistic to account for censoring. Two measures were used to assess survival model calibration: 1) the regression slope of the predicted Weibull scale parameter in a Cox-regression model, and 2) Demler et al.’s Greenwood-Nam-D’Agostino statistic (GND), which is a modification of Hosmer-Lemeshow’s statistic by Nam and D’Agostino and accounts for censoring using Greenwood’s variance estimator. Both C-statistic and GND are assessed at specific timepoints, which were selected to be at 0.5, 1, and 2 years. Additionally, GND requires a number of groups to be specified in which patients are stratified according to their predicted risk, for which 5 and 10 were chosen to assess impact of this arbitrary decision.

Internal validity of logistic regression Model L1-Event was assessed by excluding patients for which no event was observed and using Harrel’s C-statistic for model discrimination. To assess calibration, the regression slope of the predicted response of Model L1-Event in a logistic regression model and the Brier score were used.

Bootstrapping was performed to determine the optimism and obtain sets of models to perform the probabilistic sensitivity analysis for the discrete event simulation model.

Step 5.1: defining functions used to perform the validation

Several custom functions are defined to make predictions using the fitted models and to calculate the statistics used. Please find an explanation of the functions in the R code below.

# The 'predictSurvmod', 'predictSurvmod.response', 'predictSurvmod.lp', 'predictSurvmod.p', 'predictLogreg',
# and 'predictLogreg.lp' are custom functions to make predictions using the fitted models
predictSurvmod <- function(model, newdata) {
  
  # Add a dummy column for the intercept
  newdata <- cbind(newdata, '(Intercept)'=1);
  
  # Ge the Weibull Shape parameter from the model
  Shape <- 1/exp(model["Log(Scale)"]);
  model <- model[-which(names(model) == "Log(Scale)")];
  
  # Subset a dataframe to enable matrix multiplication and check
  newdata.sub <- as.matrix(newdata[, names(model), drop=F]);
  
  # Get the scale parameter for the survival model
  Scale <- exp(newdata.sub %*% model);

  # Predict the times
  p <- matrix(runif(nrow(newdata.sub)), ncol=1);
  t <- qweibull(p=p, shape=Shape, scale=Scale);

  # Return predicted times
  return(t);
  
  
}
predictSurvmod.response <- function(model, newdata) {
  
  # Add a dummy column for the intercept
  newdata <- cbind(newdata, '(Intercept)'=1);
  
  # Ge the Weibull Shape parameter from the model
  Shape <- 1/exp(model["Log(Scale)"]);
  model <- model[-which(names(model) == "Log(Scale)")];
  
  # Subset a dataframe to enable matrix multiplication and check
  newdata.sub <- as.matrix(newdata[, names(model), drop=F]);
  
  # Get the scale parameter for the survival model
  Scale <- exp(newdata.sub %*% model);
  
  # Return predicted times
  return(Scale);
  
  
}
predictSurvmod.lp <- function(model, newdata) {
  
  # Add a dummy column for the intercept
  newdata <- cbind(newdata, '(Intercept)'=1);
  
  # Ge the Weibull Shape parameter from the model
  Shape <- 1/exp(model["Log(Scale)"]);
  model <- model[-which(names(model) == "Log(Scale)")];
  
  # Subset a dataframe to enable matrix multiplication and check
  newdata.sub <- as.matrix(newdata[, names(model), drop=F]);
  
  # Get the scale parameter for the survival model
  LP <- newdata.sub %*% model;
  
  # Return predicted times
  return(LP);
  
  
}
predictSurvmod.p <- function(model, newdata, q) {
  
  # Add a dummy column for the intercept
  newdata <- cbind(newdata, '(Intercept)'=1);
  
  # Ge the Weibull Shape parameter from the model
  Shape <- 1/exp(model["Log(Scale)"]);
  model <- model[-which(names(model) == "Log(Scale)")];
  
  # Subset a dataframe to enable matrix multiplication and check
  newdata.sub <- as.matrix(newdata[, names(model), drop=F]);
  
  # Get the scale parameter for the survival model
  Scale <- exp(newdata.sub %*% model);

  # Predict the times
  q <- matrix(q, nrow=nrow(newdata), ncol=length(q), byrow=T);
  p <- pweibull(q=q, shape=Shape, scale=Scale)
  
  # Return predicted times
  return(p);
  
  
}
predictLogreg <- function(model, newdata) {
  
  # Add a dummy column for the intercept
  newdata <- cbind(newdata, '(Intercept)'=1);

  # Subset a dataframe to enable matrix multiplication and check
  newdata.sub <- as.matrix(newdata[, names(model), drop=F]);
  
  # Calculate the probabilities
  p.tmp <- exp(newdata.sub %*% model);
  p <- p.tmp / (p.tmp + 1);
  
  # Return probabilities
  return(as.vector(p));
  
}
predictLogreg.lp <- function(model, newdata) {
  
  # Add a dummy column for the intercept
  newdata <- cbind(newdata, '(Intercept)'=1);
  
  # Subset a dataframe to enable matrix multiplication and check
  newdata.sub <- as.matrix(newdata[, names(model), drop=F]);
  
  # Calculate the probabilities
  p.tmp <- newdata.sub %*% model;
  
  # Return probabilities
  return(p.tmp);
  
}

# The 'testGND' is used to calculate the Greenwood-Nam-D'Agostino statistic as proposed by the authors and
# with some additional error handlers included and an option to plot an illustration
testGND <- function(model, data, time, n.bin=5, min.event=5, plot.graph=F, y.max=1) {
  
  time.in.days <- time*365.25;
  
  # Predict event probability for each individual
  pt <- predictSurvmod.p(model=model, newdata=data, q=time.in.days);
  
  # Bin individuals according to risk of event
  pt.q <- quantile(pt, seq(0, 1, length.out=n.bin+1));
  b <- sapply(pt, function(p) ifelse(p==max(pt.q), n.bin, which(p<pt.q)[1]-1));
  
  # Calculate average probability for each group
  ptg <- tapply(pt, b, mean);
  
  # Determine Kaplan-Meier estimates
  KMg <- t(sapply(1:n.bin, function(bin) {
    
    # Get subset of data for the bin
    data.sub <- data[b==bin, ];
    
    # Check whether sufficient events are observed
    if(sum(data.sub$Event==1)<min.event) stop("Insufficient events observed!");
    
    # Estimate Kaplan-Meier for the bin
    KM <- survfit(Surv(Time, Event) ~ 1, data=data.sub, type="kaplan-meier");
    
    # Extract the estimates:
    # p       event probability (1 - km survival estimate)
    # var     variance of km survival estimate
    # nrisk   number at risk
    if(time.in.days<min(KM$time)) {
      p <- 1;
      var <- 0;
    } else {
      p <- KM$surv[tail(which(KM$time<=time.in.days), n=1)];
      p <- ifelse(p==0, KM$surv[tail(which(KM$time<=time.in.days), n=1)-1], p);
      err <- KM$std.err[tail(which(KM$time<=time.in.days), n=1)];
      err <- ifelse(err==Inf, KM$std.err[tail(which(KM$time<=time.in.days), n=1)-1], err);
      var <- (err * p)^2;
    }
    
    
    # Return estimates
    c(p=1-p, var=var)
    
  }))
  
  
  
  # Calculate test statistic
  GND.chisq.sep <- (KMg[,"p"]-ptg)^2 / KMg[,"var"];
  GND.chisq <- sum(GND.chisq.sep);
  GND.df <- n.bin - 1;
  GND.p <- 1 - pchisq(q=GND.chisq, df=GND.df);
  
  if(plot.graph) {
    
    barplot(rbind(ptg, KMg[, "p"]), beside=T, ylim=c(0,y.max), main=paste0("Event before ",time, "-year (p = ",round(GND.p,3),")"), ylab="Probability", names.arg=str_c("n=",table(b)), las=1, legend.text=c("Survival Model", "Kaplan-Meier"), args.legend=list(x="top", inset=c(0,-0.0), ncol=2, bty='n'))
    
  }
  
  # Return results
  return(c(chisq=GND.chisq, df=GND.df, p=GND.p));
  
}

# The 'getPerformanceSurvmod' and 'getPerformanceLogreg' functions are high-level functions that call all 
# functions used to calculate statistics used to assess the models' performance
getPerformanceSurvmod <- function(model, data.train, data.test, timepoints=c(0.5, 1, 2)) {
  
  library(survAUC);
  
  # c-statistic
  c.statistic <- sapply(timepoints, function(tt) {
    
    # Make negative because a higher scale indicates a higher time-to-event, i.e. lower risk
    pred.lp <- -predictSurvmod.lp(model=model, newdata=data.test);
    surv.train <- Surv(time=data.train$Time, event=data.train$Event);
    surv.test <- Surv(time=data.test$Time, event=data.test$Event);
    
    UnoC(Surv.rsp=surv.train, Surv.rsp.new=surv.test, lpnew=pred.lp, time=tt*365.25);
    
  });
  names(c.statistic) <- str_c("c.statistic_", timepoints, "-year");
  
  # Regression Slope
  data.test$scale <- predictSurvmod.response(model, data.test);
  slope.scale <- exp(coxph(Surv(Time, Event) ~ scale, data=data.test)$coefficients);

  # Hosmer-Lemeshow
  HL.5bins <- sapply(timepoints, function(tt) {
    testGND(model=model, data=data.test, time=tt, n.bin=5, plot=F)["chisq"]
  });
  HL.10bins <- sapply(timepoints, function(tt) {
    testGND(model=model, data=data.test, time=tt, n.bin=10, plot=F)["chisq"]
  });
  names(HL.5bins) <- str_c("HL.5bins_", timepoints, "-year");
  names(HL.10bins) <- str_c("HL.10bins_", timepoints, "-year");
  
  # Return all performance measures
  return(c(c.statistic, Slope.Scale = unname(slope.scale), HL.5bins, HL.10bins))
  
}
getPerformanceLogreg <- function(model, data) {
  
  data.sub <- data[data$Events!="Censored", ];
  
  obs <- ifelse(data.sub$Events=="Death", 1,
                ifelse(data.sub$Events=="Progression", 0, NA));
  
  pred.prob <- predictLogreg(model=model, newdata=data.sub);
  pred.lp <- predictLogreg.lp(model=model, newdata=data.sub);
  
  # c-statistic
  pairs <- expand.grid(1:length(obs),
                       1:length(obs));
  c_n <- sum(ifelse(obs[pairs[,1]]==1 & obs[pairs[,2]]==0, 1, 0) * ifelse(pred.prob[pairs[,1]]>pred.prob[pairs[,2]], 1, 0));
  c_d <- sum(ifelse(obs[pairs[,1]]==1 & obs[pairs[,2]]==0, 1, 0));
  c.statistic <- c_n/c_d;
  
  # Regression slope
  tmp.model <- glm(obs ~ pred.lp, family=binomial(link="logit"));
  regression.slope <- tmp.model$coefficients["pred.lp"];
  
  # Brier score
  Brier.score <- sum((pred.prob-obs)^2)/length(obs);
  
  # Return all performance measures
  return(c(C.Statistic = c.statistic, Regression.Slope = unname(regression.slope), Brier.Score = Brier.score))
  
}

Step 5.2: graphical assessment of internal validity

Both Kaplan-Meier plots for different patient subgroups and illustrations of the Greenwood-Nam-D’Agostino statistic were used to graphically assess the performance of the survival models.

Model L1-Time - Kaplan-Meier Plots

# Custom function to plot predited vs. observed Kaplan-Meier curves for both treatment strategies
plotKM <- function(model, data, class.column, time.max=1500, time.step=10, add.main=NULL) {
  
  km.fit <- survfit(as.formula(str_c("Surv(Time, Event) ~ ",class.column)), data=data, type="kaplan-meier")
  
  classes <- sort(unlist(unique(data[, class.column])));
  n.classes <- length(classes);
  n.obs.classes <- table(data[, class.column]);
  
  q.values <- seq(1, time.max, time.step);
  
  sim <- sapply(classes, function(class) {
    surv.mat <- 1 - predictSurvmod.p(model, data[data[, class.column]==as.character(class), ], q=q.values);
    surv.mean <- apply(surv.mat, 2, mean);
    
    surv.mean
  })
  
  cols <- rainbow(n.classes);
  if(is.null(add.main)) {
    main.text <- class.column;
  } else {
    main.text <- str_c(class.column, "\n(", add.main,")")
  }
  plot(km.fit, col=cols, lty=1, lwd=2, xlim=c(0,time.max), main=main.text, xlab="Time to Progression or Death (Days)")
  for(i in 1:n.classes) {
    lines(x=q.values, y=sim[,i], col=cols[i], lty=2, lwd=2)
  }
  legend("right", legend=c(str_c(classes, " (obs) [n=",n.obs.classes,"]"), str_c(classes, " (sim)")),
         col=c(cols, cols), lty=c(rep(1, n.classes), rep(2, n.classes)), lwd=2)
  
}

# Model L1-Time
set.seed(20181211);
random.dataset <- sample(1:length(datasets), 1);
data <- datasets.surv.L1[[random.dataset]];

# Kaplan-Meier curves for the following groups
# - The complete cohort
# - Patient with a left-sided primary tumour or rectum primary tumour
# - Patient with a right-sided primary tumour
# - Patients with an primary tumour in their colon but not otherwise specified (NOS) or unkown primary tumour site
par(mfrow=c(2,2));
plotKM(model=Model.L1.Time$coef, data=data, class.column="ChemBev")
plotKM(model=Model.L1.Time$coef, data=filter(data, Site.LeftRectum==1), class.column="ChemBev", add.main="Left or Rectum")
plotKM(model=Model.L1.Time$coef, data=filter(data, Site.Right==1), class.column="ChemBev", add.main="Right")
plotKM(model=Model.L1.Time$coef, data=filter(data, Site.LeftRectum==0, Site.Right==0), class.column="ChemBev", add.main="Other")

# Kaplan-Meier curves for the following groups
# - Female patients
# - Male patients
# - Patients with metastatic disease at diagnosis
# - Patients without metastatic disease at diagnosis
par(mfrow=c(2,2));
plotKM(model=Model.L1.Time$coef, data=filter(data, Female.YN==1), class.column="ChemBev", add.main="Female")
plotKM(model=Model.L1.Time$coef, data=filter(data, Female.YN==0), class.column="ChemBev", add.main="Male")
plotKM(model=Model.L1.Time$coef, data=filter(data, Stage4.YN==1), class.column="ChemBev", add.main="Stage 4 at Diagnosis")
plotKM(model=Model.L1.Time$coef, data=filter(data, Stage4.YN==0), class.column="ChemBev", add.main="NOT Stage 4 at Diagnosis")

# Kaplan-Meier curves for the following groups
# - Patients with an ECOG Performance Status of 2 or 3
# - Patients with an ECOG Performance Status of 1
# - Patients with an ECOG Performance Status of 0
par(mfrow=c(2,2));
plotKM(model=Model.L1.Time$coef, data=filter(data, ECOG.23==1), class.column="ChemBev", add.main="ECOG 23")
plotKM(model=Model.L1.Time$coef, data=filter(data, ECOG.1==1), class.column="ChemBev", add.main="ECOG 1")
plotKM(model=Model.L1.Time$coef, data=filter(data, ECOG.23==0, ECOG.1==0), class.column="ChemBev", add.main="ECOG 0")

Model L1-Time - GND Test Illustration

# Graphical representation of the Greenwood-Nam-D'Agostino statistic with 5 risk groups
par(mfrow=c(2,2));
testGND(model=Model.L1.Time$coef, data=data, time=0.5, n.bin=5, plot.graph=T);
##     chisq        df         p 
## 4.4991559 4.0000000 0.3426476
testGND(model=Model.L1.Time$coef, data=data, time=1, n.bin=5, plot.graph=T);
##     chisq        df         p 
## 2.4457211 4.0000000 0.6543809
testGND(model=Model.L1.Time$coef, data=data, time=2, n.bin=5, plot.graph=T);
##     chisq        df         p 
## 2.0226615 4.0000000 0.7315906

# Graphical representation of the Greenwood-Nam-D'Agostino statistic with 10 risk groups
par(mfrow=c(2,2));
testGND(model=Model.L1.Time$coef, data=data, time=0.5, n.bin=10, plot.graph=T);
##     chisq        df         p 
## 7.1396005 9.0000000 0.6225876
testGND(model=Model.L1.Time$coef, data=data, time=1, n.bin=10, plot.graph=T);
##     chisq        df         p 
## 8.1382488 9.0000000 0.5202764
testGND(model=Model.L1.Time$coef, data=data, time=2, n.bin=10, plot.graph=T);
##     chisq        df         p 
## 5.2879510 9.0000000 0.8085192

# Graphical representation of the Greenwood-Nam-D'Agostino statistic as included in the manuscript
png(file="GND Model L1-Time 20190201.png", width=12, height=6, units="in", res=96);
par(mfrow=c(1,2));
testGND(model=Model.L1.Time$coef, data=data, time=1, n.bin=5, plot.graph=T);
##     chisq        df         p 
## 2.4457211 4.0000000 0.6543809
testGND(model=Model.L1.Time$coef, data=data, time=1, n.bin=10, plot.graph=T);
##     chisq        df         p 
## 8.1382488 9.0000000 0.5202764
dev.off();
## quartz_off_screen 
##                 2

Model LX-Time - Kaplan-Meier Plots

# Model LX-Time
data <- datasets.surv.LX[[random.dataset]];

# Kaplan-Meier curves for the following groups
# - The complete cohort
# - Patient with a left-sided primary tumour or rectum primary tumour
# - Patient with a right-sided primary tumour
# - Patients with an primary tumour in their colon but not otherwise specified (NOS) or unkown primary tumour site
par(mfrow=c(2,2));
plotKM(model=Model.LX.Time$coef, data=data, class.column="ChemBev")
plotKM(model=Model.LX.Time$coef, data=filter(data, Site.LeftRectum==1), class.column="ChemBev", add.main="Left or Rectum")
plotKM(model=Model.LX.Time$coef, data=filter(data, Site.Right==1), class.column="ChemBev", add.main="Right")
plotKM(model=Model.LX.Time$coef, data=filter(data, Site.LeftRectum==0, Site.Right==0), class.column="ChemBev", add.main="Other")

# Kaplan-Meier curves for the following groups
# - Female patients
# - Male patients
# - Patients with metastatic disease at diagnosis
# - Patients without metastatic disease at diagnosis
par(mfrow=c(2,2));
plotKM(model=Model.LX.Time$coef, data=filter(data, Female.YN==1), class.column="ChemBev", add.main="Female")
plotKM(model=Model.LX.Time$coef, data=filter(data, Female.YN==0), class.column="ChemBev", add.main="Male")
plotKM(model=Model.LX.Time$coef, data=filter(data, Stage4.YN==1), class.column="ChemBev", add.main="Stage 4 at Diagnosis")
plotKM(model=Model.LX.Time$coef, data=filter(data, Stage4.YN==0), class.column="ChemBev", add.main="NOT Stage 4 at Diagnosis")

# Kaplan-Meier curves for the following groups
# - Patients with an ECOG Performance Status of 2 or 3
# - Patients with an ECOG Performance Status of 1
# - Patients with an ECOG Performance Status of 0
par(mfrow=c(2,2));
plotKM(model=Model.LX.Time$coef, data=filter(data, ECOG.23==1), class.column="ChemBev", add.main="ECOG 23")
plotKM(model=Model.LX.Time$coef, data=filter(data, ECOG.1==1), class.column="ChemBev", add.main="ECOG 1")
plotKM(model=Model.LX.Time$coef, data=filter(data, ECOG.23==0, ECOG.1==0), class.column="ChemBev", add.main="ECOG 0")

Model LX-Time - GND Test Illustration

# Graphical representation of the Greenwood-Nam-D'Agostino statistic with 5 risk groups
par(mfrow=c(2,2));
testGND(model=Model.LX.Time$coef, data=data, time=0.5, n.bin=5, plot.graph=T);
##      chisq         df          p 
## 8.62654164 4.00000000 0.07114301
testGND(model=Model.LX.Time$coef, data=data, time=1, n.bin=5, plot.graph=T);
##     chisq        df         p 
## 2.8534511 4.0000000 0.5826418
testGND(model=Model.LX.Time$coef, data=data, time=2, n.bin=5, plot.graph=T);
##     chisq        df         p 
## 7.4139126 4.0000000 0.1155658

# Graphical representation of the Greenwood-Nam-D'Agostino statistic with 10 risk groups
par(mfrow=c(2,2));
testGND(model=Model.LX.Time$coef, data=data, time=0.5, n.bin=10, plot.graph=T);
##       chisq          df           p 
## 16.73744515  9.00000000  0.05298987
testGND(model=Model.LX.Time$coef, data=data, time=1, n.bin=10, plot.graph=T);
##     chisq        df         p 
## 4.2555980 9.0000000 0.8937987
testGND(model=Model.LX.Time$coef, data=data, time=2, n.bin=10, plot.graph=T);
##     chisq        df         p 
## 9.5251554 9.0000000 0.3902742

# Graphical representation of the Greenwood-Nam-D'Agostino statistic as included in the manuscript
png(file="GND Model LX-Time 20190201.png", width=12, height=6, units="in", res=96);
par(mfrow=c(1,2));
testGND(model=Model.LX.Time$coef, data=data, time=1, n.bin=5, plot.graph=T);
##     chisq        df         p 
## 2.8534511 4.0000000 0.5826418
testGND(model=Model.LX.Time$coef, data=data, time=1, n.bin=10, plot.graph=T);
##     chisq        df         p 
## 4.2555980 9.0000000 0.8937987
dev.off();
## quartz_off_screen 
##                 2

Step 5.3: calculating the apparent model performance

Using the previously defined functions, the uncorrected, biased apparent performance of the models is calculated, so that these can be corrected for the optimism as estimated from the bootstrapping later on. The apparent performance is calculated for each imputed datasets separately and averaged to obtain a single apparent performance outcome.

# General settings
timepoints <- c(0.5, 1, 2);

# Model L1-Event
Dapp.Model.L1.Time <- rowMeans(
  sapply(datasets.surv.L1, function(dataset) {
    getPerformanceSurvmod(model = Model.L1.Time$coef, data.train = dataset, data.test = dataset, timepoints = timepoints)
  })
);

# Observe apparent performance
Dapp.Model.L1.Time;
## c.statistic_0.5-year   c.statistic_1-year   c.statistic_2-year 
##            0.6591775            0.6348088            0.6287258 
##          Slope.Scale    HL.5bins_0.5-year      HL.5bins_1-year 
##            0.9958962            4.0929190            4.3369164 
##      HL.5bins_2-year   HL.10bins_0.5-year     HL.10bins_1-year 
##            0.9944051            7.4060601           10.6726626 
##     HL.10bins_2-year 
##            5.5885418
# Model L1-Event
Dapp.Model.L1.Event <- rowMeans(
  sapply(datasets.surv.L1, function(dataset) {
    getPerformanceLogreg(model = Model.L1.Event$coef, data = dataset)
  })
);

# Observe apparent performance
Dapp.Model.L1.Event;
##      C.Statistic Regression.Slope      Brier.Score 
##       0.68958824       0.99974239       0.08674998
# Model LX-Event
Dapp.Model.LX.Time <- rowMeans(
  sapply(datasets.surv.LX, function(dataset) {
    getPerformanceSurvmod(model = Model.LX.Time$coef, data.train = dataset, data.test = dataset, timepoints = timepoints)
  })
);

# Observe apparent performance
Dapp.Model.LX.Time;
## c.statistic_0.5-year   c.statistic_1-year   c.statistic_2-year 
##            0.7174054            0.6933501            0.6844447 
##          Slope.Scale    HL.5bins_0.5-year      HL.5bins_1-year 
##            0.9971058            8.1409540            1.5974814 
##      HL.5bins_2-year   HL.10bins_0.5-year     HL.10bins_1-year 
##            4.9123090           15.9033123            6.6292335 
##     HL.10bins_2-year 
##            7.6692321

Step 5.4: performing bootstrapping to calculate the optimism

Bootstrapping was performed to assess the optimism of the (survival) models and to obtain sets of correlared model parameters to be used to perform a probabilistic sensitivity analysis for the discrete event simulation. In each bootstrap iteration, the following actions are performed:

  • Generate a bootstrap sample of patient identification numbers
  • For each of the imputed datasets:
    • Get the patients selected in the bootstrap sample from the imputed dataset
    • Transform the data in the appropriate format for analysis
    • Fit the (survival) models according to the defined structures
    • Assess the performance and optimism of the models
    • Return the results for the imputed dataset
  • Combine the results for the separate imputed datasets into one average model and set of outcomes
# Setting for the bootstrapping
n.boot <- 10000;         # number of bootstrap iterations
set.seed(20181211);     # random number seed for reproduction

# Define the model structure that needs to be fitted for the models in each bootstrap iteration
Model.L1.Time.formula <- str_c(
  str_c(c("Surv(Time, Event) ~ "), collapse=""),  
  str_c(vars.formulas.L1[Model.L1.Time$included], collapse=" + ")
);
Model.L1.Event.formula <- str_c(
  str_c(c("Events.01 ~ "), collapse=""),  
  str_c(vars.formulas.L1[Model.L1.Event$included], collapse=" + ")
);
Model.LX.Time.formula <- str_c(
  str_c(c("Surv(Time, Event) ~ "), collapse=""),  
  str_c(vars.formulas.LX[Model.LX.Time$included], collapse=" + ")
);

# Set up a multithreaded computation cluster
library(parallel);
library(doSNOW);
cl <- makeCluster(detectCores());
clusterExport(cl, ls());

# Perform the bootstrap
out.bootstrap <- t(parSapply(cl, 1:n.boot, function(i.boot) {
  
  # Load packages
  library(dplyr);
  library(stringr);
  library(survival);
  library(aod)
  library(survAUC);
  
  # Initialize an output object and an object to track whether any errors occur
  out <- NULL;
  count.errors <- -1;
  
  while(is.null(out)) {
    
    # An error handler is used to make sure that an a set of outcomes is obtained for each bootstrap iteration
    # If an error occurs, this is recorded, so that it is known whether this occurs and, if so, how often
    out <- tryCatch({
      
      # Increase the error counter (starts at -1 so if no errors occur the counter will be zero)
      count.errors <- count.errors + 1;
      
      # Bootstrap sample of patients
      boot.sample <- sample(x=1:nrow(datasets[[1]]), size=nrow(datasets[[1]]), replace=T);
      
      # For each imputed dataset
      out <- t(sapply(1:length(datasets), function(i.dataset) {
        
        ## Get bootstrap sample
        # Get datasets from imputed dataset
        dataset <- datasets[[i.dataset]];
        dataset.L1 <- dataset;
        dataset.LX <- filter(dataset, LX.OS.Include);
        boot.dataset.L1 <- dataset[boot.sample, , drop=F];
        boot.dataset.LX <- filter(boot.dataset.L1, LX.OS.Include);
        
        # Transform data in appropriate format
        dataset.surv.L1 <- getSurvDataset.L1(data=dataset.L1);
        dataset.surv.LX <- getSurvDataset.LX(data=dataset.LX);
        boot.dataset.surv.L1 <- getSurvDataset.L1(data=boot.dataset.L1);
        boot.dataset.surv.LX <- getSurvDataset.LX(data=boot.dataset.LX);
        
        
        ## Fit the models
        # Model L1-Time
        boot.Model.L1.Time <- survreg(formula = as.formula(Model.L1.Time.formula), data=boot.dataset.surv.L1, dist="weibull");
        boot.survmod.L1 <- c(boot.Model.L1.Time$coefficients, 'Log(Scale)'=log(boot.Model.L1.Time$scale));
        
        # Model L1-Event
        boot.Model.L1.Event <- glm(formula = as.formula(Model.L1.Event.formula), data=boot.dataset.surv.L1, family=binomial(link="logit"));
        boot.logreg.L1 <- boot.Model.L1.Event$coefficients;
        
        # Model LX-Time
        boot.Model.LX.Time <- survreg(formula = as.formula(Model.LX.Time.formula), data=boot.dataset.surv.LX, dist="weibull");
        boot.survmod.LX <- c(boot.Model.LX.Time$coefficients, 'Log(Scale)'=log(boot.Model.LX.Time$scale));
        
        
        ## Performance Model L1-Time
        # Performance in bootstrap sample
        perf.surv.L1.boot <- getPerformanceSurvmod(model = boot.survmod.L1, data.train = boot.dataset.surv.L1, data.test = boot.dataset.surv.L1, timepoints = timepoints)
        
        # Performance in original (imputed) dataset
        perf.surv.L1.origin <- getPerformanceSurvmod(model = boot.survmod.L1, data.train = boot.dataset.surv.L1, data.test = dataset.surv.L1, timepoints = timepoints); 
        
        # Calculate the optimism
        optimism.surv.L1 <- perf.surv.L1.boot - perf.surv.L1.origin;
        
        # Set names for later analyses
        names(perf.surv.L1.boot) <- str_c("BOOT_SURVMOD_L1_", names(perf.surv.L1.boot));
        names(perf.surv.L1.origin) <- str_c("ORG_SURVMOD_L1_", names(perf.surv.L1.origin));
        names(optimism.surv.L1) <- str_c("OPT_SURVMOD_L1_", names(optimism.surv.L1));
        names(boot.survmod.L1) <- str_c("SURVMOD_L1_", names(boot.survmod.L1));
        
        
        ## Performance Model L1-Event
        # Performance in bootstrap sample
        perf.logreg.L1.boot <- getPerformanceLogreg(model = boot.logreg.L1, data = boot.dataset.surv.L1)
        
        # Performance in original (imputed) dataset
        perf.logreg.L1.origin <- getPerformanceLogreg(model = boot.logreg.L1, data = dataset.surv.L1); 
        
        # Calculate the optimism
        optimism.logreg.L1 <- perf.logreg.L1.boot - perf.logreg.L1.origin;
        
        # Set names for later analyses
        names(perf.logreg.L1.boot) <- str_c("BOOT_EVENTMOD_L1_", names(perf.logreg.L1.boot));
        names(perf.logreg.L1.origin) <- str_c("ORG_EVENTMOD_L1_", names(perf.logreg.L1.origin));
        names(optimism.logreg.L1) <- str_c("OPT_EVENTMOD_L1_", names(optimism.logreg.L1));
        names(boot.logreg.L1) <- str_c("EVENTMOD_L1_", names(boot.logreg.L1));
        
        
        ## Performance Model LX-Time
        # Performance in bootstrap sample
        perf.surv.LX.boot <- getPerformanceSurvmod(model = boot.survmod.LX, data.train = boot.dataset.surv.LX, data.test = boot.dataset.surv.LX, timepoints = timepoints)
        
        # Performance in original (imputed) dataset
        perf.surv.LX.origin <- getPerformanceSurvmod(model = boot.survmod.LX, data.train = boot.dataset.surv.LX, data.test = dataset.surv.LX, timepoints = timepoints); 
        
                # Calculate the optimism
        optimism.surv.LX <- perf.surv.LX.boot - perf.surv.LX.origin;
        
        # Set names for later analyses
        names(perf.surv.LX.boot) <- str_c("BOOT_SURVMOD_LX_", names(perf.surv.LX.boot));
        names(perf.surv.LX.origin) <- str_c("ORG_SURVMOD_LX_", names(perf.surv.LX.origin));
        names(optimism.surv.LX) <- str_c("OPT_SURVMOD_LX_", names(optimism.surv.LX));
        names(boot.survmod.LX) <- str_c("SURVMOD_LX_", names(boot.survmod.LX));
        
        ## Return models and performance
        c(boot.survmod.L1, perf.surv.L1.boot, perf.surv.L1.origin, optimism.surv.L1, boot.logreg.L1, perf.logreg.L1.boot, perf.logreg.L1.origin, optimism.logreg.L1, boot.survmod.LX, perf.surv.LX.boot, perf.surv.LX.origin, optimism.surv.LX, Errors = count.errors)
        
      })) # sapply datasets
      
      # Apply Rubin's rule to obtain average models and outcomes
      out <- colMeans(out);
      
      # Check whether there are infeasible results for any of the GND tests and, if so, give
      # an error so that another bootstrap sample is performed
      #if(any(out[str_detect(string=names(out), pattern="HL")]==Inf)) {
      if(any(out==Inf)) stop("An error occured in the analysis of the bootstrap sample!");
      #}
    
      # Return the 'object'
      out;
      
    # Error handler for the tryCatch function  
    }, error=function(e) NULL)
  
  } # While
  
  # Return the outcomes for the bootstrap iteration
  out;
  
})); # parSapply nboot

# Stop the defined computation cluster
stopCluster(cl);

# Print five iteration of the bootstrap
head(out.bootstrap, n=5);
##      SURVMOD_L1_(Intercept) SURVMOD_L1_ChemBev SURVMOD_L1_Age1
## [1,]               3.981999          0.7133633      0.04901094
## [2,]               4.700113          0.4052946      0.04082126
## [3,]               4.620920          0.6174053      0.03743284
## [4,]               4.518186          0.5050754      0.04693260
## [5,]               4.863937          0.4679972      0.03160581
##      SURVMOD_L1_Age2 SURVMOD_L1_ECOG.1 SURVMOD_L1_ECOG.23
## [1,]   -0.0003953975        -0.4172791         -0.7096706
## [2,]   -0.0003497074        -0.5041710         -0.2891771
## [3,]   -0.0003100127        -0.3154448         -0.6469685
## [4,]   -0.0003717848        -0.1261577         -0.3197438
## [5,]   -0.0002592091        -0.4779499         -0.2849362
##      SURVMOD_L1_MutantBRAF.YN SURVMOD_L1_Site.LeftRectum
## [1,]              -0.18668733                  0.2851306
## [2,]              -0.10763923                  0.1840829
## [3,]              -0.10634250                  0.1204314
## [4,]              -0.08786323                 -0.2165154
## [5,]              -0.18082350                  0.2201003
##      SURVMOD_L1_Site.Right SURVMOD_L1_Liver.YN SURVMOD_L1_Peritoneum.YN
## [1,]            0.05037495         -0.08705661               -0.1502843
## [2,]           -0.04295241         -0.20918879               -0.0965782
## [3,]           -0.02865888         -0.14946688               -0.1496728
## [4,]           -0.35799352         -0.16156255               -0.1867249
## [5,]            0.07600494         -0.20936823               -0.1616124
##      SURVMOD_L1_CEA.200plus SURVMOD_L1_Private.YN SURVMOD_L1_MutantRAS.YN
## [1,]            -0.02250040             0.5601233             -0.06317179
## [2,]             0.02959098             0.4218272             -0.05558826
## [3,]             0.04848758             0.4890554             -0.02016390
## [4,]            -0.37707645             0.3980278             -0.12519125
## [5,]            -0.03467023             0.4405518             -0.10982494
##      SURVMOD_L1_ECOG.1.Int SURVMOD_L1_ECOG.23.Int
## [1,]           0.268692508             0.17451063
## [2,]           0.403546316             0.07013651
## [3,]           0.057770700             0.11115989
## [4,]          -0.001833973            -0.14401697
## [5,]           0.244993040            -0.32615021
##      SURVMOD_L1_CEA.200plus.Int SURVMOD_L1_Private.YN.Int
## [1,]                 -0.2969168                -0.6098874
## [2,]                 -0.3712344                -0.4717239
## [3,]                 -0.3673050                -0.4517613
## [4,]                  0.3443316                -0.3426054
## [5,]                 -0.1190150                -0.4710576
##      SURVMOD_L1_Log(Scale) BOOT_SURVMOD_L1_c.statistic_0.5-year
## [1,]            -0.3352906                            0.6971802
## [2,]            -0.3162845                            0.6075005
## [3,]            -0.3872189                            0.6787228
## [4,]            -0.3924422                            0.6314717
## [5,]            -0.4041236                            0.6488232
##      BOOT_SURVMOD_L1_c.statistic_1-year BOOT_SURVMOD_L1_c.statistic_2-year
## [1,]                          0.6630135                          0.6527020
## [2,]                          0.6154472                          0.6124971
## [3,]                          0.6543885                          0.6424432
## [4,]                          0.6148917                          0.6109928
## [5,]                          0.6363253                          0.6308419
##      BOOT_SURVMOD_L1_Slope.Scale BOOT_SURVMOD_L1_HL.5bins_0.5-year
## [1,]                   0.9958290                          5.211445
## [2,]                   0.9962760                          5.708340
## [3,]                   0.9959867                         10.338772
## [4,]                   0.9953009                          4.618231
## [5,]                   0.9958087                          6.050799
##      BOOT_SURVMOD_L1_HL.5bins_1-year BOOT_SURVMOD_L1_HL.5bins_2-year
## [1,]                        6.985530                       11.005641
## [2,]                        4.814476                        5.175752
## [3,]                       14.221859                        5.976124
## [4,]                        9.888893                        7.126047
## [5,]                        6.660566                        3.508903
##      BOOT_SURVMOD_L1_HL.10bins_0.5-year BOOT_SURVMOD_L1_HL.10bins_1-year
## [1,]                           23.43496                         28.94024
## [2,]                           17.57629                         17.19086
## [3,]                           24.89114                         24.99794
## [4,]                           13.89146                         24.25595
## [5,]                           15.40319                         15.69896
##      BOOT_SURVMOD_L1_HL.10bins_2-year ORG_SURVMOD_L1_c.statistic_0.5-year
## [1,]                         20.76857                           0.6603587
## [2,]                         13.96084                           0.6362329
## [3,]                         15.23128                           0.6589743
## [4,]                         14.40160                           0.6437835
## [5,]                         10.52725                           0.6513928
##      ORG_SURVMOD_L1_c.statistic_1-year ORG_SURVMOD_L1_c.statistic_2-year
## [1,]                         0.6318032                         0.6247304
## [2,]                         0.6160703                         0.6114792
## [3,]                         0.6359020                         0.6303478
## [4,]                         0.6172235                         0.6116596
## [5,]                         0.6269510                         0.6210518
##      ORG_SURVMOD_L1_Slope.Scale ORG_SURVMOD_L1_HL.5bins_0.5-year
## [1,]                  0.9968068                         3.085940
## [2,]                  0.9965162                         2.079462
## [3,]                  0.9966184                         4.145054
## [4,]                  0.9959796                         4.621218
## [5,]                  0.9964070                         4.844827
##      ORG_SURVMOD_L1_HL.5bins_1-year ORG_SURVMOD_L1_HL.5bins_2-year
## [1,]                      10.950854                       7.054199
## [2,]                       7.846517                       2.283721
## [3,]                      13.404311                       3.649663
## [4,]                       4.739600                       9.011875
## [5,]                       6.753159                       3.259901
##      ORG_SURVMOD_L1_HL.10bins_0.5-year ORG_SURVMOD_L1_HL.10bins_1-year
## [1,]                          7.978417                        17.86005
## [2,]                          6.379822                        11.97190
## [3,]                          8.151804                        20.61193
## [4,]                          9.711431                        10.76515
## [5,]                          8.845025                        10.56049
##      ORG_SURVMOD_L1_HL.10bins_2-year OPT_SURVMOD_L1_c.statistic_0.5-year
## [1,]                       10.072965                         0.036821541
## [2,]                        7.287994                        -0.028732415
## [3,]                        7.852532                         0.019748480
## [4,]                       12.752327                        -0.012311846
## [5,]                        6.122629                        -0.002569534
##      OPT_SURVMOD_L1_c.statistic_1-year OPT_SURVMOD_L1_c.statistic_2-year
## [1,]                      0.0312102343                       0.027971551
## [2,]                     -0.0006230617                       0.001017893
## [3,]                      0.0184864365                       0.012095387
## [4,]                     -0.0023317608                      -0.000666834
## [5,]                      0.0093743284                       0.009790143
##      OPT_SURVMOD_L1_Slope.Scale OPT_SURVMOD_L1_HL.5bins_0.5-year
## [1,]              -0.0009777518                      2.125504929
## [2,]              -0.0002402122                      3.628878019
## [3,]              -0.0006317983                      6.193717317
## [4,]              -0.0006787829                     -0.002987206
## [5,]              -0.0005982845                      1.205971312
##      OPT_SURVMOD_L1_HL.5bins_1-year OPT_SURVMOD_L1_HL.5bins_2-year
## [1,]                    -3.96532395                      3.9514412
## [2,]                    -3.03204090                      2.8920317
## [3,]                     0.81754863                      2.3264606
## [4,]                     5.14929344                     -1.8858279
## [5,]                    -0.09259345                      0.2490013
##      OPT_SURVMOD_L1_HL.10bins_0.5-year OPT_SURVMOD_L1_HL.10bins_1-year
## [1,]                         15.456548                       11.080198
## [2,]                         11.196472                        5.218963
## [3,]                         16.739334                        4.386004
## [4,]                          4.180025                       13.490804
## [5,]                          6.558168                        5.138470
##      OPT_SURVMOD_L1_HL.10bins_2-year EVENTMOD_L1_(Intercept)
## [1,]                       10.695601               -1.222529
## [2,]                        6.672845               -3.429277
## [3,]                        7.378746               -1.896369
## [4,]                        1.649275               -1.982959
## [5,]                        4.404618               -1.695852
##      EVENTMOD_L1_Time1 EVENTMOD_L1_Time2 EVENTMOD_L1_ChemBev
## [1,]     -0.0127019951      9.717204e-06          -1.1545663
## [2,]      0.0040775680     -8.194758e-06           1.4111030
## [3,]     -0.0007586047     -1.140798e-06          -0.1703071
## [4,]     -0.0007956200     -1.050032e-07           0.8586870
## [5,]     -0.0025616104     -9.145091e-07          -0.4207733
##      EVENTMOD_L1_Charlson.1plus EVENTMOD_L1_Site.LeftRectum.Int
## [1,]                  1.0387673                      -1.5291757
## [2,]                  0.7663630                      -1.5997984
## [3,]                  0.5972719                      -0.7651645
## [4,]                  0.4055901                      -2.3030617
## [5,]                  0.7016559                      -1.6592355
##      EVENTMOD_L1_Site.Right.Int EVENTMOD_L1_Time1.Int
## [1,]                 -1.0183295           0.015737858
## [2,]                 -1.8408813          -0.001153155
## [3,]                 -0.7035978           0.001664369
## [4,]                 -2.0915155           0.002074461
## [5,]                 -2.0244781           0.005974437
##      EVENTMOD_L1_Time2.Int BOOT_EVENTMOD_L1_C.Statistic
## [1,]         -1.017203e-05                    0.7513534
## [2,]          7.602548e-06                    0.7160946
## [3,]          1.754595e-06                    0.6314291
## [4,]          1.165385e-06                    0.7175743
## [5,]          7.297080e-08                    0.7066423
##      BOOT_EVENTMOD_L1_Regression.Slope BOOT_EVENTMOD_L1_Brier.Score
## [1,]                                 1                   0.08621135
## [2,]                                 1                   0.07125345
## [3,]                                 1                   0.09863529
## [4,]                                 1                   0.07825127
## [5,]                                 1                   0.08003173
##      ORG_EVENTMOD_L1_C.Statistic ORG_EVENTMOD_L1_Regression.Slope
## [1,]                   0.6742480                        0.6229088
## [2,]                   0.6632729                        0.7894793
## [3,]                   0.6854978                        1.1844405
## [4,]                   0.6894995                        0.7169080
## [5,]                   0.6806799                        0.8191654
##      ORG_EVENTMOD_L1_Brier.Score OPT_EVENTMOD_L1_C.Statistic
## [1,]                  0.08796597                  0.07710542
## [2,]                  0.08816621                  0.05282169
## [3,]                  0.08741610                 -0.05406867
## [4,]                  0.08821141                  0.02807471
## [5,]                  0.08674404                  0.02596248
##      OPT_EVENTMOD_L1_Regression.Slope OPT_EVENTMOD_L1_Brier.Score
## [1,]                        0.3770911                -0.001754618
## [2,]                        0.2105207                -0.016912762
## [3,]                       -0.1844404                 0.011219198
## [4,]                        0.2830920                -0.009960139
## [5,]                        0.1808346                -0.006712313
##      SURVMOD_LX_(Intercept) SURVMOD_LX_ChemBev SURVMOD_LX_Age1
## [1,]               3.562227          0.7824871     -0.02828989
## [2,]               3.998186          0.2554781     -0.02341365
## [3,]               3.569794          0.4655880     -0.02833868
## [4,]               2.313700          2.0443733     -0.02091674
## [5,]               2.598641          0.6808260     -0.03057557
##      SURVMOD_LX_Age2 SURVMOD_LX_Site.LeftRectum SURVMOD_LX_Site.Right
## [1,]   -0.0011449823                 -0.1087073            -0.1413447
## [2,]   -0.0002210010                 -0.2235832            -0.3898415
## [3,]   -0.0011554675                 -0.3559879            -0.3880295
## [4,]   -0.0009223096                 -0.1158416            -0.2002665
## [5,]   -0.0006766529                  0.2135378             0.1598420
##      SURVMOD_LX_Peritoneum.YN SURVMOD_LX_ECOG.1 SURVMOD_LX_ECOG.23
## [1,]               -0.0249536         0.1549188         -1.0220102
## [2,]                0.2214340        -0.5808265         -1.4354497
## [3,]                0.1059064        -0.4217989         -1.3702082
## [4,]                0.5151359        -0.1775224         -1.3387347
## [5,]                0.0503954        -0.1520670         -0.9341271
##      SURVMOD_LX_MutantBRAF.YN SURVMOD_LX_PFS.Time SURVMOD_LX_Stage4.YN
## [1,]               0.70546169           0.4707842           -0.3944380
## [2,]               0.19919089           0.5095575           -0.5856043
## [3,]              -0.32472287           0.6052857           -0.6042184
## [4,]               0.03480724           0.7434121           -0.3734194
## [5,]               0.14462485           0.6508931           -0.9527968
##      SURVMOD_LX_Charlson.1plus.YN SURVMOD_LX_MutantRAS.YN
## [1,]                   0.09992622              0.17053839
## [2,]                   0.04779290              0.03353796
## [3,]                  -0.07911071              0.09179549
## [4,]                   0.13386652              0.22966125
## [5,]                   0.07126643              0.05809591
##      SURVMOD_LX_Private.YN SURVMOD_LX_Age1.Int SURVMOD_LX_Age2.Int
## [1,]            0.15158631          0.01786526        7.877169e-04
## [2,]            0.07598565          0.01288963       -3.442888e-05
## [3,]            0.03818174          0.02456055        1.056388e-03
## [4,]            0.20289242          0.01560396        1.133192e-03
## [5,]            0.14046066          0.02742750        6.063385e-04
##      SURVMOD_LX_Peritoneum.YN.Int SURVMOD_LX_ECOG.1.Int
## [1,]                  -0.12031487           -0.46908724
## [2,]                  -0.21624553            0.27065371
## [3,]                  -0.15013080           -0.03996639
## [4,]                  -0.57248756           -0.03035211
## [5,]                  -0.01832955           -0.21184432
##      SURVMOD_LX_ECOG.23.Int SURVMOD_LX_MutantBRAF.YN.Int
## [1,]              0.2905964                   -1.2068901
## [2,]              0.5655256                   -0.5433808
## [3,]              0.3297124                   -0.3063589
## [4,]              0.4628835                   -0.3725404
## [5,]             -0.1904882                   -0.4076959
##      SURVMOD_LX_PFS.Time.Int SURVMOD_LX_Stage4.YN.Int
## [1,]             -0.09350060                0.3643138
## [2,]             -0.09965267                0.7427876
## [3,]             -0.09376370                0.5268879
## [4,]             -0.40790633                0.3279707
## [5,]             -0.13930390                0.7655971
##      SURVMOD_LX_MutantRAS.YN.Int SURVMOD_LX_Log(Scale)
## [1,]                  -0.3972561            -0.1596703
## [2,]                  -0.2979609            -0.1807637
## [3,]                  -0.4433107            -0.2072936
## [4,]                  -0.4738601            -0.2232146
## [5,]                  -0.2130537            -0.1556238
##      BOOT_SURVMOD_LX_c.statistic_0.5-year
## [1,]                            0.7175332
## [2,]                            0.7182717
## [3,]                            0.7255471
## [4,]                            0.6871814
## [5,]                            0.7255819
##      BOOT_SURVMOD_LX_c.statistic_1-year BOOT_SURVMOD_LX_c.statistic_2-year
## [1,]                          0.6836472                          0.6742734
## [2,]                          0.6979305                          0.6871862
## [3,]                          0.7094436                          0.6946134
## [4,]                          0.6827085                          0.6717903
## [5,]                          0.6989814                          0.6822703
##      BOOT_SURVMOD_LX_Slope.Scale BOOT_SURVMOD_LX_HL.5bins_0.5-year
## [1,]                   0.9970078                         11.326984
## [2,]                   0.9970459                         14.218835
## [3,]                   0.9971341                          7.908079
## [4,]                   0.9965562                          5.302841
## [5,]                   0.9973610                         10.901772
##      BOOT_SURVMOD_LX_HL.5bins_1-year BOOT_SURVMOD_LX_HL.5bins_2-year
## [1,]                       12.103363                        5.984211
## [2,]                        6.049089                        8.935838
## [3,]                        9.671642                        4.928537
## [4,]                        8.257890                        8.517921
## [5,]                        5.580362                        6.485760
##      BOOT_SURVMOD_LX_HL.10bins_0.5-year BOOT_SURVMOD_LX_HL.10bins_1-year
## [1,]                           28.80884                         23.00415
## [2,]                           29.92394                         18.69299
## [3,]                           33.21597                         18.72455
## [4,]                           15.67308                         17.15724
## [5,]                           18.91509                         15.45737
##      BOOT_SURVMOD_LX_HL.10bins_2-year ORG_SURVMOD_LX_c.statistic_0.5-year
## [1,]                         14.52087                           0.7113699
## [2,]                         27.31791                           0.7036708
## [3,]                         17.52691                           0.7093891
## [4,]                         17.04953                           0.7103136
## [5,]                         23.34100                           0.7129121
##      ORG_SURVMOD_LX_c.statistic_1-year ORG_SURVMOD_LX_c.statistic_2-year
## [1,]                         0.6827227                         0.6739350
## [2,]                         0.6817569                         0.6723853
## [3,]                         0.6838889                         0.6729765
## [4,]                         0.6869212                         0.6750528
## [5,]                         0.6827392                         0.6700449
##      ORG_SURVMOD_LX_Slope.Scale ORG_SURVMOD_LX_HL.5bins_0.5-year
## [1,]                  0.9971391                         6.048096
## [2,]                  0.9973047                         4.313220
## [3,]                  0.9977269                        11.834210
## [4,]                  0.9973201                         6.336112
## [5,]                  0.9975677                         2.847844
##      ORG_SURVMOD_LX_HL.5bins_1-year ORG_SURVMOD_LX_HL.5bins_2-year
## [1,]                       3.723671                       3.430626
## [2,]                       3.913788                       3.848451
## [3,]                      13.579960                       6.704817
## [4,]                       6.530022                       5.451594
## [5,]                       9.312232                      10.132961
##      ORG_SURVMOD_LX_HL.10bins_0.5-year ORG_SURVMOD_LX_HL.10bins_1-year
## [1,]                         12.759155                        8.371971
## [2,]                         10.075726                        8.675773
## [3,]                         19.169984                       17.097972
## [4,]                         15.517688                       15.131077
## [5,]                          7.252339                       12.878907
##      ORG_SURVMOD_LX_HL.10bins_2-year OPT_SURVMOD_LX_c.statistic_0.5-year
## [1,]                        10.21783                         0.006163227
## [2,]                        10.92730                         0.014600836
## [3,]                        16.42205                         0.016158051
## [4,]                        11.36277                        -0.023132143
## [5,]                        19.21853                         0.012669796
##      OPT_SURVMOD_LX_c.statistic_1-year OPT_SURVMOD_LX_c.statistic_2-year
## [1,]                      0.0009244992                      0.0003384089
## [2,]                      0.0161735932                      0.0148009260
## [3,]                      0.0255546730                      0.0216369479
## [4,]                     -0.0042126971                     -0.0032624571
## [5,]                      0.0162421948                      0.0122253280
##      OPT_SURVMOD_LX_Slope.Scale OPT_SURVMOD_LX_HL.5bins_0.5-year
## [1,]              -0.0001312901                         5.278888
## [2,]              -0.0002588117                         9.905615
## [3,]              -0.0005928405                        -3.926131
## [4,]              -0.0007639492                        -1.033271
## [5,]              -0.0002066949                         8.053928
##      OPT_SURVMOD_LX_HL.5bins_1-year OPT_SURVMOD_LX_HL.5bins_2-year
## [1,]                       8.379692                       2.553585
## [2,]                       2.135301                       5.087387
## [3,]                      -3.908318                      -1.776280
## [4,]                       1.727868                       3.066327
## [5,]                      -3.731871                      -3.647200
##      OPT_SURVMOD_LX_HL.10bins_0.5-year OPT_SURVMOD_LX_HL.10bins_1-year
## [1,]                        16.0496825                       14.632183
## [2,]                        19.8482130                       10.017212
## [3,]                        14.0459844                        1.626581
## [4,]                         0.1553928                        2.026166
## [5,]                        11.6627504                        2.578462
##      OPT_SURVMOD_LX_HL.10bins_2-year Errors
## [1,]                        4.303047      0
## [2,]                       16.390608      0
## [3,]                        1.104856      0
## [4,]                        5.686761      0
## [5,]                        4.122471      0

Step 5.5: correcting the apparent performance for optimism

After performing the bootstrapping, the apparent performance is corrected for the optimism. To do so, the following actions are performed for each statistic:

  1. Extract the apparent performance for the statistic from the corresponding object
  2. Extract the mean optimism over all bootstrap iterations from the corresponding object
  3. Calculate the nearly unbiased optimism-corrected performance
  4. For the GND: calculate the p-values corresponding to the outcome
## Model L1-Time

# Extract all relevant columns from the 'out.bootstrap' object and calculate the mean over all iterations
O.Model.L1.Time <- colMeans(out.bootstrap[, str_detect(string=colnames(out.bootstrap), pattern="OPT_SURVMOD_L1")]);

# C-statistic
Dapp.Model.L1.Time.c <- Dapp.Model.L1.Time[str_detect(string=names(Dapp.Model.L1.Time), pattern="c.statistic")];
O.Model.L1.Time.c <- O.Model.L1.Time[str_detect(string=names(O.Model.L1.Time), pattern="c.statistic")];
rbind(Dapp.Model.L1.Time.c,
      O.Model.L1.Time.c,
      corrected = Dapp.Model.L1.Time.c - O.Model.L1.Time.c);
##                      c.statistic_0.5-year c.statistic_1-year
## Dapp.Model.L1.Time.c           0.65917752         0.63480879
## O.Model.L1.Time.c              0.01175686         0.01173835
## corrected                      0.64742065         0.62307044
##                      c.statistic_2-year
## Dapp.Model.L1.Time.c         0.62872577
## O.Model.L1.Time.c            0.01208758
## corrected                    0.61663820
# Slope
Dapp.Model.L1.Time.slope <- Dapp.Model.L1.Time[str_detect(string=names(Dapp.Model.L1.Time), pattern="Slope")];
O.Model.L1.Time.slope <- O.Model.L1.Time[str_detect(string=names(O.Model.L1.Time), pattern="Slope")];
rbind(Dapp.Model.L1.Time.slope, O.Model.L1.Time.slope, corrected = Dapp.Model.L1.Time.slope - O.Model.L1.Time.slope);
##                            Slope.Scale
## Dapp.Model.L1.Time.slope  0.9958962063
## O.Model.L1.Time.slope    -0.0005810449
## corrected                 0.9964772512
# GND HL 
Dapp.Model.L1.Time.hl <- Dapp.Model.L1.Time[str_detect(string=names(Dapp.Model.L1.Time), pattern="HL")];
O.Model.L1.Time.hl <- O.Model.L1.Time[str_detect(string=names(O.Model.L1.Time), pattern="HL")];
rbind(Dapp.Model.L1.Time.hl, O.Model.L1.Time.hl, corrected = Dapp.Model.L1.Time.hl + O.Model.L1.Time.hl);
##                       HL.5bins_0.5-year HL.5bins_1-year HL.5bins_2-year
## Dapp.Model.L1.Time.hl          4.092919        4.336916       0.9944051
## O.Model.L1.Time.hl             4.637893        1.256029       2.9860626
## corrected                      8.730812        5.592946       3.9804677
##                       HL.10bins_0.5-year HL.10bins_1-year HL.10bins_2-year
## Dapp.Model.L1.Time.hl            7.40606        10.672663         5.588542
## O.Model.L1.Time.hl              11.21292         6.928614         7.157294
## corrected                       18.61898        17.601276        12.745836
1-c(pchisq(q=c(Dapp.Model.L1.Time.hl + O.Model.L1.Time.hl)[1:3], df=4), pchisq(q=c(Dapp.Model.L1.Time.hl + O.Model.L1.Time.hl)[4:6], df=9))
##  HL.5bins_0.5-year    HL.5bins_1-year    HL.5bins_2-year 
##         0.06819159         0.23167947         0.40865571 
## HL.10bins_0.5-year   HL.10bins_1-year   HL.10bins_2-year 
##         0.02863475         0.04009156         0.17444819
## Model L1-Event

# Extract all relevant columns from the 'out.bootstrap' object and calculate the mean over all iterations
O.Model.L1.Event <- colMeans(out.bootstrap[, str_detect(string=colnames(out.bootstrap), pattern="OPT_EVENTMOD_L1")]);

# C-statistic
Dapp.Model.L1.Event.c <- Dapp.Model.L1.Event[str_detect(string=names(Dapp.Model.L1.Event), pattern="C.Statistic")];
O.Model.L1.Event.c <- O.Model.L1.Event[str_detect(string=names(O.Model.L1.Event), pattern="C.Statistic")];
rbind(Dapp.Model.L1.Event.c, O.Model.L1.Event.c, corrected = Dapp.Model.L1.Event.c - O.Model.L1.Event.c);
##                       C.Statistic
## Dapp.Model.L1.Event.c  0.68958824
## O.Model.L1.Event.c     0.03202666
## corrected              0.65756159
# Slope
Dapp.Model.L1.Event.slope <- Dapp.Model.L1.Event[str_detect(string=names(Dapp.Model.L1.Event), pattern="Slope")];
O.Model.L1.Event.slope <- O.Model.L1.Event[str_detect(string=names(O.Model.L1.Event), pattern="Slope")];
rbind(Dapp.Model.L1.Event.slope, O.Model.L1.Event.slope, corrected = Dapp.Model.L1.Event.slope - O.Model.L1.Event.slope);
##                           Regression.Slope
## Dapp.Model.L1.Event.slope        0.9997424
## O.Model.L1.Event.slope           0.1796652
## corrected                        0.8200772
# Brier
Dapp.Model.L1.Event.brier <- Dapp.Model.L1.Event[str_detect(string=names(Dapp.Model.L1.Event), pattern="Brier")];
O.Model.L1.Event.brier <- O.Model.L1.Event[str_detect(string=names(O.Model.L1.Event), pattern="Brier")];
rbind(Dapp.Model.L1.Event.brier, O.Model.L1.Event.brier, corrected = Dapp.Model.L1.Event.brier - O.Model.L1.Event.brier);
##                            Brier.Score
## Dapp.Model.L1.Event.brier  0.086749976
## O.Model.L1.Event.brier    -0.002987089
## corrected                  0.089737065
## Model LX-Time

# Extract all relevant columns from the 'out.bootstrap' object and calculate the mean over all iterations
O.Model.LX.Time <- colMeans(out.bootstrap[, str_detect(string=colnames(out.bootstrap), pattern="OPT_SURVMOD_LX")]);

# C-statistic
Dapp.Model.LX.Time.c <- Dapp.Model.LX.Time[str_detect(string=names(Dapp.Model.LX.Time), pattern="c.statistic")];
O.Model.LX.Time.c <- O.Model.LX.Time[str_detect(string=names(O.Model.LX.Time), pattern="c.statistic")];
rbind(Dapp.Model.LX.Time.c, O.Model.LX.Time.c, corrected = Dapp.Model.LX.Time.c - O.Model.LX.Time.c);
##                      c.statistic_0.5-year c.statistic_1-year
## Dapp.Model.LX.Time.c           0.71740538         0.69335005
## O.Model.LX.Time.c              0.01264523         0.01364425
## corrected                      0.70476016         0.67970580
##                      c.statistic_2-year
## Dapp.Model.LX.Time.c         0.68444465
## O.Model.LX.Time.c            0.01450578
## corrected                    0.66993887
# Slope
Dapp.Model.LX.Time.slope <- Dapp.Model.LX.Time[str_detect(string=names(Dapp.Model.LX.Time), pattern="Slope")];
O.Model.LX.Time.slope <- O.Model.LX.Time[str_detect(string=names(O.Model.LX.Time), pattern="Slope")];
rbind(Dapp.Model.LX.Time.slope, O.Model.LX.Time.slope, corrected = Dapp.Model.LX.Time.slope - O.Model.LX.Time.slope);
##                            Slope.Scale
## Dapp.Model.LX.Time.slope  0.9971058368
## O.Model.LX.Time.slope    -0.0004747626
## corrected                 0.9975805993
# GND HL 
Dapp.Model.LX.Time.hl <- Dapp.Model.LX.Time[str_detect(string=names(Dapp.Model.LX.Time), pattern="HL")];
O.Model.LX.Time.hl <- O.Model.LX.Time[str_detect(string=names(O.Model.LX.Time), pattern="HL")];
rbind(Dapp.Model.LX.Time.hl, O.Model.LX.Time.hl, corrected = Dapp.Model.LX.Time.hl + O.Model.LX.Time.hl);
##                       HL.5bins_0.5-year HL.5bins_1-year HL.5bins_2-year
## Dapp.Model.LX.Time.hl          8.140954       1.5974814       4.9123090
## O.Model.LX.Time.hl             3.811204      -0.5324031       0.7049291
## corrected                     11.952158       1.0650783       5.6172380
##                       HL.10bins_0.5-year HL.10bins_1-year HL.10bins_2-year
## Dapp.Model.LX.Time.hl           15.90331         6.629234         7.669232
## O.Model.LX.Time.hl              10.96412         4.661620         5.729678
## corrected                       26.86743        11.290854        13.398910
1-c(pchisq(q=c(Dapp.Model.LX.Time.hl + O.Model.LX.Time.hl)[1:3], df=4), pchisq(q=c(Dapp.Model.LX.Time.hl + O.Model.LX.Time.hl)[4:6], df=9))
##  HL.5bins_0.5-year    HL.5bins_1-year    HL.5bins_2-year 
##        0.017710603        0.899772603        0.229614753 
## HL.10bins_0.5-year   HL.10bins_1-year   HL.10bins_2-year 
##        0.001471154        0.256298079        0.145371007

Step 6: Discrete Event Simulation

Now that all (survival) models have been fitted and correlated sets of model coefficients (parameters) are obtained by performing the boostrapping procedure, a probabilistic sensitivity analysis can be performed with the discrete event simulation. A base-case analysis can be performed by using the models that are estimated from the imputed datasets (Step 4: Multivariable Modeling of Survival). In order to validate the discrete event simulation, first some custom functions will be defined in Step 6.1, after which the validation is perfomed in Step 6.2. Hence, Step 6 exists of the following steps:

  1. Defining functions used for validating the discrete event simulation
  2. Performing the validation of the discrete event simulation

Step 6.1: defining functions used for validating the discrete event simulation

Several custom functions are defined to run the simulation and validate it. Please find an explanation of the functions in the R code below.

# The 'getSurvDataset.L1' and 'getSurvDataset.LX' functions do no longer need to also
# include the outcomes of the patients, as these are unknown for the simulation model
# since it needs to simulate them, so the shorter 'getData.L1' and 'getData.LX'
# functions are defined
getData.L1 <- function(data) {
  
  data <- data %>% 
    
    ## Single variables
    
    # Arm.ChemBev = "ChemBev",
    mutate(ChemBev = as.numeric(as.logical(L1.Arm.ChemBev))) %>%
    
    #  Age = "Age1 + Age2",
    mutate(Age1 = Age) %>%
    mutate(Age2 = Age1^2) %>%
    
    #  Hospital.Private = "Private.YN",
    mutate(Private.YN = as.numeric(as.logical(Hospital.Private))) %>%
    
    #  Stage.Diagnosis = "Stage4.YN",
    mutate(Stage4.YN = ifelse(as.character(Stage.Diagnosis)=="4", 1, 0)) %>%
    
    #  Primary.Site = "Site.Left + Site.Right + Site.Rectum",
    mutate(Site.LeftRectum = ifelse(Primary.Site %in% c("Left","Rectum"), 1, 0)) %>%
    mutate(Site.Right = ifelse(Primary.Site=="Right", 1, 0)) %>%
    
    #  Metastases.Number = "Metastases.2 + Metastases.3plus",
    mutate(Metastases.2 = ifelse(as.numeric(as.character(Metastases.Number))==2, 1, 0)) %>%
    mutate(Metastases.3plus = ifelse(as.numeric(as.character(Metastases.Number))>=3, 1, 0)) %>%
    
    #  Metastases.Liver = "Liver.YN",
    mutate(Liver.YN = as.numeric(as.logical(Metastases.Liver))) %>%
    
    #  Metastases.Peritoneum = "Peritoneum.YN",
    mutate(Peritoneum.YN = as.numeric(as.logical(Metastases.Peritoneum))) %>%
    
    #  L1.ECOG = "ECOG.1 + ECOG.23",
    mutate(ECOG.1 = ifelse(as.numeric(as.character(L1.ECOG))==1, 1, 0)) %>%
    mutate(ECOG.23 = ifelse(as.numeric(as.character(L1.ECOG))>=2, 1, 0)) %>%
    
    #  Charlson = "Charlson.1 + Charlson.2 + Charlson.3plus",
    mutate(Charlson.1plus = ifelse(as.numeric(as.character(Charlson))>=1, 1, 0)) %>%
    
    #  CEA.Level = "CEA1 + CEA2",
    mutate(CEA.200plus = ifelse(as.numeric(as.character(Metastases.Number))<=2, 1, 0) * ifelse(as.numeric(CEA.Level)>=200, 1, 0)) %>%
    
    #  RAS.Mutated = "MutantRAS.YN",
    mutate(MutantRAS.YN = as.numeric(as.logical(RAS.Mutated))) %>%
    
    #  BRAF.Mutated = "MutantBRAF.YN",
    mutate(MutantBRAF.YN = as.numeric(as.logical(BRAF.Mutated))) %>%
    
    
    ## Interactions
    mutate(Private.YN.Int = Private.YN * ChemBev) %>%
    mutate(Site.LeftRectum.Int = Site.LeftRectum * ChemBev) %>%
    mutate(Site.Right.Int = Site.Right * ChemBev) %>%
    mutate(ECOG.1.Int = ECOG.1 * ChemBev) %>%
    mutate(ECOG.23.Int = ECOG.23 * ChemBev) %>%
    mutate(CEA.200plus.Int = CEA.200plus * ChemBev)
  
  
  # Only the newly created columns need to be returned
  return(select(data, ChemBev:CEA.200plus.Int))
  
}
getData.LX <- function(data) {
  
  data <- data %>% 
    
    ## Single variables
    
    # L1.PFS.Time = "PFS.Time",
    mutate(PFS.Time = log(L1.PFS.Time)) %>%
    
    # Arm.ChemBev = "ChemBev",
    mutate(ChemBev = as.numeric(as.logical(L1.Arm.ChemBev))) %>%
    
    #  Age = "Age1 + Age2",
    mutate(Age1 = as.numeric(Age)-mean(Age)) %>%
    mutate(Age2 = as.numeric(Age1^2)) %>%
    
    #  Hospital.Private = "Private.YN",
    mutate(Private.YN = as.numeric(as.logical(Hospital.Private))) %>%
    
    #  Stage.Diagnosis = "Stage4.YN",
    mutate(Stage4.YN = ifelse(as.numeric(Stage.Diagnosis)==4, 1, 0)) %>%
    
    #  Primary.Site = "Site.Left + Site.Right + Site.Rectum",
    mutate(Site.LeftRectum = ifelse(Primary.Site %in% c("Left","Rectum"), 1, 0)) %>%
    mutate(Site.Right = ifelse(Primary.Site=="Right", 1, 0)) %>%
    
    #  Metastases.Peritoneum = "Peritoneum.YN",
    mutate(Peritoneum.YN = as.numeric(as.logical(Metastases.Peritoneum))) %>%
    
    #  L1.ECOG = "ECOG.1 + ECOG.23",
    mutate(ECOG.1 = ifelse(as.numeric(as.character(L1.ECOG))==1, 1, 0)) %>%
    mutate(ECOG.23 = ifelse(as.numeric(as.character(L1.ECOG))>=2, 1, 0)) %>%
    
    #  Charlson = "Charlson.3plus.YN",
    mutate(Charlson.1plus.YN = ifelse(as.numeric(Charlson)>=1, 1, 0)) %>%
    
    #  RAS.Mutated = "MutantRAS.YN",
    mutate(MutantRAS.YN = as.numeric(as.logical(RAS.Mutated))) %>%
    
    #  BRAF.Mutated = "MutantBRAF.YN",
    mutate(MutantBRAF.YN = as.numeric(as.logical(BRAF.Mutated))) %>%
    
    
    ## Interactions
    mutate(PFS.Time.Int = PFS.Time * ChemBev) %>%
    mutate(Age1.Int = Age1 * ChemBev) %>%
    mutate(Age2.Int = Age2 * ChemBev) %>%
    mutate(Stage4.YN.Int = Stage4.YN * ChemBev) %>%
    mutate(Peritoneum.YN.Int = Peritoneum.YN * ChemBev) %>%
    mutate(ECOG.1.Int = ECOG.1 * ChemBev) %>%
    mutate(ECOG.23.Int = ECOG.23 * ChemBev) %>%
    mutate(MutantRAS.YN.Int = MutantRAS.YN * ChemBev) %>%
    mutate(MutantBRAF.YN.Int = MutantBRAF.YN * ChemBev)
  
  # Only the newly created columns need to be returned
  return(select(data, PFS.Time:MutantBRAF.YN.Int))
  
}

# The 'predictSurvmod' and 'predictLogreg' functions are re-written so that they can
# predict the outcomes in different formats in stead of having different functions for
# different formats. The new functions are 'predSurv' and 'predLogr'.
predSurv <- function(model, newdata, type="random") {
  
  # Add a dummy column for the intercept
  newdata <- cbind(newdata, '(Intercept)'=1);
  
  # Ge the Weibull Shape parameter from the model
  Shape <- 1/exp(model["Log(Scale)"]);
  model <- model[-which(names(model) == "Log(Scale)")];
  
  # Subset a dataframe to enable matrix multiplication and check
  newdata.sub <- as.matrix(newdata[, names(model), drop=F]);
  
  # Get the scale parameter for the survival model
  Scale <- exp(newdata.sub %*% model);
  
  if(type=="random") {
  
    # Predict the times
    p <- matrix(runif(nrow(newdata.sub)), ncol=1);
    t <- qweibull(p=p, shape=Shape, scale=Scale);
    
    # Return predicted times
    return(as.numeric(t));
    
  } else if(type=="response") {
    
    # Return scale parameter
    return(Scale)
    
  }
  
  
}
predLogr <- function(model, newdata) {
  
  # Add a dummy column for the intercept
  newdata <- cbind(newdata, '(Intercept)'=1);
  
  # Subset a dataframe to enable matrix multiplication and check
  newdata.sub <- as.matrix(newdata[, names(model), drop=F]);
  
  # Calculate the probabilities
  p.tmp <- exp(newdata.sub %*% model);
  p <- p.tmp / (p.tmp + 1);
  
  # Return probabilities
  return(as.vector(p));
  
}

# The 'runDES' function performs one run with the DES based on a specific patient 
# (sub)population and a set of model parameters for each (survival) model
runDES <- function(patients, nsim, PFS.Time.Model, PFS.Event.Model, OS.Time.Model) {

  
  ## INITIALIZATION

  # Sample from patients
  df <- sample_n(patients, size=nsim, replace=T);

  
  ## PART 1: SIMULATE WHETHER PROGRESSION OR DEATH OCCURS

  # Put data in right format
  df.PFS <- getData.L1(data=df);

  # Predict time to progression or death
  df.PFS <- df.PFS %>% mutate(Time = predSurv(model=PFS.Time.Model, newdata=df.PFS))%>%
    mutate(Time1 = Time) %>%
    mutate(Time2 = Time^2) %>%
    mutate(Time1.Int = Time1 * ChemBev) %>%
    mutate(Time2.Int = Time2 * ChemBev);

  # Predict which event is associated with the time
  df.PFS <- df.PFS %>%
    mutate(PFS.Prob = predLogr(model=PFS.Event.Model, newdata=df.PFS)) %>%
    mutate(L1.PFS.Event = ifelse(runif(nrow(df.PFS)) < PFS.Prob, "Death", "Progression"));

  # Save to original patient object
  df.PFS <- mutate(df.PFS, L1.PFS.Time = Time);
  df.PFS <- select(df.PFS, L1.PFS.Time, L1.PFS.Event);
  df <- cbind(df, df.PFS);


  ## PART 2: SIMULATE TIME TILL DEATH FOR THOSE WHO PROGRESSED

  # Put data in right format
  df.OS <- getData.LX(data=df);

  # Predict time to death
  df <- df %>% mutate(OS.Time = ifelse(L1.PFS.Event=="Death", NA, predSurv(model=OS.Time.Model, newdata=df.OS)))

  # Rename variables
  df <- df %>%
    mutate(OS.Time = as.numeric(OS.Time)) %>%
    mutate(PFS.Time = L1.PFS.Time) %>%
    mutate(PFS.Event = L1.PFS.Event) %>%
    mutate(PFS = PFS.Time) %>%
    mutate(OS = PFS.Time + ifelse(is.na(OS.Time), 0, OS.Time));

  # Return simulation data.frame
  return(df)

}

# The 'predictKM' extracts a survival probabilty at a specific timepoint 't' from a fitted Kaplan-Meier
# object 'km' using a stepwise function
predictKM <- function(km, t) {
  
  km.fun <- stepfun(km$time, c(1, km$surv));
  
  pred <- km.fun(t);
  
  return(pred)
}

# The 'validateDES' function performs a probabilistic sensitivity analysis for the DES and a specific
# patient (sub)population, and plots the validation plots and median time-to-event outcomes
validateDES <- function(output=NULL, patients=NULL, patients.times=NULL, nsim=NULL, nrun=NULL, models=NULL, timemax=c(2000,2500,2000), timestep=10, freecores=0, seed=20181212) {
  
  # If the result of a previous 'validateDES' call is provide, the probabilistic sensitivity analysis 
  # is not performed again, but only the plots etc. are generated.
  runsim <- ifelse(is.null(output), T, F);
  
  # The 'timemax' and 'timestep' inputs are used to determine for what time horizon the plots should ne generated
  times <- list(seq(from=0, to=timemax[1], by=timestep),
                seq(from=0, to=timemax[2], by=timestep),
                seq(from=0, to=timemax[3], by=timestep));
  
  # Only if a simulation is to be performed, this will be done
  if(runsim) {
    
    # Check whether sufficient sets of parameters are provided to perform the requested number of runs
    if(nrun>nrow(models)) stop(str_c("nrun can be max ",nrow(models), "!"));
    
    # Set the random number seed for reproduction
    set.seed(seed);
    
    # Set up a multithreaded computation cluster
    library(parallel);
    library(doSNOW);
    cl <- makeCluster(detectCores()-freecores);
    clusterExport(cl, c("times", "patients", "patients.times",
                        "runDES", "predSurv", "predLogr",
                        "getData.L1", "getData.LX",
                        "predictKM"));
    
    # Perform the simulation
    sim <- parLapply(cl, 1:nrun, function(i.run) {
      
      # Load the required packages
      library(dplyr);
      library(tidyr);
      library(stringr);
      library(survival);
      
      # Obtain the coefficients for the (survival) models
      coefs <- models[i.run, ];
      
      # Model L1-Time
      survmod.L1 <- coefs[str_detect(names(coefs), "SURVMOD_L1_")];
      names(survmod.L1) <- str_remove(names(survmod.L1), "SURVMOD_L1_");

      # Model L1-Event
      eventmod.L1 <- coefs[str_detect(names(coefs), "EVENTMOD_L1_")];
      names(eventmod.L1) <- str_remove(names(eventmod.L1), "EVENTMOD_L1_");
      
      # Model LX-Time
      survmod.LX <- coefs[str_detect(names(coefs), "SURVMOD_LX_")];
      names(survmod.LX) <- str_remove(names(survmod.LX), "SURVMOD_LX_");
      
      # Simulation for the ChemOnly (control) strategy
      sim.chemonly <- runDES(patients = filter(patients, L1.Arm.ChemBev==F), nsim = nsim, PFS.Time.Model = survmod.L1, PFS.Event.Model = eventmod.L1, OS.Time.Model = survmod.LX);
      
      # Simulation for the ChemBev (experimental) strategy
      sim.chembev <- runDES(patients = filter(patients, L1.Arm.ChemBev==T), nsim = nsim, PFS.Time.Model = survmod.L1, PFS.Event.Model = eventmod.L1, OS.Time.Model = survmod.LX);
      
      # Fit Kaplan-Meier objects for all time-to-events and both strategies to generate the plots at a later stage
      km.L1.PFS.sim.chemonly <- survfit(Surv(PFS.Time, PFS.Event=="Progression") ~ 1, data=sim.chemonly, type="kaplan-meier");
      km.L1.PFS.sim.chembev <- survfit(Surv(PFS.Time, PFS.Event=="Progression") ~ 1, data=sim.chembev, type="kaplan-meier");
      km.L1.OS.sim.chemonly <- survfit(Surv(PFS.Time, PFS.Event=="Death") ~ 1, data=sim.chemonly, type="kaplan-meier");
      km.L1.OS.sim.chembev <- survfit(Surv(PFS.Time, PFS.Event=="Death") ~ 1, data=sim.chembev, type="kaplan-meier");
      km.LX.sim.chemonly <- survfit(Surv(OS.Time) ~ 1, data=sim.chemonly, type="kaplan-meier");
      km.LX.sim.chembev <- survfit(Surv(OS.Time) ~ 1, data=sim.chembev, type="kaplan-meier");
      
      # Use the custom 'predictKM' function to get the curves for the defined timeframe
      plot.L1.PFS.sim.chemonly <- predictKM(km = km.L1.PFS.sim.chemonly, t = times[[1]]);
      plot.L1.PFS.sim.chembev <- predictKM(km = km.L1.PFS.sim.chembev, t = times[[1]]);
      plot.L1.OS.sim.chemonly <- predictKM(km = km.L1.OS.sim.chemonly, t = times[[2]]);
      plot.L1.OS.sim.chembev <- predictKM(km = km.L1.OS.sim.chembev, t = times[[2]]);
      plot.LX.sim.chemonly <- predictKM(km = km.LX.sim.chemonly, t = times[[3]]);
      plot.LX.sim.chembev <- predictKM(km = km.LX.sim.chembev, t = times[[3]]);
      
      # Extract the median time-to-events
      median.L1.PFS.sim.chemonly <- summary(km.L1.PFS.sim.chemonly)$table['median'];
      median.L1.PFS.sim.chembev <- summary(km.L1.PFS.sim.chembev)$table['median'];
      median.L1.OS.sim.chemonly <- summary(km.L1.OS.sim.chemonly)$table['median'];
      median.L1.OS.sim.chembev <- summary(km.L1.OS.sim.chembev)$table['median'];
      median.LX.sim.chemonly <- summary(km.LX.sim.chemonly)$table['median'];
      median.LX.sim.chembev <- summary(km.LX.sim.chembev)$table['median'];
      
      # Extract the event incidences from the simulation objects
      prob.L1.death.chemonly <- mean(sim.chemonly$PFS.Event=="Death");
      prob.L1.death.chembev <- mean(sim.chembev$PFS.Event=="Death");
      
      # Return all objects in a list because of the different formats
      list(km = list(km.L1.PFS.sim.chemonly = plot.L1.PFS.sim.chemonly, km.L1.PFS.sim.chembev = plot.L1.PFS.sim.chembev, km.L1.OS.sim.chemonly = plot.L1.OS.sim.chemonly, km.L1.OS.sim.chembev = plot.L1.OS.sim.chembev, km.LX.sim.chemonly = plot.LX.sim.chemonly, km.LX.sim.chembev = plot.LX.sim.chembev), median = c(median.L1.PFS.sim.chemonly = median.L1.PFS.sim.chemonly, median.L1.PFS.sim.chembev = median.L1.PFS.sim.chembev, median.L1.OS.sim.chemonly = median.L1.OS.sim.chemonly, median.L1.OS.sim.chembev = median.L1.OS.sim.chembev, median.LX.sim.chemonly = median.LX.sim.chemonly, median.LX.sim.chembev = median.LX.sim.chembev), p = c(prob.L1.death.chemonly = prob.L1.death.chemonly, prob.L1.death.chembev = prob.L1.death.chembev))
      
    })
    
    # Stop the computation cluster
    stopCluster(cl);
  
  # If no simulation is to be performed because a previous outcome object is provide,
  # the correct objects need to be extracted from the object.
  } else {
    
    # Kaplan-Meier curves etc.
    sim <- output$sim;
    
    # Observed data
    patients <- output$patients;
    patients.times <- output$patients.times;
    
  }
  
  # Transform all Kaplan-Meier curves into one matrix
  km.L1.PFS.sim.chemonly <- sapply(sim, function(s) s$km$km.L1.PFS.sim.chemonly);
  km.L1.PFS.sim.chembev <- sapply(sim, function(s) s$km$km.L1.PFS.sim.chembev);
  km.L1.OS.sim.chemonly <- sapply(sim, function(s) s$km$km.L1.OS.sim.chemonly);
  km.L1.OS.sim.chembev <- sapply(sim, function(s) s$km$km.L1.OS.sim.chembev);
  km.LX.sim.chemonly <- sapply(sim, function(s) s$km$km.LX.sim.chemonly);
  km.LX.sim.chembev <- sapply(sim, function(s) s$km$km.LX.sim.chembev);

  # Calculate the mean curves and the corresponding 95% Confidence Intervals
  km.L1.PFS.sim.chemonly.curves <- apply(km.L1.PFS.sim.chemonly, 1, function(row) c(low = quantile(row, 0.025), mean = mean(row), high = quantile(row, 0.975)))
  km.L1.PFS.sim.chembev.curves <- apply(km.L1.PFS.sim.chembev, 1, function(row) c(low = quantile(row, 0.025), mean = mean(row), high = quantile(row, 0.975)))
  km.L1.OS.sim.chemonly.curves <- apply(km.L1.OS.sim.chemonly, 1, function(row) c(low = quantile(row, 0.025), mean = mean(row), high = quantile(row, 0.975)))
  km.L1.OS.sim.chembev.curves <- apply(km.L1.OS.sim.chembev, 1, function(row) c(low = quantile(row, 0.025), mean = mean(row), high = quantile(row, 0.975)))
  km.LX.sim.chemonly.curves <- apply(km.LX.sim.chemonly, 1, function(row) c(low = quantile(row, 0.025), mean = mean(row), high = quantile(row, 0.975)))
  km.LX.sim.chembev.curves <- apply(km.LX.sim.chembev, 1, function(row) c(low = quantile(row, 0.025), mean = mean(row), high = quantile(row, 0.975)))
 
  # Kaplan-Meier curves for the observed data from the 'patients' input object
  km.L1.PFS.obs.chemonly <- survfit(Surv(L1.PFS.Time, L1.PFS.Events=="Progression")~1, data=filter(patients.times, L1.Arm.ChemBev==F), type="kaplan-meier");
  km.L1.PFS.obs.chembev <- survfit(Surv(L1.PFS.Time, L1.PFS.Events=="Progression")~1, data=filter(patients.times, L1.Arm.ChemBev==T), type="kaplan-meier");
  km.L1.OS.obs.chemonly <- survfit(Surv(L1.PFS.Time, L1.PFS.Events=="Death")~1, data=filter(patients.times, L1.Arm.ChemBev==F), type="kaplan-meier");
  km.L1.OS.obs.chembev <- survfit(Surv(L1.PFS.Time, L1.PFS.Events=="Death")~1, data=filter(patients.times, L1.Arm.ChemBev==T), type="kaplan-meier");
  km.LX.obs.chemonly <- survfit(Surv(LX.OS.Time, LX.OS.Events=="Death")~1, data=filter(patients.times, L1.Arm.ChemBev==F), type="kaplan-meier");
  km.LX.obs.chembev <- survfit(Surv(LX.OS.Time, LX.OS.Events=="Death")~1, data=filter(patients.times, L1.Arm.ChemBev==T), type="kaplan-meier");
  
  # Plots
  par(mfrow=c(1,3));
  lwds <- c(2,2,2,2);
  cols <- c("black", "red", "green", "blue");
  plot(main=str_c("First-Line Progression (Event is Progression)\n", "(ChemOnly=",sum(km.L1.PFS.obs.chemonly$n.event),", ChemBev=",sum(km.L1.PFS.obs.chembev$n.event),")"), xlab="Time to Progression (Days)", km.L1.PFS.obs.chemonly, col=cols[1], lwd=1, conf.int=T, las=1, xlim=c(0,timemax[[1]]));
  lines(km.L1.PFS.obs.chemonly, col=cols[1], lwd=lwds[1], conf.int=F);
  lines(km.L1.PFS.obs.chembev, col=cols[2], lwd=1, conf.int=T);
  lines(km.L1.PFS.obs.chembev, col=cols[2], lwd=lwds[2], conf.int=F);
  lines(x=times[[1]], y=km.L1.PFS.sim.chemonly.curves[2,], col=cols[3], lwd=lwds[3]);
  lines(x=times[[1]], y=km.L1.PFS.sim.chemonly.curves[1,], col=cols[3], lwd=1, lty=2);
  lines(x=times[[1]], y=km.L1.PFS.sim.chemonly.curves[3,], col=cols[3], lwd=1, lty=2);
  lines(x=times[[1]], y=km.L1.PFS.sim.chembev.curves[2,], col=cols[4], lwd=lwds[4]);
  lines(x=times[[1]], y=km.L1.PFS.sim.chembev.curves[1,], col=cols[4], lwd=1, lty=2);
  lines(x=times[[1]], y=km.L1.PFS.sim.chembev.curves[3,], col=cols[4], lwd=1, lty=2);
  legend("topright", c("ChemOnly (obs)", "ChemBev (obs)", "ChemOnly (sim)", "ChemBev (sim)"), col=cols, lwd=lwds, lty=1, bty="n")
  
  plot(main=str_c("First-Line Progression (Event is Death)\n", "(ChemOnly=",sum(km.L1.OS.obs.chemonly$n.event),", ChemBev=",sum(km.L1.OS.obs.chembev$n.event),")"), xlab="Time to Death (Days)", km.L1.OS.obs.chemonly, col=cols[1], lwd=1, conf.int=T, las=1, xlim=c(0,timemax[[2]]));
  lines(km.L1.OS.obs.chemonly, col=cols[1], lwd=lwds[1], conf.int=F);
  lines(km.L1.OS.obs.chembev, col=cols[2], lwd=1, conf.int=T);
  lines(km.L1.OS.obs.chembev, col=cols[2], lwd=lwds[2], conf.int=F);
  lines(x=times[[2]], y=km.L1.OS.sim.chemonly.curves[2,], col=cols[3], lwd=lwds[3]);
  lines(x=times[[2]], y=km.L1.OS.sim.chemonly.curves[1,], col=cols[3], lwd=1, lty=2);
  lines(x=times[[2]], y=km.L1.OS.sim.chemonly.curves[3,], col=cols[3], lwd=1, lty=2);
  lines(x=times[[2]], y=km.L1.OS.sim.chembev.curves[2,], col=cols[4], lwd=lwds[4]);
  lines(x=times[[2]], y=km.L1.OS.sim.chembev.curves[1,], col=cols[4], lwd=1, lty=2);
  lines(x=times[[2]], y=km.L1.OS.sim.chembev.curves[3,], col=cols[4], lwd=1, lty=2);
  legend("topright", c("ChemOnly (obs)", "ChemBev (obs)", "ChemOnly (sim)", "ChemBev (sim)"), col=cols, lwd=lwds, lty=1, bty="n")
  
  plot(main=str_c("Death after Progression\n", "(ChemOnly=",sum(km.LX.obs.chemonly$n.event),", ChemBev=",sum(km.LX.obs.chembev$n.event),")"), xlab="Time to Death (Days)", km.LX.obs.chemonly, col=cols[1], lwd=1, conf.int=T, las=1, xlim=c(0,timemax[[3]]));
  lines(km.LX.obs.chemonly, col=cols[1], lwd=lwds[1], conf.int=F);
  lines(km.LX.obs.chembev, col=cols[2], lwd=1, conf.int=T);
  lines(km.LX.obs.chembev, col=cols[2], lwd=lwds[2], conf.int=F);
  lines(x=times[[3]], y=km.LX.sim.chemonly.curves[2,], col=cols[3], lwd=lwds[3]);
  lines(x=times[[3]], y=km.LX.sim.chemonly.curves[1,], col=cols[3], lwd=1, lty=2);
  lines(x=times[[3]], y=km.LX.sim.chemonly.curves[3,], col=cols[3], lwd=1, lty=2);
  lines(x=times[[3]], y=km.LX.sim.chembev.curves[2,], col=cols[4], lwd=lwds[4]);
  lines(x=times[[3]], y=km.LX.sim.chembev.curves[1,], col=cols[4], lwd=1, lty=2);
  lines(x=times[[3]], y=km.LX.sim.chembev.curves[3,], col=cols[4], lwd=1, lty=2);
  legend("topright", c("ChemOnly (obs)", "ChemBev (obs)", "ChemOnly (sim)", "ChemBev (sim)"), col=cols, lwd=lwds, lty=1, bty="n")
  
  # Extract Median
  medians.sim <- sapply(sim, function(s) s$median);
  medians.sim.low <- apply(medians.sim, 1, quantile, probs=0.025, na.rm=T);
  medians.sim.mean <- apply(medians.sim, 1, mean, na.rm=T);
  medians.sim.high <- apply(medians.sim, 1, quantile, probs=0.975, na.rm=T);
  
  print(cbind(medians.sim.low, medians.sim.mean, medians.sim.high))
  print(km.L1.PFS.obs.chemonly);
  print(km.L1.PFS.obs.chembev);
  print(km.L1.OS.obs.chemonly);
  print(km.L1.OS.obs.chembev);
  print(km.LX.obs.chemonly);
  print(km.LX.obs.chembev);
  
  # Probability of death
  prob.sim <- sapply(sim, function(s) s$p);
  prob.sim.low <- apply(prob.sim, 1, quantile, probs=0.025);
  prob.sim.mean <- apply(prob.sim, 1, mean);
  prob.sim.high <- apply(prob.sim, 1, quantile, probs=0.975);
  print(cbind(prob.sim.low, prob.sim.mean, prob.sim.high))
  
  count.obs <- table(patients.times$L1.PFS.Events, patients.times$L1.Arm.ChemBev);
  prob.obs <- apply(count.obs, 2, function(col) col/sum(col));
  print(prob.obs)
  
  # Only return the results if a new simulation is performed
  if(runsim) return(list(sim=sim, patients=patients, patients.times=patients.times));
  
}

Step 6.2: performing the validation of the discrete event simulation

After the custom functions have been defined, the validation of the discrete event simulation can be performed. Unfortunately, the numerical results of the validation cannot be presented in a nicer way because they are printed from without the ‘validateDES’ function.

# The sets of correlated model parameters need to be extracted from the 'out.bootstrap' object
psa.models <- out.bootstrap[, !str_detect(string=colnames(out.bootstrap), pattern="BOOT")];
psa.models <- psa.models[, !str_detect(string=colnames(psa.models), pattern="ORG")];
psa.models <- psa.models[, !str_detect(string=colnames(psa.models), pattern="OPT")];

# Only select the columns that are required to perform a simulation
dataset <- datasets[[1]];
patients <- dataset %>% 
  select(L1.Arm.ChemBev, Age, Hospital.Private, Primary.Site, Stage.Diagnosis, Metastases.Number, Metastases.Liver, Metastases.Peritoneum, L1.ECOG, Charlson, CEA.Level, RAS.Mutated, BRAF.Mutated);

# Additionally provide the observed time-to-events for validation purposes
patients.times <- datasets[[1]] %>% 
  select(L1.PFS.Time, L1.PFS.Events, LX.OS.Time, LX.OS.Events, L1.Arm.ChemBev, Age, Hospital.Private, Primary.Site, Stage.Diagnosis, Metastases.Number, Metastases.Liver, Metastases.Peritoneum, L1.ECOG, Charlson, CEA.Level, RAS.Mutated, BRAF.Mutated);

# Call the 'validateDES' function to perform the simulations and generate all plots
png(file="Validation DES 20190201.png", width=12, height=6, units="in", res=96);
val.cohort <- validateDES(patients = patients, patients.times = patients.times, nsim = 20000, nrun = 10000,  models = psa.models);
##                                   medians.sim.low medians.sim.mean
## median.L1.PFS.sim.chemonly.median        171.4950         198.8957
## median.L1.PFS.sim.chembev.median         289.1615         307.8619
## median.L1.OS.sim.chemonly.median         724.7450        1606.9122
## median.L1.OS.sim.chembev.median         1013.3768        1171.1054
## median.LX.sim.chemonly.median            161.1872         199.6297
## median.LX.sim.chembev.median             242.6702         266.4501
##                                   medians.sim.high
## median.L1.PFS.sim.chemonly.median         228.2406
## median.L1.PFS.sim.chembev.median          327.7214
## median.L1.OS.sim.chemonly.median         2623.2981
## median.L1.OS.sim.chembev.median          1392.2564
## median.LX.sim.chemonly.median             241.8250
## median.LX.sim.chembev.median              291.9795
## Call: survfit(formula = Surv(L1.PFS.Time, L1.PFS.Events == "Progression") ~ 
##     1, data = filter(patients.times, L1.Arm.ChemBev == F), type = "kaplan-meier")
## 
##       n  events  median 0.95LCL 0.95UCL 
##     191     157     171     155     210 
## Call: survfit(formula = Surv(L1.PFS.Time, L1.PFS.Events == "Progression") ~ 
##     1, data = filter(patients.times, L1.Arm.ChemBev == T), type = "kaplan-meier")
## 
##       n  events  median 0.95LCL 0.95UCL 
##     676     571     304     280     329 
## Call: survfit(formula = Surv(L1.PFS.Time, L1.PFS.Events == "Death") ~ 
##     1, data = filter(patients.times, L1.Arm.ChemBev == F), type = "kaplan-meier")
## 
##       n  events  median 0.95LCL 0.95UCL 
##     191      21      NA      NA      NA 
## Call: survfit(formula = Surv(L1.PFS.Time, L1.PFS.Events == "Death") ~ 
##     1, data = filter(patients.times, L1.Arm.ChemBev == T), type = "kaplan-meier")
## 
##       n  events  median 0.95LCL 0.95UCL 
##     676      62    1055     902      NA 
## Call: survfit(formula = Surv(LX.OS.Time, LX.OS.Events == "Death") ~ 
##     1, data = filter(patients.times, L1.Arm.ChemBev == F), type = "kaplan-meier")
## 
##    34 observations deleted due to missingness 
##       n  events  median 0.95LCL 0.95UCL 
##     157     141     193     153     243 
## Call: survfit(formula = Surv(LX.OS.Time, LX.OS.Events == "Death") ~ 
##     1, data = filter(patients.times, L1.Arm.ChemBev == T), type = "kaplan-meier")
## 
##    105 observations deleted due to missingness 
##       n  events  median 0.95LCL 0.95UCL 
##     571     467     270     237     308 
##                        prob.sim.low prob.sim.mean prob.sim.high
## prob.L1.death.chemonly   0.07064875     0.1163047       0.16565
## prob.L1.death.chembev    0.07859750     0.1027043       0.12870
##              
##                    FALSE       TRUE
##   Censored    0.06806283 0.06360947
##   Death       0.10994764 0.09171598
##   Progression 0.82198953 0.84467456
dev.off();
## png 
##   2

Step 7: Exploratory Analyses

After the discrete event simulation has been validated, an exploratory analysis is performed in which patients’ treatment is targeted using personalized predictions of their first-line progression-free survival using Model L1-Time. To achieve this, Step 7 exists out of the following four steps:

  1. Defining functions that are used to run the discrete event simulation
  2. Performing the personalized treatment targeting
  3. Performing the simulations based on the personalized treatment decisions
  4. Exploring the results of the simulations

Step 7.1: defining functions that are used to run the discrete event simulation

Several custom functions are defined to run the simulation. Also, some earlier defined functions are re-written as they do not need to be as extensive as they needed to be for the data analysis. Please find an explanation of the functions in the R code below.

# The 'runPSA' function is a wrapper function that repeatedly calls the 'runDES' function
# for a patient (sub)population and different sets of parameter values
runPSA <- function(patients=NULL, nsim=NULL, nrun=NULL, models=NULL, freecores=0, seed=20181212) {
  
  # Check whether sufficient sets of parameters are provided to perform the requested number of runs
  if(nrun>nrow(models)) stop(str_c("nrun can be max ",nrow(models), "!"));
  
  # Set the random sumber seed for reproduction
  set.seed(seed);
  
  # Multithreading
  library(parallel);
  library(doSNOW);
  cl <- makeCluster(detectCores()-freecores);
  clusterExport(cl, c("patients",  
                      "runDES", "predSurv", "predLogr",
                      "getData.L1", "getData.LX"));
    
  # Perform the probabilistic sensitivity analysis
  psa.out <- parSapply(cl, 1:nrun, function(i.run) {
    
    # Load packages
    library(dplyr);
    library(tidyr);
    library(stringr);
    library(survival);
    
    # Extract the model coefficients from the 'models' object with all sets
    coefs <- models[i.run, ];
    
    # Model L1-Time
    survmod.L1 <- coefs[str_detect(names(coefs), "SURVMOD_L1_")];
    names(survmod.L1) <- str_remove(names(survmod.L1), "SURVMOD_L1_");
    
    # Model L1-Event
    eventmod.L1 <- coefs[str_detect(names(coefs), "EVENTMOD_L1_")];
    names(eventmod.L1) <- str_remove(names(eventmod.L1), "EVENTMOD_L1_");
    
    # Model LX-Time
    survmod.LX <- coefs[str_detect(names(coefs), "SURVMOD_LX_")];
    names(survmod.LX) <- str_remove(names(survmod.LX), "SURVMOD_LX_");
    
    # Run the DES
    sim <- runDES(patients = patients,
                  nsim = nsim,
                  PFS.Time.Model = survmod.L1,
                  PFS.Event.Model = eventmod.L1,
                  OS.Time.Model = survmod.LX);

    # For now we are only interested in the median first-line progression-free survival
    median(sim$PFS.Time)
    
  })
  
  # Stop the multithreading
  stopCluster(cl);
  
  # Return the median first-line progression-free survival outcomes
  return(psa.out)
  
}

Step 7.2: performing the personalized treatment targeting

To perform a simulation to estimate the impact of personalized treatment decisions, first it needs to be determined what is expected to be the best strategy (ChemOnly or ChemBev) for each patient individually. To make these predictions, Model L1-Time is used to make a prediction whether first-line progression-free survival is expected to be higher for the ChemOnly strategy or for the ChemBev strategy.

# Transform the data to the required format
patients.chemonly <- getData.L1(data = patients %>% mutate(L1.Arm.ChemBev=F));
patients.chembev <- getData.L1(data = patients %>% mutate(L1.Arm.ChemBev=T));

# Determine the Weibull scale parameter for each patient individually (a higher value for the scale 
# parameter indicates a higher time-to-event, i.e. better first-line progression-free survival)
response.chemonly <- predSurv(model = Model.L1.Time$coef, newdata = patients.chemonly, type = "response");
response.chembev <- predSurv(model = Model.L1.Time$coef, newdata = patients.chembev, type = "response");

# Set the treatment strategy indicator 'L1.Arm.ChemBev' to the treatment strategy associated with 
# the highest expected first-line progression-free survival
patients <- patients %>%
  mutate(L1.Arm.ChemBev = if_else(response.chembev > response.chemonly, T, F));

# Observe how the treatment were provided as observed in the data
table(dataset$L1.Arm.ChemBev);
## 
## FALSE  TRUE 
##   191   676
# # Observe how the treatment are provided according to the personalized treatment decisions
table(patients$L1.Arm.ChemBev);
## 
## FALSE  TRUE 
##    74   793
# Number of different treatment decisions
n.differenttreatment <- sum(ifelse(dataset$L1.Arm.ChemBev != patients$L1.Arm.ChemBev, 1, 0));
n.differenttreatment;
## [1] 219
n.differenttreatment / nrow(dataset);
## [1] 0.2525952
# Number who swithed from ChemOnly to ChemBev
n.ChemOnly.ChemBev <- sum(ifelse(!dataset$L1.Arm.ChemBev & patients$L1.Arm.ChemBev, 1, 0));
n.ChemOnly.ChemBev;
## [1] 168
n.ChemOnly.ChemBev / n.differenttreatment;
## [1] 0.7671233
# Number who swithed from ChemBev to ChemOnly
n.ChemBev.ChemOnly <- sum(ifelse(dataset$L1.Arm.ChemBev & !patients$L1.Arm.ChemBev, 1, 0));
n.ChemBev.ChemOnly;
## [1] 51
n.ChemBev.ChemOnly / n.differenttreatment;
## [1] 0.2328767

Step 7.3: performing the simulations based on the personalized treatment decisions

After determining the treatment each patient should receive based on the personalized treatment decisions informed by the estimated first-line progression-free survival, a probabilistic sensitivity analysis can be performed to estimate the impact.

# Perform a simulation for the whole patient population
sim.out <- runPSA(patients = patients, nsim = 20000, nrun = 10000, models = psa.models, freecores = 0, seed = 20190104)

# Determine which patients switched treatment
switched <- which(dataset$L1.Arm.ChemBev != patients$L1.Arm.ChemBev);
patients <- patients[switched, , drop=F];

# Perform a simulation only for the 219 patients who switched treatment
sim.out.switched <- runPSA(patients = patients, nsim = 20000, nrun = 10000, models = psa.models, freecores = 0, seed = 20190104)

Step 7.4: exploring the results of the simulations

The final step of this study is to assess the impact of this exploratory analysis based on personalized treatment decisions.

## For the complete patient population

# Mean median PFS and upper and lower bound over the probabilistic sensitivity analysis
mean(sim.out);
## [1] 288.0516
quantile(sim.out, probs = c(0.025, 0.975));
##     2.5%    97.5% 
## 270.2137 306.6571
# Median PFS as observed from the data
survfit(formula = Surv(L1.PFS.Time, L1.PFS.Event) ~ 1, data = dataset, type = "kaplan-meier");
## Call: survfit(formula = Surv(L1.PFS.Time, L1.PFS.Event) ~ 1, data = dataset, 
##     type = "kaplan-meier")
## 
##       n  events  median 0.95LCL 0.95UCL 
##     867     811     265     248     280
## For the patients who switched treatment based on the personalized treatment decision

# Mean median PFS and upper and lower bound over the probabilistic sensitivity analysis
mean(sim.out.switched);
## [1] 269.4719
quantile(sim.out.switched, probs = c(0.025, 0.975));
##     2.5%    97.5% 
## 246.3946 293.4646
# Median PFS as observed from the data
survfit(formula = Surv(L1.PFS.Time, L1.PFS.Event) ~ 1, data = dataset[switched, ], type = "kaplan-meier");
## Call: survfit(formula = Surv(L1.PFS.Time, L1.PFS.Event) ~ 1, data = dataset[switched, 
##     ], type = "kaplan-meier")
## 
##       n  events  median 0.95LCL 0.95UCL 
##     219     205     175     156     199