This document shows all code, intermediate outcomes, and final outcomes belonging to the manuscript “Simulating Progression-Free and Overall Survival for First-Line Doublet Chemotherapy With or Without Bevacizumab in Metastatic Colorectal Cancer Patients Based on Real-World Registry Data” by K Degeling et al. in PharmacoEconomics (available here). Using real world data from the Treatment of Recurrent and Advanced Colorectal Cancer registry, two parametric survival models and a logistic regression model are estimated to populate a discrete event simulation model. Please read the manuscript before reading through this document, as the manuscript contains information essential for understanding decisions made in the code that is included in this document.

Please appropriately cite the work above if you use any code provided in this document for your own analyses.

This document contains the following sections:

Initalization
Step 1: Cleaning the TRACC Data
Step 2: Multiple Imputation of Missing Data
Step 3: Univariable Analysis
Step 4: Multivariable Modeling of Survival
Step 5: Model Validation
Step 6: Discrete Event Simulation
Step 7: Exploratory Analyses

Please note that the number of bootstrap iterations and simulation runs performed is set to a lower number for illustration purposes due to the required runtime to perform the 10,000 iterations/simulations that are performed for the final analyses presented in the manuscript. Hence, results and plots in this document may be different from those reported in the manuscript. Results presented in the manuscript should be considered as the final and definitive results at all times.

Initialization

To intialize the data analysis, several actions are performed. Most importantly, the data are loaded from an encrypted hard drive, as are the input matrices defined by clinicians to recode the treatments as observed in the unprocessed data.

# Clear Workspace and plots, and clear console
rm(list=ls()); gc(); if(!is.null(dev.list())) dev.off(); cat("\014");

# Load packages
library(dplyr);
library(readxl);
library(stringr);

# Set the locale so R handles dates as English regardless of location
Sys.setlocale("LC_TIME", "English");

# Load the raw TRACC data for the patient characteristics and first line treatment
df <- read_excel("G:/Files/mCRC Targeting Melbourne/CRC MODELING DATA 20181105.xlsx", 
                 col_types = c("numeric", "date", "text", "text",
                               "text", "text", "text", "date", "date", 
                               "date", "date", "numeric", "text", 
                               "text", "text", "text", "text", "text", 
                               "text", "text", "text", "text", "text", 
                               "text", "text", "text", "text", "text", 
                               "text", "text", "numeric", "text", 
                               "text", "text", "text", "text", "text", 
                               "text", "text", "text", "text", "text", 
                               "text", "text", "text", "text", "text", 
                               "text", "text", "text", "numeric", 
                               "text", "text", "text", "text", "date", 
                               "text", "text", "text", "date", "text", 
                               "text", "text", "text", "text", "date",
                               "numeric","text",
                               "text", "text", "text", "text", "date", 
                               "date", "date", "text", "text", "text", 
                               "text", "text", "text", "text", "numeric", 
                               "text", "numeric", "text", "numeric", 
                               "text", "text", "text", "numeric", 
                               "text", "numeric", "text", "numeric", 
                               "text", "numeric", "text", "numeric", 
                               "text", "numeric", "text", "numeric", 
                               "text", "numeric", "text", "numeric", 
                               "text", "numeric", "text", "text", 
                               "numeric", "text", "text"));

# Replace spaces in column names with dots/points to handle columns more easily
colnames(df) <- str_replace_all(string=colnames(df), pattern=" ", replace=".");

# Load the chemo and biologic classification by Hui-Li & Koen
chemo.class <- read_excel("G:/Files/mCRC Targeting Melbourne/Chemotherapy Classification 20181123.xlsx");
biolo.class <- read_excel("G:/Files/mCRC Targeting Melbourne/Biological Classification 20181123.xlsx");

Step 1: Cleaning the TRACC Data

Before all models can be fitted, the data need to be cleaned and the appropriate patient population is to be selected. To this end, several steps are performed:

Initial data cleaning and patient selection
Cleaning of patient characteristics and treatment history
Cleaning of clinical outcomes for first-line treatment
Cleaning of clinical outcomes for further treatment lines
Final patient selection

Step 1.1: Initial data cleaning and patient selection

In the first step of the data cleaning, the patient selection is performed. All data required for this selection are recoded and cleaned in this step as well. Patients included are those …

who received chemotherapy with or without bevacizumab
treated with a palliative treatment intent
with a treatment start date after 1 January 2009 and before 1 June 2017

# Inclusion is based on whether the first line of palliative treatment included
# doublet chemotherapy with or without bevacizumab. Patients for whom treatment
# intent was indicated as curative or potentially curative, are excluded.

# Only patients with recorded first-line chemotherapy treatment are considered for inclusion
df <- df %>%
  filter(!is.na(REGIMEN.1L)) %>%
  filter(TREATMENTINTENT=="Palliative only")

# Recode first-line treatment information for inclusion
df <- df %>%
  mutate(L1.Regimen = REGIMEN.1L) %>%
  mutate(L1.Biologic = BIOLOGIC.1L) %>%
  mutate(L1.Start.Date = as.character(START.DATE.1L))

# In the TRACC registry, all regimen in a treatment line need to be recoded first. 
# This has been manually done in an Excel file, which has been loaded in the global 
# environment as object 'chemo.class' and will now be used to recode the regimen.
# The 'chemo.class' object contains a column 'Recorded.Regimen' with all uncoded
# treatments and a columns 'Classification.G0' with the recoded treatments. The
# column 'L1.Chemo.Eligible' is added to indicate whether a patient received doublet
# chemotherapy and, hence, is eligible for inclusion.
df <- df %>%
  left_join(select(chemo.class, Recorded.Regimen, Classification.G0), by=c("L1.Regimen" = "Recorded.Regimen")) %>%
  rename(L1.Chemo.Class = Classification.G0) %>%
  mutate(L1.Chemo.Class = ifelse(is.na(L1.Chemo.Class), "None", L1.Chemo.Class)) %>%
  mutate(L1.Chemo.Eligible = ifelse(L1.Chemo.Class=="Doublet", T, F))

# Since only patients who had a doublet chemotherapy with or without bevacizumab,
# so not with another additional agent like cetuximab, are include, it need to be
# checked whether patients had bevacizumab yes or no, and whether they are
# eligible based on any other biologicals. This is done based on the first
# reported biological
df <- df %>%
  left_join(select(biolo.class, Recorded.Biologicals, Classification.G0), by=c("L1.Biologic" = "Recorded.Biologicals")) %>%
  rename(L1.Biologic.Class=Classification.G0) %>%
  mutate(L1.Biologic.Class = ifelse(is.na(L1.Biologic.Class), "None", L1.Biologic.Class)) %>%
  mutate(L1.Biologic.Eligible = ifelse(L1.Biologic.Class %in% c("None", "Bevacizumab"), T, F)) %>%
  mutate(L1.Bevacizumab = ifelse(L1.Biologic.Class=="Bevacizumab", T, F))

# Final selection of patients is performed based on whether they had ...
# 1) doublet chemotherapy with or without bevacizumab and no other biological
# 2) treatment start date after 2009-01-01
# 3) treatment start date before 2017-06-01
df<- df %>%
  filter(L1.Chemo.Eligible) %>%
  filter(L1.Biologic.Eligible) %>%
  filter(L1.Start.Date >= as.Date("2009-01-01")) %>%
  filter(L1.Start.Date < as.Date("2017-06-01"))

# Allocate patients to the appropriate treatment arm
df <- df %>%
  mutate(L1.Arm.ChemBev = ifelse(L1.Bevacizumab, T, F))

Step 1.2: Cleaning of patient characteristics and treatment history

The second step in cleaning the data is making sure all variables are recoded in relevant categories and all observations are categorized in one of these defined categories. The following variables are extracted from the data:

Age at treatment start
Gender
Hospital type (private or public)
Hospital location (located in Victoria or not)
Stage at initial diagnosis
Primary tumour site
Presence of peritoneal metastases
Presence of liver metastases
Number of metastatic sites
ECOG performance status
Presence of comorbidities
Charlson comorbidity index
CEA level
RAS mutation status
BRAF mutation status
Treatment intent
Treatment start year

df <- df %>%
  
  # Age in years at treatment start
  mutate(Age = floor(as.numeric(difftime(L1.Start.Date, DOB, units="weeks")/52))) %>%         

  # Gender is female or not
  mutate(Gender.Female = ifelse(GENDER=="F", T, F)) %>%                                       
  
  # Hospital type is private or not
  mutate(Hospital.Private = ifelse(TLOCATIONTYPE=="Private", T, F)) %>%
  
  # Hospital is located in Victoria or not
  mutate(Hospital.Victoria = ifelse(STATE=="VIC", T, F)) %>%
  
  # Stage of disease at initial diagnosis
  mutate(Stage.Diagnosis = ifelse(STAGE.AT.DIAGNOSIS=="Unknown", NA, NA))%>%
  mutate(Stage.Diagnosis = ifelse(STAGE.AT.DIAGNOSIS=="Locally advanced (rectal only)", 0, Stage.Diagnosis)) %>%
  mutate(Stage.Diagnosis = ifelse(STAGE.AT.DIAGNOSIS=="Stage I", 1, Stage.Diagnosis)) %>%
  mutate(Stage.Diagnosis = ifelse(STAGE.AT.DIAGNOSIS=="Stage II", 2, Stage.Diagnosis)) %>%
  mutate(Stage.Diagnosis = ifelse(STAGE.AT.DIAGNOSIS=="Stage III", 3, Stage.Diagnosis)) %>%
  mutate(Stage.Diagnosis = ifelse(STAGE.AT.DIAGNOSIS=="Stage IV", 4, Stage.Diagnosis)) %>%
  
  # Primary tumour site
  mutate(Primary.Site = ifelse(PRIMARYSITE=="Occult", NA, NA)) %>%
  mutate(Primary.Site = ifelse(PRIMARYSITE=="Colon", "Colon", Primary.Site)) %>%
  mutate(Primary.Site = ifelse(PRIMARYSITE=="Colon NOS", "Colon", Primary.Site)) %>%
  mutate(Primary.Site = ifelse(PRIMARYSITE=="Left colon", "Left", Primary.Site)) %>%
  mutate(Primary.Site = ifelse(PRIMARYSITE=="Right colon", "Right", Primary.Site)) %>%
  mutate(Primary.Site = ifelse(PRIMARYSITE=="Rectum", "Rectum", Primary.Site)) %>%
  mutate(Primary.Site = ifelse(PRIMARYSITE=="Multiple primaries", "Colon", Primary.Site)) %>%
  
  # Specific metastatic sites
  mutate(Metastases.Liver = ifelse(is.na(LIVER.METS), F, T)) %>%
  mutate(Metastases.Peritoneum = ifelse(is.na(PERITONEUM.OMENTUM.METS), F, T)) %>%
  
  # Number of metastatic sites
  mutate(Metastases.Number = NUMBER.OF.METS) %>%
  
  # ECOG Performance Status
  mutate(L1.ECOG = as.numeric(str_sub(ECOG.First.Line, 1, 1))) %>%
  
  # Whether the patient has comorbidities or not
  mutate(Comorbidities = ifelse(NO.COMORBIDITIES=="Yes", FALSE, TRUE)) %>%
  
  # Recode the comorbidities
  mutate(MYOCARDIAL.INFARCTION = ifelse(is.na(MYOCARDIAL.INFARCTION) | MYOCARDIAL.INFARCTION=="No", 0, 1)) %>%
  mutate(CONGESTIVE.HEART.FAILURE = ifelse(is.na(CONGESTIVE.HEART.FAILURE) | CONGESTIVE.HEART.FAILURE=="No", 0, 1)) %>%
  mutate(PERIPHERAL.VASCULAR.DISEASE = ifelse(is.na(PERIPHERAL.VASCULAR.DISEASE) | PERIPHERAL.VASCULAR.DISEASE=="No", 0, 1)) %>%
  mutate(CEREBROVASCULAR.DISEASE = ifelse(is.na(CEREBROVASCULAR.DISEASE) | CEREBROVASCULAR.DISEASE=="No", 0, 1)) %>%
  mutate(DEMENTIA = ifelse(is.na(DEMENTIA) | DEMENTIA=="No", 0, 1)) %>%
  mutate(COPD = ifelse(is.na(COPD) | COPD=="No", 0, 1)) %>%
  mutate(CONNECTIVE.TISSUE.DISEASE = ifelse(is.na(CONNECTIVE.TISSUE.DISEASE) | CONNECTIVE.TISSUE.DISEASE=="No", 0, 1)) %>%
  mutate(PEPTIC.ULCER.DISEASE = ifelse(is.na(PEPTIC.ULCER.DISEASE) | PEPTIC.ULCER.DISEASE=="No", 0, 1)) %>%
  mutate(MILD.LIVER.DISEASE = ifelse(is.na(MILD.LIVER.DISEASE) | MILD.LIVER.DISEASE=="No", 0, 1)) %>%
  mutate(DIABETES.NO.COMPS = ifelse(is.na(DIABETES.NO.COMPS) | DIABETES.NO.COMPS=="No", 0, 1)) %>%
  mutate(DIABETES.COMPS = ifelse(is.na(DIABETES.COMPS) | DIABETES.COMPS=="No", 0, 1)) %>%
  mutate(HEMIPLEGIA.STROKE = ifelse(is.na(HEMIPLEGIA.STROKE) | HEMIPLEGIA.STROKE=="No", 0, 1)) %>%
  mutate(RENAL.DISEASE.MOD.SEVERE = ifelse(is.na(RENAL.DISEASE.MOD.SEVERE) | RENAL.DISEASE.MOD.SEVERE=="No", 0, 1)) %>%
  mutate(SECOND.TUMOUR.NO.METS = ifelse(is.na(SECOND.TUMOUR.NO.METS) | SECOND.TUMOUR.NO.METS=="No", 0, 1)) %>%
  mutate(LYMPHOMA.MYELOMA = ifelse(is.na(LYMPHOMA.MYELOMA) | LYMPHOMA.MYELOMA=="No", 0, 1)) %>%
  mutate(LEUKAEMIA = ifelse(is.na(LEUKAEMIA) | LEUKAEMIA=="No", 0, 1)) %>%
  mutate(LIVER.DIS.MOD.SEVERE = ifelse(is.na(LIVER.DIS.MOD.SEVERE) | LIVER.DIS.MOD.SEVERE=="No", 0, 1)) %>%
  mutate(IMMUNE.DEFICIENCY = 0) %>%
  mutate(SECOND.MET.SOLID.TUMOUR = ifelse(is.na(SECOND.MET.SOLID.TUMOUR) | SECOND.MET.SOLID.TUMOUR=="No", 0, 1)) %>%
  
  # Charlson comorbidity score (NOT CORRECTED FOR AGE!!!)
  mutate(Charlson =  
           (1 * MYOCARDIAL.INFARCTION) + 
           (1 * CONGESTIVE.HEART.FAILURE) + 
           (1 * PERIPHERAL.VASCULAR.DISEASE) + 
           (1 * CEREBROVASCULAR.DISEASE) + 
           (1 * DEMENTIA) + 
           (1 * COPD) + 
           (1 * CONNECTIVE.TISSUE.DISEASE) + 
           (1 * PEPTIC.ULCER.DISEASE) + 
           (1 * MILD.LIVER.DISEASE) + 
           (1 * DIABETES.NO.COMPS) + 
           (2 * DIABETES.COMPS) + 
           (2 * HEMIPLEGIA.STROKE) + 
           (2 * RENAL.DISEASE.MOD.SEVERE) + 
           (2 * SECOND.TUMOUR.NO.METS) + 
           (2 * LYMPHOMA.MYELOMA) + 
           (2 * LEUKAEMIA) + 
           (3 * LIVER.DIS.MOD.SEVERE) + 
           (6 * IMMUNE.DEFICIENCY) +
           (6 * SECOND.MET.SOLID.TUMOUR)
  ) %>%
  
  # CEA Level
  mutate(CEA.Level = CEALEVEL) %>%

  # RAS status
  mutate(RAS.Mutated = ifelse(RAS.STATUS=="Pending", NA, NA)) %>%
  mutate(RAS.Mutated = ifelse(RAS.STATUS=="Unknown", NA, RAS.Mutated)) %>%
  mutate(RAS.Mutated = ifelse(RAS.STATUS=="Mutated", T, RAS.Mutated)) %>%
  mutate(RAS.Mutated = ifelse(RAS.STATUS=="Wild-type", F, RAS.Mutated)) %>%
  
  # BRAF status
  mutate(BRAF.Mutated = ifelse(BRAFSTATUS=="Pending", NA, NA)) %>%
  mutate(BRAF.Mutated = ifelse(BRAFSTATUS=="Unknown", NA, BRAF.Mutated)) %>%
  mutate(BRAF.Mutated = ifelse(BRAFSTATUS=="Mutated", T, BRAF.Mutated)) %>%
  mutate(BRAF.Mutated = ifelse(BRAFSTATUS=="Wild-type", F, BRAF.Mutated)) %>%
  
  # Treatment intent
  mutate(Treatment.Intent = ifelse(TREATMENTINTENT=="Curative resection occurred", "Curative", NA)) %>%
  mutate(Treatment.Intent = ifelse(TREATMENTINTENT=="Curative resection planned", "Curative", Treatment.Intent)) %>%
  mutate(Treatment.Intent = ifelse(TREATMENTINTENT=="Potentially curative resection (if good response to treatment)", "Potentially", Treatment.Intent)) %>%
  mutate(Treatment.Intent = ifelse(TREATMENTINTENT=="Palliative only", "Palliative", Treatment.Intent)) %>%
  
  # Treatment start
  mutate(Treatment.Start = format(as.Date(L1.Start.Date), "%Y"))

Step 1.3: Cleaning of clinical outcomes for first-line treatment

Clinical outcomes for the first line of treatment are extrated from the data in this third step of the data cleaning. In the manucript, only progression-free survival is considered as clinical outcome for first-line treatment. First-line progression is defined as physician-defined progression (so a patient will be able to receive further treatment or follow up) or death.

# First, the treatment stop date, date of progression, and date of death need to be extraced from the data
df<- df %>%  
  mutate(L1.Stop.Date = as.character(STOP.DATE.1L)) %>%
  mutate(L1.Progression.Date = as.character(PROGRESSION.DATE.First.Line)) %>%
  mutate(Death.Date = as.character(DATEOFDEATH))

# The progression-free survival (PFS) and overall survival (OS) are calculated using the
# treatment start date as origin
df <- df %>% 
  mutate(L1.PFS.Days = as.numeric(difftime(L1.Progression.Date, L1.Start.Date, units="days"))) %>%
  mutate(L1.OS.Days = as.numeric(difftime(Death.Date, L1.Start.Date, units="days"))) %>%
  mutate(L1.CENS.Days = as.numeric(difftime(L1.Stop.Date, L1.Start.Date, units="days"))) %>%
  mutate(L1.CENS.Days = ifelse(is.na(L1.CENS.Days), as.numeric(difftime(DATEOFREVIEW, L1.Start.Date, units="days")), L1.CENS.Days))


# Determine which of the events occured to the patient with regard to first line progression:
# 0) Censored:      when neither progression or death is recorded
# 1) Progression:   when progression is recorded
# 2) Death:         when progression is not recorded but death is
df <- df %>%
  mutate(L1.PFS.Events = ifelse(!is.na(L1.OS.Days) & (L1.OS.Days<=L1.PFS.Days | is.na(L1.PFS.Days)), "Death", NA)) %>%
  mutate(L1.PFS.Events = ifelse(!is.na(L1.PFS.Days) & (L1.PFS.Days<L1.OS.Days | is.na(L1.OS.Days)), "Progression", L1.PFS.Events)) %>%
  mutate(L1.PFS.Events = ifelse(is.na(L1.PFS.Days) & is.na(L1.OS.Days), "Censored", L1.PFS.Events))

# Determine whether an event occured or the patient was censored and the correspoding time
df <- df %>% 
  mutate(L1.PFS.Event = ifelse(L1.PFS.Events=="Censored", 0, 1)) %>%
  mutate(L1.PFS.Time = ifelse(L1.PFS.Events=="Progression", L1.PFS.Days, NA)) %>%
  mutate(L1.PFS.Time = ifelse(L1.PFS.Events=="Death", L1.OS.Days, L1.PFS.Time)) %>%
  mutate(L1.PFS.Time = ifelse(L1.PFS.Events=="Censored", L1.CENS.Days, L1.PFS.Time))

# Since a negative PFS or OS is not possible, patients with a non-positive time-to-event
# are excluded from the analysis, as their data is assumed to be incorrect
df <- df %>%
  mutate(L1.PFS.Include = ifelse(!is.na(L1.PFS.Time) & L1.PFS.Time > 0, T, F));

Step 1.4: Cleaning of clinical outcomes for further treatment lines

With regard to the clinical outcomes of treatment lines or follow up after first-line progression, only overall survival is considered. Overall survival is extracted from the data in this fourth step of the data cleaning.

# Time till death after progression
df <- df %>% 
  mutate(LX.OS.Days = as.numeric(difftime(DATEOFDEATH, L1.Progression.Date, units="days"))) %>%
  mutate(LX.CENS.Days = as.numeric(difftime(DATEOFREVIEW, L1.Progression.Date, units="days")))

# Determine whether the patient was censored:
# 0) Censored:      when death is not recorded
# 1) Death:         when death is recorded
df <- df %>%
  mutate(LX.OS.Events = ifelse(!is.na(LX.OS.Days), "Death", NA)) %>%
  mutate(LX.OS.Events = ifelse(is.na(LX.OS.Days) & !is.na(LX.CENS.Days), "Censored", LX.OS.Events)) %>%
  mutate(LX.OS.Events = ifelse(L1.PFS.Events=="Progression", LX.OS.Events, NA))

# Determine whether death occured or the patient was censored and the correspoding time
df <- df %>% 
  mutate(LX.OS.Event = ifelse(LX.OS.Events=="Censored", 0, 1)) %>%
  mutate(LX.OS.Time = ifelse(LX.OS.Events=="Death", LX.OS.Days, NA)) %>%
  mutate(LX.OS.Time = ifelse(LX.OS.Events=="Censored", LX.CENS.Days, LX.OS.Time))

# Since a negative PFS or OS is not possible, patients with a non-positive time-to-event
# are excluded from the analysis, as their data is assumed to be incorrect
df <- df %>%
  mutate(LX.OS.Include = ifelse(L1.PFS.Include & !is.na(LX.OS.Time) & LX.OS.Time > 0, T, F))

Step 1.5: Final patient selection

Patients with a negative progression-free survival or overall survival are excluded from the sample, as it is assumed that these are data entry errors. Finally, for the included patients, an overview table of their characteristics is generated.

# Only include patients who at least are included for first-line treatment
df <- df %>%
  filter(L1.PFS.Include)

# Only the new colomns need to be maintained
df <- df %>%
  select(L1.Regimen:LX.OS.Include)

# General function to get an overview table of the patient characteristics
getTable <- function(data, column.class=NULL, columns.names=NULL, columns.discrete=NULL, chisq.simulate=T, chisq.nsim=10000, export=F, export.file=NULL) {
  
  if(is.null(columns.names)) {
    columns.names <- c("Age", "Gender.Female", "Hospital.Private", "Stage.Diagnosis", "Primary.Site", 
                       "Metastases.Number", "Metastases.Liver", "Metastases.Peritoneum", "L1.ECOG",
                       "Charlson", "CEA.Level", "RAS.Mutated", "BRAF.Mutated");
    columns.discrete <- c(F, T, T, T, T, T, T, T, T, T, F, T, T);
  }
  
  if(is.null(column.class)) {
    column.class <- "L1.Arm.ChemBev";
  }
  
  if(length(columns.names)!=length(columns.discrete)) stop("OBJECTS 'columns.names' AND 'columns.discrete' SHOULD HAVE EQUAL LENGTHS!!!");
  
  classification <- unlist(data[, column.class]);
  if(length(unique(classification))!=2) stop("NEEDS A CLASSIFICATION THAT RESULTS IN TWO CLASSES!!!")
  
  out.discrete <- matrix(c(table(classification), NA, NA, NA), nrow=1);
  rownames(out.discrete) <- "n";
  colnames(out.discrete) <- c(rep(names(table(classification)),2 ), "p-value");
  for(column in columns.names) {
    
    variable <- unlist(data[, column]);
    
    if(columns.discrete[which(columns.names==column)[1]]) {
      
      variable[is.na(variable)] <- "Missings";
      
      counts <- table(variable, classification);
      rownames(counts) <- str_c(column, "-", rownames(counts));
      
      percentages <- t(t(counts) / out.discrete[1, 1:2])*100;
      
      chisq <- chisq.test(counts, simulate.p.value = chisq.simulate, B = chisq.nsim)
      chisq.p <- c(chisq$p.value, rep(NA, nrow(counts)-1));
      
      out.discrete <- rbind(out.discrete, cbind(counts, percentages, chisq.p));
      
    } else {
      
      means <- tapply(variable, classification, mean, na.rm=T);
      sd <- tapply(variable, classification, sd, na.rm=T);
      NA.counts <- tapply(is.na(variable), classification, sum);
      NA.percentages <- tapply(is.na(variable), classification, mean)*100;
      
      ttest <- t.test(x = variable[classification==unique(classification[1])],
                      y = variable[classification!=unique(classification[1])],
                      alteranative = "two-sided", paired = F);
      
      out <- rbind(c(means, sd, ttest$p.value), 
                   c(NA.counts, NA.percentages, NA));
      rownames(out) <- str_c(column, c("-MeanSD", "-Missings"));
      
      out.discrete <- rbind(out.discrete, out);
      
    }    
  }
  
  colnames(out.discrete)[1:2] <- str_c(str_c(column.class, "==", colnames(out.discrete)[1:2]), " (#/Mean)");
  colnames(out.discrete)[3:4] <- str_c(str_c(column.class, "==", colnames(out.discrete)[3:4]), " (%/SD)");
  
  if(export) write.table(out.discrete, file=export.file, sep=";");
  
  return(out.discrete)
}

# Export Characteristics Tables
Table1.L1.PFS <- getTable(data=filter(df, L1.PFS.Include), export=F);
Table1.LX.OS <- getTable(data=filter(df, LX.OS.Include), export=F);

Overview table for patients included for first-line treatment:

	L1.Arm.ChemBev==FALSE (#/Mean)	L1.Arm.ChemBev==TRUE (#/Mean)	L1.Arm.ChemBev==FALSE (%/SD)	L1.Arm.ChemBev==TRUE (%/SD)	p-value
n	191.00000	676.00000	NA	NA	NA
Age-MeanSD	63.17801	62.54882	13.9234652	12.145134	0.5714513
Age-Missings	0.00000	0.00000	0.0000000	0.000000	NA
Gender.Female-FALSE	118.00000	407.00000	61.7801047	60.207101	0.7425257
Gender.Female-TRUE	73.00000	269.00000	38.2198953	39.792899	NA
Hospital.Private-FALSE	115.00000	337.00000	60.2094241	49.852071	0.0121988
Hospital.Private-TRUE	76.00000	339.00000	39.7905759	50.147929	NA
Stage.Diagnosis-0	8.00000	35.00000	4.1884817	5.177515	0.2915708
Stage.Diagnosis-1	0.00000	11.00000	0.0000000	1.627219	NA
Stage.Diagnosis-2	10.00000	35.00000	5.2356021	5.177515	NA
Stage.Diagnosis-3	21.00000	95.00000	10.9947644	14.053254	NA
Stage.Diagnosis-4	151.00000	491.00000	79.0575916	72.633136	NA
Stage.Diagnosis-Missings	1.00000	9.00000	0.5235602	1.331361	NA
Primary.Site-Colon	13.00000	26.00000	6.8062827	3.846154	0.3540646
Primary.Site-Left	73.00000	261.00000	38.2198953	38.609468	NA
Primary.Site-Missings	3.00000	11.00000	1.5706806	1.627219	NA
Primary.Site-Rectum	51.00000	165.00000	26.7015707	24.408284	NA
Primary.Site-Right	51.00000	213.00000	26.7015707	31.508876	NA
Metastases.Number-1	91.00000	256.00000	47.6439791	37.869822	0.1606839
Metastases.Number-2	55.00000	254.00000	28.7958115	37.573965	NA
Metastases.Number-3	33.00000	123.00000	17.2774869	18.195266	NA
Metastases.Number-4	9.00000	36.00000	4.7120419	5.325444	NA
Metastases.Number-5	3.00000	6.00000	1.5706806	0.887574	NA
Metastases.Number-6	0.00000	1.00000	0.0000000	0.147929	NA
Metastases.Liver-FALSE	64.00000	211.00000	33.5078534	31.213018	0.5971403
Metastases.Liver-TRUE	127.00000	465.00000	66.4921466	68.786982	NA
Metastases.Peritoneum-FALSE	138.00000	491.00000	72.2513089	72.633136	0.9280072
Metastases.Peritoneum-TRUE	53.00000	185.00000	27.7486911	27.366864	NA
L1.ECOG-0	51.00000	284.00000	26.7015707	42.011834	0.0001000
L1.ECOG-1	95.00000	344.00000	49.7382199	50.887574	NA
L1.ECOG-2	34.00000	43.00000	17.8010471	6.360947	NA
L1.ECOG-3	11.00000	5.00000	5.7591623	0.739645	NA
Charlson-0	127.00000	487.00000	66.4921466	72.041420	0.0014999
Charlson-1	27.00000	125.00000	14.1361257	18.491124	NA
Charlson-2	22.00000	41.00000	11.5183246	6.065089	NA
Charlson-3	7.00000	18.00000	3.6649215	2.662722	NA
Charlson-4	2.00000	2.00000	1.0471204	0.295858	NA
Charlson-5	3.00000	1.00000	1.5706806	0.147929	NA
Charlson-6	1.00000	1.00000	0.5235602	0.147929	NA
Charlson-7	0.00000	1.00000	0.0000000	0.147929	NA
Charlson-8	2.00000	0.00000	1.0471204	0.000000	NA
CEA.Level-MeanSD	725.91907	478.48598	5254.2478968	2048.838683	0.5459979
CEA.Level-Missings	19.00000	61.00000	9.9476440	9.023669	NA
RAS.Mutated-FALSE	72.00000	317.00000	37.6963351	46.893491	0.0001000
RAS.Mutated-Missings	63.00000	112.00000	32.9842932	16.568047	NA
RAS.Mutated-TRUE	56.00000	247.00000	29.3193717	36.538461	NA
BRAF.Mutated-FALSE	52.00000	266.00000	27.2251309	39.349112	0.0085991
BRAF.Mutated-Missings	127.00000	369.00000	66.4921466	54.585799	NA
BRAF.Mutated-TRUE	12.00000	41.00000	6.2827225	6.065089	NA

Overview table for patients included for further treatmen lines:

	L1.Arm.ChemBev==FALSE (#/Mean)	L1.Arm.ChemBev==TRUE (#/Mean)	L1.Arm.ChemBev==FALSE (%/SD)	L1.Arm.ChemBev==TRUE (%/SD)	p-value
n	156.00000	562.00000	NA	NA	NA
Age-MeanSD	62.87179	62.32028	14.0789103	12.2110342	0.6567444
Age-Missings	0.00000	0.00000	0.0000000	0.0000000	NA
Gender.Female-FALSE	96.00000	341.00000	61.5384615	60.6761566	0.8505149
Gender.Female-TRUE	60.00000	221.00000	38.4615385	39.3238434	NA
Hospital.Private-FALSE	93.00000	272.00000	59.6153846	48.3985765	0.0153985
Hospital.Private-TRUE	63.00000	290.00000	40.3846154	51.6014235	NA
Stage.Diagnosis-0	4.00000	28.00000	2.5641026	4.9822064	0.2659734
Stage.Diagnosis-1	0.00000	8.00000	0.0000000	1.4234875	NA
Stage.Diagnosis-2	9.00000	28.00000	5.7692308	4.9822064	NA
Stage.Diagnosis-3	18.00000	82.00000	11.5384615	14.5907473	NA
Stage.Diagnosis-4	124.00000	407.00000	79.4871795	72.4199288	NA
Stage.Diagnosis-Missings	1.00000	9.00000	0.6410256	1.6014235	NA
Primary.Site-Colon	10.00000	18.00000	6.4102564	3.2028470	0.2894711
Primary.Site-Left	61.00000	216.00000	39.1025641	38.4341637	NA
Primary.Site-Missings	1.00000	11.00000	0.6410256	1.9572954	NA
Primary.Site-Rectum	38.00000	133.00000	24.3589744	23.6654804	NA
Primary.Site-Right	46.00000	184.00000	29.4871795	32.7402135	NA
Metastases.Number-1	70.00000	218.00000	44.8717949	38.7900356	0.6216378
Metastases.Number-2	50.00000	214.00000	32.0512821	38.0782918	NA
Metastases.Number-3	28.00000	95.00000	17.9487179	16.9039146	NA
Metastases.Number-4	7.00000	31.00000	4.4871795	5.5160142	NA
Metastases.Number-5	1.00000	4.00000	0.6410256	0.7117438	NA
Metastases.Liver-FALSE	53.00000	172.00000	33.9743590	30.6049822	0.4386561
Metastases.Liver-TRUE	103.00000	390.00000	66.0256410	69.3950178	NA
Metastases.Peritoneum-FALSE	109.00000	402.00000	69.8717949	71.5302491	0.6921308
Metastases.Peritoneum-TRUE	47.00000	160.00000	30.1282051	28.4697509	NA
L1.ECOG-0	39.00000	230.00000	25.0000000	40.9252669	0.0001000
L1.ECOG-1	83.00000	291.00000	53.2051282	51.7793594	NA
L1.ECOG-2	28.00000	37.00000	17.9487179	6.5836299	NA
L1.ECOG-3	6.00000	4.00000	3.8461538	0.7117438	NA
Charlson-0	106.00000	406.00000	67.9487179	72.2419929	0.0057994
Charlson-1	22.00000	106.00000	14.1025641	18.8612100	NA
Charlson-2	16.00000	32.00000	10.2564103	5.6939502	NA
Charlson-3	5.00000	14.00000	3.2051282	2.4911032	NA
Charlson-4	2.00000	1.00000	1.2820513	0.1779359	NA
Charlson-5	2.00000	1.00000	1.2820513	0.1779359	NA
Charlson-6	1.00000	1.00000	0.6410256	0.1779359	NA
Charlson-7	0.00000	1.00000	0.0000000	0.1779359	NA
Charlson-8	2.00000	0.00000	1.2820513	0.0000000	NA
CEA.Level-MeanSD	844.29064	488.46878	5796.9883585	2031.4895195	0.4746236
CEA.Level-Missings	15.00000	52.00000	9.6153846	9.2526690	NA
RAS.Mutated-FALSE	63.00000	269.00000	40.3846154	47.8647687	0.0001000
RAS.Mutated-Missings	50.00000	75.00000	32.0512821	13.3451957	NA
RAS.Mutated-TRUE	43.00000	218.00000	27.5641026	38.7900356	NA
BRAF.Mutated-FALSE	41.00000	217.00000	26.2820513	38.6120996	0.0146985
BRAF.Mutated-Missings	105.00000	309.00000	67.3076923	54.9822064	NA
BRAF.Mutated-TRUE	10.00000	36.00000	6.4102564	6.4056940	NA

Step 2: Multiple Imputation of Missing Data

After cleaning the data in Step 1, the missing data are addressed in Step 2. Two steps are performed in order to do so:

Addressing missings in the RAS status and BRAF status
Performing multiple imputation by chained equations

Step 2.1: addressing missings in the RAS status and BRAF status

As explained in the manuscript, missing data for the RAS mutation status and BRAF mutation status may be partly addressed by assuming these mutations to be mutatual exclusive. For example, this indicates that if a RAS mutation is known for a patient with a missing BRAF status, the BRAF status can be assumed to be ‘wild-type’.

# Address missing values in Mutations
# Potentially 175 missing values can be addressed:
# - 1 for RAS
# - 142 for BRAF
table(list(RAS=df$RAS.Mutated, BRAF=df$BRAF.Mutated), useNA="always");

##        BRAF
## RAS     FALSE TRUE <NA>
##   FALSE   158   51  180
##   TRUE    160    1  142
##   <NA>      0    1  174

# If patients have a KRAS mutation and the RAS status is unknown,
# we assume that they will not have a RAS mutation. Similarly, if
# the BRAF status is unknown and the patient has a RAS mutation,
# we assume that the patient does not have a BRAF mutation.
df <- df %>%
  mutate(RAS.Mutated = ifelse(is.na(RAS.Mutated) & BRAF.Mutated, F, RAS.Mutated)) %>%
  mutate(BRAF.Mutated = ifelse(is.na(BRAF.Mutated) & RAS.Mutated, F, BRAF.Mutated))

# Oberve new distribution of mutations and missing values
table(list(RAS=df$RAS.Mutated, BRAF=df$BRAF.Mutated), useNA="always");

##        BRAF
## RAS     FALSE TRUE <NA>
##   FALSE   158   52  180
##   TRUE    302    1    0
##   <NA>      0    0  174

Step 2.2: performing multiple imputation by chained equations

After addressing part of the missing values for the RAS mutation status and BRAF mutation status, remaining missing data for these and the other variables are addressed by performing multiple imputation by chained equations. Missing values are addressed for the following variables (imputation method used):

Stage at diagnosis (multinomial logit)
Primary tumor site (multinomial logit)
CEA level (Bayesian linear regression)
RAS mutation status (logistic regression)
BRAF mutation status (logistic regression)

The following variables were additionally used as predictors:

Age at treatment start
Gender
Hospital type (private of public)
Presence of liver metastases
Presence of peritoneal metastases
Number of metastatic sites
ECOG performance status
Charlson comorbidity index

After performing the multiple imputation, post-processing is performed while extracting the imputed datasets to make sure RAS mutation status and BRAF mutation status remain mutually exclusive.

# Load required package
library(mice);

# List the included columns
mice.cols.included <- c("Age", "Gender.Female", "Hospital.Private", "Metastases.Liver", "Metastases.Peritoneum", "Metastases.Number", "L1.ECOG", "Charlson", "Primary.Site", "Stage.Diagnosis", "CEA.Level", "RAS.Mutated", "BRAF.Mutated");

# Select the data used for the multiple imputation
df.mice <- df %>%
  select(mice.cols.included)

# Make sure the type of all columns is correct
df.mice <- df.mice %>%
  mutate(Age = as.numeric(Age)) %>% 
  mutate(Gender.Female = as.logical(Gender.Female)) %>% 
  mutate(Hospital.Private = as.logical(Hospital.Private)) %>% 
  mutate(Stage.Diagnosis = as.factor(Stage.Diagnosis)) %>% 
  mutate(Primary.Site = as.factor(Primary.Site)) %>% 
  mutate(Metastases.Liver = as.logical(Metastases.Liver)) %>% 
  mutate(Metastases.Peritoneum = as.logical(Metastases.Peritoneum)) %>% 
  mutate(Metastases.Number = as.numeric(Metastases.Number)) %>% 
  mutate(Charlson = as.numeric(Charlson)) %>% 
  mutate(CEA.Level = as.numeric(CEA.Level)) %>% 
  mutate(RAS.Mutated = as.logical(RAS.Mutated)) %>% 
  mutate(BRAF.Mutated = as.logical(BRAF.Mutated)) %>% 
  mutate(L1.ECOG = as.factor(L1.ECOG))


# Define predictor matrix
predictor.matrix <- matrix(1, length(mice.cols.included), length(mice.cols.included)) - diag(1, length(mice.cols.included), length(mice.cols.included));
colnames(predictor.matrix) <- rownames(predictor.matrix) <- mice.cols.included;

# Perform multiple imputation
m <- 25;              # Number of imputations
set.seed(20181211);   # Random seed for reproduction
mice.out <- mice(data=df.mice, m=m, maxit=20, predictorMatrix=predictor.matrix, printFlag=F,
                 method=c(Age="", 
                          Gender.Female="", 
                          Hospital.Private="", 
                          Metastases.Liver="", 
                          Metastases.Peritoneum="", 
                          Metastases.Number="", 
                          L1.ECOG="", 
                          Charlson="", 
                          Primary.Site="polyreg",
                          Stage.Diagnosis="polyreg", 
                          CEA.Level="norm", 
                          RAS.Mutated="logreg", 
                          BRAF.Mutated="logreg")
);


# Extract the imputed datasets
count.BRAF.and.RAS <- 0;    # Counter to track how much post-processing is required
datasets <- lapply(1:m, function(i.dataset) {
  
  # Get the imputed columns
  df.imputed <- complete(mice.out, action=i.dataset);
  cols.imputed <- colnames(df.imputed);
  
  # Replace the colums in the original dataset by the imputed columns
  df.new <- df;
  df.new[, cols.imputed] <- df.imputed; 
  
  # Correct when RAS and BRAF mutations conflict
  count.BRAF.and.RAS <<- count.BRAF.and.RAS + sum(df.new$BRAF.Mutated & df.new$RAS.Mutated);
  df.new$BRAF.Mutated <- ifelse(df.new$RAS.Mutated==1 & df.new$BRAF.Mutated==1, 0, df.new$BRAF.Mutated);
  
  # Return the newly completed datasets
  df.new;
  
});

# Check number of post-processed mutations
prob.BRAF.and.RAS <- (count.BRAF.and.RAS/m) / nrow(datasets[[1]]) * 100;
prob.BRAF.and.RAS;

## [1] 0.3460208

Step 3: Univariable Analysis

Initial analyses were performed to determine which distribution types would be used to parameterize the survival models and which transformations would be used for each variable in the multivariable modeling. To this end, the following steps were performed as part of this univariable modeling:

Calculating the overlap of the competing events for first-line progression-free survival (Model L1-Time)
Selecting the distribution for progression-free survival (Model L1-Time)
Selecting the distribution for overal survival following first-line progression (Model LX-Time)
Selecting the transformation for progression-free survival (Model L1-Time)
Selecting the transformation for progression-free survival (Model L1-Event)
Selecting the transformation for overal survival following first-line progression (Model LX-Time)

The following transformations were considered:

Age at treatment start
- Logarithm
- Square root
- 1^st order polynomial
- 2^nd order polynomial
- 3^rd order polynomial
Gender
- Binary: Female (yes) vs. Male (no)
Hospital Type
- Binary: Private (yes) vs. Public (no)
Stage at Initial Diagnosis
- Binary: Stage IV (yes) vs. Lower Stage (no)
Primary Tumour Site
- Categorical: Group(NOS and Multiple Primaries)
- Categorical: Group(Left Colon and Rectum) & Group(NOS and Multiple Primaries)
Number of Metastatic Sites
- Categorical: Group(3 or more)
Presence of Liver Metastases
- Binary: Yes vs. No
Presence of Peritoneal Metastases
- Binary: Yes vs. No
ECOG Performance Status
- Categorical: Group(2 and 3)
- Binary: 2 or more (yes) vs. 1 or 2 (no)
Charlson Comorbidity Index
- Categorical: Group(2 or more)
- Categorical: Group(0 and 1) & Group(2 or more)
- Binary: 1 or more (yes) vs. 0 (no)
CEA Level
- Binary: 200 or more (yes) vs. less than 200 (no)
RAS Mutation Status
- Binary: Mutated (yes) vs. Wild-type (no)
BRAF Mutation Status
- Binary: Mutated (yes) vs. Wild-type (no)
First-line Progression-free Survival (only applicable to Model L1-Event and Model LX-Time)
- Logarithm
- Square root
- 1^st order polynomial
- 2^nd order polynomial
- 3^rd order polynomial

Step 3.1: calculating the overlap of the competing events for first-line progression-free survival (Model L1-Time)

In modeling first-line progression-free survival, there are two competing events that need to be accounted for: physician-defined progression and death. To determine which modeling approach will be implemented to account for these two competing events, the overlap of the corresponding time-to-event distributions is important to know. It was decided that competing events will be handled by using the ‘Unimodal Distribution and Regression model (UDR)’ approach based on the strategy of selecting the time-to-event first based on a joint time-to-event distribution, and then select the event corresponding to that time.

# function for calculating the overlap between the distribution for time-to-progression and time-to-death
getOverlap <- function(data, n.bins=512, digits=3) {

  library(flux);

  # Get events
  events <- sort(unique(data$event));
  n.events <- length(events);

  # Get event-specific times
  times <- lapply(events, function(e) subset(data, event==e)$time );

  # Density settings
  min <- min(data$time);
  max <- max(data$time);
  n <- n.bins;
  x <- seq(from=min, to=max, length.out=n);

  # Get event-specific densities
  y <- sapply(1:n.events, function(i1) density(times[[i1]], from=min, to=max, n=n)$y );

  # Get event=specific density overlap
  yo <- sapply(1:n.events, function(i2) {

    sapply(1:n, function(j){

      own <- y[j,i2];
      other <- max(y[j,which(c(1:n.events)!=i2)]);

      if(own<=0) {
        return(0)
      } else
        return(min(own,other))

    })

  });

  # Calculate event-specific overlap
  o <- sapply(1:n.events, function(i3) {

    (auc(x=x, y=yo[,i3]) / auc(x=x, y=y[,i3])) *100

  });

  return(round(mean(o), digits=digits));
}

# Calculate the distribution overlap
getOverlap(data=data.frame(time = df$L1.PFS.Time[df$L1.PFS.Events!="Censored"], event = df$L1.PFS.Events[df$L1.PFS.Events!="Censored"]))

## [1] 72.457

# Visualize distribution overlap
plot(density(df$L1.PFS.Time[df$L1.PFS.Events!="Censored"]), xlab="Time-to-Event in Days", ylab=NA, main="Density Plot", las=1, lwd=2, xlim=c(0,2500), ylim=c(0,0.0025))
lines(density(df$L1.PFS.Time[df$L1.PFS.Events=="Progression"]), lwd=2, col=3)
lines(density(df$L1.PFS.Time[df$L1.PFS.Events=="Death"]), lwd=2, col=2)
lines(density(df$L1.PFS.Time[df$L1.PFS.Events=="Censored"]), lwd=2, col=4)
legend("topright", lty=1, lwd=2, col=c(1,3,2, 4), c(str_c("Combined [excl. censored patients](n=", sum(df$L1.PFS.Events!="Censored"),")"), str_c("Progression (n=", sum(df$L1.PFS.Events=="Progression"),")"), str_c("Death (n=", sum(df$L1.PFS.Events=="Death"),")"), str_c("Censored (n = ", sum(df$L1.PFS.Events=="Censored"),")")))

Step 3.2: selecting the distribution for progression-free survival (Model L1-Time)

A distribution type is selected for survival Model L1-Time that represents the two competing events applicable to first-line progression-free survival (physcian-defined progression and death) in one unimodal survival model. Selection was based on the likelihood and assessment of quantile-quantile plots and Kaplan-Meier plots. The following distributions were considered:

Exponential
Weibull
Logistic
Log-normal

# Load packages
library(survival);  # Fit Survival Models
library(EnvStats);  # Q-Q Plot for censored data

# Distributions considered
dist.considered <- c("exponential", 
                     "weibull", 
                     "logistic", 
                     "lognormal");

# Data for first-line
df.L1 <- df;

# Select distribution
dist.performance.L1 <- sapply(dist.considered, function(dist) {
  
  # Fit survival model for distribution
  model <- survreg(formula=Surv(time=L1.PFS.Time, event=L1.PFS.Event) ~ 1, data=df.L1, dist=dist);
  
  # Return the likelihood
  return(model$loglik[1]);
  
});

# Print the results and which distribution would be selected based on the likelihood
dist.selected.L1 <- dist.considered[which.max(dist.performance.L1)];
print(dist.performance.L1);

## exponential     weibull    logistic   lognormal 
##   -5522.443   -5483.372   -5646.513   -5512.667

print(dist.selected.L1);

## [1] "weibull"

# Plot the survival curves
for(i in 1:length(dist.considered)) {
  
  # Get distribution type
  dist <- dist.considered[i];
  
  # Fit survival model for distribution
  model <- survreg(formula=Surv(time=L1.PFS.Time, event=L1.PFS.Event) ~ 1, data=df.L1, dist=dist);
  
  # If it is the first distribution considered, plot kaplan-meier
  if(i==1) plot(survfit(Surv(time=L1.PFS.Time, event=L1.PFS.Event) ~ 1, data=df.L1), conf.int=F, xlab="Time to Death or Progression (days)", lwd=2);
  
  # Plot fitted distribution
  lines(y=seq(.99, .01, by=-.01), x=predict(model, type="quantile", p=seq(.01, .99, by=.01))[1,], col=rainbow(length(dist.considered))[i], lwd=2)
  
  # Add the legend if it is the last distribution
  if(i==length(dist.considered)) legend("right", legend=dist.considered, col=rainbow(length(dist.considered)), lty=1, lwd=2);
  
}

# Plot the Q-Q Plots curves
par(mfrow=c(2,2));
for(i in 1:length(dist.considered)) {
  
  # Get distribution type
  dist <- dist.considered[i];
  
  # Fit survival model for distribution
  model <- survreg(formula=Surv(time=L1.PFS.Time, event=L1.PFS.Event) ~ 1, data=df.L1, dist=dist);
  
  # Plot the Q-Q Plot
  if(dist=="exponential") qqPlotCensored(x=df.L1$L1.PFS.Time, censored=df.L1$L1.PFS.Event==0, censoring.side="right", 
                                         equal.axes=T, main=paste0("Q-Q Plot for the ", dist, " distribution"),
                                         distribution="exp", param.list=list(rate=1/exp(model$coefficients)));
  if(dist=="weibull") qqPlotCensored(x=df.L1$L1.PFS.Time, censored=df.L1$L1.PFS.Event==0, censoring.side="right", 
                                     equal.axes=T, main=paste0("Q-Q Plot for the ", dist, " distribution"), 
                                     distribution="weibull", param.list=list(shape=1/model$scale, scale=exp(model$coefficients)));
  if(dist=="logistic") qqPlotCensored(x=df.L1$L1.PFS.Time, censored=df.L1$L1.PFS.Event==0, censoring.side="right", 
                                      equal.axes=T, main=paste0("Q-Q Plot for the ", dist, " distribution"), 
                                      distribution="logis", param.list=list(location=model$coefficients, scale=model$scale));
  if(dist=="lognormal") qqPlotCensored(x=df.L1$L1.PFS.Time, censored=df.L1$L1.PFS.Event==0, censoring.side="right", 
                                       equal.axes=T, main=paste0("Q-Q Plot for the ", dist, " distribution"), 
                                       distribution="lnorm", param.list=list(meanlog=model$coefficients, sdlog=model$scale));
}

Step 3.3: selecting the distribution for overal survival following first-line progression (Model LX-Time)

A distribution type is selected for survival Model LX-Time that represents overall survival after first-line progression. Selection was based on the likelihood and assessment of quantile-quantile plots and Kaplan-Meier plots. As for Model L1-Time, the following distributions were considered:

Exponential
Weibull
Logistic
Log-normal

# Data for first-line
df.LX <- filter(df, LX.OS.Include);

# Select distribution
dist.performance.LX <- sapply(dist.considered, function(dist) {
  
  # Fit survival model for distribution
  model <- survreg(formula=Surv(time=LX.OS.Time, event=LX.OS.Event) ~ 1, data=df.LX, dist=dist);
  
  # Return the likelihood
  return(model$loglik[1]);
  
});

# Print the results and which distribution would be selected based on the likelihood
dist.selected.LX <- dist.considered[which.max(dist.performance.LX)];
print(dist.performance.LX);

## exponential     weibull    logistic   lognormal 
##   -4207.651   -4207.648   -4430.506   -4245.090

print(dist.selected.LX);

## [1] "weibull"

# Plot the survival curves
for(i in 1:length(dist.considered)) {
  
  # Get distribution type
  dist <- dist.considered[i];
  
  # Fit survival model for distribution
  model <- survreg(formula=Surv(time=LX.OS.Time, event=LX.OS.Event) ~ 1, data=df.LX, dist=dist);
  
  # If it is the first distribution considered, plot kaplan-meier
  if(i==1) plot(survfit(Surv(time=LX.OS.Time, event=LX.OS.Event) ~ 1, data=df.LX), conf.int=F, xlab="Overall Survival (days)", lwd=2);
  
  # Plot fitted distribution
  lines(y=seq(.99, .01, by=-.01), x=predict(model, type="quantile", p=seq(.01, .99, by=.01))[1,], col=rainbow(length(dist.considered))[i], lwd=2)
  
  # Add the legend if it is the last distribution
  if(i==length(dist.considered)) legend("right", legend=dist.considered, col=rainbow(length(dist.considered)), lty=1, lwd=2);
  
}

# Plot the Q-Q Plots curves
par(mfrow=c(2,2));
for(i in 1:length(dist.considered)) {
  
  # Get distribution type
  dist <- dist.considered[i];
  
  # Fit survival model for distribution
  model <- survreg(formula=Surv(time=LX.OS.Time, event=LX.OS.Event) ~ 1, data=df.LX, dist=dist);
  
  # Plot the Q-Q Plot
  if(dist=="exponential") qqPlotCensored(x=df.LX$LX.OS.Time, censored=df.LX$LX.OS.Event==0, censoring.side="right", 
                                         equal.axes=T, main=paste0("Q-Q Plot for the ", dist, " distribution"),
                                         distribution="exp", param.list=list(rate=1/exp(model$coefficients)));
  if(dist=="weibull") qqPlotCensored(x=df.LX$LX.OS.Time, censored=df.LX$LX.OS.Event==0, censoring.side="right", 
                                     equal.axes=T, main=paste0("Q-Q Plot for the ", dist, " distribution"), 
                                     distribution="weibull", param.list=list(shape=1/model$scale, scale=exp(model$coefficients)));
  if(dist=="logistic") qqPlotCensored(x=df.LX$LX.OS.Time, censored=df.LX$LX.OS.Event==0, censoring.side="right", 
                                      equal.axes=T, main=paste0("Q-Q Plot for the ", dist, " distribution"), 
                                      distribution="logis", param.list=list(location=model$coefficients, scale=model$scale));
  if(dist=="lognormal") qqPlotCensored(x=df.LX$LX.OS.Time, censored=df.LX$LX.OS.Event==0, censoring.side="right", 
                                       equal.axes=T, main=paste0("Q-Q Plot for the ", dist, " distribution"), 
                                       distribution="lnorm", param.list=list(meanlog=model$coefficients, sdlog=model$scale));
}

Step 3.4: selecting the transformation for progression-free survival (Model L1-Time)

The best-performing transformation was selected for each variable and survival model separately by fitting individual models for each candidate transformation and comparing performance based on Wald test p-values. Univariable models were also fitted for variables without multiple candidate transformations to record p-values. All p-values were calculated for each of 25 imputed datasets, and for variables with multiple candidate transformations, transformations’ p-value ranks were determined for each imputed dataset. Final selection of the transformation to be used was based on the average transformation rank over all 25 imputed datasets.

# Load package
library(aod);   # Wald-test

# Candidate variables, including their importance and potential transformations
vars.information.L1 <- list(
  
  # Age at treatment start
  Age = list(Transformations = c(Log = "log(Age)", SQRT = "sqrt(Age)", Poly1 = "poly(Age, 1)", Poly2 = "poly(Age, 2)", Poly3 = "poly(Age, 3)", Poly4 = "poly(Age, 4)")),
  
  # Gender
  Gender.Female = list(Transformations = c(Binary.FemaleYN = "Gender.Female")),
  
  # Hospital Type
  Hospital.Private = list(Transformations = c(Binary.PrivateYN = "Hospital.Private")),
  
  # Stage at Initial Diagnosis
  Stage.Diagnosis = list(Transformations = c(Binary.Stage4YN = "as.factor(as.character(Stage.Diagnosis)=='4')")),
  
  # Primary Tumour Site
  Primary.Site = list(Transformations = c(Factor.GroupColMul = "as.factor(ifelse(Primary.Site %in% c('Colon','Multiple'), 'ColonMultiple', Primary.Site))", Factor.GroupColMul.GroupLefRec = "as.factor(ifelse(Primary.Site %in% c('Colon','Multiple'), 'ColonMultiple', ifelse(Primary.Site %in% c('Left','Rectum'), 'LeftRectum', Primary.Site)))")),
  
  # Number of Metastases
  Metastases.Number = list(Transformations = c(Factor.Group3plus = "as.factor(ifelse(Metastases.Number>=3, 3, Metastases.Number))")),
  
  # Liver Metastases
  Metastases.Liver = list(Transformations = c(Binary = "Metastases.Liver")),
  
  # Peritoneum Metastases
  Metastases.Peritoneum = list(Transformations = c(Binary = "Metastases.Peritoneum")),
  
  # ECOG Performance Status
  L1.ECOG = list(Transformations = c(Factor.Group23 = "as.factor(ifelse(L1.ECOG==0, 0, ifelse(L1.ECOG==1, 1, 2)))", Binary.2plusYN = "as.factor(ifelse(L1.ECOG %in% c(0,1), F, T))")),
  
  # Charlson Score
  Charlson = list(Transformations = c(Factor.Group2plus = "as.factor(ifelse(Charlson>=2, 2, Charlson))", Factor.Group01.Group2plus = "as.factor(ifelse(Charlson %in% c(0,1), 1, ifelse(Charlson>=2, 2, Charlson)))", Binary.1plusYN = "as.factor(ifelse(Charlson>=1, T, F))")),
  
  # CEA Level
  CEA.Level = list(Transformations = c(Binary.200plus = "as.factor(ifelse(Metastases.Number<=2, T, F) * ifelse(CEA.Level>=200, T, F))")),
  
  # RAS Status
  RAS.Mutated = list(Transformations = c(Binary.MutatedYN = "RAS.Mutated")),
  
  # BRAF Status
  BRAF.Mutated = list(Transformations = c(Binary.MutatedYN = "BRAF.Mutated"))
  
);

# Set three levels of significance
significance.levels <- c(0.1, 0.05, 0.01);

# For each imputed dataset, select distribution and perform univariable analysis
univariable.L1 <- lapply(1:length(datasets), function(dataset) {
  
  # Retrieve datasets
  data.surv.L1 <- datasets[[dataset]];

  # For each variable
  vars.univariable <- lapply(vars.information.L1, function(var) {
    
    # Consider all potential transformations
    sapply(var$Transformations, function(var.form) {
      
      # Define formula to be fitted
      f <- as.formula(str_c("Surv(time=L1.PFS.Time, event=L1.PFS.Event) ~ ", var.form, collapse=""));
      
      # Fit model
      fit <- survreg(formula=f, data=data.surv.L1, dist=dist.selected.L1);
      
      # Perform Wald-test for significance
      test <- wald.test(Sigma=fit$var, b=c(fit$coefficients, fit$scale), Terms=c(2:length(fit$coefficients)))
      
      # Determine significance
      pvalue <- test$result$chi2["P"];
      significance <- "";
      if(pvalue<max(significance.levels)) {
        significance <- str_c(rep("*", tail(which(pvalue<significance.levels), n=1)), collapse="");
      }
      
      # Return transformation and model performance
      c(trans=f, ll=fit$loglik[2], pvalue=pvalue, sign=significance)
      
    })
    
  });
  
  # Return outcomes
  return(list(vars.univariable=vars.univariable));
  
})


# Add summary for variables to vars.information object
for(var in names(vars.information.L1)) {
  
  # Get names of transformations
  transformation.names <- names(vars.information.L1[[var]]$Transformations);
  
  # Extract data from univariable object
  loglik <- as.data.frame(
    sapply(transformation.names, function(trans) 
      sapply(univariable.L1, function(univar)
        univar$vars.univariable[[var]][2,trans]
      )
    )
  )
  pvalues <- as.data.frame(
    sapply(transformation.names, function(trans) 
      sapply(univariable.L1, function(univar)
        univar$vars.univariable[[var]][3,trans]
      )
    )
  )
  significance <- as.data.frame(
    sapply(transformation.names, function(trans) 
      sapply(univariable.L1, function(univar)
        univar$vars.univariable[[var]][4,trans]
      )
    )
  )
  
  # Transform List object to Matrix
  pvalues <- matrix(unlist(pvalues), ncol=ncol(pvalues), byrow=F)
  colnames(pvalues) <- transformation.names;
  
  # Rank based on pvalues
  ranks <- matrix(apply(pvalues, 1, rank, ties.method="average"), ncol=ncol(pvalues), byrow=T);
  colnames(ranks) <- colnames(pvalues);
  avgrank <- apply(ranks, 2, mean);
  
  # Select best performing transformation
  selected <- names(avgrank)[which.min(avgrank)];
  selected.transformation <- vars.information.L1[[var]]$Transformations[selected];
  selected.avgpvalue <- mean(pvalues[, selected]);
  selected.significance <- "";
  if(selected.avgpvalue<max(significance.levels)){
    selected.significance <- str_c(rep("*", tail(which(selected.avgpvalue<significance.levels), n=1)), collapse="");
  }
  
  # Save to vars.information object
  vars.information.L1[[var]]$LogLik <- loglik;
  vars.information.L1[[var]]$Pvalues <- pvalues;
  vars.information.L1[[var]]$Significance <- significance;
  vars.information.L1[[var]]$Ranks <- ranks;
  vars.information.L1[[var]]$AvgRank <- avgrank;
  
  vars.information.L1[[var]]$SelectedTransformation <- selected.transformation;
  vars.information.L1[[var]]$SelectedPvalue <- selected.avgpvalue;
  vars.information.L1[[var]]$SelectedSignificance <- selected.significance;
  
}


# Generate summary Table
vars.univariable.L1 <- t(sapply(vars.information.L1, function(var) c(
  trans = names(var$SelectedTransformation),
  pvalue = var$SelectedPvalue,
  sign = var$SelectedSignificance
))); 

# View results
kable(vars.univariable.L1);

	trans	pvalue	sign
Age	Poly2	0.00543870740945696	***
Gender.Female	Binary.FemaleYN	0.0413078627543801	**
Hospital.Private	Binary.PrivateYN	0.132221927655926
Stage.Diagnosis	Binary.Stage4YN	0.0180820041944674	**
Primary.Site	Factor.GroupColMul.GroupLefRec	0.00371107490900566	***
Metastases.Number	Factor.Group3plus	0.40180927502595
Metastases.Liver	Binary	0.0548355494244619	*
Metastases.Peritoneum	Binary	0.108781784235471
L1.ECOG	Factor.Group23	8.5890183854076e-12	***
Charlson	Binary.1plusYN	0.164886584017105
CEA.Level	Binary.200plus	0.00551021390960015	***
RAS.Mutated	Binary.MutatedYN	0.208343809880177
BRAF.Mutated	Binary.MutatedYN	0.152022282845375

Step 3.5: selecting the transformation for progression-free survival (Model L1-Event)

To distinguish between the two competing events of physisician-defined progression death for first-line progression-free survival, a logistic regression model is needed (Model L1-Event). For this model, all best-performing transformations indentified for corresponding Model L1-Time were used, so only the best transformation for the additional variable relating to the time-to-event needs to be selected.

# Define a binary variable 'Death' to represent whether first-line progression refered to Death (1), Physician-defined progression (0), or the patient was censored (NA)
Death <- ifelse(datasets[[1]]$L1.PFS.Events=="Death", 1, ifelse(datasets[[1]]$L1.PFS.Events=="Progression", 0, NA));

# Fit the different transformations
summary(glm(Death ~ 1))

## 
## Call:
## glm(formula = Death ~ 1)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.1023  -0.1023  -0.1023  -0.1023   0.8977  
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.10234    0.01065    9.61   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.09198216)
## 
##     Null deviance: 74.506  on 810  degrees of freedom
## Residual deviance: 74.506  on 810  degrees of freedom
##   (56 observations deleted due to missingness)
## AIC: 369.34
## 
## Number of Fisher Scoring iterations: 2

summary(glm(Death ~ sqrt(datasets[[1]]$L1.PFS.Time)))

## 
## Call:
## glm(formula = Death ~ sqrt(datasets[[1]]$L1.PFS.Time))
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -0.21793  -0.11542  -0.09612  -0.07337   0.95884  
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)   
## (Intercept)                     0.029326   0.027825   1.054  0.29222   
## sqrt(datasets[[1]]$L1.PFS.Time) 0.004473   0.001576   2.838  0.00465 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.09118778)
## 
##     Null deviance: 74.506  on 810  degrees of freedom
## Residual deviance: 73.771  on 809  degrees of freedom
##   (56 observations deleted due to missingness)
## AIC: 363.31
## 
## Number of Fisher Scoring iterations: 2

summary(glm(Death ~ log(datasets[[1]]$L1.PFS.Time)))

## 
## Call:
## glm(formula = Death ~ log(datasets[[1]]$L1.PFS.Time))
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -0.12443  -0.10784  -0.10247  -0.09366   0.93415  
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)
## (Intercept)                     0.04526    0.06323   0.716    0.474
## log(datasets[[1]]$L1.PFS.Time)  0.01058    0.01155   0.916    0.360
## 
## (Dispersion parameter for gaussian family taken to be 0.09200045)
## 
##     Null deviance: 74.506  on 810  degrees of freedom
## Residual deviance: 74.428  on 809  degrees of freedom
##   (56 observations deleted due to missingness)
## AIC: 370.5
## 
## Number of Fisher Scoring iterations: 2

summary(glm(Death ~ poly(datasets[[1]]$L1.PFS.Time, 1)))

## 
## Call:
## glm(formula = Death ~ poly(datasets[[1]]$L1.PFS.Time, 1))
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -0.35633  -0.11251  -0.08696  -0.06513   0.95045  
## 
## Coefficients:
##                                    Estimate Std. Error t value Pr(>|t|)
## (Intercept)                         0.10236    0.01054   9.710  < 2e-16
## poly(datasets[[1]]$L1.PFS.Time, 1)  1.31589    0.31269   4.208 2.86e-05
##                                       
## (Intercept)                        ***
## poly(datasets[[1]]$L1.PFS.Time, 1) ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.09012303)
## 
##     Null deviance: 74.506  on 810  degrees of freedom
## Residual deviance: 72.910  on 809  degrees of freedom
##   (56 observations deleted due to missingness)
## AIC: 353.78
## 
## Number of Fisher Scoring iterations: 2

summary(glm(Death ~ poly(datasets[[1]]$L1.PFS.Time, 2)))

## 
## Call:
## glm(formula = Death ~ poly(datasets[[1]]$L1.PFS.Time, 2))
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -0.62677  -0.09385  -0.08514  -0.08335   0.91693  
## 
## Coefficients:
##                                     Estimate Std. Error t value Pr(>|t|)
## (Intercept)                           0.1026     0.0105   9.764  < 2e-16
## poly(datasets[[1]]$L1.PFS.Time, 2)1   1.3072     0.3115   4.196 3.02e-05
## poly(datasets[[1]]$L1.PFS.Time, 2)2   0.8118     0.3048   2.664  0.00789
##                                        
## (Intercept)                         ***
## poly(datasets[[1]]$L1.PFS.Time, 2)1 ***
## poly(datasets[[1]]$L1.PFS.Time, 2)2 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.08944915)
## 
##     Null deviance: 74.506  on 810  degrees of freedom
## Residual deviance: 72.275  on 808  degrees of freedom
##   (56 observations deleted due to missingness)
## AIC: 348.69
## 
## Number of Fisher Scoring iterations: 2

summary(glm(Death ~ poly(datasets[[1]]$L1.PFS.Time, 3)))

## 
## Call:
## glm(formula = Death ~ poly(datasets[[1]]$L1.PFS.Time, 3))
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -0.48879  -0.10220  -0.07718  -0.06743   0.93404  
## 
## Coefficients:
##                                     Estimate Std. Error t value Pr(>|t|)
## (Intercept)                          0.10303    0.01045   9.856  < 2e-16
## poly(datasets[[1]]$L1.PFS.Time, 3)1  1.33858    0.31025   4.315  1.8e-05
## poly(datasets[[1]]$L1.PFS.Time, 3)2  0.83827    0.30347   2.762  0.00587
## poly(datasets[[1]]$L1.PFS.Time, 3)3 -0.90747    0.30755  -2.951  0.00326
##                                        
## (Intercept)                         ***
## poly(datasets[[1]]$L1.PFS.Time, 3)1 ***
## poly(datasets[[1]]$L1.PFS.Time, 3)2 ** 
## poly(datasets[[1]]$L1.PFS.Time, 3)3 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.08860409)
## 
##     Null deviance: 74.506  on 810  degrees of freedom
## Residual deviance: 71.504  on 807  degrees of freedom
##   (56 observations deleted due to missingness)
## AIC: 341.99
## 
## Number of Fisher Scoring iterations: 2

Step 3.6: selecting the transformation for overal survival following first-line progression (Model LX-Time)

Selection of the best-performing transformation for Model LX-Time was done in the same way as for Model L1-Tim. The only difference is that an additional variable representing the first-line progression-free survival was included.

# Candidate variables, including their importance and potential transformations
vars.information.LX <- list(
  
  # Progression-free survival
  PFS = list(Transformations = c(Log = "log(L1.PFS.Time)", SQRT = "sqrt(L1.PFS.Time)", Poly1 = "poly(L1.PFS.Time, 1)", Poly2 = "poly(L1.PFS.Time, 2)", Poly3 = "poly(L1.PFS.Time, 3)")),
  
  # Age at treatment start
  Age = list(Transformations = c(Log = "log(Age)", SQRT = "sqrt(Age)", Poly1 = "poly(Age, 1)",  Poly2 = "poly(Age, 2)",  Poly3 = "poly(Age, 3)")),
  
  # Gender
  Gender.Female = list(Transformations = c(Binary.FemaleYN = "Gender.Female")),
  
  # Hospital Type
  Hospital.Private = list(Transformations = c(Binary.PrivateYN = "Hospital.Private")),
  
  # Stage at Initial Diagnosis
  Stage.Diagnosis = list(Transformations = c(Binary.Stage4YN = "as.factor(as.character(Stage.Diagnosis)=='4')")),
  
  # Primary Tumour Site
  Primary.Site = list(Transformations = c(Factor.GroupColMul = "as.factor(ifelse(Primary.Site %in% c('Colon','Multiple'), 'ColonMultiple', Primary.Site))", Factor.GroupColMul.GroupLefRec = "as.factor(ifelse(Primary.Site %in% c('Colon','Multiple'), 'ColonMultiple', ifelse(Primary.Site %in% c('Left','Rectum'), 'LeftRectum', Primary.Site)))")),
  
  # Number of Metastases
  Metastases.Number = list(Transformations = c(Factor.Group3plus = "as.factor(ifelse(Metastases.Number>=3, 3, Metastases.Number))")),
  
  # Liver Metastases
  Metastases.Liver = list(Transformations = c(Binary = "Metastases.Liver")),
  
  # Peritoneum Metastases
  Metastases.Peritoneum = list(Transformations = c(Binary = "Metastases.Peritoneum")),
  
  # ECOG Performance Status
  L1.ECOG = list(Transformations = c(Factor.Group23 = "as.factor(ifelse(L1.ECOG==0, 0, ifelse(L1.ECOG==1, 1, 2)))", Binary.2plusYN = "as.factor(ifelse(L1.ECOG %in% c(0,1), F, T))")),
  
  # Charlson Score
  Charlson = list(Transformations = c(Factor.Group2plus = "as.factor(ifelse(Charlson>=2, 2, Charlson))", Factor.Group01.Group2plus = "as.factor(ifelse(Charlson %in% c(0,1), 1, ifelse(Charlson>=2, 2, Charlson)))", Binary.1plusYN = "as.factor(ifelse(Charlson>=1, T, F))")),
  
  # CEA Level
  CEA.Level = list(Transformations = c(Binary.200plus = "as.factor(ifelse(Metastases.Number<=2, T, F) * ifelse(CEA.Level>=200, T, F))")),
  
  # RAS Status
  RAS.Mutated = list(Transformations = c(Binary.MutatedYN = "RAS.Mutated")),
  
  # BRAF Status
  BRAF.Mutated = list(Transformations = c(Binary.MutatedYN = "BRAF.Mutated"))
  
);


# For each imputed dataset, select distribution and perform univariable analysis
univariable.LX <- lapply(1:length(datasets), function(dataset) {
  
  # Retrieve datasets
  data.surv <- datasets[[dataset]];
  data.surv.LX <- filter(data.surv, LX.OS.Include);
  
  # For each variable
  vars.univariable <- lapply(vars.information.LX, function(var) {
    
    # Consider all potential transformations
    sapply(var$Transformations, function(var.form) {
      
      # Define formula to be fitted
      f <- as.formula(str_c("Surv(time=LX.OS.Time, event=LX.OS.Event) ~ ", var.form, collapse=""));
      
      # Fit model
      fit <- survreg(formula=f, data=data.surv.LX, dist=dist.selected.LX);
      
      # Perform Wald-test for significance
      test <- wald.test(Sigma=fit$var, b=c(fit$coefficients, fit$scale), Terms=c(2:length(fit$coefficients)))
      
      # Determine significance
      pvalue <- test$result$chi2["P"];
      significance <- "";
      if(pvalue<max(significance.levels)) {
        significance <- str_c(rep("*", tail(which(pvalue<significance.levels), n=1)), collapse="");
      }
      
      # Return transformation and model performance
      c(trans=f, ll=fit$loglik[2], pvalue=pvalue, sign=significance)
      
    })
    
  });
  
  # Return outcomes
  return(list(vars.univariable=vars.univariable));
  
})


# Add summary for variables to vars.information object
for(var in names(vars.information.LX)) {
  
  # Get names of transformations
  transformation.names <- names(vars.information.LX[[var]]$Transformations);
  
  # Extract data from univariable object
  loglik <- as.data.frame(
    sapply(transformation.names, function(trans) 
      sapply(univariable.LX, function(univar)
        univar$vars.univariable[[var]][2,trans]
      )
    )
  )
  pvalues <- as.data.frame(
    sapply(transformation.names, function(trans) 
      sapply(univariable.LX, function(univar)
        univar$vars.univariable[[var]][3,trans]
      )
    )
  )
  significance <- as.data.frame(
    sapply(transformation.names, function(trans) 
      sapply(univariable.LX, function(univar)
        univar$vars.univariable[[var]][4,trans]
      )
    )
  )
  
  # Transform List object to Matrix
  pvalues <- matrix(unlist(pvalues), ncol=ncol(pvalues), byrow=F)
  colnames(pvalues) <- transformation.names;
  
  # Rank based on pvalues
  ranks <- matrix(apply(pvalues, 1, rank, ties.method="average"), ncol=ncol(pvalues), byrow=T);
  colnames(ranks) <- colnames(pvalues);
  avgrank <- apply(ranks, 2, mean);
  
  # Select best performing transformation
  selected <- names(avgrank)[which.min(avgrank)];
  selected.transformation <- vars.information.LX[[var]]$Transformations[selected];
  selected.avgpvalue <- mean(pvalues[, selected]);
  selected.significance <- "";
  if(selected.avgpvalue<max(significance.levels)){
    selected.significance <- str_c(rep("*", tail(which(selected.avgpvalue<significance.levels), n=1)), collapse="");
  }
  
  # Save to vars.information object
  vars.information.LX[[var]]$LogLik <- loglik;
  vars.information.LX[[var]]$Pvalues <- pvalues;
  vars.information.LX[[var]]$Significance <- significance;
  vars.information.LX[[var]]$Ranks <- ranks;
  vars.information.LX[[var]]$AvgRank <- avgrank;
  
  vars.information.LX[[var]]$SelectedTransformation <- selected.transformation;
  vars.information.LX[[var]]$SelectedPvalue <- selected.avgpvalue;
  vars.information.LX[[var]]$SelectedSignificance <- selected.significance;
  
}


# Generate summary Table
vars.univariable.LX <- t(sapply(vars.information.LX, function(var) c(
  trans = names(var$SelectedTransformation),
  pvalue = var$SelectedPvalue,
  sign = var$SelectedSignificance
))); 

# View results
kable(vars.univariable.LX);

	trans	pvalue	sign
PFS	Log	0	***
Age	Poly2	1.47682807410954e-05	***
Gender.Female	Binary.FemaleYN	0.0293104401172259	**
Hospital.Private	Binary.PrivateYN	0.00854917648739073	***
Stage.Diagnosis	Binary.Stage4YN	0.464603901073672
Primary.Site	Factor.GroupColMul.GroupLefRec	0.0240492505778438	**
Metastases.Number	Factor.Group3plus	0.678845101916153
Metastases.Liver	Binary	0.719065221171095
Metastases.Peritoneum	Binary	0.0120556554881923	**
L1.ECOG	Factor.Group23	0	***
Charlson	Binary.1plusYN	0.912530732728782
CEA.Level	Binary.200plus	0.267061210740859
RAS.Mutated	Binary.MutatedYN	0.0110552668651976	**
BRAF.Mutated	Binary.MutatedYN	0.0330294095990516	**

Step 4: Multivariable Modeling of Survival

Multivariable (survival) models are fitted to reflect patient heterogeneity. A variable selection procedure is implemented that considers both clinical relevance and statistical performance. In total, the multivariable modeling exists of three steps:

Defining functions used for fitting the multivariable models
Defining the input for the multivariable modeling process
Fitting the multivariable models

Step 4.1: defining functions used for fitting the multivariable models

Several custom functions are defined to perform the variable selection procedure that considers both clinical relevance and statistical performance. Please find an explanation of the functions in the R code below.

# The 'getSurvDataset.L1' and 'getSurvDataset.LX' functions take the cleaned data and transform these in a data.frame 
# dat can be used to estimate (survival) models. This is necessary, because the cleaned data needs to be transformed 
# according to the selected transformations and treatment interaction terms need to be calculated accordingly
getSurvDataset.L1 <- function(data) {
  
  data <- data %>% 
    
    ## Single variables
    
    # Event
    mutate(Event = L1.PFS.Event) %>%
    
    # Events
    mutate(Events = L1.PFS.Events) %>%
    
    # Events.01
    mutate(Events.01 = ifelse(Events=="Death", 1, NA)) %>%
    mutate(Events.01 = ifelse(Events=="Progression", 0, Events.01)) %>%
    
    # L1.PFS.Time = "Time",
    mutate(Time = L1.PFS.Time) %>%
    mutate(Time1 = L1.PFS.Time) %>%
    mutate(Time2 = L1.PFS.Time^2) %>%
    mutate(Time3 = L1.PFS.Time^3) %>%
    mutate(Time4 = L1.PFS.Time^4) %>%
    
    # Arm.ChemBev = "ChemBev",
    mutate(ChemBev = as.numeric(as.logical(L1.Arm.ChemBev))) %>%
    
    #  Age = "Age1 + Age2",
    mutate(Age1 = Age) %>%
    mutate(Age2 = Age1^2) %>%
    
    #  Gender.Female = "Female.YN",
    mutate(Female.YN = as.numeric(as.logical(Gender.Female))) %>%
    
    #  Hospital.Private = "Private.YN",
    mutate(Private.YN = as.numeric(as.logical(Hospital.Private))) %>%
    
    #  Stage.Diagnosis = "Stage4.YN",
    mutate(Stage4.YN = ifelse(as.character(Stage.Diagnosis)=="4", 1, 0)) %>%
    
    #  Primary.Site = "Site.Left + Site.Right + Site.Rectum",
    mutate(Site.LeftRectum = ifelse(Primary.Site %in% c("Left","Rectum"), 1, 0)) %>%
    mutate(Site.Right = ifelse(Primary.Site=="Right", 1, 0)) %>%
    
    #  Metastases.Number = "Metastases.2 + Metastases.3plus",
    mutate(Metastases.2 = ifelse(as.numeric(as.character(Metastases.Number))==2, 1, 0)) %>%
    mutate(Metastases.3plus = ifelse(as.numeric(as.character(Metastases.Number))>=3, 1, 0)) %>%
    
    #  Metastases.Liver = "Liver.YN",
    mutate(Liver.YN = as.numeric(as.logical(Metastases.Liver))) %>%
    
    #  Metastases.Peritoneum = "Peritoneum.YN",
    mutate(Peritoneum.YN = as.numeric(as.logical(Metastases.Peritoneum))) %>%
    
    #  L1.ECOG = "ECOG.1 + ECOG.23",
    mutate(ECOG.1 = ifelse(as.numeric(as.character(L1.ECOG))==1, 1, 0)) %>%
    mutate(ECOG.23 = ifelse(as.numeric(as.character(L1.ECOG))>=2, 1, 0)) %>%
    
    #  Charlson = "Charlson.1 + Charlson.2 + Charlson.3plus",
    mutate(Charlson.1plus = ifelse(as.numeric(as.character(Charlson))>=1, 1, 0)) %>%
    
    #  CEA.Level = "CEA.200plus",
    mutate(CEA.200plus = ifelse(as.numeric(as.character(Metastases.Number))<=2, 1, 0) * ifelse(as.numeric(CEA.Level)>=200, 1, 0)) %>%
    
    #  RAS.Mutated = "MutantRAS.YN",
    mutate(MutantRAS.YN = as.numeric(as.logical(RAS.Mutated))) %>%
    
    #  BRAF.Mutated = "MutantBRAF.YN",
    mutate(MutantBRAF.YN = as.numeric(as.logical(BRAF.Mutated))) %>%
    
    
    ## Interactions
    mutate(Time1.Int = Time1 * ChemBev) %>%
    mutate(Time2.Int = Time2 * ChemBev) %>%
    mutate(Time3.Int = Time3 * ChemBev) %>%
    mutate(Time4.Int = Time4 * ChemBev) %>%
    mutate(Age1.Int = Age1 * ChemBev) %>%
    mutate(Age2.Int = Age2 * ChemBev) %>%
    mutate(Female.YN.Int = Female.YN * ChemBev) %>%
    mutate(Private.YN.Int = Private.YN * ChemBev) %>%
    mutate(Stage4.YN.Int = Stage4.YN * ChemBev) %>%
    mutate(Site.LeftRectum.Int = Site.LeftRectum * ChemBev) %>%
    mutate(Site.Right.Int = Site.Right * ChemBev) %>%
    mutate(Metastases.2.Int = Metastases.2 * ChemBev) %>%
    mutate(Metastases.3plus.Int = Metastases.3plus * ChemBev) %>%
    mutate(Liver.YN.Int = Liver.YN * ChemBev) %>%
    mutate(Peritoneum.YN.Int = Peritoneum.YN * ChemBev) %>%
    mutate(ECOG.1.Int = ECOG.1 * ChemBev) %>%
    mutate(ECOG.23.Int = ECOG.23 * ChemBev) %>%
    mutate(Charlson.1plus.Int = Charlson.1plus * ChemBev) %>%
    mutate(CEA.200plus.Int = CEA.200plus * ChemBev) %>%
    mutate(MutantRAS.YN.Int = MutantRAS.YN * ChemBev) %>%
    mutate(MutantBRAF.YN.Int = MutantBRAF.YN * ChemBev)
  
  
  # Only the newly created columns need to be returned
  return(select(data, Event:MutantBRAF.YN.Int))
  
}
getSurvDataset.LX <- function(data) {
  
  data <- data %>% 
    
    ## Single variables
    
    # L1.OS.Event = Event,
    mutate(Event = LX.OS.Event) %>%
    
    # L1.OS.Time = "Time",
    mutate(Time = LX.OS.Time) %>%
    
    # L1.PFS.Time = "PFS.Time",
    mutate(PFS.Time = log(L1.PFS.Time)) %>%
    
    # Arm.ChemBev = "ChemBev",
    mutate(ChemBev = as.numeric(as.logical(L1.Arm.ChemBev))) %>%
    
    #  Age = "Age1 + Age2",
    mutate(Age1 = as.numeric(Age)-mean(Age)) %>%
    mutate(Age2 = as.numeric(Age1^2)) %>%
    
    #  Gender.Female = "Female.YN",
    mutate(Female.YN = as.numeric(as.logical(Gender.Female))) %>%
    
    #  Hospital.Private = "Private.YN",
    mutate(Private.YN = as.numeric(as.logical(Hospital.Private))) %>%
    
    #  Stage.Diagnosis = "Stage4.YN",
    mutate(Stage4.YN = ifelse(as.numeric(Stage.Diagnosis)==4, 1, 0)) %>%
    
    #  Primary.Site = "Site.Left + Site.Right + Site.Rectum",
    mutate(Site.LeftRectum = ifelse(Primary.Site %in% c("Left","Rectum"), 1, 0)) %>%
    mutate(Site.Right = ifelse(Primary.Site=="Right", 1, 0)) %>%
    
    #  Metastases.Number = "Metastases.2 + Metastases.3plus",
    mutate(Metastases.2 = ifelse(as.numeric(Metastases.Number)==2, 1, 0)) %>%
    mutate(Metastases.3plus = ifelse(as.numeric(Metastases.Number)>=3, 1, 0)) %>%
    
    #  Metastases.Liver = "Liver.YN",
    mutate(Liver.YN = as.numeric(as.logical(Metastases.Liver))) %>%
    
    #  Metastases.Peritoneum = "Peritoneum.YN",
    mutate(Peritoneum.YN = as.numeric(as.logical(Metastases.Peritoneum))) %>%
    
    #  L1.ECOG = "ECOG.1 + ECOG.23",
    mutate(ECOG.1 = ifelse(as.numeric(as.character(L1.ECOG))==1, 1, 0)) %>%
    mutate(ECOG.23 = ifelse(as.numeric(as.character(L1.ECOG))>=2, 1, 0)) %>%
    
    #  Charlson = "Charlson.3plus.YN",
    mutate(Charlson.1plus.YN = ifelse(as.numeric(Charlson)>=1, 1, 0)) %>%
    
    #  CEA.Level = "CEA.200plus",
    mutate(CEA.200plus = ifelse(as.numeric(Metastases.Number)<=2, 1, 0) * ifelse(as.numeric(CEA.Level)>=200, 1, 0)) %>%
    
    #  RAS.Mutated = "MutantRAS.YN",
    mutate(MutantRAS.YN = as.numeric(as.logical(RAS.Mutated))) %>%
    
    #  BRAF.Mutated = "MutantBRAF.YN",
    mutate(MutantBRAF.YN = as.numeric(as.logical(BRAF.Mutated))) %>%
    
    
    ## Interactions
    mutate(PFS.Time.Int = PFS.Time * ChemBev) %>%
    mutate(Age1.Int = Age1 * ChemBev) %>%
    mutate(Age2.Int = Age2 * ChemBev) %>%
    mutate(Female.YN.Int = Female.YN * ChemBev) %>%
    mutate(Private.YN.Int = Private.YN * ChemBev) %>%
    mutate(Stage4.YN.Int = Stage4.YN * ChemBev) %>%
    mutate(Site.LeftRectum.Int = Site.LeftRectum * ChemBev) %>%
    mutate(Site.Right.Int = Site.Right * ChemBev) %>%
    mutate(Metastases.2.Int = Metastases.2 * ChemBev) %>%
    mutate(Metastases.3plus.Int = Metastases.3plus * ChemBev) %>%
    mutate(Liver.YN.Int = Liver.YN * ChemBev) %>%
    mutate(Peritoneum.YN.Int = Peritoneum.YN * ChemBev) %>%
    mutate(ECOG.1.Int = ECOG.1 * ChemBev) %>%
    mutate(ECOG.23.Int = ECOG.23 * ChemBev) %>%
    mutate(Charlson.1plus.YN.Int = Charlson.1plus.YN * ChemBev) %>%
    mutate(CEA.200plus.Int = CEA.200plus * ChemBev) %>%
    mutate(MutantRAS.YN.Int = MutantRAS.YN * ChemBev) %>%
    mutate(MutantBRAF.YN.Int = MutantBRAF.YN * ChemBev)
  
  # Only the newly created columns need to be returned
  return(select(data, Event:MutantBRAF.YN.Int))
  
}

# The 'avgSurvivalModel' and 'avgLogisticModel' functions estimate (survival) models based on a number of datasets and
# combine these separate models into one model using Rubin's rule so that a p-value can be calculated for all models
# combined in the variable selection procedure
avgSurvivalModel <- function(formula, datasets, distribution, conf.int=F) {
  
  # Fit models for each dataset
  models <- lapply(datasets, function(dataset) {
    
    model <- survreg(formula=as.formula(formula), data=dataset, dist=distribution);
    
    list(coef=c(model$coefficients, 'Log(Scale)'=log(model$scale)), sigma=model$var);
    
  })
  
  # Coefficient Estimates
  m <- length(models);
  Q <- sapply(models, function(model) model$coef);
  Qm <- matrix(rowMeans(Q), nrow=1);
  
  # Within Variance
  U <- lapply(models, function(model) model$sigma);
  Um <- Reduce('+', U) / m;
  
  # Between variance
  B <- lapply(1:m, function(i) {
    
    Q <- matrix(Q[,i], nrow=1);
    
    t((Q-Qm)) %*% (Q-Qm) / (m-1)
    
  });
  Bm <- Reduce('+', B);
  
  # Total Variance
  Tm <- Um + (1+(1/m))*Bm;
  
  # Save the Total Variance-Covariance matrix
  S <- Tm;
  
  # Provide proper format and naming
  Qm <- as.numeric(Qm);
  names(Qm) <- rownames(Q);
  
  # Confidence intervals for coefficients
  if(conf.int) {
    
    # Extract diagonals of variance matrices to get variance vectors
    Um <- diag(Um);
    Bm <- diag(Bm);
    Tm <- diag(Tm);
    
    # Degrees of freedom
    rm <- (1+(1/m)) * Bm / Um;
    v <- (m-1)*(1+(1/rm))^2;
    
    # Determine 95% confidence interval bounds
    lower <- Qm + qt(p=0.025, df=v)*sqrt(Tm);
    upper <- Qm + qt(p=0.975, df=v)*sqrt(Tm);
    
  }
  
  if(conf.int) {
    return(list(coef=Qm, sigma=S, lower=lower, upper=upper))
  } else {
    return(list(coef=Qm, sigma=S))
  }
}
avgLogisticModel <- function(formula, datasets, conf.int=F) {
  
  # Fit models for each dataset
  models <- lapply(datasets, function(dataset) {
    
    model <- glm(formula=as.formula(formula), data=dataset, family=binomial(link="logit"));
    
    list(coef=c(model$coefficients), sigma=vcov(model));
    
  })
  
  # Coefficient Estimates
  m <- length(models);
  Q <- sapply(models, function(model) model$coef);
  Qm <- matrix(rowMeans(Q), nrow=1);
  
  # Within Variance
  U <- lapply(models, function(model) model$sigma);
  Um <- Reduce('+', U) / m;
  
  # Between variance
  B <- lapply(1:m, function(i) {
    
    Q <- matrix(Q[,i], nrow=1);
    
    t((Q-Qm)) %*% (Q-Qm) / (m-1)
    
  });
  Bm <- Reduce('+', B);
  
  # Total Variance
  Tm <- Um + (1+(1/m))*Bm;
  
  # Save the Total Variance-Covariance matrix
  S <- Tm;
  
  # Provide proper format and naming
  Qm <- as.numeric(Qm);
  names(Qm) <- rownames(Q);
  
  # Confidence intervals for coefficients
  if(conf.int) {
    
    # Extract diagonals of variance matrices to get variance vectors
    Um <- diag(Um);
    Bm <- diag(Bm);
    Tm <- diag(Tm);
    
    # Degrees of freedom
    rm <- (1+(1/m)) * Bm / Um;
    v <- (m-1)*(1+(1/rm))^2;
    
    # Determine 95% confidence interval bounds
    lower <- Qm + qt(p=0.025, df=v)*sqrt(Tm);
    upper <- Qm + qt(p=0.975, df=v)*sqrt(Tm);
    
  }
  
  if(conf.int) {
    return(list(coef=Qm, sigma=S, lower=lower, upper=upper))
  } else {
    return(list(coef=Qm, sigma=S))
  }
  
}

# The 'survivalBackward', 'survivalForward', and 'logisticForward' functions perform a backward or forward selection
# procedure based on a set of basis variables and a set of candidate variables
survivalBackward <- function(datasets, time, event, vars.include, vars.selection, vars.formulas, pvalue, dist) {
  
  # Extract all used columns from the provided input
  columns.used <- sapply(c(vars.include, vars.selection), function(var) {
    formula.complete <- vars.formulas[var];
    formula.splitted <- str_split(string=formula.complete, pattern=fixed("+"), simplify=T);
    formula.nospaces <- str_trim(string=formula.splitted);
  });
  
  # Objects to record process
  vars.unselected <- c();
  continue <- TRUE;
  
  # Backward selection
  while(continue) {
    
    # Get formula according to the remaining backward selection candidates
    formula.current <- str_c(
      str_c(c("Surv(", time, ", ", event, ") ~ "), collapse=""),  
      str_c(vars.formulas[c(vars.include, vars.selection)], collapse=" + ")
    );
    
    # Fit model with backward candidates
    model.current <- avgSurvivalModel(formula=formula.current, datasets=datasets, distribution=dist);
    
    # Get pvalue for each variable
    coefficient.indexes <- lapply(vars.selection, function(var) which(names(model.current$coef) %in% columns.used[[var]]));
    names(coefficient.indexes) <- vars.selection;
    pvalues <- sapply(vars.selection, function(var) {
      testresult <- wald.test(Sigma = model.current$sigma,
                              b = model.current$coef,
                              Terms = coefficient.indexes[[var]]
      )$result$chi2["P"];
      
      unname(testresult)
    }); # sapply pvalues 

    # Get highes p value
    pvalue.max <- max(pvalues);
    
    # If the highest p value is higher than the threshold, remove this one
    if(pvalue.max > pvalue) {
      
      # Get the name of the variable with the highest pvalue
      pvalue.index <- which.max(pvalues);
      
      # Temporary list of selection candidates without the selected variable
      vars.selection.tmp <- vars.selection[-pvalue.index];
      
      # Store unselected candidates
      vars.unselected <- c(vars.unselected, vars.selection[pvalue.index]);
      
      # If no backward selection candidates remain, stop backward selection
      if(length(vars.selection.tmp)<1) {
        
        vars.selection <- vars.selection.tmp;
        continue <- F;
        
      } else {
        
        # Update backward selection candidates and unselected candidates from temporary vectors
        vars.selection <- vars.selection.tmp;
        
      }
      
    } else {
      
      # If no pvalue higher than thershold, stop backward selection
      continue <- F;
      
    }
  } # while continue
  
  # Return outcomes
  return(list(included = c(vars.include, vars.selection),
              excluded = vars.unselected))
  
}
survivalForward <- function(datasets, time, event, vars.include, vars.selection, vars.formulas, pvalue, dist) {
  
  # Extract all used columns from the provided input
  columns.used <- sapply(c(vars.include, vars.selection), function(var) {
    formula.complete <- vars.formulas[var];
    formula.splitted <- str_split(string=formula.complete, pattern=fixed("+"), simplify=T);
    formula.nospaces <- str_trim(string=formula.splitted);
  });

  # Objects to record process
  continue <- TRUE;
  
  # Backward selection
  while(continue) {
    
    # Get pvalue for each variable
    pvalues <- sapply(vars.selection, function(var) {
      
      # Get formula according to the current forward selection candidates
      formula.current <- str_c(
        str_c(c("Surv(", time, ", ", event, ") ~ "), collapse=""),  
        str_c(vars.formulas[c(vars.include, var)], collapse=" + ")
      );
      
      # Fit model with backward candidates
      model.current <- avgSurvivalModel(formula=formula.current, datasets=datasets, distribution=dist);
      
      
      
      # Get pvalue for this variable
      coefficient.indexes <- which(names(model.current$coef) %in% columns.used[[var]])
      test.result <- wald.test(Sigma = model.current$sigma,
                               b = model.current$coef,
                               Terms = coefficient.indexes
      )$result$chi2["P"]
      
      unname(test.result)
      
    })

    # Get lowest p value
    pvalue.min <- min(pvalues);
    
    # If the highest p value is lower than the threshold, add this one
    if(pvalue.min < pvalue) {
      
      # Get the name of the variable with the lowest pvalue
      pvalue.index <- which.min(pvalues);
      
      # Update variable lists
      vars.include <- c(vars.include, vars.selection[pvalue.index]);
      vars.selection <- vars.selection[-pvalue.index];
      
      # If no forward selection candidates remain, stop backward selection
      if(length(vars.selection)<1) {
        
        continue <- F;
        
      } 
      
    } else {
      
      # If no pvalue higher than thershold, stop forward selection
      continue <- F;
      
    }
  } # while !stop
  
  
  # Return outcomes
  return(list(included = vars.include,
              excluded = vars.selection))
  
}
logisticForward <- function(datasets, events, vars.include, vars.selection, vars.formulas, pvalue) {
  
  # Extract all used columns from the provided input
  columns.used <- sapply(c(vars.include, vars.selection), function(var) {
    formula.complete <- vars.formulas[var];
    formula.splitted <- str_split(string=formula.complete, pattern=fixed("+"), simplify=T);
    formula.nospaces <- str_trim(string=formula.splitted);
  });
  
  # Objects to record process
  continue <- TRUE;
  
  # Backward selection
  while(continue) {
    
    # Get pvalue for each variable
    pvalues <- sapply(vars.selection, function(var) {
      
      # Get formula according to the current forward selection candidates
      formula.current <- str_c(
        str_c(c(events, " ~ "), collapse=""),  
        str_c(vars.formulas[c(vars.include, var)], collapse=" + ")
      );
      
      # Fit model with backward candidates
      model.current <- avgLogisticModel(formula=formula.current, datasets=datasets);
      
      
      
      # Get pvalue for this variable
      coefficient.indexes <- which(names(model.current$coef) %in% columns.used[[var]])
      test.result <- wald.test(Sigma = model.current$sigma,
                               b = model.current$coef,
                               Terms = coefficient.indexes
      )$result$chi2["P"]
      
      unname(test.result)
      
    })

    # Get lowest p value
    pvalue.min <- min(pvalues);
    
    # If the highest p value is lower than the threshold, add this one
    if(pvalue.min < pvalue) {
      
      # Get the name of the variable with the lowest pvalue
      pvalue.index <- which.min(pvalues);
      
      # Update variable lists
      vars.include <- c(vars.include, vars.selection[pvalue.index]);
      vars.selection <- vars.selection[-pvalue.index];
      
      # If no forward selection candidates remain, stop backward selection
      if(length(vars.selection)<1) {
        
        continue <- F;
        
      } 
      
    } else {
      
      # If no pvalue higher than thershold, stop forward selection
      continue <- F;
      
    }
  } # while !stop
  
  
  # Return outcomes
  return(list(included = vars.include,
              excluded = vars.selection))
  
}

# The 'fitLogisticModel' and 'fitSurvivalModel' functions are high-level functions that call the different selection
# procedures and return the final model based on the selected set of variables
fitLogisticModel <- function(datasets, events, pvalue, vars.include, vars.selection, vars.formulas) {
  
  result.forward <- logisticForward(datasets = datasets, events = events, vars.include = vars.include, vars.selection = vars.selection, vars.formulas = vars.formulas, pvalue = pvalue);
  
  formula.final <- str_c(
    str_c(c(events, " ~ "), collapse=""),  
    str_c(vars.formulas[result.forward$included], collapse=" + ")
  );
  model.final <- avgLogisticModel(formula = formula.final, datasets = datasets, conf.int = T)
  
  return(list(coef = model.final$coef, sigma = model.final$sigma, lower = model.final$lower, upper = model.final$upper, included = result.forward$included))
  
}
fitSurvivalModel <- function(datasets, time, event, dist, vars.include, vars.backward, vars.forward, vars.formulas, pvalue.backward, pvalue.forward, pvalue.interaction) {
  
  result.backward <- survivalBackward(datasets = datasets, time = time, event = event, dist = dist, vars.include = vars.include, vars.selection = vars.backward, vars.formulas = vars.formulas, pvalue = pvalue.backward)
  
  result.forward <- survivalForward(datasets = datasets, time = time, event = event, dist = dist, vars.include = result.backward$included, vars.selection = c(vars.forward, result.backward$excluded), vars.formulas = vars.formulas, pvalue = pvalue.forward)
  
  vars.include.interaction <- result.forward$included;
  vars.backward.interaction <- str_c(vars.include.interaction, ".Int");
  vars.backward.interaction <- vars.backward.interaction[-which(vars.backward.interaction=="Arm.ChemBev.Int")];
  
  result.interaction <- survivalBackward(datasets = datasets, time = time, event = event, dist = dist, vars.include = vars.include.interaction, vars.selection = vars.backward.interaction, vars.formulas = vars.formulas, pvalue = pvalue.interaction)
  
  formula.final <- str_c(
    str_c(c("Surv(", time, ", ", event, ") ~ "), collapse=""),  
    str_c(vars.formulas[result.interaction$included], collapse=" + ")
  );
  model.final <- avgSurvivalModel(formula = formula.final, datasets = datasets, distribution = "weibull", conf.int = T)
  
  return(list(coef = model.final$coef, sigma = model.final$sigma, lower = model.final$lower, upper = model.final$upper, included = result.interaction$included))
  
}

Step 4.2: defining the input for the multivariable modeling process

The selection procedures require specific input, which is defined here:

The formula for each variable, i.e. which columns in the dataset are used
The clinical relevance of the variables, i.e. whether they should be always included or are subject to a backward or forward selection procedure
Datasets in the appropriate format

# List object that contains an overview of all variables and the columns in the data that 
# they use for first-line progression-free survival (L1)
vars.formulas.L1 <- list(
  
  # Single variables
  L1.PFS.Time = "Time1 + Time2",
  Arm.ChemBev = "ChemBev",
  Age = "Age1 + Age2",
  Gender.Female = "Female.YN",
  Hospital.Private = "Private.YN",
  Stage.Diagnosis = "Stage4.YN",
  Primary.Site = "Site.LeftRectum + Site.Right",
  Metastases.Number = "Metastases.2 + Metastases.3plus",
  Metastases.Liver = "Liver.YN",
  Metastases.Peritoneum = "Peritoneum.YN",
  L1.ECOG = "ECOG.1 + ECOG.23",
  Charlson = "Charlson.1plus",
  CEA.Level = "CEA.200plus",
  RAS.Mutated = "MutantRAS.YN",
  BRAF.Mutated = "MutantBRAF.YN",
  
  
  # Interactions with treatment
  L1.PFS.Time.Int = "Time1.Int + Time2.Int",
  Age.Int = "Age1.Int + Age2.Int",
  Gender.Female.Int = "Female.YN.Int",
  Hospital.Private.Int = "Private.YN.Int",
  Stage.Diagnosis.Int = "Stage4.YN.Int",
  Primary.Site.Int = "Site.LeftRectum.Int + Site.Right.Int",
  Metastases.Number.Int = "Metastases.2.Int + Metastases.3plus.Int",
  Metastases.Liver.Int = "Liver.YN.Int",
  Metastases.Peritoneum.Int = "Peritoneum.YN.Int",
  L1.ECOG.Int = "ECOG.1.Int + ECOG.23.Int",
  Charlson.Int = "Charlson.1plus.Int",
  CEA.Level.Int = "CEA.200plus.Int",
  RAS.Mutated.Int = "MutantRAS.YN.Int",
  BRAF.Mutated.Int = "MutantBRAF.YN.Int"
);

# List object that contains an overview of all variables and the columns in the data that 
# they use for overall survival after first-line progression (LX)
vars.formulas.LX <- list(
  
  # Single variables
  PFS = "PFS.Time",
  Arm.ChemBev = "ChemBev",
  Age = "Age1 + Age2",
  Gender.Female = "Female.YN",
  Hospital.Private = "Private.YN",
  Stage.Diagnosis = "Stage4.YN",
  Primary.Site = "Site.LeftRectum + Site.Right",
  Metastases.Number = "Metastases.2 + Metastases.3plus",
  Metastases.Liver = "Liver.YN",
  Metastases.Peritoneum = "Peritoneum.YN",
  L1.ECOG = "ECOG.1 + ECOG.23",
  Charlson = "Charlson.1plus.YN",
  CEA.Level = "CEA.200plus",
  RAS.Mutated = "MutantRAS.YN",
  BRAF.Mutated = "MutantBRAF.YN",

  # Interactions with treatment
  PFS.Int = "PFS.Time.Int",
  Age.Int = "Age1.Int + Age2.Int",
  Gender.Female.Int = "Female.YN.Int",
  Hospital.Private.Int = "Private.YN.Int",
  Stage.Diagnosis.Int = "Stage4.YN.Int",
  Primary.Site.Int = "Site.LeftRectum.Int + Site.Right.Int",
  Metastases.Number.Int = "Metastases.2.Int + Metastases.3plus.Int",
  Metastases.Liver.Int = "Liver.YN.Int",
  Metastases.Peritoneum.Int = "Peritoneum.YN.Int",
  L1.ECOG.Int = "ECOG.1.Int + ECOG.23.Int",
  Charlson.Int = "Charlson.1plus.YN.Int",
  CEA.Level.Int = "CEA.200plus.Int",
  RAS.Mutated.Int = "MutantRAS.YN.Int",
  BRAF.Mutated.Int = "MutantBRAF.YN.Int"
);


# Indicate which variables should be included and which are subject to selection (backward vs. forward)
vars.include.L1 <- c("Arm.ChemBev", "Age", "L1.ECOG", "BRAF.Mutated");
vars.backward.L1 <- c("Stage.Diagnosis", "Primary.Site", "Metastases.Number", "Metastases.Liver",
                      "Metastases.Peritoneum", "Charlson");
vars.forward.L1 <- c("Gender.Female", "Hospital.Private", "CEA.Level", "RAS.Mutated");

# Indicate which variables should be included and which are subject to selection (backward vs. forward)
vars.include.LX <- c("Arm.ChemBev", "Age", "Primary.Site", "Metastases.Peritoneum", "L1.ECOG", "BRAF.Mutated");
vars.backward.LX <- c("PFS","Stage.Diagnosis", "Metastases.Number", "Metastases.Liver", "Charlson", "CEA.Level");
vars.forward.LX <- c("Gender.Female", "Hospital.Private", "RAS.Mutated");

# Transform cleaned datasets into appropriate format
datasets.surv.L1 <- lapply(datasets, function(dataset) getSurvDataset.L1(dataset));
datasets.surv.LX <- lapply(datasets, function(dataset) getSurvDataset.LX(filter(dataset, LX.OS.Include)));

Step 4.3: fitting the multivariable models

After defining all functions and input required for fitting the (survival) models, all procedures can be performed.

Model L1-Time

# Fit Model L1-Time
Model.L1.Time <- fitSurvivalModel(datasets = datasets.surv.L1, time = "Time", event = "Event", dist = "weibull", vars.include = vars.include.L1, vars.backward = vars.backward.L1, vars.forward = vars.forward.L1, vars.formulas = vars.formulas.L1, pvalue.backward = 0.3, pvalue.forward = 0.2, pvalue.interaction = 0.3);

# Save and print Model L1-Time
save(x=Model.L1.Time, file="object Model.L1.Time 20190201.RData");
write.table(cbind(Coef=Model.L1.Time$coef, Lower=Model.L1.Time$lower, Upper=Model.L1.Time$upper), file="Model L1-Time 20190201.txt", sep=";");
cbind(Coefficient=Model.L1.Time$coef, Lower.Bound=Model.L1.Time$lower, Upper.Bound=Model.L1.Time$upper);

##                   Coefficient   Lower.Bound  Upper.Bound
## (Intercept)      4.5352456627  3.6607544949  5.409736831
## ChemBev          0.4903821144  0.2176428890  0.763121340
## Age1             0.0467445556  0.0193879515  0.074101160
## Age2            -0.0003818097 -0.0006122064 -0.000151413
## ECOG.1          -0.4420102362 -0.7086076563 -0.175412816
## ECOG.23         -0.4466416971 -0.7601591259 -0.133124268
## MutantBRAF.YN   -0.1586860032 -0.3946605642  0.077288558
## Site.LeftRectum -0.0320105967 -0.2807102482  0.216689055
## Site.Right      -0.2048020078 -0.4594543905  0.049850375
## Liver.YN        -0.1006873497 -0.2196800253  0.018305326
## Peritoneum.YN   -0.0967462377 -0.2154643846  0.021971909
## CEA.200plus     -0.0349843979 -0.3386554012  0.268686605
## Private.YN       0.3951723972  0.1640512674  0.626293527
## MutantRAS.YN    -0.0958525980 -0.2360036442  0.044298448
## ECOG.1.Int       0.2336980952 -0.0577437169  0.525139907
## ECOG.23.Int     -0.0886445670 -0.4773567723  0.300067638
## CEA.200plus.Int -0.2021634520 -0.5296258343  0.125298930
## Private.YN.Int  -0.4110088123 -0.6653954483 -0.156622176
## Log(Scale)      -0.3324565000 -0.3850521961 -0.279860804

Model L1-Event

# Fit Model L1-Event
vars.all.L1 <- c(vars.include.L1, vars.backward.L1, vars.forward.L1);
vars.all.L1 <- vars.all.L1[vars.all.L1!="Arm.ChemBev"];
Model.L1.Event <- fitLogisticModel(datasets = datasets.surv.L1, events = "Events.01", vars.include = c("L1.PFS.Time", "Arm.ChemBev"),  vars.selection = c(vars.all.L1, str_c(vars.all.L1, ".Int"), "L1.PFS.Time.Int"), vars.formulas = vars.formulas.L1, pvalue = 0.20);

# Save and print Model L1-Event
save(x=Model.L1.Event, file="object Model.L1.Event 20190201.RData")
write.table(cbind(Coef=Model.L1.Event$coef, Lower=Model.L1.Event$lower, Upper=Model.L1.Event$upper), file="Model L1-Event 20190201.txt", sep=";");
cbind(Coefficient=Model.L1.Event$coef, Lower.Bound=Model.L1.Event$lower, Upper.Bound=Model.L1.Event$upper);

##                       Coefficient   Lower.Bound   Upper.Bound
## (Intercept)         -2.069855e+00 -3.011333e+00 -1.128376e+00
## Time1               -1.871496e-04 -8.169673e-03  7.795374e-03
## Time2               -2.561821e-06 -1.577255e-05  1.064891e-05
## ChemBev              1.738231e-01 -1.229186e+00  1.576832e+00
## Charlson.1plus       6.461158e-01  1.593857e-01  1.132846e+00
## Site.LeftRectum.Int -1.462309e+00 -2.459083e+00 -4.655346e-01
## Site.Right.Int      -1.469024e+00 -2.529350e+00 -4.086988e-01
## Time1.Int            2.267294e-03 -6.025885e-03  1.056047e-02
## Time2.Int            2.566552e-06 -1.072477e-05  1.585787e-05

Model LX-Time

# Fit Model LX-Time
Model.LX.Time <- fitSurvivalModel(datasets = datasets.surv.LX, time = "Time", event = "Event", dist = "weibull", vars.include = vars.include.LX, vars.backward = vars.backward.LX, vars.forward = vars.forward.LX, vars.formulas = vars.formulas.LX, pvalue.backward = 0.3, pvalue.forward = 0.2, pvalue.interaction = 0.3)

# Save and print Model LX-Time
save(x=Model.LX.Time, file="object Model.LX.Time 20190201.RData")
write.table(cbind(Coef=Model.LX.Time$coef, Lower=Model.LX.Time$lower, Upper=Model.LX.Time$upper), file="Model LX-Time 20190201.txt", sep=";");
cbind(Coefficient=Model.LX.Time$coef, Lower.Bound=Model.LX.Time$lower, Upper.Bound=Model.LX.Time$upper);

##                     Coefficient   Lower.Bound   Upper.Bound
## (Intercept)        3.2891795888  2.2620579298  4.316301e+00
## ChemBev            0.8818385546 -0.2812537861  2.044931e+00
## Age1              -0.0251070916 -0.0370240389 -1.319014e-02
## Age2              -0.0005670460 -0.0011710130  3.692101e-05
## Site.LeftRectum   -0.1579432364 -0.5383176339  2.224312e-01
## Site.Right        -0.2365318069 -0.6211753023  1.481117e-01
## Peritoneum.YN      0.1513874151 -0.1773574308  4.801323e-01
## ECOG.1            -0.2762104282 -0.6659711458  1.135503e-01
## ECOG.23           -1.2022780055 -1.6308052807 -7.737507e-01
## MutantBRAF.YN      0.2745640877 -0.3312597292  8.803879e-01
## PFS.Time           0.5751723494  0.4054450540  7.448996e-01
## Stage4.YN         -0.5243148245 -0.9757145591 -7.291509e-02
## Charlson.1plus.YN  0.1044981510 -0.0565984893  2.655948e-01
## MutantRAS.YN       0.1091776800 -0.3022287934  5.205842e-01
## Private.YN         0.1053037877 -0.0458828408  2.564904e-01
## Age1.Int           0.0174056852  0.0034559505  3.135542e-02
## Age2.Int           0.0004030007 -0.0003232005  1.129202e-03
## Peritoneum.YN.Int -0.2386910341 -0.6130196513  1.356376e-01
## ECOG.1.Int        -0.0930659837 -0.5165916222  3.304597e-01
## ECOG.23.Int        0.3115761369 -0.2210811948  8.442335e-01
## MutantBRAF.YN.Int -0.6795392885 -1.3768905224  1.781195e-02
## PFS.Time.Int      -0.1486793160 -0.3474887968  5.013016e-02
## Stage4.YN.Int      0.4904170757 -0.0221555921  1.002990e+00
## MutantRAS.YN.Int  -0.4071664687 -0.8555996074  4.126667e-02
## Log(Scale)        -0.1525148212 -0.2153760094 -8.965363e-02

Step 5: Model Validation

Internal validation is performed to assess the performanc of the models. This exists of five steps:

Defining functions used to perform the validation
Graphical assessment of internal validity
Calculating the apparent model performance
Performing bootstrapping to calculate the optimism
Correcting the apparent performance for the optimism

For survival models Model L1-Time and Model LX-Time, several validation plots were assessed and model discrimination was assessed by Uno et al.’s modification of Harrel’s C-statistic to account for censoring. Two measures were used to assess survival model calibration: 1) the regression slope of the predicted Weibull scale parameter in a Cox-regression model, and 2) Demler et al.’s Greenwood-Nam-D’Agostino statistic (GND), which is a modification of Hosmer-Lemeshow’s statistic by Nam and D’Agostino and accounts for censoring using Greenwood’s variance estimator. Both C-statistic and GND are assessed at specific timepoints, which were selected to be at 0.5, 1, and 2 years. Additionally, GND requires a number of groups to be specified in which patients are stratified according to their predicted risk, for which 5 and 10 were chosen to assess impact of this arbitrary decision.

Internal validity of logistic regression Model L1-Event was assessed by excluding patients for which no event was observed and using Harrel’s C-statistic for model discrimination. To assess calibration, the regression slope of the predicted response of Model L1-Event in a logistic regression model and the Brier score were used.

Bootstrapping was performed to determine the optimism and obtain sets of models to perform the probabilistic sensitivity analysis for the discrete event simulation model.

Step 5.1: defining functions used to perform the validation

Several custom functions are defined to make predictions using the fitted models and to calculate the statistics used. Please find an explanation of the functions in the R code below.

# The 'predictSurvmod', 'predictSurvmod.response', 'predictSurvmod.lp', 'predictSurvmod.p', 'predictLogreg',
# and 'predictLogreg.lp' are custom functions to make predictions using the fitted models
predictSurvmod <- function(model, newdata) {
  
  # Add a dummy column for the intercept
  newdata <- cbind(newdata, '(Intercept)'=1);
  
  # Ge the Weibull Shape parameter from the model
  Shape <- 1/exp(model["Log(Scale)"]);
  model <- model[-which(names(model) == "Log(Scale)")];
  
  # Subset a dataframe to enable matrix multiplication and check
  newdata.sub <- as.matrix(newdata[, names(model), drop=F]);
  
  # Get the scale parameter for the survival model
  Scale <- exp(newdata.sub %*% model);

  # Predict the times
  p <- matrix(runif(nrow(newdata.sub)), ncol=1);
  t <- qweibull(p=p, shape=Shape, scale=Scale);

  # Return predicted times
  return(t);
  
  
}
predictSurvmod.response <- function(model, newdata) {
  
  # Add a dummy column for the intercept
  newdata <- cbind(newdata, '(Intercept)'=1);
  
  # Ge the Weibull Shape parameter from the model
  Shape <- 1/exp(model["Log(Scale)"]);
  model <- model[-which(names(model) == "Log(Scale)")];
  
  # Subset a dataframe to enable matrix multiplication and check
  newdata.sub <- as.matrix(newdata[, names(model), drop=F]);
  
  # Get the scale parameter for the survival model
  Scale <- exp(newdata.sub %*% model);
  
  # Return predicted times
  return(Scale);
  
  
}
predictSurvmod.lp <- function(model, newdata) {
  
  # Add a dummy column for the intercept
  newdata <- cbind(newdata, '(Intercept)'=1);
  
  # Ge the Weibull Shape parameter from the model
  Shape <- 1/exp(model["Log(Scale)"]);
  model <- model[-which(names(model) == "Log(Scale)")];
  
  # Subset a dataframe to enable matrix multiplication and check
  newdata.sub <- as.matrix(newdata[, names(model), drop=F]);
  
  # Get the scale parameter for the survival model
  LP <- newdata.sub %*% model;
  
  # Return predicted times
  return(LP);
  
  
}
predictSurvmod.p <- function(model, newdata, q) {
  
  # Add a dummy column for the intercept
  newdata <- cbind(newdata, '(Intercept)'=1);
  
  # Ge the Weibull Shape parameter from the model
  Shape <- 1/exp(model["Log(Scale)"]);
  model <- model[-which(names(model) == "Log(Scale)")];
  
  # Subset a dataframe to enable matrix multiplication and check
  newdata.sub <- as.matrix(newdata[, names(model), drop=F]);
  
  # Get the scale parameter for the survival model
  Scale <- exp(newdata.sub %*% model);

  # Predict the times
  q <- matrix(q, nrow=nrow(newdata), ncol=length(q), byrow=T);
  p <- pweibull(q=q, shape=Shape, scale=Scale)
  
  # Return predicted times
  return(p);
  
  
}
predictLogreg <- function(model, newdata) {
  
  # Add a dummy column for the intercept
  newdata <- cbind(newdata, '(Intercept)'=1);

  # Subset a dataframe to enable matrix multiplication and check
  newdata.sub <- as.matrix(newdata[, names(model), drop=F]);
  
  # Calculate the probabilities
  p.tmp <- exp(newdata.sub %*% model);
  p <- p.tmp / (p.tmp + 1);
  
  # Return probabilities
  return(as.vector(p));
  
}
predictLogreg.lp <- function(model, newdata) {
  
  # Add a dummy column for the intercept
  newdata <- cbind(newdata, '(Intercept)'=1);
  
  # Subset a dataframe to enable matrix multiplication and check
  newdata.sub <- as.matrix(newdata[, names(model), drop=F]);
  
  # Calculate the probabilities
  p.tmp <- newdata.sub %*% model;
  
  # Return probabilities
  return(p.tmp);
  
}

# The 'testGND' is used to calculate the Greenwood-Nam-D'Agostino statistic as proposed by the authors and
# with some additional error handlers included and an option to plot an illustration
testGND <- function(model, data, time, n.bin=5, min.event=5, plot.graph=F, y.max=1) {
  
  time.in.days <- time*365.25;
  
  # Predict event probability for each individual
  pt <- predictSurvmod.p(model=model, newdata=data, q=time.in.days);
  
  # Bin individuals according to risk of event
  pt.q <- quantile(pt, seq(0, 1, length.out=n.bin+1));
  b <- sapply(pt, function(p) ifelse(p==max(pt.q), n.bin, which(p<pt.q)[1]-1));
  
  # Calculate average probability for each group
  ptg <- tapply(pt, b, mean);
  
  # Determine Kaplan-Meier estimates
  KMg <- t(sapply(1:n.bin, function(bin) {
    
    # Get subset of data for the bin
    data.sub <- data[b==bin, ];
    
    # Check whether sufficient events are observed
    if(sum(data.sub$Event==1)<min.event) stop("Insufficient events observed!");
    
    # Estimate Kaplan-Meier for the bin
    KM <- survfit(Surv(Time, Event) ~ 1, data=data.sub, type="kaplan-meier");
    
    # Extract the estimates:
    # p       event probability (1 - km survival estimate)
    # var     variance of km survival estimate
    # nrisk   number at risk
    if(time.in.days<min(KM$time)) {
      p <- 1;
      var <- 0;
    } else {
      p <- KM$surv[tail(which(KM$time<=time.in.days), n=1)];
      p <- ifelse(p==0, KM$surv[tail(which(KM$time<=time.in.days), n=1)-1], p);
      err <- KM$std.err[tail(which(KM$time<=time.in.days), n=1)];
      err <- ifelse(err==Inf, KM$std.err[tail(which(KM$time<=time.in.days), n=1)-1], err);
      var <- (err * p)^2;
    }
    
    
    # Return estimates
    c(p=1-p, var=var)
    
  }))
  
  
  
  # Calculate test statistic
  GND.chisq.sep <- (KMg[,"p"]-ptg)^2 / KMg[,"var"];
  GND.chisq <- sum(GND.chisq.sep);
  GND.df <- n.bin - 1;
  GND.p <- 1 - pchisq(q=GND.chisq, df=GND.df);
  
  if(plot.graph) {
    
    barplot(rbind(ptg, KMg[, "p"]), beside=T, ylim=c(0,y.max), main=paste0("Event before ",time, "-year (p = ",round(GND.p,3),")"), ylab="Probability", names.arg=str_c("n=",table(b)), las=1, legend.text=c("Survival Model", "Kaplan-Meier"), args.legend=list(x="top", inset=c(0,-0.0), ncol=2, bty='n'))
    
  }
  
  # Return results
  return(c(chisq=GND.chisq, df=GND.df, p=GND.p));
  
}

# The 'getPerformanceSurvmod' and 'getPerformanceLogreg' functions are high-level functions that call all 
# functions used to calculate statistics used to assess the models' performance
getPerformanceSurvmod <- function(model, data.train, data.test, timepoints=c(0.5, 1, 2)) {
  
  library(survAUC);
  
  # c-statistic
  c.statistic <- sapply(timepoints, function(tt) {
    
    # Make negative because a higher scale indicates a higher time-to-event, i.e. lower risk
    pred.lp <- -predictSurvmod.lp(model=model, newdata=data.test);
    surv.train <- Surv(time=data.train$Time, event=data.train$Event);
    surv.test <- Surv(time=data.test$Time, event=data.test$Event);
    
    UnoC(Surv.rsp=surv.train, Surv.rsp.new=surv.test, lpnew=pred.lp, time=tt*365.25);
    
  });
  names(c.statistic) <- str_c("c.statistic_", timepoints, "-year");
  
  # Regression Slope
  data.test$scale <- predictSurvmod.response(model, data.test);
  slope.scale <- exp(coxph(Surv(Time, Event) ~ scale, data=data.test)$coefficients);

  # Hosmer-Lemeshow
  HL.5bins <- sapply(timepoints, function(tt) {
    testGND(model=model, data=data.test, time=tt, n.bin=5, plot=F)["chisq"]
  });
  HL.10bins <- sapply(timepoints, function(tt) {
    testGND(model=model, data=data.test, time=tt, n.bin=10, plot=F)["chisq"]
  });
  names(HL.5bins) <- str_c("HL.5bins_", timepoints, "-year");
  names(HL.10bins) <- str_c("HL.10bins_", timepoints, "-year");
  
  # Return all performance measures
  return(c(c.statistic, Slope.Scale = unname(slope.scale), HL.5bins, HL.10bins))
  
}
getPerformanceLogreg <- function(model, data) {
  
  data.sub <- data[data$Events!="Censored", ];
  
  obs <- ifelse(data.sub$Events=="Death", 1,
                ifelse(data.sub$Events=="Progression", 0, NA));
  
  pred.prob <- predictLogreg(model=model, newdata=data.sub);
  pred.lp <- predictLogreg.lp(model=model, newdata=data.sub);
  
  # c-statistic
  pairs <- expand.grid(1:length(obs),
                       1:length(obs));
  c_n <- sum(ifelse(obs[pairs[,1]]==1 & obs[pairs[,2]]==0, 1, 0) * ifelse(pred.prob[pairs[,1]]>pred.prob[pairs[,2]], 1, 0));
  c_d <- sum(ifelse(obs[pairs[,1]]==1 & obs[pairs[,2]]==0, 1, 0));
  c.statistic <- c_n/c_d;
  
  # Regression slope
  tmp.model <- glm(obs ~ pred.lp, family=binomial(link="logit"));
  regression.slope <- tmp.model$coefficients["pred.lp"];
  
  # Brier score
  Brier.score <- sum((pred.prob-obs)^2)/length(obs);
  
  # Return all performance measures
  return(c(C.Statistic = c.statistic, Regression.Slope = unname(regression.slope), Brier.Score = Brier.score))
  
}

Step 5.2: graphical assessment of internal validity

Both Kaplan-Meier plots for different patient subgroups and illustrations of the Greenwood-Nam-D’Agostino statistic were used to graphically assess the performance of the survival models.

Model L1-Time - Kaplan-Meier Plots

# Custom function to plot predited vs. observed Kaplan-Meier curves for both treatment strategies
plotKM <- function(model, data, class.column, time.max=1500, time.step=10, add.main=NULL) {
  
  km.fit <- survfit(as.formula(str_c("Surv(Time, Event) ~ ",class.column)), data=data, type="kaplan-meier")
  
  classes <- sort(unlist(unique(data[, class.column])));
  n.classes <- length(classes);
  n.obs.classes <- table(data[, class.column]);
  
  q.values <- seq(1, time.max, time.step);
  
  sim <- sapply(classes, function(class) {
    surv.mat <- 1 - predictSurvmod.p(model, data[data[, class.column]==as.character(class), ], q=q.values);
    surv.mean <- apply(surv.mat, 2, mean);
    
    surv.mean
  })
  
  cols <- rainbow(n.classes);
  if(is.null(add.main)) {
    main.text <- class.column;
  } else {
    main.text <- str_c(class.column, "\n(", add.main,")")
  }
  plot(km.fit, col=cols, lty=1, lwd=2, xlim=c(0,time.max), main=main.text, xlab="Time to Progression or Death (Days)")
  for(i in 1:n.classes) {
    lines(x=q.values, y=sim[,i], col=cols[i], lty=2, lwd=2)
  }
  legend("right", legend=c(str_c(classes, " (obs) [n=",n.obs.classes,"]"), str_c(classes, " (sim)")),
         col=c(cols, cols), lty=c(rep(1, n.classes), rep(2, n.classes)), lwd=2)
  
}

# Model L1-Time
set.seed(20181211);
random.dataset <- sample(1:length(datasets), 1);
data <- datasets.surv.L1[[random.dataset]];

# Kaplan-Meier curves for the following groups
# - The complete cohort
# - Patient with a left-sided primary tumour or rectum primary tumour
# - Patient with a right-sided primary tumour
# - Patients with an primary tumour in their colon but not otherwise specified (NOS) or unkown primary tumour site
par(mfrow=c(2,2));
plotKM(model=Model.L1.Time$coef, data=data, class.column="ChemBev")
plotKM(model=Model.L1.Time$coef, data=filter(data, Site.LeftRectum==1), class.column="ChemBev", add.main="Left or Rectum")
plotKM(model=Model.L1.Time$coef, data=filter(data, Site.Right==1), class.column="ChemBev", add.main="Right")
plotKM(model=Model.L1.Time$coef, data=filter(data, Site.LeftRectum==0, Site.Right==0), class.column="ChemBev", add.main="Other")

# Kaplan-Meier curves for the following groups
# - Female patients
# - Male patients
# - Patients with metastatic disease at diagnosis
# - Patients without metastatic disease at diagnosis
par(mfrow=c(2,2));
plotKM(model=Model.L1.Time$coef, data=filter(data, Female.YN==1), class.column="ChemBev", add.main="Female")
plotKM(model=Model.L1.Time$coef, data=filter(data, Female.YN==0), class.column="ChemBev", add.main="Male")
plotKM(model=Model.L1.Time$coef, data=filter(data, Stage4.YN==1), class.column="ChemBev", add.main="Stage 4 at Diagnosis")
plotKM(model=Model.L1.Time$coef, data=filter(data, Stage4.YN==0), class.column="ChemBev", add.main="NOT Stage 4 at Diagnosis")

# Kaplan-Meier curves for the following groups
# - Patients with an ECOG Performance Status of 2 or 3
# - Patients with an ECOG Performance Status of 1
# - Patients with an ECOG Performance Status of 0
par(mfrow=c(2,2));
plotKM(model=Model.L1.Time$coef, data=filter(data, ECOG.23==1), class.column="ChemBev", add.main="ECOG 23")
plotKM(model=Model.L1.Time$coef, data=filter(data, ECOG.1==1), class.column="ChemBev", add.main="ECOG 1")
plotKM(model=Model.L1.Time$coef, data=filter(data, ECOG.23==0, ECOG.1==0), class.column="ChemBev", add.main="ECOG 0")

Model L1-Time - GND Test Illustration

# Graphical representation of the Greenwood-Nam-D'Agostino statistic with 5 risk groups
par(mfrow=c(2,2));
testGND(model=Model.L1.Time$coef, data=data, time=0.5, n.bin=5, plot.graph=T);

##     chisq        df         p 
## 4.4991559 4.0000000 0.3426476

testGND(model=Model.L1.Time$coef, data=data, time=1, n.bin=5, plot.graph=T);

##     chisq        df         p 
## 2.4457211 4.0000000 0.6543809

testGND(model=Model.L1.Time$coef, data=data, time=2, n.bin=5, plot.graph=T);

##     chisq        df         p 
## 2.0226615 4.0000000 0.7315906

# Graphical representation of the Greenwood-Nam-D'Agostino statistic with 10 risk groups
par(mfrow=c(2,2));
testGND(model=Model.L1.Time$coef, data=data, time=0.5, n.bin=10, plot.graph=T);

##     chisq        df         p 
## 7.1396005 9.0000000 0.6225876

testGND(model=Model.L1.Time$coef, data=data, time=1, n.bin=10, plot.graph=T);

##     chisq        df         p 
## 8.1382488 9.0000000 0.5202764

testGND(model=Model.L1.Time$coef, data=data, time=2, n.bin=10, plot.graph=T);

##     chisq        df         p 
## 5.2879510 9.0000000 0.8085192

# Graphical representation of the Greenwood-Nam-D'Agostino statistic as included in the manuscript
png(file="GND Model L1-Time 20190201.png", width=12, height=6, units="in", res=96);
par(mfrow=c(1,2));
testGND(model=Model.L1.Time$coef, data=data, time=1, n.bin=5, plot.graph=T);

##     chisq        df         p 
## 2.4457211 4.0000000 0.6543809

testGND(model=Model.L1.Time$coef, data=data, time=1, n.bin=10, plot.graph=T);

##     chisq        df         p 
## 8.1382488 9.0000000 0.5202764

dev.off();

## quartz_off_screen 
##                 2

Model LX-Time - Kaplan-Meier Plots

# Model LX-Time
data <- datasets.surv.LX[[random.dataset]];

# Kaplan-Meier curves for the following groups
# - The complete cohort
# - Patient with a left-sided primary tumour or rectum primary tumour
# - Patient with a right-sided primary tumour
# - Patients with an primary tumour in their colon but not otherwise specified (NOS) or unkown primary tumour site
par(mfrow=c(2,2));
plotKM(model=Model.LX.Time$coef, data=data, class.column="ChemBev")
plotKM(model=Model.LX.Time$coef, data=filter(data, Site.LeftRectum==1), class.column="ChemBev", add.main="Left or Rectum")
plotKM(model=Model.LX.Time$coef, data=filter(data, Site.Right==1), class.column="ChemBev", add.main="Right")
plotKM(model=Model.LX.Time$coef, data=filter(data, Site.LeftRectum==0, Site.Right==0), class.column="ChemBev", add.main="Other")

# Kaplan-Meier curves for the following groups
# - Female patients
# - Male patients
# - Patients with metastatic disease at diagnosis
# - Patients without metastatic disease at diagnosis
par(mfrow=c(2,2));
plotKM(model=Model.LX.Time$coef, data=filter(data, Female.YN==1), class.column="ChemBev", add.main="Female")
plotKM(model=Model.LX.Time$coef, data=filter(data, Female.YN==0), class.column="ChemBev", add.main="Male")
plotKM(model=Model.LX.Time$coef, data=filter(data, Stage4.YN==1), class.column="ChemBev", add.main="Stage 4 at Diagnosis")
plotKM(model=Model.LX.Time$coef, data=filter(data, Stage4.YN==0), class.column="ChemBev", add.main="NOT Stage 4 at Diagnosis")

# Kaplan-Meier curves for the following groups
# - Patients with an ECOG Performance Status of 2 or 3
# - Patients with an ECOG Performance Status of 1
# - Patients with an ECOG Performance Status of 0
par(mfrow=c(2,2));
plotKM(model=Model.LX.Time$coef, data=filter(data, ECOG.23==1), class.column="ChemBev", add.main="ECOG 23")
plotKM(model=Model.LX.Time$coef, data=filter(data, ECOG.1==1), class.column="ChemBev", add.main="ECOG 1")
plotKM(model=Model.LX.Time$coef, data=filter(data, ECOG.23==0, ECOG.1==0), class.column="ChemBev", add.main="ECOG 0")

Model LX-Time - GND Test Illustration

# Graphical representation of the Greenwood-Nam-D'Agostino statistic with 5 risk groups
par(mfrow=c(2,2));
testGND(model=Model.LX.Time$coef, data=data, time=0.5, n.bin=5, plot.graph=T);

##      chisq         df          p 
## 8.62654164 4.00000000 0.07114301

testGND(model=Model.LX.Time$coef, data=data, time=1, n.bin=5, plot.graph=T);

##     chisq        df         p 
## 2.8534511 4.0000000 0.5826418

testGND(model=Model.LX.Time$coef, data=data, time=2, n.bin=5, plot.graph=T);

##     chisq        df         p 
## 7.4139126 4.0000000 0.1155658

# Graphical representation of the Greenwood-Nam-D'Agostino statistic with 10 risk groups
par(mfrow=c(2,2));
testGND(model=Model.LX.Time$coef, data=data, time=0.5, n.bin=10, plot.graph=T);

##       chisq          df           p 
## 16.73744515  9.00000000  0.05298987

testGND(model=Model.LX.Time$coef, data=data, time=1, n.bin=10, plot.graph=T);

##     chisq        df         p 
## 4.2555980 9.0000000 0.8937987

testGND(model=Model.LX.Time$coef, data=data, time=2, n.bin=10, plot.graph=T);

##     chisq        df         p 
## 9.5251554 9.0000000 0.3902742

# Graphical representation of the Greenwood-Nam-D'Agostino statistic as included in the manuscript
png(file="GND Model LX-Time 20190201.png", width=12, height=6, units="in", res=96);
par(mfrow=c(1,2));
testGND(model=Model.LX.Time$coef, data=data, time=1, n.bin=5, plot.graph=T);

##     chisq        df         p 
## 2.8534511 4.0000000 0.5826418

testGND(model=Model.LX.Time$coef, data=data, time=1, n.bin=10, plot.graph=T);

##     chisq        df         p 
## 4.2555980 9.0000000 0.8937987

dev.off();

## quartz_off_screen 
##                 2

Step 5.3: calculating the apparent model performance

Using the previously defined functions, the uncorrected, biased apparent performance of the models is calculated, so that these can be corrected for the optimism as estimated from the bootstrapping later on. The apparent performance is calculated for each imputed datasets separately and averaged to obtain a single apparent performance outcome.

# General settings
timepoints <- c(0.5, 1, 2);

# Model L1-Event
Dapp.Model.L1.Time <- rowMeans(
  sapply(datasets.surv.L1, function(dataset) {
    getPerformanceSurvmod(model = Model.L1.Time$coef, data.train = dataset, data.test = dataset, timepoints = timepoints)
  })
);

# Observe apparent performance
Dapp.Model.L1.Time;

## c.statistic_0.5-year   c.statistic_1-year   c.statistic_2-year 
##            0.6591775            0.6348088            0.6287258 
##          Slope.Scale    HL.5bins_0.5-year      HL.5bins_1-year 
##            0.9958962            4.0929190            4.3369164 
##      HL.5bins_2-year   HL.10bins_0.5-year     HL.10bins_1-year 
##            0.9944051            7.4060601           10.6726626 
##     HL.10bins_2-year 
##            5.5885418

# Model L1-Event
Dapp.Model.L1.Event <- rowMeans(
  sapply(datasets.surv.L1, function(dataset) {
    getPerformanceLogreg(model = Model.L1.Event$coef, data = dataset)
  })
);

# Observe apparent performance
Dapp.Model.L1.Event;

##      C.Statistic Regression.Slope      Brier.Score 
##       0.68958824       0.99974239       0.08674998

# Model LX-Event
Dapp.Model.LX.Time <- rowMeans(
  sapply(datasets.surv.LX, function(dataset) {
    getPerformanceSurvmod(model = Model.LX.Time$coef, data.train = dataset, data.test = dataset, timepoints = timepoints)
  })
);

# Observe apparent performance
Dapp.Model.LX.Time;

## c.statistic_0.5-year   c.statistic_1-year   c.statistic_2-year 
##            0.7174054            0.6933501            0.6844447 
##          Slope.Scale    HL.5bins_0.5-year      HL.5bins_1-year 
##            0.9971058            8.1409540            1.5974814 
##      HL.5bins_2-year   HL.10bins_0.5-year     HL.10bins_1-year 
##            4.9123090           15.9033123            6.6292335 
##     HL.10bins_2-year 
##            7.6692321

Step 5.4: performing bootstrapping to calculate the optimism

Bootstrapping was performed to assess the optimism of the (survival) models and to obtain sets of correlared model parameters to be used to perform a probabilistic sensitivity analysis for the discrete event simulation. In each bootstrap iteration, the following actions are performed:

Generate a bootstrap sample of patient identification numbers
For each of the imputed datasets:
- Get the patients selected in the bootstrap sample from the imputed dataset
- Transform the data in the appropriate format for analysis
- Fit the (survival) models according to the defined structures
- Assess the performance and optimism of the models
- Return the results for the imputed dataset
Combine the results for the separate imputed datasets into one average model and set of outcomes

# Setting for the bootstrapping
n.boot <- 10000;         # number of bootstrap iterations
set.seed(20181211);     # random number seed for reproduction

# Define the model structure that needs to be fitted for the models in each bootstrap iteration
Model.L1.Time.formula <- str_c(
  str_c(c("Surv(Time, Event) ~ "), collapse=""),  
  str_c(vars.formulas.L1[Model.L1.Time$included], collapse=" + ")
);
Model.L1.Event.formula <- str_c(
  str_c(c("Events.01 ~ "), collapse=""),  
  str_c(vars.formulas.L1[Model.L1.Event$included], collapse=" + ")
);
Model.LX.Time.formula <- str_c(
  str_c(c("Surv(Time, Event) ~ "), collapse=""),  
  str_c(vars.formulas.LX[Model.LX.Time$included], collapse=" + ")
);

# Set up a multithreaded computation cluster
library(parallel);
library(doSNOW);
cl <- makeCluster(detectCores());
clusterExport(cl, ls());

# Perform the bootstrap
out.bootstrap <- t(parSapply(cl, 1:n.boot, function(i.boot) {
  
  # Load packages
  library(dplyr);
  library(stringr);
  library(survival);
  library(aod)
  library(survAUC);
  
  # Initialize an output object and an object to track whether any errors occur
  out <- NULL;
  count.errors <- -1;
  
  while(is.null(out)) {
    
    # An error handler is used to make sure that an a set of outcomes is obtained for each bootstrap iteration
    # If an error occurs, this is recorded, so that it is known whether this occurs and, if so, how often
    out <- tryCatch({
      
      # Increase the error counter (starts at -1 so if no errors occur the counter will be zero)
      count.errors <- count.errors + 1;
      
      # Bootstrap sample of patients
      boot.sample <- sample(x=1:nrow(datasets[[1]]), size=nrow(datasets[[1]]), replace=T);
      
      # For each imputed dataset
      out <- t(sapply(1:length(datasets), function(i.dataset) {
        
        ## Get bootstrap sample
        # Get datasets from imputed dataset
        dataset <- datasets[[i.dataset]];
        dataset.L1 <- dataset;
        dataset.LX <- filter(dataset, LX.OS.Include);
        boot.dataset.L1 <- dataset[boot.sample, , drop=F];
        boot.dataset.LX <- filter(boot.dataset.L1, LX.OS.Include);
        
        # Transform data in appropriate format
        dataset.surv.L1 <- getSurvDataset.L1(data=dataset.L1);
        dataset.surv.LX <- getSurvDataset.LX(data=dataset.LX);
        boot.dataset.surv.L1 <- getSurvDataset.L1(data=boot.dataset.L1);
        boot.dataset.surv.LX <- getSurvDataset.LX(data=boot.dataset.LX);
        
        
        ## Fit the models
        # Model L1-Time
        boot.Model.L1.Time <- survreg(formula = as.formula(Model.L1.Time.formula), data=boot.dataset.surv.L1, dist="weibull");
        boot.survmod.L1 <- c(boot.Model.L1.Time$coefficients, 'Log(Scale)'=log(boot.Model.L1.Time$scale));
        
        # Model L1-Event
        boot.Model.L1.Event <- glm(formula = as.formula(Model.L1.Event.formula), data=boot.dataset.surv.L1, family=binomial(link="logit"));
        boot.logreg.L1 <- boot.Model.L1.Event$coefficients;
        
        # Model LX-Time
        boot.Model.LX.Time <- survreg(formula = as.formula(Model.LX.Time.formula), data=boot.dataset.surv.LX, dist="weibull");
        boot.survmod.LX <- c(boot.Model.LX.Time$coefficients, 'Log(Scale)'=log(boot.Model.LX.Time$scale));
        
        
        ## Performance Model L1-Time
        # Performance in bootstrap sample
        perf.surv.L1.boot <- getPerformanceSurvmod(model = boot.survmod.L1, data.train = boot.dataset.surv.L1, data.test = boot.dataset.surv.L1, timepoints = timepoints)
        
        # Performance in original (imputed) dataset
        perf.surv.L1.origin <- getPerformanceSurvmod(model = boot.survmod.L1, data.train = boot.dataset.surv.L1, data.test = dataset.surv.L1, timepoints = timepoints); 
        
        # Calculate the optimism
        optimism.surv.L1 <- perf.surv.L1.boot - perf.surv.L1.origin;
        
        # Set names for later analyses
        names(perf.surv.L1.boot) <- str_c("BOOT_SURVMOD_L1_", names(perf.surv.L1.boot));
        names(perf.surv.L1.origin) <- str_c("ORG_SURVMOD_L1_", names(perf.surv.L1.origin));
        names(optimism.surv.L1) <- str_c("OPT_SURVMOD_L1_", names(optimism.surv.L1));
        names(boot.survmod.L1) <- str_c("SURVMOD_L1_", names(boot.survmod.L1));
        
        
        ## Performance Model L1-Event
        # Performance in bootstrap sample
        perf.logreg.L1.boot <- getPerformanceLogreg(model = boot.logreg.L1, data = boot.dataset.surv.L1)
        
        # Performance in original (imputed) dataset
        perf.logreg.L1.origin <- getPerformanceLogreg(model = boot.logreg.L1, data = dataset.surv.L1); 
        
        # Calculate the optimism
        optimism.logreg.L1 <- perf.logreg.L1.boot - perf.logreg.L1.origin;
        
        # Set names for later analyses
        names(perf.logreg.L1.boot) <- str_c("BOOT_EVENTMOD_L1_", names(perf.logreg.L1.boot));
        names(perf.logreg.L1.origin) <- str_c("ORG_EVENTMOD_L1_", names(perf.logreg.L1.origin));
        names(optimism.logreg.L1) <- str_c("OPT_EVENTMOD_L1_", names(optimism.logreg.L1));
        names(boot.logreg.L1) <- str_c("EVENTMOD_L1_", names(boot.logreg.L1));
        
        
        ## Performance Model LX-Time
        # Performance in bootstrap sample
        perf.surv.LX.boot <- getPerformanceSurvmod(model = boot.survmod.LX, data.train = boot.dataset.surv.LX, data.test = boot.dataset.surv.LX, timepoints = timepoints)
        
        # Performance in original (imputed) dataset
        perf.surv.LX.origin <- getPerformanceSurvmod(model = boot.survmod.LX, data.train = boot.dataset.surv.LX, data.test = dataset.surv.LX, timepoints = timepoints); 
        
                # Calculate the optimism
        optimism.surv.LX <- perf.surv.LX.boot - perf.surv.LX.origin;
        
        # Set names for later analyses
        names(perf.surv.LX.boot) <- str_c("BOOT_SURVMOD_LX_", names(perf.surv.LX.boot));
        names(perf.surv.LX.origin) <- str_c("ORG_SURVMOD_LX_", names(perf.surv.LX.origin));
        names(optimism.surv.LX) <- str_c("OPT_SURVMOD_LX_", names(optimism.surv.LX));
        names(boot.survmod.LX) <- str_c("SURVMOD_LX_", names(boot.survmod.LX));
        
        ## Return models and performance
        c(boot.survmod.L1, perf.surv.L1.boot, perf.surv.L1.origin, optimism.surv.L1, boot.logreg.L1, perf.logreg.L1.boot, perf.logreg.L1.origin, optimism.logreg.L1, boot.survmod.LX, perf.surv.LX.boot, perf.surv.LX.origin, optimism.surv.LX, Errors = count.errors)
        
      })) # sapply datasets
      
      # Apply Rubin's rule to obtain average models and outcomes
      out <- colMeans(out);
      
      # Check whether there are infeasible results for any of the GND tests and, if so, give
      # an error so that another bootstrap sample is performed
      #if(any(out[str_detect(string=names(out), pattern="HL")]==Inf)) {
      if(any(out==Inf)) stop("An error occured in the analysis of the bootstrap sample!");
      #}
    
      # Return the 'object'
      out;
      
    # Error handler for the tryCatch function  
    }, error=function(e) NULL)
  
  } # While
  
  # Return the outcomes for the bootstrap iteration
  out;
  
})); # parSapply nboot

# Stop the defined computation cluster
stopCluster(cl);

# Print five iteration of the bootstrap
head(out.bootstrap, n=5);

##      SURVMOD_L1_(Intercept) SURVMOD_L1_ChemBev SURVMOD_L1_Age1
## [1,]               3.981999          0.7133633      0.04901094
## [2,]               4.700113          0.4052946      0.04082126
## [3,]               4.620920          0.6174053      0.03743284
## [4,]               4.518186          0.5050754      0.04693260
## [5,]               4.863937          0.4679972      0.03160581
##      SURVMOD_L1_Age2 SURVMOD_L1_ECOG.1 SURVMOD_L1_ECOG.23
## [1,]   -0.0003953975        -0.4172791         -0.7096706
## [2,]   -0.0003497074        -0.5041710         -0.2891771
## [3,]   -0.0003100127        -0.3154448         -0.6469685
## [4,]   -0.0003717848        -0.1261577         -0.3197438
## [5,]   -0.0002592091        -0.4779499         -0.2849362
##      SURVMOD_L1_MutantBRAF.YN SURVMOD_L1_Site.LeftRectum
## [1,]              -0.18668733                  0.2851306
## [2,]              -0.10763923                  0.1840829
## [3,]              -0.10634250                  0.1204314
## [4,]              -0.08786323                 -0.2165154
## [5,]              -0.18082350                  0.2201003
##      SURVMOD_L1_Site.Right SURVMOD_L1_Liver.YN SURVMOD_L1_Peritoneum.YN
## [1,]            0.05037495         -0.08705661               -0.1502843
## [2,]           -0.04295241         -0.20918879               -0.0965782
## [3,]           -0.02865888         -0.14946688               -0.1496728
## [4,]           -0.35799352         -0.16156255               -0.1867249
## [5,]            0.07600494         -0.20936823               -0.1616124
##      SURVMOD_L1_CEA.200plus SURVMOD_L1_Private.YN SURVMOD_L1_MutantRAS.YN
## [1,]            -0.02250040             0.5601233             -0.06317179
## [2,]             0.02959098             0.4218272             -0.05558826
## [3,]             0.04848758             0.4890554             -0.02016390
## [4,]            -0.37707645             0.3980278             -0.12519125
## [5,]            -0.03467023             0.4405518             -0.10982494
##      SURVMOD_L1_ECOG.1.Int SURVMOD_L1_ECOG.23.Int
## [1,]           0.268692508             0.17451063
## [2,]           0.403546316             0.07013651
## [3,]           0.057770700             0.11115989
## [4,]          -0.001833973            -0.14401697
## [5,]           0.244993040            -0.32615021
##      SURVMOD_L1_CEA.200plus.Int SURVMOD_L1_Private.YN.Int
## [1,]                 -0.2969168                -0.6098874
## [2,]                 -0.3712344                -0.4717239
## [3,]                 -0.3673050                -0.4517613
## [4,]                  0.3443316                -0.3426054
## [5,]                 -0.1190150                -0.4710576
##      SURVMOD_L1_Log(Scale) BOOT_SURVMOD_L1_c.statistic_0.5-year
## [1,]            -0.3352906                            0.6971802
## [2,]            -0.3162845                            0.6075005
## [3,]            -0.3872189                            0.6787228
## [4,]            -0.3924422                            0.6314717
## [5,]            -0.4041236                            0.6488232
##      BOOT_SURVMOD_L1_c.statistic_1-year BOOT_SURVMOD_L1_c.statistic_2-year
## [1,]                          0.6630135                          0.6527020
## [2,]                          0.6154472                          0.6124971
## [3,]                          0.6543885                          0.6424432
## [4,]                          0.6148917                          0.6109928
## [5,]                          0.6363253                          0.6308419
##      BOOT_SURVMOD_L1_Slope.Scale BOOT_SURVMOD_L1_HL.5bins_0.5-year
## [1,]                   0.9958290                          5.211445
## [2,]                   0.9962760                          5.708340
## [3,]                   0.9959867                         10.338772
## [4,]                   0.9953009                          4.618231
## [5,]                   0.9958087                          6.050799
##      BOOT_SURVMOD_L1_HL.5bins_1-year BOOT_SURVMOD_L1_HL.5bins_2-year
## [1,]                        6.985530                       11.005641
## [2,]                        4.814476                        5.175752
## [3,]                       14.221859                        5.976124
## [4,]                        9.888893                        7.126047
## [5,]                        6.660566                        3.508903
##      BOOT_SURVMOD_L1_HL.10bins_0.5-year BOOT_SURVMOD_L1_HL.10bins_1-year
## [1,]                           23.43496                         28.94024
## [2,]                           17.57629                         17.19086
## [3,]                           24.89114                         24.99794
## [4,]                           13.89146                         24.25595
## [5,]                           15.40319                         15.69896
##      BOOT_SURVMOD_L1_HL.10bins_2-year ORG_SURVMOD_L1_c.statistic_0.5-year
## [1,]                         20.76857                           0.6603587
## [2,]                         13.96084                           0.6362329
## [3,]                         15.23128                           0.6589743
## [4,]                         14.40160                           0.6437835
## [5,]                         10.52725                           0.6513928
##      ORG_SURVMOD_L1_c.statistic_1-year ORG_SURVMOD_L1_c.statistic_2-year
## [1,]                         0.6318032                         0.6247304
## [2,]                         0.6160703                         0.6114792
## [3,]                         0.6359020                         0.6303478
## [4,]                         0.6172235                         0.6116596
## [5,]                         0.6269510                         0.6210518
##      ORG_SURVMOD_L1_Slope.Scale ORG_SURVMOD_L1_HL.5bins_0.5-year
## [1,]                  0.9968068                         3.085940
## [2,]                  0.9965162                         2.079462
## [3,]                  0.9966184                         4.145054
## [4,]                  0.9959796                         4.621218
## [5,]                  0.9964070                         4.844827
##      ORG_SURVMOD_L1_HL.5bins_1-year ORG_SURVMOD_L1_HL.5bins_2-year
## [1,]                      10.950854                       7.054199
## [2,]                       7.846517                       2.283721
## [3,]                      13.404311                       3.649663
## [4,]                       4.739600                       9.011875
## [5,]                       6.753159                       3.259901
##      ORG_SURVMOD_L1_HL.10bins_0.5-year ORG_SURVMOD_L1_HL.10bins_1-year
## [1,]                          7.978417                        17.86005
## [2,]                          6.379822                        11.97190
## [3,]                          8.151804                        20.61193
## [4,]                          9.711431                        10.76515
## [5,]                          8.845025                        10.56049
##      ORG_SURVMOD_L1_HL.10bins_2-year OPT_SURVMOD_L1_c.statistic_0.5-year
## [1,]                       10.072965                         0.036821541
## [2,]                        7.287994                        -0.028732415
## [3,]                        7.852532                         0.019748480
## [4,]                       12.752327                        -0.012311846
## [5,]                        6.122629                        -0.002569534
##      OPT_SURVMOD_L1_c.statistic_1-year OPT_SURVMOD_L1_c.statistic_2-year
## [1,]                      0.0312102343                       0.027971551
## [2,]                     -0.0006230617                       0.001017893
## [3,]                      0.0184864365                       0.012095387
## [4,]                     -0.0023317608                      -0.000666834
## [5,]                      0.0093743284                       0.009790143
##      OPT_SURVMOD_L1_Slope.Scale OPT_SURVMOD_L1_HL.5bins_0.5-year
## [1,]              -0.0009777518                      2.125504929
## [2,]              -0.0002402122                      3.628878019
## [3,]              -0.0006317983                      6.193717317
## [4,]              -0.0006787829                     -0.002987206
## [5,]              -0.0005982845                      1.205971312
##      OPT_SURVMOD_L1_HL.5bins_1-year OPT_SURVMOD_L1_HL.5bins_2-year
## [1,]                    -3.96532395                      3.9514412
## [2,]                    -3.03204090                      2.8920317
## [3,]                     0.81754863                      2.3264606
## [4,]                     5.14929344                     -1.8858279
## [5,]                    -0.09259345                      0.2490013
##      OPT_SURVMOD_L1_HL.10bins_0.5-year OPT_SURVMOD_L1_HL.10bins_1-year
## [1,]                         15.456548                       11.080198
## [2,]                         11.196472                        5.218963
## [3,]                         16.739334                        4.386004
## [4,]                          4.180025                       13.490804
## [5,]                          6.558168                        5.138470
##      OPT_SURVMOD_L1_HL.10bins_2-year EVENTMOD_L1_(Intercept)
## [1,]                       10.695601               -1.222529
## [2,]                        6.672845               -3.429277
## [3,]                        7.378746               -1.896369
## [4,]                        1.649275               -1.982959
## [5,]                        4.404618               -1.695852
##      EVENTMOD_L1_Time1 EVENTMOD_L1_Time2 EVENTMOD_L1_ChemBev
## [1,]     -0.0127019951      9.717204e-06          -1.1545663
## [2,]      0.0040775680     -8.194758e-06           1.4111030
## [3,]     -0.0007586047     -1.140798e-06          -0.1703071
## [4,]     -0.0007956200     -1.050032e-07           0.8586870
## [5,]     -0.0025616104     -9.145091e-07          -0.4207733
##      EVENTMOD_L1_Charlson.1plus EVENTMOD_L1_Site.LeftRectum.Int
## [1,]                  1.0387673                      -1.5291757
## [2,]                  0.7663630                      -1.5997984
## [3,]                  0.5972719                      -0.7651645
## [4,]                  0.4055901                      -2.3030617
## [5,]                  0.7016559                      -1.6592355
##      EVENTMOD_L1_Site.Right.Int EVENTMOD_L1_Time1.Int
## [1,]                 -1.0183295           0.015737858
## [2,]                 -1.8408813          -0.001153155
## [3,]                 -0.7035978           0.001664369
## [4,]                 -2.0915155           0.002074461
## [5,]                 -2.0244781           0.005974437
##      EVENTMOD_L1_Time2.Int BOOT_EVENTMOD_L1_C.Statistic
## [1,]         -1.017203e-05                    0.7513534
## [2,]          7.602548e-06                    0.7160946
## [3,]          1.754595e-06                    0.6314291
## [4,]          1.165385e-06                    0.7175743
## [5,]          7.297080e-08                    0.7066423
##      BOOT_EVENTMOD_L1_Regression.Slope BOOT_EVENTMOD_L1_Brier.Score
## [1,]                                 1                   0.08621135
## [2,]                                 1                   0.07125345
## [3,]                                 1                   0.09863529
## [4,]                                 1                   0.07825127
## [5,]                                 1                   0.08003173
##      ORG_EVENTMOD_L1_C.Statistic ORG_EVENTMOD_L1_Regression.Slope
## [1,]                   0.6742480                        0.6229088
## [2,]                   0.6632729                        0.7894793
## [3,]                   0.6854978                        1.1844405
## [4,]                   0.6894995                        0.7169080
## [5,]                   0.6806799                        0.8191654
##      ORG_EVENTMOD_L1_Brier.Score OPT_EVENTMOD_L1_C.Statistic
## [1,]                  0.08796597                  0.07710542
## [2,]                  0.08816621                  0.05282169
## [3,]                  0.08741610                 -0.05406867
## [4,]                  0.08821141                  0.02807471
## [5,]                  0.08674404                  0.02596248
##      OPT_EVENTMOD_L1_Regression.Slope OPT_EVENTMOD_L1_Brier.Score
## [1,]                        0.3770911                -0.001754618
## [2,]                        0.2105207                -0.016912762
## [3,]                       -0.1844404                 0.011219198
## [4,]                        0.2830920                -0.009960139
## [5,]                        0.1808346                -0.006712313
##      SURVMOD_LX_(Intercept) SURVMOD_LX_ChemBev SURVMOD_LX_Age1
## [1,]               3.562227          0.7824871     -0.02828989
## [2,]               3.998186          0.2554781     -0.02341365
## [3,]               3.569794          0.4655880     -0.02833868
## [4,]               2.313700          2.0443733     -0.02091674
## [5,]               2.598641          0.6808260     -0.03057557
##      SURVMOD_LX_Age2 SURVMOD_LX_Site.LeftRectum SURVMOD_LX_Site.Right
## [1,]   -0.0011449823                 -0.1087073            -0.1413447
## [2,]   -0.0002210010                 -0.2235832            -0.3898415
## [3,]   -0.0011554675                 -0.3559879            -0.3880295
## [4,]   -0.0009223096                 -0.1158416            -0.2002665
## [5,]   -0.0006766529                  0.2135378             0.1598420
##      SURVMOD_LX_Peritoneum.YN SURVMOD_LX_ECOG.1 SURVMOD_LX_ECOG.23
## [1,]               -0.0249536         0.1549188         -1.0220102
## [2,]                0.2214340        -0.5808265         -1.4354497
## [3,]                0.1059064        -0.4217989         -1.3702082
## [4,]                0.5151359        -0.1775224         -1.3387347
## [5,]                0.0503954        -0.1520670         -0.9341271
##      SURVMOD_LX_MutantBRAF.YN SURVMOD_LX_PFS.Time SURVMOD_LX_Stage4.YN
## [1,]               0.70546169           0.4707842           -0.3944380
## [2,]               0.19919089           0.5095575           -0.5856043
## [3,]              -0.32472287           0.6052857           -0.6042184
## [4,]               0.03480724           0.7434121           -0.3734194
## [5,]               0.14462485           0.6508931           -0.9527968
##      SURVMOD_LX_Charlson.1plus.YN SURVMOD_LX_MutantRAS.YN
## [1,]                   0.09992622              0.17053839
## [2,]                   0.04779290              0.03353796
## [3,]                  -0.07911071              0.09179549
## [4,]                   0.13386652              0.22966125
## [5,]                   0.07126643              0.05809591
##      SURVMOD_LX_Private.YN SURVMOD_LX_Age1.Int SURVMOD_LX_Age2.Int
## [1,]            0.15158631          0.01786526        7.877169e-04
## [2,]            0.07598565          0.01288963       -3.442888e-05
## [3,]            0.03818174          0.02456055        1.056388e-03
## [4,]            0.20289242          0.01560396        1.133192e-03
## [5,]            0.14046066          0.02742750        6.063385e-04
##      SURVMOD_LX_Peritoneum.YN.Int SURVMOD_LX_ECOG.1.Int
## [1,]                  -0.12031487           -0.46908724
## [2,]                  -0.21624553            0.27065371
## [3,]                  -0.15013080           -0.03996639
## [4,]                  -0.57248756           -0.03035211
## [5,]                  -0.01832955           -0.21184432
##      SURVMOD_LX_ECOG.23.Int SURVMOD_LX_MutantBRAF.YN.Int
## [1,]              0.2905964                   -1.2068901
## [2,]              0.5655256                   -0.5433808
## [3,]              0.3297124                   -0.3063589
## [4,]              0.4628835                   -0.3725404
## [5,]             -0.1904882                   -0.4076959
##      SURVMOD_LX_PFS.Time.Int SURVMOD_LX_Stage4.YN.Int
## [1,]             -0.09350060                0.3643138
## [2,]             -0.09965267                0.7427876
## [3,]             -0.09376370                0.5268879
## [4,]             -0.40790633                0.3279707
## [5,]             -0.13930390                0.7655971
##      SURVMOD_LX_MutantRAS.YN.Int SURVMOD_LX_Log(Scale)
## [1,]                  -0.3972561            -0.1596703
## [2,]                  -0.2979609            -0.1807637
## [3,]                  -0.4433107            -0.2072936
## [4,]                  -0.4738601            -0.2232146
## [5,]                  -0.2130537            -0.1556238
##      BOOT_SURVMOD_LX_c.statistic_0.5-year
## [1,]                            0.7175332
## [2,]                            0.7182717
## [3,]                            0.7255471
## [4,]                            0.6871814
## [5,]                            0.7255819
##      BOOT_SURVMOD_LX_c.statistic_1-year BOOT_SURVMOD_LX_c.statistic_2-year
## [1,]                          0.6836472                          0.6742734
## [2,]                          0.6979305                          0.6871862
## [3,]                          0.7094436                          0.6946134
## [4,]                          0.6827085                          0.6717903
## [5,]                          0.6989814                          0.6822703
##      BOOT_SURVMOD_LX_Slope.Scale BOOT_SURVMOD_LX_HL.5bins_0.5-year
## [1,]                   0.9970078                         11.326984
## [2,]                   0.9970459                         14.218835
## [3,]                   0.9971341                          7.908079
## [4,]                   0.9965562                          5.302841
## [5,]                   0.9973610                         10.901772
##      BOOT_SURVMOD_LX_HL.5bins_1-year BOOT_SURVMOD_LX_HL.5bins_2-year
## [1,]                       12.103363                        5.984211
## [2,]                        6.049089                        8.935838
## [3,]                        9.671642                        4.928537
## [4,]                        8.257890                        8.517921
## [5,]                        5.580362                        6.485760
##      BOOT_SURVMOD_LX_HL.10bins_0.5-year BOOT_SURVMOD_LX_HL.10bins_1-year
## [1,]                           28.80884                         23.00415
## [2,]                           29.92394                         18.69299
## [3,]                           33.21597                         18.72455
## [4,]                           15.67308                         17.15724
## [5,]                           18.91509                         15.45737
##      BOOT_SURVMOD_LX_HL.10bins_2-year ORG_SURVMOD_LX_c.statistic_0.5-year
## [1,]                         14.52087                           0.7113699
## [2,]                         27.31791                           0.7036708
## [3,]                         17.52691                           0.7093891
## [4,]                         17.04953                           0.7103136
## [5,]                         23.34100                           0.7129121
##      ORG_SURVMOD_LX_c.statistic_1-year ORG_SURVMOD_LX_c.statistic_2-year
## [1,]                         0.6827227                         0.6739350
## [2,]                         0.6817569                         0.6723853
## [3,]                         0.6838889                         0.6729765
## [4,]                         0.6869212                         0.6750528
## [5,]                         0.6827392                         0.6700449
##      ORG_SURVMOD_LX_Slope.Scale ORG_SURVMOD_LX_HL.5bins_0.5-year
## [1,]                  0.9971391                         6.048096
## [2,]                  0.9973047                         4.313220
## [3,]                  0.9977269                        11.834210
## [4,]                  0.9973201                         6.336112
## [5,]                  0.9975677                         2.847844
##      ORG_SURVMOD_LX_HL.5bins_1-year ORG_SURVMOD_LX_HL.5bins_2-year
## [1,]                       3.723671                       3.430626
## [2,]                       3.913788                       3.848451
## [3,]                      13.579960                       6.704817
## [4,]                       6.530022                       5.451594
## [5,]                       9.312232                      10.132961
##      ORG_SURVMOD_LX_HL.10bins_0.5-year ORG_SURVMOD_LX_HL.10bins_1-year
## [1,]                         12.759155                        8.371971
## [2,]                         10.075726                        8.675773
## [3,]                         19.169984                       17.097972
## [4,]                         15.517688                       15.131077
## [5,]                          7.252339                       12.878907
##      ORG_SURVMOD_LX_HL.10bins_2-year OPT_SURVMOD_LX_c.statistic_0.5-year
## [1,]                        10.21783                         0.006163227
## [2,]                        10.92730                         0.014600836
## [3,]                        16.42205                         0.016158051
## [4,]                        11.36277                        -0.023132143
## [5,]                        19.21853                         0.012669796
##      OPT_SURVMOD_LX_c.statistic_1-year OPT_SURVMOD_LX_c.statistic_2-year
## [1,]                      0.0009244992                      0.0003384089
## [2,]                      0.0161735932                      0.0148009260
## [3,]                      0.0255546730                      0.0216369479
## [4,]                     -0.0042126971                     -0.0032624571
## [5,]                      0.0162421948                      0.0122253280
##      OPT_SURVMOD_LX_Slope.Scale OPT_SURVMOD_LX_HL.5bins_0.5-year
## [1,]              -0.0001312901                         5.278888
## [2,]              -0.0002588117                         9.905615
## [3,]              -0.0005928405                        -3.926131
## [4,]              -0.0007639492                        -1.033271
## [5,]              -0.0002066949                         8.053928
##      OPT_SURVMOD_LX_HL.5bins_1-year OPT_SURVMOD_LX_HL.5bins_2-year
## [1,]                       8.379692                       2.553585
## [2,]                       2.135301                       5.087387
## [3,]                      -3.908318                      -1.776280
## [4,]                       1.727868                       3.066327
## [5,]                      -3.731871                      -3.647200
##      OPT_SURVMOD_LX_HL.10bins_0.5-year OPT_SURVMOD_LX_HL.10bins_1-year
## [1,]                        16.0496825                       14.632183
## [2,]                        19.8482130                       10.017212
## [3,]                        14.0459844                        1.626581
## [4,]                         0.1553928                        2.026166
## [5,]                        11.6627504                        2.578462
##      OPT_SURVMOD_LX_HL.10bins_2-year Errors
## [1,]                        4.303047      0
## [2,]                       16.390608      0
## [3,]                        1.104856      0
## [4,]                        5.686761      0
## [5,]                        4.122471      0

Step 5.5: correcting the apparent performance for optimism

After performing the bootstrapping, the apparent performance is corrected for the optimism. To do so, the following actions are performed for each statistic:

Extract the apparent performance for the statistic from the corresponding object
Extract the mean optimism over all bootstrap iterations from the corresponding object
Calculate the nearly unbiased optimism-corrected performance
For the GND: calculate the p-values corresponding to the outcome

## Model L1-Time

# Extract all relevant columns from the 'out.bootstrap' object and calculate the mean over all iterations
O.Model.L1.Time <- colMeans(out.bootstrap[, str_detect(string=colnames(out.bootstrap), pattern="OPT_SURVMOD_L1")]);

# C-statistic
Dapp.Model.L1.Time.c <- Dapp.Model.L1.Time[str_detect(string=names(Dapp.Model.L1.Time), pattern="c.statistic")];
O.Model.L1.Time.c <- O.Model.L1.Time[str_detect(string=names(O.Model.L1.Time), pattern="c.statistic")];
rbind(Dapp.Model.L1.Time.c,
      O.Model.L1.Time.c,
      corrected = Dapp.Model.L1.Time.c - O.Model.L1.Time.c);

##                      c.statistic_0.5-year c.statistic_1-year
## Dapp.Model.L1.Time.c           0.65917752         0.63480879
## O.Model.L1.Time.c              0.01175686         0.01173835
## corrected                      0.64742065         0.62307044
##                      c.statistic_2-year
## Dapp.Model.L1.Time.c         0.62872577
## O.Model.L1.Time.c            0.01208758
## corrected                    0.61663820

# Slope
Dapp.Model.L1.Time.slope <- Dapp.Model.L1.Time[str_detect(string=names(Dapp.Model.L1.Time), pattern="Slope")];
O.Model.L1.Time.slope <- O.Model.L1.Time[str_detect(string=names(O.Model.L1.Time), pattern="Slope")];
rbind(Dapp.Model.L1.Time.slope, O.Model.L1.Time.slope, corrected = Dapp.Model.L1.Time.slope - O.Model.L1.Time.slope);

##                            Slope.Scale
## Dapp.Model.L1.Time.slope  0.9958962063
## O.Model.L1.Time.slope    -0.0005810449
## corrected                 0.9964772512

# GND HL 
Dapp.Model.L1.Time.hl <- Dapp.Model.L1.Time[str_detect(string=names(Dapp.Model.L1.Time), pattern="HL")];
O.Model.L1.Time.hl <- O.Model.L1.Time[str_detect(string=names(O.Model.L1.Time), pattern="HL")];
rbind(Dapp.Model.L1.Time.hl, O.Model.L1.Time.hl, corrected = Dapp.Model.L1.Time.hl + O.Model.L1.Time.hl);

##                       HL.5bins_0.5-year HL.5bins_1-year HL.5bins_2-year
## Dapp.Model.L1.Time.hl          4.092919        4.336916       0.9944051
## O.Model.L1.Time.hl             4.637893        1.256029       2.9860626
## corrected                      8.730812        5.592946       3.9804677
##                       HL.10bins_0.5-year HL.10bins_1-year HL.10bins_2-year
## Dapp.Model.L1.Time.hl            7.40606        10.672663         5.588542
## O.Model.L1.Time.hl              11.21292         6.928614         7.157294
## corrected                       18.61898        17.601276        12.745836

1-c(pchisq(q=c(Dapp.Model.L1.Time.hl + O.Model.L1.Time.hl)[1:3], df=4), pchisq(q=c(Dapp.Model.L1.Time.hl + O.Model.L1.Time.hl)[4:6], df=9))

##  HL.5bins_0.5-year    HL.5bins_1-year    HL.5bins_2-year 
##         0.06819159         0.23167947         0.40865571 
## HL.10bins_0.5-year   HL.10bins_1-year   HL.10bins_2-year 
##         0.02863475         0.04009156         0.17444819

## Model L1-Event

# Extract all relevant columns from the 'out.bootstrap' object and calculate the mean over all iterations
O.Model.L1.Event <- colMeans(out.bootstrap[, str_detect(string=colnames(out.bootstrap), pattern="OPT_EVENTMOD_L1")]);

# C-statistic
Dapp.Model.L1.Event.c <- Dapp.Model.L1.Event[str_detect(string=names(Dapp.Model.L1.Event), pattern="C.Statistic")];
O.Model.L1.Event.c <- O.Model.L1.Event[str_detect(string=names(O.Model.L1.Event), pattern="C.Statistic")];
rbind(Dapp.Model.L1.Event.c, O.Model.L1.Event.c, corrected = Dapp.Model.L1.Event.c - O.Model.L1.Event.c);

##                       C.Statistic
## Dapp.Model.L1.Event.c  0.68958824
## O.Model.L1.Event.c     0.03202666
## corrected              0.65756159

# Slope
Dapp.Model.L1.Event.slope <- Dapp.Model.L1.Event[str_detect(string=names(Dapp.Model.L1.Event), pattern="Slope")];
O.Model.L1.Event.slope <- O.Model.L1.Event[str_detect(string=names(O.Model.L1.Event), pattern="Slope")];
rbind(Dapp.Model.L1.Event.slope, O.Model.L1.Event.slope, corrected = Dapp.Model.L1.Event.slope - O.Model.L1.Event.slope);

##                           Regression.Slope
## Dapp.Model.L1.Event.slope        0.9997424
## O.Model.L1.Event.slope           0.1796652
## corrected                        0.8200772

# Brier
Dapp.Model.L1.Event.brier <- Dapp.Model.L1.Event[str_detect(string=names(Dapp.Model.L1.Event), pattern="Brier")];
O.Model.L1.Event.brier <- O.Model.L1.Event[str_detect(string=names(O.Model.L1.Event), pattern="Brier")];
rbind(Dapp.Model.L1.Event.brier, O.Model.L1.Event.brier, corrected = Dapp.Model.L1.Event.brier - O.Model.L1.Event.brier);

##                            Brier.Score
## Dapp.Model.L1.Event.brier  0.086749976
## O.Model.L1.Event.brier    -0.002987089
## corrected                  0.089737065

## Model LX-Time

# Extract all relevant columns from the 'out.bootstrap' object and calculate the mean over all iterations
O.Model.LX.Time <- colMeans(out.bootstrap[, str_detect(string=colnames(out.bootstrap), pattern="OPT_SURVMOD_LX")]);

# C-statistic
Dapp.Model.LX.Time.c <- Dapp.Model.LX.Time[str_detect(string=names(Dapp.Model.LX.Time), pattern="c.statistic")];
O.Model.LX.Time.c <- O.Model.LX.Time[str_detect(string=names(O.Model.LX.Time), pattern="c.statistic")];
rbind(Dapp.Model.LX.Time.c, O.Model.LX.Time.c, corrected = Dapp.Model.LX.Time.c - O.Model.LX.Time.c);

##                      c.statistic_0.5-year c.statistic_1-year
## Dapp.Model.LX.Time.c           0.71740538         0.69335005
## O.Model.LX.Time.c              0.01264523         0.01364425
## corrected                      0.70476016         0.67970580
##                      c.statistic_2-year
## Dapp.Model.LX.Time.c         0.68444465
## O.Model.LX.Time.c            0.01450578
## corrected                    0.66993887

# Slope
Dapp.Model.LX.Time.slope <- Dapp.Model.LX.Time[str_detect(string=names(Dapp.Model.LX.Time), pattern="Slope")];
O.Model.LX.Time.slope <- O.Model.LX.Time[str_detect(string=names(O.Model.LX.Time), pattern="Slope")];
rbind(Dapp.Model.LX.Time.slope, O.Model.LX.Time.slope, corrected = Dapp.Model.LX.Time.slope - O.Model.LX.Time.slope);

##                            Slope.Scale
## Dapp.Model.LX.Time.slope  0.9971058368
## O.Model.LX.Time.slope    -0.0004747626
## corrected                 0.9975805993

# GND HL 
Dapp.Model.LX.Time.hl <- Dapp.Model.LX.Time[str_detect(string=names(Dapp.Model.LX.Time), pattern="HL")];
O.Model.LX.Time.hl <- O.Model.LX.Time[str_detect(string=names(O.Model.LX.Time), pattern="HL")];
rbind(Dapp.Model.LX.Time.hl, O.Model.LX.Time.hl, corrected = Dapp.Model.LX.Time.hl + O.Model.LX.Time.hl);

##                       HL.5bins_0.5-year HL.5bins_1-year HL.5bins_2-year
## Dapp.Model.LX.Time.hl          8.140954       1.5974814       4.9123090
## O.Model.LX.Time.hl             3.811204      -0.5324031       0.7049291
## corrected                     11.952158       1.0650783       5.6172380
##                       HL.10bins_0.5-year HL.10bins_1-year HL.10bins_2-year
## Dapp.Model.LX.Time.hl           15.90331         6.629234         7.669232
## O.Model.LX.Time.hl              10.96412         4.661620         5.729678
## corrected                       26.86743        11.290854        13.398910

1-c(pchisq(q=c(Dapp.Model.LX.Time.hl + O.Model.LX.Time.hl)[1:3], df=4), pchisq(q=c(Dapp.Model.LX.Time.hl + O.Model.LX.Time.hl)[4:6], df=9))

##  HL.5bins_0.5-year    HL.5bins_1-year    HL.5bins_2-year 
##        0.017710603        0.899772603        0.229614753 
## HL.10bins_0.5-year   HL.10bins_1-year   HL.10bins_2-year 
##        0.001471154        0.256298079        0.145371007

Step 6: Discrete Event Simulation

Now that all (survival) models have been fitted and correlated sets of model coefficients (parameters) are obtained by performing the boostrapping procedure, a probabilistic sensitivity analysis can be performed with the discrete event simulation. A base-case analysis can be performed by using the models that are estimated from the imputed datasets (Step 4: Multivariable Modeling of Survival). In order to validate the discrete event simulation, first some custom functions will be defined in Step 6.1, after which the validation is perfomed in Step 6.2. Hence, Step 6 exists of the following steps:

Defining functions used for validating the discrete event simulation
Performing the validation of the discrete event simulation

Step 6.1: defining functions used for validating the discrete event simulation

Several custom functions are defined to run the simulation and validate it. Please find an explanation of the functions in the R code below.

# The 'getSurvDataset.L1' and 'getSurvDataset.LX' functions do no longer need to also
# include the outcomes of the patients, as these are unknown for the simulation model
# since it needs to simulate them, so the shorter 'getData.L1' and 'getData.LX'
# functions are defined
getData.L1 <- function(data) {
  
  data <- data %>% 
    
    ## Single variables
    
    # Arm.ChemBev = "ChemBev",
    mutate(ChemBev = as.numeric(as.logical(L1.Arm.ChemBev))) %>%
    
    #  Age = "Age1 + Age2",
    mutate(Age1 = Age) %>%
    mutate(Age2 = Age1^2) %>%
    
    #  Hospital.Private = "Private.YN",
    mutate(Private.YN = as.numeric(as.logical(Hospital.Private))) %>%
    
    #  Stage.Diagnosis = "Stage4.YN",
    mutate(Stage4.YN = ifelse(as.character(Stage.Diagnosis)=="4", 1, 0)) %>%
    
    #  Primary.Site = "Site.Left + Site.Right + Site.Rectum",
    mutate(Site.LeftRectum = ifelse(Primary.Site %in% c("Left","Rectum"), 1, 0)) %>%
    mutate(Site.Right = ifelse(Primary.Site=="Right", 1, 0)) %>%
    
    #  Metastases.Number = "Metastases.2 + Metastases.3plus",
    mutate(Metastases.2 = ifelse(as.numeric(as.character(Metastases.Number))==2, 1, 0)) %>%
    mutate(Metastases.3plus = ifelse(as.numeric(as.character(Metastases.Number))>=3, 1, 0)) %>%
    
    #  Metastases.Liver = "Liver.YN",
    mutate(Liver.YN = as.numeric(as.logical(Metastases.Liver))) %>%
    
    #  Metastases.Peritoneum = "Peritoneum.YN",
    mutate(Peritoneum.YN = as.numeric(as.logical(Metastases.Peritoneum))) %>%
    
    #  L1.ECOG = "ECOG.1 + ECOG.23",
    mutate(ECOG.1 = ifelse(as.numeric(as.character(L1.ECOG))==1, 1, 0)) %>%
    mutate(ECOG.23 = ifelse(as.numeric(as.character(L1.ECOG))>=2, 1, 0)) %>%
    
    #  Charlson = "Charlson.1 + Charlson.2 + Charlson.3plus",
    mutate(Charlson.1plus = ifelse(as.numeric(as.character(Charlson))>=1, 1, 0)) %>%
    
    #  CEA.Level = "CEA1 + CEA2",
    mutate(CEA.200plus = ifelse(as.numeric(as.character(Metastases.Number))<=2, 1, 0) * ifelse(as.numeric(CEA.Level)>=200, 1, 0)) %>%
    
    #  RAS.Mutated = "MutantRAS.YN",
    mutate(MutantRAS.YN = as.numeric(as.logical(RAS.Mutated))) %>%
    
    #  BRAF.Mutated = "MutantBRAF.YN",
    mutate(MutantBRAF.YN = as.numeric(as.logical(BRAF.Mutated))) %>%
    
    
    ## Interactions
    mutate(Private.YN.Int = Private.YN * ChemBev) %>%
    mutate(Site.LeftRectum.Int = Site.LeftRectum * ChemBev) %>%
    mutate(Site.Right.Int = Site.Right * ChemBev) %>%
    mutate(ECOG.1.Int = ECOG.1 * ChemBev) %>%
    mutate(ECOG.23.Int = ECOG.23 * ChemBev) %>%
    mutate(CEA.200plus.Int = CEA.200plus * ChemBev)
  
  
  # Only the newly created columns need to be returned
  return(select(data, ChemBev:CEA.200plus.Int))
  
}
getData.LX <- function(data) {
  
  data <- data %>% 
    
    ## Single variables
    
    # L1.PFS.Time = "PFS.Time",
    mutate(PFS.Time = log(L1.PFS.Time)) %>%
    
    # Arm.ChemBev = "ChemBev",
    mutate(ChemBev = as.numeric(as.logical(L1.Arm.ChemBev))) %>%
    
    #  Age = "Age1 + Age2",
    mutate(Age1 = as.numeric(Age)-mean(Age)) %>%
    mutate(Age2 = as.numeric(Age1^2)) %>%
    
    #  Hospital.Private = "Private.YN",
    mutate(Private.YN = as.numeric(as.logical(Hospital.Private))) %>%
    
    #  Stage.Diagnosis = "Stage4.YN",
    mutate(Stage4.YN = ifelse(as.numeric(Stage.Diagnosis)==4, 1, 0)) %>%
    
    #  Primary.Site = "Site.Left + Site.Right + Site.Rectum",
    mutate(Site.LeftRectum = ifelse(Primary.Site %in% c("Left","Rectum"), 1, 0)) %>%
    mutate(Site.Right = ifelse(Primary.Site=="Right", 1, 0)) %>%
    
    #  Metastases.Peritoneum = "Peritoneum.YN",
    mutate(Peritoneum.YN = as.numeric(as.logical(Metastases.Peritoneum))) %>%
    
    #  L1.ECOG = "ECOG.1 + ECOG.23",
    mutate(ECOG.1 = ifelse(as.numeric(as.character(L1.ECOG))==1, 1, 0)) %>%
    mutate(ECOG.23 = ifelse(as.numeric(as.character(L1.ECOG))>=2, 1, 0)) %>%
    
    #  Charlson = "Charlson.3plus.YN",
    mutate(Charlson.1plus.YN = ifelse(as.numeric(Charlson)>=1, 1, 0)) %>%
    
    #  RAS.Mutated = "MutantRAS.YN",
    mutate(MutantRAS.YN = as.numeric(as.logical(RAS.Mutated))) %>%
    
    #  BRAF.Mutated = "MutantBRAF.YN",
    mutate(MutantBRAF.YN = as.numeric(as.logical(BRAF.Mutated))) %>%
    
    
    ## Interactions
    mutate(PFS.Time.Int = PFS.Time * ChemBev) %>%
    mutate(Age1.Int = Age1 * ChemBev) %>%
    mutate(Age2.Int = Age2 * ChemBev) %>%
    mutate(Stage4.YN.Int = Stage4.YN * ChemBev) %>%
    mutate(Peritoneum.YN.Int = Peritoneum.YN * ChemBev) %>%
    mutate(ECOG.1.Int = ECOG.1 * ChemBev) %>%
    mutate(ECOG.23.Int = ECOG.23 * ChemBev) %>%
    mutate(MutantRAS.YN.Int = MutantRAS.YN * ChemBev) %>%
    mutate(MutantBRAF.YN.Int = MutantBRAF.YN * ChemBev)
  
  # Only the newly created columns need to be returned
  return(select(data, PFS.Time:MutantBRAF.YN.Int))
  
}

# The 'predictSurvmod' and 'predictLogreg' functions are re-written so that they can
# predict the outcomes in different formats in stead of having different functions for
# different formats. The new functions are 'predSurv' and 'predLogr'.
predSurv <- function(model, newdata, type="random") {
  
  # Add a dummy column for the intercept
  newdata <- cbind(newdata, '(Intercept)'=1);
  
  # Ge the Weibull Shape parameter from the model
  Shape <- 1/exp(model["Log(Scale)"]);
  model <- model[-which(names(model) == "Log(Scale)")];
  
  # Subset a dataframe to enable matrix multiplication and check
  newdata.sub <- as.matrix(newdata[, names(model), drop=F]);
  
  # Get the scale parameter for the survival model
  Scale <- exp(newdata.sub %*% model);
  
  if(type=="random") {
  
    # Predict the times
    p <- matrix(runif(nrow(newdata.sub)), ncol=1);
    t <- qweibull(p=p, shape=Shape, scale=Scale);
    
    # Return predicted times
    return(as.numeric(t));
    
  } else if(type=="response") {
    
    # Return scale parameter
    return(Scale)
    
  }
  
  
}
predLogr <- function(model, newdata) {
  
  # Add a dummy column for the intercept
  newdata <- cbind(newdata, '(Intercept)'=1);
  
  # Subset a dataframe to enable matrix multiplication and check
  newdata.sub <- as.matrix(newdata[, names(model), drop=F]);
  
  # Calculate the probabilities
  p.tmp <- exp(newdata.sub %*% model);
  p <- p.tmp / (p.tmp + 1);
  
  # Return probabilities
  return(as.vector(p));
  
}

# The 'runDES' function performs one run with the DES based on a specific patient 
# (sub)population and a set of model parameters for each (survival) model
runDES <- function(patients, nsim, PFS.Time.Model, PFS.Event.Model, OS.Time.Model) {

  
  ## INITIALIZATION

  # Sample from patients
  df <- sample_n(patients, size=nsim, replace=T);

  
  ## PART 1: SIMULATE WHETHER PROGRESSION OR DEATH OCCURS

  # Put data in right format
  df.PFS <- getData.L1(data=df);

  # Predict time to progression or death
  df.PFS <- df.PFS %>% mutate(Time = predSurv(model=PFS.Time.Model, newdata=df.PFS))%>%
    mutate(Time1 = Time) %>%
    mutate(Time2 = Time^2) %>%
    mutate(Time1.Int = Time1 * ChemBev) %>%
    mutate(Time2.Int = Time2 * ChemBev);

  # Predict which event is associated with the time
  df.PFS <- df.PFS %>%
    mutate(PFS.Prob = predLogr(model=PFS.Event.Model, newdata=df.PFS)) %>%
    mutate(L1.PFS.Event = ifelse(runif(nrow(df.PFS)) < PFS.Prob, "Death", "Progression"));

  # Save to original patient object
  df.PFS <- mutate(df.PFS, L1.PFS.Time = Time);
  df.PFS <- select(df.PFS, L1.PFS.Time, L1.PFS.Event);
  df <- cbind(df, df.PFS);


  ## PART 2: SIMULATE TIME TILL DEATH FOR THOSE WHO PROGRESSED

  # Put data in right format
  df.OS <- getData.LX(data=df);

  # Predict time to death
  df <- df %>% mutate(OS.Time = ifelse(L1.PFS.Event=="Death", NA, predSurv(model=OS.Time.Model, newdata=df.OS)))

  # Rename variables
  df <- df %>%
    mutate(OS.Time = as.numeric(OS.Time)) %>%
    mutate(PFS.Time = L1.PFS.Time) %>%
    mutate(PFS.Event = L1.PFS.Event) %>%
    mutate(PFS = PFS.Time) %>%
    mutate(OS = PFS.Time + ifelse(is.na(OS.Time), 0, OS.Time));

  # Return simulation data.frame
  return(df)

}

# The 'predictKM' extracts a survival probabilty at a specific timepoint 't' from a fitted Kaplan-Meier
# object 'km' using a stepwise function
predictKM <- function(km, t) {
  
  km.fun <- stepfun(km$time, c(1, km$surv));
  
  pred <- km.fun(t);
  
  return(pred)
}

# The 'validateDES' function performs a probabilistic sensitivity analysis for the DES and a specific
# patient (sub)population, and plots the validation plots and median time-to-event outcomes
validateDES <- function(output=NULL, patients=NULL, patients.times=NULL, nsim=NULL, nrun=NULL, models=NULL, timemax=c(2000,2500,2000), timestep=10, freecores=0, seed=20181212) {
  
  # If the result of a previous 'validateDES' call is provide, the probabilistic sensitivity analysis 
  # is not performed again, but only the plots etc. are generated.
  runsim <- ifelse(is.null(output), T, F);
  
  # The 'timemax' and 'timestep' inputs are used to determine for what time horizon the plots should ne generated
  times <- list(seq(from=0, to=timemax[1], by=timestep),
                seq(from=0, to=timemax[2], by=timestep),
                seq(from=0, to=timemax[3], by=timestep));
  
  # Only if a simulation is to be performed, this will be done
  if(runsim) {
    
    # Check whether sufficient sets of parameters are provided to perform the requested number of runs
    if(nrun>nrow(models)) stop(str_c("nrun can be max ",nrow(models), "!"));
    
    # Set the random number seed for reproduction
    set.seed(seed);
    
    # Set up a multithreaded computation cluster
    library(parallel);
    library(doSNOW);
    cl <- makeCluster(detectCores()-freecores);
    clusterExport(cl, c("times", "patients", "patients.times",
                        "runDES", "predSurv", "predLogr",
                        "getData.L1", "getData.LX",
                        "predictKM"));
    
    # Perform the simulation
    sim <- parLapply(cl, 1:nrun, function(i.run) {
      
      # Load the required packages
      library(dplyr);
      library(tidyr);
      library(stringr);
      library(survival);
      
      # Obtain the coefficients for the (survival) models
      coefs <- models[i.run, ];
      
      # Model L1-Time
      survmod.L1 <- coefs[str_detect(names(coefs), "SURVMOD_L1_")];
      names(survmod.L1) <- str_remove(names(survmod.L1), "SURVMOD_L1_");

      # Model L1-Event
      eventmod.L1 <- coefs[str_detect(names(coefs), "EVENTMOD_L1_")];
      names(eventmod.L1) <- str_remove(names(eventmod.L1), "EVENTMOD_L1_");
      
      # Model LX-Time
      survmod.LX <- coefs[str_detect(names(coefs), "SURVMOD_LX_")];
      names(survmod.LX) <- str_remove(names(survmod.LX), "SURVMOD_LX_");
      
      # Simulation for the ChemOnly (control) strategy
      sim.chemonly <- runDES(patients = filter(patients, L1.Arm.ChemBev==F), nsim = nsim, PFS.Time.Model = survmod.L1, PFS.Event.Model = eventmod.L1, OS.Time.Model = survmod.LX);
      
      # Simulation for the ChemBev (experimental) strategy
      sim.chembev <- runDES(patients = filter(patients, L1.Arm.ChemBev==T), nsim = nsim, PFS.Time.Model = survmod.L1, PFS.Event.Model = eventmod.L1, OS.Time.Model = survmod.LX);
      
      # Fit Kaplan-Meier objects for all time-to-events and both strategies to generate the plots at a later stage
      km.L1.PFS.sim.chemonly <- survfit(Surv(PFS.Time, PFS.Event=="Progression") ~ 1, data=sim.chemonly, type="kaplan-meier");
      km.L1.PFS.sim.chembev <- survfit(Surv(PFS.Time, PFS.Event=="Progression") ~ 1, data=sim.chembev, type="kaplan-meier");
      km.L1.OS.sim.chemonly <- survfit(Surv(PFS.Time, PFS.Event=="Death") ~ 1, data=sim.chemonly, type="kaplan-meier");
      km.L1.OS.sim.chembev <- survfit(Surv(PFS.Time, PFS.Event=="Death") ~ 1, data=sim.chembev, type="kaplan-meier");
      km.LX.sim.chemonly <- survfit(Surv(OS.Time) ~ 1, data=sim.chemonly, type="kaplan-meier");
      km.LX.sim.chembev <- survfit(Surv(OS.Time) ~ 1, data=sim.chembev, type="kaplan-meier");
      
      # Use the custom 'predictKM' function to get the curves for the defined timeframe
      plot.L1.PFS.sim.chemonly <- predictKM(km = km.L1.PFS.sim.chemonly, t = times[[1]]);
      plot.L1.PFS.sim.chembev <- predictKM(km = km.L1.PFS.sim.chembev, t = times[[1]]);
      plot.L1.OS.sim.chemonly <- predictKM(km = km.L1.OS.sim.chemonly, t = times[[2]]);
      plot.L1.OS.sim.chembev <- predictKM(km = km.L1.OS.sim.chembev, t = times[[2]]);
      plot.LX.sim.chemonly <- predictKM(km = km.LX.sim.chemonly, t = times[[3]]);
      plot.LX.sim.chembev <- predictKM(km = km.LX.sim.chembev, t = times[[3]]);
      
      # Extract the median time-to-events
      median.L1.PFS.sim.chemonly <- summary(km.L1.PFS.sim.chemonly)$table['median'];
      median.L1.PFS.sim.chembev <- summary(km.L1.PFS.sim.chembev)$table['median'];
      median.L1.OS.sim.chemonly <- summary(km.L1.OS.sim.chemonly)$table['median'];
      median.L1.OS.sim.chembev <- summary(km.L1.OS.sim.chembev)$table['median'];
      median.LX.sim.chemonly <- summary(km.LX.sim.chemonly)$table['median'];
      median.LX.sim.chembev <- summary(km.LX.sim.chembev)$table['median'];
      
      # Extract the event incidences from the simulation objects
      prob.L1.death.chemonly <- mean(sim.chemonly$PFS.Event=="Death");
      prob.L1.death.chembev <- mean(sim.chembev$PFS.Event=="Death");
      
      # Return all objects in a list because of the different formats
      list(km = list(km.L1.PFS.sim.chemonly = plot.L1.PFS.sim.chemonly, km.L1.PFS.sim.chembev = plot.L1.PFS.sim.chembev, km.L1.OS.sim.chemonly = plot.L1.OS.sim.chemonly, km.L1.OS.sim.chembev = plot.L1.OS.sim.chembev, km.LX.sim.chemonly = plot.LX.sim.chemonly, km.LX.sim.chembev = plot.LX.sim.chembev), median = c(median.L1.PFS.sim.chemonly = median.L1.PFS.sim.chemonly, median.L1.PFS.sim.chembev = median.L1.PFS.sim.chembev, median.L1.OS.sim.chemonly = median.L1.OS.sim.chemonly, median.L1.OS.sim.chembev = median.L1.OS.sim.chembev, median.LX.sim.chemonly = median.LX.sim.chemonly, median.LX.sim.chembev = median.LX.sim.chembev), p = c(prob.L1.death.chemonly = prob.L1.death.chemonly, prob.L1.death.chembev = prob.L1.death.chembev))
      
    })
    
    # Stop the computation cluster
    stopCluster(cl);
  
  # If no simulation is to be performed because a previous outcome object is provide,
  # the correct objects need to be extracted from the object.
  } else {
    
    # Kaplan-Meier curves etc.
    sim <- output$sim;
    
    # Observed data
    patients <- output$patients;
    patients.times <- output$patients.times;
    
  }
  
  # Transform all Kaplan-Meier curves into one matrix
  km.L1.PFS.sim.chemonly <- sapply(sim, function(s) s$km$km.L1.PFS.sim.chemonly);
  km.L1.PFS.sim.chembev <- sapply(sim, function(s) s$km$km.L1.PFS.sim.chembev);
  km.L1.OS.sim.chemonly <- sapply(sim, function(s) s$km$km.L1.OS.sim.chemonly);
  km.L1.OS.sim.chembev <- sapply(sim, function(s) s$km$km.L1.OS.sim.chembev);
  km.LX.sim.chemonly <- sapply(sim, function(s) s$km$km.LX.sim.chemonly);
  km.LX.sim.chembev <- sapply(sim, function(s) s$km$km.LX.sim.chembev);

  # Calculate the mean curves and the corresponding 95% Confidence Intervals
  km.L1.PFS.sim.chemonly.curves <- apply(km.L1.PFS.sim.chemonly, 1, function(row) c(low = quantile(row, 0.025), mean = mean(row), high = quantile(row, 0.975)))
  km.L1.PFS.sim.chembev.curves <- apply(km.L1.PFS.sim.chembev, 1, function(row) c(low = quantile(row, 0.025), mean = mean(row), high = quantile(row, 0.975)))
  km.L1.OS.sim.chemonly.curves <- apply(km.L1.OS.sim.chemonly, 1, function(row) c(low = quantile(row, 0.025), mean = mean(row), high = quantile(row, 0.975)))
  km.L1.OS.sim.chembev.curves <- apply(km.L1.OS.sim.chembev, 1, function(row) c(low = quantile(row, 0.025), mean = mean(row), high = quantile(row, 0.975)))
  km.LX.sim.chemonly.curves <- apply(km.LX.sim.chemonly, 1, function(row) c(low = quantile(row, 0.025), mean = mean(row), high = quantile(row, 0.975)))
  km.LX.sim.chembev.curves <- apply(km.LX.sim.chembev, 1, function(row) c(low = quantile(row, 0.025), mean = mean(row), high = quantile(row, 0.975)))
 
  # Kaplan-Meier curves for the observed data from the 'patients' input object
  km.L1.PFS.obs.chemonly <- survfit(Surv(L1.PFS.Time, L1.PFS.Events=="Progression")~1, data=filter(patients.times, L1.Arm.ChemBev==F), type="kaplan-meier");
  km.L1.PFS.obs.chembev <- survfit(Surv(L1.PFS.Time, L1.PFS.Events=="Progression")~1, data=filter(patients.times, L1.Arm.ChemBev==T), type="kaplan-meier");
  km.L1.OS.obs.chemonly <- survfit(Surv(L1.PFS.Time, L1.PFS.Events=="Death")~1, data=filter(patients.times, L1.Arm.ChemBev==F), type="kaplan-meier");
  km.L1.OS.obs.chembev <- survfit(Surv(L1.PFS.Time, L1.PFS.Events=="Death")~1, data=filter(patients.times, L1.Arm.ChemBev==T), type="kaplan-meier");
  km.LX.obs.chemonly <- survfit(Surv(LX.OS.Time, LX.OS.Events=="Death")~1, data=filter(patients.times, L1.Arm.ChemBev==F), type="kaplan-meier");
  km.LX.obs.chembev <- survfit(Surv(LX.OS.Time, LX.OS.Events=="Death")~1, data=filter(patients.times, L1.Arm.ChemBev==T), type="kaplan-meier");
  
  # Plots
  par(mfrow=c(1,3));
  lwds <- c(2,2,2,2);
  cols <- c("black", "red", "green", "blue");
  plot(main=str_c("First-Line Progression (Event is Progression)\n", "(ChemOnly=",sum(km.L1.PFS.obs.chemonly$n.event),", ChemBev=",sum(km.L1.PFS.obs.chembev$n.event),")"), xlab="Time to Progression (Days)", km.L1.PFS.obs.chemonly, col=cols[1], lwd=1, conf.int=T, las=1, xlim=c(0,timemax[[1]]));
  lines(km.L1.PFS.obs.chemonly, col=cols[1], lwd=lwds[1], conf.int=F);
  lines(km.L1.PFS.obs.chembev, col=cols[2], lwd=1, conf.int=T);
  lines(km.L1.PFS.obs.chembev, col=cols[2], lwd=lwds[2], conf.int=F);
  lines(x=times[[1]], y=km.L1.PFS.sim.chemonly.curves[2,], col=cols[3], lwd=lwds[3]);
  lines(x=times[[1]], y=km.L1.PFS.sim.chemonly.curves[1,], col=cols[3], lwd=1, lty=2);
  lines(x=times[[1]], y=km.L1.PFS.sim.chemonly.curves[3,], col=cols[3], lwd=1, lty=2);
  lines(x=times[[1]], y=km.L1.PFS.sim.chembev.curves[2,], col=cols[4], lwd=lwds[4]);
  lines(x=times[[1]], y=km.L1.PFS.sim.chembev.curves[1,], col=cols[4], lwd=1, lty=2);
  lines(x=times[[1]], y=km.L1.PFS.sim.chembev.curves[3,], col=cols[4], lwd=1, lty=2);
  legend("topright", c("ChemOnly (obs)", "ChemBev (obs)", "ChemOnly (sim)", "ChemBev (sim)"), col=cols, lwd=lwds, lty=1, bty="n")
  
  plot(main=str_c("First-Line Progression (Event is Death)\n", "(ChemOnly=",sum(km.L1.OS.obs.chemonly$n.event),", ChemBev=",sum(km.L1.OS.obs.chembev$n.event),")"), xlab="Time to Death (Days)", km.L1.OS.obs.chemonly, col=cols[1], lwd=1, conf.int=T, las=1, xlim=c(0,timemax[[2]]));
  lines(km.L1.OS.obs.chemonly, col=cols[1], lwd=lwds[1], conf.int=F);
  lines(km.L1.OS.obs.chembev, col=cols[2], lwd=1, conf.int=T);
  lines(km.L1.OS.obs.chembev, col=cols[2], lwd=lwds[2], conf.int=F);
  lines(x=times[[2]], y=km.L1.OS.sim.chemonly.curves[2,], col=cols[3], lwd=lwds[3]);
  lines(x=times[[2]], y=km.L1.OS.sim.chemonly.curves[1,], col=cols[3], lwd=1, lty=2);
  lines(x=times[[2]], y=km.L1.OS.sim.chemonly.curves[3,], col=cols[3], lwd=1, lty=2);
  lines(x=times[[2]], y=km.L1.OS.sim.chembev.curves[2,], col=cols[4], lwd=lwds[4]);
  lines(x=times[[2]], y=km.L1.OS.sim.chembev.curves[1,], col=cols[4], lwd=1, lty=2);
  lines(x=times[[2]], y=km.L1.OS.sim.chembev.curves[3,], col=cols[4], lwd=1, lty=2);
  legend("topright", c("ChemOnly (obs)", "ChemBev (obs)", "ChemOnly (sim)", "ChemBev (sim)"), col=cols, lwd=lwds, lty=1, bty="n")
  
  plot(main=str_c("Death after Progression\n", "(ChemOnly=",sum(km.LX.obs.chemonly$n.event),", ChemBev=",sum(km.LX.obs.chembev$n.event),")"), xlab="Time to Death (Days)", km.LX.obs.chemonly, col=cols[1], lwd=1, conf.int=T, las=1, xlim=c(0,timemax[[3]]));
  lines(km.LX.obs.chemonly, col=cols[1], lwd=lwds[1], conf.int=F);
  lines(km.LX.obs.chembev, col=cols[2], lwd=1, conf.int=T);
  lines(km.LX.obs.chembev, col=cols[2], lwd=lwds[2], conf.int=F);
  lines(x=times[[3]], y=km.LX.sim.chemonly.curves[2,], col=cols[3], lwd=lwds[3]);
  lines(x=times[[3]], y=km.LX.sim.chemonly.curves[1,], col=cols[3], lwd=1, lty=2);
  lines(x=times[[3]], y=km.LX.sim.chemonly.curves[3,], col=cols[3], lwd=1, lty=2);
  lines(x=times[[3]], y=km.LX.sim.chembev.curves[2,], col=cols[4], lwd=lwds[4]);
  lines(x=times[[3]], y=km.LX.sim.chembev.curves[1,], col=cols[4], lwd=1, lty=2);
  lines(x=times[[3]], y=km.LX.sim.chembev.curves[3,], col=cols[4], lwd=1, lty=2);
  legend("topright", c("ChemOnly (obs)", "ChemBev (obs)", "ChemOnly (sim)", "ChemBev (sim)"), col=cols, lwd=lwds, lty=1, bty="n")
  
  # Extract Median
  medians.sim <- sapply(sim, function(s) s$median);
  medians.sim.low <- apply(medians.sim, 1, quantile, probs=0.025, na.rm=T);
  medians.sim.mean <- apply(medians.sim, 1, mean, na.rm=T);
  medians.sim.high <- apply(medians.sim, 1, quantile, probs=0.975, na.rm=T);
  
  print(cbind(medians.sim.low, medians.sim.mean, medians.sim.high))
  print(km.L1.PFS.obs.chemonly);
  print(km.L1.PFS.obs.chembev);
  print(km.L1.OS.obs.chemonly);
  print(km.L1.OS.obs.chembev);
  print(km.LX.obs.chemonly);
  print(km.LX.obs.chembev);
  
  # Probability of death
  prob.sim <- sapply(sim, function(s) s$p);
  prob.sim.low <- apply(prob.sim, 1, quantile, probs=0.025);
  prob.sim.mean <- apply(prob.sim, 1, mean);
  prob.sim.high <- apply(prob.sim, 1, quantile, probs=0.975);
  print(cbind(prob.sim.low, prob.sim.mean, prob.sim.high))
  
  count.obs <- table(patients.times$L1.PFS.Events, patients.times$L1.Arm.ChemBev);
  prob.obs <- apply(count.obs, 2, function(col) col/sum(col));
  print(prob.obs)
  
  # Only return the results if a new simulation is performed
  if(runsim) return(list(sim=sim, patients=patients, patients.times=patients.times));
  
}

Step 6.2: performing the validation of the discrete event simulation

After the custom functions have been defined, the validation of the discrete event simulation can be performed. Unfortunately, the numerical results of the validation cannot be presented in a nicer way because they are printed from without the ‘validateDES’ function.

# The sets of correlated model parameters need to be extracted from the 'out.bootstrap' object
psa.models <- out.bootstrap[, !str_detect(string=colnames(out.bootstrap), pattern="BOOT")];
psa.models <- psa.models[, !str_detect(string=colnames(psa.models), pattern="ORG")];
psa.models <- psa.models[, !str_detect(string=colnames(psa.models), pattern="OPT")];

# Only select the columns that are required to perform a simulation
dataset <- datasets[[1]];
patients <- dataset %>% 
  select(L1.Arm.ChemBev, Age, Hospital.Private, Primary.Site, Stage.Diagnosis, Metastases.Number, Metastases.Liver, Metastases.Peritoneum, L1.ECOG, Charlson, CEA.Level, RAS.Mutated, BRAF.Mutated);

# Additionally provide the observed time-to-events for validation purposes
patients.times <- datasets[[1]] %>% 
  select(L1.PFS.Time, L1.PFS.Events, LX.OS.Time, LX.OS.Events, L1.Arm.ChemBev, Age, Hospital.Private, Primary.Site, Stage.Diagnosis, Metastases.Number, Metastases.Liver, Metastases.Peritoneum, L1.ECOG, Charlson, CEA.Level, RAS.Mutated, BRAF.Mutated);

# Call the 'validateDES' function to perform the simulations and generate all plots
png(file="Validation DES 20190201.png", width=12, height=6, units="in", res=96);
val.cohort <- validateDES(patients = patients, patients.times = patients.times, nsim = 20000, nrun = 10000,  models = psa.models);

##                                   medians.sim.low medians.sim.mean
## median.L1.PFS.sim.chemonly.median        171.4950         198.8957
## median.L1.PFS.sim.chembev.median         289.1615         307.8619
## median.L1.OS.sim.chemonly.median         724.7450        1606.9122
## median.L1.OS.sim.chembev.median         1013.3768        1171.1054
## median.LX.sim.chemonly.median            161.1872         199.6297
## median.LX.sim.chembev.median             242.6702         266.4501
##                                   medians.sim.high
## median.L1.PFS.sim.chemonly.median         228.2406
## median.L1.PFS.sim.chembev.median          327.7214
## median.L1.OS.sim.chemonly.median         2623.2981
## median.L1.OS.sim.chembev.median          1392.2564
## median.LX.sim.chemonly.median             241.8250
## median.LX.sim.chembev.median              291.9795
## Call: survfit(formula = Surv(L1.PFS.Time, L1.PFS.Events == "Progression") ~ 
##     1, data = filter(patients.times, L1.Arm.ChemBev == F), type = "kaplan-meier")
## 
##       n  events  median 0.95LCL 0.95UCL 
##     191     157     171     155     210 
## Call: survfit(formula = Surv(L1.PFS.Time, L1.PFS.Events == "Progression") ~ 
##     1, data = filter(patients.times, L1.Arm.ChemBev == T), type = "kaplan-meier")
## 
##       n  events  median 0.95LCL 0.95UCL 
##     676     571     304     280     329 
## Call: survfit(formula = Surv(L1.PFS.Time, L1.PFS.Events == "Death") ~ 
##     1, data = filter(patients.times, L1.Arm.ChemBev == F), type = "kaplan-meier")
## 
##       n  events  median 0.95LCL 0.95UCL 
##     191      21      NA      NA      NA 
## Call: survfit(formula = Surv(L1.PFS.Time, L1.PFS.Events == "Death") ~ 
##     1, data = filter(patients.times, L1.Arm.ChemBev == T), type = "kaplan-meier")
## 
##       n  events  median 0.95LCL 0.95UCL 
##     676      62    1055     902      NA 
## Call: survfit(formula = Surv(LX.OS.Time, LX.OS.Events == "Death") ~ 
##     1, data = filter(patients.times, L1.Arm.ChemBev == F), type = "kaplan-meier")
## 
##    34 observations deleted due to missingness 
##       n  events  median 0.95LCL 0.95UCL 
##     157     141     193     153     243 
## Call: survfit(formula = Surv(LX.OS.Time, LX.OS.Events == "Death") ~ 
##     1, data = filter(patients.times, L1.Arm.ChemBev == T), type = "kaplan-meier")
## 
##    105 observations deleted due to missingness 
##       n  events  median 0.95LCL 0.95UCL 
##     571     467     270     237     308 
##                        prob.sim.low prob.sim.mean prob.sim.high
## prob.L1.death.chemonly   0.07064875     0.1163047       0.16565
## prob.L1.death.chembev    0.07859750     0.1027043       0.12870
##              
##                    FALSE       TRUE
##   Censored    0.06806283 0.06360947
##   Death       0.10994764 0.09171598
##   Progression 0.82198953 0.84467456

dev.off();

## png 
##   2

Step 7: Exploratory Analyses

After the discrete event simulation has been validated, an exploratory analysis is performed in which patients’ treatment is targeted using personalized predictions of their first-line progression-free survival using Model L1-Time. To achieve this, Step 7 exists out of the following four steps:

Defining functions that are used to run the discrete event simulation
Performing the personalized treatment targeting
Performing the simulations based on the personalized treatment decisions
Exploring the results of the simulations

Step 7.1: defining functions that are used to run the discrete event simulation

Several custom functions are defined to run the simulation. Also, some earlier defined functions are re-written as they do not need to be as extensive as they needed to be for the data analysis. Please find an explanation of the functions in the R code below.

# The 'runPSA' function is a wrapper function that repeatedly calls the 'runDES' function
# for a patient (sub)population and different sets of parameter values
runPSA <- function(patients=NULL, nsim=NULL, nrun=NULL, models=NULL, freecores=0, seed=20181212) {
  
  # Check whether sufficient sets of parameters are provided to perform the requested number of runs
  if(nrun>nrow(models)) stop(str_c("nrun can be max ",nrow(models), "!"));
  
  # Set the random sumber seed for reproduction
  set.seed(seed);
  
  # Multithreading
  library(parallel);
  library(doSNOW);
  cl <- makeCluster(detectCores()-freecores);
  clusterExport(cl, c("patients",  
                      "runDES", "predSurv", "predLogr",
                      "getData.L1", "getData.LX"));
    
  # Perform the probabilistic sensitivity analysis
  psa.out <- parSapply(cl, 1:nrun, function(i.run) {
    
    # Load packages
    library(dplyr);
    library(tidyr);
    library(stringr);
    library(survival);
    
    # Extract the model coefficients from the 'models' object with all sets
    coefs <- models[i.run, ];
    
    # Model L1-Time
    survmod.L1 <- coefs[str_detect(names(coefs), "SURVMOD_L1_")];
    names(survmod.L1) <- str_remove(names(survmod.L1), "SURVMOD_L1_");
    
    # Model L1-Event
    eventmod.L1 <- coefs[str_detect(names(coefs), "EVENTMOD_L1_")];
    names(eventmod.L1) <- str_remove(names(eventmod.L1), "EVENTMOD_L1_");
    
    # Model LX-Time
    survmod.LX <- coefs[str_detect(names(coefs), "SURVMOD_LX_")];
    names(survmod.LX) <- str_remove(names(survmod.LX), "SURVMOD_LX_");
    
    # Run the DES
    sim <- runDES(patients = patients,
                  nsim = nsim,
                  PFS.Time.Model = survmod.L1,
                  PFS.Event.Model = eventmod.L1,
                  OS.Time.Model = survmod.LX);

    # For now we are only interested in the median first-line progression-free survival
    median(sim$PFS.Time)
    
  })
  
  # Stop the multithreading
  stopCluster(cl);
  
  # Return the median first-line progression-free survival outcomes
  return(psa.out)
  
}

Step 7.2: performing the personalized treatment targeting

To perform a simulation to estimate the impact of personalized treatment decisions, first it needs to be determined what is expected to be the best strategy (ChemOnly or ChemBev) for each patient individually. To make these predictions, Model L1-Time is used to make a prediction whether first-line progression-free survival is expected to be higher for the ChemOnly strategy or for the ChemBev strategy.

# Transform the data to the required format
patients.chemonly <- getData.L1(data = patients %>% mutate(L1.Arm.ChemBev=F));
patients.chembev <- getData.L1(data = patients %>% mutate(L1.Arm.ChemBev=T));

# Determine the Weibull scale parameter for each patient individually (a higher value for the scale 
# parameter indicates a higher time-to-event, i.e. better first-line progression-free survival)
response.chemonly <- predSurv(model = Model.L1.Time$coef, newdata = patients.chemonly, type = "response");
response.chembev <- predSurv(model = Model.L1.Time$coef, newdata = patients.chembev, type = "response");

# Set the treatment strategy indicator 'L1.Arm.ChemBev' to the treatment strategy associated with 
# the highest expected first-line progression-free survival
patients <- patients %>%
  mutate(L1.Arm.ChemBev = if_else(response.chembev > response.chemonly, T, F));

# Observe how the treatment were provided as observed in the data
table(dataset$L1.Arm.ChemBev);

## 
## FALSE  TRUE 
##   191   676

# # Observe how the treatment are provided according to the personalized treatment decisions
table(patients$L1.Arm.ChemBev);

## 
## FALSE  TRUE 
##    74   793

# Number of different treatment decisions
n.differenttreatment <- sum(ifelse(dataset$L1.Arm.ChemBev != patients$L1.Arm.ChemBev, 1, 0));
n.differenttreatment;

## [1] 219

n.differenttreatment / nrow(dataset);

## [1] 0.2525952

# Number who swithed from ChemOnly to ChemBev
n.ChemOnly.ChemBev <- sum(ifelse(!dataset$L1.Arm.ChemBev & patients$L1.Arm.ChemBev, 1, 0));
n.ChemOnly.ChemBev;

## [1] 168

n.ChemOnly.ChemBev / n.differenttreatment;

## [1] 0.7671233

# Number who swithed from ChemBev to ChemOnly
n.ChemBev.ChemOnly <- sum(ifelse(dataset$L1.Arm.ChemBev & !patients$L1.Arm.ChemBev, 1, 0));
n.ChemBev.ChemOnly;

## [1] 51

n.ChemBev.ChemOnly / n.differenttreatment;

## [1] 0.2328767

Step 7.3: performing the simulations based on the personalized treatment decisions

After determining the treatment each patient should receive based on the personalized treatment decisions informed by the estimated first-line progression-free survival, a probabilistic sensitivity analysis can be performed to estimate the impact.

# Perform a simulation for the whole patient population
sim.out <- runPSA(patients = patients, nsim = 20000, nrun = 10000, models = psa.models, freecores = 0, seed = 20190104)

# Determine which patients switched treatment
switched <- which(dataset$L1.Arm.ChemBev != patients$L1.Arm.ChemBev);
patients <- patients[switched, , drop=F];

# Perform a simulation only for the 219 patients who switched treatment
sim.out.switched <- runPSA(patients = patients, nsim = 20000, nrun = 10000, models = psa.models, freecores = 0, seed = 20190104)

Step 7.4: exploring the results of the simulations

The final step of this study is to assess the impact of this exploratory analysis based on personalized treatment decisions.

## For the complete patient population

# Mean median PFS and upper and lower bound over the probabilistic sensitivity analysis
mean(sim.out);

## [1] 288.0516

quantile(sim.out, probs = c(0.025, 0.975));

##     2.5%    97.5% 
## 270.2137 306.6571

# Median PFS as observed from the data
survfit(formula = Surv(L1.PFS.Time, L1.PFS.Event) ~ 1, data = dataset, type = "kaplan-meier");

## Call: survfit(formula = Surv(L1.PFS.Time, L1.PFS.Event) ~ 1, data = dataset, 
##     type = "kaplan-meier")
## 
##       n  events  median 0.95LCL 0.95UCL 
##     867     811     265     248     280

## For the patients who switched treatment based on the personalized treatment decision

# Mean median PFS and upper and lower bound over the probabilistic sensitivity analysis
mean(sim.out.switched);

## [1] 269.4719

quantile(sim.out.switched, probs = c(0.025, 0.975));

##     2.5%    97.5% 
## 246.3946 293.4646

# Median PFS as observed from the data
survfit(formula = Surv(L1.PFS.Time, L1.PFS.Event) ~ 1, data = dataset[switched, ], type = "kaplan-meier");

## Call: survfit(formula = Surv(L1.PFS.Time, L1.PFS.Event) ~ 1, data = dataset[switched, 
##     ], type = "kaplan-meier")
## 
##       n  events  median 0.95LCL 0.95UCL 
##     219     205     175     156     199

Simulating progression-free survival and overall survival for first-line doublet chemotherapy with or without bevacizumab in metastatic colorectal cancer patients based on real world registry data

K. Degeling, PhD

17 July 2020