This document shows all code, intermediate outcomes, and final outcomes belonging to the manuscript “Simulating Progression-Free and Overall Survival for First-Line Doublet Chemotherapy With or Without Bevacizumab in Metastatic Colorectal Cancer Patients Based on Real-World Registry Data” by K Degeling et al. in PharmacoEconomics (available here). Using real world data from the Treatment of Recurrent and Advanced Colorectal Cancer registry, two parametric survival models and a logistic regression model are estimated to populate a discrete event simulation model. Please read the manuscript before reading through this document, as the manuscript contains information essential for understanding decisions made in the code that is included in this document.

Please appropriately cite the work above if you use any code provided in this document for your own analyses.

This document contains the following sections:

Please note that the number of bootstrap iterations and simulation runs performed is set to a lower number for illustration purposes due to the required runtime to perform the 10,000 iterations/simulations that are performed for the final analyses presented in the manuscript. Hence, results and plots in this document may be different from those reported in the manuscript. Results presented in the manuscript should be considered as the final and definitive results at all times.

Initialization

To intialize the data analysis, several actions are performed. Most importantly, the data are loaded from an encrypted hard drive, as are the input matrices defined by clinicians to recode the treatments as observed in the unprocessed data.

# Clear Workspace and plots, and clear console
rm(list=ls()); gc(); if(!is.null(dev.list())) dev.off(); cat("\014");

# Load packages
library(dplyr);
library(readxl);
library(stringr);

# Set the locale so R handles dates as English regardless of location
Sys.setlocale("LC_TIME", "English");

# Load the raw TRACC data for the patient characteristics and first line treatment
df <- read_excel("G:/Files/mCRC Targeting Melbourne/CRC MODELING DATA 20181105.xlsx", 
                 col_types = c("numeric", "date", "text", "text",
                               "text", "text", "text", "date", "date", 
                               "date", "date", "numeric", "text", 
                               "text", "text", "text", "text", "text", 
                               "text", "text", "text", "text", "text", 
                               "text", "text", "text", "text", "text", 
                               "text", "text", "numeric", "text", 
                               "text", "text", "text", "text", "text", 
                               "text", "text", "text", "text", "text", 
                               "text", "text", "text", "text", "text", 
                               "text", "text", "text", "numeric", 
                               "text", "text", "text", "text", "date", 
                               "text", "text", "text", "date", "text", 
                               "text", "text", "text", "text", "date",
                               "numeric","text",
                               "text", "text", "text", "text", "date", 
                               "date", "date", "text", "text", "text", 
                               "text", "text", "text", "text", "numeric", 
                               "text", "numeric", "text", "numeric", 
                               "text", "text", "text", "numeric", 
                               "text", "numeric", "text", "numeric", 
                               "text", "numeric", "text", "numeric", 
                               "text", "numeric", "text", "numeric", 
                               "text", "numeric", "text", "numeric", 
                               "text", "numeric", "text", "text", 
                               "numeric", "text", "text"));

# Replace spaces in column names with dots/points to handle columns more easily
colnames(df) <- str_replace_all(string=colnames(df), pattern=" ", replace=".");

# Load the chemo and biologic classification by Hui-Li & Koen
chemo.class <- read_excel("G:/Files/mCRC Targeting Melbourne/Chemotherapy Classification 20181123.xlsx");
biolo.class <- read_excel("G:/Files/mCRC Targeting Melbourne/Biological Classification 20181123.xlsx");

Step 1: Cleaning the TRACC Data

Before all models can be fitted, the data need to be cleaned and the appropriate patient population is to be selected. To this end, several steps are performed:

  1. Initial data cleaning and patient selection
  2. Cleaning of patient characteristics and treatment history
  3. Cleaning of clinical outcomes for first-line treatment
  4. Cleaning of clinical outcomes for further treatment lines
  5. Final patient selection

Step 1.1: Initial data cleaning and patient selection

In the first step of the data cleaning, the patient selection is performed. All data required for this selection are recoded and cleaned in this step as well. Patients included are those …

  • who received chemotherapy with or without bevacizumab
  • treated with a palliative treatment intent
  • with a treatment start date after 1 January 2009 and before 1 June 2017
# Inclusion is based on whether the first line of palliative treatment included
# doublet chemotherapy with or without bevacizumab. Patients for whom treatment
# intent was indicated as curative or potentially curative, are excluded.

# Only patients with recorded first-line chemotherapy treatment are considered for inclusion
df <- df %>%
  filter(!is.na(REGIMEN.1L)) %>%
  filter(TREATMENTINTENT=="Palliative only")

# Recode first-line treatment information for inclusion
df <- df %>%
  mutate(L1.Regimen = REGIMEN.1L) %>%
  mutate(L1.Biologic = BIOLOGIC.1L) %>%
  mutate(L1.Start.Date = as.character(START.DATE.1L))

# In the TRACC registry, all regimen in a treatment line need to be recoded first. 
# This has been manually done in an Excel file, which has been loaded in the global 
# environment as object 'chemo.class' and will now be used to recode the regimen.
# The 'chemo.class' object contains a column 'Recorded.Regimen' with all uncoded
# treatments and a columns 'Classification.G0' with the recoded treatments. The
# column 'L1.Chemo.Eligible' is added to indicate whether a patient received doublet
# chemotherapy and, hence, is eligible for inclusion.
df <- df %>%
  left_join(select(chemo.class, Recorded.Regimen, Classification.G0), by=c("L1.Regimen" = "Recorded.Regimen")) %>%
  rename(L1.Chemo.Class = Classification.G0) %>%
  mutate(L1.Chemo.Class = ifelse(is.na(L1.Chemo.Class), "None", L1.Chemo.Class)) %>%
  mutate(L1.Chemo.Eligible = ifelse(L1.Chemo.Class=="Doublet", T, F))

# Since only patients who had a doublet chemotherapy with or without bevacizumab,
# so not with another additional agent like cetuximab, are include, it need to be
# checked whether patients had bevacizumab yes or no, and whether they are
# eligible based on any other biologicals. This is done based on the first
# reported biological
df <- df %>%
  left_join(select(biolo.class, Recorded.Biologicals, Classification.G0), by=c("L1.Biologic" = "Recorded.Biologicals")) %>%
  rename(L1.Biologic.Class=Classification.G0) %>%
  mutate(L1.Biologic.Class = ifelse(is.na(L1.Biologic.Class), "None", L1.Biologic.Class)) %>%
  mutate(L1.Biologic.Eligible = ifelse(L1.Biologic.Class %in% c("None", "Bevacizumab"), T, F)) %>%
  mutate(L1.Bevacizumab = ifelse(L1.Biologic.Class=="Bevacizumab", T, F))

# Final selection of patients is performed based on whether they had ...
# 1) doublet chemotherapy with or without bevacizumab and no other biological
# 2) treatment start date after 2009-01-01
# 3) treatment start date before 2017-06-01
df<- df %>%
  filter(L1.Chemo.Eligible) %>%
  filter(L1.Biologic.Eligible) %>%
  filter(L1.Start.Date >= as.Date("2009-01-01")) %>%
  filter(L1.Start.Date < as.Date("2017-06-01"))

# Allocate patients to the appropriate treatment arm
df <- df %>%
  mutate(L1.Arm.ChemBev = ifelse(L1.Bevacizumab, T, F))

Step 1.2: Cleaning of patient characteristics and treatment history

The second step in cleaning the data is making sure all variables are recoded in relevant categories and all observations are categorized in one of these defined categories. The following variables are extracted from the data:

  • Age at treatment start
  • Gender
  • Hospital type (private or public)
  • Hospital location (located in Victoria or not)
  • Stage at initial diagnosis
  • Primary tumour site
  • Presence of peritoneal metastases
  • Presence of liver metastases
  • Number of metastatic sites
  • ECOG performance status
  • Presence of comorbidities
  • Charlson comorbidity index
  • CEA level
  • RAS mutation status
  • BRAF mutation status
  • Treatment intent
  • Treatment start year
df <- df %>%
  
  # Age in years at treatment start
  mutate(Age = floor(as.numeric(difftime(L1.Start.Date, DOB, units="weeks")/52))) %>%         

  # Gender is female or not
  mutate(Gender.Female = ifelse(GENDER=="F", T, F)) %>%                                       
  
  # Hospital type is private or not
  mutate(Hospital.Private = ifelse(TLOCATIONTYPE=="Private", T, F)) %>%
  
  # Hospital is located in Victoria or not
  mutate(Hospital.Victoria = ifelse(STATE=="VIC", T, F)) %>%
  
  # Stage of disease at initial diagnosis
  mutate(Stage.Diagnosis = ifelse(STAGE.AT.DIAGNOSIS=="Unknown", NA, NA))%>%
  mutate(Stage.Diagnosis = ifelse(STAGE.AT.DIAGNOSIS=="Locally advanced (rectal only)", 0, Stage.Diagnosis)) %>%
  mutate(Stage.Diagnosis = ifelse(STAGE.AT.DIAGNOSIS=="Stage I", 1, Stage.Diagnosis)) %>%
  mutate(Stage.Diagnosis = ifelse(STAGE.AT.DIAGNOSIS=="Stage II", 2, Stage.Diagnosis)) %>%
  mutate(Stage.Diagnosis = ifelse(STAGE.AT.DIAGNOSIS=="Stage III", 3, Stage.Diagnosis)) %>%
  mutate(Stage.Diagnosis = ifelse(STAGE.AT.DIAGNOSIS=="Stage IV", 4, Stage.Diagnosis)) %>%
  
  # Primary tumour site
  mutate(Primary.Site = ifelse(PRIMARYSITE=="Occult", NA, NA)) %>%
  mutate(Primary.Site = ifelse(PRIMARYSITE=="Colon", "Colon", Primary.Site)) %>%
  mutate(Primary.Site = ifelse(PRIMARYSITE=="Colon NOS", "Colon", Primary.Site)) %>%
  mutate(Primary.Site = ifelse(PRIMARYSITE=="Left colon", "Left", Primary.Site)) %>%
  mutate(Primary.Site = ifelse(PRIMARYSITE=="Right colon", "Right", Primary.Site)) %>%
  mutate(Primary.Site = ifelse(PRIMARYSITE=="Rectum", "Rectum", Primary.Site)) %>%
  mutate(Primary.Site = ifelse(PRIMARYSITE=="Multiple primaries", "Colon", Primary.Site)) %>%
  
  # Specific metastatic sites
  mutate(Metastases.Liver = ifelse(is.na(LIVER.METS), F, T)) %>%
  mutate(Metastases.Peritoneum = ifelse(is.na(PERITONEUM.OMENTUM.METS), F, T)) %>%
  
  # Number of metastatic sites
  mutate(Metastases.Number = NUMBER.OF.METS) %>%
  
  # ECOG Performance Status
  mutate(L1.ECOG = as.numeric(str_sub(ECOG.First.Line, 1, 1))) %>%
  
  # Whether the patient has comorbidities or not
  mutate(Comorbidities = ifelse(NO.COMORBIDITIES=="Yes", FALSE, TRUE)) %>%
  
  # Recode the comorbidities
  mutate(MYOCARDIAL.INFARCTION = ifelse(is.na(MYOCARDIAL.INFARCTION) | MYOCARDIAL.INFARCTION=="No", 0, 1)) %>%
  mutate(CONGESTIVE.HEART.FAILURE = ifelse(is.na(CONGESTIVE.HEART.FAILURE) | CONGESTIVE.HEART.FAILURE=="No", 0, 1)) %>%
  mutate(PERIPHERAL.VASCULAR.DISEASE = ifelse(is.na(PERIPHERAL.VASCULAR.DISEASE) | PERIPHERAL.VASCULAR.DISEASE=="No", 0, 1)) %>%
  mutate(CEREBROVASCULAR.DISEASE = ifelse(is.na(CEREBROVASCULAR.DISEASE) | CEREBROVASCULAR.DISEASE=="No", 0, 1)) %>%
  mutate(DEMENTIA = ifelse(is.na(DEMENTIA) | DEMENTIA=="No", 0, 1)) %>%
  mutate(COPD = ifelse(is.na(COPD) | COPD=="No", 0, 1)) %>%
  mutate(CONNECTIVE.TISSUE.DISEASE = ifelse(is.na(CONNECTIVE.TISSUE.DISEASE) | CONNECTIVE.TISSUE.DISEASE=="No", 0, 1)) %>%
  mutate(PEPTIC.ULCER.DISEASE = ifelse(is.na(PEPTIC.ULCER.DISEASE) | PEPTIC.ULCER.DISEASE=="No", 0, 1)) %>%
  mutate(MILD.LIVER.DISEASE = ifelse(is.na(MILD.LIVER.DISEASE) | MILD.LIVER.DISEASE=="No", 0, 1)) %>%
  mutate(DIABETES.NO.COMPS = ifelse(is.na(DIABETES.NO.COMPS) | DIABETES.NO.COMPS=="No", 0, 1)) %>%
  mutate(DIABETES.COMPS = ifelse(is.na(DIABETES.COMPS) | DIABETES.COMPS=="No", 0, 1)) %>%
  mutate(HEMIPLEGIA.STROKE = ifelse(is.na(HEMIPLEGIA.STROKE) | HEMIPLEGIA.STROKE=="No", 0, 1)) %>%
  mutate(RENAL.DISEASE.MOD.SEVERE = ifelse(is.na(RENAL.DISEASE.MOD.SEVERE) | RENAL.DISEASE.MOD.SEVERE=="No", 0, 1)) %>%
  mutate(SECOND.TUMOUR.NO.METS = ifelse(is.na(SECOND.TUMOUR.NO.METS) | SECOND.TUMOUR.NO.METS=="No", 0, 1)) %>%
  mutate(LYMPHOMA.MYELOMA = ifelse(is.na(LYMPHOMA.MYELOMA) | LYMPHOMA.MYELOMA=="No", 0, 1)) %>%
  mutate(LEUKAEMIA = ifelse(is.na(LEUKAEMIA) | LEUKAEMIA=="No", 0, 1)) %>%
  mutate(LIVER.DIS.MOD.SEVERE = ifelse(is.na(LIVER.DIS.MOD.SEVERE) | LIVER.DIS.MOD.SEVERE=="No", 0, 1)) %>%
  mutate(IMMUNE.DEFICIENCY = 0) %>%
  mutate(SECOND.MET.SOLID.TUMOUR = ifelse(is.na(SECOND.MET.SOLID.TUMOUR) | SECOND.MET.SOLID.TUMOUR=="No", 0, 1)) %>%
  
  # Charlson comorbidity score (NOT CORRECTED FOR AGE!!!)
  mutate(Charlson =  
           (1 * MYOCARDIAL.INFARCTION) + 
           (1 * CONGESTIVE.HEART.FAILURE) + 
           (1 * PERIPHERAL.VASCULAR.DISEASE) + 
           (1 * CEREBROVASCULAR.DISEASE) + 
           (1 * DEMENTIA) + 
           (1 * COPD) + 
           (1 * CONNECTIVE.TISSUE.DISEASE) + 
           (1 * PEPTIC.ULCER.DISEASE) + 
           (1 * MILD.LIVER.DISEASE) + 
           (1 * DIABETES.NO.COMPS) + 
           (2 * DIABETES.COMPS) + 
           (2 * HEMIPLEGIA.STROKE) + 
           (2 * RENAL.DISEASE.MOD.SEVERE) + 
           (2 * SECOND.TUMOUR.NO.METS) + 
           (2 * LYMPHOMA.MYELOMA) + 
           (2 * LEUKAEMIA) + 
           (3 * LIVER.DIS.MOD.SEVERE) + 
           (6 * IMMUNE.DEFICIENCY) +
           (6 * SECOND.MET.SOLID.TUMOUR)
  ) %>%
  
  # CEA Level
  mutate(CEA.Level = CEALEVEL) %>%

  # RAS status
  mutate(RAS.Mutated = ifelse(RAS.STATUS=="Pending", NA, NA)) %>%
  mutate(RAS.Mutated = ifelse(RAS.STATUS=="Unknown", NA, RAS.Mutated)) %>%
  mutate(RAS.Mutated = ifelse(RAS.STATUS=="Mutated", T, RAS.Mutated)) %>%
  mutate(RAS.Mutated = ifelse(RAS.STATUS=="Wild-type", F, RAS.Mutated)) %>%
  
  # BRAF status
  mutate(BRAF.Mutated = ifelse(BRAFSTATUS=="Pending", NA, NA)) %>%
  mutate(BRAF.Mutated = ifelse(BRAFSTATUS=="Unknown", NA, BRAF.Mutated)) %>%
  mutate(BRAF.Mutated = ifelse(BRAFSTATUS=="Mutated", T, BRAF.Mutated)) %>%
  mutate(BRAF.Mutated = ifelse(BRAFSTATUS=="Wild-type", F, BRAF.Mutated)) %>%
  
  # Treatment intent
  mutate(Treatment.Intent = ifelse(TREATMENTINTENT=="Curative resection occurred", "Curative", NA)) %>%
  mutate(Treatment.Intent = ifelse(TREATMENTINTENT=="Curative resection planned", "Curative", Treatment.Intent)) %>%
  mutate(Treatment.Intent = ifelse(TREATMENTINTENT=="Potentially curative resection (if good response to treatment)", "Potentially", Treatment.Intent)) %>%
  mutate(Treatment.Intent = ifelse(TREATMENTINTENT=="Palliative only", "Palliative", Treatment.Intent)) %>%
  
  # Treatment start
  mutate(Treatment.Start = format(as.Date(L1.Start.Date), "%Y"))

Step 1.3: Cleaning of clinical outcomes for first-line treatment

Clinical outcomes for the first line of treatment are extrated from the data in this third step of the data cleaning. In the manucript, only progression-free survival is considered as clinical outcome for first-line treatment. First-line progression is defined as physician-defined progression (so a patient will be able to receive further treatment or follow up) or death.

# First, the treatment stop date, date of progression, and date of death need to be extraced from the data
df<- df %>%  
  mutate(L1.Stop.Date = as.character(STOP.DATE.1L)) %>%
  mutate(L1.Progression.Date = as.character(PROGRESSION.DATE.First.Line)) %>%
  mutate(Death.Date = as.character(DATEOFDEATH))

# The progression-free survival (PFS) and overall survival (OS) are calculated using the
# treatment start date as origin
df <- df %>% 
  mutate(L1.PFS.Days = as.numeric(difftime(L1.Progression.Date, L1.Start.Date, units="days"))) %>%
  mutate(L1.OS.Days = as.numeric(difftime(Death.Date, L1.Start.Date, units="days"))) %>%
  mutate(L1.CENS.Days = as.numeric(difftime(L1.Stop.Date, L1.Start.Date, units="days"))) %>%
  mutate(L1.CENS.Days = ifelse(is.na(L1.CENS.Days), as.numeric(difftime(DATEOFREVIEW, L1.Start.Date, units="days")), L1.CENS.Days))


# Determine which of the events occured to the patient with regard to first line progression:
# 0) Censored:      when neither progression or death is recorded
# 1) Progression:   when progression is recorded
# 2) Death:         when progression is not recorded but death is
df <- df %>%
  mutate(L1.PFS.Events = ifelse(!is.na(L1.OS.Days) & (L1.OS.Days<=L1.PFS.Days | is.na(L1.PFS.Days)), "Death", NA)) %>%
  mutate(L1.PFS.Events = ifelse(!is.na(L1.PFS.Days) & (L1.PFS.Days<L1.OS.Days | is.na(L1.OS.Days)), "Progression", L1.PFS.Events)) %>%
  mutate(L1.PFS.Events = ifelse(is.na(L1.PFS.Days) & is.na(L1.OS.Days), "Censored", L1.PFS.Events))

# Determine whether an event occured or the patient was censored and the correspoding time
df <- df %>% 
  mutate(L1.PFS.Event = ifelse(L1.PFS.Events=="Censored", 0, 1)) %>%
  mutate(L1.PFS.Time = ifelse(L1.PFS.Events=="Progression", L1.PFS.Days, NA)) %>%
  mutate(L1.PFS.Time = ifelse(L1.PFS.Events=="Death", L1.OS.Days, L1.PFS.Time)) %>%
  mutate(L1.PFS.Time = ifelse(L1.PFS.Events=="Censored", L1.CENS.Days, L1.PFS.Time))

# Since a negative PFS or OS is not possible, patients with a non-positive time-to-event
# are excluded from the analysis, as their data is assumed to be incorrect
df <- df %>%
  mutate(L1.PFS.Include = ifelse(!is.na(L1.PFS.Time) & L1.PFS.Time > 0, T, F));

Step 1.4: Cleaning of clinical outcomes for further treatment lines

With regard to the clinical outcomes of treatment lines or follow up after first-line progression, only overall survival is considered. Overall survival is extracted from the data in this fourth step of the data cleaning.

# Time till death after progression
df <- df %>% 
  mutate(LX.OS.Days = as.numeric(difftime(DATEOFDEATH, L1.Progression.Date, units="days"))) %>%
  mutate(LX.CENS.Days = as.numeric(difftime(DATEOFREVIEW, L1.Progression.Date, units="days")))

# Determine whether the patient was censored:
# 0) Censored:      when death is not recorded
# 1) Death:         when death is recorded
df <- df %>%
  mutate(LX.OS.Events = ifelse(!is.na(LX.OS.Days), "Death", NA)) %>%
  mutate(LX.OS.Events = ifelse(is.na(LX.OS.Days) & !is.na(LX.CENS.Days), "Censored", LX.OS.Events)) %>%
  mutate(LX.OS.Events = ifelse(L1.PFS.Events=="Progression", LX.OS.Events, NA))

# Determine whether death occured or the patient was censored and the correspoding time
df <- df %>% 
  mutate(LX.OS.Event = ifelse(LX.OS.Events=="Censored", 0, 1)) %>%
  mutate(LX.OS.Time = ifelse(LX.OS.Events=="Death", LX.OS.Days, NA)) %>%
  mutate(LX.OS.Time = ifelse(LX.OS.Events=="Censored", LX.CENS.Days, LX.OS.Time))

# Since a negative PFS or OS is not possible, patients with a non-positive time-to-event
# are excluded from the analysis, as their data is assumed to be incorrect
df <- df %>%
  mutate(LX.OS.Include = ifelse(L1.PFS.Include & !is.na(LX.OS.Time) & LX.OS.Time > 0, T, F))

Step 1.5: Final patient selection

Patients with a negative progression-free survival or overall survival are excluded from the sample, as it is assumed that these are data entry errors. Finally, for the included patients, an overview table of their characteristics is generated.

# Only include patients who at least are included for first-line treatment
df <- df %>%
  filter(L1.PFS.Include)

# Only the new colomns need to be maintained
df <- df %>%
  select(L1.Regimen:LX.OS.Include)

# General function to get an overview table of the patient characteristics
getTable <- function(data, column.class=NULL, columns.names=NULL, columns.discrete=NULL, chisq.simulate=T, chisq.nsim=10000, export=F, export.file=NULL) {
  
  if(is.null(columns.names)) {
    columns.names <- c("Age", "Gender.Female", "Hospital.Private", "Stage.Diagnosis", "Primary.Site", 
                       "Metastases.Number", "Metastases.Liver", "Metastases.Peritoneum", "L1.ECOG",
                       "Charlson", "CEA.Level", "RAS.Mutated", "BRAF.Mutated");
    columns.discrete <- c(F, T, T, T, T, T, T, T, T, T, F, T, T);
  }
  
  if(is.null(column.class)) {
    column.class <- "L1.Arm.ChemBev";
  }
  
  if(length(columns.names)!=length(columns.discrete)) stop("OBJECTS 'columns.names' AND 'columns.discrete' SHOULD HAVE EQUAL LENGTHS!!!");
  
  classification <- unlist(data[, column.class]);
  if(length(unique(classification))!=2) stop("NEEDS A CLASSIFICATION THAT RESULTS IN TWO CLASSES!!!")
  
  out.discrete <- matrix(c(table(classification), NA, NA, NA), nrow=1);
  rownames(out.discrete) <- "n";
  colnames(out.discrete) <- c(rep(names(table(classification)),2 ), "p-value");
  for(column in columns.names) {
    
    variable <- unlist(data[, column]);
    
    if(columns.discrete[which(columns.names==column)[1]]) {
      
      variable[is.na(variable)] <- "Missings";
      
      counts <- table(variable, classification);
      rownames(counts) <- str_c(column, "-", rownames(counts));
      
      percentages <- t(t(counts) / out.discrete[1, 1:2])*100;
      
      chisq <- chisq.test(counts, simulate.p.value = chisq.simulate, B = chisq.nsim)
      chisq.p <- c(chisq$p.value, rep(NA, nrow(counts)-1));
      
      out.discrete <- rbind(out.discrete, cbind(counts, percentages, chisq.p));
      
    } else {
      
      means <- tapply(variable, classification, mean, na.rm=T);
      sd <- tapply(variable, classification, sd, na.rm=T);
      NA.counts <- tapply(is.na(variable), classification, sum);
      NA.percentages <- tapply(is.na(variable), classification, mean)*100;
      
      ttest <- t.test(x = variable[classification==unique(classification[1])],
                      y = variable[classification!=unique(classification[1])],
                      alteranative = "two-sided", paired = F);
      
      out <- rbind(c(means, sd, ttest$p.value), 
                   c(NA.counts, NA.percentages, NA));
      rownames(out) <- str_c(column, c("-MeanSD", "-Missings"));
      
      out.discrete <- rbind(out.discrete, out);
      
    }    
  }
  
  colnames(out.discrete)[1:2] <- str_c(str_c(column.class, "==", colnames(out.discrete)[1:2]), " (#/Mean)");
  colnames(out.discrete)[3:4] <- str_c(str_c(column.class, "==", colnames(out.discrete)[3:4]), " (%/SD)");
  
  if(export) write.table(out.discrete, file=export.file, sep=";");
  
  return(out.discrete)
}

# Export Characteristics Tables
Table1.L1.PFS <- getTable(data=filter(df, L1.PFS.Include), export=F);
Table1.LX.OS <- getTable(data=filter(df, LX.OS.Include), export=F);

Overview table for patients included for first-line treatment:

L1.Arm.ChemBev==FALSE (#/Mean) L1.Arm.ChemBev==TRUE (#/Mean) L1.Arm.ChemBev==FALSE (%/SD) L1.Arm.ChemBev==TRUE (%/SD) p-value
n 191.00000 676.00000 NA NA NA
Age-MeanSD 63.17801 62.54882 13.9234652 12.145134 0.5714513
Age-Missings 0.00000 0.00000 0.0000000 0.000000 NA
Gender.Female-FALSE 118.00000 407.00000 61.7801047 60.207101 0.7425257
Gender.Female-TRUE 73.00000 269.00000 38.2198953 39.792899 NA
Hospital.Private-FALSE 115.00000 337.00000 60.2094241 49.852071 0.0121988
Hospital.Private-TRUE 76.00000 339.00000 39.7905759 50.147929 NA
Stage.Diagnosis-0 8.00000 35.00000 4.1884817 5.177515 0.2915708
Stage.Diagnosis-1 0.00000 11.00000 0.0000000 1.627219 NA
Stage.Diagnosis-2 10.00000 35.00000 5.2356021 5.177515 NA
Stage.Diagnosis-3 21.00000 95.00000 10.9947644 14.053254 NA
Stage.Diagnosis-4 151.00000 491.00000 79.0575916 72.633136 NA
Stage.Diagnosis-Missings 1.00000 9.00000 0.5235602 1.331361 NA
Primary.Site-Colon 13.00000 26.00000 6.8062827 3.846154 0.3540646
Primary.Site-Left 73.00000 261.00000 38.2198953 38.609468 NA
Primary.Site-Missings 3.00000 11.00000 1.5706806 1.627219 NA
Primary.Site-Rectum 51.00000 165.00000 26.7015707 24.408284 NA
Primary.Site-Right 51.00000 213.00000 26.7015707 31.508876 NA
Metastases.Number-1 91.00000 256.00000 47.6439791 37.869822 0.1606839
Metastases.Number-2 55.00000 254.00000 28.7958115 37.573965 NA
Metastases.Number-3 33.00000 123.00000 17.2774869 18.195266 NA
Metastases.Number-4 9.00000 36.00000 4.7120419 5.325444 NA
Metastases.Number-5 3.00000 6.00000 1.5706806 0.887574 NA
Metastases.Number-6 0.00000 1.00000 0.0000000 0.147929 NA
Metastases.Liver-FALSE 64.00000 211.00000 33.5078534 31.213018 0.5971403
Metastases.Liver-TRUE 127.00000 465.00000 66.4921466 68.786982 NA
Metastases.Peritoneum-FALSE 138.00000 491.00000 72.2513089 72.633136 0.9280072
Metastases.Peritoneum-TRUE 53.00000 185.00000 27.7486911 27.366864 NA
L1.ECOG-0 51.00000 284.00000 26.7015707 42.011834 0.0001000
L1.ECOG-1 95.00000 344.00000 49.7382199 50.887574 NA
L1.ECOG-2 34.00000 43.00000 17.8010471 6.360947 NA
L1.ECOG-3 11.00000 5.00000 5.7591623 0.739645 NA
Charlson-0 127.00000 487.00000 66.4921466 72.041420 0.0014999
Charlson-1 27.00000 125.00000 14.1361257 18.491124 NA
Charlson-2 22.00000 41.00000 11.5183246 6.065089 NA
Charlson-3 7.00000 18.00000 3.6649215 2.662722 NA
Charlson-4 2.00000 2.00000 1.0471204 0.295858 NA
Charlson-5 3.00000 1.00000 1.5706806 0.147929 NA
Charlson-6 1.00000 1.00000 0.5235602 0.147929 NA
Charlson-7 0.00000 1.00000 0.0000000 0.147929 NA
Charlson-8 2.00000 0.00000 1.0471204 0.000000 NA
CEA.Level-MeanSD 725.91907 478.48598 5254.2478968 2048.838683 0.5459979
CEA.Level-Missings 19.00000 61.00000 9.9476440 9.023669 NA
RAS.Mutated-FALSE 72.00000 317.00000 37.6963351 46.893491 0.0001000
RAS.Mutated-Missings 63.00000 112.00000 32.9842932 16.568047 NA
RAS.Mutated-TRUE 56.00000 247.00000 29.3193717 36.538461 NA
BRAF.Mutated-FALSE 52.00000 266.00000 27.2251309 39.349112 0.0085991
BRAF.Mutated-Missings 127.00000 369.00000 66.4921466 54.585799 NA
BRAF.Mutated-TRUE 12.00000 41.00000 6.2827225 6.065089 NA

Overview table for patients included for further treatmen lines:

L1.Arm.ChemBev==FALSE (#/Mean) L1.Arm.ChemBev==TRUE (#/Mean) L1.Arm.ChemBev==FALSE (%/SD) L1.Arm.ChemBev==TRUE (%/SD) p-value
n 156.00000 562.00000 NA NA NA
Age-MeanSD 62.87179 62.32028 14.0789103 12.2110342 0.6567444
Age-Missings 0.00000 0.00000 0.0000000 0.0000000 NA
Gender.Female-FALSE 96.00000 341.00000 61.5384615 60.6761566 0.8505149
Gender.Female-TRUE 60.00000 221.00000 38.4615385 39.3238434 NA
Hospital.Private-FALSE 93.00000 272.00000 59.6153846 48.3985765 0.0153985
Hospital.Private-TRUE 63.00000 290.00000 40.3846154 51.6014235 NA
Stage.Diagnosis-0 4.00000 28.00000 2.5641026 4.9822064 0.2659734
Stage.Diagnosis-1 0.00000 8.00000 0.0000000 1.4234875 NA
Stage.Diagnosis-2 9.00000 28.00000 5.7692308 4.9822064 NA
Stage.Diagnosis-3 18.00000 82.00000 11.5384615 14.5907473 NA
Stage.Diagnosis-4 124.00000 407.00000 79.4871795 72.4199288 NA
Stage.Diagnosis-Missings 1.00000 9.00000 0.6410256 1.6014235 NA
Primary.Site-Colon 10.00000 18.00000 6.4102564 3.2028470 0.2894711
Primary.Site-Left 61.00000 216.00000 39.1025641 38.4341637 NA
Primary.Site-Missings 1.00000 11.00000 0.6410256 1.9572954 NA
Primary.Site-Rectum 38.00000 133.00000 24.3589744 23.6654804 NA
Primary.Site-Right 46.00000 184.00000 29.4871795 32.7402135 NA
Metastases.Number-1 70.00000 218.00000 44.8717949 38.7900356 0.6216378
Metastases.Number-2 50.00000 214.00000 32.0512821 38.0782918 NA
Metastases.Number-3 28.00000 95.00000 17.9487179 16.9039146 NA
Metastases.Number-4 7.00000 31.00000 4.4871795 5.5160142 NA
Metastases.Number-5 1.00000 4.00000 0.6410256 0.7117438 NA
Metastases.Liver-FALSE 53.00000 172.00000 33.9743590 30.6049822 0.4386561
Metastases.Liver-TRUE 103.00000 390.00000 66.0256410 69.3950178 NA
Metastases.Peritoneum-FALSE 109.00000 402.00000 69.8717949 71.5302491 0.6921308
Metastases.Peritoneum-TRUE 47.00000 160.00000 30.1282051 28.4697509 NA
L1.ECOG-0 39.00000 230.00000 25.0000000 40.9252669 0.0001000
L1.ECOG-1 83.00000 291.00000 53.2051282 51.7793594 NA
L1.ECOG-2 28.00000 37.00000 17.9487179 6.5836299 NA
L1.ECOG-3 6.00000 4.00000 3.8461538 0.7117438 NA
Charlson-0 106.00000 406.00000 67.9487179 72.2419929 0.0057994
Charlson-1 22.00000 106.00000 14.1025641 18.8612100 NA
Charlson-2 16.00000 32.00000 10.2564103 5.6939502 NA
Charlson-3 5.00000 14.00000 3.2051282 2.4911032 NA
Charlson-4 2.00000 1.00000 1.2820513 0.1779359 NA
Charlson-5 2.00000 1.00000 1.2820513 0.1779359 NA
Charlson-6 1.00000 1.00000 0.6410256 0.1779359 NA
Charlson-7 0.00000 1.00000 0.0000000 0.1779359 NA
Charlson-8 2.00000 0.00000 1.2820513 0.0000000 NA
CEA.Level-MeanSD 844.29064 488.46878 5796.9883585 2031.4895195 0.4746236
CEA.Level-Missings 15.00000 52.00000 9.6153846 9.2526690 NA
RAS.Mutated-FALSE 63.00000 269.00000 40.3846154 47.8647687 0.0001000
RAS.Mutated-Missings 50.00000 75.00000 32.0512821 13.3451957 NA
RAS.Mutated-TRUE 43.00000 218.00000 27.5641026 38.7900356 NA
BRAF.Mutated-FALSE 41.00000 217.00000 26.2820513 38.6120996 0.0146985
BRAF.Mutated-Missings 105.00000 309.00000 67.3076923 54.9822064 NA
BRAF.Mutated-TRUE 10.00000 36.00000 6.4102564 6.4056940 NA

Step 2: Multiple Imputation of Missing Data

After cleaning the data in Step 1, the missing data are addressed in Step 2. Two steps are performed in order to do so:

  1. Addressing missings in the RAS status and BRAF status
  2. Performing multiple imputation by chained equations

Step 2.1: addressing missings in the RAS status and BRAF status

As explained in the manuscript, missing data for the RAS mutation status and BRAF mutation status may be partly addressed by assuming these mutations to be mutatual exclusive. For example, this indicates that if a RAS mutation is known for a patient with a missing BRAF status, the BRAF status can be assumed to be ‘wild-type’.

# Address missing values in Mutations
# Potentially 175 missing values can be addressed:
# - 1 for RAS
# - 142 for BRAF
table(list(RAS=df$RAS.Mutated, BRAF=df$BRAF.Mutated), useNA="always");
##        BRAF
## RAS     FALSE TRUE <NA>
##   FALSE   158   51  180
##   TRUE    160    1  142
##   <NA>      0    1  174
# If patients have a KRAS mutation and the RAS status is unknown,
# we assume that they will not have a RAS mutation. Similarly, if
# the BRAF status is unknown and the patient has a RAS mutation,
# we assume that the patient does not have a BRAF mutation.
df <- df %>%
  mutate(RAS.Mutated = ifelse(is.na(RAS.Mutated) & BRAF.Mutated, F, RAS.Mutated)) %>%
  mutate(BRAF.Mutated = ifelse(is.na(BRAF.Mutated) & RAS.Mutated, F, BRAF.Mutated))

# Oberve new distribution of mutations and missing values
table(list(RAS=df$RAS.Mutated, BRAF=df$BRAF.Mutated), useNA="always");
##        BRAF
## RAS     FALSE TRUE <NA>
##   FALSE   158   52  180
##   TRUE    302    1    0
##   <NA>      0    0  174

Step 2.2: performing multiple imputation by chained equations

After addressing part of the missing values for the RAS mutation status and BRAF mutation status, remaining missing data for these and the other variables are addressed by performing multiple imputation by chained equations. Missing values are addressed for the following variables (imputation method used):

  • Stage at diagnosis (multinomial logit)
  • Primary tumor site (multinomial logit)
  • CEA level (Bayesian linear regression)
  • RAS mutation status (logistic regression)
  • BRAF mutation status (logistic regression)

The following variables were additionally used as predictors:

  • Age at treatment start
  • Gender
  • Hospital type (private of public)
  • Presence of liver metastases
  • Presence of peritoneal metastases
  • Number of metastatic sites
  • ECOG performance status
  • Charlson comorbidity index

After performing the multiple imputation, post-processing is performed while extracting the imputed datasets to make sure RAS mutation status and BRAF mutation status remain mutually exclusive.

# Load required package
library(mice);

# List the included columns
mice.cols.included <- c("Age", "Gender.Female", "Hospital.Private", "Metastases.Liver", "Metastases.Peritoneum", "Metastases.Number", "L1.ECOG", "Charlson", "Primary.Site", "Stage.Diagnosis", "CEA.Level", "RAS.Mutated", "BRAF.Mutated");

# Select the data used for the multiple imputation
df.mice <- df %>%
  select(mice.cols.included)

# Make sure the type of all columns is correct
df.mice <- df.mice %>%
  mutate(Age = as.numeric(Age)) %>% 
  mutate(Gender.Female = as.logical(Gender.Female)) %>% 
  mutate(Hospital.Private = as.logical(Hospital.Private)) %>% 
  mutate(Stage.Diagnosis = as.factor(Stage.Diagnosis)) %>% 
  mutate(Primary.Site = as.factor(Primary.Site)) %>% 
  mutate(Metastases.Liver = as.logical(Metastases.Liver)) %>% 
  mutate(Metastases.Peritoneum = as.logical(Metastases.Peritoneum)) %>% 
  mutate(Metastases.Number = as.numeric(Metastases.Number)) %>% 
  mutate(Charlson = as.numeric(Charlson)) %>% 
  mutate(CEA.Level = as.numeric(CEA.Level)) %>% 
  mutate(RAS.Mutated = as.logical(RAS.Mutated)) %>% 
  mutate(BRAF.Mutated = as.logical(BRAF.Mutated)) %>% 
  mutate(L1.ECOG = as.factor(L1.ECOG))


# Define predictor matrix
predictor.matrix <- matrix(1, length(mice.cols.included), length(mice.cols.included)) - diag(1, length(mice.cols.included), length(mice.cols.included));
colnames(predictor.matrix) <- rownames(predictor.matrix) <- mice.cols.included;

# Perform multiple imputation
m <- 25;              # Number of imputations
set.seed(20181211);   # Random seed for reproduction
mice.out <- mice(data=df.mice, m=m, maxit=20, predictorMatrix=predictor.matrix, printFlag=F,
                 method=c(Age="", 
                          Gender.Female="", 
                          Hospital.Private="", 
                          Metastases.Liver="", 
                          Metastases.Peritoneum="", 
                          Metastases.Number="", 
                          L1.ECOG="", 
                          Charlson="", 
                          Primary.Site="polyreg",
                          Stage.Diagnosis="polyreg", 
                          CEA.Level="norm", 
                          RAS.Mutated="logreg", 
                          BRAF.Mutated="logreg")
);


# Extract the imputed datasets
count.BRAF.and.RAS <- 0;    # Counter to track how much post-processing is required
datasets <- lapply(1:m, function(i.dataset) {
  
  # Get the imputed columns
  df.imputed <- complete(mice.out, action=i.dataset);
  cols.imputed <- colnames(df.imputed);
  
  # Replace the colums in the original dataset by the imputed columns
  df.new <- df;
  df.new[, cols.imputed] <- df.imputed; 
  
  # Correct when RAS and BRAF mutations conflict
  count.BRAF.and.RAS <<- count.BRAF.and.RAS + sum(df.new$BRAF.Mutated & df.new$RAS.Mutated);
  df.new$BRAF.Mutated <- ifelse(df.new$RAS.Mutated==1 & df.new$BRAF.Mutated==1, 0, df.new$BRAF.Mutated);
  
  # Return the newly completed datasets
  df.new;
  
});

# Check number of post-processed mutations
prob.BRAF.and.RAS <- (count.BRAF.and.RAS/m) / nrow(datasets[[1]]) * 100;
prob.BRAF.and.RAS;
## [1] 0.3460208

Step 3: Univariable Analysis

Initial analyses were performed to determine which distribution types would be used to parameterize the survival models and which transformations would be used for each variable in the multivariable modeling. To this end, the following steps were performed as part of this univariable modeling:

  1. Calculating the overlap of the competing events for first-line progression-free survival (Model L1-Time)
  2. Selecting the distribution for progression-free survival (Model L1-Time)
  3. Selecting the distribution for overal survival following first-line progression (Model LX-Time)
  4. Selecting the transformation for progression-free survival (Model L1-Time)
  5. Selecting the transformation for progression-free survival (Model L1-Event)
  6. Selecting the transformation for overal survival following first-line progression (Model LX-Time)

The following transformations were considered:

Step 3.1: calculating the overlap of the competing events for first-line progression-free survival (Model L1-Time)

In modeling first-line progression-free survival, there are two competing events that need to be accounted for: physician-defined progression and death. To determine which modeling approach will be implemented to account for these two competing events, the overlap of the corresponding time-to-event distributions is important to know. It was decided that competing events will be handled by using the ‘Unimodal Distribution and Regression model (UDR)’ approach based on the strategy of selecting the time-to-event first based on a joint time-to-event distribution, and then select the event corresponding to that time.

# function for calculating the overlap between the distribution for time-to-progression and time-to-death
getOverlap <- function(data, n.bins=512, digits=3) {

  library(flux);

  # Get events
  events <- sort(unique(data$event));
  n.events <- length(events);

  # Get event-specific times
  times <- lapply(events, function(e) subset(data, event==e)$time );

  # Density settings
  min <- min(data$time);
  max <- max(data$time);
  n <- n.bins;
  x <- seq(from=min, to=max, length.out=n);

  # Get event-specific densities
  y <- sapply(1:n.events, function(i1) density(times[[i1]], from=min, to=max, n=n)$y );

  # Get event=specific density overlap
  yo <- sapply(1:n.events, function(i2) {

    sapply(1:n, function(j){

      own <- y[j,i2];
      other <- max(y[j,which(c(1:n.events)!=i2)]);

      if(own<=0) {
        return(0)
      } else
        return(min(own,other))

    })

  });

  # Calculate event-specific overlap
  o <- sapply(1:n.events, function(i3) {

    (auc(x=x, y=yo[,i3]) / auc(x=x, y=y[,i3])) *100

  });

  return(round(mean(o), digits=digits));
}

# Calculate the distribution overlap
getOverlap(data=data.frame(time = df$L1.PFS.Time[df$L1.PFS.Events!="Censored"], event = df$L1.PFS.Events[df$L1.PFS.Events!="Censored"]))
## [1] 72.457
# Visualize distribution overlap
plot(density(df$L1.PFS.Time[df$L1.PFS.Events!="Censored"]), xlab="Time-to-Event in Days", ylab=NA, main="Density Plot", las=1, lwd=2, xlim=c(0,2500), ylim=c(0,0.0025))
lines(density(df$L1.PFS.Time[df$L1.PFS.Events=="Progression"]), lwd=2, col=3)
lines(density(df$L1.PFS.Time[df$L1.PFS.Events=="Death"]), lwd=2, col=2)
lines(density(df$L1.PFS.Time[df$L1.PFS.Events=="Censored"]), lwd=2, col=4)
legend("topright", lty=1, lwd=2, col=c(1,3,2, 4), c(str_c("Combined [excl. censored patients](n=", sum(df$L1.PFS.Events!="Censored"),")"), str_c("Progression (n=", sum(df$L1.PFS.Events=="Progression"),")"), str_c("Death (n=", sum(df$L1.PFS.Events=="Death"),")"), str_c("Censored (n = ", sum(df$L1.PFS.Events=="Censored"),")")))

Step 3.2: selecting the distribution for progression-free survival (Model L1-Time)

A distribution type is selected for survival Model L1-Time that represents the two competing events applicable to first-line progression-free survival (physcian-defined progression and death) in one unimodal survival model. Selection was based on the likelihood and assessment of quantile-quantile plots and Kaplan-Meier plots. The following distributions were considered:

  • Exponential
  • Weibull
  • Logistic
  • Log-normal
# Load packages
library(survival);  # Fit Survival Models
library(EnvStats);  # Q-Q Plot for censored data

# Distributions considered
dist.considered <- c("exponential", 
                     "weibull", 
                     "logistic", 
                     "lognormal");

# Data for first-line
df.L1 <- df;

# Select distribution
dist.performance.L1 <- sapply(dist.considered, function(dist) {
  
  # Fit survival model for distribution
  model <- survreg(formula=Surv(time=L1.PFS.Time, event=L1.PFS.Event) ~ 1, data=df.L1, dist=dist);
  
  # Return the likelihood
  return(model$loglik[1]);
  
});

# Print the results and which distribution would be selected based on the likelihood
dist.selected.L1 <- dist.considered[which.max(dist.performance.L1)];
print(dist.performance.L1);
## exponential     weibull    logistic   lognormal 
##   -5522.443   -5483.372   -5646.513   -5512.667
print(dist.selected.L1);
## [1] "weibull"
# Plot the survival curves
for(i in 1:length(dist.considered)) {
  
  # Get distribution type
  dist <- dist.considered[i];
  
  # Fit survival model for distribution
  model <- survreg(formula=Surv(time=L1.PFS.Time, event=L1.PFS.Event) ~ 1, data=df.L1, dist=dist);
  
  # If it is the first distribution considered, plot kaplan-meier
  if(i==1) plot(survfit(Surv(time=L1.PFS.Time, event=L1.PFS.Event) ~ 1, data=df.L1), conf.int=F, xlab="Time to Death or Progression (days)", lwd=2);
  
  # Plot fitted distribution
  lines(y=seq(.99, .01, by=-.01), x=predict(model, type="quantile", p=seq(.01, .99, by=.01))[1,], col=rainbow(length(dist.considered))[i], lwd=2)
  
  # Add the legend if it is the last distribution
  if(i==length(dist.considered)) legend("right", legend=dist.considered, col=rainbow(length(dist.considered)), lty=1, lwd=2);
  
} 

# Plot the Q-Q Plots curves
par(mfrow=c(2,2));
for(i in 1:length(dist.considered)) {
  
  # Get distribution type
  dist <- dist.considered[i];
  
  # Fit survival model for distribution
  model <- survreg(formula=Surv(time=L1.PFS.Time, event=L1.PFS.Event) ~ 1, data=df.L1, dist=dist);
  
  # Plot the Q-Q Plot
  if(dist=="exponential") qqPlotCensored(x=df.L1$L1.PFS.Time, censored=df.L1$L1.PFS.Event==0, censoring.side="right", 
                                         equal.axes=T, main=paste0("Q-Q Plot for the ", dist, " distribution"),
                                         distribution="exp", param.list=list(rate=1/exp(model$coefficients)));
  if(dist=="weibull") qqPlotCensored(x=df.L1$L1.PFS.Time, censored=df.L1$L1.PFS.Event==0, censoring.side="right", 
                                     equal.axes=T, main=paste0("Q-Q Plot for the ", dist, " distribution"), 
                                     distribution="weibull", param.list=list(shape=1/model$scale, scale=exp(model$coefficients)));
  if(dist=="logistic") qqPlotCensored(x=df.L1$L1.PFS.Time, censored=df.L1$L1.PFS.Event==0, censoring.side="right", 
                                      equal.axes=T, main=paste0("Q-Q Plot for the ", dist, " distribution"), 
                                      distribution="logis", param.list=list(location=model$coefficients, scale=model$scale));
  if(dist=="lognormal") qqPlotCensored(x=df.L1$L1.PFS.Time, censored=df.L1$L1.PFS.Event==0, censoring.side="right", 
                                       equal.axes=T, main=paste0("Q-Q Plot for the ", dist, " distribution"), 
                                       distribution="lnorm", param.list=list(meanlog=model$coefficients, sdlog=model$scale));
}

Step 3.3: selecting the distribution for overal survival following first-line progression (Model LX-Time)

A distribution type is selected for survival Model LX-Time that represents overall survival after first-line progression. Selection was based on the likelihood and assessment of quantile-quantile plots and Kaplan-Meier plots. As for Model L1-Time, the following distributions were considered:

  • Exponential
  • Weibull
  • Logistic
  • Log-normal
# Data for first-line
df.LX <- filter(df, LX.OS.Include);

# Select distribution
dist.performance.LX <- sapply(dist.considered, function(dist) {
  
  # Fit survival model for distribution
  model <- survreg(formula=Surv(time=LX.OS.Time, event=LX.OS.Event) ~ 1, data=df.LX, dist=dist);
  
  # Return the likelihood
  return(model$loglik[1]);
  
});

# Print the results and which distribution would be selected based on the likelihood
dist.selected.LX <- dist.considered[which.max(dist.performance.LX)];
print(dist.performance.LX);
## exponential     weibull    logistic   lognormal 
##   -4207.651   -4207.648   -4430.506   -4245.090
print(dist.selected.LX);
## [1] "weibull"
# Plot the survival curves
for(i in 1:length(dist.considered)) {
  
  # Get distribution type
  dist <- dist.considered[i];
  
  # Fit survival model for distribution
  model <- survreg(formula=Surv(time=LX.OS.Time, event=LX.OS.Event) ~ 1, data=df.LX, dist=dist);
  
  # If it is the first distribution considered, plot kaplan-meier
  if(i==1) plot(survfit(Surv(time=LX.OS.Time, event=LX.OS.Event) ~ 1, data=df.LX), conf.int=F, xlab="Overall Survival (days)", lwd=2);
  
  # Plot fitted distribution
  lines(y=seq(.99, .01, by=-.01), x=predict(model, type="quantile", p=seq(.01, .99, by=.01))[1,], col=rainbow(length(dist.considered))[i], lwd=2)
  
  # Add the legend if it is the last distribution
  if(i==length(dist.considered)) legend("right", legend=dist.considered, col=rainbow(length(dist.considered)), lty=1, lwd=2);
  
} 

# Plot the Q-Q Plots curves
par(mfrow=c(2,2));
for(i in 1:length(dist.considered)) {
  
  # Get distribution type
  dist <- dist.considered[i];
  
  # Fit survival model for distribution
  model <- survreg(formula=Surv(time=LX.OS.Time, event=LX.OS.Event) ~ 1, data=df.LX, dist=dist);
  
  # Plot the Q-Q Plot
  if(dist=="exponential") qqPlotCensored(x=df.LX$LX.OS.Time, censored=df.LX$LX.OS.Event==0, censoring.side="right", 
                                         equal.axes=T, main=paste0("Q-Q Plot for the ", dist, " distribution"),
                                         distribution="exp", param.list=list(rate=1/exp(model$coefficients)));
  if(dist=="weibull") qqPlotCensored(x=df.LX$LX.OS.Time, censored=df.LX$LX.OS.Event==0, censoring.side="right", 
                                     equal.axes=T, main=paste0("Q-Q Plot for the ", dist, " distribution"), 
                                     distribution="weibull", param.list=list(shape=1/model$scale, scale=exp(model$coefficients)));
  if(dist=="logistic") qqPlotCensored(x=df.LX$LX.OS.Time, censored=df.LX$LX.OS.Event==0, censoring.side="right", 
                                      equal.axes=T, main=paste0("Q-Q Plot for the ", dist, " distribution"), 
                                      distribution="logis", param.list=list(location=model$coefficients, scale=model$scale));
  if(dist=="lognormal") qqPlotCensored(x=df.LX$LX.OS.Time, censored=df.LX$LX.OS.Event==0, censoring.side="right", 
                                       equal.axes=T, main=paste0("Q-Q Plot for the ", dist, " distribution"), 
                                       distribution="lnorm", param.list=list(meanlog=model$coefficients, sdlog=model$scale));
}

Step 3.4: selecting the transformation for progression-free survival (Model L1-Time)

The best-performing transformation was selected for each variable and survival model separately by fitting individual models for each candidate transformation and comparing performance based on Wald test p-values. Univariable models were also fitted for variables without multiple candidate transformations to record p-values. All p-values were calculated for each of 25 imputed datasets, and for variables with multiple candidate transformations, transformations’ p-value ranks were determined for each imputed dataset. Final selection of the transformation to be used was based on the average transformation rank over all 25 imputed datasets.

# Load package
library(aod);   # Wald-test

# Candidate variables, including their importance and potential transformations
vars.information.L1 <- list(
  
  # Age at treatment start
  Age = list(Transformations = c(Log = "log(Age)", SQRT = "sqrt(Age)", Poly1 = "poly(Age, 1)", Poly2 = "poly(Age, 2)", Poly3 = "poly(Age, 3)", Poly4 = "poly(Age, 4)")),
  
  # Gender
  Gender.Female = list(Transformations = c(Binary.FemaleYN = "Gender.Female")),
  
  # Hospital Type
  Hospital.Private = list(Transformations = c(Binary.PrivateYN = "Hospital.Private")),
  
  # Stage at Initial Diagnosis
  Stage.Diagnosis = list(Transformations = c(Binary.Stage4YN = "as.factor(as.character(Stage.Diagnosis)=='4')")),
  
  # Primary Tumour Site
  Primary.Site = list(Transformations = c(Factor.GroupColMul = "as.factor(ifelse(Primary.Site %in% c('Colon','Multiple'), 'ColonMultiple', Primary.Site))", Factor.GroupColMul.GroupLefRec = "as.factor(ifelse(Primary.Site %in% c('Colon','Multiple'), 'ColonMultiple', ifelse(Primary.Site %in% c('Left','Rectum'), 'LeftRectum', Primary.Site)))")),
  
  # Number of Metastases
  Metastases.Number = list(Transformations = c(Factor.Group3plus = "as.factor(ifelse(Metastases.Number>=3, 3, Metastases.Number))")),
  
  # Liver Metastases
  Metastases.Liver = list(Transformations = c(Binary = "Metastases.Liver")),
  
  # Peritoneum Metastases
  Metastases.Peritoneum = list(Transformations = c(Binary = "Metastases.Peritoneum")),
  
  # ECOG Performance Status
  L1.ECOG = list(Transformations = c(Factor.Group23 = "as.factor(ifelse(L1.ECOG==0, 0, ifelse(L1.ECOG==1, 1, 2)))", Binary.2plusYN = "as.factor(ifelse(L1.ECOG %in% c(0,1), F, T))")),
  
  # Charlson Score
  Charlson = list(Transformations = c(Factor.Group2plus = "as.factor(ifelse(Charlson>=2, 2, Charlson))", Factor.Group01.Group2plus = "as.factor(ifelse(Charlson %in% c(0,1), 1, ifelse(Charlson>=2, 2, Charlson)))", Binary.1plusYN = "as.factor(ifelse(Charlson>=1, T, F))")),
  
  # CEA Level
  CEA.Level = list(Transformations = c(Binary.200plus = "as.factor(ifelse(Metastases.Number<=2, T, F) * ifelse(CEA.Level>=200, T, F))")),
  
  # RAS Status
  RAS.Mutated = list(Transformations = c(Binary.MutatedYN = "RAS.Mutated")),
  
  # BRAF Status
  BRAF.Mutated = list(Transformations = c(Binary.MutatedYN = "BRAF.Mutated"))
  
);

# Set three levels of significance
significance.levels <- c(0.1, 0.05, 0.01);

# For each imputed dataset, select distribution and perform univariable analysis
univariable.L1 <- lapply(1:length(datasets), function(dataset) {
  
  # Retrieve datasets
  data.surv.L1 <- datasets[[dataset]];

  # For each variable
  vars.univariable <- lapply(vars.information.L1, function(var) {
    
    # Consider all potential transformations
    sapply(var$Transformations, function(var.form) {
      
      # Define formula to be fitted
      f <- as.formula(str_c("Surv(time=L1.PFS.Time, event=L1.PFS.Event) ~ ", var.form, collapse=""));
      
      # Fit model
      fit <- survreg(formula=f, data=data.surv.L1, dist=dist.selected.L1);
      
      # Perform Wald-test for significance
      test <- wald.test(Sigma=fit$var, b=c(fit$coefficients, fit$scale), Terms=c(2:length(fit$coefficients)))
      
      # Determine significance
      pvalue <- test$result$chi2["P"];
      significance <- "";
      if(pvalue<max(significance.levels)) {
        significance <- str_c(rep("*", tail(which(pvalue<significance.levels), n=1)), collapse="");
      }
      
      # Return transformation and model performance
      c(trans=f, ll=fit$loglik[2], pvalue=pvalue, sign=significance)
      
    })
    
  });
  
  # Return outcomes
  return(list(vars.univariable=vars.univariable));
  
})


# Add summary for variables to vars.information object
for(var in names(vars.information.L1)) {
  
  # Get names of transformations
  transformation.names <- names(vars.information.L1[[var]]$Transformations);
  
  # Extract data from univariable object
  loglik <- as.data.frame(
    sapply(transformation.names, function(trans) 
      sapply(univariable.L1, function(univar)
        univar$vars.univariable[[var]][2,trans]
      )
    )
  )
  pvalues <- as.data.frame(
    sapply(transformation.names, function(trans) 
      sapply(univariable.L1, function(univar)
        univar$vars.univariable[[var]][3,trans]
      )
    )
  )
  significance <- as.data.frame(
    sapply(transformation.names, function(trans) 
      sapply(univariable.L1, function(univar)
        univar$vars.univariable[[var]][4,trans]
      )
    )
  )
  
  # Transform List object to Matrix
  pvalues <- matrix(unlist(pvalues), ncol=ncol(pvalues), byrow=F)
  colnames(pvalues) <- transformation.names;
  
  # Rank based on pvalues
  ranks <- matrix(apply(pvalues, 1, rank, ties.method="average"), ncol=ncol(pvalues), byrow=T);
  colnames(ranks) <- colnames(pvalues);
  avgrank <- apply(ranks, 2, mean);
  
  # Select best performing transformation
  selected <- names(avgrank)[which.min(avgrank)];
  selected.transformation <- vars.information.L1[[var]]$Transformations[selected];
  selected.avgpvalue <- mean(pvalues[, selected]);
  selected.significance <- "";
  if(selected.avgpvalue<max(significance.levels)){
    selected.significance <- str_c(rep("*", tail(which(selected.avgpvalue<significance.levels), n=1)), collapse="");
  }
  
  # Save to vars.information object
  vars.information.L1[[var]]$LogLik <- loglik;
  vars.information.L1[[var]]$Pvalues <- pvalues;
  vars.information.L1[[var]]$Significance <- significance;
  vars.information.L1[[var]]$Ranks <- ranks;
  vars.information.L1[[var]]$AvgRank <- avgrank;
  
  vars.information.L1[[var]]$SelectedTransformation <- selected.transformation;
  vars.information.L1[[var]]$SelectedPvalue <- selected.avgpvalue;
  vars.information.L1[[var]]$SelectedSignificance <- selected.significance;
  
}


# Generate summary Table
vars.univariable.L1 <- t(sapply(vars.information.L1, function(var) c(
  trans = names(var$SelectedTransformation),
  pvalue = var$SelectedPvalue,
  sign = var$SelectedSignificance
))); 

# View results
kable(vars.univariable.L1);
trans pvalue sign
Age Poly2 0.00543870740945696 ***
Gender.Female Binary.FemaleYN 0.0413078627543801 **
Hospital.Private Binary.PrivateYN 0.132221927655926
Stage.Diagnosis Binary.Stage4YN 0.0180820041944674 **
Primary.Site Factor.GroupColMul.GroupLefRec 0.00371107490900566 ***
Metastases.Number Factor.Group3plus 0.40180927502595
Metastases.Liver Binary 0.0548355494244619 *
Metastases.Peritoneum Binary 0.108781784235471
L1.ECOG Factor.Group23 8.5890183854076e-12 ***
Charlson Binary.1plusYN 0.164886584017105
CEA.Level Binary.200plus 0.00551021390960015 ***
RAS.Mutated Binary.MutatedYN 0.208343809880177
BRAF.Mutated Binary.MutatedYN 0.152022282845375

Step 3.5: selecting the transformation for progression-free survival (Model L1-Event)

To distinguish between the two competing events of physisician-defined progression death for first-line progression-free survival, a logistic regression model is needed (Model L1-Event). For this model, all best-performing transformations indentified for corresponding Model L1-Time were used, so only the best transformation for the additional variable relating to the time-to-event needs to be selected.

# Define a binary variable 'Death' to represent whether first-line progression refered to Death (1), Physician-defined progression (0), or the patient was censored (NA)
Death <- ifelse(datasets[[1]]$L1.PFS.Events=="Death", 1, ifelse(datasets[[1]]$L1.PFS.Events=="Progression", 0, NA));

# Fit the different transformations
summary(glm(Death ~ 1))
## 
## Call:
## glm(formula = Death ~ 1)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.1023  -0.1023  -0.1023  -0.1023   0.8977  
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.10234    0.01065    9.61   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.09198216)
## 
##     Null deviance: 74.506  on 810  degrees of freedom
## Residual deviance: 74.506  on 810  degrees of freedom
##   (56 observations deleted due to missingness)
## AIC: 369.34
## 
## Number of Fisher Scoring iterations: 2
summary(glm(Death ~ sqrt(datasets[[1]]$L1.PFS.Time)))
## 
## Call:
## glm(formula = Death ~ sqrt(datasets[[1]]$L1.PFS.Time))
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -0.21793  -0.11542  -0.09612  -0.07337   0.95884  
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)   
## (Intercept)                     0.029326   0.027825   1.054  0.29222   
## sqrt(datasets[[1]]$L1.PFS.Time) 0.004473   0.001576   2.838  0.00465 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.09118778)
## 
##     Null deviance: 74.506  on 810  degrees of freedom
## Residual deviance: 73.771  on 809  degrees of freedom
##   (56 observations deleted due to missingness)
## AIC: 363.31
## 
## Number of Fisher Scoring iterations: 2
summary(glm(Death ~ log(datasets[[1]]$L1.PFS.Time)))
## 
## Call:
## glm(formula = Death ~ log(datasets[[1]]$L1.PFS.Time))
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -0.12443  -0.10784  -0.10247  -0.09366   0.93415  
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)
## (Intercept)                     0.04526    0.06323   0.716    0.474
## log(datasets[[1]]$L1.PFS.Time)  0.01058    0.01155   0.916    0.360
## 
## (Dispersion parameter for gaussian family taken to be 0.09200045)
## 
##     Null deviance: 74.506  on 810  degrees of freedom
## Residual deviance: 74.428  on 809  degrees of freedom
##   (56 observations deleted due to missingness)
## AIC: 370.5
## 
## Number of Fisher Scoring iterations: 2
summary(glm(Death ~ poly(datasets[[1]]$L1.PFS.Time, 1)))
## 
## Call:
## glm(formula = Death ~ poly(datasets[[1]]$L1.PFS.Time, 1))
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -0.35633  -0.11251  -0.08696  -0.06513   0.95045  
## 
## Coefficients:
##                                    Estimate Std. Error t value Pr(>|t|)
## (Intercept)                         0.10236    0.01054   9.710  < 2e-16
## poly(datasets[[1]]$L1.PFS.Time, 1)  1.31589    0.31269   4.208 2.86e-05
##                                       
## (Intercept)                        ***
## poly(datasets[[1]]$L1.PFS.Time, 1) ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.09012303)
## 
##     Null deviance: 74.506  on 810  degrees of freedom
## Residual deviance: 72.910  on 809  degrees of freedom
##   (56 observations deleted due to missingness)
## AIC: 353.78
## 
## Number of Fisher Scoring iterations: 2
summary(glm(Death ~ poly(datasets[[1]]$L1.PFS.Time, 2)))
## 
## Call:
## glm(formula = Death ~ poly(datasets[[1]]$L1.PFS.Time, 2))
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -0.62677  -0.09385  -0.08514  -0.08335   0.91693  
## 
## Coefficients:
##                                     Estimate Std. Error t value Pr(>|t|)
## (Intercept)                           0.1026     0.0105   9.764  < 2e-16
## poly(datasets[[1]]$L1.PFS.Time, 2)1   1.3072     0.3115   4.196 3.02e-05
## poly(datasets[[1]]$L1.PFS.Time, 2)2   0.8118     0.3048   2.664  0.00789
##                                        
## (Intercept)                         ***
## poly(datasets[[1]]$L1.PFS.Time, 2)1 ***
## poly(datasets[[1]]$L1.PFS.Time, 2)2 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.08944915)
## 
##     Null deviance: 74.506  on 810  degrees of freedom
## Residual deviance: 72.275  on 808  degrees of freedom
##   (56 observations deleted due to missingness)
## AIC: 348.69
## 
## Number of Fisher Scoring iterations: 2
summary(glm(Death ~ poly(datasets[[1]]$L1.PFS.Time, 3)))
## 
## Call:
## glm(formula = Death ~ poly(datasets[[1]]$L1.PFS.Time, 3))
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -0.48879  -0.10220  -0.07718  -0.06743   0.93404  
## 
## Coefficients:
##                                     Estimate Std. Error t value Pr(>|t|)
## (Intercept)                          0.10303    0.01045   9.856  < 2e-16
## poly(datasets[[1]]$L1.PFS.Time, 3)1  1.33858    0.31025   4.315  1.8e-05
## poly(datasets[[1]]$L1.PFS.Time, 3)2  0.83827    0.30347   2.762  0.00587
## poly(datasets[[1]]$L1.PFS.Time, 3)3 -0.90747    0.30755  -2.951  0.00326
##                                        
## (Intercept)                         ***
## poly(datasets[[1]]$L1.PFS.Time, 3)1 ***
## poly(datasets[[1]]$L1.PFS.Time, 3)2 ** 
## poly(datasets[[1]]$L1.PFS.Time, 3)3 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.08860409)
## 
##     Null deviance: 74.506  on 810  degrees of freedom
## Residual deviance: 71.504  on 807  degrees of freedom
##   (56 observations deleted due to missingness)
## AIC: 341.99
## 
## Number of Fisher Scoring iterations: 2

Step 3.6: selecting the transformation for overal survival following first-line progression (Model LX-Time)

Selection of the best-performing transformation for Model LX-Time was done in the same way as for Model L1-Tim. The only difference is that an additional variable representing the first-line progression-free survival was included.

# Candidate variables, including their importance and potential transformations
vars.information.LX <- list(
  
  # Progression-free survival
  PFS = list(Transformations = c(Log = "log(L1.PFS.Time)", SQRT = "sqrt(L1.PFS.Time)", Poly1 = "poly(L1.PFS.Time, 1)", Poly2 = "poly(L1.PFS.Time, 2)", Poly3 = "poly(L1.PFS.Time, 3)")),
  
  # Age at treatment start
  Age = list(Transformations = c(Log = "log(Age)", SQRT = "sqrt(Age)", Poly1 = "poly(Age, 1)",  Poly2 = "poly(Age, 2)",  Poly3 = "poly(Age, 3)")),
  
  # Gender
  Gender.Female = list(Transformations = c(Binary.FemaleYN = "Gender.Female")),
  
  # Hospital Type
  Hospital.Private = list(Transformations = c(Binary.PrivateYN = "Hospital.Private")),
  
  # Stage at Initial Diagnosis
  Stage.Diagnosis = list(Transformations = c(Binary.Stage4YN = "as.factor(as.character(Stage.Diagnosis)=='4')")),
  
  # Primary Tumour Site
  Primary.Site = list(Transformations = c(Factor.GroupColMul = "as.factor(ifelse(Primary.Site %in% c('Colon','Multiple'), 'ColonMultiple', Primary.Site))", Factor.GroupColMul.GroupLefRec = "as.factor(ifelse(Primary.Site %in% c('Colon','Multiple'), 'ColonMultiple', ifelse(Primary.Site %in% c('Left','Rectum'), 'LeftRectum', Primary.Site)))")),
  
  # Number of Metastases
  Metastases.Number = list(Transformations = c(Factor.Group3plus = "as.factor(ifelse(Metastases.Number>=3, 3, Metastases.Number))")),
  
  # Liver Metastases
  Metastases.Liver = list(Transformations = c(Binary = "Metastases.Liver")),
  
  # Peritoneum Metastases
  Metastases.Peritoneum = list(Transformations = c(Binary = "Metastases.Peritoneum")),
  
  # ECOG Performance Status
  L1.ECOG = list(Transformations = c(Factor.Group23 = "as.factor(ifelse(L1.ECOG==0, 0, ifelse(L1.ECOG==1, 1, 2)))", Binary.2plusYN = "as.factor(ifelse(L1.ECOG %in% c(0,1), F, T))")),
  
  # Charlson Score
  Charlson = list(Transformations = c(Factor.Group2plus = "as.factor(ifelse(Charlson>=2, 2, Charlson))", Factor.Group01.Group2plus = "as.factor(ifelse(Charlson %in% c(0,1), 1, ifelse(Charlson>=2, 2, Charlson)))", Binary.1plusYN = "as.factor(ifelse(Charlson>=1, T, F))")),
  
  # CEA Level
  CEA.Level = list(Transformations = c(Binary.200plus = "as.factor(ifelse(Metastases.Number<=2, T, F) * ifelse(CEA.Level>=200, T, F))")),
  
  # RAS Status
  RAS.Mutated = list(Transformations = c(Binary.MutatedYN = "RAS.Mutated")),
  
  # BRAF Status
  BRAF.Mutated = list(Transformations = c(Binary.MutatedYN = "BRAF.Mutated"))
  
);


# For each imputed dataset, select distribution and perform univariable analysis
univariable.LX <- lapply(1:length(datasets), function(dataset) {
  
  # Retrieve datasets
  data.surv <- datasets[[dataset]];
  data.surv.LX <- filter(data.surv, LX.OS.Include);
  
  # For each variable
  vars.univariable <- lapply(vars.information.LX, function(var) {
    
    # Consider all potential transformations
    sapply(var$Transformations, function(var.form) {
      
      # Define formula to be fitted
      f <- as.formula(str_c("Surv(time=LX.OS.Time, event=LX.OS.Event) ~ ", var.form, collapse=""));
      
      # Fit model
      fit <- survreg(formula=f, data=data.surv.LX, dist=dist.selected.LX);
      
      # Perform Wald-test for significance
      test <- wald.test(Sigma=fit$var, b=c(fit$coefficients, fit$scale), Terms=c(2:length(fit$coefficients)))
      
      # Determine significance
      pvalue <- test$result$chi2["P"];
      significance <- "";
      if(pvalue<max(significance.levels)) {
        significance <- str_c(rep("*", tail(which(pvalue<significance.levels), n=1)), collapse="");
      }
      
      # Return transformation and model performance
      c(trans=f, ll=fit$loglik[2], pvalue=pvalue, sign=significance)
      
    })
    
  });
  
  # Return outcomes
  return(list(vars.univariable=vars.univariable));
  
})


# Add summary for variables to vars.information object
for(var in names(vars.information.LX)) {
  
  # Get names of transformations
  transformation.names <- names(vars.information.LX[[var]]$Transformations);
  
  # Extract data from univariable object
  loglik <- as.data.frame(
    sapply(transformation.names, function(trans) 
      sapply(univariable.LX, function(univar)
        univar$vars.univariable[[var]][2,trans]
      )
    )
  )
  pvalues <- as.data.frame(
    sapply(transformation.names, function(trans) 
      sapply(univariable.LX, function(univar)
        univar$vars.univariable[[var]][3,trans]
      )
    )
  )
  significance <- as.data.frame(
    sapply(transformation.names, function(trans) 
      sapply(univariable.LX, function(univar)
        univar$vars.univariable[[var]][4,trans]
      )
    )
  )
  
  # Transform List object to Matrix
  pvalues <- matrix(unlist(pvalues), ncol=ncol(pvalues), byrow=F)
  colnames(pvalues) <- transformation.names;
  
  # Rank based on pvalues
  ranks <- matrix(apply(pvalues, 1, rank, ties.method="average"), ncol=ncol(pvalues), byrow=T);
  colnames(ranks) <- colnames(pvalues);
  avgrank <- apply(ranks, 2, mean);
  
  # Select best performing transformation
  selected <- names(avgrank)[which.min(avgrank)];
  selected.transformation <- vars.information.LX[[var]]$Transformations[selected];
  selected.avgpvalue <- mean(pvalues[, selected]);
  selected.significance <- "";
  if(selected.avgpvalue<max(significance.levels)){
    selected.significance <- str_c(rep("*", tail(which(selected.avgpvalue<significance.levels), n=1)), collapse="");
  }
  
  # Save to vars.information object
  vars.information.LX[[var]]$LogLik <- loglik;
  vars.information.LX[[var]]$Pvalues <- pvalues;
  vars.information.LX[[var]]$Significance <- significance;
  vars.information.LX[[var]]$Ranks <- ranks;
  vars.information.LX[[var]]$AvgRank <- avgrank;
  
  vars.information.LX[[var]]$SelectedTransformation <- selected.transformation;
  vars.information.LX[[var]]$SelectedPvalue <- selected.avgpvalue;
  vars.information.LX[[var]]$SelectedSignificance <- selected.significance;
  
}


# Generate summary Table
vars.univariable.LX <- t(sapply(vars.information.LX, function(var) c(
  trans = names(var$SelectedTransformation),
  pvalue = var$SelectedPvalue,
  sign = var$SelectedSignificance
))); 

# View results
kable(vars.univariable.LX);
trans pvalue sign
PFS Log 0 ***
Age Poly2 1.47682807410954e-05 ***
Gender.Female Binary.FemaleYN 0.0293104401172259 **
Hospital.Private Binary.PrivateYN 0.00854917648739073 ***
Stage.Diagnosis Binary.Stage4YN 0.464603901073672
Primary.Site Factor.GroupColMul.GroupLefRec 0.0240492505778438 **
Metastases.Number Factor.Group3plus 0.678845101916153
Metastases.Liver Binary 0.719065221171095
Metastases.Peritoneum Binary 0.0120556554881923 **
L1.ECOG Factor.Group23 0 ***
Charlson Binary.1plusYN 0.912530732728782
CEA.Level Binary.200plus 0.267061210740859
RAS.Mutated Binary.MutatedYN 0.0110552668651976 **
BRAF.Mutated Binary.MutatedYN 0.0330294095990516 **

Step 4: Multivariable Modeling of Survival

Multivariable (survival) models are fitted to reflect patient heterogeneity. A variable selection procedure is implemented that considers both clinical relevance and statistical performance. In total, the multivariable modeling exists of three steps:

  1. Defining functions used for fitting the multivariable models
  2. Defining the input for the multivariable modeling process
  3. Fitting the multivariable models

Step 4.1: defining functions used for fitting the multivariable models

Several custom functions are defined to perform the variable selection procedure that considers both clinical relevance and statistical performance. Please find an explanation of the functions in the R code below.

# The 'getSurvDataset.L1' and 'getSurvDataset.LX' functions take the cleaned data and transform these in a data.frame 
# dat can be used to estimate (survival) models. This is necessary, because the cleaned data needs to be transformed 
# according to the selected transformations and treatment interaction terms need to be calculated accordingly
getSurvDataset.L1 <- function(data) {
  
  data <- data %>% 
    
    ## Single variables
    
    # Event
    mutate(Event = L1.PFS.Event) %>%
    
    # Events
    mutate(Events = L1.PFS.Events) %>%
    
    # Events.01
    mutate(Events.01 = ifelse(Events=="Death", 1, NA)) %>%
    mutate(Events.01 = ifelse(Events=="Progression", 0, Events.01)) %>%
    
    # L1.PFS.Time = "Time",
    mutate(Time = L1.PFS.Time) %>%
    mutate(Time1 = L1.PFS.Time) %>%
    mutate(Time2 = L1.PFS.Time^2) %>%
    mutate(Time3 = L1.PFS.Time^3) %>%
    mutate(Time4 = L1.PFS.Time^4) %>%
    
    # Arm.ChemBev = "ChemBev",
    mutate(ChemBev = as.numeric(as.logical(L1.Arm.ChemBev))) %>%
    
    #  Age = "Age1 + Age2",
    mutate(Age1 = Age) %>%
    mutate(Age2 = Age1^2) %>%
    
    #  Gender.Female = "Female.YN",
    mutate(Female.YN = as.numeric(as.logical(Gender.Female))) %>%
    
    #  Hospital.Private = "Private.YN",
    mutate(Private.YN = as.numeric(as.logical(Hospital.Private))) %>%
    
    #  Stage.Diagnosis = "Stage4.YN",
    mutate(Stage4.YN = ifelse(as.character(Stage.Diagnosis)=="4", 1, 0)) %>%
    
    #  Primary.Site = "Site.Left + Site.Right + Site.Rectum",
    mutate(Site.LeftRectum = ifelse(Primary.Site %in% c("Left","Rectum"), 1, 0)) %>%
    mutate(Site.Right = ifelse(Primary.Site=="Right", 1, 0)) %>%
    
    #  Metastases.Number = "Metastases.2 + Metastases.3plus",
    mutate(Metastases.2 = ifelse(as.numeric(as.character(Metastases.Number))==2, 1, 0)) %>%
    mutate(Metastases.3plus = ifelse(as.numeric(as.character(Metastases.Number))>=3, 1, 0)) %>%
    
    #  Metastases.Liver = "Liver.YN",
    mutate(Liver.YN = as.numeric(as.logical(Metastases.Liver))) %>%
    
    #  Metastases.Peritoneum = "Peritoneum.YN",
    mutate(Peritoneum.YN = as.numeric(as.logical(Metastases.Peritoneum))) %>%
    
    #  L1.ECOG = "ECOG.1 + ECOG.23",
    mutate(ECOG.1 = ifelse(as.numeric(as.character(L1.ECOG))==1, 1, 0)) %>%
    mutate(ECOG.23 = ifelse(as.numeric(as.character(L1.ECOG))>=2, 1, 0)) %>%
    
    #  Charlson = "Charlson.1 + Charlson.2 + Charlson.3plus",
    mutate(Charlson.1plus = ifelse(as.numeric(as.character(Charlson))>=1, 1, 0)) %>%
    
    #  CEA.Level = "CEA.200plus",
    mutate(CEA.200plus = ifelse(as.numeric(as.character(Metastases.Number))<=2, 1, 0) * ifelse(as.numeric(CEA.Level)>=200, 1, 0)) %>%
    
    #  RAS.Mutated = "MutantRAS.YN",
    mutate(MutantRAS.YN = as.numeric(as.logical(RAS.Mutated))) %>%
    
    #  BRAF.Mutated = "MutantBRAF.YN",
    mutate(MutantBRAF.YN = as.numeric(as.logical(BRAF.Mutated))) %>%
    
    
    ## Interactions
    mutate(Time1.Int = Time1 * ChemBev) %>%
    mutate(Time2.Int = Time2 * ChemBev) %>%
    mutate(Time3.Int = Time3 * ChemBev) %>%
    mutate(Time4.Int = Time4 * ChemBev) %>%
    mutate(Age1.Int = Age1 * ChemBev) %>%
    mutate(Age2.Int = Age2 * ChemBev) %>%
    mutate(Female.YN.Int = Female.YN * ChemBev) %>%
    mutate(Private.YN.Int = Private.YN * ChemBev) %>%
    mutate(Stage4.YN.Int = Stage4.YN * ChemBev) %>%
    mutate(Site.LeftRectum.Int = Site.LeftRectum * ChemBev) %>%
    mutate(Site.Right.Int = Site.Right * ChemBev) %>%
    mutate(Metastases.2.Int = Metastases.2 * ChemBev) %>%
    mutate(Metastases.3plus.Int = Metastases.3plus * ChemBev) %>%
    mutate(Liver.YN.Int = Liver.YN * ChemBev) %>%
    mutate(Peritoneum.YN.Int = Peritoneum.YN * ChemBev) %>%
    mutate(ECOG.1.Int = ECOG.1 * ChemBev) %>%
    mutate(ECOG.23.Int = ECOG.23 * ChemBev) %>%
    mutate(Charlson.1plus.Int = Charlson.1plus * ChemBev) %>%
    mutate(CEA.200plus.Int = CEA.200plus * ChemBev) %>%
    mutate(MutantRAS.YN.Int = MutantRAS.YN * ChemBev) %>%
    mutate(MutantBRAF.YN.Int = MutantBRAF.YN * ChemBev)
  
  
  # Only the newly created columns need to be returned
  return(select(data, Event:MutantBRAF.YN.Int))
  
}
getSurvDataset.LX <- function(data) {
  
  data <- data %>% 
    
    ## Single variables
    
    # L1.OS.Event = Event,
    mutate(Event = LX.OS.Event) %>%
    
    # L1.OS.Time = "Time",
    mutate(Time = LX.OS.Time) %>%
    
    # L1.PFS.Time = "PFS.Time",
    mutate(PFS.Time = log(L1.PFS.Time)) %>%
    
    # Arm.ChemBev = "ChemBev",
    mutate(ChemBev = as.numeric(as.logical(L1.Arm.ChemBev))) %>%
    
    #  Age = "Age1 + Age2",
    mutate(Age1 = as.numeric(Age)-mean(Age)) %>%
    mutate(Age2 = as.numeric(Age1^2)) %>%
    
    #  Gender.Female = "Female.YN",
    mutate(Female.YN = as.numeric(as.logical(Gender.Female))) %>%
    
    #  Hospital.Private = "Private.YN",
    mutate(Private.YN = as.numeric(as.logical(Hospital.Private))) %>%
    
    #  Stage.Diagnosis = "Stage4.YN",
    mutate(Stage4.YN = ifelse(as.numeric(Stage.Diagnosis)==4, 1, 0)) %>%
    
    #  Primary.Site = "Site.Left + Site.Right + Site.Rectum",
    mutate(Site.LeftRectum = ifelse(Primary.Site %in% c("Left","Rectum"), 1, 0)) %>%
    mutate(Site.Right = ifelse(Primary.Site=="Right", 1, 0)) %>%
    
    #  Metastases.Number = "Metastases.2 + Metastases.3plus",
    mutate(Metastases.2 = ifelse(as.numeric(Metastases.Number)==2, 1, 0)) %>%
    mutate(Metastases.3plus = ifelse(as.numeric(Metastases.Number)>=3, 1, 0)) %>%
    
    #  Metastases.Liver = "Liver.YN",
    mutate(Liver.YN = as.numeric(as.logical(Metastases.Liver))) %>%
    
    #  Metastases.Peritoneum = "Peritoneum.YN",
    mutate(Peritoneum.YN = as.numeric(as.logical(Metastases.Peritoneum))) %>%
    
    #  L1.ECOG = "ECOG.1 + ECOG.23",
    mutate(ECOG.1 = ifelse(as.numeric(as.character(L1.ECOG))==1, 1, 0)) %>%
    mutate(ECOG.23 = ifelse(as.numeric(as.character(L1.ECOG))>=2, 1, 0)) %>%
    
    #  Charlson = "Charlson.3plus.YN",
    mutate(Charlson.1plus.YN = ifelse(as.numeric(Charlson)>=1, 1, 0)) %>%
    
    #  CEA.Level = "CEA.200plus",
    mutate(CEA.200plus = ifelse(as.numeric(Metastases.Number)<=2, 1, 0) * ifelse(as.numeric(CEA.Level)>=200, 1, 0)) %>%
    
    #  RAS.Mutated = "MutantRAS.YN",
    mutate(MutantRAS.YN = as.numeric(as.logical(RAS.Mutated))) %>%
    
    #  BRAF.Mutated = "MutantBRAF.YN",
    mutate(MutantBRAF.YN = as.numeric(as.logical(BRAF.Mutated))) %>%
    
    
    ## Interactions
    mutate(PFS.Time.Int = PFS.Time * ChemBev) %>%
    mutate(Age1.Int = Age1 * ChemBev) %>%
    mutate(Age2.Int = Age2 * ChemBev) %>%
    mutate(Female.YN.Int = Female.YN * ChemBev) %>%
    mutate(Private.YN.Int = Private.YN * ChemBev) %>%
    mutate(Stage4.YN.Int = Stage4.YN * ChemBev) %>%
    mutate(Site.LeftRectum.Int = Site.LeftRectum * ChemBev) %>%
    mutate(Site.Right.Int = Site.Right * ChemBev) %>%
    mutate(Metastases.2.Int = Metastases.2 * ChemBev) %>%
    mutate(Metastases.3plus.Int = Metastases.3plus * ChemBev) %>%
    mutate(Liver.YN.Int = Liver.YN * ChemBev) %>%
    mutate(Peritoneum.YN.Int = Peritoneum.YN * ChemBev) %>%
    mutate(ECOG.1.Int = ECOG.1 * ChemBev) %>%
    mutate(ECOG.23.Int = ECOG.23 * ChemBev) %>%
    mutate(Charlson.1plus.YN.Int = Charlson.1plus.YN * ChemBev) %>%
    mutate(CEA.200plus.Int = CEA.200plus * ChemBev) %>%
    mutate(MutantRAS.YN.Int = MutantRAS.YN * ChemBev) %>%
    mutate(MutantBRAF.YN.Int = MutantBRAF.YN * ChemBev)
  
  # Only the newly created columns need to be returned
  return(select(data, Event:MutantBRAF.YN.Int))
  
}

# The 'avgSurvivalModel' and 'avgLogisticModel' functions estimate (survival) models based on a number of datasets and
# combine these separate models into one model using Rubin's rule so that a p-value can be calculated for all models
# combined in the variable selection procedure
avgSurvivalModel <- function(formula, datasets, distribution, conf.int=F) {
  
  # Fit models for each dataset
  models <- lapply(datasets, function(dataset) {
    
    model <- survreg(formula=as.formula(formula), data=dataset, dist=distribution);
    
    list(coef=c(model$coefficients, 'Log(Scale)'=log(model$scale)), sigma=model$var);
    
  })
  
  # Coefficient Estimates
  m <- length(models);
  Q <- sapply(models, function(model) model$coef);
  Qm <- matrix(rowMeans(Q), nrow=1);
  
  # Within Variance
  U <- lapply(models, function(model) model$sigma);
  Um <- Reduce('+', U) / m;
  
  # Between variance
  B <- lapply(1:m, function(i) {
    
    Q <- matrix(Q[,i], nrow=1);
    
    t((Q-Qm)) %*% (Q-Qm) / (m-1)
    
  });
  Bm <- Reduce('+', B);
  
  # Total Variance
  Tm <- Um + (1+(1/m))*Bm;
  
  # Save the Total Variance-Covariance matrix
  S <- Tm;
  
  # Provide proper format and naming
  Qm <- as.numeric(Qm);
  names(Qm) <- rownames(Q);
  
  # Confidence intervals for coefficients
  if(conf.int) {
    
    # Extract diagonals of variance matrices to get variance vectors
    Um <- diag(Um);
    Bm <- diag(Bm);
    Tm <- diag(Tm);
    
    # Degrees of freedom
    rm <- (1+(1/m)) * Bm / Um;
    v <- (m-1)*(1+(1/rm))^2;
    
    # Determine 95% confidence interval bounds
    lower <- Qm + qt(p=0.025, df=v)*sqrt(Tm);
    upper <- Qm + qt(p=0.975, df=v)*sqrt(Tm);
    
  }
  
  if(conf.int) {
    return(list(coef=Qm, sigma=S, lower=lower, upper=upper))
  } else {
    return(list(coef=Qm, sigma=S))
  }
}
avgLogisticModel <- function(formula, datasets, conf.int=F) {
  
  # Fit models for each dataset
  models <- lapply(datasets, function(dataset) {
    
    model <- glm(formula=as.formula(formula), data=dataset, family=binomial(link="logit"));
    
    list(coef=c(model$coefficients), sigma=vcov(model));
    
  })
  
  # Coefficient Estimates
  m <- length(models);
  Q <- sapply(models, function(model) model$coef);
  Qm <- matrix(rowMeans(Q), nrow=1);
  
  # Within Variance
  U <- lapply(models, function(model) model$sigma);
  Um <- Reduce('+', U) / m;
  
  # Between variance
  B <- lapply(1:m, function(i) {
    
    Q <- matrix(Q[,i], nrow=1);
    
    t((Q-Qm)) %*% (Q-Qm) / (m-1)
    
  });
  Bm <- Reduce('+', B);
  
  # Total Variance
  Tm <- Um + (1+(1/m))*Bm;
  
  # Save the Total Variance-Covariance matrix
  S <- Tm;
  
  # Provide proper format and naming
  Qm <- as.numeric(Qm);
  names(Qm) <- rownames(Q);
  
  # Confidence intervals for coefficients
  if(conf.int) {
    
    # Extract diagonals of variance matrices to get variance vectors
    Um <- diag(Um);
    Bm <- diag(Bm);
    Tm <- diag(Tm);
    
    # Degrees of freedom
    rm <- (1+(1/m)) * Bm / Um;
    v <- (m-1)*(1+(1/rm))^2;
    
    # Determine 95% confidence interval bounds
    lower <- Qm + qt(p=0.025, df=v)*sqrt(Tm);
    upper <- Qm + qt(p=0.975, df=v)*sqrt(Tm);
    
  }
  
  if(conf.int) {
    return(list(coef=Qm, sigma=S, lower=lower, upper=upper))
  } else {
    return(list(coef=Qm, sigma=S))
  }
  
}

# The 'survivalBackward', 'survivalForward', and 'logisticForward' functions perform a backward or forward selection
# procedure based on a set of basis variables and a set of candidate variables
survivalBackward <- function(datasets, time, event, vars.include, vars.selection, vars.formulas, pvalue, dist) {
  
  # Extract all used columns from the provided input
  columns.used <- sapply(c(vars.include, vars.selection), function(var) {
    formula.complete <- vars.formulas[var];
    formula.splitted <- str_split(string=formula.complete, pattern=fixed("+"), simplify=T);
    formula.nospaces <- str_trim(string=formula.splitted);
  });
  
  # Objects to record process
  vars.unselected <- c();
  continue <- TRUE;
  
  # Backward selection
  while(continue) {
    
    # Get formula according to the remaining backward selection candidates
    formula.current <- str_c(
      str_c(c("Surv(", time, ", ", event, ") ~ "), collapse=""),  
      str_c(vars.formulas[c(vars.include, vars.selection)], collapse=" + ")
    );
    
    # Fit model with backward candidates
    model.current <- avgSurvivalModel(formula=formula.current, datasets=datasets, distribution=dist);
    
    # Get pvalue for each variable
    coefficient.indexes <- lapply(vars.selection, function(var) which(names(model.current$coef) %in% columns.used[[var]]));
    names(coefficient.indexes) <- vars.selection;
    pvalues <- sapply(vars.selection, function(var) {
      testresult <- wald.test(Sigma = model.current$sigma,
                              b = model.current$coef,
                              Terms = coefficient.indexes[[var]]
      )$result$chi2["P"];
      
      unname(testresult)
    }); # sapply pvalues 

    # Get highes p value
    pvalue.max <- max(pvalues);
    
    # If the highest p value is higher than the threshold, remove this one
    if(pvalue.max > pvalue) {
      
      # Get the name of the variable with the highest pvalue
      pvalue.index <- which.max(pvalues);
      
      # Temporary list of selection candidates without the selected variable
      vars.selection.tmp <- vars.selection[-pvalue.index];
      
      # Store unselected candidates
      vars.unselected <- c(vars.unselected, vars.selection[pvalue.index]);
      
      # If no backward selection candidates remain, stop backward selection
      if(length(vars.selection.tmp)<1) {
        
        vars.selection <- vars.selection.tmp;
        continue <- F;
        
      } else {
        
        # Update backward selection candidates and unselected candidates from temporary vectors
        vars.selection <- vars.selection.tmp;
        
      }
      
    } else {
      
      # If no pvalue higher than thershold, stop backward selection
      continue <- F;
      
    }
  } # while continue
  
  # Return outcomes
  return(list(included = c(vars.include, vars.selection),
              excluded = vars.unselected))
  
}
survivalForward <- function(datasets, time, event, vars.include, vars.selection, vars.formulas, pvalue, dist) {
  
  # Extract all used columns from the provided input
  columns.used <- sapply(c(vars.include, vars.selection), function(var) {
    formula.complete <- vars.formulas[var];
    formula.splitted <- str_split(string=formula.complete, pattern=fixed("+"), simplify=T);
    formula.nospaces <- str_trim(string=formula.splitted);
  });

  # Objects to record process
  continue <- TRUE;
  
  # Backward selection
  while(continue) {
    
    # Get pvalue for each variable
    pvalues <- sapply(vars.selection, function(var) {
      
      # Get formula according to the current forward selection candidates
      formula.current <- str_c(
        str_c(c("Surv(", time, ", ", event, ") ~ "), collapse=""),  
        str_c(vars.formulas[c(vars.include, var)], collapse=" + ")
      );
      
      # Fit model with backward candidates
      model.current <- avgSurvivalModel(formula=formula.current, datasets=datasets, distribution=dist);
      
      
      
      # Get pvalue for this variable
      coefficient.indexes <- which(names(model.current$coef) %in% columns.used[[var]])
      test.result <- wald.test(Sigma = model.current$sigma,
                               b = model.current$coef,
                               Terms = coefficient.indexes
      )$result$chi2["P"]
      
      unname(test.result)
      
    })

    # Get lowest p value
    pvalue.min <- min(pvalues);
    
    # If the highest p value is lower than the threshold, add this one
    if(pvalue.min < pvalue) {
      
      # Get the name of the variable with the lowest pvalue
      pvalue.index <- which.min(pvalues);
      
      # Update variable lists
      vars.include <- c(vars.include, vars.selection[pvalue.index]);
      vars.selection <- vars.selection[-pvalue.index];
      
      # If no forward selection candidates remain, stop backward selection
      if(length(vars.selection)<1) {
        
        continue <- F;
        
      } 
      
    } else {
      
      # If no pvalue higher than thershold, stop forward selection
      continue <- F;
      
    }
  } # while !stop
  
  
  # Return outcomes
  return(list(included = vars.include,
              excluded = vars.selection))
  
}
logisticForward <- function(datasets, events, vars.include, vars.selection, vars.formulas, pvalue) {
  
  # Extract all used columns from the provided input
  columns.used <- sapply(c(vars.include, vars.selection), function(var) {
    formula.complete <- vars.formulas[var];
    formula.splitted <- str_split(string=formula.complete, pattern=fixed("+"), simplify=T);
    formula.nospaces <- str_trim(string=formula.splitted);
  });
  
  # Objects to record process
  continue <- TRUE;
  
  # Backward selection
  while(continue) {
    
    # Get pvalue for each variable
    pvalues <- sapply(vars.selection, function(var) {
      
      # Get formula according to the current forward selection candidates
      formula.current <- str_c(
        str_c(c(events, " ~ "), collapse=""),  
        str_c(vars.formulas[c(vars.include, var)], collapse=" + ")
      );
      
      # Fit model with backward candidates
      model.current <- avgLogisticModel(formula=formula.current, datasets=datasets);
      
      
      
      # Get pvalue for this variable
      coefficient.indexes <- which(names(model.current$coef) %in% columns.used[[var]])
      test.result <- wald.test(Sigma = model.current$sigma,
                               b = model.current$coef,
                               Terms = coefficient.indexes
      )$result$chi2["P"]
      
      unname(test.result)
      
    })

    # Get lowest p value
    pvalue.min <- min(pvalues);
    
    # If the highest p value is lower than the threshold, add this one
    if(pvalue.min < pvalue) {
      
      # Get the name of the variable with the lowest pvalue
      pvalue.index <- which.min(pvalues);
      
      # Update variable lists
      vars.include <- c(vars.include, vars.selection[pvalue.index]);
      vars.selection <- vars.selection[-pvalue.index];
      
      # If no forward selection candidates remain, stop backward selection
      if(length(vars.selection)<1) {
        
        continue <- F;
        
      } 
      
    } else {
      
      # If no pvalue higher than thershold, stop forward selection
      continue <- F;
      
    }
  } # while !stop
  
  
  # Return outcomes
  return(list(included = vars.include,
              excluded = vars.selection))
  
}

# The 'fitLogisticModel' and 'fitSurvivalModel' functions are high-level functions that call the different selection
# procedures and return the final model based on the selected set of variables
fitLogisticModel <- function(datasets, events, pvalue, vars.include, vars.selection, vars.formulas) {
  
  result.forward <- logisticForward(datasets = datasets, events = events, vars.include = vars.include, vars.selection = vars.selection, vars.formulas = vars.formulas, pvalue = pvalue);
  
  formula.final <- str_c(
    str_c(c(events, " ~ "), collapse=""),  
    str_c(vars.formulas[result.forward$included], collapse=" + ")
  );
  model.final <- avgLogisticModel(formula = formula.final, datasets = datasets, conf.int = T)
  
  return(list(coef = model.final$coef, sigma = model.final$sigma, lower = model.final$lower, upper = model.final$upper, included = result.forward$included))
  
}
fitSurvivalModel <- function(datasets, time, event, dist, vars.include, vars.backward, vars.forward, vars.formulas, pvalue.backward, pvalue.forward, pvalue.interaction) {
  
  result.backward <- survivalBackward(datasets = datasets, time = time, event = event, dist = dist, vars.include = vars.include, vars.selection = vars.backward, vars.formulas = vars.formulas, pvalue = pvalue.backward)
  
  result.forward <- survivalForward(datasets = datasets, time = time, event = event, dist = dist, vars.include = result.backward$included, vars.selection = c(vars.forward, result.backward$excluded), vars.formulas = vars.formulas, pvalue = pvalue.forward)
  
  vars.include.interaction <- result.forward$included;
  vars.backward.interaction <- str_c(vars.include.interaction, ".Int");
  vars.backward.interaction <- vars.backward.interaction[-which(vars.backward.interaction=="Arm.ChemBev.Int")];
  
  result.interaction <- survivalBackward(datasets = datasets, time = time, event = event, dist = dist, vars.include = vars.include.interaction, vars.selection = vars.backward.interaction, vars.formulas = vars.formulas, pvalue = pvalue.interaction)
  
  formula.final <- str_c(
    str_c(c("Surv(", time, ", ", event, ") ~ "), collapse=""),  
    str_c(vars.formulas[result.interaction$included], collapse=" + ")
  );
  model.final <- avgSurvivalModel(formula = formula.final, datasets = datasets, distribution = "weibull", conf.int = T)
  
  return(list(coef = model.final$coef, sigma = model.final$sigma, lower = model.final$lower, upper = model.final$upper, included = result.interaction$included))
  
}

Step 4.2: defining the input for the multivariable modeling process

The selection procedures require specific input, which is defined here:

  • The formula for each variable, i.e. which columns in the dataset are used
  • The clinical relevance of the variables, i.e. whether they should be always included or are subject to a backward or forward selection procedure
  • Datasets in the appropriate format
# List object that contains an overview of all variables and the columns in the data that 
# they use for first-line progression-free survival (L1)
vars.formulas.L1 <- list(
  
  # Single variables
  L1.PFS.Time = "Time1 + Time2",
  Arm.ChemBev = "ChemBev",
  Age = "Age1 + Age2",
  Gender.Female = "Female.YN",
  Hospital.Private = "Private.YN",
  Stage.Diagnosis = "Stage4.YN",
  Primary.Site = "Site.LeftRectum + Site.Right",
  Metastases.Number = "Metastases.2 + Metastases.3plus",
  Metastases.Liver = "Liver.YN",
  Metastases.Peritoneum = "Peritoneum.YN",
  L1.ECOG = "ECOG.1 + ECOG.23",
  Charlson = "Charlson.1plus",
  CEA.Level = "CEA.200plus",
  RAS.Mutated = "MutantRAS.YN",
  BRAF.Mutated = "MutantBRAF.YN",
  
  
  # Interactions with treatment
  L1.PFS.Time.Int = "Time1.Int + Time2.Int",
  Age.Int = "Age1.Int + Age2.Int",
  Gender.Female.Int = "Female.YN.Int",
  Hospital.Private.Int = "Private.YN.Int",
  Stage.Diagnosis.Int = "Stage4.YN.Int",
  Primary.Site.Int = "Site.LeftRectum.Int + Site.Right.Int",
  Metastases.Number.Int = "Metastases.2.Int + Metastases.3plus.Int",
  Metastases.Liver.Int = "Liver.YN.Int",
  Metastases.Peritoneum.Int = "Peritoneum.YN.Int",
  L1.ECOG.Int = "ECOG.1.Int + ECOG.23.Int",
  Charlson.Int = "Charlson.1plus.Int",
  CEA.Level.Int = "CEA.200plus.Int",
  RAS.Mutated.Int = "MutantRAS.YN.Int",
  BRAF.Mutated.Int = "MutantBRAF.YN.Int"
);

# List object that contains an overview of all variables and the columns in the data that 
# they use for overall survival after first-line progression (LX)
vars.formulas.LX <- list(
  
  # Single variables
  PFS = "PFS.Time",
  Arm.ChemBev = "ChemBev",
  Age = "Age1 + Age2",
  Gender.Female = "Female.YN",
  Hospital.Private = "Private.YN",
  Stage.Diagnosis = "Stage4.YN",
  Primary.Site = "Site.LeftRectum + Site.Right",
  Metastases.Number = "Metastases.2 + Metastases.3plus",
  Metastases.Liver = "Liver.YN",
  Metastases.Peritoneum = "Peritoneum.YN",
  L1.ECOG = "ECOG.1 + ECOG.23",
  Charlson = "Charlson.1plus.YN",
  CEA.Level = "CEA.200plus",
  RAS.Mutated = "MutantRAS.YN",
  BRAF.Mutated = "MutantBRAF.YN",

  # Interactions with treatment
  PFS.Int = "PFS.Time.Int",
  Age.Int = "Age1.Int + Age2.Int",
  Gender.Female.Int = "Female.YN.Int",
  Hospital.Private.Int = "Private.YN.Int",
  Stage.Diagnosis.Int = "Stage4.YN.Int",
  Primary.Site.Int = "Site.LeftRectum.Int + Site.Right.Int",
  Metastases.Number.Int = "Metastases.2.Int + Metastases.3plus.Int",
  Metastases.Liver.Int = "Liver.YN.Int",
  Metastases.Peritoneum.Int = "Peritoneum.YN.Int",
  L1.ECOG.Int = "ECOG.1.Int + ECOG.23.Int",
  Charlson.Int = "Charlson.1plus.YN.Int",
  CEA.Level.Int = "CEA.200plus.Int",
  RAS.Mutated.Int = "MutantRAS.YN.Int",
  BRAF.Mutated.Int = "MutantBRAF.YN.Int"
);


# Indicate which variables should be included and which are subject to selection (backward vs. forward)
vars.include.L1 <- c("Arm.ChemBev", "Age", "L1.ECOG", "BRAF.Mutated");
vars.backward.L1 <- c("Stage.Diagnosis", "Primary.Site", "Metastases.Number", "Metastases.Liver",
                      "Metastases.Peritoneum", "Charlson");
vars.forward.L1 <- c("Gender.Female", "Hospital.Private", "CEA.Level", "RAS.Mutated");

# Indicate which variables should be included and which are subject to selection (backward vs. forward)
vars.include.LX <- c("Arm.ChemBev", "Age", "Primary.Site", "Metastases.Peritoneum", "L1.ECOG", "BRAF.Mutated");
vars.backward.LX <- c("PFS","Stage.Diagnosis", "Metastases.Number", "Metastases.Liver", "Charlson", "CEA.Level");
vars.forward.LX <- c("Gender.Female", "Hospital.Private", "RAS.Mutated");

# Transform cleaned datasets into appropriate format
datasets.surv.L1 <- lapply(datasets, function(dataset) getSurvDataset.L1(dataset));
datasets.surv.LX <- lapply(datasets, function(dataset) getSurvDataset.LX(filter(dataset, LX.OS.Include)));

Step 4.3: fitting the multivariable models

After defining all functions and input required for fitting the (survival) models, all procedures can be performed.

Model L1-Time

# Fit Model L1-Time
Model.L1.Time <- fitSurvivalModel(datasets = datasets.surv.L1, time = "Time", event = "Event", dist = "weibull", vars.include = vars.include.L1, vars.backward = vars.backward.L1, vars.forward = vars.forward.L1, vars.formulas = vars.formulas.L1, pvalue.backward = 0.3, pvalue.forward = 0.2, pvalue.interaction = 0.3);

# Save and print Model L1-Time
save(x=Model.L1.Time, file="object Model.L1.Time 20190201.RData");
write.table(cbind(Coef=Model.L1.Time$coef, Lower=Model.L1.Time$lower, Upper=Model.L1.Time$upper), file="Model L1-Time 20190201.txt", sep=";");
cbind(Coefficient=Model.L1.Time$coef, Lower.Bound=Model.L1.Time$lower, Upper.Bound=Model.L1.Time$upper);
##                   Coefficient   Lower.Bound  Upper.Bound
## (Intercept)      4.5352456627  3.6607544949  5.409736831
## ChemBev          0.4903821144  0.2176428890  0.763121340
## Age1             0.0467445556  0.0193879515  0.074101160
## Age2            -0.0003818097 -0.0006122064 -0.000151413
## ECOG.1          -0.4420102362 -0.7086076563 -0.175412816
## ECOG.23         -0.4466416971 -0.7601591259 -0.133124268
## MutantBRAF.YN   -0.1586860032 -0.3946605642  0.077288558
## Site.LeftRectum -0.0320105967 -0.2807102482  0.216689055
## Site.Right      -0.2048020078 -0.4594543905  0.049850375
## Liver.YN        -0.1006873497 -0.2196800253  0.018305326
## Peritoneum.YN   -0.0967462377 -0.2154643846  0.021971909
## CEA.200plus     -0.0349843979 -0.3386554012  0.268686605
## Private.YN       0.3951723972  0.1640512674  0.626293527
## MutantRAS.YN    -0.0958525980 -0.2360036442  0.044298448
## ECOG.1.Int       0.2336980952 -0.0577437169  0.525139907
## ECOG.23.Int     -0.0886445670 -0.4773567723  0.300067638
## CEA.200plus.Int -0.2021634520 -0.5296258343  0.125298930
## Private.YN.Int  -0.4110088123 -0.6653954483 -0.156622176
## Log(Scale)      -0.3324565000 -0.3850521961 -0.279860804

Model L1-Event

# Fit Model L1-Event
vars.all.L1 <- c(vars.include.L1, vars.backward.L1, vars.forward.L1);
vars.all.L1 <- vars.all.L1[vars.all.L1!="Arm.ChemBev"];
Model.L1.Event <- fitLogisticModel(datasets = datasets.surv.L1, events = "Events.01", vars.include = c("L1.PFS.Time", "Arm.ChemBev"),  vars.selection = c(vars.all.L1, str_c(vars.all.L1, ".Int"), "L1.PFS.Time.Int"), vars.formulas = vars.formulas.L1, pvalue = 0.20);

# Save and print Model L1-Event
save(x=Model.L1.Event, file="object Model.L1.Event 20190201.RData")
write.table(cbind(Coef=Model.L1.Event$coef, Lower=Model.L1.Event$lower, Upper=Model.L1.Event$upper), file="Model L1-Event 20190201.txt", sep=";");
cbind(Coefficient=Model.L1.Event$coef, Lower.Bound=Model.L1.Event$lower, Upper.Bound=Model.L1.Event$upper);
##                       Coefficient   Lower.Bound   Upper.Bound
## (Intercept)         -2.069855e+00 -3.011333e+00 -1.128376e+00
## Time1               -1.871496e-04 -8.169673e-03  7.795374e-03
## Time2               -2.561821e-06 -1.577255e-05  1.064891e-05
## ChemBev              1.738231e-01 -1.229186e+00  1.576832e+00
## Charlson.1plus       6.461158e-01  1.593857e-01  1.132846e+00
## Site.LeftRectum.Int -1.462309e+00 -2.459083e+00 -4.655346e-01
## Site.Right.Int      -1.469024e+00 -2.529350e+00 -4.086988e-01
## Time1.Int            2.267294e-03 -6.025885e-03  1.056047e-02
## Time2.Int            2.566552e-06 -1.072477e-05  1.585787e-05

Model LX-Time

# Fit Model LX-Time
Model.LX.Time <- fitSurvivalModel(datasets = datasets.surv.LX, time = "Time", event = "Event", dist = "weibull", vars.include = vars.include.LX, vars.backward = vars.backward.LX, vars.forward = vars.forward.LX, vars.formulas = vars.formulas.LX, pvalue.backward = 0.3, pvalue.forward = 0.2, pvalue.interaction = 0.3)

# Save and print Model LX-Time
save(x=Model.LX.Time, file="object Model.LX.Time 20190201.RData")
write.table(cbind(Coef=Model.LX.Time$coef, Lower=Model.LX.Time$lower, Upper=Model.LX.Time$upper), file="Model LX-Time 20190201.txt", sep=";");
cbind(Coefficient=Model.LX.Time$coef, Lower.Bound=Model.LX.Time$lower, Upper.Bound=Model.LX.Time$upper);
##                     Coefficient   Lower.Bound   Upper.Bound
## (Intercept)        3.2891795888  2.2620579298  4.316301e+00
## ChemBev            0.8818385546 -0.2812537861  2.044931e+00
## Age1              -0.0251070916 -0.0370240389 -1.319014e-02
## Age2              -0.0005670460 -0.0011710130  3.692101e-05
## Site.LeftRectum   -0.1579432364 -0.5383176339  2.224312e-01
## Site.Right        -0.2365318069 -0.6211753023  1.481117e-01
## Peritoneum.YN      0.1513874151 -0.1773574308  4.801323e-01
## ECOG.1            -0.2762104282 -0.6659711458  1.135503e-01
## ECOG.23           -1.2022780055 -1.6308052807 -7.737507e-01
## MutantBRAF.YN      0.2745640877 -0.3312597292  8.803879e-01
## PFS.Time           0.5751723494  0.4054450540  7.448996e-01
## Stage4.YN         -0.5243148245 -0.9757145591 -7.291509e-02
## Charlson.1plus.YN  0.1044981510 -0.0565984893  2.655948e-01
## MutantRAS.YN       0.1091776800 -0.3022287934  5.205842e-01
## Private.YN         0.1053037877 -0.0458828408  2.564904e-01
## Age1.Int           0.0174056852  0.0034559505  3.135542e-02
## Age2.Int           0.0004030007 -0.0003232005  1.129202e-03
## Peritoneum.YN.Int -0.2386910341 -0.6130196513  1.356376e-01
## ECOG.1.Int        -0.0930659837 -0.5165916222  3.304597e-01
## ECOG.23.Int        0.3115761369 -0.2210811948  8.442335e-01
## MutantBRAF.YN.Int -0.6795392885 -1.3768905224  1.781195e-02
## PFS.Time.Int      -0.1486793160 -0.3474887968  5.013016e-02
## Stage4.YN.Int      0.4904170757 -0.0221555921  1.002990e+00
## MutantRAS.YN.Int  -0.4071664687 -0.8555996074  4.126667e-02
## Log(Scale)        -0.1525148212 -0.2153760094 -8.965363e-02

Step 5: Model Validation

Internal validation is performed to assess the performanc of the models. This exists of five steps:

  1. Defining functions used to perform the validation
  2. Graphical assessment of internal validity
  3. Calculating the apparent model performance
  4. Performing bootstrapping to calculate the optimism
  5. Correcting the apparent performance for the optimism

For survival models Model L1-Time and Model LX-Time, several validation plots were assessed and model discrimination was assessed by Uno et al.’s modification of Harrel’s C-statistic to account for censoring. Two measures were used to assess survival model calibration: 1) the regression slope of the predicted Weibull scale parameter in a Cox-regression model, and 2) Demler et al.’s Greenwood-Nam-D’Agostino statistic (GND), which is a modification of Hosmer-Lemeshow’s statistic by Nam and D’Agostino and accounts for censoring using Greenwood’s variance estimator. Both C-statistic and GND are assessed at specific timepoints, which were selected to be at 0.5, 1, and 2 years. Additionally, GND requires a number of groups to be specified in which patients are stratified according to their predicted risk, for which 5 and 10 were chosen to assess impact of this arbitrary decision.

Internal validity of logistic regression Model L1-Event was assessed by excluding patients for which no event was observed and using Harrel’s C-statistic for model discrimination. To assess calibration, the regression slope of the predicted response of Model L1-Event in a logistic regression model and the Brier score were used.

Bootstrapping was performed to determine the optimism and obtain sets of models to perform the probabilistic sensitivity analysis for the discrete event simulation model.

Step 5.1: defining functions used to perform the validation

Several custom functions are defined to make predictions using the fitted models and to calculate the statistics used. Please find an explanation of the functions in the R code below.

# The 'predictSurvmod', 'predictSurvmod.response', 'predictSurvmod.lp', 'predictSurvmod.p', 'predictLogreg',
# and 'predictLogreg.lp' are custom functions to make predictions using the fitted models
predictSurvmod <- function(model, newdata) {
  
  # Add a dummy column for the intercept
  newdata <- cbind(newdata, '(Intercept)'=1);
  
  # Ge the Weibull Shape parameter from the model
  Shape <- 1/exp(model["Log(Scale)"]);
  model <- model[-which(names(model) == "Log(Scale)")];
  
  # Subset a dataframe to enable matrix multiplication and check
  newdata.sub <- as.matrix(newdata[, names(model), drop=F]);
  
  # Get the scale parameter for the survival model
  Scale <- exp(newdata.sub %*% model);

  # Predict the times
  p <- matrix(runif(nrow(newdata.sub)), ncol=1);
  t <- qweibull(p=p, shape=Shape, scale=Scale);

  # Return predicted times
  return(t);
  
  
}
predictSurvmod.response <- function(model, newdata) {
  
  # Add a dummy column for the intercept
  newdata <- cbind(newdata, '(Intercept)'=1);
  
  # Ge the Weibull Shape parameter from the model
  Shape <- 1/exp(model["Log(Scale)"]);
  model <- model[-which(names(model) == "Log(Scale)")];
  
  # Subset a dataframe to enable matrix multiplication and check
  newdata.sub <- as.matrix(newdata[, names(model), drop=F]);
  
  # Get the scale parameter for the survival model
  Scale <- exp(newdata.sub %*% model);
  
  # Return predicted times
  return(Scale);
  
  
}
predictSurvmod.lp <- function(model, newdata) {
  
  # Add a dummy column for the intercept
  newdata <- cbind(newdata, '(Intercept)'=1);
  
  # Ge the Weibull Shape parameter from the model
  Shape <- 1/exp(model["Log(Scale)"]);
  model <- model[-which(names(model) == "Log(Scale)")];
  
  # Subset a dataframe to enable matrix multiplication and check
  newdata.sub <- as.matrix(newdata[, names(model), drop=F]);
  
  # Get the scale parameter for the survival model
  LP <- newdata.sub %*% model;
  
  # Return predicted times
  return(LP);
  
  
}
predictSurvmod.p <- function(model, newdata, q) {
  
  # Add a dummy column for the intercept
  newdata <- cbind(newdata, '(Intercept)'=1);
  
  # Ge the Weibull Shape parameter from the model
  Shape <- 1/exp(model["Log(Scale)"]);
  model <- model[-which(names(model) == "Log(Scale)")];
  
  # Subset a dataframe to enable matrix multiplication and check
  newdata.sub <- as.matrix(newdata[, names(model), drop=F]);
  
  # Get the scale parameter for the survival model
  Scale <- exp(newdata.sub %*% model);

  # Predict the times
  q <- matrix(q, nrow=nrow(newdata), ncol=length(q), byrow=T);
  p <- pweibull(q=q, shape=Shape, scale=Scale)
  
  # Return predicted times
  return(p);
  
  
}
predictLogreg <- function(model, newdata) {
  
  # Add a dummy column for the intercept
  newdata <- cbind(newdata, '(Intercept)'=1);

  # Subset a dataframe to enable matrix multiplication and check
  newdata.sub <- as.matrix(newdata[, names(model), drop=F]);
  
  # Calculate the probabilities
  p.tmp <- exp(newdata.sub %*% model);
  p <- p.tmp / (p.tmp + 1);
  
  # Return probabilities
  return(as.vector(p));
  
}
predictLogreg.lp <- function(model, newdata) {
  
  # Add a dummy column for the intercept
  newdata <- cbind(newdata, '(Intercept)'=1);
  
  # Subset a dataframe to enable matrix multiplication and check
  newdata.sub <- as.matrix(newdata[, names(model), drop=F]);
  
  # Calculate the probabilities
  p.tmp <- newdata.sub %*% model;
  
  # Return probabilities
  return(p.tmp);
  
}

# The 'testGND' is used to calculate the Greenwood-Nam-D'Agostino statistic as proposed by the authors and
# with some additional error handlers included and an option to plot an illustration
testGND <- function(model, data, time, n.bin=5, min.event=5, plot.graph=F, y.max=1) {
  
  time.in.days <- time*365.25;
  
  # Predict event probability for each individual
  pt <- predictSurvmod.p(model=model, newdata=data, q=time.in.days);
  
  # Bin individuals according to risk of event
  pt.q <- quantile(pt, seq(0, 1, length.out=n.bin+1));
  b <- sapply(pt, function(p) ifelse(p==max(pt.q), n.bin, which(p<pt.q)[1]-1));
  
  # Calculate average probability for each group
  ptg <- tapply(pt, b, mean);
  
  # Determine Kaplan-Meier estimates
  KMg <- t(sapply(1:n.bin, function(bin) {
    
    # Get subset of data for the bin
    data.sub <- data[b==bin, ];
    
    # Check whether sufficient events are observed
    if(sum(data.sub$Event==1)<min.event) stop("Insufficient events observed!");
    
    # Estimate Kaplan-Meier for the bin
    KM <- survfit(Surv(Time, Event) ~ 1, data=data.sub, type="kaplan-meier");
    
    # Extract the estimates:
    # p       event probability (1 - km survival estimate)
    # var     variance of km survival estimate
    # nrisk   number at risk
    if(time.in.days<min(KM$time)) {
      p <- 1;
      var <- 0;
    } else {
      p <- KM$surv[tail(which(KM$time<=time.in.days), n=1)];
      p <- ifelse(p==0, KM$surv[tail(which(KM$time<=time.in.days), n=1)-1], p);
      err <- KM$std.err[tail(which(KM$time<=time.in.days), n=1)];
      err <- ifelse(err==Inf, KM$std.err[tail(which(KM$time<=time.in.days), n=1)-1], err);
      var <- (err * p)^2;
    }
    
    
    # Return estimates
    c(p=1-p, var=var)
    
  }))
  
  
  
  # Calculate test statistic
  GND.chisq.sep <- (KMg[,"p"]-ptg)^2 / KMg[,"var"];
  GND.chisq <- sum(GND.chisq.sep);
  GND.df <- n.bin - 1;
  GND.p <- 1 - pchisq(q=GND.chisq, df=GND.df);
  
  if(plot.graph) {
    
    barplot(rbind(ptg, KMg[, "p"]), beside=T, ylim=c(0,y.max), main=paste0("Event before ",time, "-year (p = ",round(GND.p,3),")"), ylab="Probability", names.arg=str_c("n=",table(b)), las=1, legend.text=c("Survival Model", "Kaplan-Meier"), args.legend=list(x="top", inset=c(0,-0.0), ncol=2, bty='n'))
    
  }
  
  # Return results
  return(c(chisq=GND.chisq, df=GND.df, p=GND.p));
  
}

# The 'getPerformanceSurvmod' and 'getPerformanceLogreg' functions are high-level functions that call all 
# functions used to calculate statistics used to assess the models' performance
getPerformanceSurvmod <- function(model, data.train, data.test, timepoints=c(0.5, 1, 2)) {
  
  library(survAUC);
  
  # c-statistic
  c.statistic <- sapply(timepoints, function(tt) {
    
    # Make negative because a higher scale indicates a higher time-to-event, i.e. lower risk
    pred.lp <- -predictSurvmod.lp(model=model, newdata=data.test);
    surv.train <- Surv(time=data.train$Time, event=data.train$Event);
    surv.test <- Surv(time=data.test$Time, event=data.test$Event);
    
    UnoC(Surv.rsp=surv.train, Surv.rsp.new=surv.test, lpnew=pred.lp, time=tt*365.25);
    
  });
  names(c.statistic) <- str_c("c.statistic_", timepoints, "-year");
  
  # Regression Slope
  data.test$scale <- predictSurvmod.response(model, data.test);
  slope.scale <- exp(coxph(Surv(Time, Event) ~ scale, data=data.test)$coefficients);

  # Hosmer-Lemeshow
  HL.5bins <- sapply(timepoints, function(tt) {
    testGND(model=model, data=data.test, time=tt, n.bin=5, plot=F)["chisq"]
  });
  HL.10bins <- sapply(timepoints, function(tt) {
    testGND(model=model, data=data.test, time=tt, n.bin=10, plot=F)["chisq"]
  });
  names(HL.5bins) <- str_c("HL.5bins_", timepoints, "-year");
  names(HL.10bins) <- str_c("HL.10bins_", timepoints, "-year");
  
  # Return all performance measures
  return(c(c.statistic, Slope.Scale = unname(slope.scale), HL.5bins, HL.10bins))
  
}
getPerformanceLogreg <- function(model, data) {
  
  data.sub <- data[data$Events!="Censored", ];
  
  obs <- ifelse(data.sub$Events=="Death", 1,
                ifelse(data.sub$Events=="Progression", 0, NA));
  
  pred.prob <- predictLogreg(model=model, newdata=data.sub);
  pred.lp <- predictLogreg.lp(model=model, newdata=data.sub);
  
  # c-statistic
  pairs <- expand.grid(1:length(obs),
                       1:length(obs));
  c_n <- sum(ifelse(obs[pairs[,1]]==1 & obs[pairs[,2]]==0, 1, 0) * ifelse(pred.prob[pairs[,1]]>pred.prob[pairs[,2]], 1, 0));
  c_d <- sum(ifelse(obs[pairs[,1]]==1 & obs[pairs[,2]]==0, 1, 0));
  c.statistic <- c_n/c_d;
  
  # Regression slope
  tmp.model <- glm(obs ~ pred.lp, family=binomial(link="logit"));
  regression.slope <- tmp.model$coefficients["pred.lp"];
  
  # Brier score
  Brier.score <- sum((pred.prob-obs)^2)/length(obs);
  
  # Return all performance measures
  return(c(C.Statistic = c.statistic, Regression.Slope = unname(regression.slope), Brier.Score = Brier.score))
  
}

Step 5.2: graphical assessment of internal validity

Both Kaplan-Meier plots for different patient subgroups and illustrations of the Greenwood-Nam-D’Agostino statistic were used to graphically assess the performance of the survival models.

Model L1-Time - Kaplan-Meier Plots

# Custom function to plot predited vs. observed Kaplan-Meier curves for both treatment strategies
plotKM <- function(model, data, class.column, time.max=1500, time.step=10, add.main=NULL) {
  
  km.fit <- survfit(as.formula(str_c("Surv(Time, Event) ~ ",class.column)), data=data, type="kaplan-meier")
  
  classes <- sort(unlist(unique(data[, class.column])));
  n.classes <- length(classes);
  n.obs.classes <- table(data[, class.column]);
  
  q.values <- seq(1, time.max, time.step);
  
  sim <- sapply(classes, function(class) {
    surv.mat <- 1 - predictSurvmod.p(model, data[data[, class.column]==as.character(class), ], q=q.values);
    surv.mean <- apply(surv.mat, 2, mean);
    
    surv.mean
  })
  
  cols <- rainbow(n.classes);
  if(is.null(add.main)) {
    main.text <- class.column;
  } else {
    main.text <- str_c(class.column, "\n(", add.main,")")
  }
  plot(km.fit, col=cols, lty=1, lwd=2, xlim=c(0,time.max), main=main.text, xlab="Time to Progression or Death (Days)")
  for(i in 1:n.classes) {
    lines(x=q.values, y=sim[,i], col=cols[i], lty=2, lwd=2)
  }
  legend("right", legend=c(str_c(classes, " (obs) [n=",n.obs.classes,"]"), str_c(classes, " (sim)")),
         col=c(cols, cols), lty=c(rep(1, n.classes), rep(2, n.classes)), lwd=2)
  
}

# Model L1-Time
set.seed(20181211);
random.dataset <- sample(1:length(datasets), 1);
data <- datasets.surv.L1[[random.dataset]];

# Kaplan-Meier curves for the following groups
# - The complete cohort
# - Patient with a left-sided primary tumour or rectum primary tumour
# - Patient with a right-sided primary tumour
# - Patients with an primary tumour in their colon but not otherwise specified (NOS) or unkown primary tumour site
par(mfrow=c(2,2));
plotKM(model=Model.L1.Time$coef, data=data, class.column="ChemBev")
plotKM(model=Model.L1.Time$coef, data=filter(data, Site.LeftRectum==1), class.column="ChemBev", add.main="Left or Rectum")
plotKM(model=Model.L1.Time$coef, data=filter(data, Site.Right==1), class.column="ChemBev", add.main="Right")
plotKM(model=Model.L1.Time$coef, data=filter(data, Site.LeftRectum==0, Site.Right==0), class.column="ChemBev", add.main="Other")