This document shows all code, intermediate outcomes, and final outcomes belonging to the manuscript “Simulating progression-free survival and overall survival for first-line doublet chemotherapy with or without bevacizumab in metastatic colorectal cancer patients based on real world registry data” that was published in JOURNAL: … . Using real world data from the Treatment of Recurrent and Advanced Colorectal Cancer registry, two parametric survival models and a logistic regression model are estimated to populate a discrete event simulation model. Please read the manuscript before reading through this document, as the manuscript contains information essential for understanding decisions made in the code that is included in this document.

Please appropriately cite the work above if you use any code provided in this document for your own analyses.

This document contains the following sections:

Please note that the number of bootstrap iterations and simulation runs performed is set to a lower number for illustration purposes due to the required runtime to perform the 10,000 iterations/simulations that are performed for the final analyses presented in the manuscript. Hence, results and plots in this document may be different from those reported in the manuscript. Results presented in the manuscript should be considered as the final and definitive results at all times.

Initialization

To intialize the data analysis, several actions are performed. Most importantly, the data are loaded from an encrypted hard drive, as are the input matrices defined by clinicians to recode the treatments as observed in the unprocessed data.

# Clear Workspace and plots, and clear console
rm(list=ls()); gc(); if(!is.null(dev.list())) dev.off(); cat("\014");

# Load packages
library(dplyr);
library(readxl);
library(stringr);

# Set the locale so R handles dates as English regardless of location
Sys.setlocale("LC_TIME", "English");

# Load the raw TRACC data for the patient characteristics and first line treatment
df <- read_excel("G:/Files/mCRC Targeting Melbourne/CRC MODELING DATA 20181105.xlsx", 
                 col_types = c("numeric", "date", "text", "text",
                               "text", "text", "text", "date", "date", 
                               "date", "date", "numeric", "text", 
                               "text", "text", "text", "text", "text", 
                               "text", "text", "text", "text", "text", 
                               "text", "text", "text", "text", "text", 
                               "text", "text", "numeric", "text", 
                               "text", "text", "text", "text", "text", 
                               "text", "text", "text", "text", "text", 
                               "text", "text", "text", "text", "text", 
                               "text", "text", "text", "numeric", 
                               "text", "text", "text", "text", "date", 
                               "text", "text", "text", "date", "text", 
                               "text", "text", "text", "text", "date",
                               "numeric","text",
                               "text", "text", "text", "text", "date", 
                               "date", "date", "text", "text", "text", 
                               "text", "text", "text", "text", "numeric", 
                               "text", "numeric", "text", "numeric", 
                               "text", "text", "text", "numeric", 
                               "text", "numeric", "text", "numeric", 
                               "text", "numeric", "text", "numeric", 
                               "text", "numeric", "text", "numeric", 
                               "text", "numeric", "text", "numeric", 
                               "text", "numeric", "text", "text", 
                               "numeric", "text", "text"));

# Replace spaces in column names with dots/points to handle columns more easily
colnames(df) <- str_replace_all(string=colnames(df), pattern=" ", replace=".");

# Load the chemo and biologic classification by Hui-Li & Koen
chemo.class <- read_excel("G:/Files/mCRC Targeting Melbourne/Chemotherapy Classification 20181123.xlsx");
biolo.class <- read_excel("G:/Files/mCRC Targeting Melbourne/Biological Classification 20181123.xlsx");

Step 1: Cleaning the TRACC Data

Before all models can be fitted, the data need to be cleaned and the appropriate patient population is to be selected. To this end, several steps are performed:

  1. Initial data cleaning and patient selection
  2. Cleaning of patient characteristics and treatment history
  3. Cleaning of clinical outcomes for first-line treatment
  4. Cleaning of clinical outcomes for further treatment lines
  5. Final patient selection

Step 1.1: Initial data cleaning and patient selection

In the first step of the data cleaning, the patient selection is performed. All data required for this selection are recoded and cleaned in this step as well. Patients included are those …

  • who received chemotherapy with or without bevacizumab
  • treated with a palliative treatment intent
  • with a treatment start date after 1 January 2009 and before 1 June 2017
# Inclusion is based on whether the first line of palliative treatment included
# doublet chemotherapy with or without bevacizumab. Patients for whom treatment
# intent was indicated as curative or potentially curative, are excluded.

# Only patients with recorded first-line chemotherapy treatment are considered for inclusion
df <- df %>%
  filter(!is.na(REGIMEN.1L)) %>%
  filter(TREATMENTINTENT=="Palliative only")

# Recode first-line treatment information for inclusion
df <- df %>%
  mutate(L1.Regimen = REGIMEN.1L) %>%
  mutate(L1.Biologic = BIOLOGIC.1L) %>%
  mutate(L1.Start.Date = as.character(START.DATE.1L))

# In the TRACC registry, all regimen in a treatment line need to be recoded first. 
# This has been manually done in an Excel file, which has been loaded in the global 
# environment as object 'chemo.class' and will now be used to recode the regimen.
# The 'chemo.class' object contains a column 'Recorded.Regimen' with all uncoded
# treatments and a columns 'Classification.G0' with the recoded treatments. The
# column 'L1.Chemo.Eligible' is added to indicate whether a patient received doublet
# chemotherapy and, hence, is eligible for inclusion.
df <- df %>%
  left_join(select(chemo.class, Recorded.Regimen, Classification.G0), by=c("L1.Regimen" = "Recorded.Regimen")) %>%
  rename(L1.Chemo.Class = Classification.G0) %>%
  mutate(L1.Chemo.Class = ifelse(is.na(L1.Chemo.Class), "None", L1.Chemo.Class)) %>%
  mutate(L1.Chemo.Eligible = ifelse(L1.Chemo.Class=="Doublet", T, F))

# Since only patients who had a doublet chemotherapy with or without bevacizumab,
# so not with another additional agent like cetuximab, are include, it need to be
# checked whether patients had bevacizumab yes or no, and whether they are
# eligible based on any other biologicals. This is done based on the first
# reported biological
df <- df %>%
  left_join(select(biolo.class, Recorded.Biologicals, Classification.G0), by=c("L1.Biologic" = "Recorded.Biologicals")) %>%
  rename(L1.Biologic.Class=Classification.G0) %>%
  mutate(L1.Biologic.Class = ifelse(is.na(L1.Biologic.Class), "None", L1.Biologic.Class)) %>%
  mutate(L1.Biologic.Eligible = ifelse(L1.Biologic.Class %in% c("None", "Bevacizumab"), T, F)) %>%
  mutate(L1.Bevacizumab = ifelse(L1.Biologic.Class=="Bevacizumab", T, F))

# Final selection of patients is performed based on whether they had ...
# 1) doublet chemotherapy with or without bevacizumab and no other biological
# 2) treatment start date after 2009-01-01
# 3) treatment start date before 2017-06-01
df<- df %>%
  filter(L1.Chemo.Eligible) %>%
  filter(L1.Biologic.Eligible) %>%
  filter(L1.Start.Date >= as.Date("2009-01-01")) %>%
  filter(L1.Start.Date < as.Date("2017-06-01"))

# Allocate patients to the appropriate treatment arm
df <- df %>%
  mutate(L1.Arm.ChemBev = ifelse(L1.Bevacizumab, T, F))

Step 1.2: Cleaning of patient characteristics and treatment history

The second step in cleaning the data is making sure all variables are recoded in relevant categories and all observations are categorized in one of these defined categories. The following variables are extracted from the data:

  • Age at treatment start
  • Gender
  • Hospital type (private or public)
  • Hospital location (located in Victoria or not)
  • Stage at initial diagnosis
  • Primary tumour site
  • Presence of peritoneal metastases
  • Presence of liver metastases
  • Number of metastatic sites
  • ECOG performance status
  • Presence of comorbidities
  • Charlson comorbidity index
  • CEA level
  • RAS mutation status
  • BRAF mutation status
  • Treatment intent
  • Treatment start year
df <- df %>%
  
  # Age in years at treatment start
  mutate(Age = floor(as.numeric(difftime(L1.Start.Date, DOB, units="weeks")/52))) %>%         

  # Gender is female or not
  mutate(Gender.Female = ifelse(GENDER=="F", T, F)) %>%                                       
  
  # Hospital type is private or not
  mutate(Hospital.Private = ifelse(TLOCATIONTYPE=="Private", T, F)) %>%
  
  # Hospital is located in Victoria or not
  mutate(Hospital.Victoria = ifelse(STATE=="VIC", T, F)) %>%
  
  # Stage of disease at initial diagnosis
  mutate(Stage.Diagnosis = ifelse(STAGE.AT.DIAGNOSIS=="Unknown", NA, NA))%>%
  mutate(Stage.Diagnosis = ifelse(STAGE.AT.DIAGNOSIS=="Locally advanced (rectal only)", 0, Stage.Diagnosis)) %>%
  mutate(Stage.Diagnosis = ifelse(STAGE.AT.DIAGNOSIS=="Stage I", 1, Stage.Diagnosis)) %>%
  mutate(Stage.Diagnosis = ifelse(STAGE.AT.DIAGNOSIS=="Stage II", 2, Stage.Diagnosis)) %>%
  mutate(Stage.Diagnosis = ifelse(STAGE.AT.DIAGNOSIS=="Stage III", 3, Stage.Diagnosis)) %>%
  mutate(Stage.Diagnosis = ifelse(STAGE.AT.DIAGNOSIS=="Stage IV", 4, Stage.Diagnosis)) %>%
  
  # Primary tumour site
  mutate(Primary.Site = ifelse(PRIMARYSITE=="Occult", NA, NA)) %>%
  mutate(Primary.Site = ifelse(PRIMARYSITE=="Colon", "Colon", Primary.Site)) %>%
  mutate(Primary.Site = ifelse(PRIMARYSITE=="Colon NOS", "Colon", Primary.Site)) %>%
  mutate(Primary.Site = ifelse(PRIMARYSITE=="Left colon", "Left", Primary.Site)) %>%
  mutate(Primary.Site = ifelse(PRIMARYSITE=="Right colon", "Right", Primary.Site)) %>%
  mutate(Primary.Site = ifelse(PRIMARYSITE=="Rectum", "Rectum", Primary.Site)) %>%
  mutate(Primary.Site = ifelse(PRIMARYSITE=="Multiple primaries", "Colon", Primary.Site)) %>%
  
  # Specific metastatic sites
  mutate(Metastases.Liver = ifelse(is.na(LIVER.METS), F, T)) %>%
  mutate(Metastases.Peritoneum = ifelse(is.na(PERITONEUM.OMENTUM.METS), F, T)) %>%
  
  # Number of metastatic sites
  mutate(Metastases.Number = NUMBER.OF.METS) %>%
  
  # ECOG Performance Status
  mutate(L1.ECOG = as.numeric(str_sub(ECOG.First.Line, 1, 1))) %>%
  
  # Whether the patient has comorbidities or not
  mutate(Comorbidities = ifelse(NO.COMORBIDITIES=="Yes", FALSE, TRUE)) %>%
  
  # Recode the comorbidities
  mutate(MYOCARDIAL.INFARCTION = ifelse(is.na(MYOCARDIAL.INFARCTION) | MYOCARDIAL.INFARCTION=="No", 0, 1)) %>%
  mutate(CONGESTIVE.HEART.FAILURE = ifelse(is.na(CONGESTIVE.HEART.FAILURE) | CONGESTIVE.HEART.FAILURE=="No", 0, 1)) %>%
  mutate(PERIPHERAL.VASCULAR.DISEASE = ifelse(is.na(PERIPHERAL.VASCULAR.DISEASE) | PERIPHERAL.VASCULAR.DISEASE=="No", 0, 1)) %>%
  mutate(CEREBROVASCULAR.DISEASE = ifelse(is.na(CEREBROVASCULAR.DISEASE) | CEREBROVASCULAR.DISEASE=="No", 0, 1)) %>%
  mutate(DEMENTIA = ifelse(is.na(DEMENTIA) | DEMENTIA=="No", 0, 1)) %>%
  mutate(COPD = ifelse(is.na(COPD) | COPD=="No", 0, 1)) %>%
  mutate(CONNECTIVE.TISSUE.DISEASE = ifelse(is.na(CONNECTIVE.TISSUE.DISEASE) | CONNECTIVE.TISSUE.DISEASE=="No", 0, 1)) %>%
  mutate(PEPTIC.ULCER.DISEASE = ifelse(is.na(PEPTIC.ULCER.DISEASE) | PEPTIC.ULCER.DISEASE=="No", 0, 1)) %>%
  mutate(MILD.LIVER.DISEASE = ifelse(is.na(MILD.LIVER.DISEASE) | MILD.LIVER.DISEASE=="No", 0, 1)) %>%
  mutate(DIABETES.NO.COMPS = ifelse(is.na(DIABETES.NO.COMPS) | DIABETES.NO.COMPS=="No", 0, 1)) %>%
  mutate(DIABETES.COMPS = ifelse(is.na(DIABETES.COMPS) | DIABETES.COMPS=="No", 0, 1)) %>%
  mutate(HEMIPLEGIA.STROKE = ifelse(is.na(HEMIPLEGIA.STROKE) | HEMIPLEGIA.STROKE=="No", 0, 1)) %>%
  mutate(RENAL.DISEASE.MOD.SEVERE = ifelse(is.na(RENAL.DISEASE.MOD.SEVERE) | RENAL.DISEASE.MOD.SEVERE=="No", 0, 1)) %>%
  mutate(SECOND.TUMOUR.NO.METS = ifelse(is.na(SECOND.TUMOUR.NO.METS) | SECOND.TUMOUR.NO.METS=="No", 0, 1)) %>%
  mutate(LYMPHOMA.MYELOMA = ifelse(is.na(LYMPHOMA.MYELOMA) | LYMPHOMA.MYELOMA=="No", 0, 1)) %>%
  mutate(LEUKAEMIA = ifelse(is.na(LEUKAEMIA) | LEUKAEMIA=="No", 0, 1)) %>%
  mutate(LIVER.DIS.MOD.SEVERE = ifelse(is.na(LIVER.DIS.MOD.SEVERE) | LIVER.DIS.MOD.SEVERE=="No", 0, 1)) %>%
  mutate(IMMUNE.DEFICIENCY = 0) %>%
  mutate(SECOND.MET.SOLID.TUMOUR = ifelse(is.na(SECOND.MET.SOLID.TUMOUR) | SECOND.MET.SOLID.TUMOUR=="No", 0, 1)) %>%
  
  # Charlson comorbidity score (NOT CORRECTED FOR AGE!!!)
  mutate(Charlson =  
           (1 * MYOCARDIAL.INFARCTION) + 
           (1 * CONGESTIVE.HEART.FAILURE) + 
           (1 * PERIPHERAL.VASCULAR.DISEASE) + 
           (1 * CEREBROVASCULAR.DISEASE) + 
           (1 * DEMENTIA) + 
           (1 * COPD) + 
           (1 * CONNECTIVE.TISSUE.DISEASE) + 
           (1 * PEPTIC.ULCER.DISEASE) + 
           (1 * MILD.LIVER.DISEASE) + 
           (1 * DIABETES.NO.COMPS) + 
           (2 * DIABETES.COMPS) + 
           (2 * HEMIPLEGIA.STROKE) + 
           (2 * RENAL.DISEASE.MOD.SEVERE) + 
           (2 * SECOND.TUMOUR.NO.METS) + 
           (2 * LYMPHOMA.MYELOMA) + 
           (2 * LEUKAEMIA) + 
           (3 * LIVER.DIS.MOD.SEVERE) + 
           (6 * IMMUNE.DEFICIENCY) +
           (6 * SECOND.MET.SOLID.TUMOUR)
  ) %>%
  
  # CEA Level
  mutate(CEA.Level = CEALEVEL) %>%

  # RAS status
  mutate(RAS.Mutated = ifelse(RAS.STATUS=="Pending", NA, NA)) %>%
  mutate(RAS.Mutated = ifelse(RAS.STATUS=="Unknown", NA, RAS.Mutated)) %>%
  mutate(RAS.Mutated = ifelse(RAS.STATUS=="Mutated", T, RAS.Mutated)) %>%
  mutate(RAS.Mutated = ifelse(RAS.STATUS=="Wild-type", F, RAS.Mutated)) %>%
  
  # BRAF status
  mutate(BRAF.Mutated = ifelse(BRAFSTATUS=="Pending", NA, NA)) %>%
  mutate(BRAF.Mutated = ifelse(BRAFSTATUS=="Unknown", NA, BRAF.Mutated)) %>%
  mutate(BRAF.Mutated = ifelse(BRAFSTATUS=="Mutated", T, BRAF.Mutated)) %>%
  mutate(BRAF.Mutated = ifelse(BRAFSTATUS=="Wild-type", F, BRAF.Mutated)) %>%
  
  # Treatment intent
  mutate(Treatment.Intent = ifelse(TREATMENTINTENT=="Curative resection occurred", "Curative", NA)) %>%
  mutate(Treatment.Intent = ifelse(TREATMENTINTENT=="Curative resection planned", "Curative", Treatment.Intent)) %>%
  mutate(Treatment.Intent = ifelse(TREATMENTINTENT=="Potentially curative resection (if good response to treatment)", "Potentially", Treatment.Intent)) %>%
  mutate(Treatment.Intent = ifelse(TREATMENTINTENT=="Palliative only", "Palliative", Treatment.Intent)) %>%
  
  # Treatment start
  mutate(Treatment.Start = format(as.Date(L1.Start.Date), "%Y"))

Step 1.3: Cleaning of clinical outcomes for first-line treatment

Clinical outcomes for the first line of treatment are extrated from the data in this third step of the data cleaning. In the manucript, only progression-free survival is considered as clinical outcome for first-line treatment. First-line progression is defined as physician-defined progression (so a patient will be able to receive further treatment or follow up) or death.

# First, the treatment stop date, date of progression, and date of death need to be extraced from the data
df<- df %>%  
  mutate(L1.Stop.Date = as.character(STOP.DATE.1L)) %>%
  mutate(L1.Progression.Date = as.character(PROGRESSION.DATE.First.Line)) %>%
  mutate(Death.Date = as.character(DATEOFDEATH))

# The progression-free survival (PFS) and overall survival (OS) are calculated using the
# treatment start date as origin
df <- df %>% 
  mutate(L1.PFS.Days = as.numeric(difftime(L1.Progression.Date, L1.Start.Date, units="days"))) %>%
  mutate(L1.OS.Days = as.numeric(difftime(Death.Date, L1.Start.Date, units="days"))) %>%
  mutate(L1.CENS.Days = as.numeric(difftime(L1.Stop.Date, L1.Start.Date, units="days"))) %>%
  mutate(L1.CENS.Days = ifelse(is.na(L1.CENS.Days), as.numeric(difftime(DATEOFREVIEW, L1.Start.Date, units="days")), L1.CENS.Days))


# Determine which of the events occured to the patient with regard to first line progression:
# 0) Censored:      when neither progression or death is recorded
# 1) Progression:   when progression is recorded
# 2) Death:         when progression is not recorded but death is
df <- df %>%
  mutate(L1.PFS.Events = ifelse(!is.na(L1.OS.Days) & (L1.OS.Days<=L1.PFS.Days | is.na(L1.PFS.Days)), "Death", NA)) %>%
  mutate(L1.PFS.Events = ifelse(!is.na(L1.PFS.Days) & (L1.PFS.Days<L1.OS.Days | is.na(L1.OS.Days)), "Progression", L1.PFS.Events)) %>%
  mutate(L1.PFS.Events = ifelse(is.na(L1.PFS.Days) & is.na(L1.OS.Days), "Censored", L1.PFS.Events))

# Determine whether an event occured or the patient was censored and the correspoding time
df <- df %>% 
  mutate(L1.PFS.Event = ifelse(L1.PFS.Events=="Censored", 0, 1)) %>%
  mutate(L1.PFS.Time = ifelse(L1.PFS.Events=="Progression", L1.PFS.Days, NA)) %>%
  mutate(L1.PFS.Time = ifelse(L1.PFS.Events=="Death", L1.OS.Days, L1.PFS.Time)) %>%
  mutate(L1.PFS.Time = ifelse(L1.PFS.Events=="Censored", L1.CENS.Days, L1.PFS.Time))

# Since a negative PFS or OS is not possible, patients with a non-positive time-to-event
# are excluded from the analysis, as their data is assumed to be incorrect
df <- df %>%
  mutate(L1.PFS.Include = ifelse(!is.na(L1.PFS.Time) & L1.PFS.Time > 0, T, F));

Step 1.4: Cleaning of clinical outcomes for further treatment lines

With regard to the clinical outcomes of treatment lines or follow up after first-line progression, only overall survival is considered. Overall survival is extracted from the data in this fourth step of the data cleaning.

# Time till death after progression
df <- df %>% 
  mutate(LX.OS.Days = as.numeric(difftime(DATEOFDEATH, L1.Progression.Date, units="days"))) %>%
  mutate(LX.CENS.Days = as.numeric(difftime(DATEOFREVIEW, L1.Progression.Date, units="days")))

# Determine whether the patient was censored:
# 0) Censored:      when death is not recorded
# 1) Death:         when death is recorded
df <- df %>%
  mutate(LX.OS.Events = ifelse(!is.na(LX.OS.Days), "Death", NA)) %>%
  mutate(LX.OS.Events = ifelse(is.na(LX.OS.Days) & !is.na(LX.CENS.Days), "Censored", LX.OS.Events)) %>%
  mutate(LX.OS.Events = ifelse(L1.PFS.Events=="Progression", LX.OS.Events, NA))

# Determine whether death occured or the patient was censored and the correspoding time
df <- df %>% 
  mutate(LX.OS.Event = ifelse(LX.OS.Events=="Censored", 0, 1)) %>%
  mutate(LX.OS.Time = ifelse(LX.OS.Events=="Death", LX.OS.Days, NA)) %>%
  mutate(LX.OS.Time = ifelse(LX.OS.Events=="Censored", LX.CENS.Days, LX.OS.Time))

# Since a negative PFS or OS is not possible, patients with a non-positive time-to-event
# are excluded from the analysis, as their data is assumed to be incorrect
df <- df %>%
  mutate(LX.OS.Include = ifelse(L1.PFS.Include & !is.na(LX.OS.Time) & LX.OS.Time > 0, T, F))

Step 1.5: Final patient selection

Patients with a negative progression-free survival or overall survival are excluded from the sample, as it is assumed that these are data entry errors. Finally, for the included patients, an overview table of their characteristics is generated.

# Only include patients who at least are included for first-line treatment
df <- df %>%
  filter(L1.PFS.Include)

# Only the new colomns need to be maintained
df <- df %>%
  select(L1.Regimen:LX.OS.Include)

# General function to get an overview table of the patient characteristics
getTable <- function(data, column.class=NULL, columns.names=NULL, columns.discrete=NULL, chisq.simulate=T, chisq.nsim=10000, export=F, export.file=NULL) {
  
  if(is.null(columns.names)) {
    columns.names <- c("Age", "Gender.Female", "Hospital.Private", "Stage.Diagnosis", "Primary.Site", 
                       "Metastases.Number", "Metastases.Liver", "Metastases.Peritoneum", "L1.ECOG",
                       "Charlson", "CEA.Level", "RAS.Mutated", "BRAF.Mutated");
    columns.discrete <- c(F, T, T, T, T, T, T, T, T, T, F, T, T);
  }
  
  if(is.null(column.class)) {
    column.class <- "L1.Arm.ChemBev";
  }
  
  if(length(columns.names)!=length(columns.discrete)) stop("OBJECTS 'columns.names' AND 'columns.discrete' SHOULD HAVE EQUAL LENGTHS!!!");
  
  classification <- unlist(data[, column.class]);
  if(length(unique(classification))!=2) stop("NEEDS A CLASSIFICATION THAT RESULTS IN TWO CLASSES!!!")
  
  out.discrete <- matrix(c(table(classification), NA, NA, NA), nrow=1);
  rownames(out.discrete) <- "n";
  colnames(out.discrete) <- c(rep(names(table(classification)),2 ), "p-value");
  for(column in columns.names) {
    
    variable <- unlist(data[, column]);
    
    if(columns.discrete[which(columns.names==column)[1]]) {
      
      variable[is.na(variable)] <- "Missings";
      
      counts <- table(variable, classification);
      rownames(counts) <- str_c(column, "-", rownames(counts));
      
      percentages <- t(t(counts) / out.discrete[1, 1:2])*100;
      
      chisq <- chisq.test(counts, simulate.p.value = chisq.simulate, B = chisq.nsim)
      chisq.p <- c(chisq$p.value, rep(NA, nrow(counts)-1));
      
      out.discrete <- rbind(out.discrete, cbind(counts, percentages, chisq.p));
      
    } else {
      
      means <- tapply(variable, classification, mean, na.rm=T);
      sd <- tapply(variable, classification, sd, na.rm=T);
      NA.counts <- tapply(is.na(variable), classification, sum);
      NA.percentages <- tapply(is.na(variable), classification, mean)*100;
      
      ttest <- t.test(x = variable[classification==unique(classification[1])],
                      y = variable[classification!=unique(classification[1])],
                      alteranative = "two-sided", paired = F);
      
      out <- rbind(c(means, sd, ttest$p.value), 
                   c(NA.counts, NA.percentages, NA));
      rownames(out) <- str_c(column, c("-MeanSD", "-Missings"));
      
      out.discrete <- rbind(out.discrete, out);
      
    }    
  }
  
  colnames(out.discrete)[1:2] <- str_c(str_c(column.class, "==", colnames(out.discrete)[1:2]), " (#/Mean)");
  colnames(out.discrete)[3:4] <- str_c(str_c(column.class, "==", colnames(out.discrete)[3:4]), " (%/SD)");
  
  if(export) write.table(out.discrete, file=export.file, sep=";");
  
  return(out.discrete)
}

# Export Characteristics Tables
Table1.L1.PFS <- getTable(data=filter(df, L1.PFS.Include), export=F);
Table1.LX.OS <- getTable(data=filter(df, LX.OS.Include), export=F);

Overview table for patients included for first-line treatment:

L1.Arm.ChemBev==FALSE (#/Mean) L1.Arm.ChemBev==TRUE (#/Mean) L1.Arm.ChemBev==FALSE (%/SD) L1.Arm.ChemBev==TRUE (%/SD) p-value
n 191.00000 676.00000 NA NA NA
Age-MeanSD 63.17801 62.54882 13.9234652 12.145134 0.5714513
Age-Missings 0.00000 0.00000 0.0000000 0.000000 NA
Gender.Female-FALSE 118.00000 407.00000 61.7801047 60.207101 0.7425257
Gender.Female-TRUE 73.00000 269.00000 38.2198953 39.792899 NA
Hospital.Private-FALSE 115.00000 337.00000 60.2094241 49.852071 0.0121988
Hospital.Private-TRUE 76.00000 339.00000 39.7905759 50.147929 NA
Stage.Diagnosis-0 8.00000 35.00000 4.1884817 5.177515 0.2915708
Stage.Diagnosis-1 0.00000 11.00000 0.0000000 1.627219 NA
Stage.Diagnosis-2 10.00000 35.00000 5.2356021 5.177515 NA
Stage.Diagnosis-3 21.00000 95.00000 10.9947644 14.053254 NA
Stage.Diagnosis-4 151.00000 491.00000 79.0575916 72.633136 NA
Stage.Diagnosis-Missings 1.00000 9.00000 0.5235602 1.331361 NA
Primary.Site-Colon 13.00000 26.00000 6.8062827 3.846154 0.3540646
Primary.Site-Left 73.00000 261.00000 38.2198953 38.609468 NA
Primary.Site-Missings 3.00000 11.00000 1.5706806 1.627219 NA
Primary.Site-Rectum 51.00000 165.00000 26.7015707 24.408284 NA
Primary.Site-Right 51.00000 213.00000 26.7015707 31.508876 NA
Metastases.Number-1 91.00000 256.00000 47.6439791 37.869822 0.1606839
Metastases.Number-2 55.00000 254.00000 28.7958115 37.573965 NA
Metastases.Number-3 33.00000 123.00000 17.2774869 18.195266 NA
Metastases.Number-4 9.00000 36.00000 4.7120419 5.325444 NA
Metastases.Number-5 3.00000 6.00000 1.5706806 0.887574 NA
Metastases.Number-6 0.00000 1.00000 0.0000000 0.147929 NA
Metastases.Liver-FALSE 64.00000 211.00000 33.5078534 31.213018 0.5971403
Metastases.Liver-TRUE 127.00000 465.00000 66.4921466 68.786982 NA
Metastases.Peritoneum-FALSE 138.00000 491.00000 72.2513089 72.633136 0.9280072
Metastases.Peritoneum-TRUE 53.00000 185.00000 27.7486911 27.366864 NA
L1.ECOG-0 51.00000 284.00000 26.7015707 42.011834 0.0001000
L1.ECOG-1 95.00000 344.00000 49.7382199 50.887574 NA
L1.ECOG-2 34.00000 43.00000 17.8010471 6.360947 NA
L1.ECOG-3 11.00000 5.00000 5.7591623 0.739645 NA
Charlson-0 127.00000 487.00000 66.4921466 72.041420 0.0014999
Charlson-1 27.00000 125.00000 14.1361257 18.491124 NA
Charlson-2 22.00000 41.00000 11.5183246 6.065089 NA
Charlson-3 7.00000 18.00000 3.6649215 2.662722 NA
Charlson-4 2.00000 2.00000 1.0471204 0.295858 NA
Charlson-5 3.00000 1.00000 1.5706806 0.147929 NA
Charlson-6 1.00000 1.00000 0.5235602 0.147929 NA
Charlson-7 0.00000 1.00000 0.0000000 0.147929 NA
Charlson-8 2.00000 0.00000 1.0471204 0.000000 NA
CEA.Level-MeanSD 725.91907 478.48598 5254.2478968 2048.838683 0.5459979
CEA.Level-Missings 19.00000 61.00000 9.9476440 9.023669 NA
RAS.Mutated-FALSE 72.00000 317.00000 37.6963351 46.893491 0.0001000
RAS.Mutated-Missings 63.00000 112.00000 32.9842932 16.568047 NA
RAS.Mutated-TRUE 56.00000 247.00000 29.3193717 36.538461 NA
BRAF.Mutated-FALSE 52.00000 266.00000 27.2251309 39.349112 0.0085991
BRAF.Mutated-Missings 127.00000 369.00000 66.4921466 54.585799 NA
BRAF.Mutated-TRUE 12.00000 41.00000 6.2827225 6.065089 NA

Overview table for patients included for further treatmen lines:

L1.Arm.ChemBev==FALSE (#/Mean) L1.Arm.ChemBev==TRUE (#/Mean) L1.Arm.ChemBev==FALSE (%/SD) L1.Arm.ChemBev==TRUE (%/SD) p-value
n 156.00000 562.00000 NA NA NA
Age-MeanSD 62.87179 62.32028 14.0789103 12.2110342 0.6567444
Age-Missings 0.00000 0.00000 0.0000000 0.0000000 NA
Gender.Female-FALSE 96.00000 341.00000 61.5384615 60.6761566 0.8505149
Gender.Female-TRUE 60.00000 221.00000 38.4615385 39.3238434 NA
Hospital.Private-FALSE 93.00000 272.00000 59.6153846 48.3985765 0.0153985
Hospital.Private-TRUE 63.00000 290.00000 40.3846154 51.6014235 NA
Stage.Diagnosis-0 4.00000 28.00000 2.5641026 4.9822064 0.2659734
Stage.Diagnosis-1 0.00000 8.00000 0.0000000 1.4234875 NA
Stage.Diagnosis-2 9.00000 28.00000 5.7692308 4.9822064 NA
Stage.Diagnosis-3 18.00000 82.00000 11.5384615 14.5907473 NA
Stage.Diagnosis-4 124.00000 407.00000 79.4871795 72.4199288 NA
Stage.Diagnosis-Missings 1.00000 9.00000 0.6410256 1.6014235 NA
Primary.Site-Colon 10.00000 18.00000 6.4102564 3.2028470 0.2894711
Primary.Site-Left 61.00000 216.00000 39.1025641 38.4341637 NA
Primary.Site-Missings 1.00000 11.00000 0.6410256 1.9572954 NA
Primary.Site-Rectum 38.00000 133.00000 24.3589744 23.6654804 NA
Primary.Site-Right 46.00000 184.00000 29.4871795 32.7402135 NA
Metastases.Number-1 70.00000 218.00000 44.8717949 38.7900356 0.6216378
Metastases.Number-2 50.00000 214.00000 32.0512821 38.0782918 NA
Metastases.Number-3 28.00000 95.00000 17.9487179 16.9039146 NA
Metastases.Number-4 7.00000 31.00000 4.4871795 5.5160142 NA
Metastases.Number-5 1.00000 4.00000 0.6410256 0.7117438 NA
Metastases.Liver-FALSE 53.00000 172.00000 33.9743590 30.6049822 0.4386561
Metastases.Liver-TRUE 103.00000 390.00000 66.0256410 69.3950178 NA
Metastases.Peritoneum-FALSE 109.00000 402.00000 69.8717949 71.5302491 0.6921308
Metastases.Peritoneum-TRUE 47.00000 160.00000 30.1282051 28.4697509 NA
L1.ECOG-0 39.00000 230.00000 25.0000000 40.9252669 0.0001000
L1.ECOG-1 83.00000 291.00000 53.2051282 51.7793594 NA
L1.ECOG-2 28.00000 37.00000 17.9487179 6.5836299 NA
L1.ECOG-3 6.00000 4.00000 3.8461538 0.7117438 NA
Charlson-0 106.00000 406.00000 67.9487179 72.2419929 0.0057994
Charlson-1 22.00000 106.00000 14.1025641 18.8612100 NA
Charlson-2 16.00000 32.00000 10.2564103 5.6939502 NA
Charlson-3 5.00000 14.00000 3.2051282 2.4911032 NA
Charlson-4 2.00000 1.00000 1.2820513 0.1779359 NA
Charlson-5 2.00000 1.00000 1.2820513 0.1779359 NA
Charlson-6 1.00000 1.00000 0.6410256 0.1779359 NA
Charlson-7 0.00000 1.00000 0.0000000 0.1779359 NA
Charlson-8 2.00000 0.00000 1.2820513 0.0000000 NA
CEA.Level-MeanSD 844.29064 488.46878 5796.9883585 2031.4895195 0.4746236
CEA.Level-Missings 15.00000 52.00000 9.6153846 9.2526690 NA
RAS.Mutated-FALSE 63.00000 269.00000 40.3846154 47.8647687 0.0001000
RAS.Mutated-Missings 50.00000 75.00000 32.0512821 13.3451957 NA
RAS.Mutated-TRUE 43.00000 218.00000 27.5641026 38.7900356 NA
BRAF.Mutated-FALSE 41.00000 217.00000 26.2820513 38.6120996 0.0146985
BRAF.Mutated-Missings 105.00000 309.00000 67.3076923 54.9822064 NA
BRAF.Mutated-TRUE 10.00000 36.00000 6.4102564 6.4056940 NA

Step 2: Multiple Imputation of Missing Data

After cleaning the data in Step 1, the missing data are addressed in Step 2. Two steps are performed in order to do so:

  1. Addressing missings in the RAS status and BRAF status
  2. Performing multiple imputation by chained equations

Step 2.1: addressing missings in the RAS status and BRAF status

As explained in the manuscript, missing data for the RAS mutation status and BRAF mutation status may be partly addressed by assuming these mutations to be mutatual exclusive. For example, this indicates that if a RAS mutation is known for a patient with a missing BRAF status, the BRAF status can be assumed to be ‘wild-type’.

# Address missing values in Mutations
# Potentially 175 missing values can be addressed:
# - 1 for RAS
# - 142 for BRAF
table(list(RAS=df$RAS.Mutated, BRAF=df$BRAF.Mutated), useNA="always");
##        BRAF
## RAS     FALSE TRUE <NA>
##   FALSE   158   51  180
##   TRUE    160    1  142
##   <NA>      0    1  174
# If patients have a KRAS mutation and the RAS status is unknown,
# we assume that they will not have a RAS mutation. Similarly, if
# the BRAF status is unknown and the patient has a RAS mutation,
# we assume that the patient does not have a BRAF mutation.
df <- df %>%
  mutate(RAS.Mutated = ifelse(is.na(RAS.Mutated) & BRAF.Mutated, F, RAS.Mutated)) %>%
  mutate(BRAF.Mutated = ifelse(is.na(BRAF.Mutated) & RAS.Mutated, F, BRAF.Mutated))

# Oberve new distribution of mutations and missing values
table(list(RAS=df$RAS.Mutated, BRAF=df$BRAF.Mutated), useNA="always");
##        BRAF
## RAS     FALSE TRUE <NA>
##   FALSE   158   52  180
##   TRUE    302    1    0
##   <NA>      0    0  174

Step 2.2: performing multiple imputation by chained equations

After addressing part of the missing values for the RAS mutation status and BRAF mutation status, remaining missing data for these and the other variables are addressed by performing multiple imputation by chained equations. Missing values are addressed for the following variables (imputation method used):

  • Stage at diagnosis (multinomial logit)
  • Primary tumor site (multinomial logit)
  • CEA level (Bayesian linear regression)
  • RAS mutation status (logistic regression)
  • BRAF mutation status (logistic regression)

The following variables were additionally used as predictors:

  • Age at treatment start
  • Gender
  • Hospital type (private of public)
  • Presence of liver metastases
  • Presence of peritoneal metastases
  • Number of metastatic sites
  • ECOG performance status
  • Charlson comorbidity index

After performing the multiple imputation, post-processing is performed while extracting the imputed datasets to make sure RAS mutation status and BRAF mutation status remain mutually exclusive.

# Load required package
library(mice);

# List the included columns
mice.cols.included <- c("Age", "Gender.Female", "Hospital.Private", "Metastases.Liver", "Metastases.Peritoneum", "Metastases.Number", "L1.ECOG", "Charlson", "Primary.Site", "Stage.Diagnosis", "CEA.Level", "RAS.Mutated", "BRAF.Mutated");

# Select the data used for the multiple imputation
df.mice <- df %>%
  select(mice.cols.included)

# Make sure the type of all columns is correct
df.mice <- df.mice %>%
  mutate(Age = as.numeric(Age)) %>% 
  mutate(Gender.Female = as.logical(Gender.Female)) %>% 
  mutate(Hospital.Private = as.logical(Hospital.Private)) %>% 
  mutate(Stage.Diagnosis = as.factor(Stage.Diagnosis)) %>% 
  mutate(Primary.Site = as.factor(Primary.Site)) %>% 
  mutate(Metastases.Liver = as.logical(Metastases.Liver)) %>% 
  mutate(Metastases.Peritoneum = as.logical(Metastases.Peritoneum)) %>% 
  mutate(Metastases.Number = as.numeric(Metastases.Number)) %>% 
  mutate(Charlson = as.numeric(Charlson)) %>% 
  mutate(CEA.Level = as.numeric(CEA.Level)) %>% 
  mutate(RAS.Mutated = as.logical(RAS.Mutated)) %>% 
  mutate(BRAF.Mutated = as.logical(BRAF.Mutated)) %>% 
  mutate(L1.ECOG = as.factor(L1.ECOG))


# Define predictor matrix
predictor.matrix <- matrix(1, length(mice.cols.included), length(mice.cols.included)) - diag(1, length(mice.cols.included), length(mice.cols.included));
colnames(predictor.matrix) <- rownames(predictor.matrix) <- mice.cols.included;

# Perform multiple imputation
m <- 25;              # Number of imputations
set.seed(20181211);   # Random seed for reproduction
mice.out <- mice(data=df.mice, m=m, maxit=20, predictorMatrix=predictor.matrix, printFlag=F,
                 method=c(Age="", 
                          Gender.Female="", 
                          Hospital.Private="", 
                          Metastases.Liver="", 
                          Metastases.Peritoneum="", 
                          Metastases.Number="", 
                          L1.ECOG="", 
                          Charlson="", 
                          Primary.Site="polyreg",
                          Stage.Diagnosis="polyreg", 
                          CEA.Level="norm", 
                          RAS.Mutated="logreg", 
                          BRAF.Mutated="logreg")
);


# Extract the imputed datasets
count.BRAF.and.RAS <- 0;    # Counter to track how much post-processing is required
datasets <- lapply(1:m, function(i.dataset) {
  
  # Get the imputed columns
  df.imputed <- complete(mice.out, action=i.dataset);
  cols.imputed <- colnames(df.imputed);
  
  # Replace the colums in the original dataset by the imputed columns
  df.new <- df;
  df.new[, cols.imputed] <- df.imputed; 
  
  # Correct when RAS and BRAF mutations conflict
  count.BRAF.and.RAS <<- count.BRAF.and.RAS + sum(df.new$BRAF.Mutated & df.new$RAS.Mutated);
  df.new$BRAF.Mutated <- ifelse(df.new$RAS.Mutated==1 & df.new$BRAF.Mutated==1, 0, df.new$BRAF.Mutated);
  
  # Return the newly completed datasets
  df.new;
  
});

# Check number of post-processed mutations
prob.BRAF.and.RAS <- (count.BRAF.and.RAS/m) / nrow(datasets[[1]]) * 100;
prob.BRAF.and.RAS;
## [1] 0.3460208

Step 3: Univariable Analysis

Initial analyses were performed to determine which distribution types would be used to parameterize the survival models and which transformations would be used for each variable in the multivariable modeling. To this end, the following steps were performed as part of this univariable modeling:

  1. Calculating the overlap of the competing events for first-line progression-free survival (Model L1-Time)
  2. Selecting the distribution for progression-free survival (Model L1-Time)
  3. Selecting the distribution for overal survival following first-line progression (Model LX-Time)
  4. Selecting the transformation for progression-free survival (Model L1-Time)
  5. Selecting the transformation for progression-free survival (Model L1-Event)
  6. Selecting the transformation for overal survival following first-line progression (Model LX-Time)

The following transformations were considered:

Step 3.1: calculating the overlap of the competing events for first-line progression-free survival (Model L1-Time)

In modeling first-line progression-free survival, there are two competing events that need to be accounted for: physician-defined progression and death. To determine which modeling approach will be implemented to account for these two competing events, the overlap of the corresponding time-to-event distributions is important to know. It was decided that competing events will be handled by using the ‘Unimodal Distribution and Regression model (UDR)’ approach based on the strategy of selecting the time-to-event first based on a joint time-to-event distribution, and then select the event corresponding to that time.

# function for calculating the overlap between the distribution for time-to-progression and time-to-death
getOverlap <- function(data, n.bins=512, digits=3) {

  library(flux);

  # Get events
  events <- sort(unique(data$event));
  n.events <- length(events);

  # Get event-specific times
  times <- lapply(events, function(e) subset(data, event==e)$time );

  # Density settings
  min <- min(data$time);
  max <- max(data$time);
  n <- n.bins;
  x <- seq(from=min, to=max, length.out=n);

  # Get event-specific densities
  y <- sapply(1:n.events, function(i1) density(times[[i1]], from=min, to=max, n=n)$y );

  # Get event=specific density overlap
  yo <- sapply(1:n.events, function(i2) {

    sapply(1:n, function(j){

      own <- y[j,i2];
      other <- max(y[j,which(c(1:n.events)!=i2)]);

      if(own<=0) {
        return(0)
      } else
        return(min(own,other))

    })

  });

  # Calculate event-specific overlap
  o <- sapply(1:n.events, function(i3) {

    (auc(x=x, y=yo[,i3]) / auc(x=x, y=y[,i3])) *100

  });

  return(round(mean(o), digits=digits));
}

# Calculate the distribution overlap
getOverlap(data=data.frame(time = df$L1.PFS.Time[df$L1.PFS.Events!="Censored"], event = df$L1.PFS.Events[df$L1.PFS.Events!="Censored"]))
## [1] 72.457
# Visualize distribution overlap
plot(density(df$L1.PFS.Time[df$L1.PFS.Events!="Censored"]), xlab="Time-to-Event in Days", ylab=NA, main="Density Plot", las=1, lwd=2, xlim=c(0,2500), ylim=c(0,0.0025))
lines(density(df$L1.PFS.Time[df$L1.PFS.Events=="Progression"]), lwd=2, col=3)
lines(density(df$L1.PFS.Time[df$L1.PFS.Events=="Death"]), lwd=2, col=2)
lines(density(df$L1.PFS.Time[df$L1.PFS.Events=="Censored"]), lwd=2, col=4)
legend("topright", lty=1, lwd=2, col=c(1,3,2, 4), c(str_c("Combined [excl. censored patients](n=", sum(df$L1.PFS.Events!="Censored"),")"), str_c("Progression (n=", sum(df$L1.PFS.Events=="Progression"),")"), str_c("Death (n=", sum(df$L1.PFS.Events=="Death"),")"), str_c("Censored (n = ", sum(df$L1.PFS.Events=="Censored"),")")))

Step 3.2: selecting the distribution for progression-free survival (Model L1-Time)

A distribution type is selected for survival Model L1-Time that represents the two competing events applicable to first-line progression-free survival (physcian-defined progression and death) in one unimodal survival model. Selection was based on the likelihood and assessment of quantile-quantile plots and Kaplan-Meier plots. The following distributions were considered:

  • Exponential
  • Weibull
  • Logistic
  • Log-normal
# Load packages
library(survival);  # Fit Survival Models
library(EnvStats);  # Q-Q Plot for censored data

# Distributions considered
dist.considered <- c("exponential", 
                     "weibull", 
                     "logistic", 
                     "lognormal");

# Data for first-line
df.L1 <- df;

# Select distribution
dist.performance.L1 <- sapply(dist.considered, function(dist) {
  
  # Fit survival model for distribution
  model <- survreg(formula=Surv(time=L1.PFS.Time, event=L1.PFS.Event) ~ 1, data=df.L1, dist=dist);
  
  # Return the likelihood
  return(model$loglik[1]);
  
});

# Print the results and which distribution would be selected based on the likelihood
dist.selected.L1 <- dist.considered[which.max(dist.performance.L1)];
print(dist.performance.L1);
## exponential     weibull    logistic   lognormal 
##   -5522.443   -5483.372   -5646.513   -5512.667
print(dist.selected.L1);
## [1] "weibull"
# Plot the survival curves
for(i in 1:length(dist.considered)) {
  
  # Get distribution type
  dist <- dist.considered[i];
  
  # Fit survival model for distribution
  model <- survreg(formula=Surv(time=L1.PFS.Time, event=L1.PFS.Event) ~ 1, data=df.L1, dist=dist);
  
  # If it is the first distribution considered, plot kaplan-meier
  if(i==1) plot(survfit(Surv(time=L1.PFS.Time, event=L1.PFS.Event) ~ 1, data=df.L1), conf.int=F, xlab="Time to Death or Progression (days)", lwd=2);
  
  # Plot fitted distribution
  lines(y=seq(.99, .01, by=-.01), x=predict(model, type="quantile", p=seq(.01, .99, by=.01))[1,], col=rainbow(length(dist.considered))[i], lwd=2)
  
  # Add the legend if it is the last distribution
  if(i==length(dist.considered)) legend("right", legend=dist.considered, col=rainbow(length(dist.considered)), lty=1, lwd=2);
  
} 

# Plot the Q-Q Plots curves
par(mfrow=c(2,2));
for(i in 1:length(dist.considered)) {
  
  # Get distribution type
  dist <- dist.considered[i];
  
  # Fit survival model for distribution
  model <- survreg(formula=Surv(time=L1.PFS.Time, event=L1.PFS.Event) ~ 1, data=df.L1, dist=dist);
  
  # Plot the Q-Q Plot
  if(dist=="exponential") qqPlotCensored(x=df.L1$L1.PFS.Time, censored=df.L1$L1.PFS.Event==0, censoring.side="right", 
                                         equal.axes=T, main=paste0("Q-Q Plot for the ", dist, " distribution"),
                                         distribution="exp", param.list=list(rate=1/exp(model$coefficients)));
  if(dist=="weibull") qqPlotCensored(x=df.L1$L1.PFS.Time, censored=df.L1$L1.PFS.Event==0, censoring.side="right", 
                                     equal.axes=T, main=paste0("Q-Q Plot for the ", dist, " distribution"), 
                                     distribution="weibull", param.list=list(shape=1/model$scale, scale=exp(model$coefficients)));
  if(dist=="logistic") qqPlotCensored(x=df.L1$L1.PFS.Time, censored=df.L1$L1.PFS.Event==0, censoring.side="right", 
                                      equal.axes=T, main=paste0("Q-Q Plot for the ", dist, " distribution"), 
                                      distribution="logis", param.list=list(location=model$coefficients, scale=model$scale));
  if(dist=="lognormal") qqPlotCensored(x=df.L1$L1.PFS.Time, censored=df.L1$L1.PFS.Event==0, censoring.side="right", 
                                       equal.axes=T, main=paste0("Q-Q Plot for the ", dist, " distribution"), 
                                       distribution="lnorm", param.list=list(meanlog=model$coefficients, sdlog=model$scale));
}

Step 3.3: selecting the distribution for overal survival following first-line progression (Model LX-Time)

A distribution type is selected for survival Model LX-Time that represents overall survival after first-line progression. Selection was based on the likelihood and assessment of quantile-quantile plots and Kaplan-Meier plots. As for Model L1-Time, the following distributions were considered:

  • Exponential
  • Weibull
  • Logistic
  • Log-normal
# Data for first-line
df.LX <- filter(df, LX.OS.Include);

# Select distribution
dist.performance.LX <- sapply(dist.considered, function(dist) {
  
  # Fit survival model for distribution
  model <- survreg(formula=Surv(time=LX.OS.Time, event=LX.OS.Event) ~ 1, data=df.LX, dist=dist);
  
  # Return the likelihood
  return(model$loglik[1]);
  
});

# Print the results and which distribution would be selected based on the likelihood
dist.selected.LX <- dist.considered[which.max(dist.performance.LX)];
print(dist.performance.LX);
## exponential     weibull    logistic   lognormal 
##   -4207.651   -4207.648   -4430.506   -4245.090
print(dist.selected.LX);
## [1] "weibull"
# Plot the survival curves
for(i in 1:length(dist.considered)) {
  
  # Get distribution type
  dist <- dist.considered[i];
  
  # Fit survival model for distribution
  model <- survreg(formula=Surv(time=LX.OS.Time, event=LX.OS.Event) ~ 1, data=df.LX, dist=dist);
  
  # If it is the first distribution considered, plot kaplan-meier
  if(i==1) plot(survfit(Surv(time=LX.OS.Time, event=LX.OS.Event) ~ 1, data=df.LX), conf.int=F, xlab="Overall Survival (days)", lwd=2);
  
  # Plot fitted distribution
  lines(y=seq(.99, .01, by=-.01), x=predict(model, type="quantile", p=seq(.01, .99, by=.01))[1,], col=rainbow(length(dist.considered))[i], lwd=2)
  
  # Add the legend if it is the last distribution
  if(i==length(dist.considered)) legend("right", legend=dist.considered, col=rainbow(length(dist.considered)), lty=1, lwd=2);
  
} 

# Plot the Q-Q Plots curves
par(mfrow=c(2,2));
for(i in 1:length(dist.considered)) {
  
  # Get distribution type
  dist <- dist.considered[i];
  
  # Fit survival model for distribution
  model <- survreg(formula=Surv(time=LX.OS.Time, event=LX.OS.Event) ~ 1, data=df.LX, dist=dist);
  
  # Plot the Q-Q Plot
  if(dist=="exponential") qqPlotCensored(x=df.LX$LX.OS.Time, censored=df.LX$LX.OS.Event==0, censoring.side="right", 
                                         equal.axes=T, main=paste0("Q-Q Plot for the ", dist, " distribution"),
                                         distribution="exp", param.list=list(rate=1/exp(model$coefficients)));
  if(dist=="weibull") qqPlotCensored(x=df.LX$LX.OS.Time, censored=df.LX$LX.OS.Event==0, censoring.side="right", 
                                     equal.axes=T, main=paste0("Q-Q Plot for the ", dist, " distribution"), 
                                     distribution="weibull", param.list=list(shape=1/model$scale, scale=exp(model$coefficients)));
  if(dist=="logistic") qqPlotCensored(x=df.LX$LX.OS.Time, censored=df.LX$LX.OS.Event==0, censoring.side="right", 
                                      equal.axes=T, main=paste0("Q-Q Plot for the ", dist, " distribution"), 
                                      distribution="logis", param.list=list(location=model$coefficients, scale=model$scale));
  if(dist=="lognormal") qqPlotCensored(x=df.LX$LX.OS.Time, censored=df.LX$LX.OS.Event==0, censoring.side="right", 
                                       equal.axes=T, main=paste0("Q-Q Plot for the ", dist, " distribution"), 
                                       distribution="lnorm", param.list=list(meanlog=model$coefficients, sdlog=model$scale));
}

Step 3.4: selecting the transformation for progression-free survival (Model L1-Time)

The best-performing transformation was selected for each variable and survival model separately by fitting individual models for each candidate transformation and comparing performance based on Wald test p-values. Univariable models were also fitted for variables without multiple candidate transformations to record p-values. All p-values were calculated for each of 25 imputed datasets, and for variables with multiple candidate transformations, transformations’ p-value ranks were determined for each imputed dataset. Final selection of the transformation to be used was based on the average transformation rank over all 25 imputed datasets.

# Load package
library(aod);   # Wald-test

# Candidate variables, including their importance and potential transformations
vars.information.L1 <- list(
  
  # Age at treatment start
  Age = list(Transformations = c(Log = "log(Age)", SQRT = "sqrt(Age)", Poly1 = "poly(Age, 1)", Poly2 = "poly(Age, 2)", Poly3 = "poly(Age, 3)", Poly4 = "poly(Age, 4)")),
  
  # Gender
  Gender.Female = list(Transformations = c(Binary.FemaleYN = "Gender.Female")),
  
  # Hospital Type
  Hospital.Private = list(Transformations = c(Binary.PrivateYN = "Hospital.Private")),
  
  # Stage at Initial Diagnosis
  Stage.Diagnosis = list(Transformations = c(Binary.Stage4YN = "as.factor(as.character(Stage.Diagnosis)=='4')")),
  
  # Primary Tumour Site
  Primary.Site = list(Transformations = c(Factor.GroupColMul = "as.factor(ifelse(Primary.Site %in% c('Colon','Multiple'), 'ColonMultiple', Primary.Site))", Factor.GroupColMul.GroupLefRec = "as.factor(ifelse(Primary.Site %in% c('Colon','Multiple'), 'ColonMultiple', ifelse(Primary.Site %in% c('Left','Rectum'), 'LeftRectum', Primary.Site)))")),
  
  # Number of Metastases
  Metastases.Number = list(Transformations = c(Factor.Group3plus = "as.factor(ifelse(Metastases.Number>=3, 3, Metastases.Number))")),
  
  # Liver Metastases
  Metastases.Liver = list(Transformations = c(Binary = "Metastases.Liver")),
  
  # Peritoneum Metastases
  Metastases.Peritoneum = list(Transformations = c(Binary = "Metastases.Peritoneum")),
  
  # ECOG Performance Status
  L1.ECOG = list(Transformations = c(Factor.Group23 = "as.factor(ifelse(L1.ECOG==0, 0, ifelse(L1.ECOG==1, 1, 2)))", Binary.2plusYN = "as.factor(ifelse(L1.ECOG %in% c(0,1), F, T))")),
  
  # Charlson Score
  Charlson = list(Transformations = c(Factor.Group2plus = "as.factor(ifelse(Charlson>=2, 2, Charlson))", Factor.Group01.Group2plus = "as.factor(ifelse(Charlson %in% c(0,1), 1, ifelse(Charlson>=2, 2, Charlson)))", Binary.1plusYN = "as.factor(ifelse(Charlson>=1, T, F))")),
  
  # CEA Level
  CEA.Level = list(Transformations = c(Binary.200plus = "as.factor(ifelse(Metastases.Number<=2, T, F) * ifelse(CEA.Level>=200, T, F))")),
  
  # RAS Status
  RAS.Mutated = list(Transformations = c(Binary.MutatedYN = "RAS.Mutated")),
  
  # BRAF Status
  BRAF.Mutated = list(Transformations = c(Binary.MutatedYN = "BRAF.Mutated"))
  
);

# Set three levels of significance
significance.levels <- c(0.1, 0.05, 0.01);

# For each imputed dataset, select distribution and perform univariable analysis
univariable.L1 <- lapply(1:length(datasets), function(dataset) {
  
  # Retrieve datasets
  data.surv.L1 <- datasets[[dataset]];

  # For each variable
  vars.univariable <- lapply(vars.information.L1, function(var) {
    
    # Consider all potential transformations
    sapply(var$Transformations, function(var.form) {
      
      # Define formula to be fitted
      f <- as.formula(str_c("Surv(time=L1.PFS.Time, event=L1.PFS.Event) ~ ", var.form, collapse=""));
      
      # Fit model
      fit <- survreg(formula=f, data=data.surv.L1, dist=dist.selected.L1);
      
      # Perform Wald-test for significance
      test <- wald.test(Sigma=fit$var, b=c(fit$coefficients, fit$scale), Terms=c(2:length(fit$coefficients)))
      
      # Determine significance
      pvalue <- test$result$chi2["P"];
      significance <- "";
      if(pvalue<max(significance.levels)) {
        significance <- str_c(rep("*", tail(which(pvalue<significance.levels), n=1)), collapse="");
      }
      
      # Return transformation and model performance
      c(trans=f, ll=fit$loglik[2], pvalue=pvalue, sign=significance)
      
    })
    
  });
  
  # Return outcomes
  return(list(vars.univariable=vars.univariable));
  
})


# Add summary for variables to vars.information object
for(var in names(vars.information.L1)) {
  
  # Get names of transformations
  transformation.names <- names(vars.information.L1[[var]]$Transformations);
  
  # Extract data from univariable object
  loglik <- as.data.frame(
    sapply(transformation.names, function(trans) 
      sapply(univariable.L1, function(univar)
        univar$vars.univariable[[var]][2,trans]
      )
    )
  )
  pvalues <- as.data.frame(
    sapply(transformation.names, function(trans) 
      sapply(univariable.L1, function(univar)
        univar$vars.univariable[[var]][3,trans]
      )
    )
  )
  significance <- as.data.frame(
    sapply(transformation.names, function(trans) 
      sapply(univariable.L1, function(univar)
        univar$vars.univariable[[var]][4,trans]
      )
    )
  )
  
  # Transform List object to Matrix
  pvalues <- matrix(unlist(pvalues), ncol=ncol(pvalues), byrow=F)
  colnames(pvalues) <- transformation.names;
  
  # Rank based on pvalues
  ranks <- matrix(apply(pvalues, 1, rank, ties.method="average"), ncol=ncol(pvalues), byrow=T);
  colnames(ranks) <- colnames(pvalues);
  avgrank <- apply(ranks, 2, mean);
  
  # Select best performing transformation
  selected <- names(avgrank)[which.min(avgrank)];
  selected.transformation <- vars.information.L1[[var]]$Transformations[selected];
  selected.avgpvalue <- mean(pvalues[, selected]);
  selected.significance <- "";
  if(selected.avgpvalue<max(significance.levels)){
    selected.significance <- str_c(rep("*", tail(which(selected.avgpvalue<significance.levels), n=1)), collapse="");
  }
  
  # Save to vars.information object
  vars.information.L1[[var]]$LogLik <- loglik;
  vars.information.L1[[var]]$Pvalues <- pvalues;
  vars.information.L1[[var]]$Significance <- significance;
  vars.information.L1[[var]]$Ranks <- ranks;
  vars.information.L1[[var]]$AvgRank <- avgrank;
  
  vars.information.L1[[var]]$SelectedTransformation <- selected.transformation;
  vars.information.L1[[var]]$SelectedPvalue <- selected.avgpvalue;
  vars.information.L1[[var]]$SelectedSignificance <- selected.significance;
  
}


# Generate summary Table
vars.univariable.L1 <- t(sapply(vars.information.L1, function(var) c(
  trans = names(var$SelectedTransformation),
  pvalue = var$SelectedPvalue,
  sign = var$SelectedSignificance
))); 

# View results
kable(vars.univariable.L1);
trans pvalue sign
Age Poly2 0.00543870740945696 ***
Gender.Female Binary.FemaleYN 0.0413078627543801 **
Hospital.Private Binary.PrivateYN 0.132221927655926
Stage.Diagnosis Binary.Stage4YN 0.0180820041944674 **
Primary.Site Factor.GroupColMul.GroupLefRec 0.00371107490900566 ***
Metastases.Number Factor.Group3plus 0.40180927502595
Metastases.Liver Binary 0.0548355494244619 *
Metastases.Peritoneum Binary 0.108781784235471
L1.ECOG Factor.Group23 8.5890183854076e-12 ***
Charlson Binary.1plusYN 0.164886584017105
CEA.Level Binary.200plus 0.00551021390960015 ***
RAS.Mutated Binary.MutatedYN 0.208343809880177
BRAF.Mutated Binary.MutatedYN 0.152022282845375

Step 3.5: selecting the transformation for progression-free survival (Model L1-Event)

To distinguish between the two competing events of physisician-defined progression death for first-line progression-free survival, a logistic regression model is needed (Model L1-Event). For this model, all best-performing transformations indentified for corresponding Model L1-Time were used, so only the best transformation for the additional variable relating to the time-to-event needs to be selected.

# Define a binary variable 'Death' to represent whether first-line progression refered to Death (1), Physician-defined progression (0), or the patient was censored (NA)
Death <- ifelse(datasets[[1]]$L1.PFS.Events=="Death", 1, ifelse(datasets[[1]]$L1.PFS.Events=="Progression", 0, NA));

# Fit the different transformations
summary(glm(Death ~ 1))
## 
## Call:
## glm(formula = Death ~ 1)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.1023  -0.1023  -0.1023  -0.1023   0.8977  
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.10234    0.01065    9.61   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.09198216)
## 
##     Null deviance: 74.506  on 810  degrees of freedom
## Residual deviance: 74.506  on 810  degrees of freedom
##   (56 observations deleted due to missingness)
## AIC: 369.34
## 
## Number of Fisher Scoring iterations: 2
summary(glm(Death ~ sqrt(datasets[[1]]$L1.PFS.Time)))
## 
## Call:
## glm(formula = Death ~ sqrt(datasets[[1]]$L1.PFS.Time))
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -0.21793  -0.11542  -0.09612  -0.07337   0.95884  
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)   
## (Intercept)                     0.029326   0.027825   1.054  0.29222   
## sqrt(datasets[[1]]$L1.PFS.Time) 0.004473   0.001576   2.838  0.00465 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.09118778)
## 
##     Null deviance: 74.506  on 810  degrees of freedom
## Residual deviance: 73.771  on 809  degrees of freedom
##   (56 observations deleted due to missingness)
## AIC: 363.31
## 
## Number of Fisher Scoring iterations: 2
summary(glm(Death ~ log(datasets[[1]]$L1.PFS.Time)))
## 
## Call:
## glm(formula = Death ~ log(datasets[[1]]$L1.PFS.Time))
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -0.12443  -0.10784  -0.10247  -0.09366   0.93415  
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)
## (Intercept)                     0.04526    0.06323   0.716    0.474
## log(datasets[[1]]$L1.PFS.Time)  0.01058    0.01155   0.916    0.360
## 
## (Dispersion parameter for gaussian family taken to be 0.09200045)
## 
##     Null deviance: 74.506  on 810  degrees of freedom
## Residual deviance: 74.428  on 809  degrees of freedom
##   (56 observations deleted due to missingness)
## AIC: 370.5
## 
## Number of Fisher Scoring iterations: 2
summary(glm(Death ~ poly(datasets[[1]]$L1.PFS.Time, 1)))
## 
## Call:
## glm(formula = Death ~ poly(datasets[[1]]$L1.PFS.Time, 1))
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -0.35633  -0.11251  -0.08696  -0.06513   0.95045  
## 
## Coefficients:
##                                    Estimate Std. Error t value Pr(>|t|)
## (Intercept)                         0.10236    0.01054   9.710  < 2e-16
## poly(datasets[[1]]$L1.PFS.Time, 1)  1.31589    0.31269   4.208 2.86e-05
##                                       
## (Intercept)                        ***
## poly(datasets[[1]]$L1.PFS.Time, 1) ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.09012303)
## 
##     Null deviance: 74.506  on 810  degrees of freedom
## Residual deviance: 72.910  on 809  degrees of freedom
##   (56 observations deleted due to missingness)
## AIC: 353.78
## 
## Number of Fisher Scoring iterations: 2
summary(glm(Death ~ poly(datasets[[1]]$L1.PFS.Time, 2)))
## 
## Call:
## glm(formula = Death ~ poly(datasets[[1]]$L1.PFS.Time, 2))
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -0.62677  -0.09385  -0.08514  -0.08335   0.91693  
## 
## Coefficients:
##                                     Estimate Std. Error t value Pr(>|t|)
## (Intercept)                           0.1026     0.0105   9.764  < 2e-16
## poly(datasets[[1]]$L1.PFS.Time, 2)1   1.3072     0.3115   4.196 3.02e-05
## poly(datasets[[1]]$L1.PFS.Time, 2)2   0.8118     0.3048   2.664  0.00789
##                                        
## (Intercept)                         ***
## poly(datasets[[1]]$L1.PFS.Time, 2)1 ***
## poly(datasets[[1]]$L1.PFS.Time, 2)2 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.08944915)
## 
##     Null deviance: 74.506  on 810  degrees of freedom
## Residual deviance: 72.275  on 808  degrees of freedom
##   (56 observations deleted due to missingness)
## AIC: 348.69
## 
## Number of Fisher Scoring iterations: 2
summary(glm(Death ~ poly(datasets[[1]]$L1.PFS.Time, 3)))
## 
## Call:
## glm(formula = Death ~ poly(datasets[[1]]$L1.PFS.Time, 3))
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -0.48879  -0.10220  -0.07718  -0.06743   0.93404  
## 
## Coefficients:
##                                     Estimate Std. Error t value Pr(>|t|)
## (Intercept)                          0.10303    0.01045   9.856  < 2e-16
## poly(datasets[[1]]$L1.PFS.Time, 3)1  1.33858    0.31025   4.315  1.8e-05
## poly(datasets[[1]]$L1.PFS.Time, 3)2  0.83827    0.30347   2.762  0.00587
## poly(datasets[[1]]$L1.PFS.Time, 3)3 -0.90747    0.30755  -2.951  0.00326
##                                        
## (Intercept)                         ***
## poly(datasets[[1]]$L1.PFS.Time, 3)1 ***
## poly(datasets[[1]]$L1.PFS.Time, 3)2 ** 
## poly(datasets[[1]]$L1.PFS.Time, 3)3 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.08860409)
## 
##     Null deviance: 74.506  on 810  degrees of freedom
## Residual deviance: 71.504  on 807  degrees of freedom
##   (56 observations deleted due to missingness)
## AIC: 341.99
## 
## Number of Fisher Scoring iterations: 2

Step 3.6: selecting the transformation for overal survival following first-line progression (Model LX-Time)

Selection of the best-performing transformation for Model LX-Time was done in the same way as for Model L1-Tim. The only difference is that an additional variable representing the first-line progression-free survival was included.

# Candidate variables, including their importance and potential transformations
vars.information.LX <- list(
  
  # Progression-free survival
  PFS = list(Transformations = c(Log = "log(L1.PFS.Time)", SQRT = "sqrt(L1.PFS.Time)", Poly1 = "poly(L1.PFS.Time, 1)", Poly2 = "poly(L1.PFS.Time, 2)", Poly3 = "poly(L1.PFS.Time, 3)")),
  
  # Age at treatment start
  Age = list(Transformations = c(Log = "log(Age)", SQRT = "sqrt(Age)", Poly1 = "poly(Age, 1)",  Poly2 = "poly(Age, 2)",  Poly3 = "poly(Age, 3)")),
  
  # Gender
  Gender.Female = list(Transformations = c(Binary.FemaleYN = "Gender.Female")),
  
  # Hospital Type
  Hospital.Private = list(Transformations = c(Binary.PrivateYN = "Hospital.Private")),
  
  # Stage at Initial Diagnosis
  Stage.Diagnosis = list(Transformations = c(Binary.Stage4YN = "as.factor(as.character(Stage.Diagnosis)=='4')")),
  
  # Primary Tumour Site
  Primary.Site = list(Transformations = c(Factor.GroupColMul = "as.factor(ifelse(Primary.Site %in% c('Colon','Multiple'), 'ColonMultiple', Primary.Site))", Factor.GroupColMul.GroupLefRec = "as.factor(ifelse(Primary.Site %in% c('Colon','Multiple'), 'ColonMultiple', ifelse(Primary.Site %in% c('Left','Rectum'), 'LeftRectum', Primary.Site)))")),
  
  # Number of Metastases
  Metastases.Number = list(Transformations = c(Factor.Group3plus = "as.factor(ifelse(Metastases.Number>=3, 3, Metastases.Number))")),
  
  # Liver Metastases
  Metastases.Liver = list(Transformations = c(Binary = "Metastases.Liver")),
  
  # Peritoneum Metastases
  Metastases.Peritoneum = list(Transformations = c(Binary = "Metastases.Peritoneum")),
  
  # ECOG Performance Status
  L1.ECOG = list(Transformations = c(Factor.Group23 = "as.factor(ifelse(L1.ECOG==0, 0, ifelse(L1.ECOG==1, 1, 2)))", Binary.2plusYN = "as.factor(ifelse(L1.ECOG %in% c(0,1), F, T))")),
  
  # Charlson Score
  Charlson = list(Transformations = c(Factor.Group2plus = "as.factor(ifelse(Charlson>=2, 2, Charlson))", Factor.Group01.Group2plus = "as.factor(ifelse(Charlson %in% c(0,1), 1, ifelse(Charlson>=2, 2, Charlson)))", Binary.1plusYN = "as.factor(ifelse(Charlson>=1, T, F))")),
  
  # CEA Level
  CEA.Level = list(Transformations = c(Binary.200plus = "as.factor(ifelse(Metastases.Number<=2, T, F) * ifelse(CEA.Level>=200, T, F))")),
  
  # RAS Status
  RAS.Mutated = list(Transformations = c(Binary.MutatedYN = "RAS.Mutated")),
  
  # BRAF Status
  BRAF.Mutated = list(Transformations = c(Binary.MutatedYN = "BRAF.Mutated"))
  
);


# For each imputed dataset, select distribution and perform univariable analysis
univariable.LX <- lapply(1:length(datasets), function(dataset) {
  
  # Retrieve datasets
  data.surv <- datasets[[dataset]];
  data.surv.LX <- filter(data.surv, LX.OS.Include);
  
  # For each variable
  vars.univariable <- lapply(vars.information.LX, function(var) {
    
    # Consider all potential transformations
    sapply(var$Transformations, function(var.form) {
      
      # Define formula to be fitted
      f <- as.formula(str_c("Surv(time=LX.OS.Time, event=LX.OS.Event) ~ ", var.form, collapse=""));
      
      # Fit model
      fit <- survreg(formula=f, data=data.surv.LX, dist=dist.selected.LX);
      
      # Perform Wald-test for significance
      test <- wald.test(Sigma=fit$var, b=c(fit$coefficients, fit$scale), Terms=c(2:length(fit$coefficients)))
      
      # Determine significance
      pvalue <- test$result$chi2["P"];
      significance <- "";
      if(pvalue<max(significance.levels)) {
        significance <- str_c(rep("*", tail(which(pvalue<significance.levels), n=1)), collapse="");
      }
      
      # Return transformation and model performance
      c(trans=f, ll=fit$loglik[2], pvalue=pvalue, sign=significance)
      
    })
    
  });
  
  # Return outcomes
  return(list(vars.univariable=vars.univariable));
  
})


# Add summary for variables to vars.information object
for(var in names(vars.information.LX)) {
  
  # Get names of transformations
  transformation.names <- names(vars.information.LX[[var]]$Transformations);
  
  # Extract data from univariable object
  loglik <- as.data.frame(
    sapply(transformation.names, function(trans) 
      sapply(univariable.LX, function(univar)
        univar$vars.univariable[[var]][2,trans]
      )
    )
  )
  pvalues <- as.data.frame(
    sapply(transformation.names, function(trans) 
      sapply(univariable.LX, function(univar)
        univar$vars.univariable[[var]][3,trans]
      )
    )
  )
  significance <- as.data.frame(
    sapply(transformation.names, function(trans) 
      sapply(univariable.LX, function(univar)
        univar$vars.univariable[[var]][4,trans]
      )
    )
  )
  
  # Transform List object to Matrix
  pvalues <- matrix(unlist(pvalues), ncol=ncol(pvalues), byrow=F)
  colnames(pvalues) <- transformation.names;
  
  # Rank based on pvalues
  ranks <- matrix(apply(pvalues, 1, rank, ties.method="average"), ncol=ncol(pvalues), byrow=T);
  colnames(ranks) <- colnames(pvalues);
  avgrank <- apply(ranks, 2, mean);
  
  # Select best performing transformation
  selected <- names(avgrank)[which.min(avgrank)];
  selected.transformation <- vars.information.LX[[var]]$Transformations[selected];
  selected.avgpvalue <- mean(pvalues[, selected]);
  selected.significance <- "";
  if(selected.avgpvalue<max(significance.levels)){
    selected.significance <- str_c(rep("*", tail(which(selected.avgpvalue<significance.levels), n=1)), collapse="");
  }
  
  # Save to vars.information object
  vars.information.LX[[var]]$LogLik <- loglik;
  vars.information.LX[[var]]$Pvalues <- pvalues;
  vars.information.LX[[var]]$Significance <- significance;
  vars.information.LX[[var]]$Ranks <- ranks;
  vars.information.LX[[var]]$AvgRank <- avgrank;
  
  vars.information.LX[[var]]$SelectedTransformation <- selected.transformation;
  vars.information.LX[[var]]$SelectedPvalue <- selected.avgpvalue;