Skip to content
Location
Jackman Law Building, University of Toronto
Dates
  • May 3, 2019 from 8:00am to 4:00pm

Links

Regression Modelling Strategies
Instructor: Frank Harrell; Vanderbilt University

Regression models are frequently used to develop diagnostic, prognostic, and health resource utilization models in clinical, health services, outcomes, pharmacoeconomic, and epidemiologic research, and in a multitude of non-health-related areas. Regression models are also used to adjust for patient heterogeneity in randomized clinical trials, to obtain tests that are more powerful and valid than unadjusted treatment comparisons. Models must be flexible enough to fit nonlinear and non-additive relationships, but unless the sample size is enormous, the approach to modeling must avoid common problems with data mining or data dredging that result in overfitting and a failure of the predictive model to validate on new subjects. All standard regression models have assumptions that must be verified for the model to have power to test hypotheses and for it to be able to predict accurately. Of the principal assumptions (linearity, additivity, distributional), this short course will emphasize methods for assessing and satisfying the first two. Practical but powerful tools are presented for validating model assumptions and presenting model results. This course provides methods for estimating the shape of the relationship between predictors and response.

The first part of the course presents the following elements of multivariable predictive modeling for a single response variable: using regression splines to relax linearity assumptions, perils of variable selection and overfitting, where to spend degrees of freedom, shrinkage, imputation of missing data, data reduction, and interaction surfaces. Then a default overall modeling strategy will be described, with an eye towards “safe data mining”. This is followed by methods for graphically understanding models (e.g., using nomograms) and using re-sampling to estimate a model’s likely performance on new data.

Participants should have a good working knowledge of multiple regression. The following articles might be read in advance: Harrell, Lee, Mark: Stat in Med 15:361-387, 1996. Spanos, Harrell, Durack: JAMA 262:2700-2707, 1989. See http://fharrell.com/links for more background information and resources.