Skip to content

Applied Machine Learning for Health Data

Course Number
CHL5230H
Series
5200 (Biostatistics)
Format
Lecture
Course Instructor(s)
Mohammad Kaviul Khan

Course Description

Data Science is a rapidly emerging field that is increasingly applied across industry, academia, and government. Health Data Science refers to the use of Data Science methods and principles to address large, complex, real-world health data and problems. Examples include health administrative data, electronic health records, and clinical registries, which can also be linked with patient-reported outcomes, genomic data, and laboratory data, among others.

This course provides an introduction to Data Science and its applications in population health and public health outcomes, with a focus on Data Science analytics methods. The content will emphasize statistical approaches for supervised and unsupervised learning. Topics include linear regression, penalized regression methods (ridge, LASSO, elastic net), classification techniques such as logistic regression and linear discriminant analysis, decision trees, random forests, gradient boosting machines, and XGBoost. Additional topics include training error, test error, and cross-validation; principal component analysis; stochastic gradient descent; k-means clustering; and nearest neighbor methods.

All statistical analysis will be conducted in R, with computational support provided during lectures and office hours. While some theoretical background will be covered, the primary focus will be on hands-on, practical applications using large health datasets.

Course Objectives

By the end of the course students will be able to:

  • understand what we mean by machine learning and data science;
  • understand the different types of machine learning based on the way they work and the tasks they accomplish;
  • Fit machine learning models to data, obtain and interpret the results;
  • practical application of methods on real data using statistical software \texttt{R}, with appropriate justification of use of these methods;
  • interpretation of data analysis results in clear and non-technical language;
  • up to some degree critically appraise the appropriateness of the use of machine learning methodology in published research

Methods of Assessment

Assignment (1) 15%
Project Presentation (Group Activity) 15%
Project Report (Group Activity) 30%
Final Exam 30%
Participation 10%

General Requirements

Students should have taken a graduate level course in statistics, be familiar with basic concepts of statistics and probability and have a good understanding of regression. Previous experience with writing scripts for data analysis (using R, SAS or other similar software) or any other programming experience is preferred but not necessary.