Skip to content
Location
Health Sciences Building, 155 College Street
Series/Type
, , ,
Format
In-Person
Dates
  • November 27, 2025 from 12:10pm to 1:00pm

Presented by the DLSPH Biostatistics Division …

The Biostatistics Seminar Series presents:

“Sample Size Calculation for Training Ensemble Machine Learning Models on Health Data” by Dr. Nicholas Mitsakakis, University of Toronto & CHEO Research Institute

Abstract: Machine learning (ML) models are increasingly used in clinical research, yet most studies lack validated methods for determining adequate sample sizes, often relying on outdated heuristics. This study introduces an empirically derived sample size calculator tailored for ensemble ML models—Random Forests, LightGBM, and XGBoost—trained on tabular health data. Our method introduces the concept of certainty curves, which estimate the probability that a model trained on a given sample size achieves a target ROC-AUC relative to the optimal model trained on the full population. Using simulations across 13 large health datasets, we trained over 89,000 models and built a predictive calculator using dataset characteristics like class imbalance, entropy, and degrees of freedom. Compared to existing methods, our calculator showed significantly lower error rates, providing a robust solution for ML study design, regulatory submissions, and adherence to reporting guidelines. R code is available to facilitate implementation in future research.

For Dr. Mitsakakis’s biosketch, please see https://www.dlsph.utoronto.ca/faculty-profile/mitsakakis-nicholas/