- Location
- Health Sciences Building, 155 College Street
- Series/Type
- Alumni Event, DLSPH Event, Faculty/Staff Event, Student Event
- Format
- In-Person
- Dates
- November 27, 2025 from 12:10pm to 1:00pm
Presented by the DLSPH Biostatistics Division …
The Biostatistics Seminar Series presents:
“Sample Size Calculation for Training Ensemble Machine Learning Models on Health Data” by Dr. Nicholas Mitsakakis, University of Toronto & CHEO Research Institute
Abstract: Machine learning (ML) models are increasingly used in clinical research, yet most studies lack validated methods for determining adequate sample sizes, often relying on outdated heuristics. This study introduces an empirically derived sample size calculator tailored for ensemble ML models—Random Forests, LightGBM, and XGBoost—trained on tabular health data. Our method introduces the concept of certainty curves, which estimate the probability that a model trained on a given sample size achieves a target ROC-AUC relative to the optimal model trained on the full population. Using simulations across 13 large health datasets, we trained over 89,000 models and built a predictive calculator using dataset characteristics like class imbalance, entropy, and degrees of freedom. Compared to existing methods, our calculator showed significantly lower error rates, providing a robust solution for ML study design, regulatory submissions, and adherence to reporting guidelines. R code is available to facilitate implementation in future research.
For Dr. Mitsakakis’s biosketch, please see https://www.dlsph.utoronto.ca/faculty-profile/mitsakakis-nicholas/