Biostatistics Seminar Series with Dr. Nicholas Mitsakakis on ML for Sample Size Calculation

Location

Health Sciences Building, 155 College Street

Series/Type

Format

In-Person

Dates

November 27, 2025 from 12:10pm to 1:00pm

Presented by the DLSPH Biostatistics Division …

The Biostatistics Seminar Series presents:

“Sample Size Calculation for Training Ensemble Machine Learning Models on Health Data” by Dr. Nicholas Mitsakakis, University of Toronto & CHEO Research Institute

Abstract: Machine learning (ML) models are increasingly used in clinical research, yet most studies lack validated methods for determining adequate sample sizes, often relying on outdated heuristics. This study introduces an empirically derived sample size calculator tailored for ensemble ML models—Random Forests, LightGBM, and XGBoost—trained on tabular health data. Our method introduces the concept of certainty curves, which estimate the probability that a model trained on a given sample size achieves a target ROC-AUC relative to the optimal model trained on the full population. Using simulations across 13 large health datasets, we trained over 89,000 models and built a predictive calculator using dataset characteristics like class imbalance, entropy, and degrees of freedom. Compared to existing methods, our calculator showed significantly lower error rates, providing a robust solution for ML study design, regulatory submissions, and adherence to reporting guidelines. R code is available to facilitate implementation in future research.

For Dr. Mitsakakis’s biosketch, please see https://www.dlsph.utoronto.ca/faculty-profile/mitsakakis-nicholas/

Climate Change and the Developing and Aging Brain

CanPath Access Office Hours

Back to listing

Top

Climate Change and the Developing and Aging Brain

CanPath Access Office Hours