Preprint Article Version 2 Preserved in Portico This version is not peer-reviewed

Can a Transparent Machine Learning Algorithm Predict Better than Its Black-Box Counterparts? A Benchmarking Study using 110 Diverse Datasets

Version 1 : Received: 31 May 2024 / Approved: 4 June 2024 / Online: 7 June 2024 (07:43:04 CEST)
Version 2 : Received: 27 June 2024 / Approved: 27 June 2024 / Online: 27 June 2024 (11:07:41 CEST)

How to cite: Peterson, R. A.; McGrath, M.; Cavanaugh, J. E. Can a Transparent Machine Learning Algorithm Predict Better than Its Black-Box Counterparts? A Benchmarking Study using 110 Diverse Datasets. Preprints 2024, 2024060478. https://doi.org/10.20944/preprints202406.0478.v2 Peterson, R. A.; McGrath, M.; Cavanaugh, J. E. Can a Transparent Machine Learning Algorithm Predict Better than Its Black-Box Counterparts? A Benchmarking Study using 110 Diverse Datasets. Preprints 2024, 2024060478. https://doi.org/10.20944/preprints202406.0478.v2

Abstract

We developed a novel machine learning (ML) algorithm with the goal of producing transparent models (i.e. understandable-by-humans) while also flexibly accounting for nonlinearity and interactions. Our method is based on ranked sparsity, and allows for flexibility and user-control in varying the shade of the opacity of black-box machine learning methods. The main tenet of ranked sparsity is that an algorithm should be more skeptical of higher-order polynomials and interactions \textit{a~priori} compared to main effects, and hence the inclusion of these more complex terms should require a higher level of evidence. In this work, we put our new ranked sparsity algorithm (as implemented in the open-source R package, `sparseR`) to the test in a predictive model "bakeoff" (i.e. a benchmarking study of ML algorithms applied "out-of-the-box," that is, with no special tuning). Algorithms were trained on a diverse set of simulated and real-world data sets from the Penn Machine Learning Benchmarks database, addressing both regression and binary classification problems. We evaluate the extent to which our human-centered algorithm can attain predictive accuracy that rivals popular black-box approaches such as neural networks, random forests, and support vector machines, while also producing more interpretable models. Using out-of-bag error as a meta-outcome, we describe the properties of data sets in which human-centered approaches can perform as well as or better than black-box approaches. We find that interpretable approaches predicted optimally or within 5% of the optimal method in most real-world data sets. We provide a more in-depth comparison of the performances of random forests to interpretable methods for several case studies, including exemplars in which algorithms performed similarly, and several cases when interpretable methods underperformed. This work provides a strong rationale for including human-centered transparent algorithms such as ours in predictive modeling applications.

Keywords

model selection; feature selection; lasso; explainable machine learning

Subject

Computer Science and Mathematics, Artificial Intelligence and Machine Learning

Comments (0)

We encourage comments and feedback from a broad range of readers. See criteria for comments and our Diversity statement.

Leave a public comment
Send a private comment to the author(s)
* All users must log in before leaving a comment
Views 0
Downloads 0
Comments 0


×
Alerts
Notify me about updates to this article or when a peer-reviewed version is published.
We use cookies on our website to ensure you get the best experience.
Read more about our cookies here.