Abstract
Background: Digital clinical measures collected via various digital sensing technologies such as smartphones, smartwatches, wearables, ingestible, and implantable are increasingly used by individuals and clinicians to capture health outcomes or behavioral and physiological characteristics of individuals. Time series classification (TSC) is very commonly used in modeling digital clinical measures. While deep learning models for TSC are very common and powerful, there exist some fundamental challenges. This review presents non-deep learning models commonly used for time series classification in biomedical applications that achieve high performance. Objective: We performed a systematic review to characterize the techniques used in time series classification of digital clinical measures throughout all stages of data processing and model building. Methods: We conducted a literature search on PubMed, and the Institute of Electrical and Electronics Engineers (IEEE), Web of Science, and SCOPUS databases using a range of search terms to retrieve peer-reviewed articles reporting academic research on digital clinical measures in the five year period between June 2016 and June 2021. We identified and categorized research studies based on the types of classification algorithms and sensor input types. Results: We found 452 papers in total from four different databases: PubMed, IEEE, Web of Science Database, and SCOPUS. After removing duplicates and irrelevant papers, 135 articles remained for detailed review and data extraction. Among these, engineered features using time series methods that were subsequently fed into widely-used machine learning classifiers was the most commonly used technique and also most frequently achieved the best performance metrics (77 out of 135 articles). Statistical modeling (24 out of 135 articles) algorithms were the second most common and also second best classification technique. Wavelet-based classification models (8 out of 135 articles) were also common. Electroencephalogram (29 out of 135 articles) was the most common data type used as an input. Accuracy was the most commonly reported performance metric, with 67.65% of articles reporting on accuracy. In this review paper, we provide summaries of signal pre-processing methods, feature engineering and selection methods, time series models, as well as model interpretations. Importantly, we found that about 50% of the papers only report one performance metric, which may result in a skewed view of overall performance. Conclusion: While high time series classification performance has been achieved in digital clinical, physiological, or biomedical measures, no standard benchmark datasets, modeling methods, or reporting methodology exist. There is no single widely used method for time series model development or feature interpretation– many different methods have proven successful.