U.S. flag

An official website of the United States government, Department of Justice.

Workflow for the Supervised Learning of Chemical Data: Efficient Data Reduction-Multivariate Curve Resolution (EDR-MCR)

NCJ Number
Analytical Chemistry Volume: 93 Issue: 12 Dated: 2021 Pages: 5020-5027
Date Published
8 pages

The reported project tested the performance of the efficient data reduction (EDR) and supervised multivariate curve resolution (MCR) methods for their ability to enable discrimination between the constituents of two benchmark and two high-dimensional data sets.



A new method termed efficient data reduction-multivariate curve resolution (EDR-MCR) has been devised for classification of high-dimensional data. The method introduces the coupling of EDR and MCR as a new strategy for data splitting, variable selection, and supervised classification of high dimensionality data. The method reduces data dimensionality and selects the training set using principal component analysis (PCA) and convex geometry prior to data classification. Then, the reduced data are categorized using an MCR model, in which numerical constraints are imposed to resolve the data into classes and readily interpretable pure component signal weights. The results of the current project were compared with the output of the application of different data splitting methods, including iterative random selection (IRS), Kennard–Stone (KS), and discrimination methods, including partial least-squares-discriminant analysis (PLS-DA) and the ensemble-learning frameworks of linear discriminant analysis (LDA), k-nearest neighbors (KNN), classification and regression trees (CART), and support vector machine (SVM). Overall, EDR resulted in comparable results with other data splitting methods despite the small size of the training set samples that it created. The proposed MCR approach, in comparison with other commonly used supervised techniques, has the advantages of speed in implementation, tuning of fewer parameters, flexibility in the analysis of data characterized by low sample numbers and class imbalances, improved accuracy from the inclusion of additional system information in the form of numerical constraints, and the ability to resolve pure components signal weights. (publisher abstract modified)


Date Published: January 1, 2021