U.S. flag

An official website of the United States government, Department of Justice.

NCJRS Virtual Library

The Virtual Library houses over 235,000 criminal justice resources, including all known OJP works.
Click here to search the NCJRS Virtual Library

In Silico Created Fire Debris Data for Machine Learning

NCJ Number
310024
Journal
Forensic Chemistry Volume: 42 Dated: March 2025 Pages: 100633
Author(s)
Michael E. Sigman; Mary R. Williams; Larry Tang; Slun Booppasiri; Nikhil Prakash
Date Published
March 2025
Length
9 pages
Annotation

In this article, the authors investigate the use of fire debris data for machine learning.

Abstract

This work examines the in-silico preparation of computed fire debris data for training a machine learning method to classify gas chromatography – mass spectrometry (GC–MS) data as positive or negative for ignitable liquid residue (ILR). The authors report the outcome of validation tests on a set of laboratory-generated fire debris samples with known ground truth. A set of 240,000 total ion chromatograms (TIC) and total ion spectra (TIS) for fire debris (FD) samples were calculated in silico (IS). The IS FD sample set was balanced with 50% of the samples containing ILR and substrate pyrolysis (SUB) contributions. The remaining 50% contained only SUB components. The ignitable liquids incorporated into the samples containing ILR were digitally evaporated to simulate weathering observed in experimental fire debris. The IS FD sample TIS were treated by principal component analysis (PCA) with centering and variance scaling and retaining 90% of the variance. A set of 1,117 experimental FD samples were projected into the IS FD PCA model. The recovered experimental FD TIS were compared to the TIS before projection by calculating the residual mean squared error (RMSE) for each sample as a test of the IS FD samples representation of experimental samples. The range of the RMSE was [0.012, 0.127] and the median RMSE was 0.029. Experimental FD samples where the recovered TIS had the larger RMSE values were not well-represented by the IS FD samples. The IS FD samples were randomly split into balanced sets for machine learning (ML) training (90%) and validation (10%). An XGBoost ML method, trained on the IS FD training data, was validated on the testing IS FD data, giving a receiver operating curve (ROC) with area under the curve (AUC) of 0.978. Validation of the model against the experimental FD data gave a lower ROC AUC of 0.845. Limiting the experimental data to samples in the lowest quadrant of RMSE values increased the ROC AUC to 0.90. (Published Abstract Provided)