Data science and machine learning algorithms are sometimes viewed as the only tools that are needed to analyze large datasets. Yet concepts from classical statistics remain critical in such settings. Massive data are rarely independent, outlier-free, or homogeneous: clusters, subdomains of observations, multiplicity of tests, and hidden trends are common and require statistical thinking, robust methods, and insightful displays. Sampling methodology, along with survey design and analysis, are essential in our current statistical framework for ensuring valid inferences with quantifiable uncertainties. This paper discusses some datasets where statistical analysis uncovered subtle biases and discrepancies that would have been hidden in these seemingly trustworthy, data-rich sources. Until a new statistical framework is developed to generate valid inferences on non-randomized, highly dependent clustered data, these examples demonstrate that statistical thinking, statistical methods, and informative displays remain critical for ensuring valid analyses and communication of justified conclusions from “Big Data.”
(Publisher abstract provided.)