Water Quality Data Harmonization Pipeline

ETL

water quality

environmental science

A production-grade R pipeline for standardizing multi-lab water quality data with inconsistent naming conventions, mixed units, and detection limit notations into clean, analysis-ready datasets.

Author

Ian Gault

Published

January 1, 2025

View on GitHub View Vignette

Environmental monitoring programs often aggregate data from multiple laboratories, each with its own submission conventions — analyte names, units, detection limit notation, and quality flags rarely align out of the box. This pipeline addresses that problem end-to-end.

What it does

The pipeline ingests raw multi-lab water quality submissions and produces a clean, standardized dataset ready for analysis or reporting. Key steps include:

Analyte name standardization — resolves lab-specific submission codes to a common naming convention
Fraction parsing — extracts dissolved/total fractionation from analyte suffixes
Classification — groups analytes into analysis categories (Nutrients, Major Ions, Metals, etc.)
Water type detection — distinguishes freshwater from seawater samples to apply appropriate thresholds
Unit conversion — converts mixed units with correct scaling of associated detection limits
Detection limit handling — flags and substitutes values below detection using consistent conventions
QAQC checks — automated screening for field blanks, field duplicates, holding time exceedances, and metals ratio verification

Design

The pipeline is organized into modular R functions (harmonize.R, units.R, qaqc.R, database.R, utils.R) and documented via an R Markdown vignette. Synthetic test data is included so the vignette runs standalone without proprietary lab submissions.

All program names and site identifiers in the repository have been anonymized.