Water Quality Data Harmonization Pipeline

R
ETL
water quality
environmental science
A production-grade R pipeline for standardizing multi-lab water quality data with inconsistent naming conventions, mixed units, and detection limit notations into clean, analysis-ready datasets.
Author

Ian Gault

Published

January 1, 2025

View on GitHub View Vignette

Environmental monitoring programs often aggregate data from multiple laboratories, each with its own submission conventions — analyte names, units, detection limit notation, and quality flags rarely align out of the box. This pipeline addresses that problem end-to-end.

What it does

The pipeline ingests raw multi-lab water quality submissions and produces a clean, standardized dataset ready for analysis or reporting. Key steps include:

  • Analyte name standardization — resolves lab-specific submission codes to a common naming convention
  • Fraction parsing — extracts dissolved/total fractionation from analyte suffixes
  • Classification — groups analytes into analysis categories (Nutrients, Major Ions, Metals, etc.)
  • Water type detection — distinguishes freshwater from seawater samples to apply appropriate thresholds
  • Unit conversion — converts mixed units with correct scaling of associated detection limits
  • Detection limit handling — flags and substitutes values below detection using consistent conventions
  • QAQC checks — automated screening for field blanks, field duplicates, holding time exceedances, and metals ratio verification

Design

The pipeline is organized into modular R functions (harmonize.R, units.R, qaqc.R, database.R, utils.R) and documented via an R Markdown vignette. Synthetic test data is included so the vignette runs standalone without proprietary lab submissions.

All program names and site identifiers in the repository have been anonymized.