<?xml version="1.0" encoding="UTF-8"?>
<rss  xmlns:atom="http://www.w3.org/2005/Atom" 
      xmlns:media="http://search.yahoo.com/mrss/" 
      xmlns:content="http://purl.org/rss/1.0/modules/content/" 
      xmlns:dc="http://purl.org/dc/elements/1.1/" 
      version="2.0">
<channel>
<title>Ian Gault</title>
<link>https://iangault.github.io/blog/</link>
<atom:link href="https://iangault.github.io/blog/index.xml" rel="self" type="application/rss+xml"/>
<description></description>
<generator>quarto-1.7.33</generator>
<lastBuildDate>Tue, 13 Jan 2026 08:00:00 GMT</lastBuildDate>
<item>
  <title>NADA2: Accurate environmental monitoring starts with handling censured data</title>
  <dc:creator>Ian Gault</dc:creator>
  <link>https://iangault.github.io/blog/posts/NADA2/</link>
  <description><![CDATA[ 




<p>With industrialization, humans have influenced the environment in which they live. We synthesize chemicals for specific uses and extract natural resources with chemical byproducts as waste. Exposure to chemical contaminants is unavoidable; even in environments that appear pristine and are free of human influence, such as regions of the Arctic, persistent synthetic chemicals can be detected by laboratory methods.<sup>1</sup></p>
<p>The fields of analytical chemistry and toxicology converge to measure concentrations of chemical contaminants in the environment, in our food, water, and consumer products, and in our bodies, and to assess their associated health effects. With the backlog of chemical contaminants we are exposed to, many of whose toxicology is unknown, and the thousands of chemicals we encounter in our lives, methods in toxicology have been changing to keep up with the public health concern.<sup>2–4</sup> However, at its root, effective chemical measurements through laboratory testing are needed to assess risk.</p>
<p>Each method for measuring chemical concentrations has a lower limit, below which the instrument is not sensitive enough to detect true differences in concentrations. This inherent artifact of analytical instruments, though technology is continuously improving to get lower and lower detection limits of trace chemicals.</p>
<p>Environmental chemistry typically follows a log-normal distribution,<sup>5</sup> i.e., a right-skewed distribution, with fewer “hits” of hot spots with high-concentration observations relative to the majority of the dataset. Replacing the &lt;DL values with a constant cuts off the lower tail of the distribution, as shown in Figure 1. This is known as left-censored data.</p>
<p><img src="https://iangault.github.io/blog/posts/NADA2/lognormal_left_censored.png" class="img-fluid quarto-figure quarto-figure-center" style="width:90.0%" alt="Figure 1: Left-censored data."><br>
<em>Note: Example figure was generated with ChatGPT5</em></p>
<p>Common practice in environmental monitoring is to impute observations below the detection limit (&lt;DL) to either 1 × DL or 1/2 × DL, because the data exists, but is not observable.<sup>6</sup> This has traditionally been practical and was an accepted bias in summary statistics used in chemical evaluation. However, by doing so, you lose the data’s variability, which informs inferential techniques such as confidence intervals and hypothesis testing. And, depending on the distribution of the data and the substitution made, this can both overestimate and underestimate risk, collapsing a range of values into a single guess.</p>
<p>Another option–though, is arguably even worse!–would be to remove the censored data completely, removing that valuable information for when concentrations are low and artificially inflating the detected concentrations. Comparisons of contaminated environments that are of concern to surrounding life are usually done using summary statistics, such as upper bounds on chemical concentrations (or lower bounds for toxic effects), which can bias summary statistics used in chemical risk assessment.</p>
<p>Overall, these two approaches do not incorporate modern methods for handling censored data based on probability and distributional data, and multiple options now exist to incorporate into your workflow, contributing to more accurate, measured downstream analyses.</p>
<p>Three methods exist: parametric testing, which assumes a particular distribution for the data; non-parametric testing, which makes no distributional assumptions but ranks observations; or semi-parametric testing, which combines both, assuming a distribution while also ranking observations.<sup>7</sup></p>
<p>NADA2 is an R package that provides tools to handle censored data.<sup>8</sup> It can incorporate the distribution of the data and provide a probabilistic representation of concentrations that are &lt;DL, thereby avoiding the fixed, direct substitution methods. It is aware of the censorship of the data, allowing for more explicit incorporation of uncertainty and unbiased summary statistics.</p>
<p>Most commonly, a semi-parametric approach called Regression on Order Statistics (ROS) is used.<sup>8</sup> Given that environmental chemistry data is often log-normal, ROS applies a log transformation and ranks observations with their expected normal quantiles. A regression is then used to estimate where censored observations would plausibly fall in the lower tail of the distribution. Here, the censored data is not replaced; instead, values are inferred from the distribution to avoid bias in how the data is used. In other words, the probability distribution is conditional on whether the data is censored or not, as described below:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0AX_i%20=%20%5Cbegin%7Bcases%7D%20x_i,%20&amp;%20%5Ctext%7Bif%20detected%7D%20%5C%5C%20%3C%20DL_i,%20&amp;%20%5Ctext%7Bif%20censored%7D%20%5Cend%7Bcases%7D%0A"> For censored observations: <img src="https://latex.codecogs.com/png.latex?%0AP(X%20%5Cle%20x%20%5Cmid%20X%20%3C%20DL)%0A"> For detected observations: <img src="https://latex.codecogs.com/png.latex?%0AP(X%20%5Cle%20x%20%5Cmid%20X%20%5Cge%20DL_%7B%5Cmin%7D)%0A"> <em>Note: Equations generated by ChatGPT5</em></p>
<p>This is what’s going on under the hood for the following example code:</p>
<div class="cell">
<div class="sourceCode cell-code" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb1-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Toy dataframe</span></span>
<span id="cb1-2">df <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">data.frame</span>(</span>
<span id="cb1-3">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">concentration =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.2</span>, <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.5</span>, <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">1.0</span>, <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.4</span>, <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">2.0</span>, <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.3</span>),</span>
<span id="cb1-4">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">censored =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">TRUE</span>, <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">TRUE</span>, <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">FALSE</span>, <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">TRUE</span>, <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">FALSE</span>, <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">TRUE</span>)</span>
<span id="cb1-5">)</span>
<span id="cb1-6"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Created the log-normal estimated distribution</span></span>
<span id="cb1-7">estimated_distribution <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> NADA2<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ros</span>(df<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>concentration, df<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>censored)</span>
<span id="cb1-8"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Calculated unbiased summary statistics</span></span>
<span id="cb1-9"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mean</span>(estimated_distribution)</span>
<span id="cb1-10"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">median</span>(estimated_distribution)</span>
<span id="cb1-11"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Test to evaluate guideline exceedances for an example contaminated site based on a censor-aware concentration point estimate</span></span>
<span id="cb1-12">point_estimate <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">quantile</span>(fit, <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.95</span>)</span>
<span id="cb1-13">guideline <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.35</span></span>
<span id="cb1-14">exceedance <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> point_estimate <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;</span> guideline</span></code></pre></div>
</div>
<p>The semi-parametric ROS approach is best used when the distribution of data may not fully fit a log-normal distribution, which is often the case in real-world scenarios. It is also reasonably defensible for a high degree of censored data, with a frequency of 30%-50%.<sup>7</sup> It is effective for summary statistics, trends, and guideline comparisons; however, it is not a predictive model.</p>
<p>The choice of approach depends on the assessment’s goal. If a predictive model is needed, such as estimating concentrations of chemicals in a lake based on the current outflow and concentrations, and a distribution assumption is defensible, then parametric testing would be needed. In this case, NADA provides parametric testing that assigns a distribution and uses Maximum Likelihood Estimates (MLE) to predict concentrations.<sup>7,8</sup> An example of the code is below:</p>
<div class="cell">
<div class="sourceCode cell-code" id="cb2" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb2-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># </span><span class="al" style="color: #AD0000;
background-color: null;
font-style: inherit;">NOTE</span><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">: this function is from the older package NADA</span></span>
<span id="cb2-2">parametric_distribution <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> NADA<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">cenmle</span>(df<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>concentration, df<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>censored, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">dist =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"lognormal"</span>)</span></code></pre></div>
</div>
<p>Non-parametric metrics, such as the Kaplan-Meier estimator, make minimal distributional assumptions; this can be suitable depending on how your data looks, allowing you to still extract valuable information. However, the representations of left-censored data are poorer compared to semi-parametric and parametric methods, with limited extrapolation from observed values and no smoothing.<sup>9</sup> Therefore, it is less sensitive.</p>
<p>The incorporation of NADA2 or similar packages into an environmental data science workflow is not too much additional work, but does require a shift in perspective. There may be an initial learning curve to understand what’s happening, but the coding is minimal and defensible. Environmental scientists may be resistant to turning to a coding-based workflow and to changing their approach based on what has always been done or what has previously been accepted, but the benefits outweigh the initial discomfort.</p>
<p>This is needed because some chemical contaminants have very low recommended safe doses that are approaching the detection limit of our best laboratory methods. As a result, small biases in our imputed values, which may be minor when chemical concentrations are well above the detection limit, can have a much greater impact as measured concentrations approach the detection limit. As environmental scientists, we now have the tools to do better: to accurately represent the data in our studies, make inferences, and have measured remediation strategies and protections for the environment and the life within it.</p>
<section id="references" class="level3">
<h3 class="anchored" data-anchor-id="references">References</h3>
<p><em>Image taken from: A global agreement to tame chemical pollution | United Nations. https://www.un.org/en/climatechange/global-agreement-tame-chemical-pollution. Accessed January 17, 2026.</em></p>
<ol type="1">
<li>Persistent Organic Pollutants: A Global Issue, A Global Response | US EPA. https://www.epa.gov/international-cooperation/persistent-organic-pollutants-global-issue-global-response. Accessed January 13, 2026.</li>
<li>Krewski D, Westphal M, Andersen ME, et al.&nbsp;A framework for the next generation of risk science. Environ Health Perspect. 2014;122(8):796-805. doi:10.1289/ehp.1307260</li>
<li>Bell SM, Chang X, Wambaugh JF, et al.&nbsp;in vitro to in vivo extrapolation for high throughput prioritization and decision making. Toxicol Vitr. 2018;47(December 2017):213-227. doi:10.1016/j.tiv.2017.11.016</li>
<li>Tox21. Toxicology in the 21st Century. https://tox21.gov/overview/about-tox21/. Accessed January 8, 2025.</li>
<li>Andersson A. Mechanisms for log normal concentration distributions in the environment. Sci Reports 2021 111. 2021;11(1):16418-. doi:10.1038/s41598-021-96010-6</li>
<li>Mihalache OA, Dall’Asta C. Left-censored data and where to find them: Current implications in mycotoxin-related risk assessment, legislative and economic impacts. Trends Food Sci Technol. 2023;136:112-119. doi:10.1016/J.TIFS.2023.04.011</li>
<li>Holbert C. How to Calculate Summary Statistics for Left-Censored Data. https://www.cfholbert.com/blog/summary-statistics-censored-data/. Published 2022. Accessed January 17, 2026.</li>
<li>NADA2 package - RDocumentation. https://www.rdocumentation.org/packages/NADA2/versions/2.0.1. Accessed January 8, 2026.</li>
<li>Wey A, Connett J, Rudser K. Combining parametric, semi-parametric, and non-parametric survival models with stacked survival models. Biostatistics. 2015;16(3):537. doi:10.1093/BIOSTATISTICS/KXV001</li>
</ol>


</section>

 ]]></description>
  <category>code</category>
  <category>analysis</category>
  <guid>https://iangault.github.io/blog/posts/NADA2/</guid>
  <pubDate>Tue, 13 Jan 2026 08:00:00 GMT</pubDate>
  <media:content url="https://iangault.github.io/blog/posts/NADA2/polluted_water.jpg" medium="image" type="image/jpeg"/>
</item>
</channel>
</rss>
