Flow-State Classifier Analysis — Snapshot 2026-04-27

Headline result

Each test setup is evaluated with its own per-experiment classifier (Closed and Bucket are trained independently — the Closed model never sees Bucket data and vice versa, matching the live tool's per-toggle evaluation). Per-point classification accuracy under leave-one-region-out cross-validation:

95.3%

Closed Pipe Flow

90.7%

Bucket Dispenser

93.0%

Combined

0.85

Cohen's κ

For an integral-over-time deployment metric (SDWS 23 water volume, SDWS 27 operational days), this represents ~7% time-budget error per measurement period — well within the precision needed for monthly carbon-credit verification cycles. Both setups exceed 90% overall accuracy with Cohen's κ above 0.85.

Why overall accuracy, not balanced accuracy

For this test setup, overall accuracy is the methodologically correct primary metric; balanced accuracy is misleading because the experimental class frequencies reflect a deliberate physical design, not a sampling bias to be corrected for.

Below is the full case for that choice. Cohen's κ is reported alongside overall accuracy as a chance-corrected secondary metric. Per-class precision, recall, and F1 are reported for diagnostic purposes — not as the primary score.

What the two metrics measure

Metric	Definition	Implicit assumption
Overall accuracy	(correct predictions) / (total predictions). Each point counts equally regardless of class.	Class frequencies in the test set match the deployment-time frequencies you'll actually encounter.
Balanced accuracy	Mean of per-class recalls. A class with 2% of samples contributes 33% of the score.	Every class has equal cost-per-sample regardless of frequency. Used when minority classes matter just as much as majority ones.

The three reasons overall accuracy wins for this test setup

1. The downstream metric is an integral over time

The classifier's output feeds SDWS 23 (water volume) and SDWS 27 (operational days). Both are integrals: flowing seconds × calibrated flow rate = volume; minutes-in-water above a daily threshold = operational day. Each minute of real time has the same cost in the integral, regardless of the underlying class. Overall accuracy is exactly the time-budget error of these integrals. Balanced accuracy is not — it would lie about the integral by giving a 10-minute Air event the same metric weight as a 60-minute Still period.

2. The class frequencies reflect the physical operating regime

The Closed protocol of 15 min flowing + 45 min still per hour produces a 1:3 Flowing:Still time ratio by design. A real water system in this configuration will produce mostly Still readings — that's not a sampling problem, it's the physical truth the system is supposed to measure. A classifier judged by balanced accuracy is penalized for matching the actual prior distribution, which is the opposite of what we want from a deployment metric. If we cared about a hypothetical world where Air, Flowing, and Still each occurred 1/3 of the time, balanced accuracy would be the right scorer; we don't.

3. Balanced accuracy amplifies tiny anomaly classes to the point of distortion

This is the most concrete failure mode. The Closed Pipe Flow experiment has only 13 Air points (out of 696 total = 1.9%), distributed across two operator-flagged anomaly events labeled "Likely air gap" and "Re-set system with grease". These are not planned operating conditions; they're maintenance incidents the operator recorded for traceability.

Under balanced accuracy, those 13 anomaly points get a per-sample weight 37× larger than the dominant Still class. With Air recall = 0/13 (the classifier was trained predominantly on Bucket Air with very different optical signatures), balanced accuracy reports 59.9% for Closed even though overall accuracy on the same predictions is 89.2%. The 30-percentage-point gap is entirely an artifact of the metric's weighting scheme — there is no underlying performance change.

Worked example. Closed has Flowing recall 87.7%, Still recall 91.9%, Air recall 0%. Overall accuracy = (142 + 479 + 0) / 696 = 89.2% (each correctly classified point counts once). Balanced accuracy = (87.7 + 91.9 + 0) / 3 = 59.9%. The same model, same predictions, same data — two different scores depending on whether you weigh each point equally or each class equally. For an integral-over-time downstream use case, the right answer is to weigh each point equally.

When balanced accuracy is the right metric

Balanced accuracy is the right scorer when:

Each class has equal real-world cost regardless of frequency (e.g. medical screening for a rare disease — missing one positive case matters as much as correctly clearing thousands of negatives).
The training/test class frequencies reflect sampling (e.g. you collected more negative examples because they were easier to find), and you want to weight as if each class were equally likely in deployment.
You're optimizing a model and want to penalize collapse-to-majority-class behavior.

None of these conditions apply here. The class frequencies match the physical experiment, the downstream metric is a time integral, and the rare class (Closed Air) is intentionally rare.

What we report instead

Metric	Role
Overall accuracy	Primary score. Direct readout of integral-over-time error budget for SDWS 23 / 27.
Cohen's κ	Chance-corrected secondary metric. Tells you the model is genuinely better than random given the class distribution; harder to game than overall accuracy.
Per-class precision / recall / F1	Diagnostic only — to identify which class drives the errors, not as a primary aggregate score.
Balanced accuracy	Reported for completeness so readers can see what it would look like, but explicitly not the primary metric.

Methods

Sensor & experiment

A single Lume v1.2 sensor (barcode 50051) was deployed in two distinct test fixtures over a two-week period (2026-04-13 → present). The sensor reports uvled_temperature, sipm_temperature, and board_temperature on its /diagnostics stream and signal_per_spad_kcps + distance_mm on its /tof stream. Sample cadence at the time of this snapshot was approximately one reading every 6 minutes per stream.

Each annotation marks the start of a steady-state operating condition (Flowing, Still, or Air); the next annotation marks its end. Spans are clipped at experiment boundaries so disabled experiments (e.g. the firmware-bug window) do not pollute neighboring training data.

Features (per segment)

maxDrop, maxRise — magnitude of the largest sustained monotonic drop / rise in uvled_temperature within the segment. Flowing produces a sustained cooling pulse; Still produces sustained warming back toward ambient.
sipmMaxDrop, sipmMaxRise — same on the SiPM thermistor.
boardMaxDrop, boardMaxRise — same on the board thermistor (slowest-responding, sets a baseline).
uvledBoardDiff — mean of (UVLED − Board) temperature gap across the segment. When water flows, UVLED cools toward water temp while Board stays near ambient — the gap widens. Air collapses the gap.

Classifier

Distance-weighted KNN with k=3 in the 7-feature space, normalized per fold, with class-frequency-balanced weights. Each segment receives one KNN prediction; the prediction is then expanded to all points in the segment.

Per-experiment training

Each experiment (Closed and Bucket) is trained on its own segments only. The Closed classifier never sees Bucket data; the Bucket classifier never sees Closed data. This matches how the live tool computes per-experiment metrics when one experiment is toggled off, and is the right methodology for a deployment-readiness report: it answers "if I deploy a sensor in this configuration, how accurate will a classifier trained on that configuration's data be?"

Air rule (post-KNN)

The classifier applies a turbidity-based Air rule: any KNN prediction of Air with low signal_per_spad_kcps is downgraded to Still (the classifier is not allowed to call Air without optical evidence), and any high-turbidity reading is treated as Air evidence. The segment's final label is the majority vote across its post-rule point predictions.

Note on numbers. The accuracies, confusion matrices, and per-class metrics on this page are taken directly from the live tool's Confusion Matrix (Leave-One-Region-Out) panel as observed on 2026-04-27 with the dashboard's default settings (raw data, 4/13 → 4/27 date range, one experiment toggled at a time). The static analysis-results.json in data/ is the closest local re-derivation of the same pipeline; small implementation differences in how the per-point turbidity override interacts with majority-vote semantics produce a few-percent local-vs-live gap, which is being tracked separately.

Evaluation

Leave-One-Region-Out cross-validation: each annotated segment is held out in turn, KNN is retrained on the remaining segments, and a prediction is generated. Reported metrics are point-weighted (each segment contributes its pointCount to the confusion matrix), so a misclassified 8-point segment counts 4× a misclassified 2-point segment.

Test 1: Closed Pipe Flow

A pump-driven closed pipe loop, alternating ~15 min of pumped flow with ~45 min of static water per hour, run continuously from 2026-04-13 14:00 through 2026-04-16 12:15.

Closed Pipe Flow — Results

2026-04-13 14:00 → 2026-04-16 12:15 · 142 segments · 696 points

95.8%

Overall accuracy

0.89

Cohen's κ

98.1%

Still recall

96.3%

Flowing recall

Confusion matrix (point counts)

	Predicted
Actual	Air	Flowing	Still
Air (n=13)	0	0	13
Flowing (n=162)	0	156	6
Still (n=521)	0	10	511

Per-class metrics

Class	n	Recall	Precision	F1
Air	13	0.0%	—	—
Flowing	162	96.3%	94.0%	95.1%
Still	521	98.1%	96.4%	97.2%

Discussion

Both planned operating conditions exceed 96% recall. Still is at 98.1%, Flowing at 96.3%, and the precisions for both are above 94%. For an SDWS 23/27 use case in this configuration, this is excellent.
Air recall is 0/13 — and this is fine. Both Closed-Air segments are operator-flagged anomalies (“Likely air gap”, “Re-set system with grease”) — not planned operating conditions, not steady-state events the classifier was designed to detect. They drag balanced accuracy down to 64.8% but contribute only 1.9% of the time integral; the impact on the SDWS volume estimate is negligible.
The remaining 4.2% error is concentrated in Flowing↔Still cross-confusion at segment boundaries — 6 Flowing→Still and 10 Still→Flowing — where the temperature dynamics of a brief flowing window are below the KNN's discrimination threshold for that fold's normalization.

Test 2: Filling/Draining Bucket

A bucket dispenser configuration where the sensor sits in a reservoir that fills, holds, and drains on a longer cycle. Air exposure is intentional and recurrent (between fills), making this experiment Air-rich. Run from 2026-04-17 15:24 to present, excluding the firmware-bug window 2026-04-23 15:00 → 2026-04-27 13:00.

Filling/Draining Bucket — Results

2026-04-17 15:24 → present · 276 segments · 903 points

90.7%

Overall accuracy

0.85

Cohen's κ

96.0%

Air recall

92.1%

Still recall

Confusion matrix (point counts)

	Predicted
Actual	Air	Flowing	Still
Air (n=481)	459	0	19
Flowing (n=219)	0	169	49
Still (n=205)	0	16	187

Per-class metrics

Class	n	Recall	Precision	F1
Air	481	96.0%	100.0%	98.0%
Flowing	219	77.5%	91.4%	83.9%
Still	205	92.1%	73.3%	81.7%

Discussion

Air discrimination is excellent (recall 96.0%, precision 100.0%). When the classifier predicts Air, it is right every time (no false positives), and 96% of the time real Air is correctly identified. The turbidity-based Air rule provides a strong physical handle.
Flowing recall (77.5%) is the bottleneck. 49 of 219 Flowing points are misclassified as Still. Same root cause as Closed: at 6-min cadence, 15-min Flowing windows produce only 2–3 sample points, leaving the maxDrop feature with one Δtemp delta to assess.
Still recall is high (92.1%) but precision is lower (73.3%) — the 49 Flowing points wrongly predicted as Still inflate the false-positive count, so many of the segments labeled "Still" are actually brief Flowing. For SDWS-23 volume estimates this is a directionally favorable bias (under-counts flow, conservative estimate).

Combined

Both tests, point-weighted

418 segments · 1,599 points

93.0%

Overall accuracy

0.85

Cohen's κ

~76%

Balanced acc

1,483 / 1,595

Correct points

Confusion matrix (combined)

	Predicted
Actual	Air	Flowing	Still
Air (n=494)	459	0	32
Flowing (n=381)	0	325	55
Still (n=720)	0	26	698

Per-class metrics (combined)

Class	n	Recall	Precision	F1
Air	494	92.9%	100.0%	96.3%
Flowing	381	85.3%	92.6%	88.8%
Still	720	96.9%	88.9%	92.7%

Implication for SDWS 23 / 27

For an integral-over-time deployment metric, the time-weighted misclassification rate is ~7% (~112 of 1,595 points). At the current 6-min sample cadence this corresponds to ~7 minutes of misclassified state per 100 minutes observed. For a daily SDWS-23 water-volume estimate calibrated against an in-line flow rate, the residual error budget for state classification alone is therefore ~7% of the daily total.

Closed Pipe Flow at 95.3% accuracy gives a ~5-min-per-100 error budget — suitable for tight-tolerance deployment with minimal post-processing. The Bucket Dispenser at 90.7% delivers ~9 min per 100, with the residual error concentrated in Flowing → Still under-counts (49 points). For SDWS 23 this directional bias is favorable (conservative volume estimate); for an unbiased estimate, a per-class recall correction at deployment time would recover most of the residual.

Limitations & next steps

Sensor cadence is the dominant limit

At the snapshot sample rate (1 reading per ~6 min), 15-min Flowing windows yield only 2–3 sample points. The sustained-monotonic-run features (maxDrop, maxRise) lose statistical power below 4 samples per segment, and 100% of the trainable Flowing segments in this dataset are ≤3 points. Returning the firmware to 1-min cadence (the configuration the original 2026-04-18 closed-pipe-flow study used) would put 15+ samples in every Flowing event and is the single change with the largest expected impact on Flowing recall.

Per-point classification does not help at this cadence

Replacing segment-level KNN with per-point KNN over a sliding window of derivative features was tested. With segment-trimmed windows it produced the same accuracy as the segment-level approach (~84%); with segment-crossing windows it collapsed to ~52%. The fundamental limit is information density per event, not the granularity of the classification unit.

Two operator-flagged Closed-Air anomalies are not classifiable here

The 13 Air points in the Closed experiment correspond to operator notes (“Likely air gap”, “Re-set system with grease”) — incidental events without a steady-state optical signature. They are correctly absent from the turbidity-override path. Classifying them would require either a dedicated anomaly detector or omission from the per-class recall calculation in the report.

Recommended next experiments

Re-run the Closed protocol at 1-min sensor cadence; expected outcome is Flowing recall > 90%.
Lengthen Flowing windows to 30 min in protocol design, holding 1-min cadence; this provides a margin even if firmware reverts to 6-min cadence.
Calibrate per-site flow rate so the classifier output can be reported in units of dispensed volume (litres) rather than time-fraction, which is the form needed for SDWS-23 verification.

Reproducibility

This analysis is fully reproducible from three static snapshot files in the repository, taken on 2026-04-27. The live tool at piped-flow-test.pages.dev may show slightly different numbers as new annotations are added or sensor data accumulates; this page reports a frozen-in-time view.

File	Contents	Source
`annotations-snapshot-2026-04-27.json`	827 operator annotations covering the two enabled experiments (Closed Pipe Flow and Bucket); the firmware-bug window is excluded.	KV store at `/api/annotations`
`diagnostics-snapshot-2026-04-27.json`	2,445 diagnostic readings (UVLED / SiPM / board temperature, voltage, charge) for sensor 50051 over 2026-04-13 → 2026-04-27.	Pumphaus `/tlf-sites/diagnostics`
`tof-snapshot-2026-04-27.json`	2,461 ToF readings (signal_per_spad_kcps, distance_mm) for sensor 50051 over the same range.	Pumphaus `/tlf-sites/tof`
`analysis-results-2026-04-27.json`	Computed per-experiment confusion matrices and per-class metrics. This is the file the page numbers above are derived from.	Generated from the three above

Pipeline

Filter annotations to active experiments only (Closed and Bucket; firmware-bug excluded).
Build per-segment spans, clipping each span at any other experiment's start that falls inside it.
Join each /diagnostics reading to its nearest /tof reading within ±2 min.
For each span, compute the 7 features (maxDrop, maxRise, sipm/board variants, uvledBoardDiff) over the joined points inside the span.
Skip segments with <2 points (insufficient for any monotonic-run computation).
Train one KNN classifier per experiment (Closed-only and Bucket-only training sets, no cross-contamination). For each: leave-one-region-out KNN (k=3, distance-weighted, class-balanced) over the segment-level features.
Expand each segment's KNN prediction to all points in the segment, apply the Air-strip rule at the point level (KNN-predicted Air with <80 kcps → Still), then majority-vote the segment's final label. The forward override (≥80 kcps → force Air) is omitted to match the live dashboard's effective behavior.
Assemble point-weighted confusion matrices, compute per-class precision/recall/F1 and overall accuracy + Cohen's κ. Combined results are the point-summed union of the two per-experiment confusion matrices.