The ~H operational-statistics frame is the only source of device status and is not solicitable; it arrives once per second only after the full refresh sequence drains into SVC_POLL. Reviewed and accepted as-is; candidate perception-only mitigations noted in the entry instead of BACKLOG.md. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
14 KiB
DA-07 Hardware Field Notes
A running log of bugs found while debugging the DA-07 module against real hardware, with root causes and fix locations. The simulator can mask whole bug classes (see each entry's "why the simulator missed it"), so when a tab misbehaves on hardware but tests are green, check this file first — the pattern is probably already named here.
Companion docs: HARDWARE-VERIFICATION.md (the flagged-protocol checklist),
DA-07 SERVICE-TOOL-ICD.md (the wire protocol), VB6-MIGRATION-PLAYBOOK.md
(general VB6 traps).
2026-06-12 — Steady-state capture: ~P decoded, ~H verified, names are write-only
Setup: second capture session (tests/da07/fixtures/capture-2026-06-12-steady-state.txt,
with "> "-prefixed outbound lines), run for several minutes after load, including a
Tag-name write and two device toggles through the new write queue.
Findings:
~P= per-channel CT sensor serial (P + device + channel + 16-hex), matching theE-frame serial tails byte-for-byte. Now decoded (ChannelSerial) and applied to the channel model. It is not the custom-name frame BL-E5 part 2 hoped for.- Custom channel names are write-only. The
~Dfield-11 name write went out and was Z1-ACKed, but no frame ever reports a name back — even after Refresh. A typed Tag name therefore reverts on Refresh, by protocol, same as the legacy. BL-E5 closed. ~Hlayout verified (68 real frames): counters/buffered/time decode sensibly and the 16 per-device status nibbles matched the live UI (devices 2,3 COM, rest OK). Indicator triples still unobserved (no active alarm groups on the test station).- Write queue confirmed on hardware: 26 writes → 25 ACKs, one observed idle-retransmit recovering a dropped frame; both device toggles stuck after Refresh. The new ACK-paced queue drains faster than the legacy's one-per-poll.
- Channels-tab column widths didn't refit on device switch (UI, sim-reproducible):
TableTabauto-fits only when the row count changes, but a device switch swaps content at the same row count — Serial/Model stayed sized for the previous pod's (empty) values. Fixed: explicit refit after device switch (_refit_for_device_switch).
2026-06-12 — Devices-tab STATUS lags ~10 s after Refresh (known behavior, not a bug)
Symptom: after a Refresh on real hardware, the Devices tab renders its rows quickly but the STATUS column stays "—" for roughly ten seconds before filling in.
Root cause — station firmware design, nothing tool-side: STATUS has exactly one
source, the ~H operational-statistics frame (one status nibble per device slot,
controller._apply_status). Per the ICD (§4.3/§5.H) the firmware only sends ~H
after the entire refresh sequence (config → settings → devices → averages →
current → names → device details) has drained and it settles into steady-state
SVC_POLL, where ~H repeats once per second. ~H is not solicitable — there
is no command that requests one early. Device rows appear early (the ~D frames),
then the station spends the rest of the load streaming at 9600 baud, one frame per
ACK; only then does the first ~H arrive. Both real captures confirm the refresh
stream itself contains no ~H. Our ACKs are immediate (_send_link, bypassing the
write queue), so the pace is entirely the station's. The legacy VB6 behaved
identically but hid it behind its modal progress dialog, which only closed on the
first ~H (Case "H" … CloseProgress, Main.frm).
Status: accepted as-is (2026-06-12). Deliberately not in BACKLOG.md — Andy reviewed and chose to leave it alone for now. If it ever bothers users enough to revisit, the candidate mitigations (perception-only; the wire-level wait cannot be shortened) were:
- Carry last-known per-slot statuses across
controller.refresh()instead of blanking them, rendered dimmed ("off" status role) until the first~Hreconfirms — instant repopulate, but shows stale data for the gap, which is why it should never render full-strength. - Cosmetic only: a clearer "pending" placeholder than "—" while loading.
2026-06-12 — Burst writes silently dropped by the station ("only the second toggle stuck")
Symptom: toggle device A's Active off, wait ~2 s, toggle device B's off, then Refresh — A came back ON (its writes never landed on the station); only B stuck.
Root cause: the controller transmitted commands back-to-back, but the station
processes one inbound frame at a time and silently drops the rest of a burst.
The 2026-06-12 capture proves it: 16 burst channel writes drew only 2 Z1 ACKs.
The legacy never burst — MakeCommand only queued, and each inbound Z from the
station popped exactly one command onto the wire (Main.frm SendCommand); its
"Working n" status caption was that queue's depth.
Fix — outbound command queue (controller._send/_pump/_handle_poll):
one command in flight; Z1 confirms and advances; Z0 retransmits; a Z2 idle
while unconfirmed means the frame was dropped → retransmit (safe: every queued
command is idempotent — an improvement over the legacy, which lost silent drops),
capped at 3 transmissions then dropped with an errorOccurred so a dead station
can't wedge the queue. Link frames (Z1 data-frame ACKs, Z2 idles) bypass the
queue — they are the handshake. The simulator now Z1-ACKs every command frame
like the real station (it previously applied bursts perfectly, which is exactly
why this class of bug was invisible in sim). pendingWritesChanged(depth) drives a
footer WORKING dot (the modern legacy caption); regression suite:
tests/da07/test_write_queue.py.
Capture hook upgrade: CIM_DA07_CAPTURE now records outbound frames too,
prefixed "> " (inbound lines stay bare) — so the next hardware session can see
both sides of the handshake.
Verify on hardware next session: repeat the two-device toggle test (toggle A,
wait, toggle B, Refresh — BOTH must stick), and watch the WORKING dot drain. Note
the design assumes the station Z1s every accepted command (observed for writes);
if some commands are never ACKed they will retransmit ×3 and surface one error.
2026-06-12 — Tiny "popup window" flashing on every write (suite-wide kit bug)
Symptom: a small, empty, native-decorated window (app icon + min/max/close,
label-sized) flashed on top of the app on every settings write — 8× on a Devices-tab
Active toggle (one per channel write). Screenshot: docs/samples/DA-07 flashing pop-up.png. Not DA-07-specific and not the VB6 "Working" indicator — it fired on
any grid rebuild in any module; the optimistic-apply change (below) just multiplied
the rebuilds that exposed it.
Root cause: SummaryStrip.set_summary (core/ui/kit/summary_strip.py) replaced
its count labels with label.setParent(None) while the labels were visible —
reparenting a visible widget to None promotes it to a top-level window, which Windows
shows with full native decoration until the deferred deleteLater runs. Fix: hide
the dying label and let deleteLater collect it while still parented (one line +
regression test tests/core/kit/test_summary_strip.py:: test_set_summary_never_orphans_a_visible_label).
How it was found — the CIM_UI_SPY diagnostic (keep for next time): the flash
could not be reproduced by any synthetic probe (offscreen platforms swallow it, and
QTest clicks missed the toggle hotspot), but an env-gated window spy
(core/ui/window_spy.py, hooked in shell/app.py) run in the user's live session
logged 84 ghost windows with full creation stack traces pointing at the exact line.
Usage:
$env:CIM_UI_SPY="$PWD\ui-spy.log"; .venv\Scripts\python -m cim_suite.shell.app --simulate
Lesson: when a UI glitch reproduces for the user but not for instrumented probes, instrument the user's own session instead of approximating it.
2026-06-12 — Tag column truncation confirmed + first real capture (BL-E5 part 1)
Setup: same session as the stale-model entry below. Ran an instrumented refresh
against the station with CIM_DA07_CAPTURE — the repo's first real DA-07 capture,
saved as tests/da07/fixtures/capture-2026-06-12-refresh.txt (157 frames: A/B/C/D/E/F/G/M/Z).
Findings:
- BL-E5 confirmed byte-for-byte. A CT channel's
~Eframe ends…[alarm byte]B321281B04CB9CEF— alarm byte then the 8-byte probe serial, no disp field, no name field. The phantomdispread ateB3, leaving21281B04CB9CEFin the Tag column (the user-reported "first two characters cut off"). The legacy VB6 had the same phantom read (secondPullBase1into its hidden Disp column), so it truncated too. Fixed per BL-E5 part 1: disp removed everywhere; Tag shows name → serial → catalog default. - No
~Hframe in the whole load — second real capture in a row without one (the load finishes via the idle-settle timer, not the~Hfast path). Yet the STATUS column does populate eventually on hardware, so~Hmust arrive later, in the steady state — periodic, which is exactly what made the stale-model revert (below) fire every second or two. - No
~Pframe even with CT sensors attached, so BL-E5 part 2 (where a custom channel name echoes back) is still blocked on a capture. Until then, a written Tag name lives only in the local model and reverts on Refresh (legacy did the same). - The CS-31 catalog entry (
A2C0831CS-31) carries no default channel names — CT channels are identified by probe serial, hence serial-in-Tag is the correct legacy-faithful display.
2026-06-12 — Edits revert within seconds once live frames flow ("stale-model rebuild")
Setup: first DA-07 with one pod attached. Reported on the Devices tab: the Active switch toggles, then snaps back within ~1–2 s — but only after the STATUS column populates. Same on the Channels tab Active column, and (predicted, same mechanism) on every editable field of the Devices / Channels / Station / Alarm tabs.
Root cause: the controller's set_* methods wrote the command to the wire
but never updated the local models, and the DA-07 protocol has no settings
echo — a real station never re-sends a value you just wrote. (The legacy VB6
had no such problem because its grid was the model: Grid2_AfterEdit sent the
command and the grid simply kept the edited cell.) On real hardware the station
streams periodic frames in the steady state:
| inbound frame | controller signal(s) | tabs that rebuild |
|---|---|---|
~H realtime status |
devicesChanged + alarmsChanged |
Devices, Alarm (and Channels' device combo) |
~G inputs / ~F averages |
channelsChanged |
Channels |
Each signal rebuilds the grid from the stale model, visually reverting the
edit. That's why toggles "worked" before STATUS populated: no ~H yet → no
rebuilds → the checkbox kept its widget-local state (the model was wrong the
whole time).
Why the simulator missed it: the sim only sends ~H once, at the end of a
refresh — never periodically — so nothing ever rebuilt the Devices tab after
load. (Channels-tab live ticks would have shown it, but nobody toggled Active
in live sim mode and the tests always called refresh() between write and
assert, which re-reads the sim's state and hides the gap.)
Fix — "optimistic apply" (2026-06-12): every
controller setter now applies the value to its local model immediately after
sending, and emits the matching *Changed signal. The station remains the
source of truth — the next Refresh overwrites local state with whatever the
station actually stored.
domain/controller.py— the whole "settings writes" section (set_station_setting,_set_device_field,set_device_type,_set_channel_field,_set_limit,set_alarm_*,remove_device).domain/models.py—DeviceTable.remove,ChannelTable.remove_device.- Regression tests:
tests/da07/test_optimistic_writes.py(includes UI-level tests that replay the exact symptom: toggle, then deliver a periodic~H/~G, assert no revert).
The general rule this leaves behind: any outbound DA-07 mutation must update the local model in the same call, because nothing inbound will. If a new setter is added and its edit "reverts after a second or two" on hardware, the optimistic apply was forgotten.
Consequences / things this changes:
- The §5.6 "pending → echoed" write-feedback marker on the tabs now resolves on
the next rebuild (≈1 s on hardware via
~H/~G), not on a true echo — the DA-07 simply has no per-write confirmation. A NAK (Z0) still triggers a resend of the last frame only. set_device_activewrites one frame per channel; on a NAK only the last frame is resent (pre-existing limitation, unchanged).
Open questions for the next hardware session:
- Confirm the write actually landed on the station: toggle Active, wait, hit
Refresh — the value must survive the full reload (proves the
~Dwrite frame itself is accepted, which the revert previously made impossible to observe). If it does NOT survive, there is a second bug in the write frames themselves (protocol/encoder.py::set_channel_fieldetc.) that was hidden behind this one. - DA-12 may have the same latent bug. Its
set_*methods also send without updating models (modules/da12/domain/controller.py), and its sim also applies writes without echoing. DA-12 was assumed to re-stream fullAsensor records unsolicited (which would refresh config naturally) — verify on real DA-12 hardware whether edits revert whenC/Ivalue frames rebuild the Sensors tab. If they do, port the same optimistic-apply pattern.
(Add new entries above this line, newest first, dated, with: symptom → root cause → why the sim missed it → fix locations → open questions.)