Files
cimtechniques-service-suite/docs/DA07-FIELD-NOTES.md
andy d83a61e65f docs(da07): record the post-Refresh STATUS lag as accepted firmware behavior
The ~H operational-statistics frame is the only source of device status and
is not solicitable; it arrives once per second only after the full refresh
sequence drains into SVC_POLL. Reviewed and accepted as-is; candidate
perception-only mitigations noted in the entry instead of BACKLOG.md.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 11:07:38 -04:00

14 KiB
Raw Permalink Blame History

DA-07 Hardware Field Notes

A running log of bugs found while debugging the DA-07 module against real hardware, with root causes and fix locations. The simulator can mask whole bug classes (see each entry's "why the simulator missed it"), so when a tab misbehaves on hardware but tests are green, check this file first — the pattern is probably already named here.

Companion docs: HARDWARE-VERIFICATION.md (the flagged-protocol checklist), DA-07 SERVICE-TOOL-ICD.md (the wire protocol), VB6-MIGRATION-PLAYBOOK.md (general VB6 traps).


2026-06-12 — Steady-state capture: ~P decoded, ~H verified, names are write-only

Setup: second capture session (tests/da07/fixtures/capture-2026-06-12-steady-state.txt, with "> "-prefixed outbound lines), run for several minutes after load, including a Tag-name write and two device toggles through the new write queue.

Findings:

  • ~P = per-channel CT sensor serial (P + device + channel + 16-hex), matching the E-frame serial tails byte-for-byte. Now decoded (ChannelSerial) and applied to the channel model. It is not the custom-name frame BL-E5 part 2 hoped for.
  • Custom channel names are write-only. The ~D field-11 name write went out and was Z1-ACKed, but no frame ever reports a name back — even after Refresh. A typed Tag name therefore reverts on Refresh, by protocol, same as the legacy. BL-E5 closed.
  • ~H layout verified (68 real frames): counters/buffered/time decode sensibly and the 16 per-device status nibbles matched the live UI (devices 2,3 COM, rest OK). Indicator triples still unobserved (no active alarm groups on the test station).
  • Write queue confirmed on hardware: 26 writes → 25 ACKs, one observed idle-retransmit recovering a dropped frame; both device toggles stuck after Refresh. The new ACK-paced queue drains faster than the legacy's one-per-poll.
  • Channels-tab column widths didn't refit on device switch (UI, sim-reproducible): TableTab auto-fits only when the row count changes, but a device switch swaps content at the same row count — Serial/Model stayed sized for the previous pod's (empty) values. Fixed: explicit refit after device switch (_refit_for_device_switch).

2026-06-12 — Devices-tab STATUS lags ~10 s after Refresh (known behavior, not a bug)

Symptom: after a Refresh on real hardware, the Devices tab renders its rows quickly but the STATUS column stays "—" for roughly ten seconds before filling in.

Root cause — station firmware design, nothing tool-side: STATUS has exactly one source, the ~H operational-statistics frame (one status nibble per device slot, controller._apply_status). Per the ICD (§4.3/§5.H) the firmware only sends ~H after the entire refresh sequence (config → settings → devices → averages → current → names → device details) has drained and it settles into steady-state SVC_POLL, where ~H repeats once per second. ~H is not solicitable — there is no command that requests one early. Device rows appear early (the ~D frames), then the station spends the rest of the load streaming at 9600 baud, one frame per ACK; only then does the first ~H arrive. Both real captures confirm the refresh stream itself contains no ~H. Our ACKs are immediate (_send_link, bypassing the write queue), so the pace is entirely the station's. The legacy VB6 behaved identically but hid it behind its modal progress dialog, which only closed on the first ~H (Case "H" … CloseProgress, Main.frm).

Status: accepted as-is (2026-06-12). Deliberately not in BACKLOG.md — Andy reviewed and chose to leave it alone for now. If it ever bothers users enough to revisit, the candidate mitigations (perception-only; the wire-level wait cannot be shortened) were:

  1. Carry last-known per-slot statuses across controller.refresh() instead of blanking them, rendered dimmed ("off" status role) until the first ~H reconfirms — instant repopulate, but shows stale data for the gap, which is why it should never render full-strength.
  2. Cosmetic only: a clearer "pending" placeholder than "—" while loading.

2026-06-12 — Burst writes silently dropped by the station ("only the second toggle stuck")

Symptom: toggle device A's Active off, wait ~2 s, toggle device B's off, then Refresh — A came back ON (its writes never landed on the station); only B stuck.

Root cause: the controller transmitted commands back-to-back, but the station processes one inbound frame at a time and silently drops the rest of a burst. The 2026-06-12 capture proves it: 16 burst channel writes drew only 2 Z1 ACKs. The legacy never burst — MakeCommand only queued, and each inbound Z from the station popped exactly one command onto the wire (Main.frm SendCommand); its "Working n" status caption was that queue's depth.

Fix — outbound command queue (controller._send/_pump/_handle_poll): one command in flight; Z1 confirms and advances; Z0 retransmits; a Z2 idle while unconfirmed means the frame was dropped → retransmit (safe: every queued command is idempotent — an improvement over the legacy, which lost silent drops), capped at 3 transmissions then dropped with an errorOccurred so a dead station can't wedge the queue. Link frames (Z1 data-frame ACKs, Z2 idles) bypass the queue — they are the handshake. The simulator now Z1-ACKs every command frame like the real station (it previously applied bursts perfectly, which is exactly why this class of bug was invisible in sim). pendingWritesChanged(depth) drives a footer WORKING dot (the modern legacy caption); regression suite: tests/da07/test_write_queue.py.

Capture hook upgrade: CIM_DA07_CAPTURE now records outbound frames too, prefixed "> " (inbound lines stay bare) — so the next hardware session can see both sides of the handshake.

Verify on hardware next session: repeat the two-device toggle test (toggle A, wait, toggle B, Refresh — BOTH must stick), and watch the WORKING dot drain. Note the design assumes the station Z1s every accepted command (observed for writes); if some commands are never ACKed they will retransmit ×3 and surface one error.


2026-06-12 — Tiny "popup window" flashing on every write (suite-wide kit bug)

Symptom: a small, empty, native-decorated window (app icon + min/max/close, label-sized) flashed on top of the app on every settings write — 8× on a Devices-tab Active toggle (one per channel write). Screenshot: docs/samples/DA-07 flashing pop-up.png. Not DA-07-specific and not the VB6 "Working" indicator — it fired on any grid rebuild in any module; the optimistic-apply change (below) just multiplied the rebuilds that exposed it.

Root cause: SummaryStrip.set_summary (core/ui/kit/summary_strip.py) replaced its count labels with label.setParent(None) while the labels were visible — reparenting a visible widget to None promotes it to a top-level window, which Windows shows with full native decoration until the deferred deleteLater runs. Fix: hide the dying label and let deleteLater collect it while still parented (one line + regression test tests/core/kit/test_summary_strip.py:: test_set_summary_never_orphans_a_visible_label).

How it was found — the CIM_UI_SPY diagnostic (keep for next time): the flash could not be reproduced by any synthetic probe (offscreen platforms swallow it, and QTest clicks missed the toggle hotspot), but an env-gated window spy (core/ui/window_spy.py, hooked in shell/app.py) run in the user's live session logged 84 ghost windows with full creation stack traces pointing at the exact line. Usage:

$env:CIM_UI_SPY="$PWD\ui-spy.log"; .venv\Scripts\python -m cim_suite.shell.app --simulate

Lesson: when a UI glitch reproduces for the user but not for instrumented probes, instrument the user's own session instead of approximating it.


2026-06-12 — Tag column truncation confirmed + first real capture (BL-E5 part 1)

Setup: same session as the stale-model entry below. Ran an instrumented refresh against the station with CIM_DA07_CAPTURE — the repo's first real DA-07 capture, saved as tests/da07/fixtures/capture-2026-06-12-refresh.txt (157 frames: A/B/C/D/E/F/G/M/Z).

Findings:

  • BL-E5 confirmed byte-for-byte. A CT channel's ~E frame ends …[alarm byte]B321281B04CB9CEF — alarm byte then the 8-byte probe serial, no disp field, no name field. The phantom disp read ate B3, leaving 21281B04CB9CEF in the Tag column (the user-reported "first two characters cut off"). The legacy VB6 had the same phantom read (second PullBase1 into its hidden Disp column), so it truncated too. Fixed per BL-E5 part 1: disp removed everywhere; Tag shows name → serial → catalog default.
  • No ~H frame in the whole load — second real capture in a row without one (the load finishes via the idle-settle timer, not the ~H fast path). Yet the STATUS column does populate eventually on hardware, so ~H must arrive later, in the steady state — periodic, which is exactly what made the stale-model revert (below) fire every second or two.
  • No ~P frame even with CT sensors attached, so BL-E5 part 2 (where a custom channel name echoes back) is still blocked on a capture. Until then, a written Tag name lives only in the local model and reverts on Refresh (legacy did the same).
  • The CS-31 catalog entry (A2C0831CS-31) carries no default channel names — CT channels are identified by probe serial, hence serial-in-Tag is the correct legacy-faithful display.

2026-06-12 — Edits revert within seconds once live frames flow ("stale-model rebuild")

Setup: first DA-07 with one pod attached. Reported on the Devices tab: the Active switch toggles, then snaps back within ~12 s — but only after the STATUS column populates. Same on the Channels tab Active column, and (predicted, same mechanism) on every editable field of the Devices / Channels / Station / Alarm tabs.

Root cause: the controller's set_* methods wrote the command to the wire but never updated the local models, and the DA-07 protocol has no settings echo — a real station never re-sends a value you just wrote. (The legacy VB6 had no such problem because its grid was the model: Grid2_AfterEdit sent the command and the grid simply kept the edited cell.) On real hardware the station streams periodic frames in the steady state:

inbound frame controller signal(s) tabs that rebuild
~H realtime status devicesChanged + alarmsChanged Devices, Alarm (and Channels' device combo)
~G inputs / ~F averages channelsChanged Channels

Each signal rebuilds the grid from the stale model, visually reverting the edit. That's why toggles "worked" before STATUS populated: no ~H yet → no rebuilds → the checkbox kept its widget-local state (the model was wrong the whole time).

Why the simulator missed it: the sim only sends ~H once, at the end of a refresh — never periodically — so nothing ever rebuilt the Devices tab after load. (Channels-tab live ticks would have shown it, but nobody toggled Active in live sim mode and the tests always called refresh() between write and assert, which re-reads the sim's state and hides the gap.)

Fix — "optimistic apply" (2026-06-12): every controller setter now applies the value to its local model immediately after sending, and emits the matching *Changed signal. The station remains the source of truth — the next Refresh overwrites local state with whatever the station actually stored.

  • domain/controller.py — the whole "settings writes" section (set_station_setting, _set_device_field, set_device_type, _set_channel_field, _set_limit, set_alarm_*, remove_device).
  • domain/models.pyDeviceTable.remove, ChannelTable.remove_device.
  • Regression tests: tests/da07/test_optimistic_writes.py (includes UI-level tests that replay the exact symptom: toggle, then deliver a periodic ~H/~G, assert no revert).

The general rule this leaves behind: any outbound DA-07 mutation must update the local model in the same call, because nothing inbound will. If a new setter is added and its edit "reverts after a second or two" on hardware, the optimistic apply was forgotten.

Consequences / things this changes:

  • The §5.6 "pending → echoed" write-feedback marker on the tabs now resolves on the next rebuild (≈1 s on hardware via ~H/~G), not on a true echo — the DA-07 simply has no per-write confirmation. A NAK (Z0) still triggers a resend of the last frame only.
  • set_device_active writes one frame per channel; on a NAK only the last frame is resent (pre-existing limitation, unchanged).

Open questions for the next hardware session:

  • Confirm the write actually landed on the station: toggle Active, wait, hit Refresh — the value must survive the full reload (proves the ~D write frame itself is accepted, which the revert previously made impossible to observe). If it does NOT survive, there is a second bug in the write frames themselves (protocol/encoder.py::set_channel_field etc.) that was hidden behind this one.
  • DA-12 may have the same latent bug. Its set_* methods also send without updating models (modules/da12/domain/controller.py), and its sim also applies writes without echoing. DA-12 was assumed to re-stream full A sensor records unsolicited (which would refresh config naturally) — verify on real DA-12 hardware whether edits revert when C/I value frames rebuild the Sensors tab. If they do, port the same optimistic-apply pattern.

(Add new entries above this line, newest first, dated, with: symptom → root cause → why the sim missed it → fix locations → open questions.)