What Can Actually Go Wrong During an OTA
The reason OTA on cellular devices is hard isn't the protocol — HTTPS to a CDN, fetch a binary, flash it, reboot. The reason it's hard is that every one of those steps can be interrupted by power loss, signal loss, or a kernel panic mid-write, and the device has to come back. On a small fleet you can afford a 0.5% bricking rate and replace units; on a large industrial deployment you cannot. We work backwards from the failure modes.
The eight failure scenarios we plan for on every OTA pipeline:
- Power loss during download — modem disconnect or VCC dip mid-fetch.
- Power loss after download, before verify — image is in flash but unauthenticated.
- Power loss after verify, before swap — image is verified, active-bank pointer not yet flipped.
- Power loss during swap — bank pointer is being written when power dies.
- Power loss after swap, during first boot — new firmware is running but hasn't yet committed.
- Soft failure after swap — new firmware boots but crashes during init.
- Signal loss mid-download — partial image, need resume.
- Server-side malformed image — bad signature, wrong version, truncated.
"Every OTA design starts with a happy-path flow that fits on a slide. The real design is the eight asterisks that come after — the failure cases that each take a week to think through and a month to test." — Pioneer Horizon firmware team
A/B Partition Layout — The Foundation
A/B partitioning is the only design pattern we ship for OTA on field-deployed devices. Single-bank, in-place update (the so-called "wear-and-pray" pattern) is fundamentally unsafe — a power loss during in-place write leaves the device with neither an old nor a new image to boot. We see this pattern in older codebases and we replace it as a matter of policy.
The layout
On a 1 MB flash STM32H7, we typically partition as follows:
- 0x0800_0000 – 0x0801_FFFF (128 KB) — immutable bootloader. Signed at factory, RDP-protected, never updated in the field.
- 0x0802_0000 – 0x0802_FFFF (64 KB) — boot configuration page. Contains the active-bank pointer, version counters, rollback fuse mirror, and a CRC.
- 0x0803_0000 – 0x0807_FFFF (320 KB) — Slot A firmware.
- 0x0808_0000 – 0x080C_FFFF (320 KB) — Slot B firmware.
- 0x080D_0000 – 0x080F_FFFF (192 KB) — non-volatile data: calibration, logs, user settings.
The boot configuration page — the single source of truth
Everything about which firmware is active is encoded in this one page. Specifically:
- active_slot — 0 or 1, indicates which slot the bootloader should jump to.
- pending_slot — the slot whose firmware was just installed and is being trial-booted.
- boot_attempts — number of attempted boots of
pending_slotso far. - committed — has
pending_slotreported success? - version_a, version_b — monotonic version counter for each slot.
- crc — over all of the above.
Critical detail: this page must be written atomically. A power loss mid-write leaves it with a bad CRC — and the bootloader treats a bad-CRC config page as "fall back to last known good," which is the safe behaviour. We use the STM32H7's flash-write granularity (256 bits = 32 bytes per word) deliberately so the CRC sits in the same word as the rest of the structure.
Rollback Timer Placement and Trial-Boot Logic
The trial-boot pattern is what separates "OTA that bricks devices" from "OTA you can trust on 10,000 units." The new firmware doesn't immediately become permanent — it has to prove itself within a time window, or the bootloader rolls back automatically.
The state machine
The boot configuration page starts in state COMMITTED with one slot active. When OTA installs new firmware into the other slot:
- Bootloader-side:
pending_slotis set,boot_attempts = 0,committed = false. Reboot. - Bootloader sees
pending_slotset andcommitted = false. It incrementsboot_attemptsand jumps topending_slot. - New firmware runs. It must call
ota_commit()within 60 seconds of boot — typically after it has confirmed network connectivity, sensor reads are healthy, and no critical fault has fired. - If
ota_commit()is called,committed = trueis written andactive_slotis set topending_slot. The OTA is now permanent. - If
ota_commit()isn't called within 60 seconds (a watchdog tied to the rollback timer fires), the bootloader on next boot seesboot_attempts > 2and falls back to the previous slot.
Why 60 seconds, why three attempts
60 seconds is enough for the modem to register on the network, the time-sync to settle, the application to read its sensors once, and the cloud to acknowledge a heartbeat. Less than that and we false-positive on legitimately slow cellular regions. More than that and we expose a long brick window if a hardware fault prevents ota_commit() from ever being called.
Three boot attempts handles the case where the new firmware boots, almost completes initialisation, then hits a hardware-dependent crash (e.g., the modem doesn't respond on the second cellular site). Three attempts gives us statistical confidence that the failure is the firmware, not a transient.
The rollback counter — anti-downgrade
Separately from the trial-boot, we enforce version_pending ≥ version_active. Downgrades are blocked unless explicitly signed with a downgrade-marker flag. This is the anti-rollback mechanism that pairs with the secure-boot work in our STM32 secure boot article.
The Test Matrix — Brown-outs at Every Stage
No amount of code review certifies an OTA pipeline. The only way we know an OTA is safe is by inducing brown-outs at every point in the cycle and observing recovery. Our standard certification matrix runs 64 brown-out tests across 16 boards. It takes about 9 hours of wall-clock time and runs in our HIL fixture.
The brown-out injection points
- During the initial HTTP GET — 8 different offsets in the download stream.
- Between download and signature verification — single trigger.
- During signature verification — 4 offsets in the verify routine.
- Between verify and slot write — single trigger.
- During flash erase of the target slot — sector-by-sector, 10 sectors.
- During flash write of the target slot — 8 different offsets covering header, code, vector table, trailer.
- Between slot write and boot config update — single trigger (this is the most dangerous window).
- During boot config update — bit-level granularity, all bits of the word.
- During the first boot — at 4 offsets: pre-init, mid-init, post-init pre-commit, mid-commit.
What we measure
- Pass — the device recovers to a valid firmware image (either old or new) within 2 reboots.
- Soft fail — the device recovers but takes > 2 reboots (acceptable for one-off, fail for > 2% of trials).
- Hard fail — the device bricks. Any single hard fail at any injection point gates release.
How we trigger brown-outs
A programmable power supply (Rigol DP832) driven by Python over USB. We can drop VCC to 0 V for a configurable window (currently 10–500 ms in 10 ms steps) at a trigger pin set by the firmware itself — so the brown-out fires at exactly the right instruction. The same fixture also exercises modem-disconnect scenarios by toggling the RF-kill pin on the modem.
The first time we ran this matrix on a customer's existing OTA implementation, we found 11 hard-fail injection points. They had been running OTAs in production for a year — the bricks just hadn't statistically piled up yet. They were three months from a 50,000-unit deployment that would have surfaced the rate.
Operational Hygiene — The Last 5%
The code is one half. The OTA service that fronts the fleet is the other half. Five operational practices that pay back over the life of a deployment:
Staged rollouts, by default
A new firmware version goes to 1% of the fleet first. Wait 24 hours. If error telemetry stays within bounds, expand to 10%. Wait 24 hours. If still clean, full rollout. The cost is two days of release calendar. The benefit is that a regression that bricks 30% of units only bricks 0.3% of the fleet before you stop the rollout.
Telemetry on every boot
Every device reports — at minimum — its current firmware version, boot reason (cold/warm/watchdog/fault), uptime, and boot count. If a new version starts producing more watchdog-reset boot reasons than the previous version, the staged rollout halts automatically.
Sign everything, including downgrades
Production firmware is signed. Test firmware is signed with a different key. Downgrades require a separately signed downgrade-marker. There is no "unsigned shortcut" — even in support scenarios, recovery firmware is signed by a recovery key that has its own rotation lifecycle.
Backwards compatibility on the boot config schema
If you change the boot config layout between firmware versions, you must support the old layout for migration. We've shipped fleets where a bootloader update assumed a new layout and bricked devices still running old firmware. Version your schema, support N-1 at minimum.
Practice the recovery drill quarterly
The OTA you've never had to roll back is the OTA you can't roll back. Once a quarter, we deliberately push a known-bad version (signed, but designed to fail commit) to a test slice of the fleet, watch the rollback fire automatically, and confirm the recovery path works.
If you're planning an OTA deployment and want a pre-mortem on your design before silicon goes out, send us the proposed architecture — we'll run the matrix mentally and come back with the gaps within a week.