Unit Testing Embedded C on the Host

Why Host-Side Tests Beat Hardware-Loop Tests Most of the Time

The orthodoxy in embedded C testing used to be "the only real test is the test on the target." It's still true that you can't certify your final product without on-target validation — but if every test runs on hardware, your CI loop is slow, brittle, and impossible to scale across a multi-product portfolio. We've moved most of our firmware test mass to host-side execution. The result: a unit-test suite that runs in 4 seconds, gates every PR, and catches roughly 80% of the regressions we used to find a week later on a development board.

The core insight is that most firmware bugs are not hardware bugs. They are:

Logic bugs — off-by-one in a state machine, incorrect lookup in a calibration table, missing default branch in a switch.
Lifecycle bugs — uninitialised variables, double-free in a custom allocator, use-after-free in a callback.
Concurrency bugs — ISR-to-task race conditions, missing memory barriers.
Protocol-encoding bugs — bad checksums, misaligned struct packing, endianness errors in network code.

Every one of these reproduces on a host CPU when you have the right abstraction over the hardware. The bugs that actually need silicon — timing-sensitive interrupt latencies, peripheral quirks, electrical issues — are a much smaller set, and they belong in a smaller, slower HIL test suite that runs nightly rather than per-PR.

"A CI loop that takes 40 minutes is a CI loop nobody runs locally. A 4-second test suite gets run by the engineer before they push, and that's where defects get caught cheapest." — Pioneer Horizon firmware team

Wrapping the HAL — The Single Most Important Refactor

Production firmware that calls HAL_I2C_Master_Transmit directly from business logic is untestable on a host — the symbol doesn't exist, the peripheral doesn't exist, and the call signature is wedded to a specific vendor's HAL. Before any host-side testing is possible, you need a clean abstraction layer. We call it the portable HAL, or PHAL.

Shape of the abstraction

The PHAL is a thin interface — a few dozen functions — that exposes everything firmware needs from the hardware in a vendor-neutral way:

phal_i2c_xfer(bus, addr, tx, tx_len, rx, rx_len, timeout_ms)
phal_spi_xfer(bus, tx, rx, len)
phal_gpio_write(pin, level) / phal_gpio_read(pin)
phal_timer_now_us() — monotonic microsecond clock
phal_flash_write(addr, buf, len) / phal_flash_erase(sector)

On target, each call dispatches to the vendor HAL (CubeHAL, Zephyr driver, etc.). On host, each call dispatches to a mock or simulator we control from test code. The business logic — state machines, calibration routines, packet encoders — sees only the PHAL.

Common objections

Engineers push back with two arguments. First: "this adds a function-call layer of overhead." In practice, the compiler inlines most PHAL calls, and where it doesn't, the cost is a few cycles per peripheral op — negligible compared to the I2C bus turnaround itself. Second: "we already wrote against CubeHAL." Refactoring is real work; we usually do it incrementally, starting with the modules that have the most logic and the least direct hardware coupling (packet encoders, state machines), then working outward.

The payoff: every module above PHAL is testable on a Linux dev box, in a container, in CI, in under a second.

Mock vs Simulate — Choosing the Right Fake

"Mock" and "simulate" are often used interchangeably, but they answer different questions. A mock asserts that the right calls were made; a simulator computes what the response should be. We use both, and getting the choice right per peripheral is the difference between tests that catch bugs and tests that catch no bugs but make CI pass.

Mock when the behaviour is opaque

For an I2C bus driver test, the question is: "given this state machine input, does the driver issue the correct sequence of I2C transactions?" We don't care what the slave does — we care that the master behaves correctly. A mock records every phal_i2c_xfer call and lets the test assert: "the driver issued these 5 transactions in this order, with these addresses, these payloads."

We use a hand-rolled mock framework (~200 LOC of C) rather than CMock. It's vendored, debuggable, and we control the API. CMock generates a lot of code we don't read.

Simulate when the behaviour matters to the test

For testing a sensor calibration routine that reads from an accelerometer and applies a temperature compensation curve, you need real-looking data on the bus. We simulate the accelerometer: a host-side struct that holds register state, responds to I2C reads with realistic values, accepts I2C writes that change its state. The simulator can be parametric — "return acceleration data corresponding to a 1g step input with this noise profile" — which lets us test signal-processing code against synthetic stimuli that would be impossible to generate on real hardware.

The split we settle on

Drivers — mocked. The test of an I2C driver is "did it issue the right transactions?"
Sensor stacks — simulated. The test of a Kalman filter is "did it produce the right state estimate from this input stream?"
Communication code — mostly simulated. We have a small TCP-loopback simulator that emulates the modem's AT command set and the upstream MQTT broker.
Flash and storage — simulated. A host-side flash simulator with configurable sector size, write granularity, and induced bit-error rates lets us test wear-leveling and ECC.

The CI Matrix — What Runs When

Test mass split across speed tiers is the practical engineering of CI. Below is what we ship by default; the numbers come from a recent industrial telemetry product:

Tier 1 — Per-commit (under 30 seconds)

All host-side unit tests — 412 tests, ~4 seconds.
Static analysis — clang-tidy, cppcheck — ~12 seconds.
Compile against three target configs (debug, release, size-optimised) — ~14 seconds.

Gates every PR. Engineer feedback loop is under a minute. Catches roughly 80% of regressions.

Tier 2 — Per-merge (under 10 minutes)

Tier 1 tests, plus —
Integration tests on QEMU for ARM Cortex-M — runs the whole firmware binary in a Linux VM, drives it through a Python test harness over a virtual UART. ~6 minutes.
Coverage analysis (gcov + lcov), with a hard fail at <75% line coverage on changed files.
Build the OTA image and verify signature against a test key.

Gates merge to main. Catches most of the remaining structural regressions.

Tier 3 — Nightly HIL (45–90 minutes)

Tier 1 + Tier 2 tests, plus —
Hardware-in-the-loop on a dedicated test fixture: 6 boards on a rack, controllable power supply, I2C/SPI bus monitors, a programmable noise injector on power rails.
Full OTA cycle including signature verification on real silicon.
Power-glitch test suite (32 induced brown-outs at 4 ms increments during boot).

Catches the silicon-specific bugs the host simulator can't reproduce.

Tier 4 — Pre-release (multi-hour)

72-hour soak test on 12 units across temperature corners (-20°C, +25°C, +70°C).
Power-cycle endurance — 10,000 reboots back-to-back.
Modem reconnect endurance — 5,000 forced reconnects.

Gates a version tag entering production. The OTA-power-loss matrix from our OTA article lives here.

Test Discipline — What We've Learned From Bad Tests

Writing tests is easy. Writing tests that don't lie is hard. Below are the rules we ended up with after several years of fixing tests that passed while bugs shipped.

Tests must fail before they pass

Every test starts life by failing — either because the feature doesn't exist yet (TDD) or because we've introduced a deliberate regression to confirm the test would catch it. A test that has never been seen to fail is a test we don't trust.

One assertion per test, named after the behaviour

Test names like test_i2c_driver_handles_nack_on_address_with_three_retries_then_returns_io_error read badly but document the contract. We accept the verbosity in exchange for the documentation value.

No sleeps, no real-time dependencies

Tests that depend on sleep() or wall-clock time are flaky by construction. Our PHAL's phal_timer_now_us is settable in test mode — tests advance time deterministically. We've removed every sleep call from test code as a matter of policy.

Coverage is necessary but not sufficient

75% line coverage as a hard CI floor catches obvious omissions, but a test can hit every line without actually verifying behaviour. We review tests at PR time for the question "what bug would this test fail to catch?" — and we fail the PR if the answer is "everything."

Flaky tests are a P0 incident

One flaky test erodes trust in the entire suite. A test that fails intermittently is either fixed within a week or deleted. We've never regretted deletion; we've often regretted tolerance.

If you'd like a walk-through of how this CI matrix wires into your existing CubeIDE / VS Code / Bazel workflow, send us your repository — we'll do a one-hour audit and come back with a concrete migration plan.