I2C Driver Bus-Hang Recovery — Detection and Field Hardening

What an I2C Bus Hang Looks Like in the Field

I2C bus hangs are the bug that most embedded engineers eventually encounter and most embedded engineers fix wrong. The symptom is a sensor that worked yesterday returning -EIO today, indefinitely, until the device reboots. The cause is almost always the same: a slave device was reset or glitched mid-transaction while it was driving SDA low to acknowledge a bit, and now it's stuck holding the line. The master, seeing SDA low when it tries to issue a START condition, treats the bus as busy and fails to initiate anything.

What makes this so painful in production:

It happens at < 0.1% rate per power-cycle in clean lab conditions, so QA misses it.
It happens at 2–5% per month in the field on noisy power rails or ESD-prone environments.
The customer-visible failure is "the device is silent" — there's no helpful log because the I2C call hangs the task that was supposed to write the log.
Rebooting fixes it, so support closes the ticket as "transient." The pattern only emerges from fleet telemetry.

"We had a customer fleet returning 0.8% monthly RMA on 'unresponsive sensor.' Three months of investigation. The root cause was an I2C slave that held SDA low after a brown-out, and the master driver had no recovery path. Fix was 80 lines of code." — Pioneer Horizon firmware team

The good news: the fix is well-understood and isn't device-specific. Any I2C driver shipping to the field should include detection and recovery as a default behaviour, not an opt-in feature.

Detecting the Hang Without Hanging

The first design question is: how does the driver know the bus is hung, without itself hanging on the call that detects it? Most vendor HAL implementations of HAL_I2C_Master_Transmit will block waiting for the bus, and on a hang they block forever. That's the bug pattern that leaks the symptom up to the application as "frozen task."

Three detection strategies, ordered by reliability

Pre-transaction bus check — before issuing START, read the current state of SDA and SCL as GPIOs. If SDA is low while SCL is high (the bus-idle state requires both high), the bus is hung. Return an error immediately rather than attempting the transaction. This is the cheapest check and catches most hangs at the point they're first observed.
Per-transaction timeout — every I2C transaction call carries a timeout (we default to 50 ms for sensor reads, longer for slow EEPROMs). If the call doesn't complete in that window, the driver aborts, resets the I2C peripheral, and returns an error. This catches hangs that happen mid-transaction.
Bus-busy state in the peripheral status register — most MCU I2C peripherals expose a "bus busy" flag. Polled at driver-init time, it tells you the bus is in an unexpected state before you've issued any transaction. Useful but vendor-quirky.

The timeout-must-exist rule

Every I2C call in production firmware has a timeout. Always. No exceptions. We grep our codebase for raw HAL_I2C_Master_* calls and reject any PR that doesn't go through our PHAL wrapper, because the PHAL wrapper enforces a timeout. The 50 ms default catches every hang we've seen in the field, and is roughly 10x the worst-case clean-bus transaction time on a 400 kHz I2C bus with our largest sensor reads — so false positives are nil.

What about hardware watchdogs

A hardware watchdog is the last line of defence — if the entire task hangs despite the timeout, the watchdog reboots the device. That's the recovery from a recovery failure, not the primary recovery. Recovery via reboot also means losing in-RAM state, sensor calibration in progress, and partial telemetry — much worse than recovery via bus-clear.

The GPIO Bus-Clearing Routine

Once the driver detects the bus is hung, the recovery has a specific, well-defined shape that's been industry practice for fifteen years and is documented in the NXP I2C-bus specification (UM10204, section 3.1.16). The fix is to manually clock the bus until the stuck slave releases SDA.

The procedure

Disable the I2C peripheral. Reconfigure SDA and SCL as open-drain GPIOs.
Drive SCL high.
Check SDA — if it's high, the bus is clear; jump to step 7.
Toggle SCL low, wait 5 µs, high, wait 5 µs. This is one clock pulse at 100 kHz.
Re-check SDA. If high, jump to step 7. If still low, return to step 4.
Repeat up to 9 times. If SDA is still low after 9 clocks, the bus is genuinely stuck — report unrecoverable hang to the application.
Issue a manual STOP condition: with SCL high, drive SDA low briefly, then release SDA to high.
Reconfigure SDA and SCL as I2C alternate-function pins. Re-enable the I2C peripheral.

Why nine clocks

An I2C byte is 9 bits — 8 data plus the ACK. Whatever bit the slave was stuck on, nine clocks guarantees the slave sees its full transaction complete and releases the line. We've never needed more than 9 in practice; we cap there to bound the recovery time at ~100 µs.

Code sketch

The routine in our PHAL implementation is around 80 lines of C, structured as phal_i2c_bus_clear(bus) and called automatically by the driver's error path. The application doesn't see the recovery happening — it sees one I2C call return success on its next retry.

Edge cases worth knowing

SCL stuck low — if SCL itself is low and won't go high, your slave has died completely or you've shorted the bus. No software recovery; this is a hardware fault.
Multi-master buses — if a second master is involved, the recovery sequence above can clobber its transaction. We don't ship multi-master in production firmware.
Slow slaves doing clock-stretching — don't mistake a clock-stretching slave for a hang. The pre-transaction check (SDA low while SCL high) won't fire here, because clock stretching pulls SCL low, not SDA.

Which Sensor Families Hang Most

Not all I2C slaves hang at the same rate. Over four years of fleet telemetry across roughly 12,000 deployed units, we've kept a tally of bus-hang events by sensor family. The data is messy and population-skewed, but the relative ranking is consistent across customers and environments.

Worst offenders — hangs per million transactions

Low-cost humidity sensors (HDC1080, SHT2x first gen) — ~140 per million. The internal state machine is unhappy with VCC dips. We don't spec these for outdoor deployments any more.
Older EEPROMs (24LC256, AT24C256 clones) — ~85 per million. Write cycles are particularly prone to hangs if VCC dips during the internal program.
Cheap I2C IO expanders (PCA9555 unbranded) — ~70 per million. Genuine NXP parts are dramatically better; the count is dominated by counterfeit-suspect lots. Our counterfeit-detection article covers how we screen this.

Middle of the pack — 10–30 per million

MEMS pressure sensors (BMP280, BMP388) — better than HDC1080 but still ESD-sensitive on the SDA line.
Real-time clocks (DS3231, PCF8523) — backup-battery transitions are a small but real hang source.
Temperature sensors (TMP102, TMP117) — solid, but the cheaper TMP102 hangs more than the TMP117.

Best in class — under 5 per million

Modern MEMS IMUs (ICM-42688, BMI270, LSM6DSV) — engineered for cellular/portable environments, the I2C/SPI state machines are robust to brown-out.
Modern environmental sensors (SHT4x, BME688) — second-gen Sensirion and Bosch parts dramatically better than first-gen.
STMicro automotive-grade temperature/humidity — built for harsh power conditions.

What we recommend

For any deployment that's outdoor, battery-powered, or industrial, we now spec only sensors from the "best in class" tier. The unit-cost premium is typically 10–40 cents per part. The reduction in field-failure rate is roughly 20x. That maths is favourable on every fleet we've run it on.

Defaults Every I2C Driver Should Have

If you take one thing from this article, it's that bus-hang recovery should not be an opt-in flag — it should be a default. Here's the checklist we apply to every I2C driver we ship, vendor-HAL-wrapped or otherwise.

Driver-level defaults

Pre-transaction bus state check. Read SDA and SCL via GPIO before every START. If SDA is low while SCL is high, run bus-clear before continuing.
Per-call timeout. 50 ms default, configurable per device. Never wait indefinitely.
Bus-clear routine. Implemented at driver init, callable from error paths. 9 SCL clocks, manual STOP, peripheral re-init.
Single retry on transient error. If a transaction fails, run bus-clear and retry once. Two consecutive failures escalate to the application as an unrecoverable error for that sensor.
Telemetry on every recovery. Count bus-clear invocations per sensor address. Report once per day. This is what surfaces the long-tail pattern that closes individual support tickets too quickly.

System-level defaults

One sensor failure does not block the rest. If sensor A is unrecoverable, the driver isolates it and continues servicing B, C, D. We've seen firmware that locked an entire RTOS task on one stuck sensor; that's a bug.
Watchdog tied to the I2C task. If the I2C task itself dies — not just one transaction — the hardware watchdog catches it. The driver's timeouts are first-line; the watchdog is last.
Power-rail check before retry. If the bus is hung and VCC is also dipping, a software bus-clear won't help. We sample the rail via ADC and only attempt recovery when the rail is healthy.

Validation

Test the bus-clear routine by deliberately hanging the bus: pull SDA low via a transistor controlled by another GPIO, then call a transaction. Confirm the driver detects, recovers, and the transaction completes on retry. This test runs in our nightly HIL suite — see our unit testing article for the broader framework.

If you're investigating a fleet with mysterious "silent sensor" RMAs, share the firmware and we'll review the I2C path. The fix is almost always 80 lines of code; the hard part is finding which 80.

Writing I2C Drivers That Survive Bus Hangs in the Field