As I said in Part 2, I tried to keep it dead simple: how to actually train your own ML model. I know ML purists would call it superficial - but let’s face it: I’m not here to teach you the secrets of deep learning. I want hardware devs to see a real-world way that works on actual silicon – without drowning in theory rabbit holes.
Let’s move on.
From ONNX to C code – let’s see what X-CUBE-AI does with it
Open CubeMX.
Make sure you have X-CUBE-AI installed (the add-on that turns ONNX models into C code – the language the MCU compiler actually speaks). It should show up in the Middleware section. Click “Add network”, name it something like “pid”, switch format from Keras to ONNX, browse to your file, set compression and optimization options, then hit “Analyze”.
If you get the error: “Your model ir_version (10) is higher than the checker’s (9)”,that’s STM32 being stuck in the past. Downgrade your ONNX model:
model = onnx.load("./models/PyTorchLinear9.onnx") # replace with your filename model.ir_version = 9 onnx.save(model, "./pid.onnx") # output model for X-CUBE-AI
This is what we’re looking to get:
values: error, current
2 × int16_t = 4 bytes]-->B[Input tensor] C[Lag 1
4 bytes]-->B D[Lag _n_
4 bytes]-->B E[Lag 9
4 bytes]-->B B-->R[Linear regression model
samples: 1 current + 9 lags
size: 4 bytes x 10 = 40 bytes of RAM] R-->Q[Output
1 x int16_t]
Once the analyze is completed, it returns with a report. Here’s a quick look:
-----------------------------------------------------------------------------------------
type : onnx
compression : high
optimization : time
target/series : stm32f0
model_fmt : float
model_name : pid
params # : 21 items (84 B)
-----------------------------------------------------------------------------------------
input 1/1 : 'input', f32(1x20), 80 Bytes, activations
output 1/1 : 'output', f32(1x1), 4 Bytes, activations
macc : 21
-----------------------------------------------------------------------------------------
… and the summary:
Requested memory size by section - "stm32f0" target
------------------------------ ------- -------- ------ -----
module text rodata data bss
------------------------------ ------- -------- ------ -----
NetworkRuntime1020_CM0_GCC.a 6,212 0 0 0
network.o 396 8 616 116
network_data.o 36 16 88 0
lib (toolchain)* 1,788 0 0 0
------------------------------ ------- -------- ------ -----
RT total** 8,432 24 704 116
------------------------------ ------- -------- ------ -----
weights 0 88 0 0
activations 0 0 0 84
io 0 0 0 0
------------------------------ ------- -------- ------ -----
TOTAL 8,432 112 704 200
------------------------------ ------- -------- ------ -----
We calculated 60 bytes of input data (10 tensors × 2 x 2 bytes). Here’s what X-CUBE-AI sees:
Our network seen by X-CUBE-AI
Mostly matches and almost correct.
Surprise: everything is float! That’s from Torch quantization and X-CUBE-AI defaults, which use int8_t or float types for quantization. With int8 quantization we’d lose too much precision for PID control (we use 12-bit DAC, trust me: 8-bit DAC is not enough). Our bytes ballooned to 80. And remember: Cortex M0 has no hardware FPU, so float ops are emulated.
The code footprint? 8,432 bytes of text + rodata + data + bss = over 2/3 of the STM32F051’s entire Flash. And we haven’t even accounted in the HAL routines, not mentioning the app code. Expected inference cycle period: 1 ms.
walking the thin ice toward the goal
You already feel we’re pretty deep in shit, but let’s measure exactly how deep it really is.
Lord Kelvin said it best: “to measure is to know”.
For me, serious engineering talk starts and ends with numbers. No numbers = no real discussion, just daydreaming, guessing or influencing. So I ran a series of clean builds, starting from the boomer-style PID I already had running. See the graph below slowly ramping up the tension as in a good thriller?
FLASH: 16168 B 16 KB
98.68%]-->|removed PID|B B[HAL + serial comm
FLASH: 15 KB 16 KB 93.75%]-->| added ML|C C[HAL + ML + serial comm
FLASH: 26996 B 16 KB 164.77%]-->|removed HAL UART|D D[HAL + ML
FLASH: 20816 B 16 KB 127.05%
**FAILED TO FIT THE MODEL !!!**]
My own hand-written control algorithm (uint64_t math + feed-forward) takes ~800 bytes → 4.93% flash. After removing everything - even remotely optional - we’re still 27% over budget. And we haven’t looked at execution performance yet.
| Configuration | Flash usage |
|---|---|
| full HAL + boomers PID + serial comm | FLASH: 16168 B 16 KB 98.68% |
| full HAL + serial comm | FLASH: 15 KB 16 KB 93.75% |
| full HAL + ML + serial comm | FLASH: 26996 B 16 KB 164.77% |
| full HAL + ML | FLASH: 20816 B 16 KB 127.05% |
I could just declare “impossible” and close the ticket. Or bring a beefier MCU - but then we’re no longer comparing the same thing. I took the third option: strip the DRV8842 driver from the code, starve HAL as much as possible and hope we claw back those 27%. Then we’ll see. Anyway, we have a winner here!
Classic PID vs ML – size score: 1:0
So I deleted everything non-essential. Left HAL configured with basically just the model + one single GPIO pin for timing the inference loop. And suddenly:
[build] Memory region Used Size Region Size %age Used
[build] RAM: 5552 B 8 KB 67.77%
[build] FLASH: 16176 B 16 KB 98.73%
[driver] Build completed: 00:00:00.841
There is light in the tunnel after all.
performance
Measurement is brutally simple: set GPIO low → run the math → set GPIO high. Logic analyzer/DSO shows the rest.
64-bit classic PID + feed-forward on STM32F051
Best case: under 40 µs→ up to 25 000 control loops per second. Not bad for old-school.
linear regression ML model, emulated FPU, same chip
93 µs → roughly 11 000 inferences per second. Honestly? Due to code size and lack of FPU I was expecting lower number, around 0.5ms. It’s a promising number… yet:
Classic PID vs ML – performance score: 2:0
aftermath
The oldest rule still holds:
no machine learning model is going to beat a clean, mathematically described algorithm on the same limited hardware.
Ten kilobytes of matrix multiplications lost in every category that actually matters. Sure, you could move this model to a bigger, more expensive MCU – it would fit running about twice as slow as boomers PID. But then you can stick an “AI powered” label on it and sell it for ten times the price. Marketing won, I suppose.
Why did I spend time checking something I already knew the answer to? Wanted to feel the numbers: measurable, exact difference – how much faster is the classic approach really? Wanted to show you the simplest possible ML deployment flow on a tiny MCU, with real numbers and real tools of trade. No BS, without a single cent spent for fancy commercial software.
Embedded development in 2026 is very different from ten years ago. But a good engineer doesn’t blindly chase whatever is fashionable this month and dismiss everything older. A good engineer must keep a healthy balance between staying curious, skeptical, and trying new things. Sciencie isn’t something to believe in. Science is about questioning and proving.
So - what do you think about this PID vs ML series? Was it interesting? Too shallow somewhere perhaps? Need more detail on any part? Or should we - just for practice - try chances with another model? Let me know!