As I said in Part 2, I tried to keep it dead simple: how to actually train your own ML model. I know ML purists would call it superficial - but let’s face it: I’m not here to teach you the secrets of deep learning. I want hardware devs to see a real-world way that works on actual silicon – without drowning in theory rabbit holes.
Let’s move on.

From ONNX to C code – let’s see what X-CUBE-AI does with it

Open CubeMX.
Make sure you have X-CUBE-AI installed (the add-on that turns ONNX models into C code – the language the MCU compiler actually speaks). It should show up in the Middleware section. Click “Add network”, name it something like “pid”, switch format from Keras to ONNX, browse to your file, set compression and optimization options, then hit “Analyze”.

If you get the error: “Your model ir_version (10) is higher than the checker’s (9)”,that’s STM32 being stuck in the past. Downgrade your ONNX model:

model = onnx.load("./models/PyTorchLinear9.onnx")  # replace with your filename
model.ir_version = 9
onnx.save(model, "./pid.onnx")                     # output model for X-CUBE-AI

This is what we’re looking to get:

flowchart TD A[Present sample
values: error, current
2 × int16_t = 4 bytes]-->B[Input tensor] C[Lag 1
4 bytes]-->B D[Lag _n_
4 bytes]-->B E[Lag 9
4 bytes]-->B B-->R[Linear regression model
samples: 1 current + 9 lags
size: 4 bytes x 10 = 40 bytes of RAM] R-->Q[Output
1 x int16_t]

Once the analyze is completed, it returns with a report. Here’s a quick look:


 ----------------------------------------------------------------------------------------- 
 type               :   onnx                                                               
 compression        :   high                                                               
 optimization       :   time                                                               
 target/series      :   stm32f0                                                              
 model_fmt          :   float                                                              
 model_name         :   pid                                                                
 params #           :   21 items (84 B)                                                    
 ----------------------------------------------------------------------------------------- 
 input 1/1          :   'input', f32(1x20), 80 Bytes, activations                          
 output 1/1         :   'output', f32(1x1), 4 Bytes, activations                           
 macc               :   21                                                                 
 ----------------------------------------------------------------------------------------- 

… and the summary:

 Requested memory size by section - "stm32f0" target 
 ------------------------------ ------- -------- ------ ----- 
 module                            text   rodata   data   bss 
 ------------------------------ ------- -------- ------ ----- 
 NetworkRuntime1020_CM0_GCC.a     6,212        0      0     0 
 network.o                          396        8    616   116 
 network_data.o                      36       16     88     0 
 lib (toolchain)*                 1,788        0      0     0 
 ------------------------------ ------- -------- ------ ----- 
 RT total**                       8,432       24    704   116 
 ------------------------------ ------- -------- ------ ----- 
 weights                              0       88      0     0 
 activations                          0        0      0    84 
 io                                   0        0      0     0 
 ------------------------------ ------- -------- ------ ----- 
 TOTAL                            8,432      112    704   200 
 ------------------------------ ------- -------- ------ -----

We calculated 60 bytes of input data (10 tensors × 2 x 2 bytes). Here’s what X-CUBE-AI sees:

Graph result of analysis

Our network seen by X-CUBE-AI

The report matches our expectations - input tensor size is correct (20 values), the output is one value.

Mostly matches and almost correct.

Surprise: everything is float! That’s from Torch quantization and X-CUBE-AI defaults, which use int8_t or float types for quantization. With int8 quantization we’d lose too much precision for PID control (we use 12-bit DAC, trust me: 8-bit DAC is not enough). Our bytes ballooned to 80. And remember: Cortex M0 has no hardware FPU, so float ops are emulated.

The code footprint? 8,432 bytes of text + rodata + data + bss = over 2/3 of the STM32F051’s entire Flash. And we haven’t even accounted in the HAL routines, not mentioning the app code. Expected inference cycle period: 1 ms.

walking the thin ice toward the goal

You already feel we’re pretty deep in shit, but let’s measure exactly how deep it really is.

Lord Kelvin said it best: “to measure is to know”.

For me, serious engineering talk starts and ends with numbers. No numbers = no real discussion, just daydreaming, guessing or influencing. So I ran a series of clean builds, starting from the boomer-style PID I already had running. See the graph below slowly ramping up the tension as in a good thriller?

flowchart TD A[HAL+ serial comm + PID
FLASH: 16168 B 16 KB
98.68%]-->|removed PID|B B[HAL + serial comm
FLASH: 15 KB 16 KB 93.75%]-->| added ML|C C[HAL + ML + serial comm
FLASH: 26996 B 16 KB 164.77%]-->|removed HAL UART|D D[HAL + ML
FLASH: 20816 B 16 KB 127.05%

**FAILED TO FIT THE MODEL !!!**]

My own hand-written control algorithm (uint64_t math + feed-forward) takes ~800 bytes → 4.93% flash. After removing everything - even remotely optional - we’re still 27% over budget. And we haven’t looked at execution performance yet.

Configuration Flash usage
full HAL + boomers PID + serial comm FLASH: 16168 B 16 KB 98.68%
full HAL + serial comm FLASH: 15 KB 16 KB 93.75%
full HAL + ML + serial comm FLASH: 26996 B 16 KB 164.77%
full HAL + ML FLASH: 20816 B 16 KB 127.05%

I could just declare “impossible” and close the ticket. Or bring a beefier MCU - but then we’re no longer comparing the same thing. I took the third option: strip the DRV8842 driver from the code, starve HAL as much as possible and hope we claw back those 27%. Then we’ll see. Anyway, we have a winner here!

Classic PID vs ML – size score: 1:0

So I deleted everything non-essential. Left HAL configured with basically just the model + one single GPIO pin for timing the inference loop. And suddenly:

[build] Memory region         Used Size  Region Size  %age Used
[build]              RAM:        5552 B         8 KB     67.77%
[build]            FLASH:       16176 B        16 KB     98.73%
[driver] Build completed: 00:00:00.841

There is light in the tunnel after all.

performance

Measurement is brutally simple: set GPIO low → run the math → set GPIO high. Logic analyzer/DSO shows the rest.

This is how a classic, high precision PID can perform on STM32

64-bit classic PID + feed-forward on STM32F051

Best case: under 40 µs→ up to 25 000 control loops per second. Not bad for old-school.

ML Linear Regression model performance on STM32F051 with emulated FPU

linear regression ML model, emulated FPU, same chip

93 µs → roughly 11 000 inferences per second. Honestly? Due to code size and lack of FPU I was expecting lower number, around 0.5ms. It’s a promising number… yet:

Classic PID vs ML – performance score: 2:0

aftermath

The oldest rule still holds:

no machine learning model is going to beat a clean, mathematically described algorithm on the same limited hardware.

Ten kilobytes of matrix multiplications lost in every category that actually matters. Sure, you could move this model to a bigger, more expensive MCU – it would fit running about twice as slow as boomers PID. But then you can stick an “AI powered” label on it and sell it for ten times the price. Marketing won, I suppose.

Why did I spend time checking something I already knew the answer to? Wanted to feel the numbers: measurable, exact difference – how much faster is the classic approach really? Wanted to show you the simplest possible ML deployment flow on a tiny MCU, with real numbers and real tools of trade. No BS, without a single cent spent for fancy commercial software.

Embedded development in 2026 is very different from ten years ago. But a good engineer doesn’t blindly chase whatever is fashionable this month and dismiss everything older. A good engineer must keep a healthy balance between staying curious, skeptical, and trying new things. Sciencie isn’t something to believe in. Science is about questioning and proving.

So - what do you think about this PID vs ML series? Was it interesting? Too shallow somewhere perhaps? Need more detail on any part? Or should we - just for practice - try chances with another model? Let me know!