As menioned in part 1 I wasn’t in the mood for classic Hardware-in-the-Loop (where the model learns to regulate speed entirely on its own), I took a shortcut. I logged several hundred thousand records of the running PID regulator in different conditions:
- no load
- constant load
- ramping load up
- ramping load down
- and a few cases in between
All done with simple Python scripts dumping values via UART/DMA. Zero CPU overhead on the MCU side.
choosing the model
ML has plenty of models - some people use MATLAB, but I love what Linux has to offer: Python + Torch + CUDA combo.
LPT: Create a dedicated virtual environment and pip-install everything needed: torch, mlflow, scikit-learn, numpy, pandas, etc. All the Python scripts + training data set I used are in this repo
Data quality matters
Data is everything in ML training. Garbage in = garbage out.
- Incomplete dataset? If I only had data from the motor at no load, the model would never learn how to handle variable load. That’s why I collected under changing conditions.
- Useless variables? Feeding the model data that has no real impact on the output just leads to overfitting and wastes RAM on the MCU.
For us, only three variables matter:
- error = setpoint - measured RPM
- current through $%R_{SENSE}$%
- PID output value
The tensor is just an array of [error, current, output]. Our base unit: three numbers of int16_t size.
Linear regression as the baseline
If you don’t know which model is best - start with the simplest one. Linear regression is surprisingly good at solving differential equations like PID control.
Lags - the secret sauce
Pure readings aren’t enough. The model needs to know somehow if load is increasing or decreasing. That’s where lags come in: n previous samples give historical context.
error, current, output
3 × int16_t = 6 bytes]-->B[Input tensor] C[Lag 1
t-1
6 bytes]-->B D[Lag 2
t-2
6 bytes]-->B E[Lag n
t-n
6 bytes]-->B B--> model[Linear regression model
9 lags = 10 samples
10 × 6 = 60 bytes of RAM]
How many lags? Start with a few (3-10) and check training results. Over a hundred - typical overfeed. But remember: on a tiny MCU, more lags = more RAM and compute.
- 3 lags → 4 tensors (3 historical + current) = 12 × int16_t = 24 bytes
- 20 lags → 21 tensors = 63 × int16_t = 126 bytes
Appetite grows fast. And every extra tensor needs to be computed in real time on RTOS - that’s cycles you pay for.
data preparation
Before training:
- group data into current tensor + n historical lags (drop first n-1 samples with no history)
- split 80% train / 20% validation
- normalize everything
- shuffle to kill sequential bias
With clean, shuffled data - train.
training
Feed the model the training set epoch after epoch. Each epoch = one full pass over all samples. Model adjusts weights.
Every few epochs - validate on holdout set and watch metrics (loss, $%R^2$%, RMSE).
Training the model for 3 lags in 50 epochs...
Epoch 10/50 | Avg Loss: 0.014768
Epoch 20/50 | Avg Loss: 0.014919
Epoch 30/50 | Avg Loss: 0.014742
Epoch 40/50 | Avg Loss: 0.014849
Epoch 50/50 | Avg Loss: 0.014857
Training done for 3 lags.
Successfully registered model 'PyTorchLinear3'.
Created version '1' of model 'PyTorchLinear3'.
[...]
→ Model recorded in MLflow
→ ONNX ready: ./models/PyTorchLinear3.onnx
I trained several models (3 to 11 lags). Training on CPU (20 threads) took ~1 minute per model for 50 epochs. GPU would be 10x faster, but for linear regression CPU is fine.
MLFlow for supervising the learning process
mlflow was invaluable - live web dashboard showing training progress, metrics, hyperparameters.
Validation & metrics
After training we get metrics that tell us how accurate the inference will be and how much the model misses the true PID output.
Trained models (3-11 lags) results from mlflow:
| source run (lags) | RMSE | R² | Relative Error (%) |
|---|---|---|---|
| 3 lags | 12.45 | 0.942 | ~1.78% |
| 4 lags | 10.82 | 0.958 | ~1.55% |
| 5 lags | 9.61 | 0.967 | ~1.37% |
| 6 lags | 8.92 | 0.973 | ~1.27% |
| 7 lags | 8.47 | 0.978 | ~1.21% |
| 8 lags | 8.21 | 0.981 | ~1.17% |
| 9 lags (best) | 8.04 | 0.986 | ~1.15% |
| 10 lags | 8.12 | 0.984 | ~1.16% |
| 11 lags | 8.19 | 0.983 | ~1.17% |
Target PID output range: 300…1000 (700 units span).
Best model: linear regression with 9 lags
- RMSE ≈ 8.04 → average error ±8 units → 1.15% relative error
- R² = 0.986 → almost 99% accuracy
Solid enough for real deployment.
aftermath
From 3 to 11 lags, the winner is 9-lag linear regression.
Export to ONNX → STM X-CUBE-AI converts to C code.
Inference will need 10 tensors (9 historical + current) = 60 bytes of input data.
Stay tuned for Part 3 - inference on MCU and head-to-head comparison. Your bet: “fuck PID” or not…?