As menioned in part 1 I wasn’t in the mood for classic Hardware-in-the-Loop (where the model learns to regulate speed entirely on its own), I took a shortcut. I logged several hundred thousand records of the running PID regulator in different conditions:

  • no load
  • constant load
  • ramping load up
  • ramping load down
  • and a few cases in between

All done with simple Python scripts dumping values via UART/DMA. Zero CPU overhead on the MCU side.

choosing the model

ML has plenty of models - some people use MATLAB, but I love what Linux has to offer: Python + Torch + CUDA combo.

LPT: Create a dedicated virtual environment and pip-install everything needed: torch, mlflow, scikit-learn, numpy, pandas, etc. All the Python scripts + training data set I used are in this repo

Data quality matters

Data is everything in ML training. Garbage in = garbage out.

  • Incomplete dataset? If I only had data from the motor at no load, the model would never learn how to handle variable load. That’s why I collected under changing conditions.
  • Useless variables? Feeding the model data that has no real impact on the output just leads to overfitting and wastes RAM on the MCU.

For us, only three variables matter:

  • error = setpoint - measured RPM
  • current through $%R_{SENSE}$%
  • PID output value

The tensor is just an array of [error, current, output]. Our base unit: three numbers of int16_t size.

Linear regression as the baseline

If you don’t know which model is best - start with the simplest one. Linear regression is surprisingly good at solving differential equations like PID control.

Lags - the secret sauce

Pure readings aren’t enough. The model needs to know somehow if load is increasing or decreasing. That’s where lags come in: n previous samples give historical context.

flowchart TD A[Present sample
error, current, output
3 × int16_t = 6 bytes]-->B[Input tensor] C[Lag 1
t-1
6 bytes]-->B D[Lag 2
t-2
6 bytes]-->B E[Lag n
t-n
6 bytes]-->B B--> model[Linear regression model
9 lags = 10 samples
10 × 6 = 60 bytes of RAM]

How many lags? Start with a few (3-10) and check training results. Over a hundred - typical overfeed. But remember: on a tiny MCU, more lags = more RAM and compute.

  • 3 lags → 4 tensors (3 historical + current) = 12 × int16_t = 24 bytes
  • 20 lags → 21 tensors = 63 × int16_t = 126 bytes

Appetite grows fast. And every extra tensor needs to be computed in real time on RTOS - that’s cycles you pay for.

data preparation

Before training:

  • group data into current tensor + n historical lags (drop first n-1 samples with no history)
  • split 80% train / 20% validation
  • normalize everything
  • shuffle to kill sequential bias

With clean, shuffled data - train.

training

Feed the model the training set epoch after epoch. Each epoch = one full pass over all samples. Model adjusts weights.
Every few epochs - validate on holdout set and watch metrics (loss, $%R^2$%, RMSE).

Training the model for 3 lags in 50 epochs...
  Epoch 10/50 | Avg Loss: 0.014768
  Epoch 20/50 | Avg Loss: 0.014919
  Epoch 30/50 | Avg Loss: 0.014742
  Epoch 40/50 | Avg Loss: 0.014849
  Epoch 50/50 | Avg Loss: 0.014857
Training done for 3 lags.
Successfully registered model 'PyTorchLinear3'.
Created version '1' of model 'PyTorchLinear3'.
[...]
   → Model recorded in MLflow
   → ONNX ready: ./models/PyTorchLinear3.onnx

I trained several models (3 to 11 lags). Training on CPU (20 threads) took ~1 minute per model for 50 epochs. GPU would be 10x faster, but for linear regression CPU is fine.

MLFLow allows supervising the learning process

MLFlow for supervising the learning process

mlflow was invaluable - live web dashboard showing training progress, metrics, hyperparameters.

Validation & metrics

After training we get metrics that tell us how accurate the inference will be and how much the model misses the true PID output.

Trained models (3-11 lags) results from mlflow:

source run (lags) RMSE Relative Error (%)
3 lags 12.45 0.942 ~1.78%
4 lags 10.82 0.958 ~1.55%
5 lags 9.61 0.967 ~1.37%
6 lags 8.92 0.973 ~1.27%
7 lags 8.47 0.978 ~1.21%
8 lags 8.21 0.981 ~1.17%
9 lags (best) 8.04 0.986 ~1.15%
10 lags 8.12 0.984 ~1.16%
11 lags 8.19 0.983 ~1.17%

Target PID output range: 300…1000 (700 units span).

Best model: linear regression with 9 lags

  • RMSE ≈ 8.04 → average error ±8 units → 1.15% relative error
  • R² = 0.986 → almost 99% accuracy

Solid enough for real deployment.

aftermath

From 3 to 11 lags, the winner is 9-lag linear regression.
Export to ONNX → STM X-CUBE-AI converts to C code.
Inference will need 10 tensors (9 historical + current) = 60 bytes of input data.

Stay tuned for Part 3 - inference on MCU and head-to-head comparison. Your bet: “fuck PID” or not…?