Why did we open-source our inference engine? Read the post
← All Glossary Articles

What is Feature Scaling and Normalisation?

Feature scaling transforms numerical features to a comparable range so that no single feature dominates a model due to its magnitude. The two most common techniques are min-max scaling (normalisation), which maps values to [0, 1], and standardisation (z-score scaling), which centres data at zero with unit variance. Most distance-based and gradient-based models require feature scaling to perform well.


Why does feature scaling matter?

Consider a dataset with document length (range: 10-50,000 words) and number of images (range: 0-20). Without scaling, the document length feature dominates Euclidean distance calculations simply because its values are larger, not because it’s more informative.

Models particularly sensitive to feature scale:

  • K-Nearest Neighbours: distance-based, heavily affected
  • Support Vector Machines: margin maximisation is scale-sensitive
  • Neural networks / gradient descent: large feature ranges cause unstable learning
  • PCA: maximises variance, so large-scale features dominate components

Models that don’t require scaling:

  • Tree-based models (decision trees, random forests, gradient boosting): split on thresholds, scale-invariant

Min-max scaling (normalisation)

Maps all values to the range [0, 1]:

x_scaled = (x - x_min) / (x_max - x_min)
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X_train)

Best for: neural network inputs, when you know the approximate min/max of the data. Weakness: sensitive to outliers, since a single extreme value compresses all other values towards 0.


Standardisation (Z-score scaling)

Centres data at zero and scales to unit variance:

x_scaled = (x - mean) / std
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_train)

Best for: most general cases, especially when you don’t know the range. Weakness: doesn’t bound values, so outliers remain as outliers.


Min-max vs standardisation: which to use?

ScenarioRecommended
Known bounded range (e.g. pixel values 0-255)Min-max
Unknown range, possible outliersStandardisation
Neural network trainingEither (standardisation more common)
SVMStandardisation
PCAStandardisation
Data has extreme outliersRobustScaler (uses median/IQR)

Feature scaling and embedding vectors

Dense embedding vectors produced by SIE’s encoding models are already unit-normalised (L2 norm = 1) for cosine similarity search. You don’t need to apply additional scaling to embedding vectors.

However, when combining embedding similarity scores with other tabular features (e.g. document recency, click-through rate) in a re-ranking model, you’ll need to scale the non-embedding features to be comparable with the similarity scores.


A critical rule: fit on training data, transform test data

Always fit the scaler on training data only, then apply the same transformation to validation and test data:

# Correct
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train) # fit + transform
X_test_scaled = scaler.transform(X_test) # transform only
# Wrong — leaks test statistics into training
X_all_scaled = scaler.fit_transform(X_all)

Fitting on the full dataset causes data leakage and produces optimistically biased evaluation metrics.


Frequently asked questions

Do I need to scale features for tree-based models? No. Decision trees, random forests, and gradient boosted trees split on thresholds, which are scale-invariant. Scaling doesn’t hurt, but it doesn’t help either.

What is RobustScaler? RobustScaler uses the median and interquartile range instead of mean and standard deviation, making it resistant to outliers: x_scaled = (x - median) / IQR

Does feature scaling change the information content? No. Scaling is a monotonic transformation that changes magnitude but preserves rank order and relative differences. The information is identical.


Open source inference for agents

Open-source inference for the models behind your agents. Run it yourself, or let us run it for you.

Github 2.1K

Contact us

Tell us about your use case and we'll get back to you shortly.

Apply for an inference grant

Free capacity on our hosted cluster for selected projects. Tell us what you run and we reply by email.