Table Of Contents
- Feature Scaling Matters
- Basic StandardScaler Usage
- Train/Test Scaling
- Different Scaling Methods
- Pipeline Integration
- Handling Outliers
- Inverse Transform
- Feature-Specific Scaling
- When to Use Each Scaler
- Master Preprocessing
Feature Scaling Matters
Different feature scales can bias machine learning algorithms. StandardScaler ensures all features contribute equally to model training.
Basic StandardScaler Usage
from sklearn.preprocessing import StandardScaler
import numpy as np
import pandas as pd
# Sample data with different scales
data = pd.DataFrame({
'age': [25, 30, 35, 40, 45],
'salary': [50000, 60000, 75000, 80000, 90000],
'experience': [2, 5, 8, 12, 15]
})
print("Original data:")
print(data)
print(f"\nMeans: {data.mean().values}")
print(f"Std devs: {data.std().values}")
# Apply StandardScaler
scaler = StandardScaler()
scaled_data = scaler.fit_transform(data)
print(f"\nAfter scaling:")
print(f"Means: {scaled_data.mean(axis=0)}")
print(f"Std devs: {scaled_data.std(axis=0)}")
Train/Test Scaling
from sklearn.model_selection import train_test_split
# Create larger dataset
X = np.random.randn(1000, 3) * [100, 10, 1000] # Different scales
y = np.random.randint(0, 2, 1000)
# Split data first
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Fit scaler on training data only
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test) # Use same scaler
print(f"Train means: {X_train_scaled.mean(axis=0)}")
print(f"Test means: {X_test_scaled.mean(axis=0)}")
Different Scaling Methods
from sklearn.preprocessing import MinMaxScaler, RobustScaler, Normalizer
# Sample data
X = np.array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
# StandardScaler (z-score normalization)
standard_scaler = StandardScaler()
X_standard = standard_scaler.fit_transform(X)
# MinMaxScaler (0-1 scaling)
minmax_scaler = MinMaxScaler()
X_minmax = minmax_scaler.fit_transform(X)
# RobustScaler (uses median and IQR)
robust_scaler = RobustScaler()
X_robust = robust_scaler.fit_transform(X)
print("StandardScaler:", X_standard.round(2))
print("MinMaxScaler:", X_minmax.round(2))
print("RobustScaler:", X_robust.round(2))
Pipeline Integration
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
# Create pipeline with scaling
pipeline = Pipeline([
('scaler', StandardScaler()),
('classifier', LogisticRegression(random_state=42))
])
# Fit pipeline
pipeline.fit(X_train, y_train)
# Predict
y_pred = pipeline.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Pipeline accuracy: {accuracy:.3f}")
Handling Outliers
# Data with outliers
data_with_outliers = np.array([
[1, 2, 3],
[4, 5, 6],
[7, 8, 9],
[100, 200, 300] # Outlier
])
# StandardScaler is sensitive to outliers
standard = StandardScaler().fit_transform(data_with_outliers)
# RobustScaler handles outliers better
robust = RobustScaler().fit_transform(data_with_outliers)
print("With outliers - Standard:", standard.round(2))
print("With outliers - Robust:", robust.round(2))
Inverse Transform
# Scale data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(data)
# Convert back to original scale
X_original = scaler.inverse_transform(X_scaled)
print("Original:", data.values[:2])
print("Scaled:", X_scaled[:2])
print("Inverse:", X_original[:2])
Feature-Specific Scaling
# Scale only specific columns
from sklearn.compose import ColumnTransformer
# Mixed data types
mixed_data = pd.DataFrame({
'numerical1': [1, 2, 3, 4, 5],
'numerical2': [100, 200, 300, 400, 500],
'categorical': ['A', 'B', 'A', 'C', 'B']
})
# Scale only numerical columns
preprocessor = ColumnTransformer(
transformers=[
('num', StandardScaler(), ['numerical1', 'numerical2']),
('cat', 'passthrough', ['categorical'])
]
)
processed = preprocessor.fit_transform(mixed_data)
print("Processed shape:", processed.shape)
When to Use Each Scaler
- StandardScaler: Most common, assumes normal distribution
- MinMaxScaler: When you need 0-1 range, preserves relationships
- RobustScaler: When data has outliers
- Normalizer: For text analysis, scales individual samples
Master Preprocessing
Explore feature engineering techniques, learn categorical encoding methods, and discover data transformation pipelines.
Share this article
Add Comment
No comments yet. Be the first to comment!