Machine learning is not only about training a model. Students also need to know whether the model learned a useful pattern or accidentally saw information it should not have seen.
Scenario: student project predictor
A student builds a model to predict whether a project will be submitted on time. Features include number of drafts, days until deadline, and final submission status. The last feature is a problem because it directly reveals the answer.
features = ["draft_count", "days_until_deadline", "final_submission_status"]
label = "submitted_on_time"
Including final submission status is data leakage. The model may score very well, but it is using information that would not be available before the prediction is made.
Train/test split
A train/test split holds back part of the data so the model can be evaluated on examples it did not train on.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.25, random_state=42
)
Good student checklist
- Would this feature be known at prediction time?
- Did preprocessing learn from the test set?
- Is the test data similar to the real data the model will see?
- Is accuracy enough, or do false positives and false negatives matter differently?
Practice prompt
Given a sports prediction dataset, identify which columns are safe before the game starts and which columns leak the final outcome. Then retrain using only pre-game features.
