AI/ML Train-Test Split and Data Leakage

Machine learning is not only about training a model. Students also need to know whether the model learned a useful pattern or accidentally saw information it should not have seen.

Scenario: student project predictor

A student builds a model to predict whether a project will be submitted on time. Features include number of drafts, days until deadline, and final submission status. The last feature is a problem because it directly reveals the answer.

features = ["draft_count", "days_until_deadline", "final_submission_status"]
label = "submitted_on_time"

Including final submission status is data leakage. The model may score very well, but it is using information that would not be available before the prediction is made.

Train/test split

A train/test split holds back part of the data so the model can be evaluated on examples it did not train on.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42
)

Good student checklist

Would this feature be known at prediction time?
Did preprocessing learn from the test set?
Is the test data similar to the real data the model will see?
Is accuracy enough, or do false positives and false negatives matter differently?

Practice prompt

Given a sports prediction dataset, identify which columns are safe before the game starts and which columns leak the final outcome. Then retrain using only pre-game features.