Back to Blogs
AI/MLPythonData ScienceModel EvaluationJune 3, 2026

AI/ML Train-Test Split and Data Leakage

A student-friendly explanation of train/test split, leakage, and why model accuracy can look better than it really is.

Machine learning is not only about training a model. Students also need to know whether the model learned a useful pattern or accidentally saw information it should not have seen.

Scenario: student project predictor

A student builds a model to predict whether a project will be submitted on time. Features include number of drafts, days until deadline, and final submission status. The last feature is a problem because it directly reveals the answer.

features = ["draft_count", "days_until_deadline", "final_submission_status"]
label = "submitted_on_time"

Including final submission status is data leakage. The model may score very well, but it is using information that would not be available before the prediction is made.

Train/test split

A train/test split holds back part of the data so the model can be evaluated on examples it did not train on.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42
)

Good student checklist

  • Would this feature be known at prediction time?
  • Did preprocessing learn from the test set?
  • Is the test data similar to the real data the model will see?
  • Is accuracy enough, or do false positives and false negatives matter differently?

Practice prompt

Given a sports prediction dataset, identify which columns are safe before the game starts and which columns leak the final outcome. Then retrain using only pre-game features.