The holy line of your ML project

    Many clients, many servers, dataset versions, database migrations, artificial data, deprecated data, preparing for new functionality.

    A casual day. Leakage and Overfitting are lurking at your desk. Why don’t you write the holy line of ML project?

Just before the fitting the model, compare train and validate features:

assert not any([(train_rows == row).all(1).any() for row in test_rows])

Or in some more comprehensive way:

number_of_leaks = sum([(train_rows == row).all(axis=1).any() for row in test_rows])
if number_of_leaks > 0:
    warnings.warn('{} examples leak between train and validation sets'.format(number_of_leaks))

    Checking for the overlapping indices is often not enough. I’m sure this lines of code will one day save you a lot of time.

Leave a Comment

Your email address will not be published. Required fields are marked *