Casting the biggest enemy: Overfitting vs Silent failures

    For many years I had a strong conviction that the worst thing I could do is to overfit the training data.

    I loved to watch these legendary disasters from Kaggle like here (Link), where a competitor drops from the top 10 to one of the last 10 places, so I could think to myself: “man you’ve got it, you know how to do your job well”. There are many signals to state that the best you can do is to “trust your local CV“. A leaderboard is tricky, it’s a shame to drop because it means you are greedy for the top place and don’t know what it’s all about.

    Well, that’s the common ailment. And it’s all true. Despite the fact, that it’s true on Kaggle, but Kaggle is not a real life. In the real life, uhh, there a much worse evil that laughs at Overfitting seizing your project by the throat. Its name is Silent Failure.

    Assuming you’re able to create a representative validation set to drive your objective, you can run the classification and estimate the performance of the chosen algorithms. Generally, it’s enough to get rid of the overfitting. The representative validation set is not that hard to get in the real life – it’s just real clients data. That’s why, in my opinion, overfitting drops to the second place on the biggest enemy podium.

    Silent Failures are a broad collection of failures that are very hard to spot because they decay the performance gradually, even if the system seems to be reasonably good at the beginning.

For example, you may:

  • Forget to change some flag in custom feature extractors and preprocessors
  • Use the different versions of the packages in your train/production environment
    • As the example, np.einsum changed default optimize flag from False to True in v1.15.
    • In sklearn, 0.22 version changed the logistic regression solver to entirely different algorithm
  • Use a different operating system
  • Upload an old model file to the production
    • The previous version was working almost as good as current, so how can you detect it didn’t change?
  • Didn’t notice features change over the time.
    • Someone changed logs data format DD-MM-YY to MM-DD-YY

    My experience shows that many decimal places of probability distributions returned by the model should be the same in your validation and production environments, to assure there aren’t any Silent Failures. Preparing such tests is my biggest advice for today. The example

references = [0.11, 0.19, 0.7] #this comes from your train environment

class TestSilentFailure(unittest.TestCase):
    def test_probability_distribution(self):
        model = load_model(path_to_the_model)
        probabilities = model.predict_proba('example sentence')
        for probability, reference in zip(probabilities, references):
            self.assertAlmostEqual(probability, reference, places=7)

Remember the Rule #10 Watch for silent features!

P.S. Did any Silent Failure happen to you? How many people did it affect?

Leave a Comment

Your email address will not be published. Required fields are marked *