For a moment I was strongly influenced by Test Driven Development (TDD). It took me a while to relate it to Machine Learning in general. I hope to write a post about it once.
I found, however, something that is much simpler, faster, cheaper, and nicer, and eventually saves me more consistently. I did my best to place it in the world of software development terminology but haven’t found anything satisfying yet.
So I call it Pipeline Sanity Checks.
Sanity checks (smoke tests) determines whether it is possible and reasonable to proceed with further testing. Pipeline Sanity Checks determines whether it is possible and reasonable to proceed with further training.
In Machine Learning, especially Deep Learning, some parts of the system are always used, but only after many hours or days after the process started. The cost of a bug in this parts is huge. How to prevent it? Normally, all the components of the software should be tested with unit or integrations tests. In ML, at a high level, these parts of a system are often hard to test because they interact with data, and the data happens to be so diverse, that it’s hard to specify it’s properties. What’s worse, data is often processed online.
For example, you may be surprised how much GPU memory your model needs to allocate, to process the longest sequences in your data. You could specify some limit for the length, but you don’t really know what to choose a priori. And the memory usage can change immediately after you change the type of the used cells.
Another example can be online processing of the input data. The dimensionality of the entire neural network can change drastically if you will decide to subsample some signal, let’s say for the efficiency.
The range of possibilities in data and architectures excludes property-based tests or unit tests. System tests and integrations tests somehow miss the point.
If you want to be more productive, you should definitely write Pipeline Sanity Checks.
I always trust too much my learning pipeline. After code changes, it rarely passes. Sometimes the disaster happens after the epoch i.e. after 5 hours. If I’m not in the office, I will notice the bug after over a dozen of hours. It slows the progress in my research tremendously.
The ideas for Pipeline Sanity Checks:
- Create use_reduced_data flag. Always run automatically a few epochs of training pipeline using just a few batches before the real job start.
- Define properties of the data using outliers.
- Catch the preprocessors errors, supervise classifiers input-output areas.
- Print the infected training examples after catching the error.
The number of experiments I’m able to comprehend is always bigger than hardware capabilities I have. In this sense, hardware at my service is never enough. I have to maximize its utility wisely. Sadly, I think I’m a month behind on my research yearly, because of failures that can be easily spotted by Pipeline Sanity Checks. Well, prophets are not without honor, except in their hometown.