Skip to main content


Learn About Our Meetup

5000+ Members



Join our meetup, learn, connect, share, and get to know your Toronto AI community. 



Browse through the latest deep learning, ai, machine learning postings from Indeed for the GTA.



Are you looking to sponsor space, be a speaker, or volunteer, feel free to give us a shout.

[D] Predicting value of feature/model engineering on very large data sets?

Setup: sizeable corpus (1TB+) of text data. The problem space roughly generalizes into a seq2seq model.

Given the size and the required parameter space to fully exploit the data, retraining the model against the full data set is (understandably) very expensive ($10ks / full training run).

Iterative pre-processing of the data is of course pricey as well.

Problem: How can we estimate the effect of some new feature/model-engineering without re-training on the entire data set?

Obviously we can do training runs at smaller amounts of data and/or with small parameter sizes. This is of course a good start; if something doesn’t work at smaller volumes of data, it usually isn’t helpful at larger volumes of data. But, at scale, enough data tends to wash away the value of many types of feature engineering and/or notionally more clever models.

Are there better ways to drive this process? Bonus points if backed up by research!

Our current pattern is something like:

  • Build multiple new things
  • Test at much smaller data volumes
  • Do an approximately full rebuild (although see below) every week or two, leveraging all new features/changes, and see if it moves the needle.

This gives us some results, but also means that if we have multiple new features/model changes that it is very hard to disambiguate the effect of new features at scale. (And heaven help us if the overall performance goes down, despite all of our upfront testing.)

To do ablations we’re instead left with doing ablations at much lower data volumes and using our best human intuition to decide what to apply at scale.

It all kind of works, but is…unsatisfying. And probably unoptimized.


  • There are obviously lots of techniques to try to cram down the overall cost to re-training (e.g., warm starts from prior models or other pre-trained entities like Roberta). We are actively testing here; suggestions to decrease cost/wallclock are of welcome.

  • Do we “need” to use all the data and/or use larger params to eek out every last nth accuracy? To simplify the discussion here, let’s say yes–or at least assume that we’ve done a fairly good job of already doing that testing to figure out the right trade-off point between model maximization and accuracy required at the business level.

submitted by /u/farmingvillein
[link] [comments]