Skip to main content

Blog

Learn About Our Meetup

5000+ Members

MEETUPS

LEARN, CONNECT, SHARE

Join our meetup, learn, connect, share, and get to know your Toronto AI community. 

JOB POSTINGS

INDEED POSTINGS

Browse through the latest deep learning, ai, machine learning postings from Indeed for the GTA.

CONTACT

CONNECT WITH US

Are you looking to sponsor space, be a speaker, or volunteer, feel free to give us a shout.

[P] Can someone provide an overview of what a tf.data pipeline looks like for real world data instead of ML-ready datasets?

I am refactoring my data input pipeline from a custom set of classes, which need to be scrapped because the process is inefficient, error-prone, hard-to-use, and not scalable, to an end-to-end pipeline built using Tensorflow 2.0’s tf.data module. I understand the overall process, and how to use the module, but I have a few questions regarding how to properly structure the pipeline:

  1. Should I use object-oriented design or functional design? Because I would intuitively go with the OO route, using abstract base classes, and eventually having each feature be it’s own sub-class, since my dataset doesn’t actually include any of my features, but rather just the files from which data is sourced to calculate the features. But in the examples I have seen, I don’t really see anyone implementing that sort of structure.
  2. Should use the tf.feature_columns module to store each individual feature, and then concatenate these feature columns to get my output data? Or should I concatenate the individual feature tensors together and use this as my input data?
  3. My model takes 3 different inputs – one consisting of numerical and categorical features, and two separate tokenized sequence inputs. Should I implement this by creating three instances of tf.data.Dataset, one for each input, or should I create one instance for all the data, and then just ‘pop’ or whatever the equivalent function is the two columns holding my sequence data? Or is it a matter of preference?
  4. This one is more just to help me out and not necessarily having to do with tf.data pipeline structure. How do I implement a dynamic tokenizer for my sequence data, which will just update the vocabulary dict with the new word and assign it the next integer, so that the model can be trained continuously with new data, rather than having create a new tokenizer each time a significant amount of new words appear in the data, and retrain the entire model from scratch?

If anyone could point to some good examples of pipelines like this, or just help me understand the way the TF designed this module to be used, and how I can use it most effectively, I would be most appreciative and send a virtual hug your way, Or order you UberEats – I’m dead serious lol, that’s how desperate I am to understand this. Because I have already planned everything out and how everything is supposed to work, but don’t want to start coding until I’m sure I’m structuring and using everything properly.

Cheers, and if you help me figure out the answers, I will literally DM you and get you UberEats (up to $20 max).

submitted by /u/that_one_ai_nerd
[link] [comments]