Skip to main content

Blog

Learn About Our Meetup

5000+ Members

MEETUPS

LEARN, CONNECT, SHARE

Join our meetup, learn, connect, share, and get to know your Toronto AI community. 

JOB POSTINGS

INDEED POSTINGS

Browse through the latest deep learning, ai, machine learning postings from Indeed for the GTA.

CONTACT

CONNECT WITH US

Are you looking to sponsor space, be a speaker, or volunteer, feel free to give us a shout.

[P] Shuffle big file

Hi everyone,

Several times while dealing with huge files I struggled when I wanted to shuffle those for instance for training. Some methods suggest to split the files and shuffle those separately however it is not a real shuffle (element in bucket 0 will not appear in bucket 10 for instance).
I’ve made a library a few months ago that allow to do this by shuffling the index of the number of lines and reading as much as it is necessary the original file (with a certain batch size) in order to be able to complete the shuffle.

Quick example:
file with 10K lines and batch_size 5k.
– Shuffle index to index_shuffled
– Read in streaming the file and dump the first 5k of index_shuffled
– Read in streaming the file once a gain and dump the last 5k of index_shuffled

Reading a file is not costly that’s why the perfs seem to me quite interesting.

It is really not that complicated but I did not find it available somewhere…

Here is the link: https://github.com/YaYaB/shuffle-big-file

I hope it can be useful to some of you 🙂

Best,

YaYaB.

submitted by /u/YaYaBFr
[link] [comments]