[P] Shuffle big file

Written by torontoai on July 4, 2019. Posted in Reddit MachineLearning.

Hi everyone,

Several times while dealing with huge files I struggled when I wanted to shuffle those for instance for training. Some methods suggest to split the files and shuffle those separately however it is not a real shuffle (element in bucket 0 will not appear in bucket 10 for instance).
I’ve made a library a few months ago that allow to do this by shuffling the index of the number of lines and reading as much as it is necessary the original file (with a certain batch size) in order to be able to complete the shuffle.

Quick example:
file with 10K lines and batch_size 5k.
– Shuffle index to index_shuffled
– Read in streaming the file and dump the first 5k of index_shuffled
– Read in streaming the file once a gain and dump the last 5k of index_shuffled

Reading a file is not costly that’s why the perfs seem to me quite interesting.

It is really not that complicated but I did not find it available somewhere…

Here is the link: https://github.com/YaYaB/shuffle-big-file

I hope it can be useful to some of you 🙂

Best,

YaYaB.

submitted by /u/YaYaBFr
[link] [comments]

Blog

Learn About Our Meetup

5000+ Members

MEETUPS

JOB POSTINGS

CONTACT

[P] Shuffle big file