[P] Shuffle big file
Several times while dealing with huge files I struggled when I wanted to shuffle those for instance for training. Some methods suggest to split the files and shuffle those separately however it is not a real shuffle (element in bucket 0 will not appear in bucket 10 for instance).
I’ve made a library a few months ago that allow to do this by shuffling the index of the number of lines and reading as much as it is necessary the original file (with a certain batch size) in order to be able to complete the shuffle.
file with 10K lines and batch_size 5k.
– Shuffle index to index_shuffled
– Read in streaming the file and dump the first 5k of index_shuffled
– Read in streaming the file once a gain and dump the last 5k of index_shuffled
Reading a file is not costly that’s why the perfs seem to me quite interesting.
It is really not that complicated but I did not find it available somewhere…
Here is the link: https://github.com/YaYaB/shuffle-big-file
I hope it can be useful to some of you 🙂