Best tool for large-scale image processing
In the early 2010s, I actively used Hadoop / Hive and HBase for large-scale data processing. Since then, I’ve been somewhat out of the loop, except for using Spark infrequently. I am now wondering what would be the best open source software for storing a very large image dataset (100s of terabytes if not multiple petabytes) on commodity hardware. The reason I post this here is that the objective will be to run ML algorithms over subsets of the images in this dataset. Thus, it would be desirable to execute ML code in situ, if possible. For my purposes, it’s also safe to assume that writes are fairly infrequent.