Some popular datasets like LAION 400M use webdataset compressed formats. Fastdup supports the following compressed file formats: tar,tgz,tar.gz,zip. Those compressed files can be located in a local folder or remote s3 or minio path. For example, the LAION dataset contains the following tar files:Documentation Index
Fetch the complete documentation index at: https://visual-layer-my-changes.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
/tmp/<tarname>/ folder and then extracted. For each compressed tar file we generate two output files: <tarname>features.dat for the binary features and <tarname>features.dat.csv for the file list.
Example output file for the tar above (the path is given via the work_dir command line argument).
Note that it is possible to configure the behaviour regarding deletion of files. On default, both the downloaded tar files and the extracted images are deleted after the feature vectors are extracted. If you want to keep them locally (assuming there is large enough hard drive) you can run with :
turi_param='delete_tar=0,delete_img=0'This keeps all the downloaded tars and images in the /tmp folder. Running example. Assume you got to the full dataset downloaded into s3://mybucket/myfolder. In total there are 40,000 tar files. Further assume you want to run using 20 compute nodes to extract the feature in parallel. In this case you cam run: