Organizing dataset can be a hassle, especially as data is constantly evolving. ArtiVC is the most suitable tool to organize the dataset. There are the following benefits.
- No need to transfer files with the existing content. Even you rename or copy to different folder. ArtiVC knows they are the same content. It is common to move or keep the same images, videos when the dataset is evolving.
- Version tagging. If there is a stable version of dataset, we can tag a commit as the human-readable version.
Here are the common steps to prepare a dataset
- Create a dataset folder and use subfolders as image labels
- Initiate the workspace.
avc init s3://mybucket/datasets/flowers-classification
- Push your first release
avc push -m 'first version'
- Clean the dataset, and move the wrong-classified data
- Push the dataset again
# See what data will be pushed avc status # Push avc push -m 'my second version'
- If there are new versions is pushed by others, sync the data set with remote
# Check the difference avc pull --dry-run # Sync with remote avc pull # or use the delete mode # avc pull --delete --dry-run # avc pull --delete
- tag the version
and see the change
avc push avc tag v0.1.0
Use the dataset in the other machine
avc clone s3://mybucket/datasets/flowers-classification cd flowers-classification