Dataset Preparation

Organizing dataset can be a hassle, especially as data is constantly evolving. ArtiVC is the most suitable tool to organize the dataset. There are the following benefits.

No need to transfer files with the existing content. Even you rename or copy to different folder. ArtiVC knows they are the same content. It is common to move or keep the same images, videos when the dataset is evolving.
Version tagging. If there is a stable version of dataset, we can tag a commit as the human-readable version.

Prepare a dataset

Here are the common steps to prepare a dataset

Create a dataset folder and use subfolders as image labels

Initiate the workspace.

avc init s3://mybucket/datasets/flowers-classification

Push your first release
```
avc push -m 'first version'
```
Clean the dataset, and move the wrong-classified data

Push the dataset again

# See what data will be pushed
avc status
# Push
avc push -m 'my second version'

If there are new versions is pushed by others, sync the data set with remote

# Check the difference
avc pull --dry-run
# Sync with remote
avc pull
# or use the delete mode
# avc pull --delete --dry-run
# avc pull --delete

tag the version
```
avc push
avc tag v0.1.0
```
and see the change
```
avc log
```

Clone the dataset

Use the dataset in the other machine

avc clone s3://mybucket/datasets/flowers-classification
cd flowers-classification