# DagsHub
## TODO
- DagsHub and dvc
- Kedro
- Mlflow
- Poetry
- pipreqs
- pyenv
## Command Quick Reference
```bash
dvc remote add origin s3://dagshub-hello
dvc remote list
dvc remote default main
dvc push
dvc pull
git status
dvc status
```
```bash
# start tracking a file or directory
dvc add data/data.xml
dvc list https://dagshub.com/codecypher/hello-world
dvc list https://dagshub.com/codecypher/hello-world data
# import file or directory
dvc import https://github.com/iterative/dataset-registry \
get-started/data.xml -o data/data.xml
```
The metadata file is a placeholder for the original data that can be easily versioned like source code with Git:
```bash
git add data/data.dvc data/.gitignore
git commit -m "Add raw data"
```
```bash
# show which branch is selected
git branch
# create a branch and check it out in one step
git checkout -b metrics main
git checkout -b new_feature_branch
# switch between branches
git checkout branchname
```
----------
## Create a Project on DagsHub
This part of the Get Started section focuses on the configuration process when creating a project on DagsHub [1].
We will cover how to create a DagsHub repository, connect it to your local computer, configure DVC, and set DagsHub storage as remote storage.
```bash
dvc init
git status
# Configure DagsHub as DVC Remote Storage
dvc remote add origin https://dagshub.com/codecypher/hello-world.dvc
dvc remote modify origin --local auth basic
dvc remote modify origin --local user codecypher
dvc remote modify origin --local password 8414ba848e08cbc1c3943a7c0dc0122b9afd7774
# checkpoint
cat .dvc/config.local
```
For more information about DagsHub storage, visit the reference page.
If you still want to set up your own cloud remote storage, please refer to our [setup external remote storage](https://dagshub.com/docs/integration_guide/set_up_remote_storage_for_data_and_models/) page.
```bash
# Version and push DVC Configurations
git status -s
git add .dvc .dvcignore .gitignore
git commit -m "Initialize DVC"
git push
```
## Version Code and Data
In the previous part of the Get Started section, we created and configured a DagsHub repository.
In this part, we will download and add a project to our local directory, track the files using DVC and Git, and push the files to the remotes.
```bash
# Fork and clone the hello-world repository.
git clone -b start-version-project https://dagshub.com/codecypher/hello-world.git
# Configure DVC locally and set DagsHub as remote storage
dvc remote add origin https://dagshub.com/codecypher/hello-world.dvc
dvc remote modify origin --local auth basic
dvc remote modify origin --local user codecypher
dvc remote modify origin --local password 8414ba848e08cbc1c3943a7c0dc0122b9afd7774
# checkpoint
cat .dvc/config.local
```
```bash
# Add a Project¶
dvc get https://dagshub.com/nirbarazida/hello-world-files requirements.txt
dvc get https://dagshub.com/nirbarazida/hello-world-files src
dvc get https://dagshub.com/nirbarazida/hello-world-files data/
# Install Requirements
pip3 install -r requirements.txt
# checkpoint
git status -s
```
### Track Files Using Git and DVC
At this point, we need to decide which files will be tracked by Git and which will be tracked by DVC.
We will start with files tracked by DVC since this action will generate new files tracked by Git.
```bash
# Track Files with DVC
dvc add data
# Track the changes with Git
git add data.dvc .gitignore
git commit -m "Add the data directory to DVC tracking"
```
```bash
# Track Files with Git
git status -s
git add requirements.txt src/
git commit -m "Add requirements and src to Git tracking"
```
### Push the Files to the Remotes
```bash
# Push DVC tracked files
dvc push -r origin
# Push Git tracked files
git push
```
### Process and Track Data Changes
Now, we would like to preprocess our data and track the results using DVC.
We will run the `data_preprocessing.py` file from our CLI.
```bash
python3 src/data_preprocessing.py
tree data
# checkpoint
git status
dvc status
```
```bash
# version the new status of the data directory with DVC
dvc add data
git add data.dvc
git commit -m "Process raw-data and save it to data directory"
# Push changes to remote
dvc push -r origin
git push
```
In this section, we covered the basic workflow of DVC and Git:
- We added the project files to the repository and tracked them using Git and DVC.
- We generated preprocessed data files and learned how to add these changes to DVC.
## Track Experiments
In the previous part of the Get Started section, we learned how to track and push files to DagsHub using Git and DVC.
This part covers how to track your Data Science Experiments and save their parameters and metrics.
We assume you have a project that you want to add experiment tracking to.
We will be showing an example based on the result of the last section, but you can adapt it to your project in a straightforward way.
```bash
# create branch
git branch start-track-experiments
# switch branches
git checkout start-track-experiments
```
### Add DagsHub Logger
DagsHub logger is a plain Python Logger for your metrics and parameters.
- The logger saves the information as human-readable files – CSV for metrics files and YAML for parameters.
- Once you push these files to your DagsHub repository, they will be automatically parsed and visualized in the Experiments Tab.
NOTE: Since DagsHub Experiments uses generic formats, you don't have to use DagsHub Logger. Instead, you can write your metrics and parameters into `metrics.csv` and `params.yml` files however you want and push them to your DagsHub repository where they will automatically be scanned and added to the experiment tab.
```bash
# install the python package
pip3 install dagshub
```
Now import dagshub to `modeling.py` module and track the Random Forest Classifier Hyperparameters and ROC AUC Score.
```bash
# checkpoint
git status -s
# Track and commit the changes with Git
git add src/modeling.py
git commit -m "Add DagsHub Logger to the modeling module"
```
### Create New Experiment
To create a new experiment, we need to update at least one of the two `metrics.csv` or `params.yml` files, track them using Git, and push them to the DagsHub repository.
After editing the `modeling.py` module, once we run its script it will generate those two files.
```bash
# Run the script
python3 src/modeling.py
git status -s
```
As we can see for the above output, two new files were created containing the current experiment's information.
The `metrics.csv` file has four fields:
- Name: the name of the Metric.
- Value: the value of the Metric.
- Timestamp: the time that the log was written.
- Step: the step number when logging multi-step metrics like loss.
The `params.yml` file holds the hyperparameters of the Random Forest Classifier.
```bash
# Commit and push the files to the DagsHub repository using Git
git add metrics.csv params.yml
git commit -m "New Experiment - Random Forest Classifier with basic processing"
git push
```
The two files were added to the repository and one experiment was created.
The information about the experiment is displayed under the Experiment Tab.
This part covers the Experiment Tracking workflow. We highly recommend reading the experiment tab documentation to explore the various features that it has to offer.
## Explore a New Hypothesis
In the previous part, we learned how to track the project's files using Git and DVC, and track the experiments using DagsHub.
This part covers the most common practice of Exploring a New Hypothesis.
We will learn how to examine a new approach to process the data, compare the results, and save the project's best result.
```bash
# switch branches
git checkout master
```
### Basic Theory
The Data Science field is research-driven and exploring different solutions to a problem is a core principle. When a project evolves or grows in complexity, we need to compare results and see what approaches are more promising than others. In this process, we need to make sure we don't lose track of the project's components or miss any information. Therefore, it is useful to have a well-defined workflow.
The common workflow of exploring a new approach is to create a new branch for it.
In the branch, we will change the code, modify the data and models, and track them using Git and DVC. We compare the new model's performances with the current model. This comparison can be a hassle when not using the proper tools to track and visualize the result.
We can use DagsHub to overcome these challenges:
- We will log the models' performances to readable formats and commit them to DagsHub.
- Using the Experiment Tab, we will easily compare the results and determine if the new approach was effective or not.
- We can either merge the code, data, and models to the main branch or return to the main branch and retrieve the data and models from the remote storage to continue to the next experiment.
### Create a New Branch
We are using the Enron data set that contains emails.
The emails are stored in a CSV file and labeled as 'Ham' or 'Spam'.
The current data processing method for the emails is to lower-case the characters and removes the string's punctuations.
We will try to reduce the processing time by not removing punctuations and see how it will affect the model's performance.
```bash
# Create new branch
git checkout -b data-with-punctuations
```
### Update the Processing Method
Change the code in the `data_preprocessing.py` module.
```bash
# Change the code in data_preprocessing.py
# Track and Commit the changes with Git
git add src/data_preprocessing.py
git commit -m "Change data processing method - will not remove the string's punctuations"
# Run the script
python3 src/data_preprocessing.py
# checkpoint
git status
dvc status
# Track and commit the changes using DVC and Git.
dvc add data
git add data.dvc
git commit -m "Processed the data and tracked it using DVC and Git"
# Push the code and data changes to the remotes
git push origin data-with-punctuations
git push -f origin data-with-punctuations
dvc push -r origin
```
### Run a new Experiment and Compare the Results
We have everything set to run our second Data Science Experiment! We will train a new model and log its performance using the DagsHub logger. Then, we will push the updated metrics.csv file to DagsHub and easily compare the results.
```bash
# Runs script
python3 src/modeling.py
# checkpoint
git status -s
dvc status
# Track the changes using Git and push the to the DagsHub repository
git add metrics.csv
git commit -m "Punctuations experiment results - update metrics.csv file"
git push origin data-with-punctuations
```
With DagsHub, we can easily compare the model's performance between the two experiments.
We can open the Experiment Tab in the DagsHub repository and compare the model's ROC AUC scores.
As we can see in the image above, the new data processing method did not provide better results so we will not use it.
### Retrieve Files
Our experiment resulted in worse performance and we want to retrieve the previous version. Now, we can reap the benefits of our workflow.
The best version of the project is always stored on the main branch.
hen concluding an experiment with insufficient impprovements, we simply need to check out the version we want (the master branch) and pull the remote storage files based on the .dvc pointers.
```bash
# Checkout to branch master using Git and pull the data files
# from the remote storage using DVC
git checkout master
dvc checkout
```
Congratulations - Now we are finished!
In the Get Started section, we covered the fundamentals of DagsHub usage:
- We started by creating a repository and configuring Git and DVC.
- We added project files to the repository using Git (for code and configuration files) and DVC (for data).
- We created our very first data science experiment using DagsHub logger to log metrics and parameters.
- We learned how to explore new approaches and retrieve another version's files.
## References
[1]: [Get Started](https://dagshub.com/docs/getting_started/overview/)
[2]: [Get Started: Data Versioning](https://dvc.org/doc/start/data-management)
[3]: [Setup Remote Storage for Data and Models](https://dagshub.com/docs/integration_guide/set_up_remote_storage_for_data_and_models/)