# Kedro
Kedro is an open-source Python framework for creating reproducible, maintainable and modular data science code [4].
Kedro applies software engineering best-practices to machine learning code including modularity, separation of concerns and versioning [4].
## Commands
```zsh
kedro new --telemetry=no --starter=spaceflights-pandas
source ~/pyenv/coreml/bin/activate # activate virtual environment
```
## Overview
In a data science project, various coding components can be thought of as a flow of data: data flow from the source, to feature engineering, to modelling, to evaluation, etc. [1].
This flow of data is made more complex with training, evaluation, and scoring pipelines where the flow for each pipeline can be potentially very different [1].
`Kedro` is a Python framework to help structure code into a modular data pipeline [1].
Kedro allows reproducible and easy (one-line commands) running of different pipelines and even ad-hoc rerunning of a small portion of a pipeline [1].
- Kedro helps to accelerate data pipelining, enhance data science prototyping, and promote pipeline reproducibility [2].
- Kedro applies software engineering concepts to developing production-ready machine learning code to reduce the time and effort needed for successful model deployment [2]
- Kedro virtually eliminates re-engineering work from low-quality code and standardization of project templates for seamless collaborations [2].
The articles in [1] and [2] discuss the components and terminologies used in Kedro including Python examples on how to setup, configure, and run Kedro pipelines.
## What is Kedro?
Kedro is an open-source Python framework for creating reproducible, maintainable and modular data science code.
Here are some of the key concepts applied within Kedro [2]:
- Reproducibility: Ability to recreate the steps of a workflow across different pipeline runs and environments accurately and consistently.
- Modularity: Breaking down large code chunks into smaller, self-contained, and understandable units that are easy to test and modify.
- Maintainability: Use of standard code templates that allow teammates to readily comprehend and maintain the setup of any project, thereby promoting a standardized approach to collaborative development
- Versioning: Precise tracking of the data, configuration, and machine learning model used in each pipeline run.
- Documentation: Clear and structured information for easy understanding
Seamless Packaging: Allowing data science projects to be documented and shipped efficiently into production (with tools like Airflow or Docker).
## Why Kedro?
The path of bringing a data science project from pilot development to production is fraught with challenges [2]:
- Code that needs to be rewritten for production environments which leads to project delays.
- Disorganized project structures that make collaboration challenging.
- Data flow that is hard to trace.
- Functions that are too leong and difficult to test or reuse.
- Relationships between functions that are hard to understand.
## Kedro Concepts
Kedro is the first open-source software tool developed by McKinsey and is recently donated to Linux Foundation. It is a Python framework for creating reproducible, maintainable, and modular codes.
Kedro combines the best practices of software engineering with the world of data science
Here are the core components of Kedro [1]:
- Node: Function wrapper which wraps input to function, the function itself, and function output together (defines what codes should run)
- Pipeline: Link nodes together which resolves dependencies and determines the execution order of functions (defines what order the codes should be run)
- DataCatalog: Wrapper for data which links the input and output names specified in node to a file path
- Runner: Object that determines how the pipeline (code) is run such as sequentially or in parallel.
## Kedro Docs
Here is an outline of some helpful topics covered in the Kedro documentation [4]:
- IDE Support: Setup Visual Studio Code
- Create: Kedro starters
- Create: Kedro Tools
- Configure: Parameters
- Configure: Credentials
### Getting Started
- Concepts
- Glossary
- Kedro architecture
### Data Catalog
- Introduction
- Kedro Data Catalog
- Data Catalog YAML examples
- Lazy loading
### Develop
- Logging
- Debugging
### Integration & Plugins
- MLflow
- DVC
- PySpark
### Introduction to the Data Catalog
- The basics of catalog.yml
- Dataset access credentials
- Dataset versioning
- Use the Data Catalog within Kedro configuration
### Data Catalog YAML examples
This page contains examples of the YAML configuration file provided in `conf/base/catalog.yml` `or conf/local/catalog.yml`.
## References
[1]: [Kedro as a Data Pipeline in 10 Minutes](https://towardsdatascience.com/kedro-as-a-data-pipeline-in-10-minutes-21c1a7c6bbb)
[2]: [Build an Anomaly Detection Pipeline with Isolation Forest and Kedro](https://towardsdatascience.com/build-an-anomaly-detection-pipeline-with-isolation-forest-and-kedro-db5f4437bfab)
[3]: [Kedro — A Python Framework for Reproducible Data Science Project](https://towardsdatascience.com/kedro-a-python-framework-for-reproducible-data-science-project-4d44977d4f04)
[4]: [Kedro concepts](https://docs.kedro.org/en/stable/getting-started/kedro_concepts/)
-----
[Level Up Your MLOps Journey with Kedro](https://towardsdatascience.com/level-up-your-mlops-journey-with-kedro-5f000e5d0aa0)
[How to perform anomaly detection with the Isolation Forest algorithm](https://towardsdatascience.com/how-to-perform-anomaly-detection-with-the-isolation-forest-algorithm-e8c8372520bc?gi=9b318130c70a)