Dask is a free, flexible library for parallel computing in Python. It lets you work on arbitrarily large datasets and dramatically increases the speed of your computations. Dask is composed of two parts:
Dynamic task scheduling optimized for computation. This is similar to Airflow,
Luigi, Celery, or Make, but optimized for interactive computational
workloads.
“Big Data” collections like parallel arrays, dataframes, and lists that extend common interfaces like NumPy, pandas, or Python iterators to larger-than-memory or distributed environments. These parallel collections run on top of dynamic task schedulers.
Familiar Interface
Python has a rich ecosystem of data science libraries including numpy for arrays, pandas for dataframes, xarray for nd-data, and scikit-learn for machine learning. Dask matches those libraries.
Flexible
Sometimes your data doesn’t fit neatly in a dataframe or an array. Or maybe you have already
written a whole piece of your pipeline and you just want to make it faster. That’s what
dask.delayed
is for. Wrap any function with the dask.delayed
decorator
and it will run in parallel.
Python-Native
Dask is written in Python and runs Python for people who want to write in Python and troubleshoot in Python, along with access to the full PyData stack.
Fast
Scale up and easily run on clusters with 1000s of cores or scale down to run on a laptop in a single process. Dask simplifies the big data workflow and its excellent single-machine performance speeds up the prototyping stage, and leads to faster model deployment.
What makes up Dask
The Dask framework has a number of valuable components, including:
Data Collections
Dask has data collections that use similar API calls to popular types like pandas DataFrames and NumPy ndarrays, however on the backend they are distributed across workers. This allows users to get the speed and memory benefits of distributed computing while using the same programming styles they are used to.
Task Scheduling
Dask can easily take a set of tasks which normally would have run sequentially and distribute them to a collection of workers. If you have code where tasks could be computed concurrently then you can use Dask to easily parallelize them.
Machine Learning Support
Dask can be used in many ways for machine learning. The Dask ML library provides distributed support for tasks like cross validation and hyperparameter tuning, and libraries like XGBoost and LightGBM can be run directly on a Dask cluster of workers.
Blog post: What is Dask?
Dask is more than just a framework! This article will addresses what makes Dask special and then explain in more detail how Dask works.
Read the blog postOther Dask resources
- Dask.org - the official documentation for Dask
- Should I Use Dask? - Saturn Cloud blog post around deciding to use Dask
- Snowflake and Dask - Saturn Cloud blog post discussing using Snowflake with Dask
- Dask: Parallelize Everything - Medium page on the value of Dask