Dask | Saturn Cloud

Dask is a free, flexible library for parallel computing in Python. It lets you work on arbitrarily large datasets and dramatically increases the speed of your computations. Dask is composed of two parts:

Dynamic task scheduling optimized for computation. This is similar to Airflow, Luigi, Celery, or Make, but optimized for interactive computational workloads.

“Big Data” collections like parallel arrays, dataframes, and lists that extend common interfaces like NumPy, pandas, or Python iterators to larger-than-memory or distributed environments. These parallel collections run on top of dynamic task schedulers.

Familiar Interface

Python has a rich ecosystem of data science libraries including numpy for arrays, pandas for dataframes, xarray for nd-data, and scikit-learn for machine learning. Dask matches those libraries.

Flexible

Sometimes your data doesn’t fit neatly in a dataframe or an array. Or maybe you have already written a whole piece of your pipeline and you just want to make it faster. That’s what dask.delayed is for. Wrap any function with the dask.delayed decorator and it will run in parallel.

Python-Native

Dask is written in Python and runs Python for people who want to write in Python and troubleshoot in Python, along with access to the full PyData stack.

Fast

Scale up and easily run on clusters with 1000s of cores or scale down to run on a laptop in a single process. Dask simplifies the big data workflow and its excellent single-machine performance speeds up the prototyping stage, and leads to faster model deployment.

What makes up Dask

The Dask framework has a number of valuable components, including:

Data Collections

Dask has data collections that use similar API calls to popular types like pandas DataFrames and NumPy ndarrays, however on the backend they are distributed across workers. This allows users to get the speed and memory benefits of distributed computing while using the same programming styles they are used to.

Task Scheduling

Dask can easily take a set of tasks which normally would have run sequentially and distribute them to a collection of workers. If you have code where tasks could be computed concurrently then you can use Dask to easily parallelize them.

Machine Learning Support

Dask can be used in many ways for machine learning. The Dask ML library provides distributed support for tasks like cross validation and hyperparameter tuning, and libraries like XGBoost and LightGBM can be run directly on a Dask cluster of workers.

Blog post: What is Dask?

Dask is more than just a framework! This article will addresses what makes Dask special and then explain in more detail how Dask works.

Read the blog post

Other Dask resources

Dask.org - the official documentation for Dask
Should I Use Dask? - Saturn Cloud blog post around deciding to use Dask
Snowflake and Dask - Saturn Cloud blog post discussing using Snowflake with Dask
Dask: Parallelize Everything - Medium page on the value of Dask

Try Dask on Saturn Cloud