3 Ways to Schedule and Execute Python Jobs
Why would anyone want 3 ways to schedule and execute Python jobs? For many reasons! In particular, building your capability from one-time run tasks that generate some value to your business to reusable, automated tasks that produce sustainable value can be a game-changer to companies. This holds true whether those tasks are ETL, machine learning, or other functions entirely.
For example, training a model one time and predicting on one test sample might be academically interesting. But to really make this into a value proposition, you want to use this model to predict on data routinely and return those results in a usable form. In order to get to that point, you need a job scheduling solution.
However, choosing the way to automate and schedule those tasks deserves close attention. Choose a solution too bare-bones, and you’ll rapidly outgrow the tool and find yourself needing to switch again fast, wasting time and energy. Choose a tool too complex, and your users won’t want to attack the learning curve to use it effectively, making the tool really not useful at all.
In this article, we’re going to discuss many considerations you need to think about when choosing a scheduling solution for your python workflows and talk about what tools can meet different teams' needs and use cases.
Criteria
For ease of reading, we’ll evaluate the solutions using the following scorecard:
- Startup speed/ease: Starting from knowing nothing, how fast can you get a job scheduled, and how many headaches will you have?
- Error handling: If your scheduled job fails, how well does the system manage that? Does it notify you, retry, etc?
- Parallelization: Can you run many jobs in parallel in this system? Can you create task dependencies, so that jobs are run in the right order?
- Backfill: If a job fails, can the system go back and fill the missed runs? Can you ask the system to run jobs for dates in the past?
- Logging/transparency: Does the system keep detailed, easy to read logs? Does it include graphs or monitoring of the runtime of jobs?
- Ease of use: Day to day, when you’re using this system, how easy is it to get things done? How easy is it to onboard new team members to use?
- Support: Can you get human support for this system, if you’re willing to pay for it? Is there anyone to call if things go wrong?
1. Cron
Cron is simply a command-line tool that many developers and data scientists will be quite familiar with. You’ll create a cron file, which lists the jobs you wish to run and lets you set a time interval for their running.
Cron has no GUI at all, keeps no automatic logs of jobs run, and does not respond to errors or run failures. However, it’s extremely quick to start up because a handful of statements to the command line are all you need to get up and running. Because of this ease of entry, lots of users will start with cron when they first find a need for a scheduled job.
However, the first time a business-critical job fails silently in cron and causes major headaches for core functions in your company, you’ll find yourself starting to think that perhaps a more fully-featured tool is needed.
Startup Process
crontab -e
Add lines like this example to the resulting file for each job you wish to run.
01 12 * * * /usr/bin/somedirectory/somejob.py
This would run at 12:01 every day. Save file to start running. Remember that errors will be silent and logs will not be kept unless you specify logging in your python script itself.
Scorecard
- Startup speed/ease: very fast and minimal infrastructure
- Error handling: minimal
- Parallelization: none
- Backfill: none
- Logging/transparency: none, must be manually built
- Ease of use: for simple implementations, easy. Complex needs probably not feasible to meet.
- Support: none
Helpful links: https://crontab.guru/
2. Airflow
When this situation faces you, lots of users will look towards Airflow. Airflow is an extremely full-featured, flexible, and robust tool allowing tremendously complex scheduling of jobs. Airflow supports substantial parallelization of tasks, using what is called Directed Acyclic Graphs or “DAGs”. This enables very precise management of task dependencies, scheduling certain tasks to be completed before others begin, and so on.
Airflow also has very broad logging functionalities, error handling and retries, and backfill options. However, there’s one major drawback – Airflow is, frankly, quite hard to get started using.
If you speak to data scientists in the field who are not deeply experienced as developers, the reputation of Airflow is “it’s really powerful, but it’s so hard to use” – and this reputation is not inaccurate. Once a user spends significant time learning how Airflow works, incredible productivity can be achieved. But the time to that productivity from startup with Airflow is a major cost.
A thorough understanding of object-oriented programming is key to getting the most out of Airflow, and Airflow is also infamous for a system called XCom that is how data is transmitted between tasks in a single job. XCom is unfortunately very challenging to learn even for experienced users and can cause many headaches in the development of otherwise straightforward jobs.
Startup Process
Borrowed from the official documentation, here: https://airflow.apache.org/docs/stable/start.html
# airflow needs a home, ~/airflow is the default,
# but you can lay foundation somewhere else if you prefer
# (optional)
export AIRFLOW_HOME=~/airflow
# install from pypi using pip
pip install apache-airflow
# initialize the database
airflow initdb
# start the web server, default port is 8080
airflow webserver -p 8080
# start the scheduler
airflow scheduler
# visit localhost:8080 in the browser and enable the example dag in the home page
Notice that this is setting up an sqlite database for your Airflow instance to use. This does not enable any parallelization, so for that you’ll need to create a different backend option.
Also, note that you’ll need to write and schedule your DAGs after these steps. This tutorial is useful for learning the basics of that, but it gets very complex very fast.
https://airflow.apache.org/docs/stable/tutorial.html
For those who want to learn more about Airflow’s inner workings, I have given a talk that you might find helpful.https://github.com/skirmer/airflow_plus_redshift
Scorecard
- Startup speed/ease: difficult to get started, may require some specialized skill
- Error handling: very robust
- Parallelization: very robust
- Logging/transparency: very robust
- Backfill: very robust]
- Ease of use: tough, steep learning curve for new users
- Support: None except github project issues. Open source product.
Helpful links:
https://airflow.apache.org/docs/stable/start.html https://airflow.apache.org/docs/stable/tutorial.html https://github.com/skirmer/airflow_plus_redshift
3. Prefect
So, given these two choices, what’s a non-developer data scientist to do? If you need a solution that can get up and running and productive quickly, but will also provide many of the features allowing resilient pipelines and jobs, Prefect might be a good solution to explore.
One notable advantage Prefect offers is a paid, supported cloud offering, unlike Airflow. So, if you want to get up to a robust multi-user installation fast and with customer support, Prefect can offer that. You can even test the cloud offering for free with one user and 10 read-only users before making any purchase.
Startup Process
pip install prefect
prefect backend server
prefect agent start
prefect server start
Pop over to http://localhost:8080/ and see your GUI!
At this point the user can create a job in a python file, which can
essentially be a set of python functions decorated with @task(log_stdout=True)
.
Example:
@task(log_stdout=True)
def say_hi():
print("Hello World!")
In that script the “Flow” can be defined, and that is registered with Prefect to appear in the GUI.
Example:
with Flow("Greeting") as flow:
say_hi()
flow.register(project_name="Demo")
After this step, the job will appear in the GUI ready to run.
Scorecard
- Startup speed/ease: single machine instance quite easy and fast to start, requires only python familiarity
- Error handling: very robust
- Parallelization: very robust
- Logging/transparency: very robust
- Backfill: good
- Ease of use: good, new users can onboard quickly
- Support: Yes, with cloud product subscription
Helpful links:
https://docs.prefect.io/core/getting_started/first-steps.html
Conclusions
Users with very complex needs and/or software development experience may find that Airflow is the ideal tool for their use cases or users who need only run one or two occasional jobs that are not business-critical may want to stick with cron. But for a wide swath of users, especially data scientists, Prefect offers the right middle ground for vibrant functionality with ease of use. If you’re interested in learning more about Prefect, and how easily you can deploy and schedule models on the Saturn Cloud platform with it, give our free service a go.