How to use github actions to run cron jobs

Nikhil Das Nomula

github actions and data engineering

Github actions is a great tool for CI however we recently had a data engineering usecase where the client did not have cloud infrastructure in place but had a requirement to move the data from a source to a destination but also wanted that to run on a schedule(cron). In this article we will go over why and how we ended up using Github Actions and its ability to run cron to address this particular use case.

We have written a python script to achieve the data engineering task and we had to think what would be best for our client to achieve automation.

There were the options

  1. Run the script on the instances of one of the major cloud providers - AWS EC2, Google VM instance, Azure VMs'.
  2. Utilize Heroku or fly.io to run the python script on
  3. Utilize something like Apache airflow or Prefect
  4. Use github actions

The reason we chose github actions is simplicity for this usecase.

If we had chosen the first two options, we would have to set github connectivity, handle environment variables in a different place other than github where the script resides. Apache Airflow and Prefect are an overkill for what we are trying to achieve

From a cost perspective github actions are free, there is a caveat that the schedule might not run sometimes when loads are high on runners but that was not an issue for us in this particular usecase.

The best part is when we transition this to the client they just have one technology instead of a bunch of them that they would have to manage.

Lets get into how. Here is the code for github actions is pretty simple and this is how it looks like

As you can see, we set up the cron and use the standard python-dotenv to access secrets so that secrets are not in your code. This takes care of major concerns and provide a simplistic solution. That being said, this approach is not suited for every need. The option to choose depends on multiple factors.

name: run <path-to-your-python-file>.py

on:
  workflow_dispatch:# This sets up manual trigger that comes in pretty handy to test
  schedule:
    - cron: '*/10 * * * *' # Every 10 minutes

jobs:
  build:
    runs-on: ubuntu-latest
    steps:

      - name: checkout repo content
        uses: actions/checkout@v3 # checkout the repository content to github runner

      - name: setup python
        uses: actions/setup-python@v4
        with:
          python-version: '3.12' # setup Python, you can change the version here 
      
      - name: Install pipenv
        run: |
          python -m pip install --upgrade pipenv wheel

      - name: Install dependencies
        run: |
          pip install python-dotenv==1.0.1 # This is the dependency you need to have to handle sensitive values. In addition to this you can add other dependencies here that your application needs

      - name: Run the script
        env:
          ENV_VAR1: ${{ secrets.ENV_VAR1 }} # We create secrets in github actions and then access them here taking care of security
        run: |
          python src/<path-to-your-python-file>.py

If you have any questions, feel free to reach out to us at nikhil.nomula@yajur.tech

You Might Also Like

Datamesh - What is it?
Datamesh - What is it?

I have been reading about data mesh architecture by Zhamak Dehgani and it has been thought provoking in thinking how data is handled in organizations.

Read More
Why you need a solid data engineering capability in your organization
Why you need a solid data engineering capability in your organization

Data plays a big role in AI. To give some perspective ChatGPT-3 was trained on multiple sources that include web pages, books, Wikipedia, and articles.

Read More
Fetching data from SOAP Web services
Fetching data from SOAP Web services

One of the fundamental things in business intelligence and data science is to fetch the data from a source. Most of the time companies expose their data

Read More
Logo

© 2024 Yajur LLC . All rights reserved