Google Cloud Platform Automation using Airflow DAG

In this post I will show you how you can automate process in GCP using Airflow DAG. In my last post I have shown you how you can write and test python or pyspark code in GCP. Let’s say you have written a python code to process some data. Now you you want to execute that python code every day morning 11 am and save output to a folder. By reading this post you can do that.

Keeping code ready

Let’s say you have written a python code named “hello_GCP.py“. Now you should upload your code in a GCP storage bucket. If you don’t know how to upload any file to GCP storage bucket, please refer my previous post.

Write a DAG code

To automate process in Google Cloud Platform using Airflow DAGs, you must write a DAG (Directed Acyclic Graph) code as Airflow only understand DAG code. A DAG code is just a python script. In DAG code or python script you need to mention which task need to execute and order to execute. Now let’s write a simple DAG code.

import airflow, pendulum
from airflow import DAG
from airflow.contrib.operators.dataproc_operator import DataprocClusterCreateOperator, DataProcPySparkOperator, DataprocClusterDeleteOperator
from airflow.utils.trigger_rule import TriggerRule
from datetime import datetime                               

default_args = {
    'owner': 'Airflow',
    'start_date': datetime(2021, 1, 28, 10, 15),
    'depends_on_past': False,
    'retries': 0
}

dag = DAG(
    dag_id='Data_Processing_1',
    description='Sample code to explain GCP automation using Airflow DAG',
    schedule_interval=None,
    catchup=False,
    default_args=default_args,
	schedule_interval="@daily",
)

show_date = BashOperator(
    task_id='print_date',
    bash_command='sleep 2',
    dag=dag)

data_processing_test = DataProcPySparkOperator(
    dag=dag,
    task_id='data_processing_test',     ##Change the name of the task_id according to task
    main='some_folder/code/hello_GCP.py', ## This is your data processing code with location
)

show_date >> data_processing_test

Note: You need to mention correct data processing code location in line no: 33. Line no: 20 is to run your code every day.

Also Read:  Beginners guide to Google Cloud Platform

Steps to write an Airflow DAG script

Now how to write Airflow DAG code. To write DAG code you just need to remember 5 important steps:

  1. import packages: Import all required python dependencies for the workflow, just like other python code. line no: 1-5
  2. DAGs default arguments: Define DAG specific arguments. It is just a python dictionary, contains all the arguments which is going to apply to all the tasks in your workflow. [line no: 7-12]. To know more about all kind of arguments visit official page.
  3. Instantiate a DAG: You can represent a DAG by its name, configure schedule intervals and DAG settings. line no: 14-21
  4. Tasks: Write all tasks for your Airflow workflow. I have written two tasks:
    1. show_date: To print current date
    2. data_processing_test: Our main data processing code execution
  5. Setting up dependencies: Set the order in which all tasks should be executed. I have ordered tasks like: first show_date will be executed then data_processing_test will be executed. line no: 34

Upload DAG script

Once you have written your Airflow DAG code, you need to upload it into DAGs folder of GCP composer. To do that go to composer -> click on DAGs, then upload the DAG code.

upload dag script into google cloud platform composer

Once you have uploaded DAG code to composer, after few minute a DAG will be created in Airflow. Name of the DAG will be your dag id: Data_Processing_1. If you click on it a new window will open.

view Airflow dags in google cloud platform
  • To view DAG code you can click on code
  • To run this DAG you need to click on Trigger DAG and switch it on from the left top side.
  • To check success or failure of your DAG you can click on Graph View or Tree View
Also Read:  Free Certificate: Google Launched Generative AI Course

Conclusion

This is the last part of GCP tutorial. In this tutorial I have explained how you can automate process in Google Cloud Platform using Airflow DAGs. You can also read my previous articles.

If you have any question or suggestion regarding this topic see you in comment section. I will try my best to answer.

16 thoughts on “Google Cloud Platform Automation using Airflow DAG”

  1. I just like the valuable info you provide in your articles. I will bookmark your weblog and take a look at again right here frequently. I am slightly certain I’ll be told a lot of new stuff right right here! Best of luck for the next!

    Reply

Leave a comment