This post is part 2 of Google Cloud Platform tutorial series. In my last post, I have explained basic overview of Google Cloud Platform. In my last post, I have explained some important components of GCP. In this post, I will show you how you can use those components in your day-to-day work in Google Cloud Platform.
By reading this article you will know:
- How to access Storage in GCP
- How to run Python code in GCP
- How to run Pyspark code in GCP
- How to automate the whole process in Google Cloud Platform (Data Pipeline in GCP)
How to create or select project in GCP
To start your work in GCP first thing you need to create a project or select a project. When you signed up for Google Cloud Account it automatically enabled billing and create your first project named “My First Project”
To select any existing project or create a new project in Google Cloud Platform you click on project drop down section.
Once you click that project drop-down section there will be a list of projects you have already created or only one project (automatically created My First Project) will be visible.
From that list you can select any project you want to work from that list or you can create a new project in GCP.
To create new project in GCP you can click on NEW PROJECT
Then provide a project name and location of your server.
There is a warning that says You have 12 projects remaining in your quota. This warning says I can create 12 more projects for this account. This is because when Google Cloud allocates resources to customers they provide different quotas (number of project you can create) based on their previous uses. You can always increase this quota to create more project in Google Cloud Platform.
Access Storage in GCP
After selecting any project you need to create storage bucket for your selected project (folder). In this storage bucket, you can upload any data, code, etc. Without any data or code, you can not do anything right?
To create a storage bucket click on Navigation menu then scroll down and click Cloud storage from storage section or you can search for Cloud storage. Cloud Storage page will open.
Now in Cloud Storage page click CREATE BUCKET -> Name your bucket (folder name) -> CREATE
You can also create folders inside these buckets. Just like your personal computer.
To upload a file into that folder you just need to click UPLOAD FILES inside that bucket or folder under bucket. This file can be data, code, trained machine learning model, etc. anything.
To access that uploaded file from Python or Pyspark code, gsutil path of that file is required. If you click on a particular file you can find gsutil path of that file.
How to run Python code in Google Cloud Platform
Once you uploaded any data file to Google Cloud Storage, the next step is to write a python code to do something out of that data file. To write python code in Google Cloud Platform, just go to Google Cloud Shell and type python and enter (just like cmd). It will open Python compiler in GCP. Here you can write and test your code. To install any python packages similar to cmd you just need to type pip install package_name.
Note: By default it will open python 2.7 compiler. If you want to work with python 3 then type python3 and enter.
How to run Pyspark code in Google Cloud Platform
While you are working with big data, python code will not be enough. To process data parallely you must process data with Pyspark.
As you know from my prvious post anything related to big data you need to knock the door of Dataproc in Google Cloud Platform.
Now click Navigation menu -> Select Dataproc from big data section.
Now in Dataproc page select a cluster. If there is no cluster you need to create a cluster (it is so simple, you just need to correctly provide the region).
Once you click on a cluster in new page, click on VM INSTANCES -> Click on SSH arrow from right -> Click Open in browser window
How to automate process in GCP
At this point you know how to access files from google cloud storage and how you can write python code or pyspark code to analyze or process that data. Now let’s say you are processing data using pyspark or python code and you are saving that processed data to a storage location.
Now you want to do same data processing (or any other task like a Machine Learning job, Artificial Intelligence job, etc.) at the same time every day. In this case you need to run your python or pyspark code at the same time every day. This is called Data Pipeline in GCP.
Airflow in Google Cloud Platform can help you do this. You need to write a DAG code in python language for the data pipeline. I will make a separate article for Data Pipeline. I will also explain how to write DAG (Directed Acyclic Graph) code in that post.
If you have any question or suggestion regarding this topic see you in comment section. I will try my best to answer.