Data warehouse using Bigquery: Basic ELT
This tutorial will help you create a data warehouse on google cloud, will teach you how to build a basic ELT process on cloud.
What is Google cloud?
GCP is a public cloud vendor — like competitors Amazon Web Services (AWS) and Microsoft Azure. With GCP and other cloud vendors, customers are able to access computer resources housed in Google’s data centers around the world for free or on a pay-per-use basis.
What is a data warehouse?
Data Warehousing integrates data and information collected from various sources into one comprehensive database.
What is ELT ?
Data is extracted from a source system, loaded into a destination system, and transformed inside the destination system.
What is BigQuery?
BigQuery is Google’s fully managed, serverless data warehouse that enables scalable analysis over petabytes of data.
Enough of theory lets dive straight into practicals
Creating a project
- On top bar click on create new project.
- Give project Name , you may also edit your project ID if you wish to
3. Now from the project menu click on your project.
Pre-requisites to build a data warehouse
- Access your Cloud shell by clicking on the below icon on top rigght corner of console.
2. Check the current setup using the below command.
gcloud info
Click Authorize if prompted
what we are checking here is that we will get account info and project name displayed on the screen, make sure its coorect.
3. We will have to configure a new environment using the gcloud init command. It will display a lot of options, you have to choose “create a new configuration”. Give it a configuration name, for example, ‘shorya-config’ and choose the configurations to use your existing email id. At the last choose your project number.
Lets upload the dataset to GCS Bucket
- In the navigation menu click Cloud Storage.
- Click on create and give your bucket a name for e.g. “shorya-gcp-data-end-data-bucket”.
create a folder structure in the bucket like dataWarehousing/datasets and under the datasets folder upload the folders from the link below in the same manner reflected in the image.
let’s finally develop our data warehouse
User story
As a regional manager , I want to know the top two region IDs, ordered by the total capacity of the stations in that region, Also I want to download the answers to my questions as a CSV file to my local computer, the data source table is the station table which is located in the CloudSQL-MySQL database, There will be more data sources in future.
The ELT Process that we will follow is, We will extract from MySQL , will use GCS as Staging layer, load GCS to BigQuery and finally create BigQuery Data Mart.
Create a MySQL database in Cloud SQL
We can create a CloudSQL instance from GCP Console, but for simplicity lets create using Cloud Shell.
gcloud sql instances create mysql-instance-source \
> --database-version=MYSQL_5_7 \
--tier=db-g1-small --region=us-central1 \
--root-password=shorya123 --availability-type=zonal --storage-size=10GB \
--storage-type=HDD
Connect to MySQL Instance using the below command
gcloud sql connect mysql-instance-source --user=root
type your password , for me , it is shorya123.
CREATE DATABASE apps_db;
SHOW DATABSES;
CREATE TABLE apps_db.stations(
station_id varchar(255),
name varchar(255),
region_id varchar(10),
capacity integer);
Now we want to import data using CSV file:
- Click on Cloud SQL Console , click on the created project and then click on the import button.
- Change the file format option to CSV.
- Input the destination database, apps_db, and the table name, stations.
4. Now, in the shell see the data inserted data using the below command
select * FROM apps_db.stations limit 10;
5. Come out the the MySQL shell,
exit;
Extract — Load — Transform
Extract data from MySQL to GCS
We need to assign cloud SQL service account a storage object admin role first.
- From the cloudSQL console, go to your MySQL Instance, and click on it, scroll down till you find service account, copy the service account somewhere on any text editor.
- Now , in order to add a new role to the copied service account, go to the navogation bar, click on IAM & Admin -> IAM, click Grant Access, paste the CloudSQL service account into New principals.
- Select a role
After this process, your CloudSQL serice account will have the permission to write and delete file objects in all GCS buckets in your project.
Now let’s create a shell script to export the MySql Query results to a CSV file.
Type the below commands in shell,
vi gcloud_export_cloudsql_to_gcs.sh
paste the below code and change your bucket name,
bucket_name=[your-bucket-name]
gcloud sql export csv mysql-instance-source \
gs://$bucket_name/mysql_export/stations/20180101/stations.csv \
--database=apps_db \
--offload \
--query='SELECT * FROM stations WHERE station_id <= 200;'
gcloud sql export csv mysql-instance-source \
gs://$bucket_name/mysql_export/stations/20180102/stations.csv \
--database=apps_db \
--offload \
--query='SELECT * FROM stations WHERE station_id <= 400;'
sh gcloud_export_cloudsql_to_gcs.sh
The script will export the stations table two times, Each of the exports will be stored in two different directories.
Now, because Cloud SQL is paid service, we may delete the MySQL instance by running the below command.
gcloud sql instances delete mysql-instance-source
Load GCS to BigQuery
- Move to the bigQuery Console and create a new dataset ‘raw_bikesharing’.
- under raw_bikesharing , click on the + create table icon and fill like below
3. under schema click on edit as text , write the below line, and click on create table.
station_id:STRING,name:STRING,region_id:STRING,Capacity:INTEGER
Create a BigQuery Data Mart
- Let’s create new dataset with the name dm_regional_manager
- Create a view for our business problem
create view `shorya-gcp-data-end.dm_regional_manager.top_2_region_by_capacity`
as select region_id, SUM(capacity) as total_capacity
from `shorya-gcp-data-end.raw_bikesharing.stations`
where region_id != ''
group by region_id
order by total_capacity desc
limit 2;
you can access your view by using the below query:
select * from `dm_regional_manager.top_2_region_by_capacity`
After checking your results click on save results and choose CSV(local file).
Wohoo !!! We have completed our most basic ELT process on google cloud
follow me on Linkedin
LinkedIn: https://www.linkedin.com/in/shorya-sharma-b94161121