Introduction to BigQuery

May 31, 2020 3 minutes read

BigData • Cloud

GCP • BigData

In this post, I will talk about Google’s BigQuery service for big data analysis.

Overview

BigQuery is a Serverless, highly scalable, cost-effective, enterprise-grade modern data warehouse offering on Google Cloud Platform. It allows analysts to use ANSI SQL to analyze petabytes of data at fast speed with no operational overhead.

Key Features

Features
Serverless	No-Ops model. Google manages all the resource provisioning behind-the-scene.
Fast SQL	Supports ANSI SQL with sub-second query response time and high concurrency.
Managed Storage	Once the data is loaded to BigQuery, it is stored & managed by BigQuery in effective way.
Data Encryption & Security	Data encrypted at rest and integrated with Cloud IAM for security.
BigQuery ML	Enables data scientists and data analysts to build, train and test ML models inside BigQuery using the SQL syntax.
BigQuery GIS	Enables location intelligence by allowing analysis and visualization of geospatial data inside BigQuery.
Flexible pricing model	On Demand and Flat rate pricing. For latest pricing model, please refer official documentation

For latest full list, please refer official documentation

How to access BigQuery?

There are multiple ways to interact with BigQuery:

Interacting with BigQuery using bq

Prerequisites

This post assumes the following:
1. We already have a GCP Project and BigQuery API enabled.
2. Google Cloud SDK (gcloud). If you don’t have, then refer to my previous blog - Getting started with Google Cloud SDK.

BigQuery Operations

Create a dataset
```
bq mk bq_dataset
```

List datasets

bq ls

Output:

datasetId
--------------
bq_dataset

Create table in a dataset

bq mk \
--table \
--expiration 3600 \
--description "This is my BQ table" \
--label env:dev \
bq_dataset.first_table \
col1:STRING,col2:FLOAT,col3:STRING

Examine a BigQuery table Note: I will examine a table from public dataset.

bq show bigquery-public-data:covid19_jhu_csse.summary

Output

Table bigquery-public-data:covid19_jhu_csse.summary

Last modified              Schema              Total Rows   Total Bytes   Expiration   Time Partitioning   Clustered Fields      Labels
----------------- ----------------------------- ------------ ------------- ------------ ------------------- ------------------ --------------
07 Jun 10:06:41   |- province_state: string     254940       41005062                                                          freebqcovid:
                |- country_region: string
                |- date: date
                |- latitude: float
                |- longitude: float
                |- location_geom: geography
                |- confirmed: integer
                |- deaths: integer
                |- recovered: integer
                |- active: integer
                |- fips: string
                |- admin2: string
                |- combined_key: string

Running a query

bq query --use_legacy_sql=false \
'SELECT
date,
country_region,
SUM(confirmed),
SUM(deaths)
FROM
`bigquery-public-data.covid19_jhu_csse.summary`
GROUP BY
date,
country_region
HAVING date = "2020-05-31"
AND
country_region IN ("India", "US")'

Output

Waiting on bqjob_r2935edb2e19bc9f_000001728e32aded_1 ... (0s) Current status: DONE
+------------+----------------+---------+--------+
|    date    | country_region |   f0_   |  f1_   |
+------------+----------------+---------+--------+
| 2020-05-31 | US             | 1790172 | 104381 |
| 2020-05-31 | India          |  190609 |   5408 |
+------------+----------------+---------+--------+

Cleanup: Delete the dataset
```
bq rm -r bq_dataset
```

Hope this blog helps you get familiar with BigQuery.

If you have a feedback or questions, please reach out to me on LinkedIn or Twitter