Introduction to BigQuery
In this post, I will talk about Google’s BigQuery service for big data analysis.
Overview
BigQuery is a Serverless, highly scalable, cost-effective, enterprise-grade modern data warehouse offering on Google Cloud Platform. It allows analysts to use ANSI SQL to analyze petabytes of data at fast speed with no operational overhead.
Key Features
Features | |
---|---|
Serverless | No-Ops model. Google manages all the resource provisioning behind-the-scene. |
Fast SQL | Supports ANSI SQL with sub-second query response time and high concurrency. |
Managed Storage | Once the data is loaded to BigQuery, it is stored & managed by BigQuery in effective way. |
Data Encryption & Security | Data encrypted at rest and integrated with Cloud IAM for security. |
BigQuery ML | Enables data scientists and data analysts to build, train and test ML models inside BigQuery using the SQL syntax. |
BigQuery GIS | Enables location intelligence by allowing analysis and visualization of geospatial data inside BigQuery. |
Flexible pricing model | On Demand and Flat rate pricing. For latest pricing model, please refer official documentation |
For latest full list, please refer official documentation
How to access BigQuery?
There are multiple ways to interact with BigQuery:
Interacting with BigQuery using bq
Prerequisites
This post assumes the following:
1. We already have a GCP Project and BigQuery API enabled.
2. Google Cloud SDK (gcloud
). If you don’t have, then refer to my previous blog - Getting started with Google Cloud SDK.
BigQuery Operations
Create a dataset
bq mk bq_dataset
List datasets
bq ls
Output:
datasetId -------------- bq_dataset
Create table in a dataset
bq mk \ --table \ --expiration 3600 \ --description "This is my BQ table" \ --label env:dev \ bq_dataset.first_table \ col1:STRING,col2:FLOAT,col3:STRING
Examine a BigQuery table Note: I will examine a table from public dataset.
bq show bigquery-public-data:covid19_jhu_csse.summary
Output
Table bigquery-public-data:covid19_jhu_csse.summary Last modified Schema Total Rows Total Bytes Expiration Time Partitioning Clustered Fields Labels ----------------- ----------------------------- ------------ ------------- ------------ ------------------- ------------------ -------------- 07 Jun 10:06:41 |- province_state: string 254940 41005062 freebqcovid: |- country_region: string |- date: date |- latitude: float |- longitude: float |- location_geom: geography |- confirmed: integer |- deaths: integer |- recovered: integer |- active: integer |- fips: string |- admin2: string |- combined_key: string
Running a query
bq query --use_legacy_sql=false \ 'SELECT date, country_region, SUM(confirmed), SUM(deaths) FROM `bigquery-public-data.covid19_jhu_csse.summary` GROUP BY date, country_region HAVING date = "2020-05-31" AND country_region IN ("India", "US")'
Output
Waiting on bqjob_r2935edb2e19bc9f_000001728e32aded_1 ... (0s) Current status: DONE +------------+----------------+---------+--------+ | date | country_region | f0_ | f1_ | +------------+----------------+---------+--------+ | 2020-05-31 | US | 1790172 | 104381 | | 2020-05-31 | India | 190609 | 5408 | +------------+----------------+---------+--------+
Cleanup: Delete the dataset
bq rm -r bq_dataset
Hope this blog helps you get familiar with BigQuery.
If you have a feedback or questions, please reach out to me on LinkedIn or Twitter