Introduction to BigQuery
In this post, I will talk about Google’s BigQuery service for big data analysis.
Overview
BigQuery is a Serverless, highly scalable, cost-effective, enterprise-grade modern data warehouse offering on Google Cloud Platform. It allows analysts to use ANSI SQL to analyze petabytes of data at fast speed with no operational overhead.
Key Features
| Features | |
|---|---|
| Serverless | No-Ops model. Google manages all the resource provisioning behind-the-scene. |
| Fast SQL | Supports ANSI SQL with sub-second query response time and high concurrency. |
| Managed Storage | Once the data is loaded to BigQuery, it is stored & managed by BigQuery in effective way. |
| Data Encryption & Security | Data encrypted at rest and integrated with Cloud IAM for security. |
| BigQuery ML | Enables data scientists and data analysts to build, train and test ML models inside BigQuery using the SQL syntax. |
| BigQuery GIS | Enables location intelligence by allowing analysis and visualization of geospatial data inside BigQuery. |
| Flexible pricing model | On Demand and Flat rate pricing. For latest pricing model, please refer official documentation |
For latest full list, please refer official documentation
How to access BigQuery?
There are multiple ways to interact with BigQuery:
Interacting with BigQuery using bq
Prerequisites
This post assumes the following:
1. We already have a GCP Project and BigQuery API enabled.
2. Google Cloud SDK (gcloud). If you don’t have, then refer to my previous blog - Getting started with Google Cloud SDK.
BigQuery Operations
Create a dataset
bq mk bq_datasetList datasets
bq lsOutput:
datasetId -------------- bq_datasetCreate table in a dataset
bq mk \ --table \ --expiration 3600 \ --description "This is my BQ table" \ --label env:dev \ bq_dataset.first_table \ col1:STRING,col2:FLOAT,col3:STRINGExamine a BigQuery table Note: I will examine a table from public dataset.
bq show bigquery-public-data:covid19_jhu_csse.summaryOutput
Table bigquery-public-data:covid19_jhu_csse.summary Last modified Schema Total Rows Total Bytes Expiration Time Partitioning Clustered Fields Labels ----------------- ----------------------------- ------------ ------------- ------------ ------------------- ------------------ -------------- 07 Jun 10:06:41 |- province_state: string 254940 41005062 freebqcovid: |- country_region: string |- date: date |- latitude: float |- longitude: float |- location_geom: geography |- confirmed: integer |- deaths: integer |- recovered: integer |- active: integer |- fips: string |- admin2: string |- combined_key: stringRunning a query
bq query --use_legacy_sql=false \ 'SELECT date, country_region, SUM(confirmed), SUM(deaths) FROM `bigquery-public-data.covid19_jhu_csse.summary` GROUP BY date, country_region HAVING date = "2020-05-31" AND country_region IN ("India", "US")'Output
Waiting on bqjob_r2935edb2e19bc9f_000001728e32aded_1 ... (0s) Current status: DONE +------------+----------------+---------+--------+ | date | country_region | f0_ | f1_ | +------------+----------------+---------+--------+ | 2020-05-31 | US | 1790172 | 104381 | | 2020-05-31 | India | 190609 | 5408 | +------------+----------------+---------+--------+Cleanup: Delete the dataset
bq rm -r bq_dataset
Hope this blog helps you get familiar with BigQuery.
If you have a feedback or questions, please reach out to me on LinkedIn or Twitter