Verifying your connection

In this tutorial we will show how to connect to Dimensions on Google BigQuery using Python, so that we can then run a few sample queries.

Note

This tutorial is intended for people who want to query BigQuery via a notebook, but the SQL queries in this lab can also be run directly from the BigQuery console.

Prerequisites

In order to run this tutorial, please ensure that:

You have a valid Dimensions on Google BigQuery account and have configured a Google Cloud project.
You have some basic familiarity with Python and Jupyter notebooks.

(This tutorial is based on a Jupyter notebook that is available directly via GitHub.)

Connection methods

There are a few options available:

Use Google Colaboratory and your personal credentials. This option is the simplest of all, as it doesn't require you to install anything on your computer. It is normally ok for small to mid-sized projects that can live in the cloud.
Use a local Jupyter environment and your personal credentials. This option requires you to install the Google Cloud SDK in order to authenticate. It is the best option if you want to work locally and/or have other Python libraries or services that you need to access.
Use a local Jupyter environment and a service account. This option is really a variance on the option 2, for those users that must use a service account.

NOTE All of these options require you to first set up a GCP project (if you haven't done it already) and provide your project ID. E.g.:

MY_PROJECT_ID = "my-cool-gbq-project"

Option 1: using Google Colaboratory and your personal credentials

Google Colaboratory is a free cloud-based Jupyter environment from Google. This option provides an easy service allowing you to get started with notebooks.

Using your Google Account you can create notebooks, execute BigQuery queries and share these with other Google Accounts quickly and easily.

# authentication happens via your browser
from google.colab import auth
auth.authenticate_user()
print('Authenticated')

MY_PROJECT_ID = "my-cool-gbq-project"
from google.cloud import bigquery
client = bigquery.Client(project=MY_PROJECT_ID)

Option 2: using a local Jupyter and your personal credentials

A Google Account represents a developer, an administrator, or any other person who interacts with Google Cloud. This is normally the Google account one has used to get access to the Dimensions on BigQuery product.

In order to configure programmatic access for local development, the easiest way is to authenticate using the Google Cloud SDK.

$ gcloud auth application-default login

Note: the command above should be run from a Terminal or console. This will generate a JSON file that is used as the default application credentials for the account that was selected in the above login process. When using the default Client for each Google provided package (such as BigQuery) they should automatically authenticate using these default credentials.

# install python client library
!pip install google-cloud-bigquery -U --quiet

from google.cloud import bigquery

MY_PROJECT_ID = "my-cool-gbq-project"
client = bigquery.Client(project=MY_PROJECT_ID)

Option 3: using a local Jupyter and a service account

A service account is a special kind of account used by an application or a virtual machine (VM) instance, not a person.

Each service account is associated with two sets of public/private RSA key pairs that are used to authenticate to Google: Google-managed keys, and user-managed keys.

When using a service account you'd just have to point your client object to the a key file.

from google.cloud import bigquery
credentials_file = 'my-awesome-gbq-project-47616836.json'

MY_PROJECT_ID = "my-cool-gbq-project"

# Explicitly use service account credentials by specifying the private key file
client = bigquery.Client.from_service_account_json(credentials_file)

Running queries

Once the connection is set up, all you have to do is to type in a SQL query and run it using the client object.

# Query: Top publications from Oxford univ. by Altmetric Score in 2020

query_1 = """
SELECT
    id,
    title.preferred as title,
    ARRAY_LENGTH(authors) as authors_count,
    CAST(altmetrics.score as INT64) as altmetric_score
FROM
    `dimensions-ai.data_analytics.publications`
WHERE
    year = 2020 AND 'grid.4991.5' in UNNEST(research_orgs)
ORDER BY
    altmetric_score DESC
LIMIT 5"""

# 1 - main syntax

query_job = client.query(query_1)

results = query_job.result()  # Waits for job to complete.

for row in results:
  print("> {} : {}\n\tAuthors: {}\n\tAltmetric Score: {}".format(row.id, row.title, row.authors_count, row.altmetric_score))

> pub.1129493369 : Safety and immunogenicity of the ChAdOx1 nCoV-19 vaccine against SARS-CoV-2: a preliminary report of a phase 1/2, single-blind, randomised controlled trial
    Authors: 366
    Altmetric Score: 15451
> pub.1130340155 : Two metres or one: what is the evidence for physical distancing in covid-19?
    Authors: 6
    Altmetric Score: 15125
> pub.1127239818 : Remdesivir in adults with severe COVID-19: a randomised, double-blind, placebo-controlled, multicentre trial
    Authors: 46
    Altmetric Score: 12675
> pub.1131721397 : Scientific consensus on the COVID-19 pandemic: we need to act now
    Authors: 31
    Altmetric Score: 10192
> pub.1126016857 : Quantifying SARS-CoV-2 transmission suggests epidemic control with digital contact tracing
    Authors: 9
    Altmetric Score: 8320

An slighly alternative syntax is also possible

# 2 - omit calling result()

query_job = client.query(query_1)
for row in query_job:
    print(row)

Row(('pub.1129493369', 'Safety and immunogenicity of the ChAdOx1 nCoV-19 vaccine against SARS-CoV-2: a preliminary report of a phase 1/2, single-blind, randomised controlled trial', 366, 15451), {'id': 0, 'title': 1, 'authors_count': 2, 'altmetric_score': 3})
Row(('pub.1130340155', 'Two metres or one: what is the evidence for physical distancing in covid-19?', 6, 15125), {'id': 0, 'title': 1, 'authors_count': 2, 'altmetric_score': 3})
Row(('pub.1127239818', 'Remdesivir in adults with severe COVID-19: a randomised, double-blind, placebo-controlled, multicentre trial', 46, 12675), {'id': 0, 'title': 1, 'authors_count': 2, 'altmetric_score': 3})
Row(('pub.1131721397', 'Scientific consensus on the COVID-19 pandemic: we need to act now', 31, 10192), {'id': 0, 'title': 1, 'authors_count': 2, 'altmetric_score': 3})
Row(('pub.1126016857', 'Quantifying SARS-CoV-2 transmission suggests epidemic control with digital contact tracing', 9, 8320), {'id': 0, 'title': 1, 'authors_count': 2, 'altmetric_score': 3})

Another quite handy feature is to transform data direclty into Pandas dataframes

# 3 - return a dataframe

query_job = client.query(query_1).to_dataframe()
query_job

| Row | id | title | authors_count | altmetric_score | | --- | ----- | ----- | ------------- | --------------- | | 1 | pub.1129493369 | Safety and immunogenicity of the ChAdOx1 nCoV-... | 366 | 15451 | | 2 | pub.1130340155 | Two metres or one: what is the evidence for ph... | 6 | 15125 | | 3 | pub.1127239818 | Remdesivir in adults with severe COVID-19: a r... | 46 | 12675 | | 4 | pub.1131721397 | Scientific consensus on the COVID-19 pandemic:... | 31 | 10192 | | 5 | pub.1126016857 | Quantifying SARS-CoV-2 transmission suggests e... | 9 | 8320 |

Advanced: BigQuery magic command and dynamic parameters

The Google BigQuery library comes with a magic command that is essentially a nice shortcut method for running queries.

This extensions needs to be loaded sepately e.g.:

%load_ext google.cloud.bigquery

We can then set up a couple of query parameters for the query itself, as well as the usual project ID value.

project_id = MY_PROJECT_ID
bq_params = {}
bq_params["journal_id"] = "jour.1115214"

Finally we can query by starting a cell with the command %%bigquery ...:

%%bigquery --params $bq_params --project $project_id

# Publications per year for Nature Biotechnology

SELECT
    count(*) as pubs, year, journal.title
FROM
    `dimensions-ai.data_analytics.publications`
WHERE
    year >= 2010
    AND journal.id = @journal_id
GROUP BY
    year, journal.title
ORDER BY
    year DESC

Query complete after 0.02s: 100%|██████████| 1/1 [00:00<00:00, 699.28query/s]
Downloading: 100%|██████████| 11/11 [00:02<00:00,  4.31rows/s]

Row	pubs	year	title
1	438	2020	Nature Biotechnology
2	386	2019	Nature Biotechnology
3	374	2018	Nature Biotechnology
4	380	2017	Nature Biotechnology
5	436	2016	Nature Biotechnology
6	467	2015	Nature Biotechnology
7	475	2014	Nature Biotechnology
8	462	2013	Nature Biotechnology
9	507	2012	Nature Biotechnology
10	459	2011	Nature Biotechnology
11	486	2010	Nature Biotechnology

Troubleshooting

Query fails wit to_dataframe() ArrowNotImplementedError
- Try reinstalling pyarrow ie pip install pyarrow -U
Query fails with AttributeError: 'NoneType' object has no attribute 'transport'
- Try pip install google-cloud-bigquery-storage -U and restarting the notebook