Skip to content

From the DSL API to Google BigQuery

This tutorial demonstrates how to perform a full-text search in Dimensions using the Analytics API and then export the data to Google BigQuery for further analysis.

This technique allows to take advantage of the strengths of each of these data products:

  • The Analytics API allows to run full-text searches over the hundreds of millions documents stored in the Dimensions database. This makes it an ideal tool for identifying a corpus of documents using collections of keywords and/or other filters (note: this is the same functionality available when you search on app.dimensions.ai)
  • The Dimensions on Google BigQuery database allows to run SQL queries of any complexity using a cloud-based environment containing all of the metadata available in Dimensions, thus removing the need to download/analyse the data offline first. This makes is the perfect solution for advanced analytics tasks such as benchmarking, metrics calculations or impact analyses.

Prerequisites

In order to run this tutorial, please ensure that:

(This tutorial is based on a Jupyter notebook that is available directly via GitHub.)

Example: profiling researchers linked to a topic

The concrete usecase we'll be looking at involves running a full-text search for "moon landing" publications using the DSL API, then creating a corpus in BigQuery based on this search (eg see this search).

Once we have the publication corpus available in BigQuery, we will extract all associated researchers (=authors). At the same time, we are going to use SQL in order to enrich the results using other metrics (eg more researchers metadata including citations & altmetric).

Getting started

The following code will load the Python BigQuery library and authenticate you as a valid user.

!pip install google-cloud-bigquery -U --quiet
%load_ext google.cloud.bigquery

import sys
print("==\nAuthenticating...")
if 'google.colab' in sys.modules:
    from google.colab import auth
    auth.authenticate_user()
    print('..done (method: Colab)')
else:
    from google.cloud import bigquery
    print('..done (method: local credentials)')

#
# PLEASE UPDATE USING YOUR CLOUD PROJECT ID (= the 'billing' account)
#

MY_PROJECT_ID = "ds-data-solutions-gbq"

print("==\nTesting connection..")
client = bigquery.Client(project=MY_PROJECT_ID)
test = client.query("""
    SELECT COUNT(*) as pubs
    from `dimensions-ai.data_analytics.publications`
    """)
rows = [x for x in test.result()]
print("...success!")
print("Total publications in Dimensions: ", rows[0]['pubs'])
The google.cloud.bigquery extension is already loaded. To reload it, use:
  %reload_ext google.cloud.bigquery
==
Authenticating...
..done (method: local credentials)
==
Testing connection..
...success!
Total publications in Dimensions:  115963650

1. Connecting to the DSL API

For more background on the Analytics API and how to work with it, see this tutorial

!pip install dimcli --quiet

import dimcli
from dimcli.utils import *
import json
import sys
import pandas as pd
#

ENDPOINT = "https://app.dimensions.ai"
USERNAME, PASSWORD  = "", ""
dimcli.login(USERNAME, PASSWORD, ENDPOINT)
dsl = dimcli.Dsl()

Let's try running a sample query.

TIP Review the full text search syntax of the Dimensions Search Language.

%%dsldf

search publications for
  " \"moon landing\" AND Moon AND \"lunar surface\" "
return publications limit 10
Returned Publications: 10 (total = 11305)
Time: 0.72s
Row type volume pages id year author_affiliations title journal.id journal.title issue
0 article 122 100692 pub.1134954138 2021 `[[{'raw_affiliation': ['Department of Aerospac...` Review of space habitat designs for long term ... jour.1139377 Progress in Aerospace Sciences NaN
1 article 181 167-189 pub.1130384298 2021 `[[{'raw_affiliation': ['U.S. Naval War College...` Joseph G. Gavin, Jr. and MIT’s contribution to... jour.1134138 Acta Astronautica NaN
2 article 180 650-678 pub.1134475636 2021 `[[{'raw_affiliation': ['Skolkovo Institute of ...` Regolith-based additive manufacturing for sust... jour.1134138 Acta Astronautica NaN
3 article NaN 1-13 pub.1135101079 2021 `[[{'raw_affiliation': ['Centre for Teaching an...` Looking at Gail Jones’s “The Man in the Moon” ... jour.1137860 Journal of Australian Studies NaN
4 article 21 959 pub.1135057882 2021 `[[{'raw_affiliation': ['School of Artificial I...` Three-Dimensional Model of the Moon with Seman... jour.1033312 Sensors 3

2. Exporting DSL results to Google BigQuery

First off, we want to run the full-text search so to extract all relevant publications IDs.

Second, we will export the publications IDs to Google BigQuery. NOTE: Pandas provides a handy command to move data to BigQuery: DataFrame.to_gbq.

%%dslloopdf

search publications for
  " \"moon landing\" AND Moon AND \"lunar surface\" "
return publications[id]
Starting iteration with limit=1000 skip=0 ...
0-1000 / 11201 (0.43s)
1000-2000 / 11201 (1.01s)
2000-3000 / 11201 (0.69s)
3000-4000 / 11201 (1.03s)
4000-5000 / 11201 (1.30s)
5000-6000 / 11201 (1.58s)
6000-7000 / 11201 (0.33s)
7000-8000 / 11201 (0.25s)
8000-9000 / 11201 (0.27s)
9000-10000 / 11201 (0.64s)
10000-11000 / 11201 (1.05s)
11000-11201 / 11201 (1.80s)
===
Records extracted: 11201
Row id
0 pub.1128771471
1 pub.1130814402
2 pub.1131658726
3 pub.1124123379
4 pub.1131232278
... ...
11196 pub.1061739351
11197 pub.1025947790
11198 pub.1091822752
11199 pub.1025757974
11200 pub.1023928923

11201 rows × 1 columns

df = dsl_last_results

The command below will add a new table moonlanding to the demo_dsl dataset in GQB.

That destination table is entirely up to you of course, so you need to make sure you have write access to the database.

DATASET = "demo_dsl"
table_id = DATASET + ".moonlanding"
df.to_gbq(table_id, project_id = PROJECTID, if_exists="replace")
1it [00:05,  5.05s/it]

That's it - you should now be able to go to the online BigQuery console and see the new demo_dsl.moonlanding dataset.

3. Querying your new dataset using a JOIN on Dimensions

We can now use the publications IDs we imported in order to create a JOIN query on the main Dimensions dataset. This is a bit like creating a 'view' of Dimensions corresponding to the full-text search we have done above.

GOAL: Roughly, the results should be the same as the 'publication year' facet in the webapp

%%bigquery --project $PROJECTID

WITH mypubs AS (
  SELECT dim_pubs.*
  FROM
    `dimensions-ai.data_analytics.publications` dim_pubs
  JOIN
    `ds-data-solutions-gbq.demo_dsl.moonlanding` dslexport
  ON
    dim_pubs.id = dslexport.id
)

SELECT
  COUNT(id) as tot,
  year
FROM mypubs
GROUP BY year
ORDER BY tot DESC
Row tot year
0 10052 2003
1 133 2020
2 97 2019
3 70 2015
4 66 2017
5 65 2018
6 63 2009
7 59 2013
rows truncated for display

4. Using Google BigQuery to generate researcher statistics

The goal is to generate a table just like the one in the 'researchers' analytical view in the webapp.

For each researcher we want to display some extra information:

  • the total number of publications
  • the citations count
  • the total Altmetric Attention Score
%%bigquery --project $PROJECTID

WITH mypubs AS (
  SELECT dim_pubs.*
  FROM
    `dimensions-ai.data_analytics.publications` dim_pubs
  JOIN
    `ds-data-solutions-gbq.demo_dsl.moonlanding` dslexport
  ON
    dim_pubs.id = dslexport.id
),

researchers_metrics AS (
  SELECT researcher_id,
    COUNT(id) as publications_count,
    SUM(citations_count) as citations_count,
    SUM(altmetrics.score) as altmetric_sum
  FROM
    mypubs,
    UNNEST( researcher_ids ) as researcher_id
  GROUP BY researcher_id
)

SELECT * FROM researchers_metrics
ORDER BY publications_count DESC
Row researcher_id publications_count citations_count altmetric_sum
0 ur.01056354465.10 11 21.0 49.0
1 ur.014402173273.44 6 42.0 10.0
2 ur.012373502003.54 4 63.0 NaN
3 ur.010534421371.14 4 63.0 NaN
4 ur.015145367415.34 4 44.0 1.0
... ... ... ... ...
1080 ur.0767272510.86 1 5.0 NaN
1081 ur.07637166751.28 1 3.0 NaN
1082 ur.012762707227.21 1 3.0 NaN
1083 ur.010101533313.52 1 NaN 1.0
1084 ur.016406136233.64 1 NaN 16.0

Final step: let's add researchers names and current organization details by joining up data from the GRID table.

%%bigquery --project $PROJECTID

WITH mypubs AS (

  SELECT dim_pubs.*
  FROM
    `dimensions-ai.data_analytics.publications` dim_pubs
  JOIN
    `ds-data-solutions-gbq.demo_dsl.moonlanding` dslexport
  ON
    dim_pubs.id = dslexport.id

),

researchers_metrics AS (

  SELECT researcher_id,
    COUNT(id) as publications_count,
    SUM(citations_count) as citations_count,
    SUM(altmetrics.score) as altmetric_sum
  FROM
    mypubs,
    UNNEST( researcher_ids ) as researcher_id
  GROUP BY researcher_id

),

researchers_full AS (

  SELECT researchers_metrics.*,
    r.first_name, r.last_name, r.total_grants,
    grid.id as grid_id,
    grid.name as grid_name,
    grid.address.city as grid_city,
    grid.address.country as grid_country
  FROM
    researchers_metrics
  JOIN
    `dimensions-ai.data_analytics.researchers` r
    ON researchers_metrics.researcher_id = r.id
  JOIN
    `dimensions-ai.data_analytics.grid` grid
    ON grid.id = r.current_research_org

)



SELECT * FROM researchers_full
ORDER BY publications_count DESC
Row researcher_id publications_count citations_count altmetric_sum first_name last_name total_grants grid_id grid_name grid_city grid_country
0 ur.01056354465.10 11 21.0 49.0 Roger D Launius 4 grid.1214.6 Smithsonian Institution Washington D.C. United States
1 ur.014402173273.44 6 42.0 10.0 Joseph N Pelton 0 grid.33224.34 International Space University Illkirch-Graffenstaden France
2 ur.010243405673.63 4 14.0 NaN Sachiko Wakabayashi 1 grid.62167.34 Japan Aerospace Exploration Agency Tokyo Japan
3 ur.012503545245.69 4 12.0 1.0 Stephan Theil 2 grid.7551.6 German Aerospace Center Cologne Germany
4 ur.0720745255.73 4 8.0 28.0 Chun-Lai Li 3 grid.450302.0 National Astronomical Observatories Beijing China
... ... ... ... ... ... ... ... ... ... ... ...
886 ur.011430662526.22 1 10.0 NaN Gang Lei 0 grid.458502.e Technical Institute of Physics and Chemistry Beijing China
887 ur.015477605337.38 1 1.0 NaN Olivier Dubois-Matra 0 grid.424669.b European Space Research and Technology Centre Noordwijk-Binnen Netherlands
888 ur.013214745135.53 1 NaN NaN Catherine L Newell 0 grid.26790.3a University of Miami Coral Gables United States
889 ur.014464032227.37 1 41.0 141.0 Andrew M Carton 0 grid.25879.31 University of Pennsylvania Philadelphia United States
890 ur.010610572173.89 1 10.0 NaN Zhan Liu 0 grid.411510.0 China University of Mining and Technology Xuzhou China