From the DSL API to Google BigQuery
This tutorial demonstrates how to perform a full-text search in Dimensions using the Analytics API and then export the data to Google BigQuery for further analysis.
This technique allows to take advantage of the strengths of each of these data products:
- The Analytics API allows to run full-text searches over the hundreds of millions documents stored in the Dimensions database. This makes it an ideal tool for identifying a corpus of documents using collections of keywords and/or other filters (note: this is the same functionality available when you search on app.dimensions.ai)
- The Dimensions on Google BigQuery database allows to run SQL queries of any complexity using a cloud-based environment containing all of the metadata available in Dimensions, thus removing the need to download/analyse the data offline first. This makes is the perfect solution for advanced analytics tasks such as benchmarking, metrics calculations or impact analyses.
Prerequisites
In order to run this tutorial, please ensure that:
- You have a valid Dimensions on Google BigQuery account and have configured a Google Cloud project.
- You have a valid Dimensions API account.
- You have some basic familiarity with Python and Jupyter notebooks.
(This tutorial is based on a Jupyter notebook that is available directly via GitHub.)
Example: profiling researchers linked to a topic
The concrete usecase we'll be looking at involves running a full-text search for "moon landing" publications using the DSL API, then creating a corpus in BigQuery based on this search (eg see this search).
Once we have the publication corpus available in BigQuery, we will extract all associated researchers (=authors). At the same time, we are going to use SQL in order to enrich the results using other metrics (eg more researchers
metadata including citations & altmetric).
Getting started
The following code will load the Python BigQuery library and authenticate you as a valid user.
!pip install google-cloud-bigquery -U --quiet
%load_ext google.cloud.bigquery
import sys
print("==\nAuthenticating...")
if 'google.colab' in sys.modules:
from google.colab import auth
auth.authenticate_user()
print('..done (method: Colab)')
else:
from google.cloud import bigquery
print('..done (method: local credentials)')
#
# PLEASE UPDATE USING YOUR CLOUD PROJECT ID (= the 'billing' account)
#
MY_PROJECT_ID = "ds-data-solutions-gbq"
print("==\nTesting connection..")
client = bigquery.Client(project=MY_PROJECT_ID)
test = client.query("""
SELECT COUNT(*) as pubs
from `dimensions-ai.data_analytics.publications`
""")
rows = [x for x in test.result()]
print("...success!")
print("Total publications in Dimensions: ", rows[0]['pubs'])
The google.cloud.bigquery extension is already loaded. To reload it, use:
%reload_ext google.cloud.bigquery
==
Authenticating...
..done (method: local credentials)
==
Testing connection..
...success!
Total publications in Dimensions: 115963650
1. Connecting to the DSL API
For more background on the Analytics API and how to work with it, see this tutorial
!pip install dimcli --quiet
import dimcli
from dimcli.utils import *
import json
import sys
import pandas as pd
#
ENDPOINT = "https://app.dimensions.ai"
USERNAME, PASSWORD = "", ""
dimcli.login(USERNAME, PASSWORD, ENDPOINT)
dsl = dimcli.Dsl()
Let's try running a sample query.
TIP Review the full text search syntax of the Dimensions Search Language.
%%dsldf
search publications for
" \"moon landing\" AND Moon AND \"lunar surface\" "
return publications limit 10
Returned Publications: 10 (total = 11305)
[2mTime: 0.72s[0m
Row | type | volume | pages | id | year | author_affiliations | title | journal.id | journal.title | issue |
---|---|---|---|---|---|---|---|---|---|---|
0 | article | 122 | 100692 | pub.1134954138 | 2021 | `[[{'raw_affiliation': ['Department of Aerospac...` | Review of space habitat designs for long term ... | jour.1139377 | Progress in Aerospace Sciences | NaN |
1 | article | 181 | 167-189 | pub.1130384298 | 2021 | `[[{'raw_affiliation': ['U.S. Naval War College...` | Joseph G. Gavin, Jr. and MIT’s contribution to... | jour.1134138 | Acta Astronautica | NaN |
2 | article | 180 | 650-678 | pub.1134475636 | 2021 | `[[{'raw_affiliation': ['Skolkovo Institute of ...` | Regolith-based additive manufacturing for sust... | jour.1134138 | Acta Astronautica | NaN |
3 | article | NaN | 1-13 | pub.1135101079 | 2021 | `[[{'raw_affiliation': ['Centre for Teaching an...` | Looking at Gail Jones’s “The Man in the Moon” ... | jour.1137860 | Journal of Australian Studies | NaN |
4 | article | 21 | 959 | pub.1135057882 | 2021 | `[[{'raw_affiliation': ['School of Artificial I...` | Three-Dimensional Model of the Moon with Seman... | jour.1033312 | Sensors | 3 |
2. Exporting DSL results to Google BigQuery
First off, we want to run the full-text search so to extract all relevant publications IDs.
Second, we will export the publications IDs to Google BigQuery. NOTE: Pandas provides a handy command to move data to BigQuery: DataFrame.to_gbq.
%%dslloopdf
search publications for
" \"moon landing\" AND Moon AND \"lunar surface\" "
return publications[id]
Starting iteration with limit=1000 skip=0 ...
0-1000 / 11201 (0.43s)
1000-2000 / 11201 (1.01s)
2000-3000 / 11201 (0.69s)
3000-4000 / 11201 (1.03s)
4000-5000 / 11201 (1.30s)
5000-6000 / 11201 (1.58s)
6000-7000 / 11201 (0.33s)
7000-8000 / 11201 (0.25s)
8000-9000 / 11201 (0.27s)
9000-10000 / 11201 (0.64s)
10000-11000 / 11201 (1.05s)
11000-11201 / 11201 (1.80s)
===
Records extracted: 11201
Row | id |
---|---|
0 | pub.1128771471 |
1 | pub.1130814402 |
2 | pub.1131658726 |
3 | pub.1124123379 |
4 | pub.1131232278 |
... | ... |
11196 | pub.1061739351 |
11197 | pub.1025947790 |
11198 | pub.1091822752 |
11199 | pub.1025757974 |
11200 | pub.1023928923 |
11201 rows × 1 columns
df = dsl_last_results
The command below will add a new table moonlanding
to the demo_dsl
dataset in GQB.
That destination table is entirely up to you of course, so you need to make sure you have write access to the database.
DATASET = "demo_dsl"
table_id = DATASET + ".moonlanding"
df.to_gbq(table_id, project_id = PROJECTID, if_exists="replace")
1it [00:05, 5.05s/it]
That's it - you should now be able to go to the online BigQuery console and see the new demo_dsl.moonlanding
dataset.
3. Querying your new dataset using a JOIN on Dimensions
We can now use the publications IDs we imported in order to create a JOIN query on the main Dimensions dataset. This is a bit like creating a 'view' of Dimensions corresponding to the full-text search we have done above.
GOAL: Roughly, the results should be the same as the 'publication year' facet in the webapp
%%bigquery --project $PROJECTID
WITH mypubs AS (
SELECT dim_pubs.*
FROM
`dimensions-ai.data_analytics.publications` dim_pubs
JOIN
`ds-data-solutions-gbq.demo_dsl.moonlanding` dslexport
ON
dim_pubs.id = dslexport.id
)
SELECT
COUNT(id) as tot,
year
FROM mypubs
GROUP BY year
ORDER BY tot DESC
Row | tot | year |
---|---|---|
0 | 10052 | 2003 |
1 | 133 | 2020 |
2 | 97 | 2019 |
3 | 70 | 2015 |
4 | 66 | 2017 |
5 | 65 | 2018 |
6 | 63 | 2009 |
7 | 59 | 2013 |
rows truncated for display |
4. Using Google BigQuery to generate researcher statistics
The goal is to generate a table just like the one in the 'researchers' analytical view in the webapp.
For each researcher we want to display some extra information:
- the total number of publications
- the citations count
- the total Altmetric Attention Score
%%bigquery --project $PROJECTID
WITH mypubs AS (
SELECT dim_pubs.*
FROM
`dimensions-ai.data_analytics.publications` dim_pubs
JOIN
`ds-data-solutions-gbq.demo_dsl.moonlanding` dslexport
ON
dim_pubs.id = dslexport.id
),
researchers_metrics AS (
SELECT researcher_id,
COUNT(id) as publications_count,
SUM(citations_count) as citations_count,
SUM(altmetrics.score) as altmetric_sum
FROM
mypubs,
UNNEST( researcher_ids ) as researcher_id
GROUP BY researcher_id
)
SELECT * FROM researchers_metrics
ORDER BY publications_count DESC
Row | researcher_id | publications_count | citations_count | altmetric_sum |
---|---|---|---|---|
0 | ur.01056354465.10 | 11 | 21.0 | 49.0 |
1 | ur.014402173273.44 | 6 | 42.0 | 10.0 |
2 | ur.012373502003.54 | 4 | 63.0 | NaN |
3 | ur.010534421371.14 | 4 | 63.0 | NaN |
4 | ur.015145367415.34 | 4 | 44.0 | 1.0 |
... | ... | ... | ... | ... |
1080 | ur.0767272510.86 | 1 | 5.0 | NaN |
1081 | ur.07637166751.28 | 1 | 3.0 | NaN |
1082 | ur.012762707227.21 | 1 | 3.0 | NaN |
1083 | ur.010101533313.52 | 1 | NaN | 1.0 |
1084 | ur.016406136233.64 | 1 | NaN | 16.0 |
Final step: let's add researchers names and current organization details by joining up data from the GRID table.
%%bigquery --project $PROJECTID
WITH mypubs AS (
SELECT dim_pubs.*
FROM
`dimensions-ai.data_analytics.publications` dim_pubs
JOIN
`ds-data-solutions-gbq.demo_dsl.moonlanding` dslexport
ON
dim_pubs.id = dslexport.id
),
researchers_metrics AS (
SELECT researcher_id,
COUNT(id) as publications_count,
SUM(citations_count) as citations_count,
SUM(altmetrics.score) as altmetric_sum
FROM
mypubs,
UNNEST( researcher_ids ) as researcher_id
GROUP BY researcher_id
),
researchers_full AS (
SELECT researchers_metrics.*,
r.first_name, r.last_name, r.total_grants,
grid.id as grid_id,
grid.name as grid_name,
grid.address.city as grid_city,
grid.address.country as grid_country
FROM
researchers_metrics
JOIN
`dimensions-ai.data_analytics.researchers` r
ON researchers_metrics.researcher_id = r.id
JOIN
`dimensions-ai.data_analytics.grid` grid
ON grid.id = r.current_research_org
)
SELECT * FROM researchers_full
ORDER BY publications_count DESC
Row | researcher_id | publications_count | citations_count | altmetric_sum | first_name | last_name | total_grants | grid_id | grid_name | grid_city | grid_country |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | ur.01056354465.10 | 11 | 21.0 | 49.0 | Roger D | Launius | 4 | grid.1214.6 | Smithsonian Institution | Washington D.C. | United States |
1 | ur.014402173273.44 | 6 | 42.0 | 10.0 | Joseph N | Pelton | 0 | grid.33224.34 | International Space University | Illkirch-Graffenstaden | France |
2 | ur.010243405673.63 | 4 | 14.0 | NaN | Sachiko | Wakabayashi | 1 | grid.62167.34 | Japan Aerospace Exploration Agency | Tokyo | Japan |
3 | ur.012503545245.69 | 4 | 12.0 | 1.0 | Stephan | Theil | 2 | grid.7551.6 | German Aerospace Center | Cologne | Germany |
4 | ur.0720745255.73 | 4 | 8.0 | 28.0 | Chun-Lai | Li | 3 | grid.450302.0 | National Astronomical Observatories | Beijing | China |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
886 | ur.011430662526.22 | 1 | 10.0 | NaN | Gang | Lei | 0 | grid.458502.e | Technical Institute of Physics and Chemistry | Beijing | China |
887 | ur.015477605337.38 | 1 | 1.0 | NaN | Olivier | Dubois-Matra | 0 | grid.424669.b | European Space Research and Technology Centre | Noordwijk-Binnen | Netherlands |
888 | ur.013214745135.53 | 1 | NaN | NaN | Catherine L | Newell | 0 | grid.26790.3a | University of Miami | Coral Gables | United States |
889 | ur.014464032227.37 | 1 | 41.0 | 141.0 | Andrew M | Carton | 0 | grid.25879.31 | University of Pennsylvania | Philadelphia | United States |
890 | ur.010610572173.89 | 1 | 10.0 | NaN | Zhan | Liu | 0 | grid.411510.0 | China University of Mining and Technology | Xuzhou | China |