Identify a funder's grants and publications
Use case
How many grants have resulted in publications? This notebook serves two main purposes:
i. Enables identification of a funder's grants and resulting publications.
ii. Prepares a summary table of grants and publications (per year and in total), and a bar graph showing number of grants by year with and without publications.
Prerequisites
In order to run this tutorial, please ensure that:
- You have a valid Dimensions on Google BigQuery account and have configured a Google Cloud project.
- You have a valid Dimensions API account.
- You have some basic familiarity with Python and Jupyter notebooks.
This tutorial is based on a Jupyter notebook that is available directly via GitHub.
Method
This notebook retrieves grants and publication data from Dimensions the world's largest linked research information dataset. For more about Dimensions visit - https://www.dimensions.ai/
To complete this analysis the following steps are taken:
- Connect to the Dimensions database.
- Find funder's unique GRID ID. The Global Research Identifier Database (GRID) is a database of research-related organisations each with a unique and persistent identifier. For more about GRID visit - https://grid.ac/.
- Generate list of grants and resulting publications.
- (OPTIONAL) Export list of grants and resulting publications to excel.
- Summary table showing number of grants and publications in total and by year.
- Graph showing number of grants that have yielded one or more publications by year.
Using this notebook
Before running some cells you will have to fill in information about the funder and year range. When required instructions are set out in info boxes.
This notebook assumes installation of python packages pandas
and plotly
.
1. Connect to Dimensions
You will need to be authenticated to run these queries—see the "Verifying your connection tutorial for options.
Run this cell if you're connecting via a Google Colab notebook:
from google.colab import auth
auth.authenticate_user()
print('Authenticated')
Authenticated
import pandas as pd
from google.cloud import bigquery
GBQ_PROJECT_ID = 'ds-gov-funder-gbq' # <----- Input your project ID here
client = bigquery.Client(project=GBQ_PROJECT_ID)
2. Find funder's unique GRID ID
To identify funder's grants we need the funder's unique GRID ID. The following cell searches on the funders names and returns the associated GRID ID(s).
Instructions
- Before running the code in the following cell input funder's name
- Funder's name might return multiple GRID IDs, note all relevant GRID IDs as they are needed for the next step.
If no GRID IDs are returned, try searching different name variants for the funder. Avoid acronyms.
funder_name = 'Wellcome Trust' # Input funder name here
funder_name = funder_name.lower()
query = f"""
select
id as grid_id
, name
from `dimensions-ai.data_analytics.grid`
where lower(name) like '%{funder_name}%'
"""
funder = pd.read_gbq(query, project_id=GBQ_PROJECT_ID)
funder.head(10)
Row | grid_id | name |
---|---|---|
0 | grid.52788.30 | Wellcome Trust |
1 | grid.478079.5 | NIHR Wellcome Trust Southampton Clinical Resea... |
2 | grid.484745.e | Wellcome Trust/DBT India Alliance |
3 | grid.482844.6 | Wellcome Trust Centre for the History of Medicine |
4 | grid.419393.5 | Malawi-Liverpool-Wellcome Trust Clinical Resea... |
3. Generate list of grants and publications identifiers
The following cell generates a table of funders grants and any resulting publications for grants started in a defined time period. Only the first five rows of the table will be shown in this notebook, however the whole table can be exported to excel in the following step.
Table columns:
dimensions_grant_id
- Unique identifier for grant in Dimensionsgrant_start_year
- Year the grant startedgrant_number
- Grant number used by funderdimensions_publication_id
- Unique identifier for publications in Dimensionsdoi
- Digital Object Identifier
Instructions
Before running the code in the following cell, where indicated:
- input funders GRID ID between
('')
e.g.('grid.52788.30')
. If funder has two or more GRID IDs then list within brackets separated by commas e.g.('grid.453157.1', 'grid.453157.1')
. - Update year range, note that we can only look at full years and cannot be filtered by month or day.
query = f"""
with pubs as
(select id, doi from `dimensions-ai.data_analytics.publications`)
, grants as
(
select
id as dimensions_grant_id
, start_year as grant_start_year
, grant_number
, pub_id as dimensions_publication_id
from `dimensions-ai.data_analytics.grants` g left join unnest(g.resulting_publication_ids) as pub_id
where funder_org in ('grid.52788.30') -- ADD GRID ID HERE
and start_year between 2011 and 2015 -- UPDATE YEAR RANGE
)
Select
grants.*
, pubs.doi
from grants
left join pubs
on pubs.id = grants.dimensions_publication_id
"""
grants = pd.read_gbq(query)
grants.head()
Row | dimensions_grant_id | grant_start_year | grant_number | dimensions_publication_id | doi |
---|---|---|---|---|---|
0 | grant.3627143 | 2012 | 082265/Z/07/A | pub.1105902779 | 10.1016/j.jaci.2018.07.011 |
1 | grant.3635772 | 2012 | 098274/Z/12/Z | pub.1124346329 | 10.1016/j.bpj.2020.01.019 |
2 | grant.3636003 | 2012 | 097899/Z/11/Z | pub.1062809971 | 10.1016/j.jaci.2018.07.011 |
3 | grant.3642390 | 2012 | 098649/Z/12/Z | None | None |
4 | grant.3638982 | 2012 | 099618/Z/12/Z | pub.1023307051 | 10.1007/s00234-016-1648-3 |
4. (OPTIONAL) Export list of grants and resulting publications to excel
The following cell creates new excel file of grants and resulting publications as output from previous cell. The new file will appear in the same location as this notebook is saved.
Instructions
Before running code the code in the following cell replace FILE_NAME
with new file name.
# Creates new excel file.
writer = pd.ExcelWriter('FILE_NAME.xlsx', engine='openpyxl')
grants.to_excel(writer, sheet_name='grants_pubs')
writer.save()
5. Summary table of number of grants and publications in total and by year
The following cell creates a table with columns, for all years and per year:
grant_start_year
- first year of the grantgrants
- number of grants awarded by funderpublications
- number of publications resulting from awards by fundergrants_without_publications
- number of grants that have not resulted in a publicationperc_without_publications
- percentage of grants that have not resulted in a publicationavg_publications_per_grants
- average (mean) number of publications per grant
Instructions
Before running code the code in the following cell, where indicated:
- input funders GRID ID between
('')
e.g.('grid.52788.30')
. If funder has two or more GRID IDs then list within brackets separated by commas e.g.('grid.453157.1', 'grid.453157.1')
. - Update year range
query = f"""
with grant_pub_map as
(
select
id
, start_year
, pub_id -- is this the best way of finding publications? or search for grant or
from `dimensions-ai.data_analytics.grants` g left join unnest(g.resulting_publication_ids) as pub_id
where funder_org = 'grid.52788.30' -- ADD GRID IDs HERE
and start_year between 2011 and 2015 -- UPDATE YEAR RANGE
)
select
'All years' as grant_start_year
, count(distinct id) as grants
, count(pub_id) as publications
, countif(pub_id is null) as grants_without_publications
, round(((countif(pub_id is null)/count(distinct id))*100), 2) as perc_without_publications
, round((count(pub_id)/(count(distinct id) )),2) as avg_publications_per_grants
from grant_pub_map
union all
select
cast(start_year as string) as grant_start_year
, count(distinct id) as grants
, count(pub_id) as publications
, countif(pub_id is null) as grants_without_publications
, round(((countif(pub_id is null)/count(distinct id))*100), 2) as perc_without_publications
, round((count(pub_id)/(count(distinct id) )),2) as avg_publications_per_grants
from grant_pub_map
group by start_year
order by 1 desc ;
"""
grants1 = pd.read_gbq(query)
grants1.head(5)
Row | grant_start_year | grants | publications | grants_without_publications | perc_without_publications | avg_publications_per_grants |
---|---|---|---|---|---|---|
0 | All years | 6423 | 45840 | 3731 | 58.09 | 7.14 |
1 | 2015 | 1417 | 6901 | 833 | 58.79 | 4.87 |
2 | 2014 | 1327 | 8746 | 791 | 59.61 | 6.59 |
3 | 2013 | 1200 | 8217 | 690 | 57.50 | 6.85 |
4 | 2012 | 1159 | 8515 | 689 | 59.45 | 7.35 |
6. Bar graph showing number of grants that have yielded one or more publications by year
Using the table above the following cell generates a bar graph showing the number of grants with and without publications per year.
# Remove first row that counts all grants over all years
df = grants1.iloc[1:]
# Reverse rows so earlist year shown first on bar graph
df = df[::-1].reset_index()
# Calculate number of grants with publications
df.insert(5, 'grant_pubs', df["grants"] - df["grants_without_publications"])
import plotly.graph_objs as go
x = df.grant_start_year
trace1 = {
'x' : x,
'y' : df.grant_pubs,
'name' : 'Grants with publications',
'type' : 'bar'
};
trace2 = {
'x' : x,
'y' : df.grants_without_publications,
'name' : 'Grants without publications',
'type' : 'bar'
};
data = [trace1, trace2];
layout = {
'xaxis' : {'title' : 'Year'},
'yaxis' : {'title' : 'Number of grants'},
'barmode' : 'relative',
'title': 'Grants with and without publications'
};
fig = go.Figure(data = data, layout = layout)
fig.show()