Identify a funder's grants and publications

Use case

How many grants have resulted in publications? This notebook serves two main purposes:

i. Enables identification of a funder's grants and resulting publications.

ii. Prepares a summary table of grants and publications (per year and in total), and a bar graph showing number of grants by year with and without publications.

Prerequisites

In order to run this tutorial, please ensure that:

You have a valid Dimensions on Google BigQuery account and have configured a Google Cloud project.
You have a valid Dimensions API account.
You have some basic familiarity with Python and Jupyter notebooks.

This tutorial is based on a Jupyter notebook that is available directly via GitHub.

Method

This notebook retrieves grants and publication data from Dimensions the world's largest linked research information dataset. For more about Dimensions visit - https://www.dimensions.ai/

To complete this analysis the following steps are taken:

Connect to the Dimensions database.
Find funder's unique GRID ID. The Global Research Identifier Database (GRID) is a database of research-related organisations each with a unique and persistent identifier. For more about GRID visit - https://grid.ac/.
Generate list of grants and resulting publications.
(OPTIONAL) Export list of grants and resulting publications to excel.
Summary table showing number of grants and publications in total and by year.
Graph showing number of grants that have yielded one or more publications by year.

Using this notebook

Before running some cells you will have to fill in information about the funder and year range. When required instructions are set out in info boxes.

This notebook assumes installation of python packages pandas and plotly.

1. Connect to Dimensions

You will need to be authenticated to run these queries—see the "Verifying your connection tutorial for options.

Run this cell if you're connecting via a Google Colab notebook:

from google.colab import auth
auth.authenticate_user()
print('Authenticated')

Authenticated

import pandas as pd
from google.cloud import bigquery

GBQ_PROJECT_ID = 'ds-gov-funder-gbq' # <----- Input your project ID here
client = bigquery.Client(project=GBQ_PROJECT_ID)

2. Find funder's unique GRID ID

To identify funder's grants we need the funder's unique GRID ID. The following cell searches on the funders names and returns the associated GRID ID(s).

Instructions

Before running the code in the following cell input funder's name
Funder's name might return multiple GRID IDs, note all relevant GRID IDs as they are needed for the next step.

If no GRID IDs are returned, try searching different name variants for the funder. Avoid acronyms.

funder_name = 'Wellcome Trust' # Input funder name here

funder_name = funder_name.lower()
query = f"""
    select
        id as grid_id
        , name
    from `dimensions-ai.data_analytics.grid`
    where lower(name) like '%{funder_name}%'
"""

funder = pd.read_gbq(query, project_id=GBQ_PROJECT_ID)
funder.head(10)

Row	grid_id	name
0	grid.52788.30	Wellcome Trust
1	grid.478079.5	NIHR Wellcome Trust Southampton Clinical Resea...
2	grid.484745.e	Wellcome Trust/DBT India Alliance
3	grid.482844.6	Wellcome Trust Centre for the History of Medicine
4	grid.419393.5	Malawi-Liverpool-Wellcome Trust Clinical Resea...

3. Generate list of grants and publications identifiers

The following cell generates a table of funders grants and any resulting publications for grants started in a defined time period. Only the first five rows of the table will be shown in this notebook, however the whole table can be exported to excel in the following step.

Table columns:

dimensions_grant_id - Unique identifier for grant in Dimensions
grant_start_year - Year the grant started
grant_number - Grant number used by funder
dimensions_publication_id - Unique identifier for publications in Dimensions
doi - Digital Object Identifier

Instructions

Before running the code in the following cell, where indicated:

input funders GRID ID between ('') e.g. ('grid.52788.30'). If funder has two or more GRID IDs then list within brackets separated by commas e.g. ('grid.453157.1', 'grid.453157.1').
Update year range, note that we can only look at full years and cannot be filtered by month or day.

query = f"""
with pubs as
  (select id, doi from `dimensions-ai.data_analytics.publications`)
, grants as
  (
  select
    id as dimensions_grant_id
    , start_year as grant_start_year
    , grant_number
    , pub_id as dimensions_publication_id
  from `dimensions-ai.data_analytics.grants` g left join unnest(g.resulting_publication_ids) as pub_id
   where funder_org in ('grid.52788.30')  -- ADD GRID ID HERE
      and start_year between 2011 and 2015 -- UPDATE YEAR RANGE
    )
Select
  grants.*
  , pubs.doi
from grants
  left join pubs
    on pubs.id = grants.dimensions_publication_id
"""
grants = pd.read_gbq(query)
grants.head()

Row	dimensions_grant_id	grant_start_year	grant_number	dimensions_publication_id	doi
0	grant.3627143	2012	082265/Z/07/A	pub.1105902779	10.1016/j.jaci.2018.07.011
1	grant.3635772	2012	098274/Z/12/Z	pub.1124346329	10.1016/j.bpj.2020.01.019
2	grant.3636003	2012	097899/Z/11/Z	pub.1062809971	10.1016/j.jaci.2018.07.011
3	grant.3642390	2012	098649/Z/12/Z	None	None
4	grant.3638982	2012	099618/Z/12/Z	pub.1023307051	10.1007/s00234-016-1648-3

4. (OPTIONAL) Export list of grants and resulting publications to excel

The following cell creates new excel file of grants and resulting publications as output from previous cell. The new file will appear in the same location as this notebook is saved.

Instructions

Before running code the code in the following cell replace FILE_NAME with new file name.

# Creates new excel file.
writer = pd.ExcelWriter('FILE_NAME.xlsx',  engine='openpyxl')

grants.to_excel(writer, sheet_name='grants_pubs')

writer.save()

5. Summary table of number of grants and publications in total and by year

The following cell creates a table with columns, for all years and per year:

grant_start_year - first year of the grant
grants - number of grants awarded by funder
publications - number of publications resulting from awards by funder
grants_without_publications - number of grants that have not resulted in a publication
perc_without_publications - percentage of grants that have not resulted in a publication
avg_publications_per_grants - average (mean) number of publications per grant

Instructions

Before running code the code in the following cell, where indicated:

input funders GRID ID between ('') e.g. ('grid.52788.30'). If funder has two or more GRID IDs then list within brackets separated by commas e.g. ('grid.453157.1', 'grid.453157.1').
Update year range

query = f"""
with grant_pub_map as
  (
  select
    id
    , start_year
   , pub_id  -- is this the best way of finding publications? or search for grant or
  from `dimensions-ai.data_analytics.grants` g left join unnest(g.resulting_publication_ids) as pub_id
    where funder_org = 'grid.52788.30'  -- ADD GRID IDs HERE
      and start_year between 2011 and 2015 -- UPDATE YEAR RANGE
  )
select
  'All years' as grant_start_year
  , count(distinct id) as grants
  , count(pub_id) as publications
  , countif(pub_id is null) as grants_without_publications
  , round(((countif(pub_id is null)/count(distinct id))*100), 2) as perc_without_publications
  , round((count(pub_id)/(count(distinct id) )),2) as avg_publications_per_grants
from grant_pub_map
 union all
select
  cast(start_year as string) as grant_start_year
  , count(distinct id) as grants
  , count(pub_id) as publications
  , countif(pub_id is null) as grants_without_publications
  , round(((countif(pub_id is null)/count(distinct id))*100), 2) as perc_without_publications
  , round((count(pub_id)/(count(distinct id) )),2) as avg_publications_per_grants
from grant_pub_map
  group by start_year
order by 1 desc ;
"""

grants1 = pd.read_gbq(query)
grants1.head(5)

Row	grant_start_year	grants	publications	grants_without_publications	perc_without_publications	avg_publications_per_grants
0	All years	6423	45840	3731	58.09	7.14
1	2015	1417	6901	833	58.79	4.87
2	2014	1327	8746	791	59.61	6.59
3	2013	1200	8217	690	57.50	6.85
4	2012	1159	8515	689	59.45	7.35

6. Bar graph showing number of grants that have yielded one or more publications by year

Using the table above the following cell generates a bar graph showing the number of grants with and without publications per year.

# Remove first row that counts all grants over all years
df = grants1.iloc[1:]

# Reverse rows so earlist year shown first on bar graph
df = df[::-1].reset_index()

# Calculate number of grants with publications
df.insert(5, 'grant_pubs', df["grants"] - df["grants_without_publications"])


import plotly.graph_objs as go


x = df.grant_start_year

trace1 = {
    'x' : x,
    'y' : df.grant_pubs,
    'name' : 'Grants with publications',
    'type' : 'bar'
};
trace2 = {
    'x' : x,
    'y' : df.grants_without_publications,
    'name' : 'Grants without publications',
    'type' : 'bar'
};
data = [trace1, trace2];
layout = {
    'xaxis' : {'title' : 'Year'},
    'yaxis' : {'title' : 'Number of grants'},
    'barmode' : 'relative',
    'title': 'Grants with and without publications'
};
fig = go.Figure(data = data, layout = layout)
fig.show()

grant_pubs_barchart