Usage of Trust Markers in research
Use case
'Trust Markers' are indicators of integrity, professionalism and reproducibility in scientific research. They also highlight the level of research transparency within the document, and reduce the risks of allowing non-compliance to research integrity policies to go unobserved.
This notebook takes you through a few examples which address the above questions. It makes use of the Dimensions Research Integrity Dataset, an additional module to Dimensions on Google Big Query (GBQ).
Using the dataset, you can answer questions such as:
- How many research articles use Trust Markers?
- How does coverage of Trust Markers differ across publishers, funders and research organisations?
- If researchers are using Trust Markers (eg, data availability statements), how many are putting their data in repositories (and which repositories)?
Prerequisites
In order to run this tutorial, please ensure that:
- You have a valid Dimensions on Google BigQuery account and have configured a Google Cloud project. This must include access to the Dimensions Research Integrity Dataset.
- You have some basic familiarity with Python and Jupyter notebooks.
About Trust Markers
The Trust Markers in the Dataset represent the integrity and reproducibility of scientific research. Trust Markers represent a contract between authors and readers that proper research practices have been observed. They also highlight the level of research transparency within the document, and reduce the risks of allowing non-compliance to research integrity policies to go unobserved.
To read definitions of specific Trust Markers, see the GBQ schema documentation.
Method
This notebook retrieves data about trust marker and publication data from Dimensions, the world's largest linked research information datatset. In particular the Trust Markers are taken from the DRI module.
To complete the analysis the following steps are taken:
- Connect to the Dimensions database.
- Gather information about general use of Trust Markers, broken down by publisher.
- Look at how usage of Trust Markers breaks down by research organisations in the US, by joining on data from GRID.
- Find out some of the most commonly claimed contributions by inviduals to research across different fields.
- Understand the usage of repositories across funders, research orgs and research articles.
1. Connect
You will need to be authenticated to run these queries - see the "Verifying yout connection" tutorial for options.
from google.colab import auth
auth.authenticate_user()
print('Authenticated')
Authenticated
#import other packages/modules needed for analysis
from google.cloud import bigquery
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objs as go
#config to avoid having to declare parameters multiple times
project_id = "ds-ripeta-gbq" #replace 'project' with the required project associate with your account
from google.cloud.bigquery import magics
magics.context.project = project_id
client = bigquery.Client(project = project_id)
2. Trust Markers by publisher
Write and run a query
In this instance we will limit data to 2022 and 10 publishers to keep things manageable.
#write the query - we're limiting results here to keep things easy to follow
qry = client.query("""
SELECT
p.publisher.name,
100 * COUNTIF(tm.data.data_availability_statement.present)/ COUNT(p.id) AS data_availability,
100 * COUNTIF(tm.code.code_availability_statement.present)/COUNT(p.id) AS code_availability,
100 * COUNTIF(tm.authors.author_contribution_statement.present)/COUNT(p.id) AS author_contributions,
100 * COUNTIF(tm.authors.conflict_of_interest_statement.present)/COUNT(p.id) AS conflict_interest,
100 * COUNTIF(tm.funding.funding_statement.present)/COUNT(p.id) AS funding_statement,
#note here we are only counting articles with mesh terms of 'animal' or 'human' as if this criteria isn't met it is unlikely an ethics statement would be expected
100 * COUNTIF((tm.ethical_approval.ethical_approval_statement.present AND (('Humans' IN UNNEST(p.mesh_terms)) OR ('Animals' IN UNNEST(p.mesh_terms)))) )/
NULLIF(COUNTIF(('Humans' in UNNEST(p.mesh_terms)) OR ('Animals' IN unnest(p.mesh_terms))), 0) AS ethics_approval
FROM
dimensions-ai.data_analytics.publications p
INNER JOIN `dimensions-ai-integrity.data.trust_markers` tm
ON p.id = tm.id
WHERE p.year = 2022 AND p.document_type.classification = 'RESEARCH_ARTICLE'
GROUP BY 1
ORDER BY COUNT(p.id) DESC #order by number of publications in the trust marker dataset
--To keep things manageable for display purposes, we'll only look at 10 publishers for now
LIMIT 10
""")
#get the results
results = qry.result().to_dataframe() #may take a while depending on how much data your return
#take a peak
results
name | data_availability | code_availability | author_contributions | conflict_interest | funding_statement | ethics_approval | |
---|---|---|---|---|---|---|---|
0 | Elsevier | 22.890000 | 1.597544 | 17.378246 | 80.405263 | 57.918947 | 18.943601 |
1 | Springer Nature | 49.878036 | 8.742873 | 52.908014 | 40.728133 | 71.049643 | 65.175558 |
2 | MDPI | 88.760718 | 1.339980 | 97.192921 | 99.243270 | 91.880857 | 13.336054 |
3 | Wiley | 39.376133 | 1.139870 | 20.882104 | 7.671746 | 58.562563 | 16.154602 |
4 | Frontiers | 97.583713 | 1.031047 | 99.547012 | 6.398587 | 79.234019 | 75.641116 |
5 | Taylor & Francis | 17.195639 | 0.605616 | 11.305576 | 83.174962 | 52.912265 | 28.457937 |
6 | Institute of Electrical and Electronics Engine... | 0.300312 | 0.723896 | 0.208514 | 0.599313 | 3.561780 | 5.201794 |
7 | American Chemical Society (ACS) | 2.348741 | 0.794581 | 42.137319 | 1.914696 | 75.219210 | 4.040979 |
8 | Hindawi | 85.378842 | 1.029393 | 23.275609 | 85.217410 | 52.681454 | 23.068174 |
9 | SAGE Publications | 9.527544 | 0.604082 | 18.841010 | 4.317599 | 90.031582 | 33.596158 |
Visualize your results
#make data 'long'
long_results = pd.melt(
results,
id_vars = "name",
value_vars = results.columns.tolist()[1:7],
var_name = "trust_marker",
value_name = "pct"
)
#convert your data to a dictionary so plotly can handle it
result_dict = {
"z": long_results.pct.tolist(),
"x": long_results.trust_marker.tolist(),
"y": long_results.name.tolist()
}
#plot
plot = go.Figure(
data = go.Heatmap(
result_dict,
colorscale = "Blues"
)
)
plot.show()
3. Trust Markers by research org
There are plenty of other analyses we can carry out using DRI. We are also not obliged to pull in aggregated data - we can also pull in data 'row-by-row' and analyse further in Python. We'll do this in the next example to look at the proportion of articles using Trust Markers at US universities.
Note that this example is set up to work generally. In a notebook environment (like Colab or Jupyter) you can use magic commands to reduce the code down to just the query and store as a summary dataframe. Google has guidance on doing this. We will use this magic commands in future examples.
%%bigquery markers_by_us_uni
SELECT
p.id AS pub_id,
p.year,
orgs AS org_id,
CONCAT(g.name, ' (', g.address.city, ')') AS org_name,
tm.data.data_availability_statement.present AS das,
tm.code.code_availability_statement.present AS cas,
tm.authors.author_contribution_statement.present AS auth_cont,
tm.authors.conflict_of_interest_statement.present AS conflict_int,
tm.funding.funding_statement.present AS funding,
CASE
WHEN tm.ethical_approval.ethical_approval_statement.present IS TRUE AND ('Humans' IN UNNEST(p.mesh_terms) OR 'Animals' IN UNNEST(p.mesh_terms)) THEN TRUE
WHEN tm.ethical_approval.ethical_approval_statement.present IS FALSE AND ('Humans' IN UNNEST(p.mesh_terms) OR 'Animals' IN UNNEST(p.mesh_terms)) THEN FALSE
ELSE NULL
END AS ethics
FROM dimensions-ai.data_analytics.publications p,
UNNEST(research_orgs) orgs
INNER JOIN `dimensions-ai-integrity.data.trust_markers` tm
ON p.id = tm.id
INNER JOIN dimensions-ai.data_analytics.grid g
ON orgs = g.id
AND 'Education' IN UNNEST(g.types)
AND g.address.country = "United States"
WHERE
p.year = 2022
AND p.document_type.classification = 'RESEARCH_ARTICLE'
Query is running: 0%| |
Downloading: 0%| |
#you can see we've got the result straight to a df; take a look
markers_by_us_uni.head(5)
pub_id | year | org_id | org_name | das | cas | auth_cont | conflict_int | funding | ethics | |
---|---|---|---|---|---|---|---|---|---|---|
0 | pub.1141586193 | 2022 | grid.36425.36 | Stony Brook University (Stony Brook) | False | False | False | False | True | <NA> |
1 | pub.1141411763 | 2022 | grid.5288.7 | Oregon Health & Science University (Portland) | False | False | True | True | False | False |
2 | pub.1141060825 | 2022 | grid.262273.0 | Queens College, CUNY (New York) | True | False | False | True | False | False |
3 | pub.1140891786 | 2022 | grid.255935.d | Fisk University (Nashville) | False | False | False | True | False | <NA> |
4 | pub.1141168735 | 2022 | grid.152326.1 | Vanderbilt University (Nashville) | False | False | False | False | False | True |
#now we'll manipulate as required in Python
marker_df = markers_by_us_uni
#flag marker cols
marker_cols = ["das", "cas", "auth_cont", "conflict_int", "funding", "ethics"]
#work out if there is a least one marker
marker_df["tm"] = marker_df[marker_cols].eq(1).any(axis = 1)
#institutions w/ <=1,000 publications
gt1000 = (marker_df.
groupby(["org_id"], as_index = False).
agg({"pub_id": "count"})
)
gt1000 = gt1000[gt1000["pub_id"] >= 1000]["org_id"].to_list()
#summary
marker_sum = (marker_df.
groupby(["org_id", "org_name", "tm"], as_index = False).
agg({"pub_id": "count"})
)
#add on %
marker_sum["pct"] = 100 * marker_sum["pub_id"] / marker_sum.groupby(["org_id"])["pub_id"].transform("sum")
#remove institutions w/ <=1000 pubs and tm = False rows
marker_sum = marker_sum[(marker_sum["org_id"].isin(gt1000)) & (marker_sum["tm"] == True)]
#sort and slice to keep data manageable, pick the top 10 by number of publications...
marker_sum = marker_sum.sort_values("pub_id", ascending = False).head(10)
#...then order by pct for purposes of plot
marker_sum = marker_sum.sort_values("pct", ascending = False)
marker_sum
org_id | org_name | tm | pub_id | pct | |
---|---|---|---|---|---|
1730 | grid.38142.3c | Harvard University (Cambridge) | True | 15532 | 85.797934 |
1309 | grid.266102.1 | University of California, San Francisco (San F... | True | 5955 | 85.511200 |
101 | grid.21107.35 | Johns Hopkins University (Baltimore) | True | 8246 | 84.800494 |
3785 | grid.5386.8 | Cornell University (Ithaca) | True | 5846 | 83.765582 |
1722 | grid.34477.33 | University of Washington (Seattle) | True | 7825 | 83.475571 |
47 | grid.168010.e | Stanford University (Stanford) | True | 7731 | 83.030824 |
713 | grid.25879.31 | University of Pennsylvania (Philadelphia) | True | 6571 | 82.747765 |
3647 | grid.47100.32 | Yale University (New Haven) | True | 5992 | 82.488987 |
85 | grid.19006.3e | University of California, Los Angeles (Los Ang... | True | 6714 | 82.329859 |
115 | grid.214458.e | University of Michigan–Ann Arbor (Ann Arbor) | True | 7672 | 80.301444 |
#plot the data
plot = px.bar(marker_sum, x = "pct", y = "org_name")
plot.show()
4. Author contributions to articles
You can go beyond just the 'basic' Trust Markers with Dimensions Research Integrity. You can also look at related data, such as recorded contributions to papers by individuals, or at which repositories data is being deposited in.
Let's take a look at author contributions by research categorisation (note: articles falling under multiple categories will be counted once in each category. Articles mentioning the same verb more than once are only counted once per category). This will help understand acknowledgement patterns in research and possibly identify discipline areas where practice is 'ahead of the curve'.
%%bigquery cont_df
SELECT
p.year,
cat.name,
contributor_verbs,
COUNT(DISTINCT p.id) publications
FROM dimensions-ai.data_analytics.publications p,
UNNEST(category_for.first_level.full) cat
INNER JOIN `dimensions-ai-integrity.data.trust_markers` tm
ON p.id = tm.id,
UNNEST(tm.authors.author_roles.keywords) contributor_verbs
WHERE p.type= 'article' and p.year between 2011 and 2022
group by 1, 2, 3
Query is running: 0%| |
Downloading: 0%| |
#see what we get
cont_df.head(5)
year | name | contributor_verbs | publications | |
---|---|---|---|---|
0 | 2021 | Biomedical And Clinical Sciences | using | 2733 |
1 | 2016 | Biomedical And Clinical Sciences | using | 1108 |
2 | 2022 | Psychology | using | 315 |
3 | 2015 | Biomedical And Clinical Sciences | using | 723 |
4 | 2021 | Agricultural, Veterinary And Food Sciences | using | 456 |
There are a lot of variables (field and verbs) here and we can't visualise them all in one go. Instead, let's identify the top five verbs and six fields (by number of publications) and stick with just those for now.
#get the most common verbs
common_verbs = (cont_df.
groupby(["contributor_verbs"], as_index = False).
agg({"publications": "sum"}).
sort_values("publications", ascending = False).
head(5)
)["contributor_verbs"].to_list()
common_verbs
['performed', 'wrote', 'designed', 'contributed', 'approved']
#and the most common fields
common_fields = (cont_df.
groupby(["name"], as_index = False).
agg({"publications": "sum"}).
sort_values("publications", ascending = False).
head(6)
)["name"].to_list()
common_fields
['Biomedical And Clinical Sciences',
'Biological Sciences',
'Engineering',
'Health Sciences',
'Chemical Sciences',
'Agricultural, Veterinary And Food Sciences']
#filter the data accordingly and sort
cont_df = cont_df[(cont_df["contributor_verbs"].isin(common_verbs)) & (cont_df["name"].isin(common_fields))].sort_values("year")
#and now plot the results
plot = px.line(
cont_df,
x = "year",
y = "publications",
color = "contributor_verbs",
facet_col = "name",
facet_col_wrap = 3
)
plot.show()
5. Repository usage
Now we'll take a look at how we can access the repositories data. Suppose we want to see the breadth of usage of specific repositories across work across funders and research orgs, to understand how wide the usage of the most common repositories is.
%%bigquery repo_df
--the first two tables created by the WITH statements are needed to get full count of repositories based on names and URLs found
WITH
common_keywords AS(
SELECT
kw,
COUNT(DISTINCT id) AS pubs
FROM `dimensions-ai-integrity.data.trust_markers` tm,
UNNEST(tm.data.data_locations.repositories) kw
GROUP BY 1
),
repositories AS(
SELECT
id,
kw,
'url' isin
FROM `dimensions-ai-integrity.data.trust_markers` tm,
UNNEST(tm.data.data_locations.repository_urls) url
INNER JOIN common_keywords
ON REGEXP_CONTAINS(REPLACE(url,'10.17632','mendeley'), LOWER(kw))
UNION DISTINCT
SELECT
id,
replace(
replace(
#replace(
replace(kw,'open science framework','osf'),
#'gene','geo'),
'gene expression omnibus','geo'),
'sequence read archive','sra') kw,
'keyword' isin
FROM `dimensions-ai-integrity.data.trust_markers` tm,
UNNEST(tm.data.data_locations.repositories) kw
WHERE kw != 'board'
),
funders AS(
SELECT
pubs.id AS pub_id,
fund.grid_id AS funder_id
FROM `dimensions-ai.data_analytics.publications` pubs,
UNNEST(funding_details) fund
WHERE pubs.year = 2021
),
orgs AS(
SELECT
pubs.id AS pub_id,
org
FROM `dimensions-ai.data_analytics.publications` pubs,
UNNEST(research_orgs) org
WHERE pubs.year = 2021
), combined AS(
SELECT
rep.id,
CASE
WHEN REGEXP_CONTAINS(rep.kw, 'github') THEN 'github'
WHEN REGEXP_CONTAINS(rep.kw, 'osf') THEN 'osf'
WHEN REGEXP_CONTAINS(rep.kw, 'ncbi') THEN 'ncbi'
ELSE rep.kw
END AS kw,
f.funder_id AS funder,
o.org AS ro
FROM repositories rep
INNER JOIN funders f
ON rep.id = f.pub_id
INNER JOIN orgs o
ON rep.id = o.pub_id
)
SELECT
kw,
COUNT(DISTINCT funder) AS funders,
COUNT(DISTINCT ro) AS orgs,
COUNT(DISTINCT id) AS pubs
FROM combined
GROUP BY 1
ORDER BY pubs DESC
LIMIT 10
Query is running: 0%| |
Downloading: 0%| |
#see what we've got
repo_df
kw | funders | orgs | pubs | |
---|---|---|---|---|
0 | ncbi | 1858 | 7378 | 16104 |
1 | github | 1680 | 6787 | 13160 |
2 | geo | 1456 | 5115 | 8802 |
3 | zenodo | 1175 | 4828 | 6757 |
4 | gene | 1429 | 5311 | 6441 |
5 | bioproject | 1136 | 4369 | 5925 |
6 | sra | 1172 | 4292 | 5817 |
7 | genbank | 1025 | 4123 | 5019 |
8 | figshare | 1164 | 4422 | 4683 |
9 | impact | 1176 | 4649 | 3659 |
#plot
plot = px.scatter(
repo_df,
x = "funders",
y = "orgs",
hover_data = ["kw"]
)
plot.show()