Skip to content

Usage of trust markers in research

Use case

'Trust markers' are indicators of integrity, professionalism and reproducibility in scientific research. They also highlight the level of research transparency within the document, and reduce the risks of allowing non-compliance to research integrity policies to go unobserved.

This notebook takes you through a few examples which address the above questions. It makes use of the Dimensions Research Integrity Dataset, an additional module to Dimensions on Google Big Query (GBQ).

Using the dataset, you can answer questions such as:

  • How many research articles use trust markers?
  • How does coverage of trust markers differ across publishers, funders and research organisations?
  • If researchers are using trust markers (eg, data availability statements), how many are putting their data in repositories (and which repositories)?

Prerequisites

In order to run this tutorial, please ensure that:

About trust markers

The trust markers in the Dataset represent the integrity and reproducibility of scientific research. Trust markers represent a contract between authors and readers that proper research practices have been observed. They also highlight the level of research transparency within the document, and reduce the risks of allowing non-compliance to research integrity policies to go unobserved.

Currently, the following trust markers are available via the dataset:

  • Funding statement: States if the author(s) of the paper were granted funding in order to conduct their research.
  • Ethical approval statement: Standalone statement affirming that the conducted research has been carried out in an ethical fashion with proper consent from all participating parties.
  • Competing interests statement: Declares possible sources of bias, based on personal interests of the author(s) in the findings of the research. For example, the source of funding, past or present employers of the author(s), or the author(s) financial interests.
  • Author contributions statement: Details of each author’s role in the development and publication of the manuscript.
  • Repositories: The names of any research data repositories used by the author(s) to preserve, organize and facilitate access to study data.
  • Data locations: Locations where research data (raw or processed) can be accessed.
  • Data availability statement: A dedicated section of a scientific work indicating whether data from the research is available and where it can be found.
  • Code availability statement: States if and how one could gain access to the code used to conduct the study/research.

Method

This notebook retrieves data about trust marker and publication data from Dimensions, the world's largest linked research information datatset. In particular the trust markers are taken from the DRI module.

To complete the analysis the following steps are taken:

  1. Connect to the Dimensions database.
  2. Gather information about general use of trust markers, broken down by publisher.
  3. Look at how usage of trust markers breaks down by research organisations in the US, by joining on data from GRID.
  4. Find out some of the most commonly claimed contributions by inviduals to research across different fields.
  5. Understand the usage of repositories across funders, research orgs and research articles.

1. Connect

You will need to be authenticated to run these queries - see the "Verifying yout connection" tutorial for options.

from google.colab import auth
auth.authenticate_user()
print('Authenticated')
Authenticated
#import other packages/modules needed for analysis
from google.cloud import bigquery
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objs as go

#set up client
client = bigquery.Client(project = 'ds-ripeta-gbq') #replace 'project' with the required project associate with your account

2. Trust markers by publisher

Write and run a query

In this instance we will limit data to 2022 and 10 publishers to keep things manageable.

#write the query - we're limiting results here to keep things easy to follow
qry = client.query("""
SELECT
    p.publisher.name, 
    100 * COUNTIF(tm.data.data_availability_statement)/ COUNT(p.id) AS data_availability,
    100 * COUNTIF(tm.code.code_availability_statement)/COUNT(p.id) AS code_availability,
    100 * COUNTIF(tm.author_contributions.author_contributions_statement)/COUNT(p.id) AS author_contributions,
    100 * COUNTIF(tm.competing_interests.competing_interests_statement)/COUNT(p.id) AS competing_interests,
    100 * COUNTIF(tm.funding.funding_statement)/COUNT(p.id) AS funding_statement,
    #note here we are only counting articles with mesh terms of 'animal' or 'human' as if this criteria isn't met it is unlikely an ethics statement would be expected
    100 * COUNTIF((tm.ethics.ethics_approval_statement AND (('Humans' IN UNNEST(p.mesh_terms)) OR ('Animals' IN UNNEST(p.mesh_terms)))) )/
        NULLIF(COUNTIF(('Humans' in UNNEST(p.mesh_terms)) OR ('Animals' IN unnest(p.mesh_terms))), 0) AS ethics_approval
FROM 
    dimensions-ai.data_analytics.publications p 
INNER JOIN `dimensions-ai-integrity.data.trust_markers` tm
    ON p.id = tm.id
WHERE p.year = 2022 AND p.document_type.classification = 'RESEARCH_ARTICLE'
GROUP BY 1
ORDER BY COUNT(p.id) DESC #order by number of publications in the trust marker dataset
--To keep things manageable for display purposes, we'll only look at 10 publishers for now
LIMIT 10
""")

#get the results
results = qry.result().to_dataframe() #may take a while depending on how much data your return

#take a peak
results
name data_availability code_availability author_contributions competing_interests funding_statement ethics_approval
0 Elsevier 23.220141 1.255325 9.852915 85.654392 55.983485 20.768440
1 Springer Nature 47.947094 5.239463 51.224349 63.354514 70.045505 66.610701
2 Wiley 41.088111 0.609560 18.156035 42.476518 54.245125 18.035888
3 MDPI 90.374943 0.825924 98.102639 92.952828 96.745167 14.255689
4 Frontiers 97.561644 0.484786 99.626311 57.620222 78.402516 75.672164
5 Institute of Electrical and Electronics Engine... 0.423834 0.548492 0.265446 4.599117 3.783712 5.615234
6 American Chemical Society (ACS) 2.445753 0.355457 51.336057 75.126886 71.855271 4.811294
7 Hindawi 86.262373 1.167437 21.282875 44.916294 50.766539 23.535921
8 Royal Society of Chemistry (RSC) 6.180646 0.328791 44.383674 79.058519 76.175271 14.142808
9 IOP Publishing 28.083510 0.314284 2.292990 38.624848 43.797704 19.267139

Visualize your results

#make data 'long'
long_results = pd.melt(
    results, 
    id_vars = "name", 
    value_vars = results.columns.tolist()[1:6],
    var_name = "trust_marker",
    value_name = "pct"
    )

#convert your data to a dictionary so plotly can handle it
result_dict = {
    "z": long_results.pct.tolist(),
    "x": long_results.trust_marker.tolist(),
    "y": long_results.name.tolist()
}

#plot
plot = go.Figure(
    data = go.Heatmap(
        result_dict,
        colorscale = "Blues"
        )
    )
plot.show()

3. Trust markers by research org

There are plenty of other analyses we can carry out using DRI. We are also not obliged to pull in aggregated data - we can also pull in data 'row-by-row' and analyse further in Python. We'll do this in the next example to look at the proportion of articles using trust markers at US universities.

Note that this example is set up to work generally. In a notebook environment (like Colab or Jupyter) you can use magic commands to reduce the code down to just the query and store as a summary dataframe. Google has guidance on doing this. We will use this magic commands in future examples.

%%bigquery markers_by_us_uni --project ds-ripeta-gbq 

SELECT
    p.id AS pub_id,
    p.year, 
    orgs AS org_id,
    CONCAT(g.name, ' (', g.address.city, ')') AS org_name,
    tm.data.data_availability_statement AS das,  
    tm.code.code_availability_statement AS cas,
    tm.author_contributions.author_contributions_statement AS auth_cont,
    tm.competing_interests.competing_interests_statement AS comp_int,
    tm.funding.funding_statement AS funding,
    CASE
      WHEN tm.ethics.ethics_approval_statement IS TRUE AND ('Humans' IN UNNEST(p.mesh_terms) OR 'Animals' IN UNNEST(p.mesh_terms)) THEN TRUE
      WHEN tm.ethics.ethics_approval_statement IS FALSE AND ('Humans' IN UNNEST(p.mesh_terms) OR 'Animals' IN UNNEST(p.mesh_terms)) THEN FALSE
      ELSE NULL
      END AS ethics
FROM dimensions-ai.data_analytics.publications p,
     UNNEST(research_orgs) orgs
INNER JOIN `dimensions-ai-integrity.data.trust_markers` tm 
    ON p.id = tm.id
INNER JOIN dimensions-ai.data_analytics.grid g 
    ON orgs = g.id 
    AND 'Education' IN UNNEST(g.types)
    AND g.address.country = "United States"
WHERE 
    p.year = 2022 
    AND p.document_type.classification = 'RESEARCH_ARTICLE'
Query is running:   0%|          |



Downloading:   0%|          |
#you can see we've got the result straight to a df; take a look
markers_by_us_uni.head(5)
pub_id year org_id org_name das cas auth_cont comp_int funding ethics
0 pub.1128274945 2022 grid.131063.6 University of Notre Dame (Notre Dame) False False False False False <NA>
1 pub.1131474369 2022 grid.267627.0 University of the Sciences (Philadelphia) False False True True False False
2 pub.1131474369 2022 grid.33489.35 University of Delaware (Newark) False False True True False False
3 pub.1126087463 2022 grid.147455.6 Carnegie Mellon University (Pittsburgh) False False False False False <NA>
4 pub.1107329658 2022 grid.266097.c University of California, Riverside (Riverside) True False False False True <NA>
#now we'll manipulate as required in Python
marker_df = markers_by_us_uni

#flag marker cols
marker_cols = ["das", "cas", "auth_cont", "comp_int", "funding", "ethics"]

#work out if there is a least one marker
marker_df["tm"] = marker_df[marker_cols].eq(1).any(axis = 1)

#institutions w/ <=1,000 publications
gt1000 = (marker_df.
          groupby(["org_id"], as_index = False).
          agg({"pub_id": "count"})
          )

gt1000 = gt1000[gt1000["pub_id"] >= 1000]["org_id"].to_list()

#summary
marker_sum = (marker_df.
              groupby(["org_id", "org_name", "tm"], as_index = False).
              agg({"pub_id": "count"})
)

#add on %
marker_sum["pct"] = 100 * marker_sum["pub_id"] / marker_sum.groupby(["org_id"])["pub_id"].transform("sum")

#remove institutions w/ <=1000 pubs and tm = False rows
marker_sum = marker_sum[(marker_sum["org_id"].isin(gt1000)) & (marker_sum["tm"] == True)]

#sort and slice to keep data manageable, pick the top 10 by number of publications...
marker_sum = marker_sum.sort_values("pub_id", ascending = False).head(10)
#...then order by pct for purposes of plot
marker_sum = marker_sum.sort_values("pct", ascending = False)

marker_sum
org_id org_name tm pub_id pct
1688 grid.38142.3c Harvard University (Cambridge) True 12198 86.028634
1275 grid.266102.1 University of California, San Francisco (San F... True 4623 85.865527
101 grid.21107.35 Johns Hopkins University (Baltimore) True 6354 84.259382
1680 grid.34477.33 University of Washington (Seattle) True 6070 84.025471
3424 grid.5386.8 Cornell University (Ithaca) True 4660 83.363148
47 grid.168010.e Stanford University (Stanford) True 6128 82.979012
85 grid.19006.3e University of California, Los Angeles (Los Ang... True 5245 82.820148
3295 grid.47100.32 Yale University (New Haven) True 4667 82.484977
699 grid.25879.31 University of Pennsylvania (Philadelphia) True 5177 82.279085
115 grid.214458.e University of Michigan–Ann Arbor (Ann Arbor) True 6045 81.043035
#plot the data
plot = px.bar(marker_sum, x = "pct", y = "org_name")
plot.show()

4. Author contributions to articles

You can go beyond just the 'basic' trust markers with Dimensions Research Integrity. You can also look at related data, such as recorded contributions to papers by individuals, or at which repositories data is being deposited in.

Let's take a look at author contributions by research categorisation (note: articles falling under multiple categories will be counted once in each category. Articles mentioning the same verb more than once are only counted once per category). This will help understand acknowledgement patterns in research and possibly identify discipline areas where practice is 'ahead of the curve'.

%%bigquery cont_df --project ds-ripeta-gbq

SELECT
    p.year, 
    cat.name,
    contributor_verbs,
    COUNT(DISTINCT p.id) publications      
FROM dimensions-ai.data_analytics.publications p,
    UNNEST(category_for.first_level.full) cat
INNER JOIN `dimensions-ai-integrity.data.trust_markers` tm
    ON p.id = tm.id,
    UNNEST(author_contributions.activity_keywords) contributor_verbs
WHERE  p.type= 'article' and p.year between 2011 and 2022
group by 1, 2, 3
Query is running:   0%|          |



Downloading:   0%|          |
#see what we get
cont_df.head(5)
year name contributor_verbs publications
0 2022 Chemical Sciences using 1322
1 2021 Health Sciences using 442
2 2022 Environmental Sciences using 309
3 2018 Built Environment And Design using 29
4 2020 Mathematical Sciences using 82

There are a lot of variables (field and verbs) here and we can't visualise them all in one go. Instead, let's identify the top five verbs and six fields (by number of publications) and stick with just those for now.

#get the most common verbs
common_verbs = (cont_df.
                groupby(["contributor_verbs"], as_index = False).
                agg({"publications": "sum"}).
                sort_values("publications", ascending = False).
                head(5)
                )["contributor_verbs"].to_list()

common_verbs
['performed', 'wrote', 'designed', 'contributed', 'approved']
#and the most common fields
common_fields = (cont_df.
                groupby(["name"], as_index = False).
                agg({"publications": "sum"}).
                sort_values("publications", ascending = False).
                head(6)
                )["name"].to_list()

common_fields
['Biomedical And Clinical Sciences',
 'Biological Sciences',
 'Health Sciences',
 'Engineering',
 'Chemical Sciences',
 'Agricultural, Veterinary And Food Sciences']
#filter the data accordingly and sort
cont_df = cont_df[(cont_df["contributor_verbs"].isin(common_verbs)) & (cont_df["name"].isin(common_fields))].sort_values("year")

#and now plot the results
plot = px.line(
    cont_df,
    x = "year",
    y = "publications",
    color = "contributor_verbs",
    facet_col = "name",
    facet_col_wrap = 3
)
plot.show()

5. Repository usage

Now we'll take a look at how we can access the repositories data. Suppose we want to see the breadth of usage of specific repositories across work across funders and research orgs, to understand how wide the usage of the most common repositories is.

%%bigquery repo_df --project ds-ripeta-gbq

--the first two tables created by the WITH statements are needed (for now) to tidy up repository data
WITH
  common_keywords AS(
  SELECT
    kw, 
    COUNT(DISTINCT id) AS pubs
  FROM `dimensions-ai-integrity.data.trust_markers` tm,
    UNNEST(tm.data.repository_keywords) kw
  GROUP BY 1
),
repositories AS(
  SELECT
  id,
  kw, 
  'url' isin
    FROM `dimensions-ai-integrity.data.trust_markers` tm,
      UNNEST(tm.data.data_repository_urls) url
    INNER JOIN common_keywords
      ON REGEXP_CONTAINS(REPLACE(url,'10.17632','mendeley'), LOWER(kw))
    UNION DISTINCT
      SELECT 
        id, 
          replace(
            replace(
              #replace(
                replace(kw,'open science framework','osf'),
              #'gene','geo'),
              'gene expression omnibus','geo'), 
            'sequence read archive','sra') kw, 
        'keyword' isin
        FROM `dimensions-ai-integrity.data.trust_markers` tm,
          UNNEST(tm.data.repository_keywords) kw
        WHERE kw != 'board'
        ),
  funders AS(
    SELECT
    pubs.id AS pub_id,
    fund.grid_id AS funder_id
    FROM `dimensions-ai.data_analytics.publications` pubs,
      UNNEST(funding_details) fund
    WHERE pubs.year = 2021
  ),
  orgs AS(
    SELECT
      pubs.id AS pub_id,
      org
    FROM `dimensions-ai.data_analytics.publications` pubs,
      UNNEST(research_orgs) org
    WHERE pubs.year = 2021
  ), combined AS(
  SELECT
    rep.id,
    rep.kw,
    f.funder_id AS funder,
    o.org AS ro
  FROM repositories rep
  INNER JOIN funders f
    ON rep.id = f.pub_id
  INNER JOIN orgs o
    ON rep.id = o.pub_id
  )
  SELECT
  kw,
  COUNT(DISTINCT funder) AS funders,
  COUNT(DISTINCT ro) AS orgs,
  COUNT(DISTINCT id) AS pubs
  FROM combined
  GROUP BY 1
  ORDER BY pubs DESC
  LIMIT 10
Query is running:   0%|          |



Downloading:   0%|          |
#see what we've got
repo_df
kw funders orgs pubs
0 github 1688 6915 13760
1 ncbi 1630 6586 12219
2 geo 1478 5323 9777
3 gene 1578 6037 9023
4 zenodo 1201 4922 7041
5 bioproject 1148 4442 6138
6 sra 1176 4351 6020
7 genbank 1067 4295 5289
8 figshare 1144 4433 4828
9 map 1002 3893 3315
#plot
plot = px.scatter(
    repo_df,
    x = "funders",
    y = "orgs",
    hover_data = ["kw"]
)

plot.show()