Skip to content

Usage of Trust Markers in research

Use case

'Trust Markers' are indicators of integrity, professionalism and reproducibility in scientific research. They also highlight the level of research transparency within the document, and reduce the risks of allowing non-compliance to research integrity policies to go unobserved.

This notebook takes you through a few examples which address the above questions. It makes use of the Dimensions Research Integrity Dataset, an additional module to Dimensions on Google Big Query (GBQ).

Using the dataset, you can answer questions such as:

  • How many research articles use Trust Markers?
  • How does coverage of Trust Markers differ across publishers, funders and research organisations?
  • If researchers are using Trust Markers (eg, data availability statements), how many are putting their data in repositories (and which repositories)?

Prerequisites

In order to run this tutorial, please ensure that:

About Trust Markers

The Trust Markers in the Dataset represent the integrity and reproducibility of scientific research. Trust Markers represent a contract between authors and readers that proper research practices have been observed. They also highlight the level of research transparency within the document, and reduce the risks of allowing non-compliance to research integrity policies to go unobserved.

To read definitions of specific Trust Markers, see the GBQ schema documentation.

Method

This notebook retrieves data about trust marker and publication data from Dimensions, the world's largest linked research information datatset. In particular the Trust Markers are taken from the DRI module.

To complete the analysis the following steps are taken:

  1. Connect to the Dimensions database.
  2. Gather information about general use of Trust Markers, broken down by publisher.
  3. Look at how usage of Trust Markers breaks down by research organisations in the US, by joining on data from GRID.
  4. Find out some of the most commonly claimed contributions by inviduals to research across different fields.
  5. Understand the usage of repositories across funders, research orgs and research articles.

1. Connect

You will need to be authenticated to run these queries - see the "Verifying yout connection" tutorial for options.

from google.colab import auth
auth.authenticate_user()
print('Authenticated')
Authenticated
#import other packages/modules needed for analysis
from google.cloud import bigquery
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objs as go
#config to avoid having to declare parameters multiple times
project_id = "ds-ripeta-gbq" #replace 'project' with the required project associate with your account

from google.cloud.bigquery import magics
magics.context.project = project_id

client = bigquery.Client(project = project_id)

2. Trust Markers by publisher

Write and run a query

In this instance we will limit data to 2022 and 10 publishers to keep things manageable.

#write the query - we're limiting results here to keep things easy to follow
qry = client.query("""
SELECT
    p.publisher.name,
    100 * COUNTIF(tm.data.data_availability_statement.present)/ COUNT(p.id) AS data_availability,
    100 * COUNTIF(tm.code.code_availability_statement.present)/COUNT(p.id) AS code_availability,
    100 * COUNTIF(tm.authors.author_contribution_statement.present)/COUNT(p.id) AS author_contributions,
    100 * COUNTIF(tm.authors.conflict_of_interest_statement.present)/COUNT(p.id) AS conflict_interest,
    100 * COUNTIF(tm.funding.funding_statement.present)/COUNT(p.id) AS funding_statement,
    #note here we are only counting articles with mesh terms of 'animal' or 'human' as if this criteria isn't met it is unlikely an ethics statement would be expected
    100 * COUNTIF((tm.ethical_approval.ethical_approval_statement.present AND (('Humans' IN UNNEST(p.mesh_terms)) OR ('Animals' IN UNNEST(p.mesh_terms)))) )/
        NULLIF(COUNTIF(('Humans' in UNNEST(p.mesh_terms)) OR ('Animals' IN unnest(p.mesh_terms))), 0) AS ethics_approval
FROM
    dimensions-ai.data_analytics.publications p
INNER JOIN `dimensions-ai-integrity.data.trust_markers` tm
    ON p.id = tm.id
WHERE p.year = 2022 AND p.document_type.classification = 'RESEARCH_ARTICLE'
GROUP BY 1
ORDER BY COUNT(p.id) DESC #order by number of publications in the trust marker dataset
--To keep things manageable for display purposes, we'll only look at 10 publishers for now
LIMIT 10
""")

#get the results
results = qry.result().to_dataframe() #may take a while depending on how much data your return

#take a peak
results
name data_availability code_availability author_contributions conflict_interest funding_statement ethics_approval
0 Elsevier 22.890000 1.597544 17.378246 80.405263 57.918947 18.943601
1 Springer Nature 49.878036 8.742873 52.908014 40.728133 71.049643 65.175558
2 MDPI 88.760718 1.339980 97.192921 99.243270 91.880857 13.336054
3 Wiley 39.376133 1.139870 20.882104 7.671746 58.562563 16.154602
4 Frontiers 97.583713 1.031047 99.547012 6.398587 79.234019 75.641116
5 Taylor & Francis 17.195639 0.605616 11.305576 83.174962 52.912265 28.457937
6 Institute of Electrical and Electronics Engine... 0.300312 0.723896 0.208514 0.599313 3.561780 5.201794
7 American Chemical Society (ACS) 2.348741 0.794581 42.137319 1.914696 75.219210 4.040979
8 Hindawi 85.378842 1.029393 23.275609 85.217410 52.681454 23.068174
9 SAGE Publications 9.527544 0.604082 18.841010 4.317599 90.031582 33.596158

Visualize your results

#make data 'long'
long_results = pd.melt(
    results,
    id_vars = "name",
    value_vars = results.columns.tolist()[1:7],
    var_name = "trust_marker",
    value_name = "pct"
    )

#convert your data to a dictionary so plotly can handle it
result_dict = {
    "z": long_results.pct.tolist(),
    "x": long_results.trust_marker.tolist(),
    "y": long_results.name.tolist()
}

#plot
plot = go.Figure(
    data = go.Heatmap(
        result_dict,
        colorscale = "Blues"
        )
    )

plot.show()

3. Trust Markers by research org

There are plenty of other analyses we can carry out using DRI. We are also not obliged to pull in aggregated data - we can also pull in data 'row-by-row' and analyse further in Python. We'll do this in the next example to look at the proportion of articles using Trust Markers at US universities.

Note that this example is set up to work generally. In a notebook environment (like Colab or Jupyter) you can use magic commands to reduce the code down to just the query and store as a summary dataframe. Google has guidance on doing this. We will use this magic commands in future examples.

%%bigquery markers_by_us_uni

SELECT
    p.id AS pub_id,
    p.year,
    orgs AS org_id,
    CONCAT(g.name, ' (', g.address.city, ')') AS org_name,
    tm.data.data_availability_statement.present AS das,
    tm.code.code_availability_statement.present AS cas,
    tm.authors.author_contribution_statement.present AS auth_cont,
    tm.authors.conflict_of_interest_statement.present AS conflict_int,
    tm.funding.funding_statement.present AS funding,
    CASE
      WHEN tm.ethical_approval.ethical_approval_statement.present IS TRUE AND ('Humans' IN UNNEST(p.mesh_terms) OR 'Animals' IN UNNEST(p.mesh_terms)) THEN TRUE
      WHEN tm.ethical_approval.ethical_approval_statement.present IS FALSE AND ('Humans' IN UNNEST(p.mesh_terms) OR 'Animals' IN UNNEST(p.mesh_terms)) THEN FALSE
      ELSE NULL
      END AS ethics
FROM dimensions-ai.data_analytics.publications p,
     UNNEST(research_orgs) orgs
INNER JOIN `dimensions-ai-integrity.data.trust_markers` tm
    ON p.id = tm.id
INNER JOIN dimensions-ai.data_analytics.grid g
    ON orgs = g.id
    AND 'Education' IN UNNEST(g.types)
    AND g.address.country = "United States"
WHERE
    p.year = 2022
    AND p.document_type.classification = 'RESEARCH_ARTICLE'
Query is running:   0%|          |



Downloading:   0%|          |
#you can see we've got the result straight to a df; take a look
markers_by_us_uni.head(5)
pub_id year org_id org_name das cas auth_cont conflict_int funding ethics
0 pub.1141586193 2022 grid.36425.36 Stony Brook University (Stony Brook) False False False False True <NA>
1 pub.1141411763 2022 grid.5288.7 Oregon Health & Science University (Portland) False False True True False False
2 pub.1141060825 2022 grid.262273.0 Queens College, CUNY (New York) True False False True False False
3 pub.1140891786 2022 grid.255935.d Fisk University (Nashville) False False False True False <NA>
4 pub.1141168735 2022 grid.152326.1 Vanderbilt University (Nashville) False False False False False True
#now we'll manipulate as required in Python
marker_df = markers_by_us_uni

#flag marker cols
marker_cols = ["das", "cas", "auth_cont", "conflict_int", "funding", "ethics"]

#work out if there is a least one marker
marker_df["tm"] = marker_df[marker_cols].eq(1).any(axis = 1)

#institutions w/ <=1,000 publications
gt1000 = (marker_df.
          groupby(["org_id"], as_index = False).
          agg({"pub_id": "count"})
          )

gt1000 = gt1000[gt1000["pub_id"] >= 1000]["org_id"].to_list()

#summary
marker_sum = (marker_df.
              groupby(["org_id", "org_name", "tm"], as_index = False).
              agg({"pub_id": "count"})
)

#add on %
marker_sum["pct"] = 100 * marker_sum["pub_id"] / marker_sum.groupby(["org_id"])["pub_id"].transform("sum")

#remove institutions w/ <=1000 pubs and tm = False rows
marker_sum = marker_sum[(marker_sum["org_id"].isin(gt1000)) & (marker_sum["tm"] == True)]

#sort and slice to keep data manageable, pick the top 10 by number of publications...
marker_sum = marker_sum.sort_values("pub_id", ascending = False).head(10)
#...then order by pct for purposes of plot
marker_sum = marker_sum.sort_values("pct", ascending = False)

marker_sum
org_id org_name tm pub_id pct
1730 grid.38142.3c Harvard University (Cambridge) True 15532 85.797934
1309 grid.266102.1 University of California, San Francisco (San F... True 5955 85.511200
101 grid.21107.35 Johns Hopkins University (Baltimore) True 8246 84.800494
3785 grid.5386.8 Cornell University (Ithaca) True 5846 83.765582
1722 grid.34477.33 University of Washington (Seattle) True 7825 83.475571
47 grid.168010.e Stanford University (Stanford) True 7731 83.030824
713 grid.25879.31 University of Pennsylvania (Philadelphia) True 6571 82.747765
3647 grid.47100.32 Yale University (New Haven) True 5992 82.488987
85 grid.19006.3e University of California, Los Angeles (Los Ang... True 6714 82.329859
115 grid.214458.e University of Michigan–Ann Arbor (Ann Arbor) True 7672 80.301444
#plot the data
plot = px.bar(marker_sum, x = "pct", y = "org_name")
plot.show()

4. Author contributions to articles

You can go beyond just the 'basic' Trust Markers with Dimensions Research Integrity. You can also look at related data, such as recorded contributions to papers by individuals, or at which repositories data is being deposited in.

Let's take a look at author contributions by research categorisation (note: articles falling under multiple categories will be counted once in each category. Articles mentioning the same verb more than once are only counted once per category). This will help understand acknowledgement patterns in research and possibly identify discipline areas where practice is 'ahead of the curve'.

%%bigquery cont_df

SELECT
    p.year,
    cat.name,
    contributor_verbs,
    COUNT(DISTINCT p.id) publications
FROM dimensions-ai.data_analytics.publications p,
    UNNEST(category_for.first_level.full) cat
INNER JOIN `dimensions-ai-integrity.data.trust_markers` tm
    ON p.id = tm.id,
    UNNEST(tm.authors.author_roles.keywords) contributor_verbs
WHERE  p.type= 'article' and p.year between 2011 and 2022
group by 1, 2, 3
Query is running:   0%|          |



Downloading:   0%|          |
#see what we get
cont_df.head(5)
year name contributor_verbs publications
0 2021 Biomedical And Clinical Sciences using 2733
1 2016 Biomedical And Clinical Sciences using 1108
2 2022 Psychology using 315
3 2015 Biomedical And Clinical Sciences using 723
4 2021 Agricultural, Veterinary And Food Sciences using 456

There are a lot of variables (field and verbs) here and we can't visualise them all in one go. Instead, let's identify the top five verbs and six fields (by number of publications) and stick with just those for now.

#get the most common verbs
common_verbs = (cont_df.
                groupby(["contributor_verbs"], as_index = False).
                agg({"publications": "sum"}).
                sort_values("publications", ascending = False).
                head(5)
                )["contributor_verbs"].to_list()

common_verbs
['performed', 'wrote', 'designed', 'contributed', 'approved']
#and the most common fields
common_fields = (cont_df.
                groupby(["name"], as_index = False).
                agg({"publications": "sum"}).
                sort_values("publications", ascending = False).
                head(6)
                )["name"].to_list()

common_fields
['Biomedical And Clinical Sciences',
 'Biological Sciences',
 'Engineering',
 'Health Sciences',
 'Chemical Sciences',
 'Agricultural, Veterinary And Food Sciences']
#filter the data accordingly and sort
cont_df = cont_df[(cont_df["contributor_verbs"].isin(common_verbs)) & (cont_df["name"].isin(common_fields))].sort_values("year")

#and now plot the results
plot = px.line(
    cont_df,
    x = "year",
    y = "publications",
    color = "contributor_verbs",
    facet_col = "name",
    facet_col_wrap = 3
)
plot.show()

5. Repository usage

Now we'll take a look at how we can access the repositories data. Suppose we want to see the breadth of usage of specific repositories across work across funders and research orgs, to understand how wide the usage of the most common repositories is.

%%bigquery repo_df

--the first two tables created by the WITH statements are needed to get full count of repositories based on names and URLs found
WITH
  common_keywords AS(
  SELECT
    kw,
    COUNT(DISTINCT id) AS pubs
  FROM `dimensions-ai-integrity.data.trust_markers` tm,
    UNNEST(tm.data.data_locations.repositories) kw
  GROUP BY 1
),
repositories AS(
  SELECT
  id,
  kw,
  'url' isin
    FROM `dimensions-ai-integrity.data.trust_markers` tm,
      UNNEST(tm.data.data_locations.repository_urls) url
    INNER JOIN common_keywords
      ON REGEXP_CONTAINS(REPLACE(url,'10.17632','mendeley'), LOWER(kw))
    UNION DISTINCT
      SELECT
        id,
          replace(
            replace(
              #replace(
                replace(kw,'open science framework','osf'),
              #'gene','geo'),
              'gene expression omnibus','geo'),
            'sequence read archive','sra') kw,
        'keyword' isin
        FROM `dimensions-ai-integrity.data.trust_markers` tm,
          UNNEST(tm.data.data_locations.repositories) kw
        WHERE kw != 'board'
        ),
  funders AS(
    SELECT
    pubs.id AS pub_id,
    fund.grid_id AS funder_id
    FROM `dimensions-ai.data_analytics.publications` pubs,
      UNNEST(funding_details) fund
    WHERE pubs.year = 2021
  ),
  orgs AS(
    SELECT
      pubs.id AS pub_id,
      org
    FROM `dimensions-ai.data_analytics.publications` pubs,
      UNNEST(research_orgs) org
    WHERE pubs.year = 2021
  ), combined AS(
  SELECT
    rep.id,
     CASE
        WHEN REGEXP_CONTAINS(rep.kw, 'github') THEN 'github'
        WHEN REGEXP_CONTAINS(rep.kw, 'osf') THEN 'osf'
        WHEN REGEXP_CONTAINS(rep.kw, 'ncbi') THEN 'ncbi'
        ELSE rep.kw
      END AS kw,
    f.funder_id AS funder,
    o.org AS ro
  FROM repositories rep
  INNER JOIN funders f
    ON rep.id = f.pub_id
  INNER JOIN orgs o
    ON rep.id = o.pub_id
  )
  SELECT
  kw,
  COUNT(DISTINCT funder) AS funders,
  COUNT(DISTINCT ro) AS orgs,
  COUNT(DISTINCT id) AS pubs
  FROM combined
  GROUP BY 1
  ORDER BY pubs DESC
  LIMIT 10
Query is running:   0%|          |



Downloading:   0%|          |
#see what we've got
repo_df
kw funders orgs pubs
0 ncbi 1858 7378 16104
1 github 1680 6787 13160
2 geo 1456 5115 8802
3 zenodo 1175 4828 6757
4 gene 1429 5311 6441
5 bioproject 1136 4369 5925
6 sra 1172 4292 5817
7 genbank 1025 4123 5019
8 figshare 1164 4422 4683
9 impact 1176 4649 3659
#plot
plot = px.scatter(
    repo_df,
    x = "funders",
    y = "orgs",
    hover_data = ["kw"]
)

plot.show()