Usage of Trust Markers in research

Use case

'Trust Markers' are indicators of integrity, professionalism and reproducibility in scientific research. They also highlight the level of research transparency within the document, and reduce the risks of allowing non-compliance to research integrity policies to go unobserved.

This notebook takes you through a few examples which address the above questions. It makes use of the Dimensions Research Integrity Dataset, an additional module to Dimensions on Google Big Query (GBQ).

Using the dataset, you can answer questions such as:

How many research articles use Trust Markers?
How does coverage of Trust Markers differ across publishers, funders and research organisations?
If researchers are using Trust Markers (eg, data availability statements), how many are putting their data in repositories (and which repositories)?

Prerequisites

In order to run this tutorial, please ensure that:

You have a valid Dimensions on Google BigQuery account and have configured a Google Cloud project. This must include access to the Dimensions Research Integrity Dataset.
You have some basic familiarity with Python and Jupyter notebooks.

About Trust Markers

The Trust Markers in the Dataset represent the integrity and reproducibility of scientific research. Trust Markers represent a contract between authors and readers that proper research practices have been observed. They also highlight the level of research transparency within the document, and reduce the risks of allowing non-compliance to research integrity policies to go unobserved.

To read definitions of specific Trust Markers, see the GBQ schema documentation.

Method

This notebook retrieves data about trust marker and publication data from Dimensions, the world's largest linked research information datatset. In particular the Trust Markers are taken from the DRI module.

To complete the analysis the following steps are taken:

Connect to the Dimensions database.
Gather information about general use of Trust Markers, broken down by publisher.
Look at how usage of Trust Markers breaks down by research organisations in the US, by joining on data from GRID.
Find out some of the most commonly claimed contributions by inviduals to research across different fields.
Understand the usage of repositories across funders, research orgs and research articles.

1. Connect

You will need to be authenticated to run these queries - see the "Verifying yout connection" tutorial for options.

from google.colab import auth
auth.authenticate_user()
print('Authenticated')

Authenticated

#import other packages/modules needed for analysis
from google.cloud import bigquery
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objs as go

#config to avoid having to declare parameters multiple times
project_id = "ds-ripeta-gbq" #replace 'project' with the required project associate with your account

from google.cloud.bigquery import magics
magics.context.project = project_id

client = bigquery.Client(project = project_id)

2. Trust Markers by publisher

Write and run a query

In this instance we will limit data to 2022 and 10 publishers to keep things manageable.

#write the query - we're limiting results here to keep things easy to follow
qry = client.query("""
SELECT
    p.publisher.name,
    100 * COUNTIF(tm.data.data_availability_statement.present)/ COUNT(p.id) AS data_availability,
    100 * COUNTIF(tm.code.code_availability_statement.present)/COUNT(p.id) AS code_availability,
    100 * COUNTIF(tm.authors.author_contribution_statement.present)/COUNT(p.id) AS author_contributions,
    100 * COUNTIF(tm.authors.conflict_of_interest_statement.present)/COUNT(p.id) AS conflict_interest,
    100 * COUNTIF(tm.funding.funding_statement.present)/COUNT(p.id) AS funding_statement,
    #note here we are only counting articles with mesh terms of 'animal' or 'human' as if this criteria isn't met it is unlikely an ethics statement would be expected
    100 * COUNTIF((tm.ethical_approval.ethical_approval_statement.present AND (('Humans' IN UNNEST(p.mesh_terms)) OR ('Animals' IN UNNEST(p.mesh_terms)))) )/
        NULLIF(COUNTIF(('Humans' in UNNEST(p.mesh_terms)) OR ('Animals' IN unnest(p.mesh_terms))), 0) AS ethics_approval
FROM
    dimensions-ai.data_analytics.publications p
INNER JOIN `dimensions-ai-integrity.data.trust_markers` tm
    ON p.id = tm.id
WHERE p.year = 2022 AND p.document_type.classification = 'RESEARCH_ARTICLE'
GROUP BY 1
ORDER BY COUNT(p.id) DESC #order by number of publications in the trust marker dataset
--To keep things manageable for display purposes, we'll only look at 10 publishers for now
LIMIT 10
""")

#get the results
results = qry.result().to_dataframe() #may take a while depending on how much data your return

#take a peak
results

	name	data_availability	code_availability	author_contributions	conflict_interest	funding_statement	ethics_approval
0	Elsevier	22.890000	1.597544	17.378246	80.405263	57.918947	18.943601
1	Springer Nature	49.878036	8.742873	52.908014	40.728133	71.049643	65.175558
2	MDPI	88.760718	1.339980	97.192921	99.243270	91.880857	13.336054
3	Wiley	39.376133	1.139870	20.882104	7.671746	58.562563	16.154602
4	Frontiers	97.583713	1.031047	99.547012	6.398587	79.234019	75.641116
5	Taylor & Francis	17.195639	0.605616	11.305576	83.174962	52.912265	28.457937
6	Institute of Electrical and Electronics Engine...	0.300312	0.723896	0.208514	0.599313	3.561780	5.201794
7	American Chemical Society (ACS)	2.348741	0.794581	42.137319	1.914696	75.219210	4.040979
8	Hindawi	85.378842	1.029393	23.275609	85.217410	52.681454	23.068174
9	SAGE Publications	9.527544	0.604082	18.841010	4.317599	90.031582	33.596158

Visualize your results

#make data 'long'
long_results = pd.melt(
    results,
    id_vars = "name",
    value_vars = results.columns.tolist()[1:7],
    var_name = "trust_marker",
    value_name = "pct"
    )

#convert your data to a dictionary so plotly can handle it
result_dict = {
    "z": long_results.pct.tolist(),
    "x": long_results.trust_marker.tolist(),
    "y": long_results.name.tolist()
}

#plot
plot = go.Figure(
    data = go.Heatmap(
        result_dict,
        colorscale = "Blues"
        )
    )

plot.show()

3. Trust Markers by research org

There are plenty of other analyses we can carry out using DRI. We are also not obliged to pull in aggregated data - we can also pull in data 'row-by-row' and analyse further in Python. We'll do this in the next example to look at the proportion of articles using Trust Markers at US universities.

Note that this example is set up to work generally. In a notebook environment (like Colab or Jupyter) you can use magic commands to reduce the code down to just the query and store as a summary dataframe. Google has guidance on doing this. We will use this magic commands in future examples.

%%bigquery markers_by_us_uni

SELECT
    p.id AS pub_id,
    p.year,
    orgs AS org_id,
    CONCAT(g.name, ' (', g.address.city, ')') AS org_name,
    tm.data.data_availability_statement.present AS das,
    tm.code.code_availability_statement.present AS cas,
    tm.authors.author_contribution_statement.present AS auth_cont,
    tm.authors.conflict_of_interest_statement.present AS conflict_int,
    tm.funding.funding_statement.present AS funding,
    CASE
      WHEN tm.ethical_approval.ethical_approval_statement.present IS TRUE AND ('Humans' IN UNNEST(p.mesh_terms) OR 'Animals' IN UNNEST(p.mesh_terms)) THEN TRUE
      WHEN tm.ethical_approval.ethical_approval_statement.present IS FALSE AND ('Humans' IN UNNEST(p.mesh_terms) OR 'Animals' IN UNNEST(p.mesh_terms)) THEN FALSE
      ELSE NULL
      END AS ethics
FROM dimensions-ai.data_analytics.publications p,
     UNNEST(research_orgs) orgs
INNER JOIN `dimensions-ai-integrity.data.trust_markers` tm
    ON p.id = tm.id
INNER JOIN dimensions-ai.data_analytics.grid g
    ON orgs = g.id
    AND 'Education' IN UNNEST(g.types)
    AND g.address.country = "United States"
WHERE
    p.year = 2022
    AND p.document_type.classification = 'RESEARCH_ARTICLE'

Query is running:   0%|          |



Downloading:   0%|          |

#you can see we've got the result straight to a df; take a look
markers_by_us_uni.head(5)

	pub_id	year	org_id	org_name	das	cas	auth_cont	conflict_int	funding	ethics
0	pub.1141586193	2022	grid.36425.36	Stony Brook University (Stony Brook)	False	False	False	False	True	<NA>
1	pub.1141411763	2022	grid.5288.7	Oregon Health & Science University (Portland)	False	False	True	True	False	False
2	pub.1141060825	2022	grid.262273.0	Queens College, CUNY (New York)	True	False	False	True	False	False
3	pub.1140891786	2022	grid.255935.d	Fisk University (Nashville)	False	False	False	True	False	<NA>
4	pub.1141168735	2022	grid.152326.1	Vanderbilt University (Nashville)	False	False	False	False	False	True

#now we'll manipulate as required in Python
marker_df = markers_by_us_uni

#flag marker cols
marker_cols = ["das", "cas", "auth_cont", "conflict_int", "funding", "ethics"]

#work out if there is a least one marker
marker_df["tm"] = marker_df[marker_cols].eq(1).any(axis = 1)

#institutions w/ <=1,000 publications
gt1000 = (marker_df.
          groupby(["org_id"], as_index = False).
          agg({"pub_id": "count"})
          )

gt1000 = gt1000[gt1000["pub_id"] >= 1000]["org_id"].to_list()

#summary
marker_sum = (marker_df.
              groupby(["org_id", "org_name", "tm"], as_index = False).
              agg({"pub_id": "count"})
)

#add on %
marker_sum["pct"] = 100 * marker_sum["pub_id"] / marker_sum.groupby(["org_id"])["pub_id"].transform("sum")

#remove institutions w/ <=1000 pubs and tm = False rows
marker_sum = marker_sum[(marker_sum["org_id"].isin(gt1000)) & (marker_sum["tm"] == True)]

#sort and slice to keep data manageable, pick the top 10 by number of publications...
marker_sum = marker_sum.sort_values("pub_id", ascending = False).head(10)
#...then order by pct for purposes of plot
marker_sum = marker_sum.sort_values("pct", ascending = False)

marker_sum

	org_id	org_name	tm	pub_id	pct
1730	grid.38142.3c	Harvard University (Cambridge)	True	15532	85.797934
1309	grid.266102.1	University of California, San Francisco (San F...	True	5955	85.511200
101	grid.21107.35	Johns Hopkins University (Baltimore)	True	8246	84.800494
3785	grid.5386.8	Cornell University (Ithaca)	True	5846	83.765582
1722	grid.34477.33	University of Washington (Seattle)	True	7825	83.475571
47	grid.168010.e	Stanford University (Stanford)	True	7731	83.030824
713	grid.25879.31	University of Pennsylvania (Philadelphia)	True	6571	82.747765
3647	grid.47100.32	Yale University (New Haven)	True	5992	82.488987
85	grid.19006.3e	University of California, Los Angeles (Los Ang...	True	6714	82.329859
115	grid.214458.e	University of Michigan–Ann Arbor (Ann Arbor)	True	7672	80.301444

#plot the data
plot = px.bar(marker_sum, x = "pct", y = "org_name")
plot.show()

4. Author contributions to articles

You can go beyond just the 'basic' Trust Markers with Dimensions Research Integrity. You can also look at related data, such as recorded contributions to papers by individuals, or at which repositories data is being deposited in.

Let's take a look at author contributions by research categorisation (note: articles falling under multiple categories will be counted once in each category. Articles mentioning the same verb more than once are only counted once per category). This will help understand acknowledgement patterns in research and possibly identify discipline areas where practice is 'ahead of the curve'.

%%bigquery cont_df

SELECT
    p.year,
    cat.name,
    contributor_verbs,
    COUNT(DISTINCT p.id) publications
FROM dimensions-ai.data_analytics.publications p,
    UNNEST(category_for.first_level.full) cat
INNER JOIN `dimensions-ai-integrity.data.trust_markers` tm
    ON p.id = tm.id,
    UNNEST(tm.authors.author_roles.keywords) contributor_verbs
WHERE  p.type= 'article' and p.year between 2011 and 2022
group by 1, 2, 3

Query is running:   0%|          |



Downloading:   0%|          |

#see what we get
cont_df.head(5)

	year	name	contributor_verbs	publications
0	2021	Biomedical And Clinical Sciences	using	2733
1	2016	Biomedical And Clinical Sciences	using	1108
2	2022	Psychology	using	315
3	2015	Biomedical And Clinical Sciences	using	723
4	2021	Agricultural, Veterinary And Food Sciences	using	456

There are a lot of variables (field and verbs) here and we can't visualise them all in one go. Instead, let's identify the top five verbs and six fields (by number of publications) and stick with just those for now.

#get the most common verbs
common_verbs = (cont_df.
                groupby(["contributor_verbs"], as_index = False).
                agg({"publications": "sum"}).
                sort_values("publications", ascending = False).
                head(5)
                )["contributor_verbs"].to_list()

common_verbs

['performed', 'wrote', 'designed', 'contributed', 'approved']

#and the most common fields
common_fields = (cont_df.
                groupby(["name"], as_index = False).
                agg({"publications": "sum"}).
                sort_values("publications", ascending = False).
                head(6)
                )["name"].to_list()

common_fields

['Biomedical And Clinical Sciences',
 'Biological Sciences',
 'Engineering',
 'Health Sciences',
 'Chemical Sciences',
 'Agricultural, Veterinary And Food Sciences']

#filter the data accordingly and sort
cont_df = cont_df[(cont_df["contributor_verbs"].isin(common_verbs)) & (cont_df["name"].isin(common_fields))].sort_values("year")

#and now plot the results
plot = px.line(
    cont_df,
    x = "year",
    y = "publications",
    color = "contributor_verbs",
    facet_col = "name",
    facet_col_wrap = 3
)
plot.show()

5. Repository usage

Now we'll take a look at how we can access the repositories data. Suppose we want to see the breadth of usage of specific repositories across work across funders and research orgs, to understand how wide the usage of the most common repositories is.

%%bigquery repo_df

--the first two tables created by the WITH statements are needed to get full count of repositories based on names and URLs found
WITH
  common_keywords AS(
  SELECT
    kw,
    COUNT(DISTINCT id) AS pubs
  FROM `dimensions-ai-integrity.data.trust_markers` tm,
    UNNEST(tm.data.data_locations.repositories) kw
  GROUP BY 1
),
repositories AS(
  SELECT
  id,
  kw,
  'url' isin
    FROM `dimensions-ai-integrity.data.trust_markers` tm,
      UNNEST(tm.data.data_locations.repository_urls) url
    INNER JOIN common_keywords
      ON REGEXP_CONTAINS(REPLACE(url,'10.17632','mendeley'), LOWER(kw))
    UNION DISTINCT
      SELECT
        id,
          replace(
            replace(
              #replace(
                replace(kw,'open science framework','osf'),
              #'gene','geo'),
              'gene expression omnibus','geo'),
            'sequence read archive','sra') kw,
        'keyword' isin
        FROM `dimensions-ai-integrity.data.trust_markers` tm,
          UNNEST(tm.data.data_locations.repositories) kw
        WHERE kw != 'board'
        ),
  funders AS(
    SELECT
    pubs.id AS pub_id,
    fund.grid_id AS funder_id
    FROM `dimensions-ai.data_analytics.publications` pubs,
      UNNEST(funding_details) fund
    WHERE pubs.year = 2021
  ),
  orgs AS(
    SELECT
      pubs.id AS pub_id,
      org
    FROM `dimensions-ai.data_analytics.publications` pubs,
      UNNEST(research_orgs) org
    WHERE pubs.year = 2021
  ), combined AS(
  SELECT
    rep.id,
     CASE
        WHEN REGEXP_CONTAINS(rep.kw, 'github') THEN 'github'
        WHEN REGEXP_CONTAINS(rep.kw, 'osf') THEN 'osf'
        WHEN REGEXP_CONTAINS(rep.kw, 'ncbi') THEN 'ncbi'
        ELSE rep.kw
      END AS kw,
    f.funder_id AS funder,
    o.org AS ro
  FROM repositories rep
  INNER JOIN funders f
    ON rep.id = f.pub_id
  INNER JOIN orgs o
    ON rep.id = o.pub_id
  )
  SELECT
  kw,
  COUNT(DISTINCT funder) AS funders,
  COUNT(DISTINCT ro) AS orgs,
  COUNT(DISTINCT id) AS pubs
  FROM combined
  GROUP BY 1
  ORDER BY pubs DESC
  LIMIT 10

Query is running:   0%|          |



Downloading:   0%|          |

#see what we've got
repo_df

	kw	funders	orgs	pubs
0	ncbi	1858	7378	16104
1	github	1680	6787	13160
2	geo	1456	5115	8802
3	zenodo	1175	4828	6757
4	gene	1429	5311	6441
5	bioproject	1136	4369	5925
6	sra	1172	4292	5817
7	genbank	1025	4123	5019
8	figshare	1164	4422	4683
9	impact	1176	4649	3659

#plot
plot = px.scatter(
    repo_df,
    x = "funders",
    y = "orgs",
    hover_data = ["kw"]
)

plot.show()