DataCite Example Queries
The following examples explore how to use the openly available BigQuery dataset available at: ds-external-data.datacite.records_2025
Further documentation on the DataCite schema, along with how to get connected to BigQuery can be found at:
https://support.datacite.org/docs/api
Prerequisites
In order to run this tutorial, please ensure that you have
- Configured a Google Cloud project.
- Basic familiarity with Python and Jupyter notebooks.
(This tutorial is based on a Jupyter notebook that is available directly via GitHub.)
from google.colab import auth
auth.authenticate_user()
print('Authenticated')
from google.cloud import bigquery
from google.cloud.bigquery import magics
project_id = input("Enter the name of a GBQ project to use when running the code in this notebook: ")
magics.context.project = project_id
bq_params = {}
client = bigquery.Client(project=project_id)
%load_ext bigquery_magics
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
Before we go further, a quick warning. In BigQuery don't use "SELECT *" to explore a dataset. It will be expensive. Use only the columns that you need.
Exploring the DataCite dataset on Google BigQuery
In this notebook, we breakdown the DataCite schema in BigQuery, and demonstrate how to query each section.
- Resource Types
- Publication Years
- Titles and Publishers
- Descriptions (Abstracts)
- Subjects (Keywords)
- GeoLocations
- Rights & Licenses
- Creators & Affiliations (ROR)
- Related Identifiers (The PID Graph)
- Handling Versions
Counting Records
How many active DataCite DOIs do we have in this snapshot?
%%bigquery df_overview
SELECT
COUNT(*) as total_records
FROM
`ds-external-data.datacite.records_2025`
print(f"Total DataCite Records: {df_overview['total_records'][0]:,}")
Attributes
attributes.types
What are the most common resource types?
%%bigquery df_types
SELECT
attributes.types.resourceTypeGeneral as type,
COUNT(*) as count
FROM
`ds-external-data.datacite.records_2025`
GROUP BY
1
ORDER BY
2 DESC
LIMIT 20
df_types
We can also look at the specific resourceType (free text) vs the controlled resourceTypeGeneral.
%%bigquery df_types_detailed
SELECT
attributes.types.resourceTypeGeneral as general_type,
COALESCE(attributes.types.resourceType, '(Not Specified)') as specific_type,
COUNT(*) as count
FROM
`ds-external-data.datacite.records_2025`
GROUP BY
1, 2
ORDER BY
count DESC
LIMIT 15
df_types_detailed
attributes.publicationYear
How has the volume of research outputs grown over time?
%%bigquery df_years
SELECT
attributes.publicationYear as year,
COUNT(*) as count
FROM
`ds-external-data.datacite.records_2025`
WHERE
attributes.publicationYear BETWEEN 2010 AND 2025
GROUP BY
1
ORDER BY
1 DESC
plt.figure(figsize=(10, 5))
plt.plot(df_years['year'], df_years['count'], marker='o')
plt.title('DataCite Records by Publication Year (2010-2025)')
plt.xlabel('Year')
plt.ylabel('Count')
plt.grid(True)
plt.show()
attributes.titles & attributes.publisher
What is the basic descriptive metadata for these records?
%%bigquery df_core_detailed
SELECT
id as doi,
attributes.titles[SAFE_OFFSET(0)].title as primary_title,
attributes.publisher.name as publisher,
attributes.publicationYear
FROM
`ds-external-data.datacite.records_2025`
LIMIT 5
df_core_detailed
attributes.descriptions
What are these records about? (Extracting abstracts)
%%bigquery df_descriptions
SELECT
id as doi,
description_entry.descriptionType,
COALESCE(SUBSTR(description_entry.description, 0, 200), '(No Description Available)') as abstract_snippet
FROM
`ds-external-data.datacite.records_2025`,
UNNEST(attributes.descriptions) as description_entry
WHERE
description_entry.descriptionType = 'Abstract'
ORDER BY
LENGTH(description_entry.description) DESC
LIMIT 5
df_descriptions
attributes.subjects
What are the most frequent keywords used in the dataset?
%%bigquery df_subjects_list
SELECT
subj.subject as keyword,
COUNT(*) as frequency
FROM
`ds-external-data.datacite.records_2025`,
UNNEST(attributes.subjects) as subj
GROUP BY
1
ORDER BY
2 DESC
LIMIT 10
df_subjects_list
attributes.geoLocations
Where was data collected?
%%bigquery df_geoloc
SELECT
id as doi,
geo.geoLocationPlace,
geo.geoLocationPoint.pointLatitude as lat,
geo.geoLocationPoint.pointLongitude as long
FROM
`ds-external-data.datacite.records_2025`,
UNNEST(attributes.geoLocations) as geo
WHERE
geo.geoLocationPlace IS NOT NULL
LIMIT 10
df_geoloc
attributes.rightsList
What are the most common licenses?
%%bigquery df_licenses
SELECT
r.rights as license_name,
r.rightsUri as license_url,
COUNT(*) as count
FROM
`ds-external-data.datacite.records_2025`,
UNNEST(attributes.rightsList) as r
GROUP BY
1, 2
ORDER BY
3 DESC
LIMIT 10
df_licenses
attributes.citationCount & attributes.viewCount
Which records have citations or views recorded in the record?
%%bigquery df_usage
SELECT
id as doi,
attributes.titles[SAFE_OFFSET(0)].title as title,
attributes.citationCount,
attributes.viewCount,
attributes.downloadCount
FROM
`ds-external-data.datacite.records_2025`
WHERE
attributes.citationCount > 0
ORDER BY
attributes.citationCount DESC
LIMIT 10
df_usage
People & Organizations
Creators & Affiliations (ROR)
Who created this work, and what organizations are they affiliated with?
%%bigquery df_creators_detailed
SELECT
id as doi,
creator.name as creator_name,
(SELECT nameIdentifier FROM UNNEST(creator.nameIdentifiers) WHERE nameIdentifierScheme = 'ORCID' LIMIT 1) as orcid,
ARRAY_TO_STRING(ARRAY(SELECT name FROM UNNEST(creator.affiliation)), '; ') as affiliations
FROM
`ds-external-data.datacite.records_2025`,
UNNEST(attributes.creators) as creator
WHERE
ARRAY_LENGTH(creator.affiliation) > 0
LIMIT 10
df_creators_detailed
We can also find works by a specific organization (e.g., CERN) by joining with the ROR table.
%%bigquery df_org
SELECT
affiliation.affiliationIdentifier, affiliation.name,
count(distinct dc.id) dois,
FROM
`ds-external-data.datacite.records_2025` dc
INNER JOIN
UNNEST(dc.attributes.creators) as creator
INNER JOIN
UNNEST(creator.affiliation) as affiliation
WHERE
affiliation.affiliationIdentifierScheme = 'ROR'
AND dc.attributes.publicationYear = 2024
GROUP BY 1,2
ORDER BY 3 DESC
LIMIT 10
display(df_org)
PID Adoption (ORCID & ROR)
How many records include an ORCID iD for creators or contributors?
%%bigquery df_orcid_by_type
SELECT
attributes.types.resourceTypeGeneral as resource_type,
COUNT(distinct id) allrecords,
COUNT(distinct CASE WHEN ni.nameIdentifierScheme = 'ORCID' THEN id END) as creator_orcid_count,
COUNT(distinct CASE WHEN nit.nameIdentifierScheme = 'ORCID' THEN id END) as contributor_orcid_count
FROM
`ds-external-data.datacite.records_2025`
LEFT JOIN UNNEST(attributes.creators) c
LEFT JOIN UNNEST(c.nameIdentifiers) ni
LEFT JOIN UNNEST(attributes.contributors) ct
LEFT JOIN UNNEST(ct.nameIdentifiers) nit
#WHERE 'ORCID' in (ni.nameIdentifierScheme,nit.nameIdentifierScheme)
GROUP BY
1
ORDER BY
2 DESC
df_orcid_by_type
%%bigquery df_orcid_by_type_year
SELECT
attributes.publicationYear as year,
COUNT(distinct id) allrecords,
COUNT(distinct CASE WHEN ni.nameIdentifierScheme = 'ORCID' THEN id END) as creator_orcid_count,
COUNT(distinct CASE WHEN nit.nameIdentifierScheme = 'ORCID' THEN id END) as contributor_orcid_count,
ROUND(100*COUNT(distinct CASE WHEN 'ORCID' in (nit.nameIdentifierScheme,ni.nameIdentifierScheme) THEN id END)/ COUNT(distinct id) ,2) percentage_with_orcid
FROM
`ds-external-data.datacite.records_2025`
LEFT JOIN UNNEST(attributes.creators) c
LEFT JOIN UNNEST(c.nameIdentifiers) ni
LEFT JOIN UNNEST(attributes.contributors) ct
LEFT JOIN UNNEST(ct.nameIdentifiers) nit
WHERE attributes.types.resourceTypeGeneral = 'Dataset'
and attributes.publicationYear between 2015 and 2025
GROUP BY
1
ORDER BY
1 ASC
df_orcid_by_type_year.plot(x='year', y='percentage_with_orcid', kind='bar')
How many records contain at least one ROR ID in the affiliation metadata?
%%bigquery df_ror_count
SELECT
COUNT(DISTINCT id) as records_with_ror_affiliation,
COUNTIF(EXISTS(
SELECT 1 FROM UNNEST(attributes.creators) c, UNNEST(c.affiliation) a
WHERE a.affiliationIdentifierScheme = 'ROR'
)) as records_with_creator_ror,
COUNTIF(EXISTS(
SELECT 1 FROM UNNEST(attributes.contributors) c, UNNEST(c.affiliation) a
WHERE a.affiliationIdentifierScheme = 'ROR'
)) as records_with_contributor_ror
FROM
`ds-external-data.datacite.records_2025`
display(df_ror_count)
Connections & Funding
attributes.relatedIdentifiers (The PID Graph)
What other research objects (papers, software, versions) is this record connected to?
%%bigquery df_pid_graph
SELECT
id as source_doi,
rel.relationType,
rel.relatedIdentifierType as target_type,
rel.relatedIdentifier as target_id
FROM
`ds-external-data.datacite.records_2025`,
UNNEST(attributes.relatedIdentifiers) as rel
WHERE
rel.relationType IN ('IsSupplementTo', 'Cites', 'IsVersionOf')
LIMIT 10
display(df_pid_graph)
Which relationship types are most common?
%%bigquery df_relations
SELECT
relation.relationType,
COUNT(*) as frequency
FROM
`ds-external-data.datacite.records_2025`,
UNNEST(attributes.relatedIdentifiers) as relation
GROUP BY
1
ORDER BY
2 DESC
LIMIT 15
display(df_relations)
attributes.fundingReferences
Who paid for this research?
%%bigquery df_funding_refs
SELECT
id as doi,
funding.funderName,
funding.awardNumber,
funding.awardTitle
FROM
`ds-external-data.datacite.records_2025`,
UNNEST(attributes.fundingReferences) as funding
ORDER BY
funding.funderName DESC
LIMIT 10
display(df_funding_refs)
Advanced: Deduplication
Handling Versions
When analyzing research outputs, you often want to avoid double-counting. For example, a dataset might have versions 1.0, 1.1, and 2.0, all with different DOIs.
The following query uses a rigorous filtering strategy to remove superseded versions (IsPreviousVersionOf) and redundant identical copies (IsIdenticalTo), keeping only the most relevant record.
%%bigquery df_deduplicated
WITH dois_to_remove AS (
SELECT
dc.id
FROM
`ds-external-data.datacite.records_2025` dc
INNER JOIN
UNNEST(dc.attributes.relatedIdentifiers) related
ON related.relationType IN ('IsIdenticalTo', 'IsPreviousVersionOf')
-- Join with the table again to get the related DOI's metadata (date)
INNER JOIN
`ds-external-data.datacite.records_2025` dc2
ON dc2.id = related.relatedIdentifier
WHERE
(
-- Logic for Identical items: Remove the one registered LATER (or higher ID if tie)
(related.relationType = 'IsIdenticalTo'
AND (
SAFE_CAST(dc.attributes.registered AS TIMESTAMP) > SAFE_CAST(dc2.attributes.registered AS TIMESTAMP)
OR (
SAFE_CAST(dc.attributes.registered AS TIMESTAMP) = SAFE_CAST(dc2.attributes.registered AS TIMESTAMP)
AND dc.id > dc2.id
)
))
-- Logic for Versions: If dc is a "Previous Version Of" something else, remove dc.
OR (related.relationType = 'IsPreviousVersionOf')
)
-- Optional: Ensure we only deduplicate items owned by the same client/repository
AND dc2.relationships.client.data.id = dc.relationships.client.data.id
)
SELECT
COUNT(dc.id) as unique_records
FROM
`ds-external-data.datacite.records_2025` dc
LEFT JOIN
dois_to_remove dcr
ON dcr.id = dc.id
WHERE
dcr.id IS NULL
print(f"Total Deduped Records: {df_deduplicated['unique_records'][0]:,}")