Skip to content

Exploring the ORCiD dataset on Google Bigquery

The following examples explore how to use the openly available bigquery dataset available at: ds-open-datasets.orcid.summaries_2024

Further documentation on the orcid schema, along with how to get connected to bigquery can be found at: https://docs.dimensions.ai/bigquery/

Prerequisites

In order to run this tutorial, please ensure that you have

(This tutorial is based on a Jupyter notebook that is available directly via GitHub.)

from google.colab import auth
auth.authenticate_user()
print('Authenticated')
from google.cloud import bigquery

from google.cloud.bigquery import magics

project_id = input("Enter the name of a GBQ project to use when running the code in this notebook: ")

magics.context.project = project_id

bq_params = {}

client = bigquery.Client(project=project_id)

%load_ext bigquery_magics

Before we go further, a quick warning. In bigquery don't use "select *" to explore a dataset. It will be expensive. Use only the columns that you need

1. Querying orcid identifiers

1.1 How many active orcids do we have?

%%bigquery

select count(orcid_identifier)
from ds-open-datasets.orcid.summaries_2024
   where history.deactivation_date is null
f0_
0 21046010

1.2 How many orcid records have a publicly verified email?

%%bigquery

select count(orcid_identifier)
from ds-open-datasets.orcid.summaries_2024
   where history.deactivation_date is null
   and history.verified_email is True
f0_
0 16623231

1.3 How many orcids have been created by year?

%%bigquery

select
  extract(YEAR FROM timestamp(history.submission_date)) AS year,
  count(orcid_identifier)
from ds-open-datasets.orcid.summaries_2024
   where history.deactivation_date is null
   group by 1 -- i.e. year
   order by 1
year f0_
0 2012 43695
1 2013 420762
2 2014 584072
3 2015 731309
4 2016 1021238
5 2017 1343595
6 2018 1546783
7 2019 1959241
8 2020 2598695
9 2021 2659905
10 2022 2748914
11 2023 2975162
12 2024 2412639

1.4 How many orcids have been modified in 2023 (2024 is only up to the date of the snapshot) ?

%%bigquery

select  count(orcid_identifier)
from ds-open-datasets.orcid.summaries_2024
   where history.deactivation_date is null
   and extract(YEAR FROM timestamp(history.last_modified_date)) = 2023
f0_
0 7471503

2. person

2.1 How many orcid names are public?

%%bigquery

select
  count(orcid_identifier)
from ds-open-datasets.orcid.summaries_2024
where person.name.visibility = 'public'
f0_
0 21023410

2.2 How many orcid records have other names?

%%bigquery

 select count(orcid_identifier)
 from ds-open-datasets.orcid.summaries_2024
 where person.other_names is not null
f0_
0 742479

2.3 How many orcids have a populated credit name?

%%bigquery

 select count(orcid_identifier)
 from
 ds-open-datasets.orcid.summaries_2024
 where person.name.credit_name is not null
f0_
0 811105

2.4 What are the dominant domains of URLs in author profiles?

%%bigquery

select
  substr(url.url,1,25) url_beginning,
  count(orcid_identifier) as orcid_count
from ds-open-datasets.orcid.summaries_2024
  cross join unnest(person.researcher_urls.urls) as url
group by 1 -- url beginining
order by 2 desc -- count of orcids
limit 10
url_beginning orcid_count
0 https://www.linkedin.com/ 238827
1 https://www.researchgate. 150454
2 https://scholar.google.co 125628
3 https://www.facebook.com/ 40467
4 http://www.linkedin.com/i 35762
5 https://www.instagram.com 24594
6 https://sites.google.com/ 23693
7 https://publons.com/resea 13536
8 https://www.webofscience. 12911
9 https://scholar.google.es 10752

2.5 How many profiles have biographies?

%%bigquery

select count(orcid_identifier)
 from ds-open-datasets.orcid.summaries_2024
where person.biography.content is not null
f0_
0 917857

2.6 What are some of the most used words in biographies (other than stop words)?

%%bigquery

with
  bio_tokens as (
      select
        orcid_identifier.path orcid,
        split(person.biography.content,' ') as tokens
      from ds-open-datasets.orcid.summaries_2024
      where person.biography.content is not null
    ),
  tf_idf as (
    SELECT
      orcid,
      TF_IDF(tokens, 10000, 20) OVER() AS results
    FROM bio_tokens
    ORDER BY orcid
  )

select
  token.index,
  count(orcid) AS profiles
from tf_idf
  cross join unnest(results) as token
where token.index is not null
  and LENGTH(token.index) > 2
  and token.value > .7
group by 1 -- the word
order by 2 desc -- number of profiles including the word
limit 200
index profiles
0 Estudiante 1149
1 Student 1087
2 Medical 886
3 Researcher 880
4 student 860
... ... ...
195 Graduate 111
196 Cesar 110
197 Social 110
198 Applied 110
199 Finance 109

200 rows × 2 columns

2.7 How many public emails are available in ORCiD?

%%bigquery

select count(distinct email.email)
from ds-open-datasets.orcid.summaries_2024,
   unnest(person.emails.emails) as email
WHERE email.visibility = 'public'
  AND email IS NOT NULL
f0_
0 729428

2.8 How many people give publicly visible countries/regions in ORCiD?

The person.addresses field records the countries a person has worked in.

%%bigquery

select count(distinct address.source.source_orcid.path)
from ds-open-datasets.orcid.summaries_2024,
   unnest(person.addresses.addresses) as address
WHERE address.visibility = 'public'
  AND address IS NOT NULL
f0_
0 2552260

2.9 Which countries?

%%bigquery

select
  address.country,
  count(distinct address.source.source_orcid.path)
from ds-open-datasets.orcid.summaries_2024,
   unnest(person.addresses.addresses) as address
where address.visibility = 'public'
  and address is not null
group by address.country
order by 2 DESC -- count of orcids
country f0_
0 BR 272986
1 CN 231405
2 US 209177
3 IN 171638
4 ES 100541
... ... ...
245 NU 6
246 CC 6
247 BV 5
248 TK 4
249 PM 4

250 rows × 2 columns

2.10 What are the most frequent keywords in the ORCID dataset?

## Most frequent keywords?

%%bigquery

select lower(keyword.content), count(orcid_identifier.path)
from ds-open-datasets.orcid.summaries_2024,
   unnest(person.keywords.keywords) keyword
group by lower(keyword.content)
order by 2 desc -- i.e. count
f0_ f1_
0 machine learning 19257
1 artificial intelligence 10382
2 deep learning 7789
3 bioinformatics 7404
4 education 6359
... ... ...
1189703 социально-психологическая виктимология 1
1189704 analog circuits,oscillators, pll, adc, dac, lo... 1
1189705 high education, internationalization of educat... 1
1189706 structural mechanics, structural health monito... 1
1189707 ritual cognition 1

1189708 rows × 2 columns

2.11 What are the most common identifier types in ORCiD?

%%bigquery

select
  lower(identifier.type),
  count(orcid_identifier.path)
from ds-open-datasets.orcid.summaries_2024,
   unnest(person.external_identifiers.identifiers) identifier
group by 1 -- i.e. identifier type
order by 2 desc -- i.e. count
limit 10
f0_ f1_
0 scopus author id 1791459
1 researcherid 700225
2 sciprofiles 325959
3 loop profile 261417
4 ciência id 67820
5 researcher name resolver id 14228
6 gnd 9692
7 中国科学家在线 5457
8 isni 4227
9 pitt id 3400

3. activities

3.1 What are the most common academic roles in ORCiD?

%%bigquery

select
  record.role_title,
  count(orcid_identifier.path) as orcids
from ds-open-datasets.orcid.summaries_2024,
   unnest(activities.educations.groups) grp,
   unnest(grp.records) record
   where start_date.year = "2024"
group by record.role_title
order by orcids desc
limit 40
role_title orcids
0 PhD 2118
1 None 1768
2 Mestrado 753
3 Doctor of Philosophy 666
4 Doutorado 583
5 Ph.D. 510
6 PhD Student 372
7 Master 317
8 PhD student 293
9 Master of Science 256
10 Ph.D 222
11 Mestranda 216
12 Completed the course and project 168
13 Mestrando 149
14 Doutoranda 145
15 Graduação 137
16 PhD Candidate 136
17 Doctor 126
18 Doutorando 119
19 Mestre 112
20 Phd 97
21 Estudiante 94
22 PhD candidate 93
23 MD 91
24 Doctorate 89
25 Masters 84
26 MSc 78
27 Especialização 75
28 Master's Degree 74
29 Doctor of Philosophy (PhD) 69
30 Doctor of Medicine 68
31 Ph.D. Student 63
32 Master's degree 58
33 ESTUDIANTE 58
34 MS 56
35 PHD 55
36 Doctoral Student 51
37 Bacharelado 49
38 Master's 48
39 Graduate Student 47

3.2 How many employments have disambiguated addresses in 2024?

#How many employments have disambiguated addresses in 2024?

%%bigquery

select organization.disambiguated_organization.source, count(orcid_identifier.path)
from ds-open-datasets.orcid.summaries_2024,
   unnest(activities.employments.groups) grp,
   unnest(grp.records) record
   where start_date.year = "2024"
group by 1
order by 2 desc
source f0_
0 ROR 228862
1 None 30104
2 RINGGOLD 2865
3 FUNDREF 1904
4 GRID 576

3.3 What organizations have been tagged the most times in ORCID as having awarded funding?

%%bigquery

select
  funding_record.organization.name,
  count(*) AS mentioned_count
from ds-open-datasets.orcid.summaries_2024,
  unnest(activities.fundings.groups) as funding_groups,
  unnest(funding_groups.records) as funding_record
where funding_record.organization.name <> 'N/A'
group by funding_record.organization.name
order by mentioned_count desc
limit 10
name mentioned_count
0 Fundação para a Ciência e a Tecnologia 75382
1 European Commission 47984
2 Japan Society for the Promotion of Science 43370
3 National Natural Science Foundation of China 20965
4 Fundação para a Ciência e a Tecnologia, I.P. 17160
5 Coordenação de Aperfeiçoamento de Pessoal de N... 15851
6 Conselho Nacional de Desenvolvimento Científic... 15311
7 Canadian Institutes of Health Research 15123
8 Australian Research Council 14738
9 Swiss National Science Foundation 14689

3.4 What organizations make the most use of peer reviewers?

%%bigquery

select
  record.reviewer_role,
  record.review_type,
  record.convening_organization.name,
  count(orcid_identifier.path) AS orcids
from ds-open-datasets.orcid.summaries_2024,
    unnest(activities.peer_reviews.groups) as grp,
    unnest(grp.groups) as grpp,
    unnest(grpp.records) as record
group by record.reviewer_role, record.review_type, record.convening_organization.name
order by orcids desc
limit 10
reviewer_role review_type name orcids
0 reviewer review Clarivate PLC 3232053
1 reviewer review Publons 3117297
2 reviewer review Elsevier, Inc. 2660331
3 reviewer review Springer Nature 2261495
4 reviewer review American Chemical Society 1382847
5 reviewer review MDPI 694117
6 reviewer review SpringerNature 363646
7 reviewer review Wiley-VCH 300707
8 reviewer review Public Library of Science 166378
9 reviewer review BMJ Publishing Group 83693

3.5 What are the most used identifier types for works claimed in ORCID?

%%bigquery

select
  identifier.type,
  count(identifier.value) AS instances
from ds-open-datasets.orcid.summaries_2024,
  unnest(activities.works.groups) as grp,
  unnest(grp.external_ids.identifiers) as identifier
group by identifier.type
order by instances desc
limit 10
type instances
0 doi 77820619
1 eid 40109303
2 wosuid 13082114
3 source-work-id 10850721
4 pmid 8329151
5 other-id 2997732
6 pmc 2545463
7 arxiv 2107965
8 handle 1552937
9 isbn 1054806

3.6 What are the most common invited positions held by researchers with ORCIDs?

%%bigquery

select
  record.role_title,
  count(orcid_identifier.path) as orcids
from ds-open-datasets.orcid.summaries_2024,
  unnest(activities.invited_positions.records) as rec,
  unnest(rec.records) as record
where record.role_title is not null
group by record.role_title
order by orcids desc
role_title orcids
0 Visiting Professor 8391
1 Visiting Scholar 4841
2 Member 3984
3 Visiting Researcher 3603
4 Reviewer 3197
... ... ...
160075 Internal Medicine Chief Resident 1
160076 Leuven Institute for Advanced Studies Fellow 1
160077 Profesora invitada de Educación Superior de Po... 1
160078 Director, Depression and Anxiety Disorders Pro... 1
160079 Representance de los centros de investigacion ... 1

160080 rows × 2 columns

3.7 What are the most common membership positions held by researchers with ORCIDs?

%%bigquery

select
  record.role_title,
  count(orcid_identifier.path) as orcids
from ds-open-datasets.orcid.summaries_2024,
  unnest(activities.memberships.groups) as grp,
  unnest(grp.records) as record
where record.role_title is not null
group by record.role_title
order by orcids desc
limit 10
role_title orcids
0 Member 78249
1 Fellow 14622
2 Life Member 14297
3 member 11710
4 Student 10547
5 Student Member 6575
6 Miembro 5651
7 Associate Member 5036
8 Senior Member 4781
9 Life member 4575

3.8 What are the most common types of qualification recorded by researchers with ORCIDs?

%%bigquery

select
  record.role_title,
  count(orcid_identifier.path) as orcids
from ds-open-datasets.orcid.summaries_2024,
  unnest(activities.qualifications.groups) as grp,
  unnest(grp.records) as record
where record.role_title is not null
  and record.role_title <> 'Student'
group by record.role_title
order by orcids desc
limit 10
role_title orcids
0 PhD 44330
1 Investigador RENACYT 10225
2 Ph.D. 8451
3 Mestrado 7355
4 Ph.D 6843
5 Master 5351
6 MSc 5329
7 Doutorado 5197
8 Mestre 5147
9 PhD student 5109

3.9 What types of services do researchers with ORCIDs commonly say they have performed?

%%bigquery

select
  record.role_title,
  count(orcid_identifier.path) as orcids
from ds-open-datasets.orcid.summaries_2024,
  unnest(activities.services.groups) as grp,
  unnest(grp.records) as record
where record.role_title is not null
group by record.role_title
order by orcids desc
limit 10
role_title orcids
0 Member 11290
1 Reviewer 5011
2 President 4567
3 Associate Editor 2865
4 Chair 2463
5 Board Member 1897
6 Editorial Board Member 1882
7 Volunteer 1835
8 Director 1711
9 member 1653

3.10 What proposals for the use of specialist research facilities have involved the most researchers?

%%bigquery

select
  record.proposal.title.title,
  count(orcid_identifier.path) as orcids
from ds-open-datasets.orcid.summaries_2024,
  unnest(activities.research_resources.groups) as grp,
  unnest(grp.records) as record
group by record.proposal.title.title
order by orcids desc
limit 10
title orcids
0 Neutron Beam Award at Spallation Neutron Sourc... 858
1 Neutron Beam Award at High Flux Isotope Reacto... 338
2 Neutron Beam Award at High Flux Isotope Reacto... 199
3 Prediction of soil microbiome phenotypic respo... 16
4 Unraveling Redox Transformation Mechanisms of ... 13
5 The Transformation of Tetrahedral to Octahedra... 12
6 EVALUATION OF CEMENT COMPOSITES, ROCKS, AND OT... 10
7 Controls of bioorganic constituents of soils i... 7
8 In situ TEM study of hierarchical nanowires gr... 7
9 The effect of trace element content on the rat... 7