Exploring the ORCiD dataset on Google Bigquery
The following examples explore how to use the openly available bigquery dataset available at: ds-open-datasets.orcid.summaries_2024
Further documentation on the orcid schema, along with how to get connected to bigquery can be found at: https://docs.dimensions.ai/bigquery/
Prerequisites
In order to run this tutorial, please ensure that you have
(This tutorial is based on a Jupyter notebook that is available directly via GitHub.)
from google.colab import auth
auth.authenticate_user()
print('Authenticated')
from google.cloud import bigquery
from google.cloud.bigquery import magics
project_id = input("Enter the name of a GBQ project to use when running the code in this notebook: ")
magics.context.project = project_id
bq_params = {}
client = bigquery.Client(project=project_id)
%load_ext bigquery_magics
Before we go further, a quick warning. In bigquery don't use "select *" to explore a dataset. It will be expensive. Use only the columns that you need
1.1 How many active orcids do we have?
%%bigquery
select count(orcid_identifier)
from ds-open-datasets.orcid.summaries_2024
where history.deactivation_date is null
1.2 How many orcid records have a publicly verified email?
%%bigquery
select count(orcid_identifier)
from ds-open-datasets.orcid.summaries_2024
where history.deactivation_date is null
and history.verified_email is True
1.3 How many orcids have been created by year?
%%bigquery
select
extract(YEAR FROM timestamp(history.submission_date)) AS year,
count(orcid_identifier)
from ds-open-datasets.orcid.summaries_2024
where history.deactivation_date is null
group by 1 -- i.e. year
order by 1
|
year |
f0_ |
| 0
| 2012 |
43695 |
| 1
| 2013 |
420762 |
| 2
| 2014 |
584072 |
| 3
| 2015 |
731309 |
| 4
| 2016 |
1021238 |
| 5
| 2017 |
1343595 |
| 6
| 2018 |
1546783 |
| 7
| 2019 |
1959241 |
| 8
| 2020 |
2598695 |
| 9
| 2021 |
2659905 |
| 10
| 2022 |
2748914 |
| 11
| 2023 |
2975162 |
| 12
| 2024 |
2412639 |
1.4 How many orcids have been modified in 2023 (2024 is only up to the date of the snapshot) ?
%%bigquery
select count(orcid_identifier)
from ds-open-datasets.orcid.summaries_2024
where history.deactivation_date is null
and extract(YEAR FROM timestamp(history.last_modified_date)) = 2023
2. person
2.1 How many orcid names are public?
%%bigquery
select
count(orcid_identifier)
from ds-open-datasets.orcid.summaries_2024
where person.name.visibility = 'public'
2.2 How many orcid records have other names?
%%bigquery
select count(orcid_identifier)
from ds-open-datasets.orcid.summaries_2024
where person.other_names is not null
2.3 How many orcids have a populated credit name?
%%bigquery
select count(orcid_identifier)
from
ds-open-datasets.orcid.summaries_2024
where person.name.credit_name is not null
2.4 What are the dominant domains of URLs in author profiles?
%%bigquery
select
substr(url.url,1,25) url_beginning,
count(orcid_identifier) as orcid_count
from ds-open-datasets.orcid.summaries_2024
cross join unnest(person.researcher_urls.urls) as url
group by 1 -- url beginining
order by 2 desc -- count of orcids
limit 10
|
url_beginning |
orcid_count |
| 0
| https://www.linkedin.com/ |
238827 |
| 1
| https://www.researchgate. |
150454 |
| 2
| https://scholar.google.co |
125628 |
| 3
| https://www.facebook.com/ |
40467 |
| 4
| http://www.linkedin.com/i |
35762 |
| 5
| https://www.instagram.com |
24594 |
| 6
| https://sites.google.com/ |
23693 |
| 7
| https://publons.com/resea |
13536 |
| 8
| https://www.webofscience. |
12911 |
| 9
| https://scholar.google.es |
10752 |
2.5 How many profiles have biographies?
%%bigquery
select count(orcid_identifier)
from ds-open-datasets.orcid.summaries_2024
where person.biography.content is not null
2.6 What are some of the most used words in biographies (other than stop words)?
%%bigquery
with
bio_tokens as (
select
orcid_identifier.path orcid,
split(person.biography.content,' ') as tokens
from ds-open-datasets.orcid.summaries_2024
where person.biography.content is not null
),
tf_idf as (
SELECT
orcid,
TF_IDF(tokens, 10000, 20) OVER() AS results
FROM bio_tokens
ORDER BY orcid
)
select
token.index,
count(orcid) AS profiles
from tf_idf
cross join unnest(results) as token
where token.index is not null
and LENGTH(token.index) > 2
and token.value > .7
group by 1 -- the word
order by 2 desc -- number of profiles including the word
limit 200
|
index |
profiles |
| 0
| Estudiante |
1149 |
| 1
| Student |
1087 |
| 2
| Medical |
886 |
| 3
| Researcher |
880 |
| 4
| student |
860 |
| ...
| ... |
... |
| 195
| Graduate |
111 |
| 196
| Cesar |
110 |
| 197
| Social |
110 |
| 198
| Applied |
110 |
| 199
| Finance |
109 |
200 rows × 2 columns
2.7 How many public emails are available in ORCiD?
%%bigquery
select count(distinct email.email)
from ds-open-datasets.orcid.summaries_2024,
unnest(person.emails.emails) as email
WHERE email.visibility = 'public'
AND email IS NOT NULL
2.8 How many people give publicly visible countries/regions in ORCiD?
The person.addresses field records the countries a person has worked in.
%%bigquery
select count(distinct address.source.source_orcid.path)
from ds-open-datasets.orcid.summaries_2024,
unnest(person.addresses.addresses) as address
WHERE address.visibility = 'public'
AND address IS NOT NULL
2.9 Which countries?
%%bigquery
select
address.country,
count(distinct address.source.source_orcid.path)
from ds-open-datasets.orcid.summaries_2024,
unnest(person.addresses.addresses) as address
where address.visibility = 'public'
and address is not null
group by address.country
order by 2 DESC -- count of orcids
|
country |
f0_ |
| 0
| BR |
272986 |
| 1
| CN |
231405 |
| 2
| US |
209177 |
| 3
| IN |
171638 |
| 4
| ES |
100541 |
| ...
| ... |
... |
| 245
| NU |
6 |
| 246
| CC |
6 |
| 247
| BV |
5 |
| 248
| TK |
4 |
| 249
| PM |
4 |
250 rows × 2 columns
2.10 What are the most frequent keywords in the ORCID dataset?
## Most frequent keywords?
%%bigquery
select lower(keyword.content), count(orcid_identifier.path)
from ds-open-datasets.orcid.summaries_2024,
unnest(person.keywords.keywords) keyword
group by lower(keyword.content)
order by 2 desc -- i.e. count
|
f0_ |
f1_ |
| 0
| machine learning |
19257 |
| 1
| artificial intelligence |
10382 |
| 2
| deep learning |
7789 |
| 3
| bioinformatics |
7404 |
| 4
| education |
6359 |
| ...
| ... |
... |
| 1189703
| социально-психологическая виктимология |
1 |
| 1189704
| analog circuits,oscillators, pll, adc, dac, lo... |
1 |
| 1189705
| high education, internationalization of educat... |
1 |
| 1189706
| structural mechanics, structural health monito... |
1 |
| 1189707
| ritual cognition |
1 |
1189708 rows × 2 columns
2.11 What are the most common identifier types in ORCiD?
%%bigquery
select
lower(identifier.type),
count(orcid_identifier.path)
from ds-open-datasets.orcid.summaries_2024,
unnest(person.external_identifiers.identifiers) identifier
group by 1 -- i.e. identifier type
order by 2 desc -- i.e. count
limit 10
|
f0_ |
f1_ |
| 0
| scopus author id |
1791459 |
| 1
| researcherid |
700225 |
| 2
| sciprofiles |
325959 |
| 3
| loop profile |
261417 |
| 4
| ciência id |
67820 |
| 5
| researcher name resolver id |
14228 |
| 6
| gnd |
9692 |
| 7
| 中国科学家在线 |
5457 |
| 8
| isni |
4227 |
| 9
| pitt id |
3400 |
3.1 What are the most common academic roles in ORCiD?
%%bigquery
select
record.role_title,
count(orcid_identifier.path) as orcids
from ds-open-datasets.orcid.summaries_2024,
unnest(activities.educations.groups) grp,
unnest(grp.records) record
where start_date.year = "2024"
group by record.role_title
order by orcids desc
limit 40
|
role_title |
orcids |
| 0
| PhD |
2118 |
| 1
| None |
1768 |
| 2
| Mestrado |
753 |
| 3
| Doctor of Philosophy |
666 |
| 4
| Doutorado |
583 |
| 5
| Ph.D. |
510 |
| 6
| PhD Student |
372 |
| 7
| Master |
317 |
| 8
| PhD student |
293 |
| 9
| Master of Science |
256 |
| 10
| Ph.D |
222 |
| 11
| Mestranda |
216 |
| 12
| Completed the course and project |
168 |
| 13
| Mestrando |
149 |
| 14
| Doutoranda |
145 |
| 15
| Graduação |
137 |
| 16
| PhD Candidate |
136 |
| 17
| Doctor |
126 |
| 18
| Doutorando |
119 |
| 19
| Mestre |
112 |
| 20
| Phd |
97 |
| 21
| Estudiante |
94 |
| 22
| PhD candidate |
93 |
| 23
| MD |
91 |
| 24
| Doctorate |
89 |
| 25
| Masters |
84 |
| 26
| MSc |
78 |
| 27
| Especialização |
75 |
| 28
| Master's Degree |
74 |
| 29
| Doctor of Philosophy (PhD) |
69 |
| 30
| Doctor of Medicine |
68 |
| 31
| Ph.D. Student |
63 |
| 32
| Master's degree |
58 |
| 33
| ESTUDIANTE |
58 |
| 34
| MS |
56 |
| 35
| PHD |
55 |
| 36
| Doctoral Student |
51 |
| 37
| Bacharelado |
49 |
| 38
| Master's |
48 |
| 39
| Graduate Student |
47 |
3.2 How many employments have disambiguated addresses in 2024?
#How many employments have disambiguated addresses in 2024?
%%bigquery
select organization.disambiguated_organization.source, count(orcid_identifier.path)
from ds-open-datasets.orcid.summaries_2024,
unnest(activities.employments.groups) grp,
unnest(grp.records) record
where start_date.year = "2024"
group by 1
order by 2 desc
|
source |
f0_ |
| 0
| ROR |
228862 |
| 1
| None |
30104 |
| 2
| RINGGOLD |
2865 |
| 3
| FUNDREF |
1904 |
| 4
| GRID |
576 |
3.3 What organizations have been tagged the most times in ORCID as having awarded funding?
%%bigquery
select
funding_record.organization.name,
count(*) AS mentioned_count
from ds-open-datasets.orcid.summaries_2024,
unnest(activities.fundings.groups) as funding_groups,
unnest(funding_groups.records) as funding_record
where funding_record.organization.name <> 'N/A'
group by funding_record.organization.name
order by mentioned_count desc
limit 10
|
name |
mentioned_count |
| 0
| Fundação para a Ciência e a Tecnologia |
75382 |
| 1
| European Commission |
47984 |
| 2
| Japan Society for the Promotion of Science |
43370 |
| 3
| National Natural Science Foundation of China |
20965 |
| 4
| Fundação para a Ciência e a Tecnologia, I.P. |
17160 |
| 5
| Coordenação de Aperfeiçoamento de Pessoal de N... |
15851 |
| 6
| Conselho Nacional de Desenvolvimento Científic... |
15311 |
| 7
| Canadian Institutes of Health Research |
15123 |
| 8
| Australian Research Council |
14738 |
| 9
| Swiss National Science Foundation |
14689 |
3.4 What organizations make the most use of peer reviewers?
%%bigquery
select
record.reviewer_role,
record.review_type,
record.convening_organization.name,
count(orcid_identifier.path) AS orcids
from ds-open-datasets.orcid.summaries_2024,
unnest(activities.peer_reviews.groups) as grp,
unnest(grp.groups) as grpp,
unnest(grpp.records) as record
group by record.reviewer_role, record.review_type, record.convening_organization.name
order by orcids desc
limit 10
|
reviewer_role |
review_type |
name |
orcids |
| 0
| reviewer |
review |
Clarivate PLC |
3232053 |
| 1
| reviewer |
review |
Publons |
3117297 |
| 2
| reviewer |
review |
Elsevier, Inc. |
2660331 |
| 3
| reviewer |
review |
Springer Nature |
2261495 |
| 4
| reviewer |
review |
American Chemical Society |
1382847 |
| 5
| reviewer |
review |
MDPI |
694117 |
| 6
| reviewer |
review |
SpringerNature |
363646 |
| 7
| reviewer |
review |
Wiley-VCH |
300707 |
| 8
| reviewer |
review |
Public Library of Science |
166378 |
| 9
| reviewer |
review |
BMJ Publishing Group |
83693 |
3.5 What are the most used identifier types for works claimed in ORCID?
%%bigquery
select
identifier.type,
count(identifier.value) AS instances
from ds-open-datasets.orcid.summaries_2024,
unnest(activities.works.groups) as grp,
unnest(grp.external_ids.identifiers) as identifier
group by identifier.type
order by instances desc
limit 10
|
type |
instances |
| 0
| doi |
77820619 |
| 1
| eid |
40109303 |
| 2
| wosuid |
13082114 |
| 3
| source-work-id |
10850721 |
| 4
| pmid |
8329151 |
| 5
| other-id |
2997732 |
| 6
| pmc |
2545463 |
| 7
| arxiv |
2107965 |
| 8
| handle |
1552937 |
| 9
| isbn |
1054806 |
3.6 What are the most common invited positions held by researchers with ORCIDs?
%%bigquery
select
record.role_title,
count(orcid_identifier.path) as orcids
from ds-open-datasets.orcid.summaries_2024,
unnest(activities.invited_positions.records) as rec,
unnest(rec.records) as record
where record.role_title is not null
group by record.role_title
order by orcids desc
|
role_title |
orcids |
| 0
| Visiting Professor |
8391 |
| 1
| Visiting Scholar |
4841 |
| 2
| Member |
3984 |
| 3
| Visiting Researcher |
3603 |
| 4
| Reviewer |
3197 |
| ...
| ... |
... |
| 160075
| Internal Medicine Chief Resident |
1 |
| 160076
| Leuven Institute for Advanced Studies Fellow |
1 |
| 160077
| Profesora invitada de Educación Superior de Po... |
1 |
| 160078
| Director, Depression and Anxiety Disorders Pro... |
1 |
| 160079
| Representance de los centros de investigacion ... |
1 |
160080 rows × 2 columns
3.7 What are the most common membership positions held by researchers with ORCIDs?
%%bigquery
select
record.role_title,
count(orcid_identifier.path) as orcids
from ds-open-datasets.orcid.summaries_2024,
unnest(activities.memberships.groups) as grp,
unnest(grp.records) as record
where record.role_title is not null
group by record.role_title
order by orcids desc
limit 10
|
role_title |
orcids |
| 0
| Member |
78249 |
| 1
| Fellow |
14622 |
| 2
| Life Member |
14297 |
| 3
| member |
11710 |
| 4
| Student |
10547 |
| 5
| Student Member |
6575 |
| 6
| Miembro |
5651 |
| 7
| Associate Member |
5036 |
| 8
| Senior Member |
4781 |
| 9
| Life member |
4575 |
3.8 What are the most common types of qualification recorded by researchers with ORCIDs?
%%bigquery
select
record.role_title,
count(orcid_identifier.path) as orcids
from ds-open-datasets.orcid.summaries_2024,
unnest(activities.qualifications.groups) as grp,
unnest(grp.records) as record
where record.role_title is not null
and record.role_title <> 'Student'
group by record.role_title
order by orcids desc
limit 10
|
role_title |
orcids |
| 0
| PhD |
44330 |
| 1
| Investigador RENACYT |
10225 |
| 2
| Ph.D. |
8451 |
| 3
| Mestrado |
7355 |
| 4
| Ph.D |
6843 |
| 5
| Master |
5351 |
| 6
| MSc |
5329 |
| 7
| Doutorado |
5197 |
| 8
| Mestre |
5147 |
| 9
| PhD student |
5109 |
%%bigquery
select
record.role_title,
count(orcid_identifier.path) as orcids
from ds-open-datasets.orcid.summaries_2024,
unnest(activities.services.groups) as grp,
unnest(grp.records) as record
where record.role_title is not null
group by record.role_title
order by orcids desc
limit 10
|
role_title |
orcids |
| 0
| Member |
11290 |
| 1
| Reviewer |
5011 |
| 2
| President |
4567 |
| 3
| Associate Editor |
2865 |
| 4
| Chair |
2463 |
| 5
| Board Member |
1897 |
| 6
| Editorial Board Member |
1882 |
| 7
| Volunteer |
1835 |
| 8
| Director |
1711 |
| 9
| member |
1653 |
3.10 What proposals for the use of specialist research facilities have involved the most researchers?
%%bigquery
select
record.proposal.title.title,
count(orcid_identifier.path) as orcids
from ds-open-datasets.orcid.summaries_2024,
unnest(activities.research_resources.groups) as grp,
unnest(grp.records) as record
group by record.proposal.title.title
order by orcids desc
limit 10
|
title |
orcids |
| 0 |
Neutron Beam Award at Spallation Neutron Sourc... |
858 |
| 1 |
Neutron Beam Award at High Flux Isotope Reacto... |
338 |
| 2
| Neutron Beam Award at High Flux Isotope Reacto... |
199 |
| 3
| Prediction of soil microbiome phenotypic respo... |
16 |
| 4
| Unraveling Redox Transformation Mechanisms of ... |
13 |
| 5
| The Transformation of Tetrahedral to Octahedra... |
12 |
| 6
| EVALUATION OF CEMENT COMPOSITES, ROCKS, AND OT... |
10 |
| 7
| Controls of bioorganic constituents of soils i... |
7 |
| 8
| In situ TEM study of hierarchical nanowires gr... |
7 |
| 9
| The effect of trace element content on the rat... |
7 |