Exploring the ORCiD dataset on Google Bigquery
The following examples explore how to use the openly available bigquery dataset available at: ds-open-datasets.orcid.summaries_2024
Further documentation on the orcid schema, along with how to get connected to bigquery can be found at: https://docs.dimensions.ai/bigquery/
Prerequisites
In order to run this tutorial, please ensure that you have
(This tutorial is based on a Jupyter notebook that is available directly via GitHub.)
from google.colab import auth
auth.authenticate_user()
print('Authenticated')
from google.cloud import bigquery
from google.cloud.bigquery import magics
project_id = input("Enter the name of a GBQ project to use when running the code in this notebook: ")
magics.context.project = project_id
bq_params = {}
client = bigquery.Client(project=project_id)
%load_ext bigquery_magics
Before we go further, a quick warning. In bigquery don't use "select *" to explore a dataset. It will be expensive. Use only the columns that you need
1.1 How many active orcids do we have?
%%bigquery
select count(orcid_identifier)
from ds-open-datasets.orcid.summaries_2024
where history.deactivation_date is null
1.2 How many orcid records have a publicly verified email?
%%bigquery
select count(orcid_identifier)
from ds-open-datasets.orcid.summaries_2024
where history.deactivation_date is null
and history.verified_email is True
1.3 How many orcids have been created by year?
%%bigquery
select
extract(YEAR FROM timestamp(history.submission_date)) AS year,
count(orcid_identifier)
from ds-open-datasets.orcid.summaries_2024
where history.deactivation_date is null
group by 1 -- i.e. year
order by 1
|
year |
f0_ |
0
| 2012 |
43695 |
1
| 2013 |
420762 |
2
| 2014 |
584072 |
3
| 2015 |
731309 |
4
| 2016 |
1021238 |
5
| 2017 |
1343595 |
6
| 2018 |
1546783 |
7
| 2019 |
1959241 |
8
| 2020 |
2598695 |
9
| 2021 |
2659905 |
10
| 2022 |
2748914 |
11
| 2023 |
2975162 |
12
| 2024 |
2412639 |
1.4 How many orcids have been modified in 2023 (2024 is only up to the date of the snapshot) ?
%%bigquery
select count(orcid_identifier)
from ds-open-datasets.orcid.summaries_2024
where history.deactivation_date is null
and extract(YEAR FROM timestamp(history.last_modified_date)) = 2023
2. person
2.1 How many orcid names are public?
%%bigquery
select
count(orcid_identifier)
from ds-open-datasets.orcid.summaries_2024
where person.name.visibility = 'public'
2.2 How many orcid records have other names?
%%bigquery
select count(orcid_identifier)
from ds-open-datasets.orcid.summaries_2024
where person.other_names is not null
2.3 How many orcids have a populated credit name?
%%bigquery
select count(orcid_identifier)
from
ds-open-datasets.orcid.summaries_2024
where person.name.credit_name is not null
2.4 What are the dominant domains of URLs in author profiles?
%%bigquery
select
substr(url.url,1,25) url_beginning,
count(orcid_identifier) as orcid_count
from ds-open-datasets.orcid.summaries_2024
cross join unnest(person.researcher_urls.urls) as url
group by 1 -- url beginining
order by 2 desc -- count of orcids
limit 10
|
url_beginning |
orcid_count |
0
| https://www.linkedin.com/ |
238827 |
1
| https://www.researchgate. |
150454 |
2
| https://scholar.google.co |
125628 |
3
| https://www.facebook.com/ |
40467 |
4
| http://www.linkedin.com/i |
35762 |
5
| https://www.instagram.com |
24594 |
6
| https://sites.google.com/ |
23693 |
7
| https://publons.com/resea |
13536 |
8
| https://www.webofscience. |
12911 |
9
| https://scholar.google.es |
10752 |
2.5 How many profiles have biographies?
%%bigquery
select count(orcid_identifier)
from ds-open-datasets.orcid.summaries_2024
where person.biography.content is not null
2.6 What are some of the most used words in biographies (other than stop words)?
%%bigquery
with
bio_tokens as (
select
orcid_identifier.path orcid,
split(person.biography.content,' ') as tokens
from ds-open-datasets.orcid.summaries_2024
where person.biography.content is not null
),
tf_idf as (
SELECT
orcid,
TF_IDF(tokens, 10000, 20) OVER() AS results
FROM bio_tokens
ORDER BY orcid
)
select
token.index,
count(orcid) AS profiles
from tf_idf
cross join unnest(results) as token
where token.index is not null
and LENGTH(token.index) > 2
and token.value > .7
group by 1 -- the word
order by 2 desc -- number of profiles including the word
limit 200
|
index |
profiles |
0
| Estudiante |
1149 |
1
| Student |
1087 |
2
| Medical |
886 |
3
| Researcher |
880 |
4
| student |
860 |
...
| ... |
... |
195
| Graduate |
111 |
196
| Cesar |
110 |
197
| Social |
110 |
198
| Applied |
110 |
199
| Finance |
109 |
200 rows × 2 columns
2.7 How many public emails are available in ORCiD?
%%bigquery
select count(distinct email.email)
from ds-open-datasets.orcid.summaries_2024,
unnest(person.emails.emails) as email
WHERE email.visibility = 'public'
AND email IS NOT NULL
2.8 How many people give publicly visible countries/regions in ORCiD?
The person.addresses
field records the countries a person has worked in.
%%bigquery
select count(distinct address.source.source_orcid.path)
from ds-open-datasets.orcid.summaries_2024,
unnest(person.addresses.addresses) as address
WHERE address.visibility = 'public'
AND address IS NOT NULL
2.9 Which countries?
%%bigquery
select
address.country,
count(distinct address.source.source_orcid.path)
from ds-open-datasets.orcid.summaries_2024,
unnest(person.addresses.addresses) as address
where address.visibility = 'public'
and address is not null
group by address.country
order by 2 DESC -- count of orcids
|
country |
f0_ |
0
| BR |
272986 |
1
| CN |
231405 |
2
| US |
209177 |
3
| IN |
171638 |
4
| ES |
100541 |
...
| ... |
... |
245
| NU |
6 |
246
| CC |
6 |
247
| BV |
5 |
248
| TK |
4 |
249
| PM |
4 |
250 rows × 2 columns
2.10 What are the most frequent keywords in the ORCID dataset?
## Most frequent keywords?
%%bigquery
select lower(keyword.content), count(orcid_identifier.path)
from ds-open-datasets.orcid.summaries_2024,
unnest(person.keywords.keywords) keyword
group by lower(keyword.content)
order by 2 desc -- i.e. count
|
f0_ |
f1_ |
0
| machine learning |
19257 |
1
| artificial intelligence |
10382 |
2
| deep learning |
7789 |
3
| bioinformatics |
7404 |
4
| education |
6359 |
...
| ... |
... |
1189703
| социально-психологическая виктимология |
1 |
1189704
| analog circuits,oscillators, pll, adc, dac, lo... |
1 |
1189705
| high education, internationalization of educat... |
1 |
1189706
| structural mechanics, structural health monito... |
1 |
1189707
| ritual cognition |
1 |
1189708 rows × 2 columns
2.11 What are the most common identifier types in ORCiD?
%%bigquery
select
lower(identifier.type),
count(orcid_identifier.path)
from ds-open-datasets.orcid.summaries_2024,
unnest(person.external_identifiers.identifiers) identifier
group by 1 -- i.e. identifier type
order by 2 desc -- i.e. count
limit 10
|
f0_ |
f1_ |
0
| scopus author id |
1791459 |
1
| researcherid |
700225 |
2
| sciprofiles |
325959 |
3
| loop profile |
261417 |
4
| ciência id |
67820 |
5
| researcher name resolver id |
14228 |
6
| gnd |
9692 |
7
| 中国科学家在线 |
5457 |
8
| isni |
4227 |
9
| pitt id |
3400 |
3.1 What are the most common academic roles in ORCiD?
%%bigquery
select
record.role_title,
count(orcid_identifier.path) as orcids
from ds-open-datasets.orcid.summaries_2024,
unnest(activities.educations.groups) grp,
unnest(grp.records) record
where start_date.year = "2024"
group by record.role_title
order by orcids desc
limit 40
|
role_title |
orcids |
0
| PhD |
2118 |
1
| None |
1768 |
2
| Mestrado |
753 |
3
| Doctor of Philosophy |
666 |
4
| Doutorado |
583 |
5
| Ph.D. |
510 |
6
| PhD Student |
372 |
7
| Master |
317 |
8
| PhD student |
293 |
9
| Master of Science |
256 |
10
| Ph.D |
222 |
11
| Mestranda |
216 |
12
| Completed the course and project |
168 |
13
| Mestrando |
149 |
14
| Doutoranda |
145 |
15
| Graduação |
137 |
16
| PhD Candidate |
136 |
17
| Doctor |
126 |
18
| Doutorando |
119 |
19
| Mestre |
112 |
20
| Phd |
97 |
21
| Estudiante |
94 |
22
| PhD candidate |
93 |
23
| MD |
91 |
24
| Doctorate |
89 |
25
| Masters |
84 |
26
| MSc |
78 |
27
| Especialização |
75 |
28
| Master's Degree |
74 |
29
| Doctor of Philosophy (PhD) |
69 |
30
| Doctor of Medicine |
68 |
31
| Ph.D. Student |
63 |
32
| Master's degree |
58 |
33
| ESTUDIANTE |
58 |
34
| MS |
56 |
35
| PHD |
55 |
36
| Doctoral Student |
51 |
37
| Bacharelado |
49 |
38
| Master's |
48 |
39
| Graduate Student |
47 |
3.2 How many employments have disambiguated addresses in 2024?
#How many employments have disambiguated addresses in 2024?
%%bigquery
select organization.disambiguated_organization.source, count(orcid_identifier.path)
from ds-open-datasets.orcid.summaries_2024,
unnest(activities.employments.groups) grp,
unnest(grp.records) record
where start_date.year = "2024"
group by 1
order by 2 desc
|
source |
f0_ |
0
| ROR |
228862 |
1
| None |
30104 |
2
| RINGGOLD |
2865 |
3
| FUNDREF |
1904 |
4
| GRID |
576 |
3.3 What organizations have been tagged the most times in ORCID as having awarded funding?
%%bigquery
select
funding_record.organization.name,
count(*) AS mentioned_count
from ds-open-datasets.orcid.summaries_2024,
unnest(activities.fundings.groups) as funding_groups,
unnest(funding_groups.records) as funding_record
where funding_record.organization.name <> 'N/A'
group by funding_record.organization.name
order by mentioned_count desc
limit 10
|
name |
mentioned_count |
0
| Fundação para a Ciência e a Tecnologia |
75382 |
1
| European Commission |
47984 |
2
| Japan Society for the Promotion of Science |
43370 |
3
| National Natural Science Foundation of China |
20965 |
4
| Fundação para a Ciência e a Tecnologia, I.P. |
17160 |
5
| Coordenação de Aperfeiçoamento de Pessoal de N... |
15851 |
6
| Conselho Nacional de Desenvolvimento Científic... |
15311 |
7
| Canadian Institutes of Health Research |
15123 |
8
| Australian Research Council |
14738 |
9
| Swiss National Science Foundation |
14689 |
3.4 What organizations make the most use of peer reviewers?
%%bigquery
select
record.reviewer_role,
record.review_type,
record.convening_organization.name,
count(orcid_identifier.path) AS orcids
from ds-open-datasets.orcid.summaries_2024,
unnest(activities.peer_reviews.groups) as grp,
unnest(grp.groups) as grpp,
unnest(grpp.records) as record
group by record.reviewer_role, record.review_type, record.convening_organization.name
order by orcids desc
limit 10
|
reviewer_role |
review_type |
name |
orcids |
0
| reviewer |
review |
Clarivate PLC |
3232053 |
1
| reviewer |
review |
Publons |
3117297 |
2
| reviewer |
review |
Elsevier, Inc. |
2660331 |
3
| reviewer |
review |
Springer Nature |
2261495 |
4
| reviewer |
review |
American Chemical Society |
1382847 |
5
| reviewer |
review |
MDPI |
694117 |
6
| reviewer |
review |
SpringerNature |
363646 |
7
| reviewer |
review |
Wiley-VCH |
300707 |
8
| reviewer |
review |
Public Library of Science |
166378 |
9
| reviewer |
review |
BMJ Publishing Group |
83693 |
3.5 What are the most used identifier types for works claimed in ORCID?
%%bigquery
select
identifier.type,
count(identifier.value) AS instances
from ds-open-datasets.orcid.summaries_2024,
unnest(activities.works.groups) as grp,
unnest(grp.external_ids.identifiers) as identifier
group by identifier.type
order by instances desc
limit 10
|
type |
instances |
0
| doi |
77820619 |
1
| eid |
40109303 |
2
| wosuid |
13082114 |
3
| source-work-id |
10850721 |
4
| pmid |
8329151 |
5
| other-id |
2997732 |
6
| pmc |
2545463 |
7
| arxiv |
2107965 |
8
| handle |
1552937 |
9
| isbn |
1054806 |
3.6 What are the most common invited positions held by researchers with ORCIDs?
%%bigquery
select
record.role_title,
count(orcid_identifier.path) as orcids
from ds-open-datasets.orcid.summaries_2024,
unnest(activities.invited_positions.records) as rec,
unnest(rec.records) as record
where record.role_title is not null
group by record.role_title
order by orcids desc
|
role_title |
orcids |
0
| Visiting Professor |
8391 |
1
| Visiting Scholar |
4841 |
2
| Member |
3984 |
3
| Visiting Researcher |
3603 |
4
| Reviewer |
3197 |
...
| ... |
... |
160075
| Internal Medicine Chief Resident |
1 |
160076
| Leuven Institute for Advanced Studies Fellow |
1 |
160077
| Profesora invitada de Educación Superior de Po... |
1 |
160078
| Director, Depression and Anxiety Disorders Pro... |
1 |
160079
| Representance de los centros de investigacion ... |
1 |
160080 rows × 2 columns
3.7 What are the most common membership positions held by researchers with ORCIDs?
%%bigquery
select
record.role_title,
count(orcid_identifier.path) as orcids
from ds-open-datasets.orcid.summaries_2024,
unnest(activities.memberships.groups) as grp,
unnest(grp.records) as record
where record.role_title is not null
group by record.role_title
order by orcids desc
limit 10
|
role_title |
orcids |
0
| Member |
78249 |
1
| Fellow |
14622 |
2
| Life Member |
14297 |
3
| member |
11710 |
4
| Student |
10547 |
5
| Student Member |
6575 |
6
| Miembro |
5651 |
7
| Associate Member |
5036 |
8
| Senior Member |
4781 |
9
| Life member |
4575 |
3.8 What are the most common types of qualification recorded by researchers with ORCIDs?
%%bigquery
select
record.role_title,
count(orcid_identifier.path) as orcids
from ds-open-datasets.orcid.summaries_2024,
unnest(activities.qualifications.groups) as grp,
unnest(grp.records) as record
where record.role_title is not null
and record.role_title <> 'Student'
group by record.role_title
order by orcids desc
limit 10
|
role_title |
orcids |
0
| PhD |
44330 |
1
| Investigador RENACYT |
10225 |
2
| Ph.D. |
8451 |
3
| Mestrado |
7355 |
4
| Ph.D |
6843 |
5
| Master |
5351 |
6
| MSc |
5329 |
7
| Doutorado |
5197 |
8
| Mestre |
5147 |
9
| PhD student |
5109 |
%%bigquery
select
record.role_title,
count(orcid_identifier.path) as orcids
from ds-open-datasets.orcid.summaries_2024,
unnest(activities.services.groups) as grp,
unnest(grp.records) as record
where record.role_title is not null
group by record.role_title
order by orcids desc
limit 10
|
role_title |
orcids |
0
| Member |
11290 |
1
| Reviewer |
5011 |
2
| President |
4567 |
3
| Associate Editor |
2865 |
4
| Chair |
2463 |
5
| Board Member |
1897 |
6
| Editorial Board Member |
1882 |
7
| Volunteer |
1835 |
8
| Director |
1711 |
9
| member |
1653 |
3.10 What proposals for the use of specialist research facilities have involved the most researchers?
%%bigquery
select
record.proposal.title.title,
count(orcid_identifier.path) as orcids
from ds-open-datasets.orcid.summaries_2024,
unnest(activities.research_resources.groups) as grp,
unnest(grp.records) as record
group by record.proposal.title.title
order by orcids desc
limit 10
|
title |
orcids |
0 |
Neutron Beam Award at Spallation Neutron Sourc... |
858 |
1 |
Neutron Beam Award at High Flux Isotope Reacto... |
338 |
2
| Neutron Beam Award at High Flux Isotope Reacto... |
199 |
3
| Prediction of soil microbiome phenotypic respo... |
16 |
4
| Unraveling Redox Transformation Mechanisms of ... |
13 |
5
| The Transformation of Tetrahedral to Octahedra... |
12 |
6
| EVALUATION OF CEMENT COMPOSITES, ROCKS, AND OT... |
10 |
7
| Controls of bioorganic constituents of soils i... |
7 |
8
| In situ TEM study of hierarchical nanowires gr... |
7 |
9
| The effect of trace element content on the rat... |
7 |