5. Number of publications per SDG category
Level: Easy
This query is suitable for new users of Dimensions on Google BigQuery
Description
Many publications in the Dimensions database are classified under Sustainable Development Goals (SDGs).
This query returns the top five most commonly applied SDG classifications and the total number of publications in each one.
Query
SELECT
COUNT(p.id) AS tot,
sdg.name
FROM `dimensions-ai.data_analytics.publications` p
CROSS JOIN UNNEST(category_sdg.full) sdg
GROUP BY
sdg.name
ORDER BY
tot DESC
LIMIT 5
Breaking it down
This query is short but wades through a complex data structure. While the author names query has to deal with an array of structs, this one has to parse a struct of arrays: Refer to the schema documentation for the details, but if we want to access the names of the SDG categories relevant to a single publication, we have to drill through multiple layers:
- The
publications
table uses a single row to represent a single publication. - Publications with an SDG categorization have a struct in the
category_sdg
field with two keys:codes
andfull
. - We can access this by name, so
category_sdg.full
will give us the relevant entry for a publication. However,category_sdg.full
is an array. - Each entry in the
category_sdg.full
array is another struct, with three fields:code
,id
andname
. We want the names.
We can get through the first struct by getting the full
field by name. From there, we need to do something more complex:
CROSS JOIN UNNEST(category_sdg.full) sdg
This line performs a cross join. The less-technical explanation for what happens here is that, since category_sdg.full
is an array, using a cross join with the publications table creates a new field called sdg
, and each row of the publications
table is repeated, once for each value for sdg
. A quick demonstration:
WITHOUT UNNEST():
id | category_sdg.full |
---|---|
pub.123 | sdg1, sdg2, sdg3 |
pub.987 | sdg4, sdg1 |
WITH UNNEST():
id | sdg |
---|---|
pub.123 | sdg1 |
pub.123 | sdg2 |
pub.123 | sdg3 |
pub.987 | sdg4 |
pub.987 | sdg1 |
So after this clause:
CROSS JOIN UNNEST(category_sdg.full) sdg
sdg
, that has the nested fields associated with an individual SDG: code
, id
and name
. Since we want name
, that's what we refer to in the final query:
SELECT
COUNT(p.id) AS tot,
sdg.name
FROM `dimensions-ai.data_analytics.publications` p
CROSS JOIN UNNEST(category_sdg.full) sdg
GROUP BY
sdg.name
ORDER BY
tot DESC
LIMIT 5
We then count the number of IDs associated with each SDG name
by using a GROUP BY
clause.
Results
[
{
"tot": "1577950",
"name": "Affordable and Clean Energy"
},
{
"tot": "1455575",
"name": "Good Health and Well Being"
},
{
"tot": "769875",
"name": "Peace, Justice and Strong Institutions"
},
{
"tot": "633369",
"name": "Quality Education"
},
{
"tot": "507003",
"name": "Climate Action"
}
]