5. Number of publications per SDG category
Level: Easy
This query is suitable for new users of Dimensions on Google BigQuery
Description
Many publications in the Dimensions database are classified under Sustainable Development Goals (SDGs).
This query returns the top five most commonly applied SDG classifications and the total number of publications in each one.
Query
SELECT
COUNT(p.id) AS tot,
sdg.name
FROM `dimensions-ai.data_analytics.publications` p
CROSS JOIN UNNEST(category_sdg.full) sdg
GROUP BY
sdg.name
ORDER BY
tot DESC
LIMIT 5
Breaking it down
This query is short but wades through a complex data structure. While the author names query has to deal with an array of structs, this one has to parse a struct of arrays: Refer to the schema documentation for the details, but if we want to access the names of the SDG categories relevant to a single publication, we have to drill through multiple layers:
- The
publicationstable uses a single row to represent a single publication. - Publications with an SDG categorization have a struct in the
category_sdgfield with two keys:codesandfull. - We can access this by name, so
category_sdg.fullwill give us the relevant entry for a publication. However,category_sdg.fullis an array. - Each entry in the
category_sdg.fullarray is another struct, with three fields:code,idandname. We want the names.
We can get through the first struct by getting the full field by name. From there, we need to do something more complex:
CROSS JOIN UNNEST(category_sdg.full) sdg
This line performs a cross join. The less-technical explanation for what happens here is that, since category_sdg.full is an array, using a cross join with the publications table creates a new field called sdg, and each row of the publications table is repeated, once for each value for sdg. A quick demonstration:
WITHOUT UNNEST():
| id | category_sdg.full |
|---|---|
| pub.123 | sdg1, sdg2, sdg3 |
| pub.987 | sdg4, sdg1 |
WITH UNNEST():
| id | sdg |
|---|---|
| pub.123 | sdg1 |
| pub.123 | sdg2 |
| pub.123 | sdg3 |
| pub.987 | sdg4 |
| pub.987 | sdg1 |
So after this clause:
CROSS JOIN UNNEST(category_sdg.full) sdg
sdg, that has the nested fields associated with an individual SDG: code, id and name. Since we want name, that's what we refer to in the final query:
SELECT
COUNT(p.id) AS tot,
sdg.name
FROM `dimensions-ai.data_analytics.publications` p
CROSS JOIN UNNEST(category_sdg.full) sdg
GROUP BY
sdg.name
ORDER BY
tot DESC
LIMIT 5
We then count the number of IDs associated with each SDG name by using a GROUP BY clause.
Results
[
{
"tot": "1577950",
"name": "Affordable and Clean Energy"
},
{
"tot": "1455575",
"name": "Good Health and Well Being"
},
{
"tot": "769875",
"name": "Peace, Justice and Strong Institutions"
},
{
"tot": "633369",
"name": "Quality Education"
},
{
"tot": "507003",
"name": "Climate Action"
}
]