Skip to content

5. Number of publications per SDG category

Level: Easy

This query is suitable for new users of Dimensions on Google BigQuery

Description

Many publications in the Dimensions database are classified under Sustainable Development Goals (SDGs).

This query returns the top five most commonly applied SDG classifications and the total number of publications in each one.

Query

SELECT
  COUNT(p.id) AS tot,
  sdg.name
FROM `dimensions-ai.data_analytics.publications` p
CROSS JOIN UNNEST(category_sdg.full) sdg
GROUP BY
  sdg.name
ORDER BY
  tot DESC
LIMIT 5

Breaking it down

This query is short but wades through a complex data structure. While the author names query has to deal with an array of structs, this one has to parse a struct of arrays: Refer to the schema documentation for the details, but if we want to access the names of the SDG categories relevant to a single publication, we have to drill through multiple layers:

  1. The publications table uses a single row to represent a single publication.
  2. Publications with an SDG categorization have a struct in the category_sdg field with two keys: codes and full.
  3. We can access this by name, so category_sdg.full will give us the relevant entry for a publication. However, category_sdg.full is an array.
  4. Each entry in the category_sdg.full array is another struct, with three fields: code, id and name. We want the names.

We can get through the first struct by getting the full field by name. From there, we need to do something more complex:

CROSS JOIN UNNEST(category_sdg.full) sdg

This line performs a cross join. The less-technical explanation for what happens here is that, since category_sdg.full is an array, using a cross join with the publications table creates a new field called sdg, and each row of the publications table is repeated, once for each value for sdg. A quick demonstration:

WITHOUT UNNEST():

id category_sdg.full
pub.123 sdg1, sdg2, sdg3
pub.987 sdg4, sdg1

WITH UNNEST():

id sdg
pub.123 sdg1
pub.123 sdg2
pub.123 sdg3
pub.987 sdg4
pub.987 sdg1

So after this clause:

CROSS JOIN UNNEST(category_sdg.full) sdg
we have a new field, sdg, that has the nested fields associated with an individual SDG: code, id and name. Since we want name, that's what we refer to in the final query:

SELECT
  COUNT(p.id) AS tot,
  sdg.name
FROM `dimensions-ai.data_analytics.publications` p
CROSS JOIN UNNEST(category_sdg.full) sdg
GROUP BY
  sdg.name
ORDER BY
  tot DESC
LIMIT 5

We then count the number of IDs associated with each SDG name by using a GROUP BY clause.

Results

[
  {
    "tot": "1577950",
    "name": "Affordable and Clean Energy"
  },
  {
    "tot": "1455575",
    "name": "Good Health and Well Being"
  },
  {
    "tot": "769875",
    "name": "Peace, Justice and Strong Institutions"
  },
  {
    "tot": "633369",
    "name": "Quality Education"
  },
  {
    "tot": "507003",
    "name": "Climate Action"
  }
]