Skip to content

30. Publication volume growth rate

Level: Advanced

This query requires a good understanding of SQL and the Dimensions data model

Description

This query calculates the growth rate of the number of publications in a particular field over time. Using annual numbers can make visualizations almost impossible to read, since year-to-year numbers can be noisy and include large swings in either direction. To avoid this, the query below uses a sliding five-year window: For example, the publication count for 2005 would include all of the publications between 2001 and 2005.

The other way the results are smoothed is in our choice of comparison: Rather than comparing the 2005 total to the 2004 total, we compare 2005 to the previous 5-year window, which ended in 2000. Both of these strategies can be adjusted in the query:

  • To adjust the "width" of the window, modify the line that says 4 PRECEDING to whatever number of years works best for your data.
  • To adjust which windows are compared to get the growth rate, adjust the line that says LAG(running_total, 5) by changing the 5 to a different value. For example, a 5 here means we compare the 2005 total to the 2000 total, but, changing this to LAG(running_total, 1) would compare 2005 to 2004.

Unusual features

There are a few components of this query that may not be intuitive, or use a complicated syntax:

  • The SUM(num) OVER ... clause is part of a window function that includes the num field of the previous four rows (ordered by year). The BigQuery documentation has more information about using functions like SUM() OVER.
  • The LAG(running_total, 5) clause does something similar, but instead of adding the previous four rows together, it only retrieves the value from the running_total field from the row five positions higher when ordered by year. This is called a "navigation function" and is also discussed in the BigQuery documentation.
  • The allyears subquery is a clunky workaround for situations in which a particular year has no publications of interest. The pub_counts subquery returns a row for each year, and a count of publications in that year, but a year only appears if there's at least one publication. Since we're counting rows in the SUM and LAG functions described above, it would be a big problem if we're calculating a "five-year window" that actually stretches over eight years because three years in the window have no publications. allyears returns the exact same numbers as the pubcounts subquery, but it guarantees every year will have a row.
    • The start and end years for this step are defined in the params subquery. Ideally, we could just use the first and last years of the publications in the list, but BigQuery doesn't support "correlated subqueries" that reference each other, so this simpler method is used here instead.
  • The line that includes COALESCE(NULLIF(prev_total, 0), 1) is a bit of a hack. Ideally, the rate field would be calculated simply using (running_total - prev_total) / prev_total. However, there may be cases where prev_total is zero, which would throw an error for dividing by zero. Because there is no function for "change this value if it is zero," it has to happen in two steps:
    • NULLIF(prev_total, 0): If prev_total is equal to 0, this step returns NULL instead of 0.
    • COALESCE(NULLIF(prev_total, 0), 1): If the previous step returns NULL, this step will return 1 instead.
    • The end result is that all non-zero values of prev_total are used without modification, but a value of 0 is changed to a value of 1. This isn't perfect, since the real answer is an infinite growth rate.

Query

WITH params AS (
    SELECT
    -- Define the start and end points for your rate calculations.
    -- Publications outside of this range will be ignored.
        1990 AS minyear
       ,2020 AS maxyear
),
pub_counts AS (
  SELECT
    p.year AS pubyear, COUNT(p.id) AS num
  FROM `dimensions-ai.data_analytics.publications` p
  WHERE
    p.year >= (SELECT minyear FROM params)
    AND p.year <= (SELECT maxyear FROM params)
    ---- HERE is where to define your publications
    ---- of interest
    AND '2101' IN UNNEST(p.category_for.second_level.codes) -- Archaeology
  GROUP BY 1
),
allyears AS (
    SELECT year, COALESCE(p.num, 0) AS num
    FROM UNNEST(GENERATE_ARRAY(
        (SELECT minyear FROM params),
        (SELECT maxyear FROM params)
    )) year
    LEFT JOIN pub_counts p
        ON p.pubyear=year
),
pub_window AS (
    SELECT year, num
      ,SUM(num) OVER(ORDER BY year ROWS BETWEEN 4 PRECEDING AND CURRENT ROW) AS running_total
    FROM allyears
),
pub_collate AS (
    SELECT year, num, running_total
        ,LAG(running_total, 5) OVER(ORDER BY year ASC) AS prev_total
    FROM pub_window
)

SELECT year
    ,num AS pub_count
    ,(running_total - prev_total) / COALESCE(NULLIF(prev_total, 0), 1) AS rate
FROM pub_collate
WHERE year >= (SELECT minyear FROM params)+10
ORDER BY 1 ASC

Results

Row year pub_count rate
0 2000 4404 0.1616
1 2001 4152 0.1404
2 2002 4477 0.1413
3 2003 4328 0.1047
4 2004 4190 0.0397