30. Publication volume growth rate
Level: Advanced
This query requires a good understanding of SQL and the Dimensions data model
Description
This query calculates the growth rate of the number of publications in a particular field over time. Using annual numbers can make visualizations almost impossible to read, since year-to-year numbers can be noisy and include large swings in either direction. To avoid this, the query below uses a sliding five-year window: For example, the publication count for 2005 would include all of the publications between 2001 and 2005.
The other way the results are smoothed is in our choice of comparison: Rather than comparing the 2005 total to the 2004 total, we compare 2005 to the previous 5-year window, which ended in 2000. Both of these strategies can be adjusted in the query:
- To adjust the "width" of the window, modify the line that says
4 PRECEDING
to whatever number of years works best for your data. - To adjust which windows are compared to get the growth rate, adjust the line that says
LAG(running_total, 5)
by changing the5
to a different value. For example, a5
here means we compare the 2005 total to the 2000 total, but, changing this toLAG(running_total, 1)
would compare 2005 to 2004.
Unusual features
There are a few components of this query that may not be intuitive, or use a complicated syntax:
- The
SUM(num) OVER ...
clause is part of a window function that includes thenum
field of the previous four rows (ordered by year). The BigQuery documentation has more information about using functions likeSUM() OVER
. - The
LAG(running_total, 5)
clause does something similar, but instead of adding the previous four rows together, it only retrieves the value from therunning_total
field from the row five positions higher when ordered by year. This is called a "navigation function" and is also discussed in the BigQuery documentation. - The
allyears
subquery is a clunky workaround for situations in which a particular year has no publications of interest. Thepub_counts
subquery returns a row for each year, and a count of publications in that year, but a year only appears if there's at least one publication. Since we're counting rows in theSUM
andLAG
functions described above, it would be a big problem if we're calculating a "five-year window" that actually stretches over eight years because three years in the window have no publications.allyears
returns the exact same numbers as thepubcounts
subquery, but it guarantees every year will have a row.- The start and end years for this step are defined in the
params
subquery. Ideally, we could just use the first and last years of the publications in the list, but BigQuery doesn't support "correlated subqueries" that reference each other, so this simpler method is used here instead.
- The start and end years for this step are defined in the
- The line that includes
COALESCE(NULLIF(prev_total, 0), 1)
is a bit of a hack. Ideally, therate
field would be calculated simply using(running_total - prev_total) / prev_total
. However, there may be cases whereprev_total
is zero, which would throw an error for dividing by zero. Because there is no function for "change this value if it is zero," it has to happen in two steps:NULLIF(prev_total, 0)
: Ifprev_total
is equal to0
, this step returnsNULL
instead of0
.COALESCE(NULLIF(prev_total, 0), 1)
: If the previous step returnsNULL
, this step will return1
instead.- The end result is that all non-zero values of
prev_total
are used without modification, but a value of0
is changed to a value of1
. This isn't perfect, since the real answer is an infinite growth rate.
Query
WITH params AS (
SELECT
-- Define the start and end points for your rate calculations.
-- Publications outside of this range will be ignored.
1990 AS minyear
,2020 AS maxyear
),
pub_counts AS (
SELECT
p.year AS pubyear, COUNT(p.id) AS num
FROM `dimensions-ai.data_analytics.publications` p
WHERE
p.year >= (SELECT minyear FROM params)
AND p.year <= (SELECT maxyear FROM params)
---- HERE is where to define your publications
---- of interest
AND '2101' IN UNNEST(p.category_for.second_level.codes) -- Archaeology
GROUP BY 1
),
allyears AS (
SELECT year, COALESCE(p.num, 0) AS num
FROM UNNEST(GENERATE_ARRAY(
(SELECT minyear FROM params),
(SELECT maxyear FROM params)
)) year
LEFT JOIN pub_counts p
ON p.pubyear=year
),
pub_window AS (
SELECT year, num
,SUM(num) OVER(ORDER BY year ROWS BETWEEN 4 PRECEDING AND CURRENT ROW) AS running_total
FROM allyears
),
pub_collate AS (
SELECT year, num, running_total
,LAG(running_total, 5) OVER(ORDER BY year ASC) AS prev_total
FROM pub_window
)
SELECT year
,num AS pub_count
,(running_total - prev_total) / COALESCE(NULLIF(prev_total, 0), 1) AS rate
FROM pub_collate
WHERE year >= (SELECT minyear FROM params)+10
ORDER BY 1 ASC
Results
Row | year | pub_count | rate |
---|---|---|---|
0 | 2000 | 4404 | 0.1616 |
1 | 2001 | 4152 | 0.1404 |
2 | 2002 | 4477 | 0.1413 |
3 | 2003 | 4328 | 0.1047 |
4 | 2004 | 4190 | 0.0397 |