Working with dates

Each publication has various dates available.

date, year, date_normal, date_online, date_print refer to the publication object. See the documentation to find out more about their meaning.
date_imported_gbq refers to when this record was last added to Google BigQuery - this date can be handy if you want to synchronize an external data source to BigQuery.
date_inserted: this refers to when this records was originally added to Dimensions. This date does not change, even if the record is later adjusted.

The following examples show how to work with publications dates.

Prerequisites

In order to run this tutorial, please ensure that:

You have a valid Dimensions on Google BigQuery account and have configured a Google Cloud project.

The online BigQuery console can be used to test the queries below.

Comparing date fields

Description

We'll get started by pulling a selection of the date fields to see their formats:

SELECT doi,
       date,
       date_normal,
       year,
       date_online,
       date_print,
       date_imported_gbq,
       date_inserted
FROM   `dimensions-ai.data_analytics.publications`
WHERE year = 2010
      AND journal.id = "jour.1115214"
ORDER BY citations_count DESC
LIMIT 10

Results

Row	doi	date	date_normal	year	date_online	date_print	date_imported_gbq	date_inserted
0	10.1038/nbt.1621	2010-05-02	2010-05-02	2010	2010-05-02	2010-05	2021-02-10 01:09:29+00:00	2017-08-31 12:50:56+00:00
1	10.1038/nbt.1630	2010-05-02	2010-05-02	2010	2010-05-02	2010-05	2021-02-10 01:09:29+00:00	2017-08-31 12:50:56+00:00
2	10.1038/nbt.1614	2010-03	2010-03-01	2010	null	2010-03	2021-02-10 01:09:29+00:00	2017-08-31 12:50:56+00:00
3	10.1038/nbt.1685	2010-10-13	2010-10-13	2010	2010-10-13	2010-10	2021-02-10 00:53:56+00:00	2017-08-31 12:50:56+00:00
4	10.1038/nbt1210-1248	2010-12-07	2010-12-07	2010	2010-12-07	2010-12	2021-02-10 00:53:56+00:00	2017-08-31 12:50:56+00:00
5	10.1038/nbt.1755	2010-12-22	2010-12-22	2010	2010-12-22	2011-02	2021-02-10 01:09:29+00:00	2017-08-31 12:50:56+00:00
6	10.1038/nbt1010-1045	2010-10-13	2010-10-13	2010	2010-10-13	2010-10	2021-02-10 00:53:56+00:00	2017-08-31 12:50:56+00:00
7	10.1038/nbt.1633	2010-05-02	2010-05-02	2010	2010-05-02	2010-05	2021-02-10 00:53:56+00:00	2017-08-31 12:50:56+00:00
8	10.1038/nbt.1667	2010-07-19	2010-07-19	2010	2010-07-19	2010-08	2021-02-10 01:09:29+00:00	2017-08-31 12:50:56+00:00
9	10.1038/nbt.1641	2010-05-23	2010-05-23	2010	2010-05-23	2010-06	2021-02-10 00:53:56+00:00	2017-08-31 12:50:56+00:00

The first thing to stick out is that some of the dates are actually timestamps: date_imported_gbq and date_inserted have times attached to the dates. The other important caveat is that some dates aren't actually whole dates: Some values in the date and date_print fields have only a year and month. One of the reasons these different types are important is because can add an extra step when you compare fields to each other. For example, if we wanted to count how many publications were added to Dimensions before their "publication" date, it would be intuitive to write a query like this:

SELECT COUNT(id)
FROM `dimensions-ai.data_analytics.publications`
WHERE
  year = 2020
  AND date > date_inserted

However, we get an error from BigQuery: No matching signature for operator > for argument types: STRING, TIMESTAMP. Supported signature: ANY > ANY at [12:11]. BigQuery won't do the comparison because both sides of the comparison aren't of the same type: The date field is of type STRING, since it doesn't always have a day (or month) attached. The date_normal field solves this for us: It uses the same information as the date field, but it fills in the gaps to make a full DATE entry—so "2010-03" in the date field becomes 2010-03-01 in date_normal. But swapping that in doesn't fix our problems either:

SELECT COUNT(id)
FROM `dimensions-ai.data_analytics.publications`
WHERE
  year = 2020
  AND date_normal > date_inserted

We run into a new variant of the issue now: No matching signature for operator > for argument types: DATE, TIMESTAMP. Supported signature: ANY > ANY at [5:7]. Now date_normal gives us a DATE, but we can't compare that to a TIMESTAMP. Generally, you can mitigate most issues with comparing date fields by converting one of them to match the other, and BigQuery supports multiple functions for manipulating dates and datetimes. This one should do the trick:

SELECT COUNT(id)
FROM `dimensions-ai.data_analytics.publications`
WHERE
  year = 2020
  AND date_normal > DATE(date_inserted)

Results:

Row	f0_
1	859011

Number of publications added to Dimensions by month

Description

Next, we'll use the date_inserted field to count the number of publications added to the Dimensions database per month. date_inserted is of type DATETIME, so we choose from the datetime manipulation functions to round down all dates to the first of the month:

SELECT
  DATETIME_TRUNC(date_inserted, MONTH) as added_date,
  COUNT(id) as countDim
FROM
  `dimensions-ai.data_analytics.publications`
GROUP BY added_date
ORDER BY added_date DESC
LIMIT 5

Results

Row	added_date	countDim
1	2021-04-01 00:00:00 UTC	534043
2	2021-03-01 00:00:00 UTC	746963
3	2021-02-01 00:00:00 UTC	661575
4	2021-01-01 00:00:00 UTC	687764
5	2020-12-01 00:00:00 UTC	828307

We can see the dates have all been collapsed into the first of the month for each paper, but those timestamps that are attached are unhelpful. We can get rid of them by converting date_inserted to a DATE first, and switch to using the DATE_TRUNC function instead:

SELECT
  DATE_TRUNC(DATE(date_inserted), MONTH) as added_date,
  COUNT(id) as countDim
FROM
  `dimensions-ai.data_analytics.publications`
GROUP BY added_date
ORDER BY added_date DESC
LIMIT 5

Results

Row	added_date	countDim
1	2021-04-01	534043
2	2021-03-01	746963
3	2021-02-01	661575
4	2021-01-01	687764
5	2020-12-01	828307

That looks much better. If we want to manipulate different parts of the dates separately, we can also use EXTRACT to split things up:

SELECT
  EXTRACT(MONTH FROM date_inserted) as added_month,
  EXTRACT(YEAR FROM date_inserted) as added_year,
  COUNT(id) as countDim
FROM
  `dimensions-ai.data_analytics.publications`
GROUP BY added_month, added_year
ORDER BY added_year DESC, added_month DESC
LIMIT 5

Results

Row	added_month	added_year	countDim
1	4	2021	534043
2	3	2021	746963
3	2	2021	661575
4	1	2021	687764
5	12	2020	828307