Skip to content

Working with dates

Each publication has various dates available.

  • date, year, date_normal, date_online, date_print refer to the publication object. See the documentation to find out more about their meaning.
  • date_imported_gbq refers to when this record was last added to Google BigQuery - this date can be handy if you want to synchronize an external data source to BigQuery.
  • date_inserted: this refers to when this records was originally added to Dimensions. This date does not change, even if the record is later adjusted.

The following examples show how to work with publications dates.

Prerequisites

In order to run this tutorial, please ensure that:

The online BigQuery console can be used to test the queries below.

Comparing date fields

Description

We'll get started by pulling a selection of the date fields to see their formats:

SELECT doi,
       date,
       date_normal,
       year,
       date_online,
       date_print,
       date_imported_gbq,
       date_inserted
FROM   `dimensions-ai.data_analytics.publications`
WHERE year = 2010
      AND journal.id = "jour.1115214"
ORDER BY citations_count DESC
LIMIT 10

Results

Row doi date date_normal year date_online date_print date_imported_gbq date_inserted
0 10.1038/nbt.1621 2010-05-02 2010-05-02 2010 2010-05-02 2010-05 2021-02-10 01:09:29+00:00 2017-08-31 12:50:56+00:00
1 10.1038/nbt.1630 2010-05-02 2010-05-02 2010 2010-05-02 2010-05 2021-02-10 01:09:29+00:00 2017-08-31 12:50:56+00:00
2 10.1038/nbt.1614 2010-03 2010-03-01 2010 null 2010-03 2021-02-10 01:09:29+00:00 2017-08-31 12:50:56+00:00
3 10.1038/nbt.1685 2010-10-13 2010-10-13 2010 2010-10-13 2010-10 2021-02-10 00:53:56+00:00 2017-08-31 12:50:56+00:00
4 10.1038/nbt1210-1248 2010-12-07 2010-12-07 2010 2010-12-07 2010-12 2021-02-10 00:53:56+00:00 2017-08-31 12:50:56+00:00
5 10.1038/nbt.1755 2010-12-22 2010-12-22 2010 2010-12-22 2011-02 2021-02-10 01:09:29+00:00 2017-08-31 12:50:56+00:00
6 10.1038/nbt1010-1045 2010-10-13 2010-10-13 2010 2010-10-13 2010-10 2021-02-10 00:53:56+00:00 2017-08-31 12:50:56+00:00
7 10.1038/nbt.1633 2010-05-02 2010-05-02 2010 2010-05-02 2010-05 2021-02-10 00:53:56+00:00 2017-08-31 12:50:56+00:00
8 10.1038/nbt.1667 2010-07-19 2010-07-19 2010 2010-07-19 2010-08 2021-02-10 01:09:29+00:00 2017-08-31 12:50:56+00:00
9 10.1038/nbt.1641 2010-05-23 2010-05-23 2010 2010-05-23 2010-06 2021-02-10 00:53:56+00:00 2017-08-31 12:50:56+00:00

The first thing to stick out is that some of the dates are actually timestamps: date_imported_gbq and date_inserted have times attached to the dates. The other important caveat is that some dates aren't actually whole dates: Some values in the date and date_print fields have only a year and month. One of the reasons these different types are important is because can add an extra step when you compare fields to each other. For example, if we wanted to count how many publications were added to Dimensions before their "publication" date, it would be intuitive to write a query like this:

SELECT COUNT(id)
FROM `dimensions-ai.data_analytics.publications`
WHERE
  year = 2020
  AND date > date_inserted

However, we get an error from BigQuery: No matching signature for operator > for argument types: STRING, TIMESTAMP. Supported signature: ANY > ANY at [12:11]. BigQuery won't do the comparison because both sides of the comparison aren't of the same type: The date field is of type STRING, since it doesn't always have a day (or month) attached. The date_normal field solves this for us: It uses the same information as the date field, but it fills in the gaps to make a full DATE entry—so "2010-03" in the date field becomes 2010-03-01 in date_normal. But swapping that in doesn't fix our problems either:

SELECT COUNT(id)
FROM `dimensions-ai.data_analytics.publications`
WHERE
  year = 2020
  AND date_normal > date_inserted

We run into a new variant of the issue now: No matching signature for operator > for argument types: DATE, TIMESTAMP. Supported signature: ANY > ANY at [5:7]. Now date_normal gives us a DATE, but we can't compare that to a TIMESTAMP. Generally, you can mitigate most issues with comparing date fields by converting one of them to match the other, and BigQuery supports multiple functions for manipulating dates and datetimes. This one should do the trick:

SELECT COUNT(id)
FROM `dimensions-ai.data_analytics.publications`
WHERE
  year = 2020
  AND date_normal > DATE(date_inserted)

Results:

Row f0_
1 859011

Number of publications added to Dimensions by month

Description

Next, we'll use the date_inserted field to count the number of publications added to the Dimensions database per month. date_inserted is of type DATETIME, so we choose from the datetime manipulation functions to round down all dates to the first of the month:

SELECT
  DATETIME_TRUNC(date_inserted, MONTH) as added_date,
  COUNT(id) as countDim
FROM
  `dimensions-ai.data_analytics.publications`
GROUP BY added_date
ORDER BY added_date DESC
LIMIT 5

Results

Row added_date countDim
1 2021-04-01 00:00:00 UTC 534043
2 2021-03-01 00:00:00 UTC 746963
3 2021-02-01 00:00:00 UTC 661575
4 2021-01-01 00:00:00 UTC 687764
5 2020-12-01 00:00:00 UTC 828307

We can see the dates have all been collapsed into the first of the month for each paper, but those timestamps that are attached are unhelpful. We can get rid of them by converting date_inserted to a DATE first, and switch to using the DATE_TRUNC function instead:

SELECT
  DATE_TRUNC(DATE(date_inserted), MONTH) as added_date,
  COUNT(id) as countDim
FROM
  `dimensions-ai.data_analytics.publications`
GROUP BY added_date
ORDER BY added_date DESC
LIMIT 5

Results

Row added_date countDim
1 2021-04-01 534043
2 2021-03-01 746963
3 2021-02-01 661575
4 2021-01-01 687764
5 2020-12-01 828307

That looks much better. If we want to manipulate different parts of the dates separately, we can also use EXTRACT to split things up:

SELECT
  EXTRACT(MONTH FROM date_inserted) as added_month,
  EXTRACT(YEAR FROM date_inserted) as added_year,
  COUNT(id) as countDim
FROM
  `dimensions-ai.data_analytics.publications`
GROUP BY added_month, added_year
ORDER BY added_year DESC, added_month DESC
LIMIT 5

Results

Row added_month added_year countDim
1 4 2021 534043
2 3 2021 746963
3 2 2021 661575
4 1 2021 687764
5 12 2020 828307