Working with dates
Each publication has various dates available.
date
,year
,date_normal
,date_online
,date_print
refer to the publication object. See the documentation to find out more about their meaning.date_imported_gbq
refers to when this record was last added to Google BigQuery - this date can be handy if you want to synchronize an external data source to BigQuery.date_inserted
: this refers to when this records was originally added to Dimensions. This date does not change, even if the record is later adjusted.
The following examples show how to work with publications dates.
Prerequisites
In order to run this tutorial, please ensure that:
- You have a valid Dimensions on Google BigQuery account and have configured a Google Cloud project.
The online BigQuery console can be used to test the queries below.
Comparing date fields
Description
We'll get started by pulling a selection of the date fields to see their formats:
SELECT doi,
date,
date_normal,
year,
date_online,
date_print,
date_imported_gbq,
date_inserted
FROM `dimensions-ai.data_analytics.publications`
WHERE year = 2010
AND journal.id = "jour.1115214"
ORDER BY citations_count DESC
LIMIT 10
Results
Row | doi | date | date_normal | year | date_online | date_print | date_imported_gbq | date_inserted |
---|---|---|---|---|---|---|---|---|
0 | 10.1038/nbt.1621 | 2010-05-02 | 2010-05-02 | 2010 | 2010-05-02 | 2010-05 | 2021-02-10 01:09:29+00:00 | 2017-08-31 12:50:56+00:00 |
1 | 10.1038/nbt.1630 | 2010-05-02 | 2010-05-02 | 2010 | 2010-05-02 | 2010-05 | 2021-02-10 01:09:29+00:00 | 2017-08-31 12:50:56+00:00 |
2 | 10.1038/nbt.1614 | 2010-03 | 2010-03-01 | 2010 | null | 2010-03 | 2021-02-10 01:09:29+00:00 | 2017-08-31 12:50:56+00:00 |
3 | 10.1038/nbt.1685 | 2010-10-13 | 2010-10-13 | 2010 | 2010-10-13 | 2010-10 | 2021-02-10 00:53:56+00:00 | 2017-08-31 12:50:56+00:00 |
4 | 10.1038/nbt1210-1248 | 2010-12-07 | 2010-12-07 | 2010 | 2010-12-07 | 2010-12 | 2021-02-10 00:53:56+00:00 | 2017-08-31 12:50:56+00:00 |
5 | 10.1038/nbt.1755 | 2010-12-22 | 2010-12-22 | 2010 | 2010-12-22 | 2011-02 | 2021-02-10 01:09:29+00:00 | 2017-08-31 12:50:56+00:00 |
6 | 10.1038/nbt1010-1045 | 2010-10-13 | 2010-10-13 | 2010 | 2010-10-13 | 2010-10 | 2021-02-10 00:53:56+00:00 | 2017-08-31 12:50:56+00:00 |
7 | 10.1038/nbt.1633 | 2010-05-02 | 2010-05-02 | 2010 | 2010-05-02 | 2010-05 | 2021-02-10 00:53:56+00:00 | 2017-08-31 12:50:56+00:00 |
8 | 10.1038/nbt.1667 | 2010-07-19 | 2010-07-19 | 2010 | 2010-07-19 | 2010-08 | 2021-02-10 01:09:29+00:00 | 2017-08-31 12:50:56+00:00 |
9 | 10.1038/nbt.1641 | 2010-05-23 | 2010-05-23 | 2010 | 2010-05-23 | 2010-06 | 2021-02-10 00:53:56+00:00 | 2017-08-31 12:50:56+00:00 |
The first thing to stick out is that some of the dates are actually timestamps: date_imported_gbq
and date_inserted
have times attached to the dates. The other important caveat is that some dates aren't actually whole dates: Some values in the date
and date_print
fields have only a year and month. One of the reasons these different types are important is because can add an extra step when you compare fields to each other. For example, if we wanted to count how many publications were added to Dimensions before their "publication" date, it would be intuitive to write a query like this:
SELECT COUNT(id)
FROM `dimensions-ai.data_analytics.publications`
WHERE
year = 2020
AND date > date_inserted
However, we get an error from BigQuery: No matching signature for operator > for argument types: STRING, TIMESTAMP. Supported signature: ANY > ANY at [12:11]
. BigQuery won't do the comparison because both sides of the comparison aren't of the same type: The date
field is of type STRING
, since it doesn't always have a day (or month) attached. The date_normal
field solves this for us: It uses the same information as the date
field, but it fills in the gaps to make a full DATE
entry—so "2010-03"
in the date
field becomes 2010-03-01
in date_normal
. But swapping that in doesn't fix our problems either:
SELECT COUNT(id)
FROM `dimensions-ai.data_analytics.publications`
WHERE
year = 2020
AND date_normal > date_inserted
We run into a new variant of the issue now: No matching signature for operator > for argument types: DATE, TIMESTAMP. Supported signature: ANY > ANY at [5:7]
. Now date_normal
gives us a DATE
, but we can't compare that to a TIMESTAMP
. Generally, you can mitigate most issues with comparing date fields by converting one of them to match the other, and BigQuery supports multiple functions for manipulating dates and datetimes. This one should do the trick:
SELECT COUNT(id)
FROM `dimensions-ai.data_analytics.publications`
WHERE
year = 2020
AND date_normal > DATE(date_inserted)
Results:
Row | f0_ |
---|---|
1 | 859011 |
Number of publications added to Dimensions by month
Description
Next, we'll use the date_inserted
field to count the number of publications added to the Dimensions database per month. date_inserted
is of type DATETIME
, so we choose from the datetime manipulation functions to round down all dates to the first of the month:
SELECT
DATETIME_TRUNC(date_inserted, MONTH) as added_date,
COUNT(id) as countDim
FROM
`dimensions-ai.data_analytics.publications`
GROUP BY added_date
ORDER BY added_date DESC
LIMIT 5
Results
Row | added_date | countDim |
---|---|---|
1 | 2021-04-01 00:00:00 UTC | 534043 |
2 | 2021-03-01 00:00:00 UTC | 746963 |
3 | 2021-02-01 00:00:00 UTC | 661575 |
4 | 2021-01-01 00:00:00 UTC | 687764 |
5 | 2020-12-01 00:00:00 UTC | 828307 |
We can see the dates have all been collapsed into the first of the month for each paper, but those timestamps that are attached are unhelpful. We can get rid of them by converting date_inserted
to a DATE
first, and switch to using the DATE_TRUNC
function instead:
SELECT
DATE_TRUNC(DATE(date_inserted), MONTH) as added_date,
COUNT(id) as countDim
FROM
`dimensions-ai.data_analytics.publications`
GROUP BY added_date
ORDER BY added_date DESC
LIMIT 5
Results
Row | added_date | countDim |
---|---|---|
1 | 2021-04-01 | 534043 |
2 | 2021-03-01 | 746963 |
3 | 2021-02-01 | 661575 |
4 | 2021-01-01 | 687764 |
5 | 2020-12-01 | 828307 |
That looks much better. If we want to manipulate different parts of the dates separately, we can also use EXTRACT
to split things up:
SELECT
EXTRACT(MONTH FROM date_inserted) as added_month,
EXTRACT(YEAR FROM date_inserted) as added_year,
COUNT(id) as countDim
FROM
`dimensions-ai.data_analytics.publications`
GROUP BY added_month, added_year
ORDER BY added_year DESC, added_month DESC
LIMIT 5
Results
Row | added_month | added_year | countDim |
---|---|---|---|
1 | 4 | 2021 | 534043 |
2 | 3 | 2021 | 746963 |
3 | 2 | 2021 | 661575 |
4 | 1 | 2021 | 687764 |
5 | 12 | 2020 | 828307 |