32. Creating a concepts network
Level: Advanced
This query requires a good understanding of SQL and the Dimensions data model
Description
This query generates two-concept pairs and counts how many publications are shared between these concepts (note: concepts in Dimensions are publication-level keywords normalised and weighted based on a relevancy score).
The query includes a subquery (user-provided-subquery
) that can be customised so to focus on arbitrary subsets of the Dimensions database. Parameters values for max number of nodes and min weight of edges to be included in the result (@max_nodes, @min_edge_weight) can be fine-tuned as needed.
The gist of the query lies in the double CROSS JOIN UNNEST. This mechanism allows to traverse a potentially very large number of relationships in seconds and to expose all relevant combinations of co-authoring organisations within the same data structure.
For more details, see also:
- Generating large-scale network analyses of scientific landscapes in seconds using Dimensions on Google BigQuery, International Conference on Science, Technology and Innovation Indicators (STI 2022) Granada September 2022
- Python tool to generate network visualizations: website and source code in Github
Query
As an example, we generate a concepts cooccurence network for all publications from the last 30 days that have an Altmetric Attention Score greater than 10:
WITH subset AS (
-- USER-PROVIDED-SUBQUERY
-- add any query in this section, as long as it returns publication IDs
-- EG here we get all papers from the last 30 days with Altmetric attention > 10
SELECT
id
FROM
`dimensions-ai.data_analytics.publications`
WHERE
EXTRACT(
DATE
FROM
date_inserted
) >= DATE_ADD(CURRENT_DATE(), INTERVAL -30 DAY)
AND altmetrics.score > 10
),
papercount AS (
SELECT concept.concept, COUNT(p.id) AS papers,
FROM `dimensions-ai.data_analytics.publications` p
INNER JOIN subset ON p.id=subset.id
CROSS JOIN UNNEST(p.concepts) concept
WHERE
year >= 1965
AND concept.relevance >= 0.5 -- @min_link_relevance
GROUP BY 1
),
filtered AS (
SELECT *
FROM papercount
WHERE papers >= 5 --@min_concept_frequency
ORDER BY papers DESC
LIMIT 500 -- @max_nodes
),
results AS (
SELECT concept1.concept AS concept_a, concept2.concept AS concept_b,
COUNT(p.id) AS overlap,
FROM `dimensions-ai.data_analytics.publications` p
INNER JOIN subset ON p.id=subset.id
CROSS JOIN UNNEST(p.concepts) concept1
CROSS JOIN UNNEST(p.concepts) concept2
INNER JOIN filtered pcount1 ON concept1.concept=pcount1.concept
INNER JOIN filtered pcount2 ON concept2.concept=pcount2.concept
WHERE year >= 1965
AND concept1.relevance >= 0.5 --@min_link_relevance
AND concept2.relevance >= 0.5 --@min_link_relevance
AND concept1.concept <> concept2.concept
GROUP BY 1,2
)
SELECT *
FROM results
WHERE overlap >= 3 --@min_edge_weight
Results
A sample of the results
concept_a | concept_b | overlap | ||
---|---|---|---|---|
concept_a | concept_b | overlap | ||
odds ratio | subgroup analysis | 5 | ||
randomized clinical trials | patients | 18 | ||
population-based study | Main Outcomes | 3 | ||
systematic review | Web of Science | 24 | ||
intervention | care | 18 | ||
depression | participants | 20 | ||
exercise | muscle | 5 | ||
heart failure | observational study | 3 | ||
non-small cell lung cancer | therapy | 3 | ||
type 2 diabetes | cohort study | 7 | ||
quality of life | primary outcome | 8 | ||
Cox proportional hazards models | women | 5 | ||
plants | cells | 3 | ||
nurses | outcomes | 6 | ||
COVID-19 | intervention | 3 | ||
COVID-19 pandemic | well-being | 3 | ||
clinical trials | randomized clinical trials | 46 | ||
subgroup analysis | meta-analysis | 17 | ||
odds ratio | logistic regression models | 3 | ||
more entries here... |
VOSViewer
A VOSViewer visualization using the results of this query can be explored online (note: the visualization uses data from the COVID-19 dataset)