SciEvo: A 2 Million, 30-Year Cross-disciplinary Dataset for Temporal Scientometric Analysis

Mar 3, 2025ยท
Yiqiao Jin
Yiqiao Jin
,
Yijia Xiao
,
Yiyang Wang
,
Jindong Wang
ยท 2 min read
SciEvo dataset spans >30 years of scientific evolution
Abstract
Understanding the creation, evolution, and dissemination of scientific knowledge is crucial for bridging diverse subject areas and addressing complex global challenges such as pandemics, climate change, and ethical AI. Scientometrics, the quantitative and qualitative study of scientific literature, provides valuable insights into these processes. We introduce SciEvo, a longitudinal scientometric dataset with over two million academic publications, providing comprehensive contents information and citation graphs to support cross-disciplinary analyses. SciEvo is easy to use and available across platforms, including GitHub, Kaggle, and HuggingFace. Using SciEvo, we conduct a temporal study spanning over 30 years to explore key questions in scientometrics: the evolution of academic terminology, citation patterns, and interdisciplinary knowledge exchange. Our findings reveal critical insights, such as disparities in epistemic cultures, knowledge production modes, and citation practices. For example, rapidly developing, application-driven fields like LLMs exhibit significantly shorter citation age (2.48 years) compared to traditional theoretical disciplines like oral history (9.71 years).
Type
Publication
Good Data AAAI 2025 Workshop

Abstract

Understanding the creation, evolution, and dissemination of scientific knowledge is crucial for bridging diverse subject areas and addressing complex global challenges such as pandemics, climate change, and ethical AI. Scientometrics, the quantitative and qualitative study of scientific literature, provides valuable insights into these processes. We introduce SciEvo, a longitudinal scientometric dataset with over two million academic publications, providing comprehensive contents information and citation graphs to support cross-disciplinary analyses. SciEvo is easy to use and available across platforms, including GitHub, Kaggle, and HuggingFace. Using SciEvo, we conduct a temporal study spanning over 30 years to explore key questions in scientometrics: the evolution of academic terminology, citation patterns, and interdisciplinary knowledge exchange. Our findings reveal critical insights, such as disparities in epistemic cultures, knowledge production modes, and citation practices. For example, rapidly developing, application-driven fields like LLMs exhibit significantly shorter citation age (2.48 years) compared to traditional theoretical disciplines like oral history (9.71 years).

Best Paper Award ๐Ÿ†

This work received the Best Paper Award at the Good-Data @ AAAI'25 Workshop.

Key Contributions

  1. Large-scale Temporal Dataset: 2M+ papers across 30 years with consistent temporal coverage
  2. Cross-disciplinary Scope: Coverage across multiple scientific domains
  3. Benchmark Tasks: Four established temporal analysis tasks for evaluation
  4. Quality Validation: Systematic filtering and verification procedures

Keywords

Scientometrics, Temporal Analysis, Dataset, Benchmark, Scientific Evolution