Multi-dimensional scientific impact prediction

SciImpact: A Multi-Dimensional, Multi-Field Benchmark for Scientific Impact Prediction

SciImpact turns scientific impact prediction into a benchmark that spans 7 impact dimensions, 19 scientific fields, and 215,928 contrastive pairs. The benchmark shows that task-specific supervised fine-tuning lets compact open-weight models outperform much larger baselines.

Hangxiao Zhu1 Yuyu Zhang2 Ping Nie3 Yu Zhang1

1Texas A&M University 2Verdent AI 3University of Waterloo

Pairs
215,928
Dimensions
7
Fields
19
LLMs Evaluated
11
Performance radar across the seven SciImpact dimensions.
SFT-Qwen3-4B reaches 0.685 average accuracy across impact dimensions and leads on Citation, Award, Patent, Media, Dataset, and Model.

Beyond citation counts

The paper argues that scientific impact cannot be reduced to citations alone. SciImpact broadens evaluation to include Award, Patent, Media, Code, Dataset, and Model signals, while still preserving the classical Citation setting.

Each benchmark instance is a contrastive pair from the same field, asking models to decide which artifact has higher impact under a specific dimension. This gives a clean way to compare models across short-term and long-term impact settings without collapsing everything into one noisy score.

Benchmark scope

19 fields from computer science and medicine to history, sociology, and philosophy.

Input types

Paper title plus abstract, GitHub README files, Hugging Face dataset cards, and model cards.

Main result

Supervised fine-tuning turns a 4B model into the strongest average system across dimensions and fields.

Three-stage curation pipeline

Three-stage pipeline for SciImpact benchmark construction.
SciImpact combines candidate retrieval, impact labeling plus pair generation, and filtering plus quality control into one reproducible pipeline.

Stage 1

Collect candidate artifacts and metadata from MAPLE, OpenAlex, SciSciNet, Papers with Code, GitHub, and Hugging Face.

Stage 2

Compute dimension-specific impact signals and build contrastive pairs under matched field and time constraints.

Stage 3

Recover missing text, filter invalid items, remove duplicates, and balance train, validation, and test coverage.

Citation Award Patent Media Code Dataset Model

Field coverage

SciImpact spans art, biology, business, chemistry, computer science, economics, engineering, environmental science, geography, geology, history, materials science, mathematics, medicine, philosophy, physics, political science, psychology, and sociology.

Fine-tuning changes the shape of performance

Radar chart comparing o4-mini, Qwen3-4B, and SFT-Qwen3-4B across SciImpact dimensions.
Across dimensions, SFT-Qwen3-4B is strongest on average, with especially large gains on Award (0.837), Media (0.720), and Model (0.644).
Radar chart comparing o4-mini, Qwen3-4B, and SFT-Qwen3-4B across scientific fields.
Across fields, SFT-Qwen3-4B reaches 0.720 average accuracy, peaks at 0.768 on Chemistry, and stays strong outside the usual computer science and biomedicine focus.

Smaller can be stronger

After supervised fine-tuning on SciImpact, a 4B model can match or surpass larger open-weight systems and even strong closed-source baselines.

Award is easiest, public impact is harder

Models perform best on award-related judgments, while dimensions such as Patent and Media remain harder because they depend on broader external factors.

Scientific impact is heterogeneous

The benchmark exposes where textual signals are sufficient and where models likely need additional context beyond the artifact text itself.

Artifact length varies sharply by dimension

Abstract-based tasks such as Citation, Award, Patent, and Media stay relatively compact. In contrast, repository README files and Hugging Face cards are much longer and more variable, which makes the benchmark meaningfully heterogeneous even before modeling begins.

The paper standardizes evaluation by truncating inputs when needed, so models must learn to extract impact-relevant cues under a controlled prompt budget.

Text length percentile plot across SciImpact dimensions.
Code artifacts are the longest and most variable; paper-abstract dimensions are shorter and more tightly clustered.

BibTeX

@misc{zhu2026sciimpact,
  title={SciImpact: A Multi-Dimensional, Multi-Field Benchmark for Scientific Impact Prediction},
  author={Zhu, Hangxiao and Zhang, Yuyu and Nie, Ping and Zhang, Yu},
  year={2026},
  url={https://gitlab.com/user-paper-review/SciImpact.git}
}