T2S-Bench

T2S-Bench is the first benchmark specifically designed to evaluate and improve models’ text-to-structure capabilities.

Performance of models from different families on T2S-Bench.

Introduction

Large Language Models (LLMs) have demonstrated remarkable capabilities across a wide range of language understanding and reasoning tasks. However, despite their success, their ability to explicitly structure information from complex text—capturing key entities, relations, and higher-order semantic organization—remains poorly understood and insufficiently evaluated. Existing benchmarks primarily assess surface-level generation or task-specific reasoning outcomes, leaving the fundamental problem of text-to-structure understanding largely unexplored.

To bridge this gap, we present Logo T2S-Bench, the first benchmark specifically designed to evaluate and improve models’ text-to-structure capabilities. T2S-Bench comprises 1.8K high-quality samples spanning 6 scientific domains, 17 subfields, and 32 distinct structural types, covering a wide spectrum of real-world semantic structures. Solving tasks in T2S-Bench requires models to move beyond fluent text generation and instead perform explicit semantic structuring, which proves challenging for all existing models.

Using Logo T2S-Bench, we conduct a comprehensive evaluation of 45 mainstream models across 10 model families. Our results reveal substantial headroom for improvement: on the multi-hop reasoning subset, the average exact-match accuracy is only 52.1%, and even the strongest model achieves merely 58.1% node accuracy on end-to-end structure extraction. These findings indicate that accurate text structuring remains a core bottleneck for current LLMs, even those excelling at downstream reasoning tasks.

Leaderboard on T2S-Bench-MR

EM (Exact Match) and F1 scores on the Multi-hop Reasoning benchmark.

Metrics: EM (Exact Match), F1.
Task categories: Overall, Computer Science, Economic Science, Environment Science, Life Science, Physical Science, Social Science.

Leaderboard on T2S-Bench-E2E

Node and Link scores on the End-to-End Evaluation benchmark.

Metrics: Node Similarity, Link F1.
Task categories: Overall, Computer Science, Economic Science, Environment Science, Life Science, Physical Science, Social Science.

Overview

Logo T2S-Bench is a comprehensive benchmark for evaluating models' ability to extract structured representations from scientific text. It includes three curated components: T2S-Train-1.2k for training, T2S-Bench-MR (500 samples) for multi-hop reasoning, and T2S-Bench-E2E (87 samples) for end-to-end structuring. Covering 6 scientific domains, 17 subfields, and 32 structure types, T2S-Bench provides high-quality, structure-grounded samples drawn from peer-reviewed academic papers. Every sample underwent 6K+ model search, 6 rounds of validation, and 3 rounds of human review, ensuring correctness in structure, text, and reasoning logic.

Construction Process of T2S-Bench.

Science Domain Taxonomy in T2S-Bench.

Task Taxonomy of T2S-Bench-MR.

All data examples in T2S-Bench are organized into three high-quality subsets targeting different aspects of text-to-structure reasoning:

T2S-Train-1.2k: 1,200 verified text-structure pairs for training and instruction tuning.
T2S-Bench-MR: 500 multi-hop QA examples built from 4 structure-aware reasoning types × 32 templates.
T2S-Bench-E2E: 87 end-to-end structuring tasks with fixed nodes/links to ensure consistent evaluation.

You can download the dataset on Hugging Face Dataset.

Statistics

Notable statistics of Logo T2S-Bench

Overview of T2S-Bench Sample Distributions.

Sample Distributions of different Datasets in T2S-Bench.

Link F1 scores on MR-Bench-E2E across texts corresponding to reference graphs with varying node counts.

Correlation of model performance between T2S-Bench-MR and Longbench Pro Dataset.

Examples for each Major Science Topic in Logo T2S-Bench-MR

Topic 1: Computer Science

Topic 2: Economic Science

Topic 3: Environment Science

Topic 4: Life Science

Topic 5: Physical Science

Topic 6: Social Science

Examples for each Major Science Topic in Logo T2S-Bench-E2E

Topic 1: Computer Science

Topic 2: Economic Science

Topic 3: Environment Science

Topic 4: Life Science

Topic 5: Physical Science

Topic 6: Social Science

Diagram examples in Minor Science Topic in Logo T2S-Bench-E2E

Topic 1: Computer Science

Topic 2: Economic Science

Topic 3: Environment Science

Topic 4: Life Science

Topic 5: Physical Science

Topic 6: Social Science

The ability to extract structured information from text can substantially improve the performance of LLMs in a wide range of real-world applications, such as generating figures for scientific papers and creating presentation slides. To provide an intuitive demonstration of this effect, we present a comparison of figures generated from the same scientific text with and without structured reasoning.

Without Structure-of-Thought: We directly prompt the Nanobanana to generate an overview figure from the input text.

With Structure-of-Thought: We first guide Nanobanana to organize the paper into its structural components, producing key points and the relationships among them. The model then generates the figure based on both the text and the structured representation.

As shown in the comparison, the figure produced with Structure-of-Thought is significantly closer to a human-created scientific diagram in terms of layout, logical organization, and visual clarity. This result indicates that text structuring is a critical capability that models should acquire, as it enables them to more effectively assist users in practical, everyday tasks.

Sample I: Service Ecosystem

✦ Ground Truth Framework

The manual standard features explicit vertical tiering and a distinct central platform.

Explicit Vertical Level Hierarchy

Clean Meso Level

✓ With Structure of Thought

Captures structural logic perfectly, mimicking vertical flow and horizontal bands.

Matches GT Vertical Labels

Clean Horizontal Banding

✗ Without Structure of Thought

Fails overarching hierarchy, relying on deep nesting that ruins readability.

Horizontal Labels break vertical flow

Heavy nesting causes clutter

Sample II: OpenROAD Pipeline

✦ Ground Truth Framework

Notice the clear, sequential step-by-step pipeline organized into distinct columns (Inputs → Placement → Routing → Output).

Explicit sequence: Detail Placement to Detail Routing

✓ With Structure of Thought

By mapping the layout first, the model perfectly captured the three-phase column structure and the exact linear execution sequence of OpenROAD.

Explicit 3-Column Phase Headers

Preserved linear execution steps

✗ Without Structure of Thought

Without an overarching structural thought process, the model lumps complex steps into generic boxes and creates an awkward visual flow.

Collapsed generic groups (Lost steps)

Awkward zigzag bottom flow

Sample III: Monarch Architecture

✦ Ground Truth Framework

Strict use of macro-containers ("GLOBAL" and "Zone-1") to encapsulate complex sub-systems safely.

Explicit Global Container

Explicit Zonal Container

✓ With Structure of Thought

Understands hierarchical limits, using dashed boundaries to partition the system flawlessly.

Clear Global Encapsulation

Clean Zonal Boundary

✗ Without Structure of Thought

Fails container logic, dropping global nodes directly onto the canvas resulting in tangled pathways.

Floating Text (No Container)

Uncontained / Tangled Routing

BibTeX

@misc{wang2026t2sbenchstructureofthoughtbenchmarking,
      title={T2S-Bench & Structure-of-Thought: Benchmarking and Prompting Comprehensive Text-to-Structure Reasoning}, 
      author={Qinsi Wang and Hancheng Ye and Jinhee Kim and Jinghan Ke and Yifei Wang and Martin Kuo and Zishan Shao and Dongting Li and Yueqian Lin and Ting Jiang and Chiyue Wei and Qi Qian and Wei Wen and Helen Li and Yiran Chen},
      year={2026},
      eprint={2603.03790},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2603.03790}, 
}
}

T2S-BENCH

Benchmarking Comprehensive Text-to-Structure Reasoning

Introduction

LEADERBOARD

Leaderboard on T2S-Bench-MR

Leaderboard on T2S-Bench-E2E

DATASET CONSTRUCTION

Overview

Statistics

EXAMPLES

APPLICATION

Sample I: Service Ecosystem

✦ Ground Truth Framework

✓ With Structure of Thought

✗ Without Structure of Thought

Sample II: OpenROAD Pipeline

✦ Ground Truth Framework

✓ With Structure of Thought

✗ Without Structure of Thought

Sample III: Monarch Architecture

✦ Ground Truth Framework

✓ With Structure of Thought

✗ Without Structure of Thought

BibTeX