Logo T2S-BENCH

Benchmarking Comprehensive Text-to-Structure Reasoning

1Duke University, 2UT Austin, 3Meta

Introduction

Large Language Models (LLMs) have demonstrated remarkable capabilities across a wide range of language understanding and reasoning tasks. However, despite their success, their ability to explicitly structure information from complex text—capturing key entities, relations, and higher-order semantic organization—remains poorly understood and insufficiently evaluated. Existing benchmarks primarily assess surface-level generation or task-specific reasoning outcomes, leaving the fundamental problem of text-to-structure understanding largely unexplored.

To bridge this gap, we present Logo T2S-Bench, the first benchmark specifically designed to evaluate and improve models’ text-to-structure capabilities. T2S-Bench comprises 1.8K high-quality samples spanning 6 scientific domains, 17 subfields, and 32 distinct structural types, covering a wide spectrum of real-world semantic structures. Solving tasks in T2S-Bench requires models to move beyond fluent text generation and instead perform explicit semantic structuring, which proves challenging for all existing models.

Using Logo T2S-Bench, we conduct a comprehensive evaluation of 45 mainstream models across 10 model families. Our results reveal substantial headroom for improvement: on the multi-hop reasoning subset, the average exact-match accuracy is only 52.1%, and even the strongest model achieves merely 58.1% node accuracy on end-to-end structure extraction. These findings indicate that accurate text structuring remains a core bottleneck for current LLMs, even those excelling at downstream reasoning tasks.

Logo LEADERBOARD

Leaderboard on T2S-Bench-MR

EM (Exact Match) and F1 scores on the Multi-hop Reasoning benchmark.

Metrics: EM (Exact Match), F1.
Task categories: Overall, Computer Science, Economic Science, Environment Science, Life Science, Physical Science, Social Science.

Leaderboard on T2S-Bench-E2E

Node and Link scores on the End-to-End Evaluation benchmark.

Metrics: Node Similarity, Link F1.
Task categories: Overall, Computer Science, Economic Science, Environment Science, Life Science, Physical Science, Social Science.

Logo DATASET CONSTRUCTION

Overview

Logo T2S-Bench is a comprehensive benchmark for evaluating models' ability to extract structured representations from scientific text. It includes three curated components: T2S-Train-1.2k for training, T2S-Bench-MR (500 samples) for multi-hop reasoning, and T2S-Bench-E2E (87 samples) for end-to-end structuring. Covering 6 scientific domains, 17 subfields, and 32 structure types, T2S-Bench provides high-quality, structure-grounded samples drawn from peer-reviewed academic papers. Every sample underwent 6K+ model search, 6 rounds of validation, and 3 rounds of human review, ensuring correctness in structure, text, and reasoning logic.

All data examples in T2S-Bench are organized into three high-quality subsets targeting different aspects of text-to-structure reasoning:

  • T2S-Train-1.2k: 1,200 verified text-structure pairs for training and instruction tuning.
  • T2S-Bench-MR: 500 multi-hop QA examples built from 4 structure-aware reasoning types × 32 templates.
  • T2S-Bench-E2E: 87 end-to-end structuring tasks with fixed nodes/links to ensure consistent evaluation.
You can download the dataset on Hugging Face Dataset.

Statistics

Notable statistics of Logo T2S-Bench

Logo EXAMPLES

Examples for each Major Science Topic in Logo T2S-Bench-MR



Examples for each Major Science Topic in Logo T2S-Bench-E2E



Diagram examples in Minor Science Topic in Logo T2S-Bench-E2E

Logo APPLICATION

BibTeX

@misc{wang2026t2sbenchstructureofthoughtbenchmarking,
      title={T2S-Bench & Structure-of-Thought: Benchmarking and Prompting Comprehensive Text-to-Structure Reasoning}, 
      author={Qinsi Wang and Hancheng Ye and Jinhee Kim and Jinghan Ke and Yifei Wang and Martin Kuo and Zishan Shao and Dongting Li and Yueqian Lin and Ting Jiang and Chiyue Wei and Qi Qian and Wei Wen and Helen Li and Yiran Chen},
      year={2026},
      eprint={2603.03790},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2603.03790}, 
}
}