🧘‍♂️ KramaBench: A Benchmark for AI Systems on Data Intensive Tasks

KramaBench is a comprehensive benchmark designed to evaluate the performance of AI systems on complex, data-intensive tasks that mirror real-world data science workflows. It contains 104 curated tasks with 1700 dataset from six diverse domains—archaeology, astronomy, biomedical research, environmental science, legal discovery, and wildfire prevention—each requiring end-to-end reasoning, data processing, and pipeline orchestration. Our goal is to assess how well large language models and agentic systems can autonomously design and execute full data-to-insight pipelines, advancing research toward truly automated, reliable, and interpretable AI-driven data science.

The benchmark evaluates systems on:

End-to-end automation: Solve complete data science tasks without any human intervention
Pipeline Design: Generate a pipeline that includes all key functionalities required for a correct solution.
Pipeline Implementation: Given a specific pipeline sub-task, implement the necessary code to execute it successfully.

Each task’s score is normalized to a [0, 1] range, and aggregate scores are reported as averages across workloads or the full benchmark. Higher scores indicate better overall system capability. All benchmark tasks have reference solutions manually validated by domain experts to ensure reliability and fairness.

For detailed methodology, evaluation procedures, and analysis, please refer to the original paper.

Current Rankings

Select Domain:

Oracle Inputs

Search:

Rank	System	Model	Score (%)

The name KramaBench 🧘‍♀️

You read it correctly, it's Krama, not Karma! The name KramaBench is a reference to the "Vinyasa Krama" practice of Yoga, where the main focus of the practice is on a correct and seamless transition from one pose to the next.

In Sanskrit, "Krama" (क्रम) means "sequence" or "order" — emphasizing the importance of proper progression and methodical execution. Just as Vinyasa Krama yoga requires mindful attention to each pose transition, our benchmark evaluates AI systems on their ability to execute data science pipelines with a correct sequence of steps.

Submit your system's results

To add your system to the KramaBench leaderboard, follow these steps:

Request Test Questions
Contact our team to receive a set of obscured test questions, which we will provide to you without ground truth answers. You can reach us at the following email address:

Mail
Run Your System
Process the test questions using your AI system and generate answers along with detailed reasoning traces. We will provide you with a reference solution output to format your results.
Submit Results
Send us your system's answers and the complete reasoning traces. Our team will run the evaluation pipeline to assess your system's performance.

To ensure fairness and consistency, all submissions are evaluated using the same methodology described in our paper. The evaluation process is automated using the open source framework found in the KramaBench repository.

Once evaluated, your system will be added to the KramaBench leaderboard with its performance scores across all domains.

Resources

Access the benchmark data, source code, and research publication:

📄 Paper Read the full research paper

Repository Access source code & docs 🤗 Data Download benchmark data

KramaBench: A Benchmark for AI Systems on Data Intensive Tasks

Eugenie Lai, Gerardo Vitagliano, Ziyu Zhang, Om Chabra, Sivaprasad Sudhir, Anna Zeng, Anton A. Zabreyko, Chenning Li, Ferdi Kossmann, Jialin Ding, Jun Chen, Markos Markakis, Matthew Russo, Weiyang Wang, Ziniu Wu, Michael J. Cafarella, Lei Cao, Samuel Madden, Tim Kraska (2025)