Leaderboard

Methodology

The AI AgriBench benchmark uses 416 Question-Answer (QA) pairs, curated carefully by agricultural experts from technical source documents, to evaluate the responses from general-purpose Large Language Models and agriculture-specific advisory services. The model or service being compared is called the “Subject Model.”  This data set is treated as “ground truth” and a Subject Model’s answers are compared to the ground truth answers when computing the metrics reported in the leaderboard, below.

The ground truth answers are generally 2-4 (brief) paragraphs long and, for a uniform evaluation, do not include citations.  In order to compare such “long-form” answers, we use 4 high-quality Large Language Models as “judge models” to compare the Subject Model responses.

The Judge Models we use are Claude Opus 4.5, Gemini3-Pro-Preview, Kimi-K2-thinking, GPT5.1.

By default, we use the first 3 judge models to evaluate a Subject Model’s responses.  However, when one of the judge models itself is being evaluated as a Subject Model, we use the other three judge models from the set above.  This ensures that a Subject Model is never used to judge itself.
 
The Data Set, evaluation methodology, evaluation Metrics, and judging prompts, are described in more detail in an accompanying blog post.
 

Interpreting the Leaderboard

Metrics: This table ranks general-purpose Large Language Models and agriculture-specific advisory services (called “Subject Model”) in terms of 4 metrics: 
  • Accuracy–Alignment with expert consensus and the gold answer. This includes correct terminology (disease/pest names, nutrient forms), factual correctness of diagnostic conclusions, and appropriateness of management recommendations. Completely correct, expert-aligned answers score 100; severely incorrect or misleading answers score near 0.
  • Relevance–How well the answer stays on topic and addresses the user’s agricultural question. Answers that drift into unrelated agronomy, ignore the main decision, or miss critical points are penalized.
  • Completeness–Whether the answer covers the key steps, caveats, and conditions needed for a farmer or advisor to act safely and effectively, rather than giving partial or fragmentary advice.
  • Conciseness–Whether the answer is focused, avoids unnecessary digressions, and communicates the required information efficiently.

You can sort the entries in the table using any one of the metrics by clicking the arrows next to the metric name.
 
Agronomic Topic Categories: The 416 QA pairs are grouped into 9 different agronomic categories:
      Pests, Diseases, Weeds, Nutrition, Soils, Seed Hybrids, Horticulture, Water, Weather

A single QA pair may belong to multiple categories.  If you wish to narrow the focus of the comparisons to a particular category (or a few specific categories), you can FILTER the results by selecting those categories and deselecting the others using the selector buttons listed above the table.
 

Reducing Evaluation Bias from Public Data Sources

Modern Large Language Models (LLMs), the foundation of public chatbots and recent AI-driven agricultural advisory services,  are pretrained on a vast quantity of publicly accessible data sources, which potentially includes the Extension publications from which the benchmark QA pairs have been extracted. This creates a potential for bias in the evaluation for such LLMs. One effective method to mitigate this bias is to limit the source data set to documents published after the training cutoff date of a LLM.  We use a second data set of QA pairs taken from documents published on or after September 1, 2024.  The results for this “post-cutoff” data set for several frontier LLMs — GPT5.1, GPT5-mini, Gemini3-Pro-Preview and Gemini2.5-Flash — are included in the leaderboard below and marked with ‘**’; comparing these results with the main data set gives an indication of the impact of this data bias.  Briefly, the results show little, and arguably negligible, impact of bias from the pretraining exposure to the public documents.

The AI AgriBench Leaderboard

To sort this table by one of the four metrics, click on the arrows next to the corresponding column header.

Entries marked with ‘**’ indicate results from the post-cutoff data set.  See the section on “Reducing Evaluation Bias from Public Data Sources”, above, for more details.

There may be additional rows in the table not visible within the frame below — scroll the contents of the table to see them.