Leaderboard
Methodology
The AI AgriBench benchmark uses 416 Question-Answer (QA) pairs, curated carefully by agricultural experts from technical source documents, to evaluate the responses from general-purpose Large Language Models and agriculture-specific advisory services. The model or service being compared is called the “Subject Model.” This data set is treated as “ground truth” and a Subject Model’s answers are compared to the ground truth answers when computing the metrics reported in the leaderboard, below.
The ground truth answers are generally 2-4 (brief) paragraphs long and, for a uniform evaluation, do not include citations. In order to compare such “long-form” answers, we use 4 high-quality Large Language Models as “judge models” to compare the Subject Model responses.
The Judge Models we use are Claude Opus 4.5, Gemini3-Pro-Preview, Kimi-K2-thinking, GPT5.1.
Interpreting the Leaderboard
- Accuracy–Alignment with expert consensus and the gold answer. This includes correct terminology (disease/pest names, nutrient forms), factual correctness of diagnostic conclusions, and appropriateness of management recommendations. Completely correct, expert-aligned answers score 100; severely incorrect or misleading answers score near 0.
- Relevance–How well the answer stays on topic and addresses the user’s agricultural question. Answers that drift into unrelated agronomy, ignore the main decision, or miss critical points are penalized.
- Completeness–Whether the answer covers the key steps, caveats, and conditions needed for a farmer or advisor to act safely and effectively, rather than giving partial or fragmentary advice.
Conciseness–Whether the answer is focused, avoids unnecessary digressions, and communicates the required information efficiently.
Reducing Evaluation Bias from Public Data Sources
The AI AgriBench Leaderboard
To sort this table by one of the four metrics, click on the arrows next to the corresponding column header.
Entries marked with ‘**’ indicate results from the post-cutoff data set. See the section on “Reducing Evaluation Bias from Public Data Sources”, above, for more details.
There may be additional rows in the table not visible within the frame below — scroll the contents of the table to see them.