The fourth iteration of the Generation, Evaluation & Metrics (GEM) Workshop will be held as part of ACL, July 31, 2025.

The workshop will be held in hybrid mode with sessions in-person and via the conference portal.

Schedule

All times in Vienna local time.

Start End  
9:00 10:25 Opening Remarks, Keynotes by Barbara Plank and Leshem Choshen
10:25 10:55 Coffee Break
10:55 11:30 Talk Session 1
11:30 12:30 Poster Session Part 1
12:30 14:00 Lunch Break
14:00 15:00 Poster Session Part 2
15:00 15:30 Talk Session 2
15:30 16:00 Coffee Break
16:00 16:15 Talk Session 3
16:15 16:55 Keynote by Ehud Reiter
16:55 17:40 Panel Discussion
17:40 17:50 Closing Remarks

Keynotes

Keynote 1 - Barbara Plank

Ambiguity, Consistency and Reasoning in LLMs

ABSTRACT

Large Language Models (LLMs) are powerful yet fallible tools, often struggling with ambiguity, inconsistency, and flawed reasoning. This talk explores some of our recent research exposing these limitations in text and language-vision models. We examine how they misinterpret ambiguous entities, fail to maintain self-consistency, and exhibit biases when these issues remain unresolved. Using insights from controlled studies and new benchmarks, we dissect how models “know” but often cannot “apply” or “verify” that knowledge. We also highlight a promising intervention — vector ablation — to surgically address false refusals without sacrificing model accuracy. Together, these findings reveal the critical need for more work on nuanced evaluation and fine-grained control mechanisms in future LLM development.

BIO

Barbara Plank is Full Professor for AI and Computational Linguistics at LMU Munich, where she directs the MaiNLP lab and co-directs the Center for Information and Language Processing. She is also an ELLIS Fellow and Visiting Full Professor at IT University of Copenhagen. Her research lab focuses on human-facing NLP: to make NLP models and evaluation more robust and inclusive, so that NLP can deal better with underlying shifts in data due to language variation, is fairer, and embraces human label variation.

Keynote 2 - Leshem Choshen

Evaluation at the Heart of the AI Wave

ABSTRACT

The AI wind also blows in the sails of evaluation, creating a mass of evaluation works. In this talk, Leshem will present some of the most pressing and open problems in evaluation and exemplify those with efforts they participated in. These “blue sea” problems include pretraining evaluation, unified evaluation, multicultural evaluation, and contamination.

BIO

Leshem Choshen is a postdoctoral researcher at MIT and MIT-IBM, studying communal LLMs, from community-built LLMs to LLMs for humans and communities. They co-created model merging, TIES merging, and the babyLM pretraining challenge. They are constantly working with the community to gather chats (please contribute), LLM games in textArena and other efforts that call for community involvement. During this work, they emphasize evaluation aspects, including reliable and efficient evaluation, tinyBenchmarks, benchmark agreement testing, and pretraining evaluation.

Keynote 3 - Ehud Reiter

We Should Evaluate Real-World Impact

ABSTRACT

The ACL community has shown very little interest in evaluating the real-world impact of deployed NLP systems. This limits the usefulness and rate of adoption of NLP in areas such as medicine. I will discuss various ways of evaluating real-world impact, and then share the results of a structured survey of the ACL Anthology, which suggests that perhaps 0.1% of its papers evaluate real-world impact; furthermore, most Anthology papers which include impact evaluations present them very sketchily and instead focus on metric evaluations. I will conclude with a discussion of when impact evaluation is appropriate, and steps the community could take to encourage it.

BIO

Ehud Reiter is a Professor of Computing Science at the University of Aberdeen and was formerly Chief Scientist of Arria NLG (a spinout he cofounded). He has been working on Natural Language Generation for 35 years, and in recent years has focused on healthcare applications and the evaluation of language generation. He is one of the most cited researchers in NLG, and his awards include an INLG Test of Time award for his work on data-to-text. He writes a widely read blog on NLG and evaluation (ehudreiter.com), and wrote a book on NLG which was published in November 2024.

Panelists

Deuwe Kiela

Thiago Castro Ferreira

Pushkar Mishra

Sessions and Papers

Talk Session 1

Title Authors
ReproNLP Shared Task Overview Anya Belz for the https://repronlp.github.io/ Team
Cleanse: Uncertainty Estimation Approach Using Clustering-based Semantic Consistency in LLMs Minsuh Joo

Talk Session 2

Title Authors
CoKe: Customizable Fine-Grained Story Evaluation via Chain-of-Keyword Rationalization Joshi Brihi, Sriram Venkatapathy, Mohit Bansal, Nanyun Peng, Haw-Shiuan Chang
PapersPlease: A Benchmark for Evaluating Motivational Values of Large Language Models Based on ERG Theory Junho Myung, Yeon Su, Sunwoo Kim, Shin Yoo, Alice Oh

Talk Session 3

Title Authors
Psycholinguistic Word Features: a New Approach for the Evaluation of LLMs Alignment with Humans Javier Conde, Miguel Gonzalez, Maria Grandury, Pedro Reviriego, Gonzalo Martinez, Marc Brysbaert

Poster Session - In-Person

All posters can be presented during both parts of the split poster session (with lunch break in between).

Poster Session - Virtual

Important Dates

July 31 Workshop Date

Organization

Organizers