Event Type: Call for Papers
Conference: ACL 2026
Date: July 4th, 2026 (All day)
Location: San Diego, California, USA
Contact: gem-workshop-chairs@googlegroups.com


Overview

The fifth edition of the Natural Language Generation, Evaluation, and Metrics (GEM) Workshop will be at ACL 2026 in San Diego!

Evaluation of language models has grown to be a central theme in NLP research, while remaining far from solved. As LMs have become more powerful, errors have become tougher to spot and systems harder to distinguish. Evaluation practices are evolving rapidly—from living benchmarks like Chatbot Arena to LMs being used as evaluators themselves (e.g., LM as judge, autoraters). Further research is needed to understand the interplay between metrics, benchmarks, and human-in-the-loop evaluation, and their impact in real-world settings

Topics of Interest

We welcome submissions related to, but not limited to, the following topics:

Special Tracks

Opinion and Statement Papers Track (New!)

We are introducing a special track for opinion and statement papers. These submissions will be presented in curated panel discussions, encouraging open dialogue on emerging topics in evaluation research.

We welcome bold, thought-provoking position papers that challenge conventional wisdom, propose new directions for the field, or offer critical perspectives on current evaluation practices. This track is an opportunity to spark discussion and debate—submissions need not present new empirical results but should offer well-argued viewpoints supported by scientific evidence (e.g. prior studies) that advance our collective thinking about evaluation.

ReproNLP

The ReproNLP Shared Task on Reproducibility of Evaluations in NLP has been run for six consecutive years (2021–2026). ReproNLP 2026 will be part of the GEM Workshop at ACL 2026 in San Diego. It aims to (i) shed light on the extent to which past NLP evaluations have been reproducible, and (ii) draw conclusions regarding how NLP evaluations can be designed and reported in order to increase reproducibility. Participants submit reports for their reproductions of human evaluations from previous NLP literature where they quantitatively assess the degree of reproducibility using methods described in Belz. (2025). More details can be found in the first call for participation for ReproNLP 2026 at https://repronlp.github.io.

Workshop Format

We aim to organize the workshop in an inclusive, highly interactive, and discussion-driven format. Paper presentations will focus on themed poster sessions that allow presenters to interact with researchers from varied backgrounds and similar interests. The workshop will feature panels on emerging topics and multiple short keynotes by leading experts.

🎭 GEM Comic-Con Edition!

In the spirit of San Diego’s famous Comic-Con (July 23-26), this year’s GEM will be a special Comic-Con edition! We encourage participants to embrace creativity! Whether that’s through themed poster designs, comic-style slides, or dressing up as your favorite evaluation metric personified, we want this year’s workshop to be memorable and fun!

Submission Types

Submissions can take any of the following forms:

Submission Guidelines

Important Dates

Programme

More details on main page!

Time Session/Authors Title  
08:55-09:10 Opening remarks    
09:10-10:20 Oral session #1    
09:10-09:50 Invited talk #1 Vered Shwartz Follow the Evidence: Diagnosing the What, Where, and Why of Generative Model Failures  
09:50-10:05 Erfan Nourbakhsh, Mohammad Sadegh, Seyed Amir, Khoa Nguyen, John Quarles, Mimi Xie, Rocky Slavin Are LLM Benchmarks Already Contaminated? A Systematic Review of Contamination Detection Methods  
10:05-10:20 Craig Thomson, Javier González, Anya Belz Process Standardisation for Human Evaluation of NLP System Outputs  
10:20-10:50 Coffee break    
10:50-11:30 Oral session #2    
10:50-11:15 Anya Belz, Craig Thomson, Javier González Corbelle The Shared Task on Reproducibility of Evaluations in NLP (ReproNLP) 2026: Overview and Results  
11:15-11:30 Davan Harrison, Marilyn Walker Cross-Domain Semantic Fidelity Evaluation for Meaning-to-Text Generation  
11:35-12:35 Poster session #1    
  See list of authors below All posters, see list of papers below  
12:35-13:55 Lunch break    
13:55-14:55 Poster session #2    
  See list of authors below All posters, see list of papers below  
15:00-15:40 Oral session #3    
15:00-15:40 Invited talk #2 Swabha Swayamdipta Small Samples, Big Reveal: What can we learn from limited observations of language model behavior?  
15:40-16:10 Coffee break    
16:10-17:20 Oral session #4    
16:10-16:25 Avni Mittal, Rauno Arike C2-Faith: Benchmarking LLM Judges for Causal and Coverage Faithfulness in Chain-of-Thought Reasoning  
16:25-16:40 Zefang Liu, Yinzhu Quan EconWebArena: Benchmarking Autonomous Agents on Economic Tasks in Realistic Web Environments  
16:40-17:20 Invited talk #3 Chris Callison-Burch Autorubric: A Unified Framework for Rubric-Based LLM Evaluation  
17:20-17:30 Closing session    

Organizing committee

Contact

For any questions, please check the workshop page or email the organisers: gem-workshop-chairs@googlegroups.com