The workshop will be held as part of ACL-IJCNLP 2021, August 1-6, 2021. It will take place on August 6. It is endorsed by the ACL Special Interest Group on Natural Language Generation (SIGGEN).

Note: Our system output submission form is perpetually open, please continue contributing to our benchmark. If you want to help improve GEM in the future, join our team.

Workshop Overview

Natural language generation is one of the most active research fields in NLP, with generation, summarization, and dialog among the most submitted-to tracks. As such, the number of available datasets, metrics, models, and evaluation strategies are increasing rapidly. This is leading to the situation where new models are often evaluated on different anglo-centric tasks with incompatible evaluation setups. With GEM, we are aiming to solve this problem by standardizing and improving the corpora on which to evaluate NLG models, and by supporting the development of better evaluation approaches. Submitted papers analyze the state of NLG evaluation and propose better alternatives. Moreover, we are organizing the living GEM benchmark which incorporates new advances in data and human and automatic evaluation to make it easier to evaluate models on challenging tasks with the correct tools. In our shared task, models were applied to up to 11 tasks in 18 languages, 80 challenge sets, and their outputs characterized using a combination of human evaluation and over 50 automatic metrics. Through the presented papers and the shared task, we aim to uncover shortcomings and opportunities for progress.

Schedule

All times in UTC, please use a converter like this one to convert to your local time.

We do not distinguish between workshop papers and Findings of the ACL papers that are being presented - they are all great!

If you want to suggest questions to the panels, please submit and vote here.

Time (UTC) Session
11:30 - 12:00 Welcome and Explanation of Logistics (Recording)
12:00 - 13:00 Poster Session
  Evaluating the Efficacy of Summarization Evaluation across Languages
Fajri Koto, Jey Han Lau, and Timothy Baldwin
  Automatic Text Simplification for Social Good: Progress and Challenges
Sanja Stajner
  Flesch-Kincaid is Not a Text Simplification Evaluation Metric
Teerapaun Tanprasert and David Kauchak
  Human Perception in Natural Language Generation
Lorenzo De Mattei, Huiyuan Lai, Felice Dell’Orletta, and Malvina Nissim
  Semantic Similarity Based Evaluation for Abstractive News Summarization
Figen Beken Fikri, Kemal Oflazer, and Berrin Yanikoglu
  Shades of BLEU, Flavours of Success: The Case of MultiWOZ
Tomáš Nekvinda and Ondřej Dušek
13:00 - 13:45 Panel Discussion with Hady Elsahar, Seraphina Goldfarb-Tarrant, He He, and Ehud Reiter
Suggest questions here. (Recording)
13:45 - 14:00 Break
14:00 - 15:00 Talk Session (Recording)
  Personalized Response Generation with Tensor Factorization
Zhenghui Wang, Lingxiao Luo, and Diyi Yang
  A Review of Human Evaluation for Style Transfer
Eleftheria Briakou, Sweta Agrawal, Ke Zhang, Joel Tetreault, and Marine Carpuat
  GOT: Testing for Originality in Natural Language Generation
Jennifer Brooks and Abdou Youssef
  Evaluating Text Generation from Discourse Representation Structures
Chunliu Wang, Rik van Noord, Arianna Bisazza, and Johan Bos
15:00 - 16:00 Poster Session
  Detecting Hallucinated Content in Conditional Neural Sequence Generation
Chunting Zhou, Graham Neubig, Jiatao Gu, Mona Diab, Paco Guzman,
Luke Zettlemoyer, and Marjan Ghazvininejad
  Synthesizing Adversarial Negative Responses for Robust Response Ranking and Evaluation
Prakhar Gupta, Yulia Tsvetkov, and Jeffrey Bigham
  Perceptual Models of Machine-Edited Text
Elizabeth Merkhofer, Monica-Ann Mendoza, Rebecca Marvin, and John Henderson
  Improving Automated Evaluation of Open Domain Dialog via Diverse Reference Augmentation
Varun Gangal, Harsh Jhamtani, Eduard Hovy, and Taylor Berg-Kirkpatrick
  XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages
Tahmid Hasan, Abhik Bhattacharjee, Md. Saiful Islam, Kazi Mubasshir, Yuan-Fang Li,
Yong-Bin Kang, M. Sohel Rahman, and Rifat Shahriyar
  Human Evaluation of Creative NLG Systems: An Interdisciplinary Survey on Recent Papers
Mika Hämäläinen and Khalid Alnajjar
16:00 - 17:00 Keynote by Asli Celikyilmaz (Recording)
Are language models enough for narrative coherence?
Abstract Automatic text generation enables computers to summarize online meetings, write stories or articles about an event, have conversations in customer-service, chit-chat with individuals, describe pictures to visually impaired, and similar tasks. In this talk, I will discuss challenges and shortcomings of building such systems with the current neural text generation models focusing on issues relating to modeling discourse structure and narrative flow. I will present our recent approaches that imbue transformer based neural generators with structural representations by way of implicit memory architectures and latent structural embeddings. I will conclude the talk pointing to avenues for future research.
17:00 - 17:45 Panel Discussion with Anya Belz, Asli Celikyilmaz, Mike Lewis, Lisa Li, and Wang Lu
Suggest questions here. (Recording)
17:45 - 18:00 Break
18:00 - 19:00 GEM Overview Session
  The GEM Benchmark: Natural Language Generation, its Evaluation and Metrics
Everyone listed on the GEM team page
  Reusable Templates and Guides For Documenting Datasets and Models for Natural Language
Processing and Generation: A Case Study of the HuggingFace and GEM Data and Model Cards

Angelina McMillan-Major, Salomey Osei, Juan Diego Rodriguez, Pawan Sasanka Ammanamanchi,
Sebastian Gehrmann, and Yacine Jernite
  Preliminary Results of the GEM Shared Task
GEM Organizers
  NL-Augmenter: A Collaborative Effort to Transform and Filter Text Datasets
Kaustubh Dhole, Sebastian Gehrmann, Jascha Sohl-Dickstein, Varun Prashant Gangal,
Tongshuang Wu, Simon Mille, Zhenhao Li, Aadesh Gupta, Samson Tan, Saad Mahmood,
Ashish Shrivastava, Ondrej Dusek, and Jinho D. Choi
19:00 - 20:00 GEM System Session
  Structure-to-Text Generation with Self-Training, Acceptability Classifiers and Context-Conditioning
for the GEM Shared Task
Shreyan Bakshi, Soumya Batra, Peyman Heidari, Ankit Arun, Shashank Jain, and Michael White
  NUIG-DSI’s submission to The GEM Benchmark 2021
Nivranshu Pasricha, Mihael Arcan, and Paul Buitelaar
  System Description for the CommonGen task with the POINTER model
Anna Shvets
  SimpleNER Sentence Simplification System for GEM 2021
K V Aditya Srivatsa, Monil Gokani, and Manish Shrivastava
20:00 - 21:00 Poster Session
  GO FIGURE: A Meta Evaluation of Factuality in Summarization
Saadia Gabriel, Asli Celikyilmaz, Rahul Jha, Yejin Choi, and Jianfeng Gao
  TellMeWhy: A Dataset for Answering Why-Questions in Narratives
Yash Kumar Lal, Nathanael Chambers, Raymond Mooney and Niranjan Balasubramanian
  Is Human Scoring the Best Criteria for Summary Evaluation?
Oleg Vasilyev and John Bohannon
  Are Larger Pretrained Language Models Uniformly Better? Comparing Performance at the Instance Level
Ruiqi Zhong, Dhruba Ghosh, Dan Klein, and Jacob Steinhardt
  Elaborative Simplification: Content Addition and Explanation Generation in Text Simplification
Neha Srikanth and Junyi Jessy Li
  Decoding Methods for Neural Narrative Generation
Alexandra DeLucia, Aaron Mueller, Xiang Lisa Li, and João Sedoc

Important Dates

Workshop

February 2 First Call for Shared Task Submissions and Papers, Release of the Training Data

May 3 Workshop Paper Due Date (excl. shared tasks) UPDATED

May 28 Notification of Acceptance (excl. shared tasks)

June 7 Camera-ready papers due (excl. shared tasks)

Shared Task Dates

Modeling

February 2 Release of the training Data

March 29 Release of the test sets

May 14 Modeling submissions due

June 11 System Descriptions and Analyses due

June 25 Notification of Acceptance (shared task)

July 9 Camera-ready papers and task descriptions due

August 5-6 Workshop Dates

Organization

The workshop is organized by

The shared task and the GEM environment is organized by a larger team which is listed on this page.