Explore the challenges and advancements in evaluating natural language generation systems in this 44-minute talk by Wei Xu from the Center for Language & Speech Processing at JHU. Delve into the comparison between GPT models and human performance on constrained text generation tasks, focusing on paraphrase generation and text simplification. Learn about the innovative Rank-and-Rate evaluation framework and discover how GPT-3.5 compares to fine-tuned T5 and human capabilities. Examine the limitations of existing automatic evaluation metrics and understand the potential of LENS, a learnable evaluation metric that outperforms current methods in both automatic evaluation and minimal risk decoding for text generation.