Referring Expression Problem

The problem of referring expression is a more domain specific area of image captioning with the goal of describing a sub-region of a given image. Rational Speech Act (RSA) framework is a probabilistic reasoning approach that can generate sentences based on game theory systems of speaker - listener. The advantage of RSA is its explainability - namely answer the question of why a speaking agent choosing a specific word/phrase over another. Can RSA be applied to referring expression problem to generate a better/more explainable description?