Natural Language Generation – The Task

Natural Language Generation (NLG) is the task of translating structured data into natural language. For instance, imagine that you have many structured financial data, which is hard to analyze, then you can use NLG to translate this data into text, which summarizes the data. With the ever-increasing amounts of structured data available, the demand for NLG is raising steadily. Other application include the automated generation of sport articles, where structured data from the games is used as basis. Similarly, NLG tools can generate product descriptions on e-commerce websites from product information. Chatbots also rely on NLG to translate the current mental state of the chatbot into natural language.

This Post

Most commercial systems rely on large rule based systems. In contrast, current research is working on data driven approaches with the goal of learning the translation rules from training data. In recent years, the research community focused on deep learning approaches, which have shown promising results. However, these approaches are far from perfect and are only applicable to small problems. There are still many open challenges, such as hallucination (i.e. the tendency of systems to generate texts, which contain information that are not in the input data), evaluation (i.e. the problem of how to evaluate the generated texts automatically), and variability (i.e. the tendency of systems to generate repetitive utterances).

In this work, we address the issue of variability. We noticed that the current state-of-the-art approaches generate utterances, which all share the same structure. As a result, the utterances start to sound monotonous very quickly. This might not be an issue if you are just interested in asking for your balance. However, for scenarios where there is a need for professional sounding texts (e.g. automated journalism), variability is an important issue.

In this article, we present a simple approach to NLG based on the so-called Semantically Conditioned LSTM proposed by Wen et. al. in 2015. We show how we exploited the architecture and the properties of the data to generate many diverse utterances for the same data. We base this article on our paper, which we presented at the INLG2018 conference. This article serves to introduce the high-level idea. In part two, we present an in-depth code walk through. We show how to preprocess the data, train the deep neural network and analyze the output.

The Data

We use the data from the E2E-NLG challenge who provide a very nice dataset in the restaurant domain. It is very large: with 50’000 pairs of meaning representations (i.e. structured data about a restaurant) and corresponding utterances (i.e. descriptions of said restaurant); it provides a large variety of different formulations. Comparable datasets provide only a fraciton of this data. More details one the data and the challenge, which the reader can find here.

Example form the Training Data

Let us look at an example from the training data. First off, each meaning representation cosists of a list of attribute-value pairs. The attributes describe the various features of the restaurant. Each attribute has a predefined set of possible values it can assume. In the example below, we see four different human-written formulations for the same meaning representation. First, we note that the descriptions show some extent of variance. For instance, Ref 4 refers to the restaurant as a “dining establishment”, while Ref 3 mentions the name of the restaurant only in the second sentence.

MR name=Alimentum, food=Chinese, priceRange=high, area=riverside, familyFriendly=no
Ref1 For a high priced Chinese food, adult only Alimentum is located in riverside.
Ref2 The Alimentum offers Chinese for in the high price range in the riverside area but is not child friendly.
Ref3 There is a child friendly venue located in riverside with a high price range. It is called Alimentum and provides Chinese food.
Ref4 Alimentum is a Chinese dining establishment offering high priced menu options. It is located in the Riverside, and is not child friendly.

Generated Examples

The training data exhibits a large amount of variety, which lies in stark contrast to what current state-of-the-art algorithms generate. To emphasize this point, let us look at some examples of generated texts from a current state-of-the-art NLG system (see here). First, we look at some short texts. We note that the utterances all follow the same exact template. Although one may argue, that it is hard to introduce variety in these short descriptions, humans did just that. At the right, we see the human written version of the same meaning representation, which clearly show more variety.

Generated Examples Human Alternatives
Cocum is a pub near The Rice Boat. For a coffee shop near The Rice Boat, try Cocum.
Cocum is a pub near The Sorrento. Cocum is a coffee shop by The Sorrento.
Giraffe is a pub near The Rice Boat. Near The Rice Boat you can visit a coffee shop called Giraffe.
Giraffe is a pub near The Bakers. Giraffe is a coffee shop found near The Bakers.

Generated Long Examples

Similarly, when we look at longer generated examples (see Table below), we see the same monotonous structures. Overall, we can derive a few observations of recurring patterns:

  • The utterance always starts by stating the restaurant name, thus, forcing the sentence structures to be very similar across all utterances.
  • In case the utterance is composed of multiple sentences, the follow-up sentences all start with a pronoun referring to the restaurant (i.e. “It is” or “Its”).
  • Although there exist many different way of expressing an attribute-value pair, the generated utterances rely only on one formulation for each attribute-value pair.

Generated Examples (long)
The Cricketers is located near Ranch. It is not family-friendly and has a low customer rating.
The Cricketers provides Chinese food in the £20-25 price range. It is located in the city center. It is near All Bar One. Its customer rating is high.
The Mill is a low-priced pub in the city center near Raja Indian Cuisine. It is not family-friendly.
Wildwood is a family friendly pub serving French food. It is located in riverside near Raja Indian Cuisine.

Proposed Syntactic Manipulations

We have seen that the generated utterances follow a rather monotonous structure. For this reason, we derived three syntactic manipulations in order to increase the variability.

  • Manipulate the first word of the utterance. In essence, this step has the largest effect on the utterance structure, since it effects the sequence of attributes. Above all, it allows for utterances which start with part-of-speeches other than nouns.
  • Manipulate the first word of follow-up sentences in the utterances. Similarly to the above manipulation it allows for more variety in the sentence structure.
  • Use different formulations for rendering attribute value pairs.

We base our reasoning for this conditioning on the observation that the neural network methods generate the most common structure. If we look at the frequencies of the different first words (see Image below), we see that in over 50% of the cases the utterance begins with the restaurant name (X-name). Thus, our conditioning teaches the neural network the correlation between aforementioned manipulations and the desired formulations.


The main issue with this approach is that there are formulations which conflict with the meaning representations. For instance when starting the utterance with the word “Family”, although the meaning representation does not state anything about family friendliness, the neural network hallucinates the additional information that the location is family friendly. Or, in the second example, starting the utterance with “In” leads to the hallucination of the location being in the city center, which lies in conflict with the meaning representation.

This problem stems from the fact that we do not have access to the correct formulations during test-time. In fact, the test set only contains the meaning representation.

MR name=Blue Spice, eatType=coffee shop, customer rating=5 out of 5, near=Crowne Plaza Hotel
Utt1 Blue Spice is a coffee shop near Crowne Plaza Hotel with a 5 out of 5 customer rating.
Utt2 Family friendly coffee shop Blue Spice is located near Crowne Plaza Hotel and has a customer rating of 5 out of 5
Utt3 In the city center near Crowne Plaza Hotel is a coffee shop called Blue Spice. It has a customer rating of 5 out of 5


To solve this, we propose an over-generation and re-ranking approach. For each of the three manipulation types above, we sample 10 different candidates. Then, we let the neural network generate an utterance for each combination of these candidates. This means we effectively generate 10*10*10 = 1000 utterances for each meaning representation. Then we re-rank the utterances based on their correctness, filtering out those which are incorrect.
For this, we train standard classifiers, one for each attribute. The purpose of these classifiers is to classify the value which the generation network rendered in a given utterance. We can use these classifiers by letting them classify the rendered values and compare them to the meaning representation.

Cherrry Picked Examples

Finally, we look at what our neural network produces.

MR name=The Punter, eatType=pub, food=English, priceRange=high, area=city-centre, familyFriendly=no, near=Raja Indian Cuisine
Baseline The Punter is a pub that serves English food in the high price range and is located in the city centre near Raja Indian Cuisine.
Our System If you are looking for a pub serving English food, try The Punter. It is located in the city centre near Raja Indian Cuisine. Prices are on the higher end and it is not child friendly.

MR name=Giraffe, eatType=restaurant, food=French, area=riverside, familyFriendly=yes, near=Raja Indian Cuisine
Baseline Giraffe is a family friendly restaurant that serves French food. It is located near Raja Indian Cuisine.
Our System In the riverside area there is a French restaurant called Giraffe. You will find it near Raja Indian Cuisine.Yes, it is family friendly.

We see that our solution indeed generates utterances that are more diverse. However, some cherry-picked examples are not sufficient to judge the quality of a solution. Thus, we also ran two evaluations of the generated utterances: we automatically evaluated the variety of the utterances by means of lexical complexity measures; and we performed a human evaluation where judges provide ratings on various dimensions.

Automated Evaluation

In order to measure the impact of the syntactic manipulations, we apply various measures of lexical complexity on the outputs of some baseline systems, our system, as well as the human written texts (see Table below).  In essence, these measures implement different ways of counting the number of different words used in a text.

For instance, our system uses 224 different tokens, which is significantly more than the baseline, but still less than the humans. We see the same pattern for all the other measures as well. For instance, for the type token ratio (TTR), which measures the ratio of different words used and the number of total words in a text, our system almost doubles the baseline.  The moving-averaged TTR (MATTR), which normalizes for the different lengths of texts, shows that our system is close to the human variety. This phenomena is due to the fact the human written texts have multiple references for the same meaning representation. Finally, the measure of textural lexical diversity (MTLD), which measures the lexical variety, also shows that our system produces a high degree of variability.

Human Evaluation

For the human evaluation, we found it hard find a setting where humans would be able to assess the diversity of the utterances. The main problem is that we would need to expose to utterances of the same system several times. We opted for an extrinsic approach. For this, we provide the judge with two utterances for the same meaning representation: one utterance by the baseline system and one by our system. We ask the humans 5 questions: which of the two utterances do they prefer (preference), which is easier to understand (comprehensibility), which is more concise (conciseness), which they judge to be more elegant (elegance), and which of the two they deem more professional (professional)?

The results show that there is no significant difference regarding preference and comprehensibility. However, there are huge differences regarding the other three dimensions.

In 75% of the cases, the judges said that the utterances which the baseline produced are more concise. This is in line with our expectations, as our system tends to generate utterances that are more complicated. On the other hand, the judges rated the utterances of our system to be more elegant and professional sounding.


Increasing the variety of automatically generated texts is a quite challenging problem. With our system, we showed that it is possible to train systems that reflect this variance, if there is enough variability in the data.

In the next post, we show how to implement this system with Keras. In case you can not wait for the next Blog entry, you can just clone the code from Github and play around with it yourself.

Contact Us
close slider

Please check to consent to your data being stored in line with the guidelines in our Privacy Policy

We are using cookies on our website

Please confirm, if you accept our tracking cookies. You can also decline the tracking, so you can continue to visit our website without any data sent to third party services. For more information please visit our Privacy Policy