Document Summaries in Danish with OpenAI

Still a need for a human in the loop

Tore G. C. Rich

Published in

Generative AI in the Newsroom

8 min readMay 4, 2023

Example of a published automatically generated article. In the blue box on the right, you can see an AI generated summary of a management report.

Introduction

On Denmark’s most visited news site Ekstrabladet.dk, we publish automatically generated local news articles in Danish as a part of the Platform Intelligence in News (PIN) project. In these automatically generated articles we cover topics including companies’ financial reports, inspection reports for food establishments, and real-estate sales. The articles are generated with rule-based NLG (Natural Language Generation) using data retrieved automatically via APIs.

Yet this rule-based approach using structured data doesn’t offer the complete story that we want to share with our readers. For instance, companies’ financial reports often includes information from management that explains why revenue, results, and equity turned out the way they did. This is important context to include in our articles. At the same time, these documents are often too long, not always formatted with high grammatical quality, and are sometimes also written in English.

To incorporate information from these documents into our automated articles we have been experimenting with the latest GPT models (text-davinci-003, gpt-3.5 turbo and gpt-4) from OpenAI to generate clean, well-written summaries in Danish to include in the articles. However, these AI-generated summaries are not always acceptable according to our editorial standards. And while we publish the rule-based auto-generated articles directly on Ekstrabladet.dk, we do not add the AI-generated summaries to the articles before they have been reviewed by a human.

I hope to reach a point where we can trust the summaries and publish them directly without a human in the loop, but we’re not there yet. In the rest of this post I examine the quality of these AI-generated summaries and the ongoing need for a human in the loop.

Setup

The rule-based automated articles on financial reports were automatically generated and published on Ekstrabladet.dk. For this, I built a pipeline called MAGNA (Monitoring and Auto-Generation of News Articles) to automate and monitor the process, and to monitor and collect data from APIs, and I used online NLG software from AX Semantics to generate the articles from the collected data. At the same time, a summary was automatically generated with the use of the OpenAI API. We always used the latest available GPT-model, and set the temperature at 0.5.

For the human in the loop, I built a user interface in MAGNA to evaluate the summaries, leading to an “accept” or “reject” decision. Each rejected summary was accompanied by a reason for its rejection. For each rejected summary, a new summary was generated, the reason for the rejection was not given to the model.

The prompt was continuously updated with learnings from the failed summaries, which I explain further below.

Data overview

The data is exclusively from summaries of management reports in financial reports in the period 20 January to 11 April, 2023. In this period 92 articles were given an accepted summary. 54 of the articles were given an acceptable summary in the first attempt (59%), while the remaining 38 articles took several attempts. A total of 193 summaries were generated, of which 101 summaries were rejected. We didn’t give up on any articles.

The rejected 101 summaries fell into the following categories [number of articles] (number of summaries):

Missing important information [14] (28)
Incorrect summary [14] (23)
AI evaluates content [9] (17)
(e.g. the model states that the company has had a successful year)
Poor language [9] (12)
Irrelevant information [4] (6)
AI adds own content [2] (7)
Wrong translation [2] (4)
Mentioning what is missing [1] (2)
(e.g. the model adds: “the company does not provide further information about the reason for the unsatisfactory result”)
AI not aware of public knowledge [1] (1)
(the model has limited knowledge of the world and events after 2021, which is evident in the summary. In this case, the model is unaware of a fatal accident that is widely publicized.)
Summary too long [1] (1)

Analysis

In the end, 100% of the articles received a summary, however only 59% of the articles had a summary accepted on the first attempt. 41% of the articles went through 2 to 12 attempts before an acceptable summary could be included in the article.

A total of 193 summaries were generated, of which 48% were accepted and 52% rejected. However since the model was not told why a summary was rejected, follow-up summaries often contained the same errors. For example in the case of the one article with 12 summary attempts before a summary was accepted: Six times a summary was rejected because text-davinci-003 did (the same) evaluation in the summary (in this case the model wrote, that the company had had a strong year), and in the five remaining rejected summaries the same information was missing in the summary. With a temperature setting of 0.5 the model was expected to generate variations using also less likely tokens. If lower probability tokens had generated the erroneous parts of the summary, there should be a chance that that part of the summary would be changed in a new summary.

In the case of the one article with 11 summary attempts (again text-davinci-003): Eight times the summary was missing the same word: “Covid 19”, one summary contained an evaluation and one summary had poor language that made part of the text unclear.

In general, the main reasons for rejected summaries were incorrect summaries, i.e. summaries with errors of some kind, and summaries missing important information present in the original text, i.e. information erroneously omitted from the summary.

But articles were also often rejected due to poor language and unwanted evaluation of the content. We did not decide whether the model’s evaluation of the content was correct or not (e.g. that the company had had a successful financial year), as we had asked for a sober (Danish: “nøgternt”) summary, all evaluation made by the model was seen as an error.

Model comparison

As mentioned we went through three GPT-models in the period of our testing: text-davinci-003, gpt-3.5 turbo and gpt-4. Most of the summaries (73%) were generated with text-davinci-003. A small number (6%) with gpt-3.5 turbo and a greater share with gpt-4 (21%).

If we look at the summaries generated by each model, we see that text-davinci-003 showed a much greater error rate than both gpt-3.5 turbo and gpt-4. Of the summaries generated with text-davinci-003, 57% were rejected while the numbers were 36% with gpt-3.5 turbo and 39% with gpt-4, suggesting that the overall quality may be higher in the newer chat models.

However with a total of 193 summaries: 141 text-davinci-003, 11 gpt-3.5 turbo, and 41 gpt-4 the dataset is too small to say anything conclusive about this. In addition we continuously improved the prompt and the system prompt which likely reduced the number of rejected summaries.

Prompt evolution

During the period we also experimented with 12 different versions of the prompt. In addition, gpt-3.5 turbo and gpt-4 also have a system prompt, of which we used four versions.

The prompts generally get longer with each version, adding instruction based on failed summaries. The additions include statements like: “without typos”, “focus on unusual conditions”, “how the year has been”, “what the future holds” and “ignore irrelevant information that does not relate to the company”.

Also, the system prompt grows larger with each new version. At the beginning it just said “You are a skilled journalist who is about to write an article for Ekstra Bladet” but at the OpenAI presentation of gpt-4 on March 14th I noticed that in an example they wrote “You are a TaxGPT, a large language model trained by OpenAI” in a system prompt, although they really had not trained a TaxGPT. I therefore changed the system prompt to: “You are JournalistGPT, a large language model trained by OpenAI, you are writing an article for the Danish newspaper Ekstra Bladet.” This prompt, however, could not be used with gpt-4 as the result was too tabloid style, and I therefore replaced “Ekstra Bladet” with “a large Danish newspaper”. This indicates that gpt-4 has more emphasis on the system prompt than gpt-3.5 turbo and/or that gpt-4 better knows Ekstra Bladet.

In sum

100% of the articles were given a summary, 59% in the first attempt. It is clear that GPT models now have the capability to generate acceptable summaries in Danish, both when the original language is English and when it is Danish. In this way we were able to add very useful content to the rule-based auto-generated articles, and augment the articles with insights and explanations presented in the company’s financial reports.

However the goal of reaching a point where we can trust the summaries and publish them directly without a human in the loop seems very far away. A custom user interface for accepting and rejecting the summaries combined with a database of accepted summaries, rejected reasons and prompts used proved to be a useful tool for learning and improving the prompting.

A few of the texts took many summary attempts before a summary was accepted as the model was not given the reason for the rejection. I have subsequently successfully experimented with giving gpt-4 the rejected reason and telling it to write a new summary. This approach promises to greatly reduce the number of summaries per article.

The quality of summaries seemed to improve with newer models, but it is worth noting that the data set was too small to say anything conclusive about this, and in addition the continuously improved prompts likely led to a reduced number of rejections.

Below are the main categories of rejected summaries, in future approaches it will be worth keeping an eye on whether the model delivers similar types of errors:

Missing important information
Incorrect summary
AI evaluates content
(e.g. the model states that the company has had a successful year)
Poor language
Irrelevant information

Appendix 1

Summaries per model

Appendix 2

Number of summaries per article

54 articles have only 1 summary
15 articles have 2 summaries
8 articles have 3 summaries
etc.

Appendix 3

Reject types

Articles can have multiple summaries with different reject types.

Only one reject type per summary is registered.

Example: 9 articles have rejected summaries due to poor language, but 12 summaries are rejected because of poor language. This means that multiple summaries with poor language have been generated for the same article.