On model selection for text summarization

On model selection for text summarization
Photo by Clay Banks / Unsplash

On our journey toward our product MVP, we are building a platform that will guide users through the process of machine learning model selection by performing evidence-based evaluation. We’re starting with text summarization as a first platform task for two main reasons:

First, the product interviews we’ve conducted with developers and enterprises have shown firsthand that text summarization is one of the most important and widely used machine learning tasks that businesses rely on. This matches with machine learning historical context since summarization is one of the original use cases for natural language modeling. 

In a landscape still dominated by closed-source LLM systems, we want to help users by providing an easy way for them to a) evaluate whatever model they choose to use, b) get understandable and transparent results, and c) trust the tech stack they want to build. In short, we’re offering accessibility, transparency, and trust.

Second, from a business perspective, the decision to prioritize this use case is underpinned by the significant market potential for text summarization (a subset of the broader natural language processing (NLP) market), which is set to expand from $5.5 billion in 2020 to $13.8 billion by 2025, reflecting a compound annual growth rate (CAGR) of 20.3%​ (see On-Page)​.

An extremely interesting part of our recent work with summarization has been that “traditional” Seq2Seq encoder-decoder models perform better than newer autoregressive-style large language models with more parameters, especially taking into account practical considerations around inference speed and model size.

Picking a Summarization Model: Abstractive or Extractive?

Finding a good model for summarization is a daunting task, as the typical intuition that larger parameter models generally perform better goes out the window. For summarization, we need to consider the input, which will likely be of a longer context size, and finding models that efficiently deal with those longer contexts is of paramount importance. In our business case, which is to create summaries of conversation threads, much as you might see in Slack or an email chain, the models need to be able to extract key information from those threads while still being able to accept a large context window to capture the entire conversation history.

We identified that it is far more valuable to conduct abstractive summaries, or summaries that identify important sections in the text and generate highlights,  rather than extractive ones, which pick a subset of sentences and add them together for our use cases since the final interface will be natural language. We want the summary results not to need to be interpreted from often incoherent text snippets produced by extractive summaries. 

How do we evaluate a “good” summary?

The most difficult part is identifying the evaluation metrics useful for judging the quality of summaries. Evaluation is a broad and complex topic, and it’s difficult to have a single metric that can answer the question, “Is this model a good abstractive summarizer?” 

For our early exploration, we stuck to tried-and-true metrics, limiting our scope to metrics that can be used with ground truth. Ground truth for summarization includes documents that are either manually summarized or bootstrapped from summaries generated by models and approved by humans. 

These include:

  • ROUGE - (Recall-Oriented Understudy for Gisting Evaluation), which compares an automatically-generated summary to one generated by a machine learning model on a score of 0 to 1 in a range of metrics comparing statistical similarity of two texts. 
  • BLEU - (Bilingual Evaluation Understudy) - Also a measure between 0 and 1 that looks at the modified precision between ngrams in a candidate (ground truth) sentence and reference sentence
  • METEOR - Looks at the harmonic mean of precision and recall
  • BERTScore - Generates embeddings of ground truth input and model output and compares their cosine similarity

Because our platform users won’t always have ground truth summarization data,  we explored some metrics that don’t require a ground truth example for evaluation and based the evaluation specifically on trying to relate the information of the summary to the original inputs,

  • Judge model-based eval, where we use an LLM to judge whether the summary generated by our initial model is correct,  (GPT4, G-eval) can be implemented from scratch or have similar metrics available through various evaluation libraries, ML flow Evaluate and G-eval as an example). We essentially prompt the LLM to read the summary and assess different attributes
    • Relevance: Evaluates if the summary includes only important information and excludes redundancies.
    • Coherence: Assesses the logical flow and organization of the summary.
    • Consistency: Checks if the summary aligns with the facts in the source document.
    • Fluency: Rates the grammar and readability of the summary.
    • An unsupervised summarization and summary evaluation library.
    • Has some ground truth-based metrics available but mainly uses semantic similarity on sentence embeddings for evaluation.
  • SummEval 
    • Extractive fragment coverage: percentage of words in the summary that are from the source article, measuring the extent to which a summary is a derivative of a text.
    • Density: the average length of the extractive fragment to which each summary word belongs

Picking a good abstractive summarization model

Once we have settled on how we want to evaluate the summarization task, the next step is finding models that can be used for abstractive summarization. In this area, we essentially are comparing what are now classical encoder-decoder models that have been around for almost five years to the new breed of large context models that are emerging on an almost daily basis. 

The models we identified that may be a good fit for us are as follows, with some of their pros and cons listed:

“Traditional” Encoder-Decoder Style Models for Summarization

  1. Longformer - a transformer that can both be trained on the masked-language-modeling (MLM) objective (same as BERT) and the autoregressive objective (ie the LM objective that GPTs are trained with) with the attention mechanism changed so that it scales linearly with the size of the input.  
    1. Pros
      1. Relatively easy to implement and work with
      2. Long context window (up to 8k while still being relatively lightweight)
      3. Can also do abstractive summarizations while being performant
    2. Cons
      1. Architecture a bit dated, newer long-context LLMs can likely outperform the longformer with some tweaking.
  2. T5 - A transformer model that is trained using the MLM objective specifically for text to text generation. This model is trained on a few different objectives: translation, summarization, QA, entailment, paraphrasing, among others. It is NOT a LM in the sense that you prompt the model with instructions to get a specific type of output. The model is pre-trained with specific structures for each of the task types, and inputs must be of a certain format for the specific task type (usually abstracted away behind an interface when using HF).
    1. Pros:
      1. Has been specifically trained for summarization and paraphrasing tasks. 
      2. Usually more performant than using BERT-based models. 
      3. Can be multi-purpose, but not as versatile as other LMs. 
      4. Can be fine-tuned further for both extractive and abstractive modeling. 
    2. Cons
      1. There is no specific context limit, but context size scales quadratically with memory usage, so not as efficient for large contexts like longformer.
  3. Pegasus-X - A long-input context abstractive summarization model. Pegasus-X is specialized for long summarization tasks and sacrifices some performance on short input summarization, for those we can fallback to using the Pegasus model directly. This is an LLM trained specifically for summarization, and can be further fine-tuned.
    1. Pros
      1. Performs on-par or better than other LLMs on summarization tasks while having a smaller parameter size.
      2. Out-of-the-box near SOTA performance on summarization tasks. Has different model options for long input vs short input summaries.
      3. Can be fine-tuned on a single GPU.
    2. Cons
      1. A bit dated but maybe still good?

Decoder-Only LLMs for Summarization

  1. Mistral - The Mistral 7B instruct model is fine-tuned for summarization with a context size up to 64k. This model produces far more natural language summarizations compared to the previous models but has the added caveat that it is a 7B size model, so cannot really be run in a resource-constrained environment. It has the same safety concerns as the base Mistral models in that it is not considered a safe model.
    1. Pros
      1. Higher quality abstractive summaries
      2. Large context window
    2. Cons
      1. Not trained for safety
      2. Relatively large model
  2. Guanaco - Open source versions of the Llama models adapted using 4-bit qlora finetuning. Available from 7b to 65b size. Must have prompt instructions for summarizing. 
    1. Pros
      1. Available in a large set of sizes from large to very large.
      2. Can be prompted for multiple types of tasks, not just summarization, although according to reddit it is pretty good at it
    2. Cons
      1. Needs to load Llama weights, potentially we can also load weights using open-llama.
      2. None of the sizes are small enough if we later want to have an on-device solution

Once we have a good set of candidate models that we believe perform well, we can test these models on a real dataset. 

Evaluation Setup and Results

To start our exploration, we evaluated these model architectures by using versions of them explicitly fine-tuned for summarization available on Hugging Face when available or the standard instruction tuned or foundation models if not. Autoregressive LLMs, such as Mistral 7B, had very slow inference speeds, especially since we needed to have access to the large context sizes if we had large documents, even when the model was loaded with a lower precision and accelerated through an A100 GPU. 

Large parameter LLMs, which we consider to be at least 7B parameters and up in this case, completed a summarization inference every ~30 seconds, which is extremely high-latency, even for batch inference. The result quality was also not very good, often producing barely coherent texts. 

It became evident that to effectively use autoregressive models, especially the ones with larger context windows for summarization would require some finessing, either on crafting prompts better than “Summarize the following conversation: {conversation}” or through fine-tuning using advanced techniques (PPO or DPO). 

Considering the scope of this investigation, we dove into models historically more successful at summarization: Seq2Seq encoder-decoder models. We found that the classical Longformer architecture performs beautifully even with our large context windows. 

We also tested multiple variants of Pegasus and T5 models and found that they perform reasonably well too, trading blows with the Longformer models on some metrics. We found that overall, the longformer model proved to have the best score all around, having top or near top scores with METEOR, bert_precision, bert_recall, and all ROUGE metrics. The T5 model fine-tuned for summarization was a close second in terms of performance with these metrics.

When looking at these scores, there are some caveats that we have to highlight. The BERT-based metrics have a smaller variation in their values compared to METEOR and ROUGE, making it harder to discriminate the model qualities between them. All the metrics we have used for this stage of our exploration evaluate the model quality based on how closely the generated summaries match with a ground truth. This method is not always the best, as there are many ways to summarize the same information using different words and meanings. While bert_score maps the summaries to an embedding space and ideally would map different text with the same meaning close together (and thus having a high similarity score), that is not guaranteed. All this is to say that a higher score with these metrics only tells us that a certain model is closer to some reference summary that is provided, not that it will be a good summarizer in general. 

Summarization Model Evaluation Results on A100s using HuggingFace Evaluate

A sample of results is shown below:

models meteor bert_precision bert_recall bert_f1 rouge1 rouge2 rougeL rougeL
mikeadimech/longformer-qmsum-meeting-summarization_summaries 0.3147 0.8463 0.8780 0.8615 0.2161 0.0610 0.1722 0.1722
mistralai/Mistral-7B-v0.1_summaries 0.2728 0.7856 0.8731 0.8269 0.1190 0.0401 0.0956 0.0956
google/pegasus-xsum_summaries 0.1305 0.8607 0.8435 0.8518 0.1631 0.0322 0.1344 0.1344
Falconsai/text_summarization_summaries 0.2842 0.8336 0.8632 0.8480 0.1736 0.0435 0.1374 0.1374
mrm8488/t5-base-finetuned-summarize-news_summaries 0.3176 0.8354 0.8737 0.8539 0.1972 0.0562 0.1494 0.1494

Given the results, we found the following hold true for summarization tasks with large language models: 

  • Running evaluations was easiest using the HF evaluate library. Other libraries such as SUPERT or SummEval are also good options but require more setup 
  • Some metrics that are not available through HF evaluate are available through SummEval, but the metrics available through HF evaluate are a good starting point for starting explorations. 
  • Larger models are not necessarily better, especially when considering the time it takes for a single summary generation and the quality of the generation. We need to explore highly compressed larger models, mixed 4 bit and 8 bit compression, finetuning, LoRA adapting of large models for faster inference on summarization.
  • Summarization quality based on just the metrics are not the only thing to consider as it is missing contextual elements. From a metric standpoint longformer performs great, but Pegasus produces the best natural language summary of the conversations based on how readable the responses are (see appendix A for some outputs from Pegasus). It can gather contextual information and provide higher level summaries, but it is still somewhat error-prone.

As a result of these findings, for the work we’re doing on the mzai platform, we’re focusing on providing support for ‘Longformer’, ‘T5’, and Pegasus model families as they have the best performance given their relatively lightweight size and inference performance. If you’re considering summarization in your LLM pipeline, it can be a tricky landscape to navigate, but these tips are a good starting point for most use cases.

Appendix A:

Sample Summaries from Pegasus X-Sum vs Ground Truth

Ground Truth Summary

Pegasus X-Sum


#Person2# has trouble breathing. The doctor asks #Person2# about it and will send #Person2# to a pulmonary specialist.

This is a recording of a conversation between a person and a doctor.

Summary is broad and high level.

#Person1# and #Person2# talk about advertisements in Hong Kong. #Person2# likes the billboards, while #Person1# thinks there're too many advertisements. They both hate leaflets and broadcasts but like the way that advertising agencies use comedy in their campaign.

How do you feel about advertising in Hong Kong?

Summary is high level, without mentioning the characters and missing relevant details.

#Person1# tells #Person2# #Person2# hasn't got the position. #Person2# feels disappointed and #Person1# encourages #Person2# to keep working hard.

#Person1#: I'm afraid I've failed the test.

More of an extractive summary of the original text than abstractive.

#Person1# and #Person2# are touched by the wedding, and #Person1# says #Person1# and Tom are preparing for their engagement.

A conversation between Harris and Anne, as they prepare for their wedding.

Identifies characters in the original text, but misidentifies whose wedding they are preparing for.

Read more