Summarization

Description of summarization in the context of Iris. What is summarization and how can it be used ?

When to use summarization ?

Summarization allows to transform a document or a string into a shorter version, keeping the main points clear. It is better to use this brick than to ask "Can you summarize this document ?" to the Interrogation.

Example of a use case

Document summary

In order to have an overview of any document in the DMS (GED in French), the summarization is used with the file as input.

String summary

After we extracted all the information in a brief project related to a specific theme, we would like to summarize all the information into a short paragraph. To do so, we use the string summarizer function.

API

Summary of a new document

import requests

token = 'JWT ' + ''  # set your token here
url = "https://iris.egis-group.com/api/cgpt_structure/task_execute/?label_task=summary_file_path"

payload = {}
files=[
  ('document_list',('file_name.pdf',open('path_to_file.pdf','rb'),'application/pdf'))
]
headers = {
  'Authorization': token
}

response = requests.request("POST", url, headers=headers, data=payload, files=files)

print(response.text)

Summary of string list

import requests
import json

token = 'JWT ' + ''  # set your token here
url = "https://iris.egis-group.com/api/cgpt_structure/task_execute/?label_task=summary_str"

payload = json.dumps({
  "string_list": [
    "Texte 1 à résumer",
    "Texte 2 à résumer"
  ]
})
headers = {
  'Authorization': token,
  'Content-Type': 'application/json'
}

response = requests.request("POST", url, headers=headers, data=payload)

How does it work?

The summary algorithm using Large Language Models (LLM) that we propose unfolds in four steps, starting from the initial parsing of the document to concluding with a concise and relevant summary. Each step employs advanced techniques for processing and transforming the text, notably to address the limit of the Context Window (the maximum input size the LLM can handle).

  • Parsing: The first step involves analyzing the document's structure using Computer Vision algorithms to detect the layout. We have trained our own Layout Detection algorithms to reduce the model sizes. They are available on Hugging Face. This page structure detection enables the identification and separation of different visual elements of the document, such as text, images, and tables, into coherent text units. The goal is to preserve the logical structure of the original document for efficient understanding and further processing.

  • Chunking: Once the document is divided into coherent text units, the next step is "chunking". This step is necessary to address the Context Window problem. Each piece must be informative and autonomous enough for the LLM to interpret it correctly without the document's global context. The choice of granularity for these chunks is crucial to balance text coherence and the size limit imposed by the model. Therefore, two small paragraphs might be merged, or a large paragraph might be divided. In the case of summary, large paragraphs can be used (about 1 page long).

  • Extraction of Key Elements: Next, each "chunk" is passed independently to the LLM to generate a summary of this chunk. The prompt states to set a concise summary of the chunk.

  • Summary Production: Then all chunks are concatenated without any other treatment to get the final summary. In particular, nos global summarization is done because it would require a much longer time before the first token is rendered to the user.

Last updated