For a person to manually generate a summary, s/he has to read and understand the original document first. Based on the understood events, facts or situations within the document, the important aspects are specified to meet the purpose of the summary. The summary would not contain all of the information present within the original document, but only those deemed to be important. This is obvious since the goal of the summary is to reduce the amount of information present in the original documents. After specifying the important aspects within a document, the summary is then produced in a suitable output
format.
The generic stages of summarization mentioned above have been addressed in the previous work of Luhn [2] in which the author claims that summarization involves three aspects in general: input, analysis and output. For the input aspect, humans usually require an understanding of the natural language the text document is written in. Analysis would require determining the purpose of the summary and the target audience.
Synthesizing a suitable output form for the summary would then be the last step before it is presented to the user.
For each of the mentioned summarization aspects, there are many factors to consider regardless of whether the summarizer is a human or a machine. For the input aspect: Depending on how the document is structured, the summarizer would have to decide how to approach reading the document. For instance, the headers of chapters or labels of Figures and Tables may contain information which is useful in the analysis stage. Metadata of some documents such as the keywords of HTML webpages may be beneficial, too. If the document was classified and the class or domain it belongs to is accessible by the summarizer, it may be possible to utilize knowledge restricted to that domain to aid in the analysis and output stages. The language of the text documents may also have an effect on the summarizer. Human summarizers would usually require an understanding of the language the document was written in. Machines on the other hand, lack the full and deep natural language understanding which humans normally possess. In addition, human summarizers usually have background information and common sense knowledge about the world and possibly the subjects in the document allowing them to infer between sentences. Take the following two sentences as an example: “Adam ordered
a delivery pizza. He liked its taste very much”. It can be inferred that the pizza was
prepared, cooked, delivered to Adam and then that Adam ate the pizza. This inferring capability requires common sense knowledge and it is something that machines lack as was noted by Lenat in [3].
In the analysis stage, the summarizer would evaluate and select the important parts from a document. The purpose of the summary would have an effect on how it is analyzed. For
about the subject and hence an overview of the subject or the content of the document is included in the summary. An update summary for a news event on the other hand, would include only new key updates with the assumption that the user has read previous older articles. The user goal may also affect how the document is analyzed. When a user searches for specific information, the summarizer would focus the summary to mostly related information to what the user has searched for. The number of documents to be summarized is another aspect affecting how analysis is performed. With single-document summaries, the structure of the document may give greater impact on the final generated summary than of the multi-document summary. This is especially evident if the structures of the different documents vary greatly.
The shape and output of the summary can take different forms. The summary can be in the form of extracts containing unaltered pieces from the original text such as full sentences or paragraphs. It can also be in the form of abstracts, where new phrases or sentences are created. The length of the summaries also varies based on the intended purpose and the compression rate desired. The shape of the summary can be in the form of complete sentences, or simply phrases as in news headlines. Summaries can be presented in the form of simply Text, or Text and other contextual information such as hyperlinks or related sentences depending on the user interface being used.
In the work performed for this thesis, the focus is mainly on generating extract-based single-document or multi-document summaries. The extracts are in the form of sentences taken from the original text. The use of the Sentences Simplification Module (SSM) allows me to adapt the system and generate abstractive summaries as described in chapter six. The summaries generated are domain-independent. Due to the repositories used being in English, the developed framework is currently restricted to only the English language.
No prior knowledge is assumed about the users and the summaries are only influenced by the text documents contents and the user query if supplied.