Auto-Summarisation of Text Document!

Of course, who has not seen the AutoSummarise tool in Word from MS Office. Some might have also played with it by changing the compression quotient i.e. the ratio of size of summary to document.

I really don’t know the algorithm used inside (of course, how can I? Its Microsoft!). Nor do I attempt to completely address all issues of auto-summarization. I was just trying to analyze how exactly do we summarize when we summarize a document? It involves a lot factors, besides picking up important details out of document, its title and other aspects like how short the document is going to be among others, who is going to read it (students, professors, researchers, layman etc) and others.

One thing which simply struck me was shortening big concepts made of small words into small set of big words. By big/small words, I mean how complex is its meaning, and not by its length. Dilution is a very common concept of language, where we tend to dilute a word into simpler words, just like we do in our computer languages (representing bigger modules using smaller/basic modules). While summarization is of course related to picking up prominent words in a document, however, I have not seen if these tools use new words to describe the document. Not all concepts are listed in a document in a detailed manner. Not just that, its quite helpful to simply replace set of words by single word, which more or less conveys the same meaning. In short, I’m talking about mapping of a meaning/semantics with single definition words.

Is it easy? NO. However, by semantic graphs (of RDF), perhaps the first stage can definitely be grown. For, if slight changes in meaning or structure of words occur, the graph is minimally perturbed. Complexity increases when higher-degree mapping is used for same definition. That means, when a same word can be composed of different words by rearranging smaller words into different groups. This may be reduced by grouping similar words, which implies picking a word from that group to replace a word does not change the meaning of the sentence.

Suggestions invited.


