Automated Topic-Based Extractive Summarization System

To build the summarization, we followed four steps for end-to-end data extraction and summary generation - data preprocessing, content selection, information ordering and content realization.

  1. Data Preprocessing - The process involved collecting raw data, segmenting sentences, coverting to tokens, stop word removal, lemmatization and stemming
  2. Content Selection - We used Latent Dirichlet Allocation to select the most important sentences that contribute to the summary. To improve the accuracy of LDA, we used TF-IDF scores that rank the words of a document based on importance and relevance.
  3. Information Ordering - This phase helps in ordering these selected sentences so that the summary is coherent. We used cosine similarity to discard any redundant sentences and use pairwise cosine score to determine the most coherent ordering.
  4. Content Realization - This phase makes the final touches on the sentences, removing any extraneous parts of sentences that would make it wordy etc. To do this, we used methods like removing parenthesis, eliminating sentences shorter than 8 words, removing adverbs etc.
Nifty tech tag lists fromĀ Wouter Beeftink