NLP news tracking tool
The business goal
Our client is an Economics professor at Chicago university. She wanted to bring scholars’, students’, and academics’ ability to be up to date on recent research by subscribing to a multiple scientific journals, magazines, and almanachs.
The business goal of this project was to create a tool that is able to track follow ups to news stories by calculating similarities between pairs of news articles. News articles were combed from three major news agencies: BBC, NBC and ABC. News updates were captured by frequently monitoring RSS feeds of those same news agencies. If a new article was found, it was parsed from the web-page, and HTML and other irrelevant information was stripped, punctuation removed and a Porter stemmer applied. The resulting text was compared with previously processed articles by calculating pairwise TF-IDF score.
The main challenge of the project was dealing with estimating a TF-IDF score threshold for grouping similar articles together and with optimizing the algorithms to run quickly on a large dataset of previously processed articles. The tool was deployed on a remote server with an ability to store processed articles and their similarities.
From a technical point of view we used Python-based Natural Language Processing frameworks such as NLTK, BeautifulSoup for scraping and MySQL as a database.
As a result our client enhanced the functionality of their application to make it more marketable.