NLP news tracking tool

The business goal

Our client is an Economics professor at Chicago university. She wanted to bring scholars’, students’, and academics’ ability to be up to date on recent research by subscribing to a multiple scientific journals, magazines, and almanachs.

The business goal of this project was to create a tool that is able to track follow ups to news stories by calculating similarities between pairs of news articles. News articles were combed from three major news agencies: BBC, NBC and ABC. News updates were captured by frequently monitoring RSS feeds of those same news agencies. If a new article was found, it was parsed from the web-page, and HTML and other irrelevant information was stripped, punctuation removed and a Porter stemmer applied. The resulting text was compared with previously processed articles by calculating pairwise TF-IDF score.

The main challenge of the project was dealing with estimating a TF-IDF score threshold for grouping similar articles together and with optimizing the algorithms to run quickly on a large dataset of previously processed articles. The tool was deployed on a remote server with an ability to store processed articles and their similarities.

Technical side

From a technical point of view we used Python-based Natural Language Processing frameworks such as NLTK, BeautifulSoup for scraping and MySQL as a database.

 

Result

As a result our client enhanced the functionality of their application to make it more marketable.

Tatyana Deryugina
Very professional team, good communication.
5/5
Get in touch

    logo

    Contact us on Upwork

    Hire us
    logo

    Contact us on Linkedin

    Reach out us
    logo

    Moscow, 117335,

    Russia