The need

The modern news cycle is always accelerating. For publishers, providing substantial, high-quality journalism is a difficult task. Publishers must quickly sort through and locate relevant content of different types from huge archives. The process of scanning such a large variety of data provides a perfect opportunity for the use of Microsoft AI.

The idea

By leveraging Azure AI knowledge mining solution powered by Azure Cognitive Search, we could apply a specific AI model that identified the variety of content types and formats in their archive, and by doing so, make it easy to extract insights that can speed and scale the journalistic process.

The solution

With the Microsoft News team, we shaped an AI ingestion pipeline that leverages the power of Azure Cognitive Search, object vision, and OCR to analyze page layouts and to extract text, images. and key entities for indexing. We can then intelligently link those entities, giving journalists a powerful knowledge mining tool that helps them find the specific content they need quickly.

Technical details for the Insights and Discovery Accelerator

For brands like The Atlantic, which has been publishing since 1857, there are massive archives of photos, articles, long form stories, ads and more that must be searched through.

Working side-by-side with 4500 of the world’s most influential and respected news outlets, Microsoft News learned that content search was a universal pain point for journalists. Archival news is hampered by several challenges:

  • Tagging: Manual tagging is slow, cumbersome, non-standardized, and often inaccurate – as well as subjective. Would something you search for in the future have been tagged as important when the content is digitized?
  • Media types: Some archives are digital, but much material is still physical, which means thumbing through delicate books and documents that need preservation.
  • Formatting: Magazines vary layout elements such as two- or three-column structure, pull quotes, jump pages, and embedded ads. Templates for any magazine can evolve over time. Forced layouts produce hyphenation that is inappropriate when reflowed into new layouts. A typical OCR scan runs left to right across the page -- ignoring column breaks which causes indexing issues.

Without organization, a rich data set can be unwieldy. The solution was to build tools that could more easily navigate key facets within the data, like issue, author, year of publish.

Microsoft News and Unify, a Microsoft Digital Transformation partner, created a solution to help media organizations – like The Atlantic – crack open their archives to easily gain access to valuable insights. Part of that was to analyze generational changes within the archives with respect to identified core features. Once these subsets were created, the Insights and Discovery Accelerator labeled approximately 500 articles, randomly sampled from each generational age, to determine the preliminary discriminating characteristics of the document segment labels. It was key to explore the content as it was presented in different layouts.

Following the exercise in labeling, the documents and bounding box labels were uploaded to Azure to train an Azure Custom Vision model to identify document segments, columns and other boundaries.

This new skill was integrated into an Azure Cognitive Search enrichment pipeline to extract known entities (such as people, places, dates, etc.) and display their relationships in an easy-to-navigate data visualization, as well as power faceted search experience.

Resources:

Projects related to Project IDA

Browse more business scenario projects

Explore the possibilities of AI

Jumpstart your own AI innovations with learning resources and development solutions from Microsoft AI.