Application samples from AI Lab

Discover innovative applications and learn how to extend your projects with AI.

Insights and Discovery Machine teaching Sketch2Code JFK Files Style Transfer Snip Insights Vector search

The need

A common challenge facing publishers today: How can they produce high-quality journalism at a velocity that accommodates an always-accelerating news cycle? Sorting through and locating relevant content from a rich archive takes a tremendous amount of time. The complexity is exacerbated by the variety of content types – articles, poetry, photography, advertisements and more – within magazine formats with colloquial language that have changed over decades.

The idea

By leveraging Azure AI knowledge mining solution powered by Azure Cognitive Search, we could apply a specific AI model that identified the variety of content types and formats in their archive, and by doing so, make it easy to extract insights that can speed and scale the journalistic process.

The solution

With the Microsoft News team, we shaped an AI ingestion pipeline that leverages the power of Azure Cognitive Search, object vision, and OCR to analyze page layouts and to extract text, images. and key entities for indexing. We can then intelligently link those entities, giving journalists a powerful knowledge mining tool that helps them find the specific content they need quickly.

Technical details for the Insights and Discovery Accelerator

Journalists face an ongoing conundrum: How do they mine decades of content across many kinds of media, to extract insights that can be applied today’s issues, while meeting a constantly accelerating news cycle? For brands like The Atlantic, which has been publishing since 1857, there are archives of photos, articles, long form stories, ads and more.

Working side-by-side with 4500 of the world’s most influential and respected news outlets, Microsoft News learned that content search was a universal pain point for journalists. Archival news is hampered by several challenges:

  • Tagging: Manual tagging is slow, cumbersome, non-standardized, and often inaccurate – as well as subjective. Would something you search for in the future have been tagged as important when the content is digitized?
  • Media types: Some archives are digital, but much material is still physical, which means thumbing through delicate books and documents that need preservation.
  • Formatting: Magazines vary layout elements such two- or three-column structure, pull quotes, jump pages, and embedded ads. Templates for any magazine can evolve over time. Forced layouts produce hyphenation that is inappropriate when reflowed into new layouts. A typical OCR scan runs left to right across the page -- ignoring column breaks which causes indexing issues.

Without organization, a rich data set can be unwieldy. The solution was to build tools that could more easily navigate key facets within the data, like issue, author, year of publish.

Microsoft News and Unify, a Microsoft Digital Transformation partner, created a solution to help media organizations – like The Atlantic – crack open their archives to easily gain access to valuable insights. Part of that was to analyze generational changes within the archives with respect to identified core features. Once these subsets were created, the Insights and Discovery Accelerator labeled approximately 500 articles, randomly sampled from each generational age, to determine the preliminary discriminating characteristics of the document segment labels. It was key to explore the content as it was presented in different layouts.

Following the exercise in labeling, the documents and bounding box labels were uploaded to Azure to train an Azure Custom Vision model to identify document segments, columns and other boundaries.

This new skill was integrated into an Azure Cognitive Search enrichment pipeline to extract known entities (such as people, places, dates, etc.) and display their relationships in an easy-to-navigate data visualization, as well as power faceted search experience.


Explore the possibilities of AI

Find demos to get more ideas or learn about AI technology to jumpstart your own development.