Insights and Discovery Accelerator
Unlock insights and discover intelligent connections in unstructured data. The Insights and Discovery Accelerator uses Microsoft AI to speed and scale investigations and research by helping journalists find critical content in a sea of data.
Go to Insights and Discovery demo
The need
The modern news cycle is always accelerating. For publishers, providing substantial, high-quality journalism is a difficult task. Publishers must quickly sort through and locate relevant content of different types from huge archives. The process of scanning such a large variety of data provides a perfect opportunity for the use of Microsoft AI.
The idea
By leveraging Azure AI knowledge mining solution powered by Azure Cognitive Search, we could apply a specific AI model that identified the variety of content types and formats in their archive, and by doing so, make it easy to extract insights that can speed and scale the journalistic process.
The solution
With the Microsoft News team, we shaped an AI ingestion pipeline that leverages the power of Azure Cognitive Search, object vision, and OCR to analyze page layouts and to extract text, images. and key entities for indexing. We can then intelligently link those entities, giving journalists a powerful knowledge mining tool that helps them find the specific content they need quickly.
Finding insights hidden in 160+ years of archives
Azure Cognitive Search ingests a variety of file formats and then applies custom AI models, OCR, entity extraction, and document classification. We can achieve incredible accuracy in identifying and labeling document segments, entities, and other key components like author, page number, issue, and more.
Technical details for the Insights and Discovery Accelerator
For brands like The Atlantic, which has been publishing since 1857, there are massive archives of photos, articles, long form stories, ads and more that must be searched through.
Working side-by-side with 4500 of the world’s most influential and respected news outlets, Microsoft News learned that content search was a universal pain point for journalists. Archival news is hampered by several challenges:
- Tagging: Manual tagging is slow, cumbersome, non-standardized, and often inaccurate – as well as subjective. Would something you search for in the future have been tagged as important when the content is digitized?
- Media types: Some archives are digital, but much material is still physical, which means thumbing through delicate books and documents that need preservation.
- Formatting: Magazines vary layout elements such as two- or three-column structure, pull quotes, jump pages, and embedded ads. Templates for any magazine can evolve over time. Forced layouts produce hyphenation that is inappropriate when reflowed into new layouts. A typical OCR scan runs left to right across the page -- ignoring column breaks which causes indexing issues.
Without organization, a rich data set can be unwieldy. The solution was to build tools that could more easily navigate key facets within the data, like issue, author, year of publish.
Microsoft News and Unify, a Microsoft Digital Transformation partner, created a solution to help media organizations – like The Atlantic – crack open their archives to easily gain access to valuable insights. Part of that was to analyze generational changes within the archives with respect to identified core features. Once these subsets were created, the Insights and Discovery Accelerator labeled approximately 500 articles, randomly sampled from each generational age, to determine the preliminary discriminating characteristics of the document segment labels. It was key to explore the content as it was presented in different layouts.
Following the exercise in labeling, the documents and bounding box labels were uploaded to Azure to train an Azure Custom Vision model to identify document segments, columns and other boundaries.
This new skill was integrated into an Azure Cognitive Search enrichment pipeline to extract known entities (such as people, places, dates, etc.) and display their relationships in an easy-to-navigate data visualization, as well as power faceted search experience.
Resources:

JFK Files
JFK Files takes 34,000 complex files including photos, handwriting, government documents, and more, then extracts readable information. This knowledge is organized to enable new ways to explore the information.

Machine Reading
Machine Reading Comprehension (MRC) answers questions about written text. Using a neural network, MRC mimics the process of human readers. Ask a question and MRC reads a document until an answer is formed.

Responsible Conversational AI
Conversational AI is a new way for companies to interact with their customers across any channel, like digital assistants, chat or social media. To be effective, conversational bots need to be developed in a way that earns people’s trust.

Gen Studio
Gen Studio is a prototype created with collaborators from The Metropolitan Museum of Art, Microsoft, and MIT. Gen Studio uses AI to visually navigate The Met’s art collection.
Explore the possibilities of AI
Jumpstart your own AI innovations with learning resources and development solutions from Microsoft AI.

Innovation Developer Hub
Explore insights and behind-the-scenes technology for breakthrough AI innovations. From Tech Minutes videos to Technology Deep Dives, learn about the engineering that powers the future of AI.

AI School
Learn to create your own AI experiences with learning paths in conversational AI, machine learning, AI for devices, cognitive services, autonomous systems, AI strategy, and more.

AI platform
Start building AI solutions with powerful tools and services. Microsoft AI is a robust framework for developing AI solutions in conversational AI, machine learning, data sciences, robotics, IoT, and more.