Having seen the success of the Times Digital Archive product, used primarily in education and libraries, the Times Online team led by Anne Spackman, Editor-in-Chief for Times Online, embarked on discussions about the feasibility and commercial potential of presenting the entire Times archive online. With enthusiastic backing from senior management, the team needed to find an Enterprise Search solution with the right scalability, proven stability, and technical sophistication to deliver its vision. The team chose to implement the FAST Enterprise Search platform.
Times Online Archive: history as it happened, online. Having seen the success of the Times Digital Archive product, used primarily in education and libraries, the Times Online team led by Anne Spackman, Editor-in-Chief for Times Online, embarked on discussions about the feasibility and commercial potential of presenting the entire Times archive online. With enthusiastic backing from senior management, the team needed to find an Enterprise Search solution with the right scalability, proven stability, and technical sophistication to deliver its vision.
The Times Online team had experienced FAST’s ability to deliver both on its ground-breaking Travel Channel and for its award-winning news site. The FAST Enterprise Search platform was therefore the natural choice for the archive project. The media expertise within the FAST consultancy team had also proven valuable, offering industry best-practice suggestions on indexing techniques and search strategies that would be critical to the efficient delivery of the archive project.
Having been soft launched in early 2008, Times Online Archive was launched on 14 June 2008. The majority of the FAST implementation took six months to complete, but the project as a whole, including the construction of a new site, took nearly a year to realize. The archive project has achieved its initial objective of presenting online news from 1785 to 1985—20 million articles, pictures, and advertisements—from the groundbreaking coverage of the Crimean War by William Howard Russell, to letters to the Editor from figures such as Karl Marx and Benito Mussolini—all reproduced exactly as they appeared in the original newspaper edition.
For an introductory period the use of the archive is free, but users must register their personal details, which are valuable to the commercial team analyzing the demographics of the user base.
Behind the story: Digital scanning brings 200 years into view for Search
Optical Character Recognition (OCR) technology was used to scan in every page of the newspaper from 1785 to 1985. The archive team wanted to present images of the actual pages and not just plain text; however, this method of digitizing content has certain limitations. FAST engaged with the Times Online team to strategize ways to minimize the negative impact on search quality.
The OCR process converts each article into an XML format that can be automatically fed through the FAST file traverser into the FAST document processing pipeline. At this point, words are checked against predefined dictionaries to establish a match and be tagged.
“The flexibility of the FAST product enabled us to complete the indexation process efficiently and respond to issues such as proper nouns spelt three different ways within an article, due to the OCR conversion,” says Drew Broomhall, Search Editor at Times Online.
The OCR process also gives each word an ‘X’ and ‘Y’ coordinate as to where it appears on the page. When a user searches for a term, the system will return all articles that contain that term and will highlight its position within the article.
Relevance models determine the order of search results and when considering this part of the project, the Archive team discovered the need to evolve a different ranking strategy than would apply with contemporary newspapers.
“We found that in many archive articles, the nature of the content was not even mentioned in the headline and so used the FAST ranking module to build a fairly flat ranking profile where the headline and body of the text are roughly equivalent. We plan to evolve this over time using search intelligence to refine relevance,” says Drew Broomhall.
Benefits: A robust, scalable search platform
FAST has not only provided a search platform that would index and search The Times archive of 20 million articles, but also consistently deliver a high-quality query response within a two-second performance target.
||We wanted to set the gold standard for an online newspaper archive for arguably the most famous newspaper in the world. Providing an accurate, multimedia search platform to enable a project of this scale and sophistication was key to a high quality user experience.
Editor-in-Chief, Times Online
In the daily newspaper industry, the culture requires the ability for changes to be made within the time it would take to put out tomorrow’s edition. FAST has delivered on this cultural requirement, allowing a complete reindex of the archive within a 24-hour time frame to support either editorial changes or the evolving business model.
Categorizing content to enhance the customer experience
In order to help users discover the richness and variety of The Times archive content, the FAST data dictionaries were customized to create 300 topic categories. For example, under the category of “War and Revolution”, a user can further refine their search by individual conflict. Such automated topics were created during the indexing process, by tagging associated content within the FAST dictionaries.
Using a lightweight rule system and Boolean logic, this search refinement has allowed the archive team to categorize content in areas that support the U.K. school’s National Curriculum or areas where the editorial team felt that The Times coverage had historically been very strong.
Drew Broomhall explains an example of this in practice. “We wanted a category of “Angry Young Men”, but when I performed an initial search using this term, only two articles were returned. With FAST it was straightforward to customize the search query, enabling us to add playwrights' names such as John Osbourne or Harold Pinter to the topic query and constrain the result set using a date range.”
This approach also enables the contextualization of other general terms such as “global warming” and understands that this is a specific term, not merely two separate words.
Customer experience that meets the needs of all users
James Harding, Editor of The Times, said: “The Times Archive is a remarkable resource. It is a treasure trove for anyone interested in history, whether amateur or professional. From the personal and domestic, to unforgettable events on the world stage; from the French Revolution to the Falklands war, it can all be searched for using the online Times Archive.” The archive team anticipates significant interest from those researching genealogy. The archive can be used to search for family history through the births, deaths, and obituaries pages; by searching references to family names; or to see what events happened on the day that family members were born.
Search analytics to support future refinement and development
As the archive project is still evolving, the team has significant plans for development and refinement. Spellings of names and words have evolved over time and even the use of certain letters has changed, such as Y to V. The search development team intends to review search terms and results, building a synonyms dictionary to offer alternative spellings and ensure that the user discovers the correct content.
Query volumes will demonstrate the most popular content which can then be promoted on the home page or shared with advertisers as part of the commercial team’s monetization strategy.
As the site builds its audience the team anticipates using FAST to develop further specific products. An example of this is considering using FAST for the creation of a separate searchable photography library by adding meta data to each image.
This project demonstrates the flexibility of the FAST platform to meet a range of different search applications within the News International group, maximizing the value of content to attract new users and create a host of monetization opportunities.
For more information about FAST please visit: