Click Here to Install Silverlight*
EgyptChange|All Microsoft Sites
Microsoft
Microsoft Innovation Lab in Cairo (CMIC) 
Improving Services for Digitized Books
By: Kareem Darwish
Mostafa El Baradi

Abstract

Although digitized books are searchable, thanks to OCR and search technology, the use of digitized books continues to be unattractive to users. Perhaps search is not enough and other services are required to make digitized books more accessible and useful. This paper identifies and proposes specific services that are likely to improve accessibility and usability of digitized books. The services cover the areas of visualization, browsing, and recommendation.

Introduction

Since the advent of the printing press in the fifteenth century, the amount of printed text has grown overwhelmingly. This led to the existence of large collections of legacy books available only in print. Many recent initiatives, such as the Million Book Project, have focused on digitizing large repositories of legacy books (Barret et al., 2004; Lin, 2006; Simske and Lin, 2004). Such initiatives have been successful in digitizing millions of books in a variety of languages , , , , including Arabic . Although much of the initial focus has been on the preservation and archiving of these books, the focus has steadily grown to include content related services pertaining to how users can access and use the digitize content. The most prominent of such services has been search, which has been enabled by optical character recognition, which transforms document images to text. Search allows users to find books of interest and perhaps pages or paragraphs of interest. Despite the existence of search, digitized books continue to be unattractive to many users. Other services are still required to enable the effective use of digitized books and to make books more accessible and alluring to users.

In this paper, we identify possible services being explored at CMIC that would improve the accessibility and usability of digitized books. The services revolve around visualization, navigation, and recommendation, which are covered in sections 2, 3, and 4 respectively.

Visualization

We believe that viewing digitized books is unattractive to users because of many factors including: the sheer size of a book; access time, especially if a user is required to download an entire book; and formatting, which is inflexible and hence make poor use of the screen’s viewable area. The following are possible ways to overcome these problems.

Book Size:

There a different book usage scenarios including reading cover to cover, which is typical of fiction books, or selective reading, which typical of non-fiction books or reference material. Barring the case where a user would be interested in reading a book cover to cover, the size of books as single documents can be daunting to users. Unlike a physical book that a user can flip through easily, flipping through a digitized book with a sequence of typically hundreds of images, of which the user can usually see 1 or 2 at time, using “next page” and “previous page” navigational buttons that move too slow or a slider bar that moves too fast is inconvenient. Also, as the user traverses through the book (or across books), it is easy for the user to lose track of where (s)he is. Although little can be done about the size of a book, what is presented to the user can be significantly smaller than a book and would greatly improve the flip through process. Such reduction in displayed material can come in several forms including:

  1. Presenting sections or chapters rather than entire books. Such a presentation may require other aids to help the user maintain mental context such as a table of content or a chapter and section list to be displayed beside the subsection of the book.

  2. Presenting the first page of every chapter or section in a book with a facility to expand and see the rest of the chapter or section on demand.

  3. Presenting chapter or section summaries that again can be expanded to see the rest of the chapter or section. These summaries can be presented in the form of selected snippets or possibly key phrases in the form of tag clouds.

  4. Using an image viewer that can display many more pages in a book at once with a facility to easily zoom in an out of pages. A technology such as SeaDragon can be very effective in allowing a user “flip” through hundreds of digitized pages. In SeaDragon, hundreds of images can be present on the screen as thumbnails and as the user zooms into any of the images, the images are fetched with a resolution that matches the user’s screen, avoiding delays associated with loading entire document images.

Access Time

Access time is a particularly important problem, especially for books that are stored as image files, which are typically large in size (hundreds of kilobytes per image), or as single files in formats such as PDF or DjVu, which would warrant the download of an entire book when perhaps just a few pages are needed. Some of the ways to deal with this problem include:

  1. Improving the compression of digitized pages. DjVu offers very high compression rates via the separation of the black and white foreground text from the usually colored background.

  2. re-fetching digitized pages that a user is likely to view. Guessing what the user is likely to view next can be done using heuristics or using more advanced machine learning techniques based on user activity data.

  3. Fetching images in a resolution that matches the resolution of the user’s screen. A technology such as SeaDragon does precisely that.

  4. Fetching pages only instead of fetching entire books. This would require the creation/utilization of image formats that may use technology such the DjVu technology on single images and may require the development of a new viewer.

Formatting

The formatting or re-formatting of a book to fit the desired view of the user and to make use of the available screen space can greatly improve book viewing. One way to improve formatting is to perform the so-called book reflow (Goodwin et al., 2006). Reflow involves accurately identifying the bounding boxes of words and then allowing these boxes to move freely on a page. Therefore, as a user changes the aspect ratio of a page, the lines will be reconstructed by reflowing the words (or their boxes) without doing any zooming. This can enable a user to dramatically change the formatting of a book, which would make it more readable.

Navigation

One of the fundamental problems with digitized books is that they exist as a long series of digitized pages with little to connect the different pages together. Creating links within a book or across books to make them browsable in a non-linear way can greatly improve the user experience. We suggest the following to improve navigation:

  1. Automatically identifying table of content and index pages and linking their entries to the relevant pages. There are a variety of proprietary systems that perform TOC and index page identification and linking using English OCR output with less than 1% character error rate. Much of the previous work on TOC page identification and linking used heuristics involving page matching and exploitation of page numbering (He et al., 2004; Lin, 2003, Lin et al., 1997; Mandal et al., 2003) or some form of machine learning (Satoh et al., 1995).

  2. Constructing tag-clouds with the keywords and key phrases in a book, a chapter, or a section to enable users to jump to topics of interest quickly without having to sequentially flip through the images. There are many methods for keyword and key phrase extraction such as those of (El-Beltagy, 2006; Medelyan and Witten, 2006; Turney, 1999). However special care is required when extracting keywords and key phrases from OCR output, because OCR often introduces errors and hyphenated words are often split between 2 lines (and possibly between 2 pages). Other pieces of evidence, such as the position of the word sequence in page and relative size compared to surrounding words can further improve keyword and key phrase extraction.

  3. Linking keywords or key phrases in a book to other books or external resources such as Wikipedia, dictionaries, or online search engines. This can provide context to what the user is reading. Although not completely identical, the Link-the-Wiki track in the INEX evaluation has been aiming at finding incoming and outgoing links to isolated pages (site LTW-INEX)

  4. Showing similar books as a user is browsing through books. More on this is addressed in the next section.

Recommendation

Recommendation services for digitized content have demonstrated significant value in enhancing the user experience across domains (e.g. multimedia , , news , and shopping ) by providing a convenient means for the non-linear exploration of related content. In the context of digitized books, recommendations can be issued on the basis of metadata, content, or patterns of user activity on both the book or segment level. Elements of collaborative intelligence may further contribute to an effective system by leveraging user ratings in contributing to the ranking of results. Adomavicius and Tuzhilin (2005) offer a fairly good survey of current state-of-the-art recommendation technology. We suggest the following to allow for a more comprehensive browsing experience in which the user is afforded the opportunity to stumble upon books of interest which may otherwise have been overlooked:

  1. Issuing recommendations based on content. This type of recommendation can be based on:
    1. Matching of metadata associated with book, such as titles, author names, and publication years, major subjects, and user annotations. Amazon.com uses this type of recommendation (among other types) (Linden et al., 2003) to recommend other books by the same author and other books covering the same topics.

    2. Employing NLP tools for the analysis and matching of content, e.g. key phrases or even writing styles. An example system is the one reported on by Mooney and Roy that uses information extraction and text categorization to make content-based recommendations. Another innovative system is the so-called BookLamp system that analyzes the text of a book for the pace of events and density adjectives as indicators of writing style .

    3. Leveraging user profiling technologies to detect and learn from user activity patterns for the analysis and prediction of users’ interests and issue recommendations on that basis (Middleton et al., 2004). Again such recommendation can be based on metadata or other automatically extracted features from content.

  2. Employing collaborative recommendation comparing a user profile to the profiles of other users. This is perhaps the most important way that Amazon.com makes recommendations, by providing a list of books that other users have bought alongside the book being viewed (Herlocker et al., 2004).

  3. Constructing social networks that connect users with similar interests to expand a user’s exploration horizon in a visually appealing way. Similarly, the modeling of these networks based on relationships between books could present itself as an intuitive method for the navigation of recommendations (Basu et al., 1998).

  4. Incorporating explicit user recommendation by allowing users to rate or tag books as interesting or good reads. This is similar to what is currently being done in book clubs

  5. Devising a recommendation service that includes some or all of the above would allow for a more sophisticated and inclusive browsing experience for digitized books and render the user experience more attractive.

Devising a recommendation service that includes some or all of the above would allow for a more sophisticated and inclusive browsing experience for digitized books and render the user experience more attractive.

Conclusiong

Search, which is the most ubiquitous service for digitized books, is an insufficient lure to users who favor more convenient content, such as web content. In this paper we present services being explored at CMIC that have the potential of making digitized more appealing to a larger user base. The presented services cover some areas of visualization, navigation, and recommendation. We believe that creating and adopting services that address these areas would significantly improve the usability and accessibility of digitized books and would potentially lead to wider acceptance amongst users.

REFERENCES

Adomavicius, G. and A. Tuzhilin (2005). Toward the Next Generation of Recommender Systems: A Survey of the State-of-the-Art and Possible Extensions. In Transactions on Knowledge and Data Engineering, Vol. 17, No. 6, pp. 734-749, June 2005.

Barret, W., L. Hutchison, et al. (2004). Digital Mountain: From Granite Archive to Global Access. Proc. of International Workshop on Document Image Analysis for Libraries, Palo Alto, January 2004, pp. 104-121.

Basu, C., H. Hirsh, and W. Cohen (1998). Recommendation as Classification: Using Social and Content-Based Information in Recommendation. Proceedings of the tenth conference on Artificial Intelligence/Innovative applications of artificial intelligence, AAAI.

El-Beltagy, S. R. (2006). KP-Miner: A Simple System for Effective Keyphrase Extraction. IEEE Innovations in Information Technology, 1-5.

Goodwin, R. L., T. N. Terry, A. B., F. Akalin, and J. Shagam (2006). Efficient processing of non-reflow content in a digital image. United States Patent 20070237428.

He, F., X. Ding and L. Peng (2004). Hierarchical logical structure extraction of book documents by analyzing tables of contents. Document Recognition and Retrieval XI, Proc. of SPIE-IS&T Electronic Imaging, SPIE Vol. 5296, 2004.

Herlocker, J. L., J.A. Konstan, L.G. Terveen, J.T. Rield (2004). Evaluating Collaborative Filtering Recommender Systems. ACM Transactions on Information Systems, 2004.

Le Bourgeois, F. and E. Trinh, et al. (2004). Document Image Analysis Solutions for Digital Libraries. Proc. International Workshop on Document Image Analysis for Libraries, Palo Alto, January 2004, pp. 2-24.

Lin, C., Y. Niwa, and S. Narita (1997). Logical Structure Analysis of Book Document Images Using Contents Information, ICDAR-1997. Lin, X (2006). Quality Assurance in High Volume Document Digitization: A Survey. DIAL'06, pp: 312 – 319.

Lin, X. (2003). Automatic Document Navigation for Digital Content Re-mastering. HP technical report, 2003

Linden, G., B. Smith, and J. York (2003). Amazon.com Recommendations: Item-to-Item Collaborative Filtering. IEEE Vol.7, No. 1, 2003

Mandal, S., S.P. Chowbury, A.K. Das, and B. Chanda (2003). Automated Detection and Segmentation of Table of Contents Pages from Document Images, ICDAR 2003.

Medelyan, O. and I. H. Witten (2006). Thesaurus based automatic keyphrase indexing. In Proceedings of the Joint Conference on Digital Libraries, pp. 296–297. Chapel Hill, NC, USA

Middleton, S.E., N.R. Shadbolt and D.C. De Roure, “Ontological User Profiling in Recommender Systems”, ACM Transactions on Information Systems, 2004.

Satoh, S., A. Takasu, and E. Katsura (1995). An automated Generation of Electronic Library based on Document Image Understanding. Proceedings of the third International Conference on Document Analysis and recognition, ICDAR’95.

Simske, S. and X. Lin (2004). Creating Digital Libraries: Content Generation and Re-mastering. Proc. International Workshop on Document Image Analysis for Libraries, Palo Alto, January 2004, pp. 33-45.

Thoma G. and G. Ford (2002). Automated Data Entry System: Performance Issues. Proc. SPIE Conference on Document Recognition and Retrieval IX, San Jose, 2002, pp. 181-190


Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.



 

©2009 Microsoft Corporation. All rights reserved. Contact Us |Terms of Use |Trademarks |Privacy Statement