|
Abstract
Although digitized books are searchable, thanks to OCR and search technology, the
use of digitized books continues to be unattractive to users. Perhaps search is
not enough and other services are required to make digitized books more accessible
and useful. This paper identifies and proposes specific services that are likely
to improve accessibility and usability of digitized books. The services cover the
areas of visualization, browsing, and recommendation.
Introduction
Since the advent of the printing press in the fifteenth century, the amount of printed
text has grown overwhelmingly. This led to the existence of large collections of
legacy books available only in print. Many recent initiatives, such as the Million
Book Project, have focused on digitizing large repositories of legacy books (Barret
et al., 2004; Lin, 2006; Simske and Lin, 2004). Such initiatives have been successful
in digitizing millions of books in a variety of languages , , , , including Arabic
. Although much of the initial focus has been on the preservation and archiving
of these books, the focus has steadily grown to include content related services
pertaining to how users can access and use the digitize content. The most prominent
of such services has been search, which has been enabled by optical character recognition,
which transforms document images to text. Search allows users to find books of interest
and perhaps pages or paragraphs of interest. Despite the existence of search, digitized
books continue to be unattractive to many users. Other services are still required
to enable the effective use of digitized books and to make books more accessible
and alluring to users.
In this paper, we identify possible services being explored at CMIC that would improve
the accessibility and usability of digitized books. The services revolve around
visualization, navigation, and recommendation, which are covered in sections 2,
3, and 4 respectively.
Visualization
We believe that viewing digitized books is unattractive to users because of many
factors including: the sheer size of a book; access time, especially if a user is
required to download an entire book; and formatting, which is inflexible and hence
make poor use of the screen’s viewable area. The following are possible ways to
overcome these problems.
Book Size:
There a different book usage scenarios including reading cover to cover, which is
typical of fiction books, or selective reading, which typical of non-fiction books
or reference material. Barring the case where a user would be interested in reading
a book cover to cover, the size of books as single documents can be daunting to
users. Unlike a physical book that a user can flip through easily, flipping through
a digitized book with a sequence of typically hundreds of images, of which the user
can usually see 1 or 2 at time, using “next page” and “previous page” navigational
buttons that move too slow or a slider bar that moves too fast is inconvenient.
Also, as the user traverses through the book (or across books), it is easy for the
user to lose track of where (s)he is. Although little can be done about the size
of a book, what is presented to the user can be significantly smaller than a book
and would greatly improve the flip through process. Such reduction in displayed
material can come in several forms including:
- Presenting sections or chapters rather than entire books. Such a presentation may
require other aids to help the user maintain mental context such as a table of content
or a chapter and section list to be displayed beside the subsection of the book.
- Presenting the first page of every chapter or section in a book with a facility
to expand and see the rest of the chapter or section on demand.
- Presenting chapter or section summaries that again can be expanded to see the rest
of the chapter or section. These summaries can be presented in the form of selected
snippets or possibly key phrases in the form of tag clouds.
- Using an image viewer that can display many more pages in a book at once with a
facility to easily zoom in an out of pages. A technology such as SeaDragon can be
very effective in allowing a user “flip” through hundreds of digitized pages. In
SeaDragon, hundreds of images can be present on the screen as thumbnails and as
the user zooms into any of the images, the images are fetched with a resolution
that matches the user’s screen, avoiding delays associated with loading entire document
images.
Access Time
Access time is a particularly important problem, especially for books that are stored
as image files, which are typically large in size (hundreds of kilobytes per image),
or as single files in formats such as PDF or DjVu, which would warrant the download
of an entire book when perhaps just a few pages are needed. Some of the ways to
deal with this problem include:
- Improving the compression of digitized pages. DjVu offers very high compression
rates via the separation of the black and white foreground text from the usually
colored background.
- re-fetching digitized pages that a user is likely to view. Guessing what the user
is likely to view next can be done using heuristics or using more advanced machine
learning techniques based on user activity data.
- Fetching images in a resolution that matches the resolution of the user’s screen.
A technology such as SeaDragon does precisely that.
- Fetching pages only instead of fetching entire books. This would require the creation/utilization
of image formats that may use technology such the DjVu technology on single images
and may require the development of a new viewer.
Formatting
The formatting or re-formatting of a book to fit the desired view of the user and
to make use of the available screen space can greatly improve book viewing. One
way to improve formatting is to perform the so-called book reflow (Goodwin et al.,
2006). Reflow involves accurately identifying the bounding boxes of words and then
allowing these boxes to move freely on a page. Therefore, as a user changes the
aspect ratio of a page, the lines will be reconstructed by reflowing the words (or
their boxes) without doing any zooming. This can enable a user to dramatically change
the formatting of a book, which would make it more readable.
Navigation
One of the fundamental problems with digitized books is that they exist as a long
series of digitized pages with little to connect the different pages together. Creating
links within a book or across books to make them browsable in a non-linear way can
greatly improve the user experience. We suggest the following to improve navigation:
- Automatically identifying table of content and index pages and linking their entries
to the relevant pages. There are a variety of proprietary systems that perform TOC
and index page identification and linking using English OCR output with less than
1% character error rate. Much of the previous work on TOC page identification and
linking used heuristics involving page matching and exploitation of page numbering
(He et al., 2004; Lin, 2003, Lin et al., 1997; Mandal et al., 2003) or some form
of machine learning (Satoh et al., 1995).
- Constructing tag-clouds with the keywords and key phrases in a book, a chapter,
or a section to enable users to jump to topics of interest quickly without having
to sequentially flip through the images. There are many methods for keyword and
key phrase extraction such as those of (El-Beltagy, 2006; Medelyan and Witten, 2006;
Turney, 1999). However special care is required when extracting keywords and key
phrases from OCR output, because OCR often introduces errors and hyphenated words
are often split between 2 lines (and possibly between 2 pages). Other pieces of
evidence, such as the position of the word sequence in page and relative size compared
to surrounding words can further improve keyword and key phrase extraction.
- Linking keywords or key phrases in a book to other books or external resources such
as Wikipedia, dictionaries, or online search engines. This can provide context to
what the user is reading. Although not completely identical, the Link-the-Wiki track
in the INEX evaluation has been aiming at finding incoming and outgoing links to
isolated pages (site LTW-INEX)
- Showing similar books as a user is browsing through books. More on this is addressed
in the next section.
Recommendation
Recommendation services for digitized content have demonstrated significant value
in enhancing the user experience across domains (e.g. multimedia , , news , and
shopping ) by providing a convenient means for the non-linear exploration of related
content. In the context of digitized books, recommendations can be issued on the
basis of metadata, content, or patterns of user activity on both the book or segment
level. Elements of collaborative intelligence may further contribute to an effective
system by leveraging user ratings in contributing to the ranking of results. Adomavicius
and Tuzhilin (2005) offer a fairly good survey of current state-of-the-art recommendation
technology. We suggest the following to allow for a more comprehensive browsing
experience in which the user is afforded the opportunity to stumble upon books of
interest which may otherwise have been overlooked:
- Issuing recommendations based on content. This type of recommendation can be based
on:
- Matching of metadata associated with book, such as titles, author names, and publication
years, major subjects, and user annotations. Amazon.com uses this type of recommendation
(among other types) (Linden et al., 2003) to recommend other books by the same author
and other books covering the same topics.
- Employing NLP tools for the analysis and matching of content, e.g. key phrases or
even writing styles. An example system is the one reported on by Mooney and Roy
that uses information extraction and text categorization to make content-based recommendations.
Another innovative system is the so-called BookLamp system that analyzes the text
of a book for the pace of events and density adjectives as indicators of writing
style .
- Leveraging user profiling technologies to detect and learn from user activity patterns
for the analysis and prediction of users’ interests and issue recommendations on
that basis (Middleton et al., 2004). Again such recommendation can be based on metadata
or other automatically extracted features from content.
- Employing collaborative recommendation comparing a user profile to the profiles
of other users. This is perhaps the most important way that Amazon.com makes recommendations,
by providing a list of books that other users have bought alongside the book being
viewed (Herlocker et al., 2004).
- Constructing social networks that connect users with similar interests to expand
a user’s exploration horizon in a visually appealing way. Similarly, the modeling
of these networks based on relationships between books could present itself as an
intuitive method for the navigation of recommendations (Basu et al., 1998).
- Incorporating explicit user recommendation by allowing users to rate or tag books
as interesting or good reads. This is similar to what is currently being done in
book clubs
Devising a recommendation service that includes some or all of the above would allow
for a more sophisticated and inclusive browsing experience for digitized books and
render the user experience more attractive.
Devising a recommendation service that includes some or all of the above would allow
for a more sophisticated and inclusive browsing experience for digitized books and
render the user experience more attractive.
Conclusiong
Search, which is the most ubiquitous service for digitized books, is an insufficient
lure to users who favor more convenient content, such as web content. In this paper
we present services being explored at CMIC that have the potential of making digitized
more appealing to a larger user base. The presented services cover some areas of
visualization, navigation, and recommendation. We believe that creating and adopting
services that address these areas would significantly improve the usability and
accessibility of digitized books and would potentially lead to wider acceptance
amongst users.
REFERENCES
Adomavicius, G. and A. Tuzhilin (2005). Toward the Next Generation of Recommender
Systems: A Survey of the State-of-the-Art and Possible Extensions. In Transactions
on Knowledge and Data Engineering, Vol. 17, No. 6, pp. 734-749, June 2005.
Barret, W., L. Hutchison, et al. (2004). Digital Mountain: From Granite Archive
to Global Access. Proc. of International Workshop on Document Image Analysis for
Libraries, Palo Alto, January 2004, pp. 104-121.
Basu, C., H. Hirsh, and W. Cohen (1998). Recommendation as Classification: Using
Social and Content-Based Information in Recommendation. Proceedings of the tenth
conference on Artificial Intelligence/Innovative applications of artificial intelligence,
AAAI.
El-Beltagy, S. R. (2006). KP-Miner: A Simple System for Effective Keyphrase Extraction.
IEEE Innovations in Information Technology, 1-5.
Goodwin, R. L., T. N. Terry, A. B., F. Akalin, and J. Shagam (2006). Efficient processing
of non-reflow content in a digital image. United States Patent 20070237428.
He, F., X. Ding and L. Peng (2004). Hierarchical logical structure extraction of
book documents by analyzing tables of contents. Document Recognition and Retrieval
XI, Proc. of SPIE-IS&T Electronic Imaging, SPIE Vol. 5296, 2004.
Herlocker, J. L., J.A. Konstan, L.G. Terveen, J.T. Rield (2004). Evaluating Collaborative
Filtering Recommender Systems. ACM Transactions on Information Systems, 2004.
Le Bourgeois, F. and E. Trinh, et al. (2004). Document Image Analysis Solutions
for Digital Libraries. Proc. International Workshop on Document Image Analysis for
Libraries, Palo Alto, January 2004, pp. 2-24.
Lin, C., Y. Niwa, and S. Narita (1997). Logical Structure Analysis of Book Document
Images Using Contents Information, ICDAR-1997. Lin, X (2006). Quality Assurance
in High Volume Document Digitization: A Survey. DIAL'06, pp: 312 – 319.
Lin, X. (2003). Automatic Document Navigation for Digital Content Re-mastering.
HP technical report, 2003
Linden, G., B. Smith, and J. York (2003). Amazon.com Recommendations: Item-to-Item
Collaborative Filtering. IEEE Vol.7, No. 1, 2003
Mandal, S., S.P. Chowbury, A.K. Das, and B. Chanda (2003). Automated Detection and
Segmentation of Table of Contents Pages from Document Images, ICDAR 2003.
Medelyan, O. and I. H. Witten (2006). Thesaurus based automatic keyphrase indexing.
In Proceedings of the Joint Conference on Digital Libraries, pp. 296–297. Chapel
Hill, NC, USA
Middleton, S.E., N.R. Shadbolt and D.C. De Roure, “Ontological User Profiling in
Recommender Systems”, ACM Transactions on Information Systems, 2004.
Satoh, S., A. Takasu, and E. Katsura (1995). An automated Generation of Electronic
Library based on Document Image Understanding. Proceedings of the third International
Conference on Document Analysis and recognition, ICDAR’95.
Simske, S. and X. Lin (2004). Creating Digital Libraries: Content Generation and
Re-mastering. Proc. International Workshop on Document Image Analysis for Libraries,
Palo Alto, January 2004, pp. 33-45.
Thoma G. and G. Ford (2002). Automated Data Entry System: Performance Issues. Proc.
SPIE Conference on Document Recognition and Retrieval IX, San Jose, 2002, pp. 181-190
Permission to make digital or hard copies of all or part of this work for personal
or classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full
citation on the first page. To copy otherwise, or republish, to post on servers
or to redistribute to lists, requires prior specific permission and/or a fee.
|