|
Executive Summary
Databases provide content storage for many sites that dynamically create web pages around them, including ecommerce catalog sites, online news, and even entertainment sites. Intranets often contain large amounts of text stored in databases as well. These databases generally have their own query functions, which may appear to take the place of a full-text search engine. But this is not necessarily the case. Database querying is not oriented towards text search and relevance ranking: it is optimized for locating widgets by part number and listing the inventory of leather slippers, but allot less effective at helping site visitors find the articles on widget quality or comparing leather and fleece slippers. Full-text search engines are often a better solution, moving the processing load from the database to the search engine, and adding crucial search functionality. Some database vendors offer their own full-text search products; alternatively, an independent full-text search engine product may be used to search the textual content that is stored in the database. Arabic is a morphologically complex language in that it provides flexibility in word formation: complex rules govern the creation of morphological variations, making it possible to form hundreds or thousands of closely related words from one root. Therefore, in Arabic, it is extremely important for a single search query to capture potential word variations. The Arabic language word breaker in SharePoint Portal Server (SPS) v3 provides support for these requirements.

Overview & Scope
The next generation SharePoint Portal Server (SPSv3) will advance in the marketplace by satisfying customers. Search functionality will be enhanced by providing language-specific word breakers, stemmers and Named Entities (NEs) detection to provide increased relevance. Word breakers are an essential part of a search engine, since they define the elements of a search query which will be matched against the document index. Many search engines use the simple language-neutral technique of breaking on white space, which is insufficient for providing accurate language-specific word breaking. The new word breakers, which benefit from linguistic and statistical information, will significantly enhance the user's search experience in a variety of structurally different languages.

Goals
The SharePoint Portal Server (SPSv3) provides language-specific word breakers for a set of languages. The main goal of extending language coverage and improving word breaking behavior is to improve the search engine experience and gain an advantage over competitors in term of language coverage. These new language-specific word breakers will improve the user experience when using these languages in a search context.

Breaking and Non-Breaking Characters
The determination of word breaking characters is essential, as it establishes which characters will be coded as word separators. Breaking characters include white space characters, punctuation markers, quotation marks, parenthesis, symbols, and more. Any character that is not explicitly listed as breaking character is not a breaking character. One way of categorizing word breaking characters from a linguistic point of view is to assign them to two main groups:

Language-neutral word breaking characters
These are characters that are word breaking in all languages we cover. If a character does not exist in a given language but exists in other languages and is consistently word breaking in those languages, it also belongs to this category.

Language-specific word breaking character
These are characters that are word breaking in a particular language or group of languages but which are not word breaking in all the languages that we cover.

Special Cases
There are many special cases in word breaking which override standard word breaking behavior. These special cases typically result from normally word breaking characters not breaking words in certain contexts or because a particular language uses punctuation token or a special symbol in a way which combines with the form of a word and therefore requires special treatment. Common examples of this include abbreviations and acronyms.

Named Entities
Named Entities are sequences of tokens which may contain normally breaking characters which we want to recognize as a single token. And link to a standardized format so that different ways of representing the same information may be related to each other, thus extending search coverage (recall).
The SharePoint Server (SPSv3) Arabic Word Breaker includes Named Entities of the following types: Numbers, Currencies, Times, Dates, Emails, URLs, File Paths and File Names with Extensions.

Numbers
Numbers can include punctuation elements and can be written in different ways (e.g. 2 and 2.0). The word breaker normalizes the different formats to the same underlying format to ensure that the user search matches all relevant results.

Currency
Currencies can include punctuation elements and be written in different ways (e.g. $15, 15$, and 15USD). The word breaker normalizes the different formats to the same underlying format to ensure that the user search matches all relevant results.

Time
Times can include punctuation elements and be written in different ways (e.g. 1:15 and 13:15). The word breaker normalizes the different formats to the same underlying format to ensure that the user search matches all relevant results.

Date
Dates can include punctuation elements and be written in different ways (e.g. 3rd January 2007, 01/03/07, 03/01/07). The word breaker normalizes the different formats to the same underlying format to ensure that the user search matches all relevant results.

Email
Emails can include punctuation elements. The word breaker recognizes email addresses as single unbroken units (e.g.mail@microsoft.com).

URLs
URLs can include punctuation elements. The word breaker recognizes URLS as single unbroken units (e.g. http://www.microsoft.com).

File Paths and File Names with Extensions
The word breaker recognizes file paths and file names with extensions as single unbroken units (e.g. c:\windows, or \\windows, and c:\windows\install.log).

Additional Features for Search
This section groups together a number of additional features related to SPSv3 word-breaker. These features are:

Pass-through Feature
By including a query in quotation marks, the word or words in the search query are matched without change against the index

Noise Words
Noise Words can be used to remove irrelevant matches ("noise") from searches. The list of noise words is necessarily language-specific. The noise word list should contain frequent words that do not substantially add to the meaning of a user query

Special Word List
A Special Word List is provided internally in SPSv3 to prevent words which contain breaking characters from breaking for the purposes of search queries. (e.g. C++, C#, .Net, etc).

Custom Dictionaries
Custom dictionaries have been provided to enable a SPSv3 Site Administrator to add further words which contain breaking characters to the Special Words List so that they will not be broken for the purposes of search queries. This allows the Administrator to customize the word breaker to the requirements of a particular local environment or language.

Diacritics
The word-breaker preserves the diacritics emitting the form with the diacritic. The Diacritics are marks added to a letter or phoneme to indicate a special phonetic value. Diacritics distinguish words that are otherwise graphically identical such as "اليُمنُ", "لُبنانِ".

Morphology and Word Breakers
Once word breaking has taken place and the individual word forms have been identified, SPSv3 morphological processing enables the word forms to be linked to other morphologically related forms of the same word for the purpose of expanding the user search query to find more relevant matches.
This morphological linking can be achieved in different ways.

Morphological Analysis
The inflected word form can simply be linked to its base form and via the base form to all inflected forms sharing the same base form – this requires that the index store along with each inflected form also its base form(s). Note that more than one base form may be associated with words that are ambiguous with respect to their part of speech (e.g. “lives” may have a verb base form “live” and a noun base form “life”). For languages with very large numbers of related word forms, this is the approach generally adopted.

Morphological Generation
Alternatively, the inflected query term can be expanded to all inflected forms related to its base form, requiring more matches to be made at query time but less information to be stored at index time. The choice of which approach to adopt depends on performance considerations such as size of index, speed of indexing, and the number of query terms generated for search. For most languages with less extensive morphological variation, this is the approach generally adopted.

Arabic Morphological Analyzer
The Arabic Morphological Analyzer embedded in the Arabic search engine provides linguistic support for the analysis of the Arabic language which is characterized by rich morphology. This section groups together some of its principal aspects.

Coverage and Reliability
The search engine provides more relevant results related to the search query term by using morphological processing. For example:
- When searching for the word سفينة, the forms such as وسفن will also appear in the search results.

User Scenarios
The user may specify different forms of the word as a query. For example, the user might enter اسلحة or أسلحة (plural form) rather than سلاح as a query. In this case, the results must be consistent for any query. It would be a major setback to the user if أسلحة yields different results than سلاح due to an incomplete analysis of the query.
Another example - entering لطائرات should return the same results as entering الطائرات or بطائرة. This is especially important in multi-word queries, where sets of words are copied and pasted, and the user does not want (or does not know how) to reduce the words down to their basic form. This is not the case when using simple or rule-based stemmers (basic stemmers), where some inflections trigger different results - or no results at all.

The information contained in this document represents the current view of Microsoft Corporation on the issues discussed as of the date of publication. Because Microsoft must respond to changing market conditions, this document should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information presented after the date of publication. The information represents the product at the time this document was printed and should be used for planning purposes only. Information is subject to change at any time without prior notice.
This document is for informational purposes only. MICROSOFT MAKES NO WARRANTIES, EXPRESS OR IMPLIED, IN THIS DOCUMENT.
© 2006 Microsoft Corporation. All rights reserved. Microsoft, Active Directory, Excel, Internet Explorer, Microsoft Dynamics, MSDN, the Office logo, Outlook, PivotTable, PowerPoint, SharePoint, Visio, Visual Basic, Visual Studio, Windows, Windows Server, Windows Server System, and Windows Vista are either registered trademarks or trademarks of Microsoft Corporation in the United States and/or other countries. All other trademarks are property of their respective owners.
|