Click Here to Install Silverlight*
Middle EastChange|All Microsoft Sites
Microsoft
Microsoft Products & Arabic Support ver. 3.5 
     Office


Microsoft Office Servers 2007 Arabic Edition Word-Breaker

related links
http://office.microsoft.com/
sharepointsearch/

Microsoft Office Servers 2007 Arabic Edition Word-Breaker

Table of contents

  1. Executive Summary


  2. Overview & Scope


  3. Goals


  4. Breaking and Non-Breaking Characters

    4.1  Language-neutral word breaking characters

    4.2  Language-specific word breaking character

    4.3  Special Cases

  5. Named Entities

    5.1  Numbers

    5.2  Currency

    5.3  Time

    5.4  Date

    5.5  EMAIL

    5.6  URL

    5.7  File Paths and File Names with Extensions

  6. Additional Features for Search

    6.1  Pass-through Feature

    6.2  Noise Words

    6.3  Arabic noise words

    6.4  Custom Dictionaries

    6.5  Diacritics

  7. Morphology and Word Breakers

    7.1  Morphological Reduction

    7.2  Morphological Expansion

  8. Arabic Morphological Analyzer

    8.1  Coverage and Reliability

    8.2  Consistency



Executive Summary

Databases provide content storage for many sites that dynamically create web pages around them, including ecommerce catalog sites, online news, and even entertainment sites. Intranets often contain large amounts of text stored in databases as well. These databases generally have their own query functions, which may appear to take the place of a full-text search engine. But this is not necessarily the case. Database querying is not oriented towards text search and relevance ranking: it is optimized for locating widgets by part number and listing the inventory of leather slippers, but allot less effective at helping site visitors find the articles on widget quality or comparing leather and fleece slippers. Full-text search engines are often a better solution, moving the processing load from the database to the search engine, and adding crucial search functionality. Some database vendors offer their own full-text search products; alternatively, an independent full-text search engine product may be used to search the textual content that is stored in the database. Arabic is a morphologically complex language in that it provides flexibility in word formation: complex rules govern the creation of morphological variations, making it possible to form hundreds or thousands of closely related words from one root. Therefore, in Arabic, it is extremely important for a single search query to capture potential word variations. The Arabic language word breaker in SharePoint Portal Server (SPS) v3 provides support for these requirements.


Top




Overview & Scope

The next generation SharePoint Portal Server (SPSv3) will advance in the marketplace by satisfying customers. Search functionality will be enhanced by providing language-specific word breakers, stemmers and Named Entities (NEs) detection to provide increased relevance. Word breakers are an essential part of a search engine, since they define the elements of a search query which will be matched against the document index. Many search engines use the simple language-neutral technique of breaking on white space, which is insufficient for providing accurate language-specific word breaking. The new word breakers, which benefit from linguistic and statistical information, will significantly enhance the user's search experience in a variety of structurally different languages.


Top




Goals

The SharePoint Portal Server (SPSv3) provides language-specific word breakers for a set of languages. The main goal of extending language coverage and improving word breaking behavior is to improve the search engine experience and gain an advantage over competitors in term of language coverage. These new language-specific word breakers will improve the user experience when using these languages in a search context.


Top




Breaking and Non-Breaking Characters

The determination of word breaking characters is essential, as it establishes which characters will be coded as word separators. Breaking characters include white space characters, punctuation markers, quotation marks, parenthesis, symbols, and more. Any character that is not explicitly listed as breaking character is not a breaking character. One way of categorizing word breaking characters from a linguistic point of view is to assign them to two main groups:


Top



Language-neutral word breaking characters

These are characters that are word breaking in all languages we cover. If a character does not exist in a given language but exists in other languages and is consistently word breaking in those languages, it also belongs to this category.


Top



Language-specific word breaking character

These are characters that are word breaking in a particular language or group of languages but which are not word breaking in all the languages that we cover.


Top



Special Cases

There are many special cases in word breaking which override standard word breaking behavior. These special cases typically result from normally word breaking characters not breaking words in certain contexts or because a particular language uses punctuation token or a special symbol in a way which combines with the form of a word and therefore requires special treatment. Common examples of this include abbreviations and acronyms.


Top




Named Entities

Named Entities are sequences of tokens which may contain normally breaking characters which we want to recognize as a single token. And link to a standardized format so that different ways of representing the same information may be related to each other, thus extending search coverage (recall).

The SharePoint Server (SPSv3) Arabic Word Breaker includes Named Entities of the following types: Numbers, Currencies, Times, Dates, Emails, URLs, File Paths and File Names with Extensions.


Top



Numbers

Numbers can include punctuation elements and can be written in different ways (e.g. 2 and 2.0). The word breaker normalizes the different formats to the same underlying format to ensure that the user search matches all relevant results.


Top



Currency

Currencies can include punctuation elements and be written in different ways (e.g. $15, 15$, and 15USD). The word breaker normalizes the different formats to the same underlying format to ensure that the user search matches all relevant results.


Top



Time

Times can include punctuation elements and be written in different ways (e.g. 1:15 and 13:15). The word breaker normalizes the different formats to the same underlying format to ensure that the user search matches all relevant results.


Top



Date

Dates can include punctuation elements and be written in different ways (e.g. 3rd January 2007, 01/03/07, 03/01/07). The word breaker normalizes the different formats to the same underlying format to ensure that the user search matches all relevant results.


Top



Email

Emails can include punctuation elements. The word breaker recognizes email addresses as single unbroken units (e.g.mail@microsoft.com).


Top



URLs

URLs can include punctuation elements. The word breaker recognizes URLS as single unbroken units (e.g. http://www.microsoft.com).


Top



File Paths and File Names with Extensions

The word breaker recognizes file paths and file names with extensions as single unbroken units (e.g. c:\windows, or \\windows, and c:\windows\install.log).


Top




Additional Features for Search

This section groups together a number of additional features related to SPSv3 word-breaker. These features are:


Top



Pass-through Feature

By including a query in quotation marks, the word or words in the search query are matched without change against the index


Top



Noise Words

Noise Words can be used to remove irrelevant matches ("noise") from searches. The list of noise words is necessarily language-specific. The noise word list should contain frequent words that do not substantially add to the meaning of a user query


Top



Special Word List

A Special Word List is provided internally in SPSv3 to prevent words which contain breaking characters from breaking for the purposes of search queries. (e.g. C++, C#, .Net, etc).


Top



Custom Dictionaries

Custom dictionaries have been provided to enable a SPSv3 Site Administrator to add further words which contain breaking characters to the Special Words List so that they will not be broken for the purposes of search queries. This allows the Administrator to customize the word breaker to the requirements of a particular local environment or language.


Top



Diacritics

The word-breaker preserves the diacritics emitting the form with the diacritic. The Diacritics are marks added to a letter or phoneme to indicate a special phonetic value. Diacritics distinguish words that are otherwise graphically identical such as "اليُمنُ", "لُبنانِ".


Top




Morphology and Word Breakers

Once word breaking has taken place and the individual word forms have been identified, SPSv3 morphological processing enables the word forms to be linked to other morphologically related forms of the same word for the purpose of expanding the user search query to find more relevant matches.

This morphological linking can be achieved in different ways.


Top



Morphological Analysis

The inflected word form can simply be linked to its base form and via the base form to all inflected forms sharing the same base form – this requires that the index store along with each inflected form also its base form(s). Note that more than one base form may be associated with words that are ambiguous with respect to their part of speech (e.g. “lives” may have a verb base form “live” and a noun base form “life”). For languages with very large numbers of related word forms, this is the approach generally adopted.


Top



Morphological Generation

Alternatively, the inflected query term can be expanded to all inflected forms related to its base form, requiring more matches to be made at query time but less information to be stored at index time. The choice of which approach to adopt depends on performance considerations such as size of index, speed of indexing, and the number of query terms generated for search. For most languages with less extensive morphological variation, this is the approach generally adopted.


Top




Arabic Morphological Analyzer

The Arabic Morphological Analyzer embedded in the Arabic search engine provides linguistic support for the analysis of the Arabic language which is characterized by rich morphology. This section groups together some of its principal aspects.


Top



Coverage and Reliability

The search engine provides more relevant results related to the search query term by using morphological processing. For example:

  • When searching for the word سفينة, the forms such as وسفن will also appear in the search results.


Top



User Scenarios

The user may specify different forms of the word as a query. For example, the user might enter اسلحة or أسلحة (plural form) rather than سلاح as a query. In this case, the results must be consistent for any query. It would be a major setback to the user if أسلحة yields different results than سلاح due to an incomplete analysis of the query.

Another example - entering لطائرات should return the same results as entering الطائرات or بطائرة. This is especially important in multi-word queries, where sets of words are copied and pasted, and the user does not want (or does not know how) to reduce the words down to their basic form. This is not the case when using simple or rule-based stemmers (basic stemmers), where some inflections trigger different results - or no results at all.



Top



The information contained in this document represents the current view of Microsoft Corporation on the issues discussed as of the date of publication. Because Microsoft must respond to changing market conditions, this document should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information presented after the date of publication. The information represents the product at the time this document was printed and should be used for planning purposes only. Information is subject to change at any time without prior notice.

This document is for informational purposes only. MICROSOFT MAKES NO WARRANTIES, EXPRESS OR IMPLIED, IN THIS DOCUMENT.

© 2006 Microsoft Corporation. All rights reserved. Microsoft, Active Directory, Excel, Internet Explorer, Microsoft Dynamics, MSDN, the Office logo, Outlook, PivotTable, PowerPoint, SharePoint, Visio, Visual Basic, Visual Studio, Windows, Windows Server, Windows Server System, and Windows Vista are either registered trademarks or trademarks of Microsoft Corporation in the United States and/or other countries. All other trademarks are property of their respective owners.




 Last updated Sunday, April 29, 2007