United States   Change   |   All Microsoft Sites

Home

Search Glossary

A   B   C   D   E   F   G   H   I   J   K   L   M   N   O   P   Q   R   S   T   U   V   W   X   Y   Z

 

-A-

Absolute Boosting
Enables a document to be displayed consistently at a given position in the result set when a user searches with a specific query. It also prevents individual documents from being displayed when a user searches with a specific query.

Accent Normalization
Can preserve both original and normalized forms for accented words (for example, hôtel).

Access Control List (ACL)
A data set which grants permissions, or access rights, to each user or group for a specific system object, such as a directory or file. Using the ACL information from the content repositories, the same permissions apply to search results. This means that a user is able to see only the query results that he or she is entitled to view, based on his or her permissions towards the source content repository.

Adjacent Searching
Commonly referred to as proximity search. An extension to Boolean searching, this technique checks the position of terms and only matches those within the specified distance. It's a good way to cut down the irrelevant matches and get better results.

Alert
A message that the enterprise search engine broadcasts (for example, to a front-end application, or a messaging system, such as e-mail, Short Message Service [SMS], or Instant Messaging [IM]) when a document satisfies a stored query. Alerts are either near real-time or configured as asynchronous events run on a scheduled basis.

Alert Engine
A matching engine which performs matching of incoming documents against stored queries (triggers). A match generates an alert.

Alert Query
An alert query is the set of filtering conditions an end user or external application sends to the alert engine. Each alert query is composed of several matching conditions and Boolean operators in a way that is similar to a search query.

Anchor Text
The textual components of Web hyperlinks (text links or 'alt' text associated with image hyperlinks). Anchor text may provide additional descriptive information about the referred page and, therefore, are often indexed as metadata to the referred document. Anchor texts may be used when detecting links from other pages to a given page. The anchor texts associated with these links can be used to compute an authority rank component. The referring anchor texts may also be included as searchable content for the referred documents.

Anti-Phrasing
Identifying word sequences in queries that do not contribute essentially to the query's meaning, such as "Where can I find" or "Where is."

Approximate Matching
Matching a query term and a term within a document based on approximations. Such approximations can be based on spell-check (see Spell-Checking) or linguistic normalization (such as lemmatization or accent normalization).

Asian Language Tokenization
Tokenization (word segmentation) for Asian languages requires special treatment. These languages do not allow text to be split into word entities by referring to white space or other separators. Asian language text needs to be split into tokens that can be treated as words during document processing and matching.

Authority
One dimension of search relevancy. This indicates that the document is considered to be an authority for this query. That is, the document is being referred to by others, for example, through Web anchor texts. Many items can be part of the analysis of documents to determine this parameter—Web link cardinality, article references, page impressions, and product revenue, to name a few.

Average Query Response Time
The average time it takes for the search engine to respond to a given query. Typically, there are two times that can be measured: 1) the average response time of the search engine itself, and 2) that of the complete system for an end-to-end query (that is, including the application and Web server times).

 

-B-

Bigram
A bigram is another term for a two-word phrase. It can also be seen as an N-gram where N=2.

Boolean Search or Boolean Query
A form of logical comparison. Boolean operators let you define whether multiple search terms are matched within a text block. A Boolean expression is constructed by joining terms together with the three special operators: AND, OR, and NOT. You also can combine sub-expressions within a query using these Boolean expressions. Proximity search using the NEAR/ONEAR operators is somewhat related. NEAR is similar to AND but implies an additional constraint that the terms should appear within a given distance in the document.

Boosting
May be used to alter the relevancy value of a document compared to other documents in a search index, typically because it is perceived to be a more valuable resource. It is the addition or subtraction of a value to a document's rank (relevancy). By default, documents with the highest rank values are returned to the user before documents of lower rank values. Boosting may be applied in two ways: query independent (document boosting)-used to boost high-quality pages for all queries that match the document; and query dependent (query boosting)-where specific documents may be boosted for given queries.

Boundary Match
The ability to limit a query term or phrase to the start and/or end of an indexed field or parameter. Combining start and end condition provides an exact field/parameter match.

 

-C-

Case Sensitive/Insensitive Searching
Search engines most often will normalize words to lower case. Some search applications, though, may use case-sensitive search against specific content, such as metadata.

Categorization or Classification
The process of organizing pieces of information into topical categories. Usually, these are hierarchical trees, with the most general topics at the top and the most specific at the bottom. A department store might have: Products, Shoes, Women, Cross-Trainers, while a gardening site might have a category: Plants, Flowers, California Natives, Poppies. In either case, searchers can understand more about the content of the page when they know the category. Some categorization products will attempt to classify data automatically, while others assist human catalogers. A search engine may apply categorization of the documents in the index based on similarities (typically based on a training set), matching rules, or programmatic rules. See also Results Clustering.

Collection
Content that is to be processed, made searchable, and retrieved as a logical unit. Content types can be grouped by source and by the processing rules that are to be applied to this type of content.

Collection-Level Security
The application tier will assign different authorization levels to various collections within the search index. End users then have access to the set of collections that map to their authorization levels.

Completeness
In relation to relevancy, a gauge of how well the document matches superior document contexts, such as the title or the URL. It describes what matches the query: document title, author, mention in the body text, metadata linked to the document, and both the root and expanded form of words.

Concept Extraction
The ability to mine concepts from data using linguistic analysis.

Content Aggregation
Bringing together content from multiple source repositories for retrieval at a later time. In some cases, this term also is used for the amalgamation of search results into a comprehensive whole.

Content Connector
A content connector extracts content from an external content repository (file systems, content management systems, databases, collaboration applications) and adds this to the search system for indexing. Connectors may be based on push or pull technology, depending on the capabilities of the content repository.

Content Routing
In a large search system, the index is split into multiple columns/partitions. Often, different algorithms may be chosen for routing content to columns/partitions. An efficient method is to apply a statistical distribution of documents based on a hashing algorithm. In other cases, routing may be based on collections or other attributes of the content.

Context Relevancy
One dimension of search relevancy. The importance of a term, phrase, or entity match depends on the matching context. Contexts may be fields or semantic structures of the document, such as paragraph, sentence, or title.

Contextual Entity Extraction
Extracted entities (see Entity Extraction) can be annotated automatically to semantic structures in the text, such as paragraphs or sentences. Such annotation enables normalized matching of entities in addition to contextual navigation into detected entities from search results.

Contextual Insight
Next-generation search intelligence that dynamically identifies relationships so that users can find facts and answers to questions quickly, such as, "When did the Berlin Wall fall?" Users get both the contextual results with extreme precision and the contextual navigation for further investigation of related information.

Crawling
The act of accessing Web servers or file systems in order to extract information to feed into the enterprise search platform. By following links, a crawler is able to traverse Web content hierarchies based on a single-start URL.

 

-D-

Date Range
A search engine may provide an option to search for documents modified on a specific date, before a date, after a date, or between two dates. See also Freshness Boosting.

Deep Navigator
A type of dynamic drill-down navigator which applies on-the-fly aggregation of result values across the entire result set for a query. See also Navigator.

Dictionary/Thesaurus
Within the context of search, a dictionary supports linguistic processing of content and queries against a list of words, terms, or phrases to improve recall and precision for a query. A compiled dictionary structure normally is used for performance reasons.

Directed Search
A narrow search within a specified area of the indexed content. Users may choose to search within "news" if they want the latest updates on today's game, for example, instead of having to search within "news," "culture," and "sports."

Document Element
Part of a document within the document processing framework. A document is divided into elements to enable individual processing and indexing of structural parts of the documents. These may be heading, body, and metadata from HTML documents, or fields within a database schema. Document elements are mapped to searchable fields within the index.

Document-Level Security
Within a search engine, document-level security implies that the search index provides the same document access control granularity as the source content repositories. This may be facilitated by mapping the ACL information from the content repositories to the index.

Document Processing Pipeline
The sequential set of pre-index document processing stages.

Document Processing Stage
The document processing stage may modify, remove, or add information to a document, such as adding new meta-information for linguistic processing or extracting information about the language the document is written in. Also known as a document processor.

Document Summary
A document summary represents the subset of the matching documents that is returned with a query result.

Document Summary Field
The content of an individual field within a document summary. The set of summary fields returned is typically a subset of the indexed fields of a document. See also Dynamic Teaser.

Document Vector
A set of (keyword, weight) pairs, where keyword is a word or a phrase associated with the document, and weight is a numerical measure of how important keyword is for the document. Vectors are a kind of document signature (word-weight pairs) representing a document's content in a way that allows comparison between documents. It is the numerical representation of the unstructured textual content of a document. Vectors can be used to enable clustering and refinement operations.

Duplicate Detection
Search engines may apply different levels of duplicate detection. Exact duplicates means the same document, but located in different repositories. The next level of duplicate detection is typically to look for documents with equal visible content (excluding metadata). Certain applications may apply even more aggressive duplicate detection (for example, based on a set of fields that are equal). See also Field Collapsing.

Dynamic Concept Extraction
The ability to mine concepts from data present in the result set of a query through statistical and linguistic analysis. See also Entity Extraction.

Dynamic Drill-Down
A navigation tool for structured data, dynamic drill-down provides multidimensional drill-down in structured data based on facets of content. This enables on-the-fly aggregation of result values for multiple fields across the documents in a result set. For numeric data, this also includes dynamic binning of result values based on statistical value distribution across the result set.

Dynamic Rank
The process by which rank components are computed during matching related to the level of match between document and query.

Dynamic Teaser
A short summary of a document, generated based on the actual query. The dynamic teaser shows the regions of the document matching the query with the query terms highlighted.

 

-E-

ETL-Type Tools
Extract, transform, and load (ETL) is a data-integration function that involves extracting data from outside sources, transforming it to fit business needs and ultimately loading it into a data warehouse. In a search application, ETL tools may be used for merging of database records and content normalization.

Entity Extraction
The ability of an enterprise search platform to parse and recognize informational entities based on dictionaries, patterns, or a combination of the two, such as geographic names, personal names, company names, company-specific products, etc. Entities typically can be annotated to the indexed documents and enhance the search and navigation experience. See also Contextual Entity Extraction.

Exact Match
Matching query terms to document words exactly. This will not allow fuzzy matching based on spell-check or linguistic normalization.

 

-F-

False Positives
When a search returns results that do not contain what was searched for.

Federated Search
In a federated search, users receive results from multiple search and retrieval systems, for example, from other search engines, commercial information services, or internal databases. Federation is the blending of results from multiple, often non-compatible, search systems.

Field
The schema of a search index splits documents into fields. Fields specify those elements of a document that are to be searchable or presented in the result.

Field Collapsing
Used in order to collapse a group of results with similar value for a given field to a single entry in the result set. Site collapsing is a special case of this, where all results for a given Web site are collapsed into one or two entries in the result set, typically with an associated "more documents from this site" link. See also Duplicate Detection.

File Traverser
A tool for accessing the files (for example, Microsoft Office Word, HTML, and XML files) that are located on a standard file system in order to bring them into the index of the enterprise search platform. A file traverser works along file system directory structures, whereas a Web Crawler crawls Web servers along URI structures.

Freshness
The "age" of the document compared to the time of query.

Freshness Boosting
Enhances relevancy by boosting documents based on their relative age—that is, compared to the time of query.

Full-Text Sorting
Sorting search results based on the full textual content of a field (or a configurable number of characters).

Fuzzy Matching
Exact matching is very strict: either a word matches or it doesn’t. An attempt to improve search recall by matching more than the exact word, fuzzy matching techniques try to reduce words to their core and then match all forms of the word. See also Approximate Matching.

 

-G-

Geo-Enabled Search
The ability to sort, filter, and rank documents based on their geographical distance from a given source position, typically the position of the end user.

Golden Set
The number of documents and queries that are to be used for testing (a minimum of 2,000 documents and at least 50 queries). Typically these are selected manually.

 

-I-

Impression Logging
The ability to log all query results that are displayed to the end user. This means that each query result generates an impression log entry for each document returned on the result page.

Incremental Indexing
Enables an efficient combination of freshness and scalability in a search node. The index can be partitioned into different segments (partitions) with different refresh rates. Data is indexed initially in the smallest index where it resides for a given period of time before it propagates a larger index. This procedure repeats itself: Once data has entered index P, it stays there for a configurable amount of time before it moves to index P+1. This concept takes place within one search node and is independent of the partitioning performed using multiple columns of search nodes.

Index
An index is a reverse representation of searchable content using an index of terms occurring in the documents, with a reference to all documents matching a given term.

Index-Based Security
The resolution of a repository's document ACL permissions at query time by the index itself through the use of stored metadata. Using this method, results lists will only include hits for which the searcher has viewing permissions. Compared to post-processing, the index-based security method gives higher query performance and enables the search engine to return correct counts for navigators and related concepts.

Index Profile
The configuration file that defines the schema for a searchable index. This includes fields and properties of the index, similar to an XML schema, but also specifies field types and search engine-specific field features.

Indexing Latency
The time from when a document is added to the search system to when the document is included in the searchable index.

Ingestion Rate
The number of documents per unit time that an enterprise search platform can process.

 

-L-

Lemmatization
Using lemmatization enables the search system to recognize and match different grammatical forms of a word. For example, searching for "mouse" will also produce hits on "mice."

Lemmatization by Expansion
A type of lemmatization which expands words into the full set of inflected forms.

Lemmatization by Reduction
A type of lemmatization, also referred to as base form reduction, that normalizes indexed terms and query terms to their grammatical base form. For example, "ate" becomes "eat."

Linguistics
The study of the nature, structure, and variation of language. In advanced enterprise search platforms, linguistics analysis enables transformation of content and queries for the purposes of improving relevancy, recall, and precision.

Link Cardinality
The number of links in a set that refer to a given document. It is used best to determine the relevancy of a Web page by factoring in how many other pages refer to the page under consideration.

 

-M-

Metadata
Often described as "data about data." Metadata typically augments the full text of a document to help with recall, precision, creating filters, and working with navigators.

Mining
Finding useful facts in databases of text; evaluating large amounts of stored data and looking for useful patterns.

More Like This
A way to refine search by identifying the right set of documents and then locating similar documents. This allows the searcher to control the direction of the search and focus on the most fruitful lines of inquiry.

Morphologic Analysis
Used in query analysis, this analysis includes all forms of a given word through linguistic normalization (lemmatization).

Morphology
The study of the structure and form of words in language or a language, including inflection, derivation, and the formation of compounds.

Multilevel Sorting
Sorting by multiple fields. Both text and integer fields may be sorted upon (ascending or descending). Field sorting may be combined with rank sorting.

 

-N-

Name Value-Pairs
In a search context, name value-pairs are raw data that is normalized into a structured "tree" of information. They are then sent downstream to waiting document processors. For example, name value-pairs can be data about cars that is structured into categories containing information about "make," "color," "year," and "mileage."

Natural Language Processing (NLP)
Instead of using Boolean logic, the user simply can type in a question as a query. The simplest processing just removes stop words and uses statistical approaches. Natural language processing is the process of using linguistic analysis to infer meaning from human-written text that could not be extracted using the individual word meanings.

Navigation
Information discovery through drill-down into query results. Navigation is possible both on document level attributes and entities along with contextual entities within the matching context of the search results. Dynamic drill-down may be used to drill down into any dimension of the documents that can be represented as numeric information or well-defined terms or strings. The combination of entity extraction and drill-down provides a powerful way of drilling down into the results. See also Navigator.

Navigator
A construct that enables filtering and grouping of search results. On an international site, you may have a navigator that enables you to display only results with content in a given language (for instance, "Display English results only").

Node
In general, a node is a basic unit used to build data structures, such as linked lists and tree data structures. In an enterprise search system, a node usually is referred to as a server.

Noun Phrase Extraction
Implies that phrases, such as "competitive advantage," "key driver," and "seller's market," can be extracted and annotated to a document prior to indexing.

 

-O-

Offensive Content Filter
Detects offensive content (sexual, drug, violence) in a document and can tag the document to be offensive and, based on that, either exclude the document from the index (as spam) or enable removal of offensive content on a per-query basis.

Ontology
Defines concepts, providing a way to move towards consistency in vocabulary. It provides a working model of the entities and interactions of a particular topic, such as dentistry or anthropology. It also has a specific knowledge related to a given domain name, for example, in finance or pharmaceuticals.

Orthographic Analysis
Used in checking for typos and official variants (for example, German spelling).

 

-P-

Parametric Search
Allows people to find items of interest based on an individual item’s parameters or particular characteristics. Such parameters or facets may be represented as fields within a search index.

Parsing
The process of analyzing input to determine its grammatical structure with respect to formal grammar. A parser is a computer program that carries out this task. Parsing transforms input text into a data structure, usually a tree, which is suitable for later processing and which captures the implied hierarchy of the input. Generally, parsers operate in two stages: first identifying the meaningful tokens in the input and then building a parse tree from those tokens.

Phonetic Search
The analysis of words that are pronounced similarly in order to detect all possible variants.

Phrase Detection
The recognition and grouping of an idiom, such as "home run" or "Christmas tree." Detection of an implicit phrase in a query may improve the precision of a query.

Phrase Searching
A search engine may provide an option to search a set of words as a phrase, either by typing in quotation marks (""), by using a command, or by clicking a button. When it receives this kind of search, the engine will generally locate all words that match the search terms and discard those which are not next to each other in the correct order. To perform this task, the index must store the position of the word in the document so that the search engine can tell where the words are located. See also Proximity Search.

Phrasing
The recognition and grouping of an idiom, such as "home run" or "Christmas tree."

Precision and Recall
Precision is the ability to retrieve the most precise results. Higher precision means better relevance and more precise results but may imply fewer results returned. For a query, recall means the ability to retrieve as many documents as possible that match or are related to a query. Recall may be improved by linguistic processing, such as lemmatization, spell-checking, and synonym expansion. In information retrieval, there's a classic tension between recall and precision. By specifying more recall (trying to find all the relevant items), you often get a lot of junk. If you limit your search by trying to find only precisely relevant items, you can miss important items because they don’t use quite the same vocabulary.

Processing Pipeline
Sequential stages of processing within the search engine before the creation of final index of the content.

Proper Name Recognition
A way of identifying word sequences in text that are defined as proper names or phrases in the appropriate dictionary. See also Spell-Checking.

Proximity Boosting
Documents that contain the query terms closer together are ranked higher than documents that contain these terms distributed throughout the document. This may also be referred to as implicit proximity.

Proximity Search
An extension to Boolean searching, this technique checks the position of terms and only matches those within the specified distance. It's a good way to cut down on irrelevant matches and get better results. Search using the NEAR/ONEAR operators implies an explicit proximity constraint to the operands to the NEAR/ONEAR operator. NEAR is similar to AND but requires that the terms should appear within a given word distance in the document. ONEAR also requires that the order of the terms is equal.

 

-Q-

Quality
In relation to relevancy, this term refers to the quality of the document and how important it is as viewed by the content owner or search application.

Queries Per Second (QPS)
The number of queries that the enterprise search platform will process in one second. This normally is a function of hardware (capability) and licensing (what is allowed based on contract terms).

Query
The combination of the word or words used for searching and any options allowed by the search engine.

Query and Result Processing
The application of algorithms to the original query or to the raw results returned by the search engine. This is useful for modifying queries to reflect an inferred behavior, for example, using synonym expansion or business rules to modify the results (such as resorting or teaser modification), and to customize the search experience. The overall goal is to analyze and identify the essence of the searcher’s intent from the query and to return the most relevant set of results.

Query Syntax
The semantic rules that must be observed when submitting queries to a search engine (for example, the use of parenthesis and Boolean operators). Sometimes, a query transformation stage may be used to allow end users to use a different syntax from the one expected by the search engine.

Query Term Weight
The ability to support different relevance weight for different terms in a query.

Query Transformation
The analysis and subsequent rewriting of a query, using linguistic transformations, such as lemmatization and spell-checking. Custom query transformation stages may also be used if necessary. Equivalent to query processing.

 

-R-

Range Restrictions
The ability to limit a search to a specified range of a numerical metadata field. For example, a search for a digital camera priced between $250 and $400.

Rank Profile
A rank profile concept enables full control of the relative weight of each component for a given query (for example, how important an article's title is relative to the main text or how important proximity is versus freshness). This enables individual relevance tuning of different query applications.

Ranking
A way of arranging result documents according to their relevancy related to a query.

Ranking Models
Models used to determine how closely content matches a particular query and whether it should be included in the search results.

Real-Time Indexing
The ability to index content with short latency, typically within seconds from when the enterprise search platform receives a document for indexing.

Recall
For a query, recall means the ability to retrieve as many documents as possible that match or are related to a query. Recall may be improved by linguistic processing such as lemmatization, spell-checking, and synonym expansion.

Relative Boosting
Enables a document to be displayed always among the first 20 documents in the result list, provided a user searched with a specific query. For all other queries, the ranking position of the document will not be affected.

Relevance or Relevance Ranking
Relevancy is the measure of how well the indexed page answers the question. Only the searcher can define how relevant a document is, in relation to the query: there is no way to automate it. When there are many query matches, the search engines must rank the results by relevance score, sorting the results listing so that the pages most likely to be useful will appear first. Varying algorithms are used to define relevancy.

Result Set
A set of document summaries returned for a query.

Result-Side (Shallow) Navigators
A type of dynamic drill-down navigator. Drill-down navigators are created across an extended but non-exhaustive result set (typically, the 200 highest ranked results).

Results-Based Binning
Performs ad-hoc clustering of results into dynamic bins based on value distribution for this parameter in the results. See also Dynamic Drill-Down.

Results Clustering
Grouping similar results together to make it easy to see which results relate to each other. This can be supervised (based on a taxonomy) or unsupervised (based on on-the-fly similarity analysis).

Results Transformation
The algorithmic processing of search results, which includes result-set reordering (that is, duplicate removal), adding navigation information, and result content conversion or reformatting. Equivalent to results processing.

Rows and Columns
A search installation may be configured in a row and column configuration for performance and fault-tolerance reasons. Multiple columns are used in order to partition the indexed content for large data volumes. Each column contains a unique subset of the indexed content. Multiple rows are used for query performance scaling and fault-tolerance. Each row within a column is identical with respect to the indexed content.

 

-S-

Scalability
Indicates the capability of a system to increase total throughput under an increased load when resources (typically hardware) are added.

Scope Field
Contains hierarchically structured content. It enables schema flexibility and the ability to conserve hierarchical relationships rather than flattening the data as is often required by metadata engines.

Scope Search
Enables search in hierarchical content structures without a need to know the schema in advance.

Search Cluster
A group of search nodes (row/column matrix) that shares the same index schema (index profile).

Search Profile
A concept used in order to identify the set of search attributes common for a given search application. This includes global filter constraints (such as collection), query processing parameters (such as linguistics), and result handling parameters (such as navigation settings).

Search Terms
The words entered by the searcher, which are part of the query, along with other instructions. The search engine will look for these words in the index and return the matching results, usually sorted by relevance. Some search engines will allow Boolean operators, adjacency, match phrases, partial words, and provide other options.

Semantic Analysis
Applying a combination of general and specific thesauri, ontology, and automatic phrasing, for example, to understand the intention of the query.

Semantic Indexing
The indexing of content by detection and annotation of sentences, paragraphs, and other semantic structures in unstructured content. This enables you to limit your search to paragraphs or other semantic elements in the text.

Sentiment Analysis
The evaluation of the sentiment-typically positive or negative-of the text, based on the usage of language. Determining the sentiment (general tone) of a document, based on the application of computational linguistics algorithms.

Similarity Searching
The ability to search for similar documents. Similar in a search context may be similar to a document in a result set or similar to an example document. Similarity searching may be based on: find similar (find documents similar to the selected document or based on input of a full document or a chunk of a document submitted through the search interface); refine similar (within the scope of the original query, find documents similar to the selected document); or exclude similar (within the scope of the original query, find documents different from the selected document). See also Document Vector.

Spell-Check Optimization
The process of optimizing a spell-check dictionary towards a live search index. In this way, the dictionary is aligned with the actual domain of the given search application, taking into consideration term frequencies and domain-specific terminology.

Spell-Checking
Individual query terms and phrases are spell-checked against a dictionary. The spell-check algorithm normally is based on the edit distance between a query term and the dictionary term. The edit distance is given by the number of basic character operations (add, delete, swap) required to transform the misspelled query term or phrase to the closest term in the dictionary. A special variant of spell-check is the phonetic spell-check, where the edit distance is computed based on a phonetic representation of the words. See also Approximate Matching.

Statistics
In relation to relevancy, this term indicates statistically how well the content of the overall document matches the query. One measure is the number of times the query terms appears in the document and how rare that term is within the complete corpus. Another is the proximity of the words in the document-how close they are to one another.

Stemming
Using linguistic analysis to reduce a word to its root form (stem) and then matching all forms of a word in a search query to all forms of the same word in documents. Stemming, in contrast to lemmatization, normally is based only on removing trailing parts of a word, leaving the stem. Lemmatization normally is based on dictionary look-up in addition to stemming rules.

Stop Words
Words which are frequent and have little meaning. They are removed and not indexed. In advanced enterprise search platforms, customers can control the list of stop words by managing the stop word dictionary.

Structural Analysis
Allows documents to be classified based on structure and linguistic analysis (for example, the home page of an Internet service provider), in addition to the detection and extraction of more complex elements, (such as the opening hours of the ISP’s customer service operations).

Substring Search
Searching for parts of a string as with a wildcard search ("*term*"). A word or token (for Asian language documents) is split up into smaller entities, called substrings, consisting of a defined number of signs. Substring search is often used for Asian languages, which do not have a word structure similar to that of Western languages.

Supervised Clustering
Provides a grouped view, based on predefined categories, and maps results to pre-determined categories (that is, category information provided for the documents prior to indexing).

Synonym Expansion
When a query or document is expanded with a defined list of synonyms for the words it originally contains.

Syntactic Analysis
Used to analyze query through entity and phrase extraction, anti-phrasing, and to remove word-sense ambiguity (the color orange versus the fruit, for example).

Syntactical Patterns
Used for detecting information entities, such as people, places, product codes, and prices.

 

-T-

TF-IDF
Term frequency (TF) and inverse document frequency (IDF) are used together as a measure of the statistical strength of a given word relative to a query. TF is the measure of how often a word appears in a document. IDF is the measure of the rarity of a word within the search index.

Taxonomy
A defined hierarchy of categories-a tree-like structure of customer-specific or market-specific terminology that defines how categories relate to one another. Taxonomy provides a conceptual framework for discussion, analysis, or information retrieval. For example, a car manufacturer may have a taxonomy based on the type of car (convertible, SUV, or wagon). Taxonomies help partition the search environment and experience, based on a predefined knowledge of categories. This helps limit the number of "noisy" results returned to the user.

Thesaurus
Stores synonyms and related words. This allows a search engine to map city planning to land use, for example, and to show the relevant pages even if the vocabulary of the text did not match.

Tokenization
Involves detection of white-space characters and other symbols that separate words from each other and that are not relevant to the matching process. It is part of the linguistic analysis, in which text is split into word entities. More complex tokenization is used for CJK languages, in which semantic analysis is required to identify word boundaries. FAST ESP provides a highly configurable tokenization, which enables you to configure whether or not special characters will be discarded, treated as white space, or indexed as normal characters.

 

-U-

Unsupervised Clustering
Provides grouping of related documents on the basis of their content without referring to a taxonomy. Unsupervised clustering creates a taxonomy on-the-fly, parceling documents into dynamic partitions.

 

-V-

Variations (Linguistic)
Linguistic variations include different spelling variations for the same words. This is typically language specific, related to country differences, for example:
English: Spelling variations for the United Kingdom/United States/Australia, such as color/colour; Portuguese: Brazil versus Portugal; German: spelling reform. A search engine may normalize across spelling variations in a similar way, such as for lemmatization.

Vector
A kind of document signature (word-weight pairs) representing a document's content in a way that allows comparison between documents. A vector is the numerical representation of the unstructured textual content of a document. Vectors can be used to enable clustering and refinement operations.

 

-W-

Wildcard
A wildcard character can be used to substitute for any other character or characters in a string. Common wildcards include "*" (zero or more characters) and "?" (a single character).

 

-X-

XML Searching
The ability to search into XML structures, including hierarchical content structures. Industry standards for XML searching include XPath and XQuery.

 

-Z-

Zero Results
A futile query; a query that returns 0 (no) results.

Visit These Other Resources

MSDN

MSDN developer centers put you in touch with a wealth of technical articles, code samples, developer documentation, and multimedia presentations.

Microsoft TechNet

TechNet TechCenters are designed to help connect you with enterprise search products resources and with the broader Microsoft Office community.