Monday, April 27, 2009

FAST search engine for SharePoint PART 2 (Overview of FAST)

The right way to introduce FAST would be to give an overview of how the indexing and search is really done and what processes take place.

 

image

I’m sure it’s not the best graphic you have ever seen, but I’m not a designer for a reason.

This picture provides a VERY simplified overview of processes, but it is a start.

FAST terminology: document is not a Microsoft Word document per-say (even thought it can be), document is data entity that is subject for indexing. Document can be a file, db record, web page, etc. XML docs are a bit different story, but I’ll talk about it later.

  1. Content: Content can be in many formats, structured and unstructured. Content sources are defined at a “collection” level, collection has to be created as a very first step that you take when defining a logical grouping of searchable content.
  2. Connector is assigned to a “Collection”, and is used to feed documents in batches to document processing pipelines.
  3. Document processing pipelines refine your documents, apply entity extractors, matching process and business logic. Only one pipeline can be associated with a “Collection”. There are many stages that document goes through within the pipeline, but they all can be split into three groups: pre-processing, document manipulation, post-processing.
  4. If the document is not discarded within pipeline it gets indexed with all appropriate attributes that were assigned to it within the previous step, original content is also being stored in FXML format (FastXML)
  5. Query Processing pipeline comes into play when end-user submits query, it provides additional analysis to content retrieved from index and further refines the result set.
  6. Finally, end user interface renders results and provided further ways of drilling into search results.

In the next post I’ll cover Content sources and connectors in more details.

Enjoy :-)

Friday, April 24, 2009

Why FAST?

So what is all the fuss about? Why do we even want to consider FAST as a search engine? What FAST does that other search engines fail to do and why?

As we all remember in early 2000 it was all about data, collecting data, “information is the power”. Almost all companies jumped on “Data collection” wagon.  Few years later they were forced to collect data in the light of all laws and regulations that were imposed by government. At around the same time and even earlier wave of data analysis, and data mining became incredibly strong. Companies realized that the data they have is an unknown mass, potentially incredibly valuable, but the value of it was obfuscated by the quantity and often poor quality of the data.

SharePoint “age” started educating companies about true meaning of “garbage in, garbage out” concept. I personally often tell clients that SharePoint deployment stage should be treated as “Spring Cleaning” and reorganization of content.

But is “Spring Cleaning” approach always valid and feasible? No.

Can large enterprises afford such undertaking? Most likely not. Because hiring an army of temps to tag documents will not cut it, since they are lacking subject matter expert’s knowledge and business insight.

Properly configured FAST can impose multiple layers of information taxonomy structure on already existing content, thus providing unparalleled insight and access to information.

When FAST “bug” bit me, I tried to be devil’s advocate for myself and started researching FAST competitors in the enterprise search market, such as: Autonomy, Endeca, and Google appliance.  Unfortunately ns my FAST competitors research journey, I could not pass through sales reps, who had troubles answering even basic questions, to more “techy” people (and I’m so far drinking sales reps cool-aid). I had to resort to comparison of the type and level of influence or administrative intervention you have on the search engine, in terms of content relevance, analysis, ranking… and you name it. FAST had prevailed on many levels over other guys.

If we take the analogy of apple trees in an orchard. (ha-ha-ha, my Russian really comes through here) Companies say: I want Google quality search within enterprise.

Google is famous for Internet search, but this is the same as you going to a free range farm and shaking an apple tree. Then using the fruit that falls to the ground where you have no control over what apples fall first, and their pattern of falling and arrangement on the ground does not clue you into their quality of “fruits”.

Using FAST Search is like going to a farm which specializes in cultivating the best apples. So you get the best fruit all the time. It is presented to you in ways that you need by the ton, baskets, 5 pound paper bag, raw, or processed into apple cider, puree etc. (even by colors)

In conclusion, I just wanted to add couple of facts about FAST (in no particular order):

  • FAST can handle 40,000 Terabytes/ or 40 Petabytes of data
  • 10 billion documents
  • 2,000 queries/second

And  expansion of it all through Federated installations!

And according to Dstar – Data News company:

  • 1,000 terabyte is about one-hundred times the contents of the Library of Congress, the largest library in the world, with more than 18 million books, 2.5 million recordings, 12 million photographs, 4.5 million maps, and 54 million manuscripts

According to Answers.com

  • Approximately fifteen thousand terabytes of data will be generated each year in particle physics experiments using CERN’s Large Hadron Collider, launched in May 2008
  • As of November 2006, eBay had 2,000 terabytes of data
  • In 2007, NOAA maintained approximately 1,000 terabyte of climate data. NOAA expects that their Comprehensive Large Array-data Stewardship System (CLASS) library will hold 20,000 terabytes of data by 2011, 140,000 terabytes by 2020

Isn’t it alone impressive?

Enjoy :-)

Technorati Tags: ,

FAST search engine for SharePoint PART 1 (Dictionaries)

In this post I’d like to introduce you to FAST ESP dictionaries and their benefits.

FAST ESP (Enterprise Search Platform) has many prebuilt dictionaries. Dictionaries that support basic entity extraction such as: locations, people names, and company names. Dictionaries that help to facilitate a context recognition through the use of lemmatization, synonyms, spelling variations, stop words elimination, etc…

Lemmatization: generally speaking, lemmatization means the mapping of a word to its base form or its other inflectional forms. Ex: searching for “to do” will bring “doing”, and “done”.

Dictionaries also support advance phrase recognition, but this feature and lemmatization do not go together, they are mutually exclusive. Phrase recognition provides you the following: If I’m executing query on “SharePoint for Squirrels” Fast will immediately recognize that “SharePoint for Squirrels” is a PHRASE and will not bring absolutely irrelevant results with “Count of squirrels in SharePoint”, or “maintaining squirrels database in SharePoint” mentioned in the source documents.

Dictionaries as almost any feature of FAST is a big topic that can fit into a small book, but I’ll try to cover at a high level some of the OOTB dictionaries (index and/or query side):

  1. Phonetic search
    1. Used for detection of sound-a-like words
  2. Spelling
    1. Within spelling dictionary each word has a weight value which might be used to influence ranking value. Weigh value is based on the frequency of the word in specific language. Lower values mean more frequency of usage.
  3. Synonyms
  4. Tokenization
    1. Detection of characters that might be irrelevant in the query, ex: “-“, white spaces, etc.
  5. Character normalization
    1. Used for characters with accents that are not commonly supplied by users
  6. Locations and Proper Name recognition
    1. Used to identify locations, addresses, and names of people and companies.
  7. Anti-phrasing and stop words
    1. used to eliminate irrelevant words or often repeated words

FAST also provides automatic language detection in 70 languages, and dictionaries are language specific.

Custom FAST Dictionaries

Now a days almost every company is using their own terminology, or industry, or department specific jargon, and abbreviations. When searching for information people are using keywords in their queries, keywords that make  sense to them, ex: If I’m searching for “PTO”, I’ll get results with documents where “PTO” is mentioned, but not “vacation”, or “personal day”. While entering “PTO” in my search query, I simply wanted to lookup my company’s vacation days policy. When people are searching for information it is very important for search engine to understand what they REALY are looking for in the context of their keywords.

Dictionaries can be used on the content processing side (document pipelines) before document even gets indexed, and supply more information about a document into the index.

Though it will contribute to index growth, but will make query results serving a bit faster, since less work will have to be done there.

Also they can be used on the query processing side, which might directly affect query performance.

Keep in mind that almost all updates to index side dictionaries will require re-indexing of the content.

Just an idea on building custom dictionary: I’ve heard this idea at the first FAST Forward Conference 2008 in Orlando.

What is we start using internet to built our dictionaries. Internet is the biggest data repository in the world, and data on the Internet is often contained in tables. Table structure can be easily recognized in the document Pipeline of FAST.

Example of FAST indexing a web page with cars table:

Make Models    
BMW X5 SUV  
BMW X3 SUV  
BMW 325 XI Sedan  
Acura MDX SUV  
Acura RL Sedan  
Acura TL Sedan  
Acura RSX SUV  
Acura TSX SUV  

By utilizing GetAttributes() method in document processor, FAST can return this document as a dictionary, it will populate your  custom dictionary based on values from the above table. Just think about all the opportunities WWW opens for you :-)

When I say “document”, do not confuse it with MS Word document, document in FAST vernacular is any type of information entity: DB record, web page, file, etc.)

How will it help? let’s say that someone goes to my FAST Search engine and decides to find a car by entering  “Acura sedan” into the search box. While documents in my index do not have “sedan” keyword, but they do have make and model names. Without dictionary based on the above table, search results would bring everything on “Acura”. But with this dictionary through the matching stage in my document pipeline, FAST will identify relevant documents in my index by matching “RL”, and “TL” models to “sedan” in the dictionary. And return both sedan models in result set as opposed to the whole bunch of irrelevant content where “Acura” keyword is mentioned.

FAST dictionaries are created, populated and viewed using Dictionary manager from command line script dictman. You can use this utility in interactive mode for manual population of dictionary and in non-interactive mode, to populate dictionary through a batch file.

Dictionaries are used to improve relevance and can drastically improve search results if properly deployed.

There are pros and cons to deploying different dictionaries, so there is a pre-planning process that has to take place. Considerations must be given to what dictionaries to deploy based on the content that you want to serve and business requirements, when to deploy index side dictionaries and when to deploy it on the query end. This is more of a “Best Practices” talk, I’ll try to cover it in some other posts.

Enjoy :-)

Tuesday, April 21, 2009

FAST as SharePoint search engine

While I was in UK at the beginning of this month, I had a very interesting conversation with one of SharePoint MVPs (no names, but he knows who he is :-). The conversation was mainly revolving around FAST and the fact that most SharePointers don’t have a clue what FAST search engine is and what it does. The common knowledge is that FAST is going to be tightly integrated with SharePoint and will make search better….. That’s it! 

As a result of the above mentioned conversation, I’ve decided to start series of blog posts about FAST features and the value they bring to the enterprises. Stay tuned, in the next post, I’ll start unveiling some of my favorite features. Posts will be titled “FAST search engine for SharePoint PART 1,2,3….etc”

Enjoy

Monday, April 13, 2009

UK Best Practices Conference

The European Best Practices Conference just passed and it was one of the biggest SharePoint Conference that UK has seen. First of all I’d like to thank Combined Knowledge for inviting me to speak at 2 sessions, and 1 “ask the experts” panel. The Conference gave lots of opportunities to learn and to network with other people in the industry.  Organizers of the conference did show what it is to take “Best Practices Conference” to the next level. Steve Smith and Zoe Watson from Combined Knowledge took a lot of care in personally making sure that all speakers and attendees felt very welcomed in London and at the conference itself. The only sad part was that great weather stayed only for the duration of the conference, since I took the rest of the week to enjoy my little vacation in London I did need an umbrella afterwards :-)

Once the conference was over, Eric Shupps has dedicated all his time to show us (Bob Fox, Stacy Draper, and George) the beauty of London, and all the great places to visit. Eric, Thank you!

To wrap up this post I urge you to become a regular attendee of this great event, as it provides deep insight into the technology and the community.  It was great seeing old friends and meeting new ones along with all the attendees that came to the event and my sessions as well. 

As they say…. Cheers!!