Wednesday, March 27, 2013

Stuff I've seen: A system for personal information retrieval and re-use

This paper is about the SIS system designed at Microsoft Research.
As per studies, while searching, about 60-80% times people do a revisit of previously seen web pages. SIS tries to take into account this fact. It creates a unified index of all the information a person has seen on his/her computer(s), which can be in any form including local disk files, emails, web pages, calendar etc. It also maintains a contextual information of the content This is in form of either thumbnails, meta information about the document and previews. This is missing in the web search results. SIS provides a unified index across multiple applications differing in data organization  strategy and indexing technique. AS it creates a local index from the data, its quicker than web search systems. Also, it have query refinement facility too. The rich contextual cues & previews gathered from the users' activities are useful.

The system comprised of 5 components: Gatherer, Filter, Tokenizer, Indexer and a Retriever. Gatherer was responsible to gather data from various sources. Filter emits a character stream from the decoded data. Tokenizer performed tokenization carries out linguistic processing. The Indexer generates an inverted index used for searching. The  Retriever is the language to be used by the user for searching data via the index populated. A SIS client must run on every users' machine. The user interface was in 2 forms: Top view and Side view. The prior is a list view with filters for refining attributes in each column and is more flexible. The later differs from the prior one in terms of positioning of the filters: they are placed at left and revealed serially. It was easier to understand this way and results are less cluttered. The systems was evaluated using questionnaires and log analysis.  Users were asked about their search experiences before and after using SIS. Log analysis gave an insight into the query patterns, user interface interactions and quality of the results produced. They had also monitored the interaction patterns of the user interface. Very few user queries involved use of boolean operators, phrases. Average query length was 1.59 words.

This seems to be a normal search engine with the only add-on being that it used to gather data from several sources and parse the data. As this being customized for a particular user, it made sense to have the index locally stored on the machine. This gave good response time. Web search engines these days have become way smarter and are user customized. One direct disadvantage of the system that I see is that it will not produce results outside of those not visited by the user. The system will increase the load on the users' machine as it will run continuously. Also, over a period of time, the data gathered  will keep of accumulating thus consuming a significant memory of the local machine. 

No comments:

Post a Comment