Wednesday, March 27, 2013

The Web as a graph

This blog is about a paper by R. Kumar, P Raghavan, S. Rajagopalan, D. Sivakumar, A. Tomkins, E. Upfal

Topic search problem: 
Given a search topic, find the high quality output pages for the query.
Authoritative pages: focused on a topic. Hub pages contain links to relevant pages on the topic.

HITS algorithm: Every page has a pair of non-negative weights <xp, yp>.
If a page is pointed by many good hubs, its authority weight, xp is increased by no of hubs.
The hub weight, xy is updated in the same fashion.
Page with large weights represent a very dense pattern of linkage, from pages of a large hub pages weight of pages to page of large weight authority. Typically, the relative ordering of hub/authority scores become stable after 5 iterations.

Topic enumeration problem: 
Given a snapshot of web, output all communities in it.
Bipartite core Ci,j is a graph of (i+j) nodes that contains atleast one Ki,j as a subgraph.
Trawling: We can find a large fraction of cyber communities by enumerating all bipartite cores of the web. Naive algorithms are not effective for doing this.

Elimination generation paradigm: Scan the entire web graph, modify it and extract some metadata. The residual graph is written to disk and it along with the meta forms the new state.  During each pass, elimination operation (prune off edges from nodes lower in-degree) and generation operations (indentify qualifying nodes, output core or indicate if it cannot exist) are interleaved. After few runs, the benefits of elimination and generation tail off.
This algorithm runs in linear time in size of the web graph.

Supervised classification problem: 
Given a set of predefined categories, build a system that assigns a document one of the categories
Using terms in the anchor text of links pointing to documents degrade the accuracy.
Better: "Statistical model based" Pages on similar topics tend to be linked as compared to non related ones. This relation can be captured using the Markow Random Field (MRF) model.

Hyperclass algorithm: Instead of just considering page p for classification, consider pages surrounding it IFF p -> q and q -> p. If hyperlinks are used along with the text, the categorization is improved.
Relaxation labeling: Category labels of linked pages and of those to be categorized are adjusted until most probable class configuration is found.

Web graph models can be analytical tool, explanatory tool and predictive tool.
3 properties of web graph model: The rich get richer, correlated out links, correlated in-links.
The in-degrees and out-degrees of nodes follow a inverse polynomial distribution.

No comments:

Post a Comment