I read this paper on 10/19/2012.
Original paper: http://www.icdt.tu-dortmund.de/proceedings/edbticdt2011proc/WebProceedings/papers/edbt/a39-stoyanovich.pdf [PDF]
[Updated 10/19/2012. I just found a nice slideshare from Julia Stoyanovich.]
Bottom-up Algorithm for Rank Aware (Interval Based) clustering.
This paper argues that in some scenarios providing search results in a ranked manner is improved by grouping similar results such that the user is more easily able to access a varied set of matching responses. The proposal is to cluster results before presenting them to the user. Clustering is performed locally, that is, after applying the filter and by taking into account the user’s selected ranking criterion.
The example used for the paper is based on Yahoo! personal searches. For a particular filter (age range, sex, etc), a user may select a ranking variable (e.g. income, highest to lowest). Without the proposed cluster, the user may see a long list of very similar results (perhaps software engineers) before seeing a different type of result (a different profession), and if uninterested in the first category, would have to wade inconveniently through a long list.
The paper describes ways to measure qualities of clusters, including formal definitions, which are then used as a basis for an algorithm “BARAC” for “Bottom-up Algorithm for Rank Aware (Interval Based) clustering”. It also discusses complexity and computational costs, and results from user tests.
- Interval-based clustering refers to grouping an attribute by a range of values (age between 20 and 25).
- CLIQUE — a rank unaware clustering framework, used both as a point of comparison and a starting template for BARAC. [Agrawal 1998]
- Clustering measures are based on locality, quality, tightness, maximality.
The paper is clear and easily consumed (I read it on BART), and is sufficiently descriptive to be actionable (even includes pseudo code for BARAC and most sub routines). Discusses some practical issues in implementation. Some general valuable lessons can be extracted.