I read this paper on 10/19/2012.
Original paper: http://www.icdt.tu-dortmund.de/proceedings/edbticdt2011proc/WebProceedings/papers/edbt/a39-stoyanovich.pdf [PDF]
[Updated 10/19/2012. I just found a nice slideshare from Julia Stoyanovich.]
Bottom-up Algorithm for Rank Aware (Interval Based) clustering.
This paper argues that in some scenarios providing search results in a ranked manner is improved by grouping similar results such that the user is more easily able to access a varied set of matching responses. The proposal is to cluster results before presenting them to the user. Clustering is performed locally, that is, after applying the filter and by taking into account the user’s selected ranking criterion.
The example used for the paper is based on Yahoo! personal searches. For a particular filter (age range, sex, etc), a user may select a ranking variable (e.g. income, highest to lowest). Without the proposed cluster, the user may see a long list of very similar results (perhaps software engineers) before seeing a different type of result (a different profession), and if uninterested in the first category, would have to wade inconveniently through a long list.
The paper describes ways to measure qualities of clusters, including formal definitions, which are then used as a basis for an algorithm “BARAC” for “Bottom-up Algorithm for Rank Aware (Interval Based) clustering”. It also discusses complexity and computational costs, and results from user tests.
- Interval-based clustering refers to grouping an attribute by a range of values (age between 20 and 25).
- CLIQUE — a rank unaware clustering framework, used both as a point of comparison and a starting template for BARAC. [Agrawal 1998]
- Clustering measures are based on locality, quality, tightness, maximality.
The paper is clear and easily consumed (I read it on BART), and is sufficiently descriptive to be actionable (even includes pseudo code for BARAC and most sub routines). Discusses some practical issues in implementation. Some general valuable lessons can be extracted.
Some weeks ago I finished Consilience, by EO Wilson.
The book is grandly ambitious, informative, critical and didactic. And while few people would agree with everything that Wilson says, fewer still could claim that they are not better off for having read the book.
First we are given sweeping overviews of the major disciplines of knowledge; a report of the condition, progress, and challenges of each discipline. This is then placed into a broad context, like approximate positioning of pieces at the start of a jigsaw puzzle. Finally, Wilson examines the gaps between these islands of human knowledge and argues that the greatest potential now lies there, in these gaps.
One of my interesting new reads is “Tasting Beer: An Insider’s Guide to the World’s Greatest Drink” by Randy Mosher. [Amazon].
The chapters so far cover a brief history, the use of senses, qualities of beer, and the brewing process. Learning about beer is more than good party conversation. Already I’m finding that picking out a brew at Bevmo has become more interesting and thoughtful. Beer labels and descriptions make more sense.
I find it rewarding to learn about beer, in the same way I like to read about the science of cooking. And now I’m finally starting to get some of those elusive facts straight. For example, Ales (starting with ‘A’, top of the alphabet) use top-fermenting yeasts and ferment at higher temperatures than Lagers, and those higher temperatures encourage faster chemical changes, yield higher alcohols and result in generally more complex flavors.
Remaining chapters will cover some specific important styles, important regions, and food pairings.
Blogging is hard, and yields no reward. I think I can get over this initial hump — why I’d like to do so is a whole separate issue — by writing frequently about something that can be clearly identified: what I read.
I read a lot, voraciously, often to the point where I don’t want admit how much time I spend (waste?) on reading. Most of it is crap. I have over 80 feed subscriptions, and I browse websites and news, and fiction books and magazines. But those will not be the subject of this blog — for that there is Google Reader and social networks sharing. [See my sidebar Reader widget].
I will write a little about the real stuff that I read. Readings related to a specific interest, goal, or pafnuty. I hope some of you will share my interests and offer comments or even be encouraged to start a discussion on these topics that inspire me.
Today I read (more) about Pinax. Pinax is a collection of reusable applications written on top of Django. (Django is an open source web framework written in Python, and Python is a high level computer language which has been the subject of my interest and exploration in recent months). I’m interested in Pinax because it allows one to quickly realize relatively complex web projects by reducing the amount of time one needs to spend on all those necessary but universal components and instead focus on the domain-specific needs.
I listened to a couple of presentations by the founder of the project, James Tauber, and he echoed many of my own thoughts on reusable code and the opportunity they provide to accelerate the development of projects.
I’ve actually been reading about Pinax for a while, and I put up a little development test-site last night. Today I was trying to understand if Pinax would be viable tool to use for a specific project I’m working on.
Foreshadowing: Anya managed to use her NYU academic resources to find some articles I was looking for on generative grammar!