Non-convex sets with k-means and hierarchical clustering

Bad mouthing old friends

I got into a conversation recently about k-means clustering —  you know, as you do — and let me tell you, poor k-means was really getting bashed. “K-means sucks at this”, “K-means can’t do that”. It was really rather vicious, and I felt I had to step up to defend our old friend k-means. So I started writing up something that shows that those oft-highlighted weaknesses of k-means aren’t nearly bad as people think, and in most cases don’t outweigh the awesomeness that k-means brings to the party.

It started to get quite lengthy, so I’m breaking it up into pieces and maybe I’ll put it all together into one thing later. This post is the first of those pieces.

Convex sets

“K-means can’t handle non-convex sets”.

Non-Convex set

A non-convex set

Convex sets: In Euclidean space, an object is convex if for every pair of points within the object, every point on the straight line segment that joins them is also within the object. [Source: Wikipedia.]

The k-means algorithm, in its basic form, is like making little circular paper cutouts and using them to cover the data points. We can change the quantity and size and position of our paper cut-outs, but they are still round and, thus, these non-convex shapes evade us.

That is, what are doing when we use k-means is constructing a mixture of k-gaussians. This works well if the data can be described by spatially separated hyper-spheres.

Here’s a clustering example, borrowed directly from the sklearn documentation on clustering. These are two slightly entangled banana spheres. That’s two non-convex shapes, and they are not spatially separated.

banana_shape

When we try to use k-means on this example, it doesn’t do very well. There’s just no way to form these two clusters with two little circular paper cut-outs. Or three.

k-means on the banana shapes

k-means performs poorly on the banana shapes

K-means pairs well

But by combining k-means with another algorithm, hierarchical clustering, we can solve this problem. Pairing k-means with other techniques turns out to be a very effective way to draw from its benefits while overcoming its deficiencies. It’s like our theme. I’ll do it again in another post, just you watch.

First, we cluster the data into a large number of clusters using k-means. Below, I’ve plotted the centroids of clusters after k-means clustering using 21. [Why 21? Well, actually, it doesn’t matter very much in the end.]

Centroids of 21 kmeans clusters

Centroids of 21 k-means clusters

Then, we take these many clusters from k-means and then start clustering them together into bigger clusters using a single-link agglomerative method.  That is, we repeatedly pick the two clusters that are closest together and merge them. It is important in this scenario that we use the “single-link” method, in which the distance between two clusters is defined by the distance between the two closest data points we can find, one from each cluster.

Here’s what that looks like:

hierarchical clustering animation

Woah woah. Did you see that one near the end? The one where we’ve taken 616 data points, formed a whole bunch [I used k=51 for the animation to get lots of colorful frames] of clusters with k-means , and then agglomerated them into … this:

clustered bananas

Yup, that one. So pretty.

So many benefits

You get it already, I’m sure. We’re making lots of those little circles, covering all the data points with them. Then, we are attaching the little circles to each other, in pairs, by repeatedly picking the two that are closest.

K-means and single-link clustering. Combining the two algorithms is a pretty robust technique. It is less sensitive to initialization than pure k-means. It is also less sensitive to choice of parameters. When we have many points, we use an algorithm that is fast and parallelizable. After the heavy lifting is done, we can afford to use the more expensive hierarchical method, and reap its benefits, too.

There are many additional problems with k-means: sensitivity to initialization, the need to pick k, poor performance in high-dimensions. Today we looked at those damn non-convex sets. I’ll dive into some of the others in future posts.

By the way, in the banana shapes solution today, note that we don’t have to specify ahead of time the expected final number of clusters. We specified some arbitrary large number for k, but we finished up with hierarchical clustering. We could use one of many well-studied techniques to decide when to stop clustering. For example, we could automate a stopping rule using concepts of separation and cohesion — see this post for a hint.

Related posts:

Advertisements

Reading workflow and backposting to reading-log

One of my categories on this blog is “reading-log“, which I intended as a way to highlight one of the books, articles or papers that I’ve read recently. I’ve been very negligent at this, but fortunately this is one of those situations where it’s not too late to do so.

I keep notes (on Evernote) with the date that I read the material and thoughts that it inspired. So I can still go back and post them retroactively. I can even artificially date the WordPress Post. I’ll be trying to do some of that over the next few days. If all goes well, subscribers will see a flurry of activity (which hopefully doesn’t chase any of them away).

I’ve been reading a lot these days. My reading workflow is always evolving, but I’ve got a system that seems to be working pretty well, and as a result I find it easier to read more and be efficient.

Image

I use Feedly, to which I switched after the days of Google Reader. I currently have 120+ sources (web feeds) in six or seven categories. I am picky with my subscriptions, and feeds that feel like clutter are weeded out (I have a separate category for feeds “on probation”, and I’ll skip those articles on busy days). After years of this, I find that a lot of value and entertainment in my feeds.

I skim these web feeds on my phone using Feedly’s android app. This is fast consumption, and easy to do when taking a break or during in-between moments. Anything requiring deeper attention or more time, I save for later, using Pocket.

In addition to web feeds via Feedly, my Pocket queue is populated by tweets, web browsing, active research, and things-people-send-to-me. The ability to easily save anything for later means I have fewer interruptions and distractions. There is a separate time and place for consuming all that material. This makes me more efficient.

When researching on a particular subject, for personal interest or for a client, I read papers and “heavier” articles. I have a Dropbox folder where I keep this research material, and it stays there even after I’ve read it, for future reference. I’ll often transfer unread articles from this folder to my Kindle; I always keep the ol’ ebook filled with a collection of unread novels, non-fiction books, and dozens of research papers. This is particularly wonderful when traveling, as I am now.

We all have so many sources for reading material, and there are a lot of tools to help us manage everything. I’ve shared only the most significant of the tools that I use, (and hinted at the taxonomies I’ve invented to organize things) with which I’m able to read, and watch, and listen to, a lot more material without feeling overwhelmed or constantly interrupted.

Keep an eye on this reading-log WordPress category — I’ll be doing that back-posting and perhaps you’ll find we have common reading interests.

 

Mean absolute percentage error (MAPE) in Scikit-learn

On CrossValidated, the StackExchange for statistics, someone asks:

How can we calculate the Mean absolute percentage error (MAPE) of our predictions using Python and scikit-learn?

Mean Absolute Percentage Error (MAPE) is an metric used to determine the success of a regression analysis. Read my answer on CV here:

http://stats.stackexchange.com/questions/58391/mean-absolute-percentage-error-mape-in-scikit-learn/62511#62511

Continue reading

Thomas Davenport on Creativity in Quantitative Analysis

The intersection of creativity and quantitative analysis is very fascinating. There is nothing surprising about the idea that good analysis requires — or at least, can often require — a fair bit of creativity. After all, creativity is well recognized as an important trait in the sciences, in mathematics, and in many other disciplines.

We love to hear how great ideas and great discoveries were triggered by happenstance (like watching an apple fall) or by radically different ways of thinking about familiar things (such as the idea that time and space are relative to an observer). Through these stories we acknowledge the importance of creative insight, even (especially?) in fields normally associated with methodical and systematic processes. Data analysis * follows neatly with this line of thinking. It is quantitative, scientific, mathematical. And it requires, at it’s best, innovative thought and a creative approach.

* Data analysis, data science, quantitative analysis — these are loosely defined and often annoying terms. In the context of this blog post, these terms refer to the use of data to form hypotheses and build models (or products, or tools, or reports) that add some business value or inform decisions. 

But what is the nature of that creativity — is it like artistic creativity, or something quite different? Why is it required, and to what extent can that requirement be replaced or made obsolete? How can this creativity be recognized in an individual, or encouraged in a particular team or environment? These questions are intriguing philosophically, and they also have great relevance today.

I was interested to listen to Thomas Davenport share a few of his thoughts on this subject in a recent interview hosted by analytics software reviewer SoftwareAdvice. (There is more about the interview, with a link, below.) Davenport describes data analysis problem solving projects as being composed of three stages: “Framing the problem”, “Solving the problem” and “Communicating the Results”, and suggests that the stages that require the most creativity are the first stage and the last stage.

Actually, Davenport seemed to be going out of his way to emphasize the first and third stages, more generally, throughout the interview, claiming that they were often overlooked in favor of the obvious middle stage. Perhaps the idea that more creativity is required in those two stages was just part of his attempt to draw emphasize to them.

I agree with Davenport that creativity helps at every stage of the project. I think that there is arguably a distinct type of creativity unique to each stage (sounds like a fun train of thought for a follow-up post). I also agree with Davenport that the creativity required in framing a problem and forming a hypothesis is often overlooked and underestimated. I’ll go even further and say that that creativity, the creativity of the first Stage, is destined to be one of the most important and most desired skills in this discipline.

It’s still very difficult to apply any sophisticated algorithms to large amounts of data — the majority of companies are happy if they can simply count things in their data. Davenport touches on this subject in the interview. He calls it the “big data equal small math problem” and notes that “it won’t be that way forever.”

It won’t be that way forever because we’re slowly but surely getting better at searching and organizing and querying big data. The trouble is that we don’t know how to take advantage of those capabilities. What can we do with all this data?

Creative and sophisticated uses will be found for commonly encountered data and packaged for easy deployment or sold as a service. We see this happening in web analytics and, increasingly, in other common scenarios. But most companies also collect data that is domain specific or specialized or unique to them. Data analysts will need to understand their company’s business, their challenges, their data, and find ways to put that data to good use.

We don’t just need creative solutions to common problems. We need creative analysts for uncommon problems.

As a community, we have much to do. Exploratory data analysis, in the Tufte sense, is still under-developed, and there are few publicly available resources to help develop those skills. We can learn from other disciplines about creativity. We can bring our data-driven problem solving approach to understanding and improving the creative process.

I’m personally very interested in building these resources and helping the community as a whole develop creative skills in data analysis. I’m also interested, as I’ve told many of my data-nerd friends, in building a “data innovation consulting firm”. An IDEO for data.


The interview I discuss above can be found here: http://plotting-success.softwareadvice.com/hangout-future-of-working-with-data-0613/

Thomas Davenport: Keeping Up with the Quants

Thomas Davenport is a (quite prolific) author, co-founder and research director of International Institute for Analytics, and visiting professor at Harvard Business School. The interview was my first exposure to his ideas. He speaks about his new book “Keeping Up with the Quants: Your Guide to Understanding and Using Analytics” and addresses several subjects including creativity, the need for humans in the analytical process, the type of people that make good analysts, and advice to new graduates. His assertions on creativity were one of the underlying themes of the interview (I haven’t read the book).

I found myself agreeing, in general, with much of what Davenport says. It is clear he knows his audience — his book is described as a guide to the data-driven world for business professionals — in that he does well to present his ideas in broad and easily understood terms. The book is co-authored by Jinho Kim, a professor of business and statistics at the Korea National Defense University, who seems to also be focused on the business side of things (PhD from Wharton) and in educating about data analysis in a business context.

As someone who often works with non-technical business folks wrestling with data-related projects, I’ve put the list on my to-read list, and it may turn out to be a good gift for clients.

Reading Log: “Five Ball-Tree Construction Algorithms”, Omohundro

“Five Balltree Construction Algorithms.” (1989).
Stephen M. Omohundro

I browsed this paper after reading several blog posts and articles about balltree-related algorithms, including:

  1. “Damn Cool Algorithms, Part 1: BK-Trees.” Nick Johnson. http://blog.notdot.net/2007/4/Damn-Cool-Algorithms-Part-1-BK-Trees
  2. “VP trees: A data structure for finding stuff fast.” Stephen Hanov. http://stevehanov.ca/blog/index.php?id=130

These and Omohundro’s paper are worthwhile reading. Even if one is not directly able to apply these data structures, they still have benefit in the read. When I was reading them, I was reminded that:

  • A concept that is intuitively straightforward can often be impractical or impossible to implement for a particular application.
  • Data structures can be designed and built specifically to optimize an operation (that is required by your algorithm)
  • That curse of dimensionality, god damnit.
  • There are many really cool and clever algorithms that you’ll never be able to apply in your domain.

Balltree and related structures are hierarchical, tree-like representation. They place data points in the tree and provide instructions for traversal of the tree in such a way as to optimize some expected future operation. The clearest application is nearest neighbor search. They also give you an excuse to sensibly use terms like “hyper-spheres” and “leaf balls”.

Construction times for these structures don’t tend to scale well. Think O (N^3). A lot of effort is put into improving and optimizing construction, but direct application of these structures to large data sets is not tractable.

Relatedly: kd balls, Burkhard-Keller (BK) trees, and VP-trees. And others.

Reading Log: “Overlapping Experiment Infrastructure at Google”, D. Tang

“Overlapping Experiment Infrastructure at Google” D. Tang
Published KDD Proceedings 2010
http://dl.acm.org/citation.cfm?id=1835810

This paper describes the thought process and concepts behind the extensive testing philosophy and infrastructure at Google.

Reading log: This is a very useful paper I read a while ago and dug up again for a client in June. The concepts I learned here seem to emerge intermittently when meeting with clients.

I think this should be required reading for anyone getting started with overlapping testing infrastructures (those that manage multiple tests at the same time). Lean Analytics!

Key take-aways include:

  • the concept of domains, subsets and layers to partition parameters and design infrastructure
  • binary push vs data push; separating testing parameters from program code.
  • Canary experiments and defining expected range of monitored metrics

My concerns (i.e. interests or applications in mind) with re-reading this paper for my client were:

  • Applying overlapping infrastructure to A/B testing vs. Multi-Arm Bandit testing,
  • The particulars of having a shared control group
  • Using such an infrastructure to test and select machine learning algorithm hyperparameters

Detached HEAD — a git discovery

Recently I found myself with a detached HEAD. In Git.

This was the first time I encountered such a thing. When you are working on, or checkout, commits that are not attached to any branch, you have a detached head situation. Your commits are branchless. There is a pretty easy fix to this, and the solution is pretty easy to find on SO.

Check out SO: Why did git detach my head?

I retraced my steps to figure out exactly how this happened.

I created a branch (git branch newfeature; git checkout newfeature) and then cloned my repository for further work on this branch. This created an ambiguity for git: both the clone and master branch had a branch named newfeature. When I pulled my work from master with git pull , the commits were not attached to any branch.

The symptoms

I didn’t recognize this unfamiliar situation. I did notice I couldn’t find all those commits.

  • They weren’t visible with git log or git log newfeature.
  • git status with newfeature checked out showed a clean working directory.

With help from @ddunlop, I was finally able to view the commits with git log <hash>. I got the commit hash using git log in my cloned repo.

This is how I resolved the problem.

  1. git checkout <hash>.  I checked out my most recent commit using its hash. Git informed me that I was now in a ‘detached HEAD’ state. After that it was easy. I googled the provocative “detached HEAD” message and did some learning.
  2. git checkout newfeature
  3. git branch newfeature_2 6e51426cdb
  4. git merge newfeature_2
  5. git checkout master
  6. git merge newfeature

Then I just deleted the extra branches.

In the process, I also learned about “tracking” branches. Check out the useful SO: Switch branch without detaching head