I’ve been working on a tutorial on change detection. This is the first time I’ve attempted to write a tutorial, and it’s been a useful learning process. I’m not “done” yet, but I feel it is at the point where I can announce that it exists.
While the transition from “notebooks where Aman is fooling around” to “a well-written tutorial with a narrative” is far from complete, I’ve invested enough time without any validation of whether anyone is actually interested in reading all this. If there’s significant interest or requests, I will be happy to invest more time in cleaning up this tutorial and maybe even adding more topics.
You can check out my tutorial on change detection here:
Why a tutorial on change detection?
Change detection is a type of problem in which we want to detect an interesting change in a signal. In the tutorial, I consider the signal to be a stream of (scalar) values, and take an online approach to the problem. I leave it open to interpretation what we mean by “interesting change”.
The objective of the tutorial is to introduce to discuss some fundamental concepts in change detection, to highlight some important use-cases of these algorithms, and to encourage the reader to think about the context of the problem before designing a solution.
I find this topic very interesting, and I’ve been fortunate to have had the chance to work on a few projects over the last years in which change detection was an important component. The problem is also tied to the more general subject of anomaly detection, which is turning out to be a recurring theme in my work.
Most importantly, I think there is a huge and growing demand in this subject. It is simply impossible to use purely manual methods to keep tabs on the tremendous amounts of data that we’re generating as a society, and in many scenarios the most interesting things are those that are NOT normal — they that fall rapidly, they rise sharply, they fit an unusual pattern or or they do not fit a usual pattern. Systems that utilize change detection algorithms — often as part of a larger solution — will help us sort through all our data and enable appropriate decisions to be made accurately and on time.
Some of the topics covered in the tutorial are:
- Online vs Offline algorithms, simulating online detection.
- Designing residuals and stop conditions
- Welford’s method
- Comparing multiple windows on a single stream.
- Anomaly detection in EKG signals
To look at EKG signals, I borrowed from Ted Dunning’s presentation at Strata. I recreated his solution in python (his solution, available on github at https://github.com/tdunning/anomaly-detection/ uses Java and Mahout).
I actually haven’t finished writing this section yet, it’s an exciting enough topic that I feel I could sink a lot of time into it. I spent (only) several hours reading and ingesting the EKG data, and (only) several more hours re-writing Ted Dunning’s algorithm. But after that initial effort, I put in a large amount of intermittent effort trying to make this section presentable and useful for a tutorial — and therein lies the time sink. I’ll fix this section up based on feedback from readers.
Ted Dunning’s approach to anomaly detection in EKG signals is as follows. First, a dictionary of representative small segments (windows) is created. These windows are found by using k-means clustering on a “normal” EKG signal. This dictionary, constructed from the training data, is used to reconstruct the EKG signal of interest.
If a test signal is successfully reconstructed from the dictionary, the signal is much like those found in the training data, and can be considered normal. If the reconstruction has a high error rate, there’s something in the signal that may be anomalous, suggesting a potentially unhealthy EKG that should be investigated further.
I have not tuned the model in the tutorial; there is room for improvement in the training phase, parameter selection, and other adjustments. That’s another potential time sink, so I’ve temporarily convinced myself that it’s actually better, for a tutorial, to leave that as an exercise for the reader.
If all this sounds interesting, please do take a look at the tutorial here:
I’ll be happy to receive any comments or feedback!
Bad mouthing old friends
I got into a conversation recently about k-means clustering — you know, as you do — and let me tell you, poor k-means was really getting bashed. “K-means sucks at this”, “K-means can’t do that”. It was really rather vicious, and I felt I had to step up to defend our old friend k-means. So I started writing up something that shows that those oft-highlighted weaknesses of k-means aren’t nearly bad as people think, and in most cases don’t outweigh the awesomeness that k-means brings to the party.
It started to get quite lengthy, so I’m breaking it up into pieces and maybe I’ll put it all together into one thing later. This post is the first of those pieces.
“K-means can’t handle non-convex sets”.
Convex sets: In Euclidean space, an object is convex if for every pair of points within the object, every point on the straight line segment that joins them is also within the object. [Source: Wikipedia.]
The k-means algorithm, in its basic form, is like making little circular paper cutouts and using them to cover the data points. We can change the quantity and size and position of our paper cut-outs, but they are still round and, thus, these non-convex shapes evade us.
That is, what are doing when we use k-means is constructing a mixture of k-gaussians. This works well if the data can be described by spatially separated hyper-spheres.
Here’s a clustering example, borrowed directly from the sklearn documentation on clustering. These are two slightly entangled banana spheres. That’s two non-convex shapes, and they are not spatially separated.
When we try to use k-means on this example, it doesn’t do very well. There’s just no way to form these two clusters with two little circular paper cut-outs. Or three.
K-means pairs well
But by combining k-means with another algorithm, hierarchical clustering, we can solve this problem. Pairing k-means with other techniques turns out to be a very effective way to draw from its benefits while overcoming its deficiencies. It’s like our theme. I’ll do it again in another post, just you watch.
First, we cluster the data into a large number of clusters using k-means. Below, I’ve plotted the centroids of clusters after k-means clustering using 21. [Why 21? Well, actually, it doesn’t matter very much in the end.]
Then, we take these many clusters from k-means and then start clustering them together into bigger clusters using a single-link agglomerative method. That is, we repeatedly pick the two clusters that are closest together and merge them. It is important in this scenario that we use the “single-link” method, in which the distance between two clusters is defined by the distance between the two closest data points we can find, one from each cluster.
Here’s what that looks like:
Woah woah. Did you see that one near the end? The one where we’ve taken 616 data points, formed a whole bunch [I used k=51 for the animation to get lots of colorful frames] of clusters with k-means , and then agglomerated them into … this:
Yup, that one. So pretty.
So many benefits
You get it already, I’m sure. We’re making lots of those little circles, covering all the data points with them. Then, we are attaching the little circles to each other, in pairs, by repeatedly picking the two that are closest.
K-means and single-link clustering. Combining the two algorithms is a pretty robust technique. It is less sensitive to initialization than pure k-means. It is also less sensitive to choice of parameters. When we have many points, we use an algorithm that is fast and parallelizable. After the heavy lifting is done, we can afford to use the more expensive hierarchical method, and reap its benefits, too.
There are many additional problems with k-means: sensitivity to initialization, the need to pick k, poor performance in high-dimensions. Today we looked at those damn non-convex sets. I’ll dive into some of the others in future posts.
By the way, in the banana shapes solution today, note that we don’t have to specify ahead of time the expected final number of clusters. We specified some arbitrary large number for k, but we finished up with hierarchical clustering. We could use one of many well-studied techniques to decide when to stop clustering. For example, we could automate a stopping rule using concepts of separation and cohesion — see this post for a hint.
The intersection of creativity and quantitative analysis is very fascinating. There is nothing surprising about the idea that good analysis requires — or at least, can often require — a fair bit of creativity. After all, creativity is well recognized as an important trait in the sciences, in mathematics, and in many other disciplines.
We love to hear how great ideas and great discoveries were triggered by happenstance (like watching an apple fall) or by radically different ways of thinking about familiar things (such as the idea that time and space are relative to an observer). Through these stories we acknowledge the importance of creative insight, even (especially?) in fields normally associated with methodical and systematic processes. Data analysis * follows neatly with this line of thinking. It is quantitative, scientific, mathematical. And it requires, at it’s best, innovative thought and a creative approach.
* Data analysis, data science, quantitative analysis — these are loosely defined and often annoying terms. In the context of this blog post, these terms refer to the use of data to form hypotheses and build models (or products, or tools, or reports) that add some business value or inform decisions.
But what is the nature of that creativity — is it like artistic creativity, or something quite different? Why is it required, and to what extent can that requirement be replaced or made obsolete? How can this creativity be recognized in an individual, or encouraged in a particular team or environment? These questions are intriguing philosophically, and they also have great relevance today.
I was interested to listen to Thomas Davenport share a few of his thoughts on this subject in a recent interview hosted by analytics software reviewer SoftwareAdvice. (There is more about the interview, with a link, below.) Davenport describes data analysis problem solving projects as being composed of three stages: “Framing the problem”, “Solving the problem” and “Communicating the Results”, and suggests that the stages that require the most creativity are the first stage and the last stage.
Actually, Davenport seemed to be going out of his way to emphasize the first and third stages, more generally, throughout the interview, claiming that they were often overlooked in favor of the obvious middle stage. Perhaps the idea that more creativity is required in those two stages was just part of his attempt to draw emphasize to them.
I agree with Davenport that creativity helps at every stage of the project. I think that there is arguably a distinct type of creativity unique to each stage (sounds like a fun train of thought for a follow-up post). I also agree with Davenport that the creativity required in framing a problem and forming a hypothesis is often overlooked and underestimated. I’ll go even further and say that that creativity, the creativity of the first Stage, is destined to be one of the most important and most desired skills in this discipline.
It’s still very difficult to apply any sophisticated algorithms to large amounts of data — the majority of companies are happy if they can simply count things in their data. Davenport touches on this subject in the interview. He calls it the “big data equal small math problem” and notes that “it won’t be that way forever.”
It won’t be that way forever because we’re slowly but surely getting better at searching and organizing and querying big data. The trouble is that we don’t know how to take advantage of those capabilities. What can we do with all this data?
Creative and sophisticated uses will be found for commonly encountered data and packaged for easy deployment or sold as a service. We see this happening in web analytics and, increasingly, in other common scenarios. But most companies also collect data that is domain specific or specialized or unique to them. Data analysts will need to understand their company’s business, their challenges, their data, and find ways to put that data to good use.
We don’t just need creative solutions to common problems. We need creative analysts for uncommon problems.
As a community, we have much to do. Exploratory data analysis, in the Tufte sense, is still under-developed, and there are few publicly available resources to help develop those skills. We can learn from other disciplines about creativity. We can bring our data-driven problem solving approach to understanding and improving the creative process.
I’m personally very interested in building these resources and helping the community as a whole develop creative skills in data analysis. I’m also interested, as I’ve told many of my data-nerd friends, in building a “data innovation consulting firm”. An IDEO for data.
The interview I discuss above can be found here: http://plotting-success.softwareadvice.com/hangout-future-of-working-with-data-0613/
Thomas Davenport is a (quite prolific) author, co-founder and research director of International Institute for Analytics, and visiting professor at Harvard Business School. The interview was my first exposure to his ideas. He speaks about his new book “Keeping Up with the Quants: Your Guide to Understanding and Using Analytics” and addresses several subjects including creativity, the need for humans in the analytical process, the type of people that make good analysts, and advice to new graduates. His assertions on creativity were one of the underlying themes of the interview (I haven’t read the book).
I found myself agreeing, in general, with much of what Davenport says. It is clear he knows his audience — his book is described as a guide to the data-driven world for business professionals — in that he does well to present his ideas in broad and easily understood terms. The book is co-authored by Jinho Kim, a professor of business and statistics at the Korea National Defense University, who seems to also be focused on the business side of things (PhD from Wharton) and in educating about data analysis in a business context.
As someone who often works with non-technical business folks wrestling with data-related projects, I’ve put the list on my to-read list, and it may turn out to be a good gift for clients.
I am cross-posting here one of my answers on Stack Exchange – Cross Validation (that’s like Stack Overflow for statistics). The question had already been answered and accepted when I posted my answer several months later, but I chose to spend some time putting in my thoughts anyway.
I’m particularly interested in the interpretation of simple plots in the context of exploratory data analysis, and am planning to compile resources for data explorers on this subject. My plan is to do this via a wiki — which I have already installed but not yet populated with much information. So busy these days!
How does one determine the number of clusters through interpretation of a Silhouette plot?
[I have paraphrased the OP’s question. -AA]
(Click through to my answer on Cross Validated for the most recent version of the answer, which may have changed.)
Sergey’s answer contains the critical point, which is that the silhouette coefficient quantifies the quality of clustering achieved — so you should select the number of clusters that maximizes the silhouette coefficient.
The long answer is that the best way to evaluate the results of your clustering efforts is to start by actually examining — human inspection — the clusters formed and making a determination based on an understanding of what the data represents, what a cluster represents, and what the clustering is intended to achieve.
There are numerous quantitative methods of evaluating clustering results which should be used as tools, with full understanding of the limitations. They tend to be fairly intuitive in nature, and thus have a natural appeal (like clustering problems in general).
Examples: cluster mass / radius / density, cohesion or separation between clusters, etc. These concepts are often combined, for example, the ratio of separation to cohesion should be large if clustering was successful.
The way clustering is measured is informed by the type of clustering algorithms used. For example, measuring quality of a complete clustering algorithm (in which all points are put into clusters) can be very different from measuring quality of a threshold-based fuzzy clustering algorithm (in which some point might be left un-clustered as ‘noise’).
The silhouette coefficient is one such measure. It works as follows:
For each point p, first find the average distance between p and all other points in the same cluster (this is a measure of cohesion, call it A). Then find the average distance between p and all points in the nearest cluster (this is a measure of separation from the closest other cluster, call it B). The silhouette coefficient for p is the difference between B and A divided by the greater of the two (max(A,B)).
We evaluate the cluster coefficient of each point and from this we can obtain the ‘overall’ average cluster coefficient.
Intuitively, we are trying to measure the space between clusters. If cluster cohesion is good (A is small) and cluster separation is good (B is large), the numerator will be large, etc.
I’ve constructed an example here to demonstrate this graphically.
In these plots the same data is plotted five times; the colors indicate the clusters created by k-means clustering, with k = 1,2,3,4,5. That is, I’ve forced a clustering algorithm to divide the data into 2 clusters, then 3, and so on, and colored the graph accordingly.
The silhouette plot shows the that the silhouette coefficient was highest when k = 3, suggesting that’s the optimal number of clusters. In this example we are lucky to be able to visualize the data and we might agree that indeed, three clusters best captures the segmentation of this data set.
If we were unable to visualize the data, perhaps because of higher dimensionality, a silhouette plot would still give us a suggestion. However, I hope my somewhat long-winded answer here also makes the point that this “suggestion” could be very insufficient or just plain wrong in certain scenarios.
Very interesting to read this Stack Overflow question today. The question and responses, written mostly in 2009, really bring to light how much public awareness of this field has grown in recent years. The same question would likely not be asked today — a Google search would inform the curious and thus questions on the topic would be more specific than this one. If it were asked, the answers would be quite different.
I especially enjoyed looking at the train of thought that many followed in trying to answer the question. Some thought the problem was completely intractable, others that text in itself was insufficient (“…but if you had keystroke timings you should be able to figure it out”). Those that suggested solutions focused on counting occurrences of emotional words (“maybe you can count the number of 4 letter words?”) or punctuation in the text with various manually-assigned weights, or ruminated about the use of ALL CAPS or font styles (“bold red text is probably an angry user”). The use of spelling and grammar mistakes was suggested as some type of indicator — were these mistakes really so rare in 2009 that they could be interpreted as emotional?
(I don’t mean in any way to be derogatory towards those that chimed in, of course. )
One of the answers (the one most up-voted) used the term ‘sentiment analysis’ and described a solution schema that is still used effectively today. Which points out for us that the requisite level of research in NLP, as many of you already know, has been around since 2009 and long before 2009… it is public interest and awareness, and the sheer number of people interested in these types of questions, that has grown so dramatically since.
I gave in this week. I turned on the air conditioner at my apartment.
Not only did I use my A/C, I actually left it on ‘economy mode’ while I was at work. This resulted in the dramatic spike in the graph above, where my energy consumption shot up to about 13 kilo-Watt-hours, compared to my average of ~3 KwH in May and ~4 kWH in the rest of June.
Was it really all that hot? Weather underground says that it got up to 101 degrees in those two days at the nearby BART station. And if you’ve ever been to my apartment, you know that when you walk up to my door the temperature rises palpably with each step. I can honestly estimate it must have been 110 degrees F in my living room.
I have colleagues in India (Kalkaji, in New Delhi), and this week their temperatures were quite comparable to this. I definitely will not jump on this rare opportunity to whine safely, though, lest they point out that their temperatures, while similar to ours this week, were actually abnormally cool compared to their own averages and norms. I wouldn’t hear the end of it for 51 weeks.
I’ll try to keep the broader perspective, but I’m just not very good with heat. And I’m simultaneously stubborn with turning on the air conditioner. Fortunately for me, the temperature dropped back down by about 20 degrees today, and I am comfortable with the windows open and a supply of water, Gatorade, and ice cream.