I’ve been working on a tutorial on change detection. This is the first time I’ve attempted to write a tutorial, and it’s been a useful learning process. I’m not “done” yet, but I feel it is at the point where I can announce that it exists.
While the transition from “notebooks where Aman is fooling around” to “a well-written tutorial with a narrative” is far from complete, I’ve invested enough time without any validation of whether anyone is actually interested in reading all this. If there’s significant interest or requests, I will be happy to invest more time in cleaning up this tutorial and maybe even adding more topics.
You can check out my tutorial on change detection here:
Why a tutorial on change detection?
Change detection is a type of problem in which we want to detect an interesting change in a signal. In the tutorial, I consider the signal to be a stream of (scalar) values, and take an online approach to the problem. I leave it open to interpretation what we mean by “interesting change”.
The objective of the tutorial is to introduce to discuss some fundamental concepts in change detection, to highlight some important use-cases of these algorithms, and to encourage the reader to think about the context of the problem before designing a solution.
I find this topic very interesting, and I’ve been fortunate to have had the chance to work on a few projects over the last years in which change detection was an important component. The problem is also tied to the more general subject of anomaly detection, which is turning out to be a recurring theme in my work.
Most importantly, I think there is a huge and growing demand in this subject. It is simply impossible to use purely manual methods to keep tabs on the tremendous amounts of data that we’re generating as a society, and in many scenarios the most interesting things are those that are NOT normal — they that fall rapidly, they rise sharply, they fit an unusual pattern or or they do not fit a usual pattern. Systems that utilize change detection algorithms — often as part of a larger solution — will help us sort through all our data and enable appropriate decisions to be made accurately and on time.
Some of the topics covered in the tutorial are:
- Online vs Offline algorithms, simulating online detection.
- Designing residuals and stop conditions
- Welford’s method
- Comparing multiple windows on a single stream.
- Anomaly detection in EKG signals
To look at EKG signals, I borrowed from Ted Dunning’s presentation at Strata. I recreated his solution in python (his solution, available on github at https://github.com/tdunning/anomaly-detection/ uses Java and Mahout).
I actually haven’t finished writing this section yet, it’s an exciting enough topic that I feel I could sink a lot of time into it. I spent (only) several hours reading and ingesting the EKG data, and (only) several more hours re-writing Ted Dunning’s algorithm. But after that initial effort, I put in a large amount of intermittent effort trying to make this section presentable and useful for a tutorial — and therein lies the time sink. I’ll fix this section up based on feedback from readers.
Ted Dunning’s approach to anomaly detection in EKG signals is as follows. First, a dictionary of representative small segments (windows) is created. These windows are found by using k-means clustering on a “normal” EKG signal. This dictionary, constructed from the training data, is used to reconstruct the EKG signal of interest.
If a test signal is successfully reconstructed from the dictionary, the signal is much like those found in the training data, and can be considered normal. If the reconstruction has a high error rate, there’s something in the signal that may be anomalous, suggesting a potentially unhealthy EKG that should be investigated further.
I have not tuned the model in the tutorial; there is room for improvement in the training phase, parameter selection, and other adjustments. That’s another potential time sink, so I’ve temporarily convinced myself that it’s actually better, for a tutorial, to leave that as an exercise for the reader.
If all this sounds interesting, please do take a look at the tutorial here:
I’ll be happy to receive any comments or feedback!
I am happy to announce that recently I’ve joined forces with The Data Guild!
What is — who are — The Data Guild? Their website says:
The Data Guild brings together deeply experienced data scientists, social scientists, designers and engineers from diverse industry backgrounds to tackle important problems and challenges.
This new relationship doesn’t encroach on any of the benefits and freedoms that I enjoy by working independently, and that was an important consideration. And there are great practical reasons to work with a team. But what really attracted my interest in The Data Guild, and the reasons why I want to work with them, are less tangible than these.
When I visited universities in India a few years ago, I had noticed a strong resistance to the sharing of knowledge that leads to creative thinking and unique ideas. The system in which those schools lived seemed severely limited in this regard. But by working as an independent consultant, I am constantly fighting a similar battle.
It costs me a great deal of energy to continually expose myself to new ideas and projects, to find inter-disciplinary collaboration. And I am rarely able to bounce ideas around with someone who understands the nuances of what I am talking about; I also lack intra-disciplinary collaboration.
By being part of a community like The Data Guild, I am hopeful to find frequent opportunities for such cross-pollination of ideas.
But that’s not the best part.
It has been two years since I started working independently as a consultant, and I have been naturally in a mood of self-assessment. I had quit my job back then because I was not satisfied in just earning a good salary. I had wanted to work on problems that I found more interesting and challenging. I feel good about what my progress on this front. But I had also wanted to work on projects that had some positive impact in a way that mattered to me. In that, I have far to go.
So it was perfect timing when, last month, I met with the founders of The Data Guild — Chris Diehl, Dave Gutelius and Cameron Turner. They talked about their vision of assembling a team of experts that were passionate about doing something significant with their efforts.
There is plenty of money to be made forming a company or working for one in the “big data” world. In this nascent industry, the “low-hanging fruit” — the business models that are immediately profitable — are ones that I do not find to be satisfying. Developing a new non-relational database, or optimizing bidding strategies for advertising — these projects are often technically impressive and have good business justification. But I do not find them compelling.
I would like to spend my time working on problems that are interesting not just for their own sake, but for the impact that they have on our world. On their first blog post, The Data Guild writes:
“We shouldn’t have been surprised; the best and brightest people we know want a chance to make a difference in the world, and to work creatively on teams where they can reach their full potential. We wanted to create a space where these incredible teams could tackle the most significant global challenges we face – but also make a living doing it. We wanted to challenge the idea that there’s a necessary tradeoff between making a difference and making a living.”
People who think like this are people I can be proud to work with. That is the reason I’m excited about working The Data Guild.
Bad mouthing old friends
I got into a conversation recently about k-means clustering — you know, as you do — and let me tell you, poor k-means was really getting bashed. “K-means sucks at this”, “K-means can’t do that”. It was really rather vicious, and I felt I had to step up to defend our old friend k-means. So I started writing up something that shows that those oft-highlighted weaknesses of k-means aren’t nearly bad as people think, and in most cases don’t outweigh the awesomeness that k-means brings to the party.
It started to get quite lengthy, so I’m breaking it up into pieces and maybe I’ll put it all together into one thing later. This post is the first of those pieces.
“K-means can’t handle non-convex sets”.
Convex sets: In Euclidean space, an object is convex if for every pair of points within the object, every point on the straight line segment that joins them is also within the object. [Source: Wikipedia.]
The k-means algorithm, in its basic form, is like making little circular paper cutouts and using them to cover the data points. We can change the quantity and size and position of our paper cut-outs, but they are still round and, thus, these non-convex shapes evade us.
That is, what are doing when we use k-means is constructing a mixture of k-gaussians. This works well if the data can be described by spatially separated hyper-spheres.
Here’s a clustering example, borrowed directly from the sklearn documentation on clustering. These are two slightly entangled banana spheres. That’s two non-convex shapes, and they are not spatially separated.
When we try to use k-means on this example, it doesn’t do very well. There’s just no way to form these two clusters with two little circular paper cut-outs. Or three.
K-means pairs well
But by combining k-means with another algorithm, hierarchical clustering, we can solve this problem. Pairing k-means with other techniques turns out to be a very effective way to draw from its benefits while overcoming its deficiencies. It’s like our theme. I’ll do it again in another post, just you watch.
First, we cluster the data into a large number of clusters using k-means. Below, I’ve plotted the centroids of clusters after k-means clustering using 21. [Why 21? Well, actually, it doesn’t matter very much in the end.]
Then, we take these many clusters from k-means and then start clustering them together into bigger clusters using a single-link agglomerative method. That is, we repeatedly pick the two clusters that are closest together and merge them. It is important in this scenario that we use the “single-link” method, in which the distance between two clusters is defined by the distance between the two closest data points we can find, one from each cluster.
Here’s what that looks like:
Woah woah. Did you see that one near the end? The one where we’ve taken 616 data points, formed a whole bunch [I used k=51 for the animation to get lots of colorful frames] of clusters with k-means , and then agglomerated them into … this:
Yup, that one. So pretty.
So many benefits
You get it already, I’m sure. We’re making lots of those little circles, covering all the data points with them. Then, we are attaching the little circles to each other, in pairs, by repeatedly picking the two that are closest.
K-means and single-link clustering. Combining the two algorithms is a pretty robust technique. It is less sensitive to initialization than pure k-means. It is also less sensitive to choice of parameters. When we have many points, we use an algorithm that is fast and parallelizable. After the heavy lifting is done, we can afford to use the more expensive hierarchical method, and reap its benefits, too.
There are many additional problems with k-means: sensitivity to initialization, the need to pick k, poor performance in high-dimensions. Today we looked at those damn non-convex sets. I’ll dive into some of the others in future posts.
By the way, in the banana shapes solution today, note that we don’t have to specify ahead of time the expected final number of clusters. We specified some arbitrary large number for k, but we finished up with hierarchical clustering. We could use one of many well-studied techniques to decide when to stop clustering. For example, we could automate a stopping rule using concepts of separation and cohesion — see this post for a hint.
How can we calculate the Mean absolute percentage error (MAPE) of our predictions using Python and scikit-learn?
“Five Balltree Construction Algorithms.” (1989).
Stephen M. Omohundro
I browsed this paper after reading several blog posts and articles about balltree-related algorithms, including:
- “Damn Cool Algorithms, Part 1: BK-Trees.” Nick Johnson. http://blog.notdot.net/2007/4/Damn-Cool-Algorithms-Part-1-BK-Trees
- “VP trees: A data structure for finding stuff fast.” Stephen Hanov. http://stevehanov.ca/blog/index.php?id=130
These and Omohundro’s paper are worthwhile reading. Even if one is not directly able to apply these data structures, they still have benefit in the read. When I was reading them, I was reminded that:
- A concept that is intuitively straightforward can often be impractical or impossible to implement for a particular application.
- Data structures can be designed and built specifically to optimize an operation (that is required by your algorithm)
- That curse of dimensionality, god damnit.
- There are many really cool and clever algorithms that you’ll never be able to apply in your domain.
Balltree and related structures are hierarchical, tree-like representation. They place data points in the tree and provide instructions for traversal of the tree in such a way as to optimize some expected future operation. The clearest application is nearest neighbor search. They also give you an excuse to sensibly use terms like “hyper-spheres” and “leaf balls”.
Construction times for these structures don’t tend to scale well. Think O (N^3). A lot of effort is put into improving and optimizing construction, but direct application of these structures to large data sets is not tractable.
Relatedly: kd balls, Burkhard-Keller (BK) trees, and VP-trees. And others.
“MAD Skills: New Analysis Practices for Big Data”
Cohen, et al.
Proceedings of the VLDB Endowment, Volume 2 Issue 2, August 2009
Reading log: I’m not sure when I read this paper, so the back-dating is pretty much arbitrary.
Abstract from the paper:
As massive data acquisition and storage becomes increasingly affordable, a wide variety of enterprises are employing statisticians to engage in sophisticated data analysis. In this paper we highlight the emerging practice of Magnetic, Agile, Deep (MAD) data analysis as a radical departure from traditional Enterprise Data Warehouses and Business Intelligence. We present our design philosophy, techniques and experience providing MAD analytics for one of the world’s largest advertising networks at Fox Audience Network, using the Greenplum parallel database system. We describe database design methodologies that support the agile working style of analysts in these settings. We present dataparallel algorithms for sophisticated statistical techniques, with a focus ondensity methods. Finally, we reflect on database system features that enable agile design and flexible algorithm development using both SQL and MapReduce interfaces over a variety of storage mechanisms.
I was concerned that this paper would turn into a white-paper or technical sales piece on joint hardware-software product offerings by Greenplum. Presents a Greenplum case study: Greenplum database for their client Fox Networks.
- MAD is Magnetic, Agile, Deep data analysis
- The authors define the MAD acronym as a re-imagination of the data warehouse concept such that:
- Magnetic: encourages (attracts) new data sources, has reduced sensitivity to cleanliness of data sources
- Agile: logical and physical contents of the database can evolve and adapt rapidly
- Deep: Avoid BI rollups and sampling to serve more demanding statistical analyses.
- Presented as an alternative to “traditional Enterprise Data Warehouses and Business Intelligence.”
- Emphasis is on moving data to a data warehouse rapidly, and using a staged approach to clean and integrate the new data.
Provides background / definitions: OLAP, data cubes, common statistical systems, parallel processing paradigms, some statistical concepts, tf-idf analysis, Ordinary least squares curve fitting, etc. Then basically just states that all this possible in a fast, dynamic, fashion using Greenplum technology.
I skimmed rather than read this paper. It felt like it was at least a review of some important concepts, but actually I’m not sure I actually got anything out of this read.
I’ve been increasingly interested in this subject — given a stream of data, a time-series such as, perhaps, a periodic measurement from a sensor, how do we define and identify anomalous values quickly and efficiently?
[Update: Check out this August 11th post by Ben Lorica, focusing on IT Ops tools in this space.]
Michèle Basseville has written several papers on the subject which I found very helpful. These two were among the first I read, in February, while researching for a new client.
- “Statistical methods for change detection.” (2002).
- “Detecting Changes in Signals and Systems: A Survey” Automation, Vol. 2,t, No. 3, pp. 309-326, 1988
His approach involves two major steps. First, from the signal, generate “residuals”, which are defined as having three properties: residuals should be close to zero under ambient (normal) conditions, insensitive to noise, and sensitive to fault (anomaly). Second, evaluate the residuals using one or more previously design decision rules (stop conditions).
Bassevile defines multiple criteria for designing detection algorithms, which I found very useful. For each application, different criteria may take priority. They are often opposing or mutually exclusive to implement. An obvious example is balancing false positives and false negatives. Another tradeoff is the mean time between false alarms and the delay in fault detection. He draws the distinction between off-line and on-line change detection, and design differences in algorithms in each case.
Some of the ingredients he uses and discusses include:
- likelihood ratio and cumalative sum tests.
- the Page-Hinkley Stopping Rule
- using local approaches and moving windows to reduce computation costs.
- spectral properties of the incoming signal
- Cumulative Sum Control Chart (CUSUM) by ES Page — http://en.wikipedia.org/wiki/Cusum
If one is interested in this subject, I imagine Basseville is a familiar name already. Following his works and the paper that cite them is a deep dive straight into the subject. I find it all fascinating and hope to get many chances to utilize these techniques in future projects.