I’ve been working on a tutorial on change detection. This is the first time I’ve attempted to write a tutorial, and it’s been a useful learning process. I’m not “done” yet, but I feel it is at the point where I can announce that it exists.
While the transition from “notebooks where Aman is fooling around” to “a well-written tutorial with a narrative” is far from complete, I’ve invested enough time without any validation of whether anyone is actually interested in reading all this. If there’s significant interest or requests, I will be happy to invest more time in cleaning up this tutorial and maybe even adding more topics.
You can check out my tutorial on change detection here:
Why a tutorial on change detection?
Change detection is a type of problem in which we want to detect an interesting change in a signal. In the tutorial, I consider the signal to be a stream of (scalar) values, and take an online approach to the problem. I leave it open to interpretation what we mean by “interesting change”.
The objective of the tutorial is to introduce to discuss some fundamental concepts in change detection, to highlight some important use-cases of these algorithms, and to encourage the reader to think about the context of the problem before designing a solution.
I find this topic very interesting, and I’ve been fortunate to have had the chance to work on a few projects over the last years in which change detection was an important component. The problem is also tied to the more general subject of anomaly detection, which is turning out to be a recurring theme in my work.
Most importantly, I think there is a huge and growing demand in this subject. It is simply impossible to use purely manual methods to keep tabs on the tremendous amounts of data that we’re generating as a society, and in many scenarios the most interesting things are those that are NOT normal — they that fall rapidly, they rise sharply, they fit an unusual pattern or or they do not fit a usual pattern. Systems that utilize change detection algorithms — often as part of a larger solution — will help us sort through all our data and enable appropriate decisions to be made accurately and on time.
Some of the topics covered in the tutorial are:
- Online vs Offline algorithms, simulating online detection.
- Designing residuals and stop conditions
- Welford’s method
- Comparing multiple windows on a single stream.
- Anomaly detection in EKG signals
To look at EKG signals, I borrowed from Ted Dunning’s presentation at Strata. I recreated his solution in python (his solution, available on github at https://github.com/tdunning/anomaly-detection/ uses Java and Mahout).
I actually haven’t finished writing this section yet, it’s an exciting enough topic that I feel I could sink a lot of time into it. I spent (only) several hours reading and ingesting the EKG data, and (only) several more hours re-writing Ted Dunning’s algorithm. But after that initial effort, I put in a large amount of intermittent effort trying to make this section presentable and useful for a tutorial — and therein lies the time sink. I’ll fix this section up based on feedback from readers.
Ted Dunning’s approach to anomaly detection in EKG signals is as follows. First, a dictionary of representative small segments (windows) is created. These windows are found by using k-means clustering on a “normal” EKG signal. This dictionary, constructed from the training data, is used to reconstruct the EKG signal of interest.
If a test signal is successfully reconstructed from the dictionary, the signal is much like those found in the training data, and can be considered normal. If the reconstruction has a high error rate, there’s something in the signal that may be anomalous, suggesting a potentially unhealthy EKG that should be investigated further.
I have not tuned the model in the tutorial; there is room for improvement in the training phase, parameter selection, and other adjustments. That’s another potential time sink, so I’ve temporarily convinced myself that it’s actually better, for a tutorial, to leave that as an exercise for the reader.
If all this sounds interesting, please do take a look at the tutorial here:
I’ll be happy to receive any comments or feedback!
Bad mouthing old friends
I got into a conversation recently about k-means clustering — you know, as you do — and let me tell you, poor k-means was really getting bashed. “K-means sucks at this”, “K-means can’t do that”. It was really rather vicious, and I felt I had to step up to defend our old friend k-means. So I started writing up something that shows that those oft-highlighted weaknesses of k-means aren’t nearly bad as people think, and in most cases don’t outweigh the awesomeness that k-means brings to the party.
It started to get quite lengthy, so I’m breaking it up into pieces and maybe I’ll put it all together into one thing later. This post is the first of those pieces.
“K-means can’t handle non-convex sets”.
Convex sets: In Euclidean space, an object is convex if for every pair of points within the object, every point on the straight line segment that joins them is also within the object. [Source: Wikipedia.]
The k-means algorithm, in its basic form, is like making little circular paper cutouts and using them to cover the data points. We can change the quantity and size and position of our paper cut-outs, but they are still round and, thus, these non-convex shapes evade us.
That is, what are doing when we use k-means is constructing a mixture of k-gaussians. This works well if the data can be described by spatially separated hyper-spheres.
Here’s a clustering example, borrowed directly from the sklearn documentation on clustering. These are two slightly entangled banana spheres. That’s two non-convex shapes, and they are not spatially separated.
When we try to use k-means on this example, it doesn’t do very well. There’s just no way to form these two clusters with two little circular paper cut-outs. Or three.
K-means pairs well
But by combining k-means with another algorithm, hierarchical clustering, we can solve this problem. Pairing k-means with other techniques turns out to be a very effective way to draw from its benefits while overcoming its deficiencies. It’s like our theme. I’ll do it again in another post, just you watch.
First, we cluster the data into a large number of clusters using k-means. Below, I’ve plotted the centroids of clusters after k-means clustering using 21. [Why 21? Well, actually, it doesn’t matter very much in the end.]
Then, we take these many clusters from k-means and then start clustering them together into bigger clusters using a single-link agglomerative method. That is, we repeatedly pick the two clusters that are closest together and merge them. It is important in this scenario that we use the “single-link” method, in which the distance between two clusters is defined by the distance between the two closest data points we can find, one from each cluster.
Here’s what that looks like:
Woah woah. Did you see that one near the end? The one where we’ve taken 616 data points, formed a whole bunch [I used k=51 for the animation to get lots of colorful frames] of clusters with k-means , and then agglomerated them into … this:
Yup, that one. So pretty.
So many benefits
You get it already, I’m sure. We’re making lots of those little circles, covering all the data points with them. Then, we are attaching the little circles to each other, in pairs, by repeatedly picking the two that are closest.
K-means and single-link clustering. Combining the two algorithms is a pretty robust technique. It is less sensitive to initialization than pure k-means. It is also less sensitive to choice of parameters. When we have many points, we use an algorithm that is fast and parallelizable. After the heavy lifting is done, we can afford to use the more expensive hierarchical method, and reap its benefits, too.
There are many additional problems with k-means: sensitivity to initialization, the need to pick k, poor performance in high-dimensions. Today we looked at those damn non-convex sets. I’ll dive into some of the others in future posts.
By the way, in the banana shapes solution today, note that we don’t have to specify ahead of time the expected final number of clusters. We specified some arbitrary large number for k, but we finished up with hierarchical clustering. We could use one of many well-studied techniques to decide when to stop clustering. For example, we could automate a stopping rule using concepts of separation and cohesion — see this post for a hint.
I’ve been increasingly interested in this subject — given a stream of data, a time-series such as, perhaps, a periodic measurement from a sensor, how do we define and identify anomalous values quickly and efficiently?
[Update: Check out this August 11th post by Ben Lorica, focusing on IT Ops tools in this space.]
Michèle Basseville has written several papers on the subject which I found very helpful. These two were among the first I read, in February, while researching for a new client.
- “Statistical methods for change detection.” (2002).
- “Detecting Changes in Signals and Systems: A Survey” Automation, Vol. 2,t, No. 3, pp. 309-326, 1988
His approach involves two major steps. First, from the signal, generate “residuals”, which are defined as having three properties: residuals should be close to zero under ambient (normal) conditions, insensitive to noise, and sensitive to fault (anomaly). Second, evaluate the residuals using one or more previously design decision rules (stop conditions).
Bassevile defines multiple criteria for designing detection algorithms, which I found very useful. For each application, different criteria may take priority. They are often opposing or mutually exclusive to implement. An obvious example is balancing false positives and false negatives. Another tradeoff is the mean time between false alarms and the delay in fault detection. He draws the distinction between off-line and on-line change detection, and design differences in algorithms in each case.
Some of the ingredients he uses and discusses include:
- likelihood ratio and cumalative sum tests.
- the Page-Hinkley Stopping Rule
- using local approaches and moving windows to reduce computation costs.
- spectral properties of the incoming signal
- Cumulative Sum Control Chart (CUSUM) by ES Page — http://en.wikipedia.org/wiki/Cusum
If one is interested in this subject, I imagine Basseville is a familiar name already. Following his works and the paper that cite them is a deep dive straight into the subject. I find it all fascinating and hope to get many chances to utilize these techniques in future projects.
Read this paper in November 2012 (backdated post).
Compare this paper to many similar ones with general advice or observations on Machine Learning, such as the one by Andrew Ng.
These are the useful things Pedro chooses to highlight in this paper.
- Selection of an ML algorithm is simpler if you understand the three components: representation, evaluation (scoring function), optimization.
- Generalization is the goal. Use cross validation, optimize for the test set, not training data.
- Data and algorithms must be combined with domain knowledge and experience for good results. This is a good thing. Anti-Kaggle.
- Understand bias vs. variance in overfitting. Use techniques like regularization to combat them.
- Curse of dimensionality
- Theoretical guarantees in ML algorithms are not a criterion for practical decisions.
- Feature engineering is the most important contributor to success/failure of ML.
- More data > algorithmic sophistication, but adds scalability issues. Try the simplest learners first.
- Use ensemble methods. More models are better.
- “… Simpler hypotheses should be preferred because simplicity is a virtue in its own right, not because of a hypothetical connection with accuracy. This is probably what Occam meant in the first place.”
- Just because something has a representation doesn’t mean it can be learned.
- Correlation is not Causation.
I read this paper on 10/19/2012.
Original paper: http://www.icdt.tu-dortmund.de/proceedings/edbticdt2011proc/WebProceedings/papers/edbt/a39-stoyanovich.pdf [PDF]
[Updated 10/19/2012. I just found a nice slideshare from Julia Stoyanovich.]
Bottom-up Algorithm for Rank Aware (Interval Based) clustering.
This paper argues that in some scenarios providing search results in a ranked manner is improved by grouping similar results such that the user is more easily able to access a varied set of matching responses. The proposal is to cluster results before presenting them to the user. Clustering is performed locally, that is, after applying the filter and by taking into account the user’s selected ranking criterion.
The example used for the paper is based on Yahoo! personal searches. For a particular filter (age range, sex, etc), a user may select a ranking variable (e.g. income, highest to lowest). Without the proposed cluster, the user may see a long list of very similar results (perhaps software engineers) before seeing a different type of result (a different profession), and if uninterested in the first category, would have to wade inconveniently through a long list.
The paper describes ways to measure qualities of clusters, including formal definitions, which are then used as a basis for an algorithm “BARAC” for “Bottom-up Algorithm for Rank Aware (Interval Based) clustering”. It also discusses complexity and computational costs, and results from user tests.
- Interval-based clustering refers to grouping an attribute by a range of values (age between 20 and 25).
- CLIQUE — a rank unaware clustering framework, used both as a point of comparison and a starting template for BARAC. [Agrawal 1998]
- Clustering measures are based on locality, quality, tightness, maximality.
The paper is clear and easily consumed (I read it on BART), and is sufficiently descriptive to be actionable (even includes pseudo code for BARAC and most sub routines). Discusses some practical issues in implementation. Some general valuable lessons can be extracted.
Some weeks ago I finished Consilience, by EO Wilson.
The book is grandly ambitious, informative, critical and didactic. And while few people would agree with everything that Wilson says, fewer still could claim that they are not better off for having read the book.
First we are given sweeping overviews of the major disciplines of knowledge; a report of the condition, progress, and challenges of each discipline. This is then placed into a broad context, like approximate positioning of pieces at the start of a jigsaw puzzle. Finally, Wilson examines the gaps between these islands of human knowledge and argues that the greatest potential now lies there, in these gaps.
Part of the “Today I read” series.
Today I read this paper:
“Language Generation for Conversational Agent by Evolution of Plan Trees with Genetic Programming” by Sungsoo Lim and Sung-Bae Cho. Link.
Anya was able to get this paper for me from NYU resources. It was a quick read. I had hoped for something that could apply to the Pablo project, but that was not to be.
The paper uses a population of (initially simple) sentences and joining operators to form new (initially more complex) sentences. They generated 20 sentences (population size), humans rated each on their quality (fitness score), and then used the fittest of these sentences to contribute to the next generation. 90 generations later, they ended up with sentences approximately 60% more fit than the first generation.
Ten human subjects were used to rate sentences (at least 10,800 of them) and I’m curious about the inevitable problems with that. For example, the authors mention that there was a statistically visible bias in ratings which wasn’t compensated for.
Since Pablo is web-based and public, it would be easy to implement a rating system for this kind of algorithm. However, there are a lot of associated problems that suggest this would be a difficult idea, with questionable results.
Two more comments. The researchers wrote the joining operators and performed the work in the Korean language, which made some of the details of their work less interesting for me to read. Also, Pablo attempts to generate sentences from words, not sentences from sentences.