I’ve been working on a tutorial on change detection. This is the first time I’ve attempted to write a tutorial, and it’s been a useful learning process. I’m not “done” yet, but I feel it is at the point where I can announce that it exists.
While the transition from “notebooks where Aman is fooling around” to “a well-written tutorial with a narrative” is far from complete, I’ve invested enough time without any validation of whether anyone is actually interested in reading all this. If there’s significant interest or requests, I will be happy to invest more time in cleaning up this tutorial and maybe even adding more topics.
You can check out my tutorial on change detection here:
Why a tutorial on change detection?
Change detection is a type of problem in which we want to detect an interesting change in a signal. In the tutorial, I consider the signal to be a stream of (scalar) values, and take an online approach to the problem. I leave it open to interpretation what we mean by “interesting change”.
The objective of the tutorial is to introduce to discuss some fundamental concepts in change detection, to highlight some important use-cases of these algorithms, and to encourage the reader to think about the context of the problem before designing a solution.
I find this topic very interesting, and I’ve been fortunate to have had the chance to work on a few projects over the last years in which change detection was an important component. The problem is also tied to the more general subject of anomaly detection, which is turning out to be a recurring theme in my work.
Most importantly, I think there is a huge and growing demand in this subject. It is simply impossible to use purely manual methods to keep tabs on the tremendous amounts of data that we’re generating as a society, and in many scenarios the most interesting things are those that are NOT normal — they that fall rapidly, they rise sharply, they fit an unusual pattern or or they do not fit a usual pattern. Systems that utilize change detection algorithms — often as part of a larger solution — will help us sort through all our data and enable appropriate decisions to be made accurately and on time.
Some of the topics covered in the tutorial are:
- Online vs Offline algorithms, simulating online detection.
- Designing residuals and stop conditions
- Welford’s method
- Comparing multiple windows on a single stream.
- Anomaly detection in EKG signals
To look at EKG signals, I borrowed from Ted Dunning’s presentation at Strata. I recreated his solution in python (his solution, available on github at https://github.com/tdunning/anomaly-detection/ uses Java and Mahout).
I actually haven’t finished writing this section yet, it’s an exciting enough topic that I feel I could sink a lot of time into it. I spent (only) several hours reading and ingesting the EKG data, and (only) several more hours re-writing Ted Dunning’s algorithm. But after that initial effort, I put in a large amount of intermittent effort trying to make this section presentable and useful for a tutorial — and therein lies the time sink. I’ll fix this section up based on feedback from readers.
Ted Dunning’s approach to anomaly detection in EKG signals is as follows. First, a dictionary of representative small segments (windows) is created. These windows are found by using k-means clustering on a “normal” EKG signal. This dictionary, constructed from the training data, is used to reconstruct the EKG signal of interest.
If a test signal is successfully reconstructed from the dictionary, the signal is much like those found in the training data, and can be considered normal. If the reconstruction has a high error rate, there’s something in the signal that may be anomalous, suggesting a potentially unhealthy EKG that should be investigated further.
I have not tuned the model in the tutorial; there is room for improvement in the training phase, parameter selection, and other adjustments. That’s another potential time sink, so I’ve temporarily convinced myself that it’s actually better, for a tutorial, to leave that as an exercise for the reader.
If all this sounds interesting, please do take a look at the tutorial here:
I’ll be happy to receive any comments or feedback!
After attending a lecture at University of San Francisco by Jonathan Reichental (@Reichental) on the use of open data in the public sector, I started poking around some data sets available at Data.gov.
Data.gov is pretty impressive. The site was established in 2009 by Vivek Kundra, the first person with the title “Federal CIO” of the United States, appointed by Barack Obama. It is rapidly adding data sets; sixty-four thousand data sets have been added just in the last year.
Interestingly, there is an open-source version of data.gov itself, called the open government platform. It is built on Drupal and available on github. The initiative is spear-headed by the US and the Indian governments, to help promote transparency and citizen engagement by making data widely and easily available. Awesome.
- The data was easy to find. I downloaded it without declaring who I am or why I’m downloading the data, and I didn’t have to wait for any approval.
- The data was well-formatted and trivially easy to digest using python pandas.
- Ipython notebook and data source available below.
- iPynb (Python code to import data and generate plot): http://nbviewer.ipython.org/6417852
- Data from: http://wonder.cdc.gov/cancer.html
If you’re interested in this data, you should also check out http://statecancerprofiles.cancer.gov/ , which I didn’t know existed until I started writing this post. I was able to retrieve this map from there:
How can we calculate the Mean absolute percentage error (MAPE) of our predictions using Python and scikit-learn?
Check out the alphabet soup of data web visualizations I am swimming in today.
- R is statistical and computational software.
- ggplot2 is a graphing library for R, developed by Hadley Wickham.
The first three are pretty powerful and, if they are not already, are fast becoming critical parts of the data toolkit. The last is a promising newcomer, worth keeping an eye on.
So far so good. If you’re a data nerd, you probably already know all this. Stick with me.
It turns out that all these libraries, doing slightly different but related things, and doing them well, would work very well together. They’re not tightly integrated (yet) but there are several efforts to make it so.
Hadley Wickam, creator of the R package ggplot2, is a fan of d3.js and has suggested that the next version of ggplot2 will probably be redone on the web, likely using d3. He’s also working on a new R library that more immediately allows them to work well together. This is great news.
He’s calling it R2D3 (– named, supposedly, more at the insistence of friends that are Star Wars geeks than due to his own fandom).
(Confusingly, there were some unfounded rumors that Hadley’s next version of ggplot would be called R2D3.)
There are also a few projects to get Raphaël.js to work well with d3.js. One of them is called ‘d34raphael‘. Another, a bit more ambitious, is a custom build of d3 powered by Raphael. Awesome! Guess what it’s called? R2D3.
It’s not that uncommon for two open source libraries to have the same name, but these libraries both address the needs of a pretty niche audience. They both work with d3.js, but one extends “upstream” towards the data and the other extends “downstream” toward the graphics. It’s more than conceivable for someone to want to use all them at the same time: R, R2D3, D3, R2D3, and Raphael.
Apparently the the two authors, Mike Hemesath and Hadley Wickham didn’t know about each other’s projects when they named their own. If both projects are adopted widely, it will be interesting to see if either of them eventually decides to change names.
On a new Ubuntu machine, I needed to use NLTK. This serves as a quick reference for myself, and maybe you’ll find it useful as well.
> mkdir bananaproject
> cd bananaproject
> virtualenv ENV
> source ENV/bin/activate
I created a new folder for my project, and a new virtualenv for it. Virtualenv comes in damn handy for managing portability and dependencies on multiple python projects. The last command activated the virtual environment, so subsequent commands are now taking place within it.
> pip install yolk
> yolk -l
I installed yolk, which then tells me what I have installed and ready to use in my virtualenv. I use this to check dependencies before I can install NLTK.
> sudo apt-get install python-numpy
> pip install pyyaml
Numpy is a package I’m okay with having installed system-wide, not just in this virtualenv. Pyyaml on the other hand I installed just for this project.
> mkdir ENV/src
> cd ENV/src
> wget http://nltk.googlecode.com/files/nltk-2.0.1rc1.zip
> unzip nltk-2.0.1rc1.zip
> cd nltk-2.0.1rc1
> python setup.py install
Self-explanatory. Of course, the link to NLTK will soon be outdated; the latest can be found at http://www.nltk.org/download. The virtualenv was activated while I ran the install.
At this point I thought I was done, but when I started ipython and tried to
import nltk, I got an import error. I need to tell ipython about the python executable I’m using and the changes to sys.path.
This is only necessary because of the way I set up my virtualenv and the order in which I have installed things. A simple alternate is to to use a virtualenv with the
--no-site-packages option, and then install ipython afresh for that project.
This post came in handy: http://blog.ufsoft.org/2009/1/29/ipython-and-virtualenv. However, it was written in 2009, and I’m using ipython 0.12. A slight variation is necessary for ipython >= 0.11.
> vi ~/.ipython/virtualenv.py
[ Use this, or a variation thereof: https://gist.github.com/1176035 ]
> ipython profile create
The profile create command tells ipython to create default config files, which we can then play with. The command will tell you where the
ipython_config.py file has been created, and in it we need to find this line:
c.InteractiveShellApp.exec_files = 
and change it to:
c.InteractiveShellApp.exec_files = ['virtualenv.py']
Now whenever I start ipython, the
virtualenv.py script will be executed, which will set my sys.path variables the way I need them. I can now happily
import numpy and
import nltk in ipython.
In : import nltk
In : phrase = nltk.word_tokenize("That was easy.")
In : nltk.pos_tag(phrase)
Out: [('That', 'DT'), ('was', 'VBD'), ('easy', 'JJ'), ('.', '.')]
Plot the sigmoid function.
A result that confused me until [Thanks, Sasha] I noticed the tick values on my x-axis, which matplotlib selected unintelligently. If we simply correct the plot domain.
I would like to know more about how different plotting packages, such as matplotlib and ggplot2 in R, select default values for xrange and yrange.