A blog post by Vytautas Jančauskas talks about the implementation of Andrew’s Curves in Python Pandas. These curves, introduced in David Andrew’s paper in 1972, allow one to visualize high dimensional data through transformation.
It is now trivial to generate such a plot from your pandas dataframe:
import pandas as pd
df = pd.Dataframe(some_data, columns = ['y', 'x1', 'x2', 'x3', 'x4', 'x5'])
I think this is a powerful and exciting tool that could be very insightful for exploratory data analysis.
I read this paper that expounds upon some of Andrew’s ideas:
César García-Osorio, Colin Fyfe, “Visualization of High-Dimensional Data via Orthogonal Curves” (2005).
After playing around and reading a bit, I came up with some ideas for future work on this new feature:
Labels and ticks
In the above example plot, which I generated, the xticks are at multiples of π — which is sensible because what we are looking at is the projection of data onto the vector of Fourier series on the range (−π < t < π) . But the current pandas implementation has xticks at integer multiples. It also doesn’t provide axis labels. I should create a PR for this.
The shape of Andrew’s curves are highly influenced by the order of variables in the transformation. So in the pandas implementation, the order of the columns is significant.
Here are two plots of the air quality data set — the only difference is column order:
[Added the code I used to generate these plot at the bottom of this section.]
One might argue this difference does not matter… that if all you are doing is checking for structure in a dataset, then the shape of that structure is not important (compare the airquality Andrew’s Plot to the one with random data above). But in fact shapes can be very important when you are using visual data to develop an intuition about numbers. Also, Andrew’s Curves can be very informative beyond a binary “yes there is” / “no there isn’t” decision with respect to structure, and in that case the column order here could become analogous to bin widths in a histogram.
Here is the same “column-order experiment” as above, this time for the mtcars dataset:
Surprised? Me too. For the sake of reproducibility, here are the column orders for the three mtcars plots:
['qsec', 'vs', 'am', 'gear', 'carb', 'mpg', 'cyl', 'disp', 'hp', 'drat', 'wt']
['wt', 'qsec', 'vs', 'am', 'gear', 'carb', 'mpg', 'cyl', 'disp', 'hp', 'drat']
['drat', 'wt', 'qsec', 'vs', 'am', 'gear', 'carb', 'mpg', 'cyl', 'disp', 'hp']
This is an inherent weakness with Andrew’s Curves, and no fault of pandas. The people that provide powerful tools cannot be responsible for mistakes that users might make. However, going along with analogy made earlier, anybody creating a tool to generate histograms will provide the capability to adjust bin sizes. In the same way, this vulnerability might need to be acknowledged in some way: by, for example, allowing the user to specify a column order when creating an Andrews Plot, or by allowing the user to generate several plots each with a random column order.
Andrew’s Curves also have other weaknesses, such as biasing some frequencies over others. Variations exist to address these weaknesses, and there are other visualizations built on the principle of transforming high-dimensional data. These might be worth exploring in more detail, but I’m out of time for now. See the paper by García-Osorio for more details.
The code used to generate some of the plots in this post:
/cc @orbitfold @tacaswell @jtratner
Update (2013 Oct 30):
In the above column order test, I was simply cycling the column order, not shuffling them. In the plot below, I’m rearranging the columns completely using random.shuffle(). Also included as a bonus is a side-by-side comparison with a Parallel Coordinates Plot (PCP).
One of my categories on this blog is “reading-log“, which I intended as a way to highlight one of the books, articles or papers that I’ve read recently. I’ve been very negligent at this, but fortunately this is one of those situations where it’s not too late to do so.
I keep notes (on Evernote) with the date that I read the material and thoughts that it inspired. So I can still go back and post them retroactively. I can even artificially date the WordPress Post. I’ll be trying to do some of that over the next few days. If all goes well, subscribers will see a flurry of activity (which hopefully doesn’t chase any of them away).
I’ve been reading a lot these days. My reading workflow is always evolving, but I’ve got a system that seems to be working pretty well, and as a result I find it easier to read more and be efficient.
I use Feedly, to which I switched after the days of Google Reader. I currently have 120+ sources (web feeds) in six or seven categories. I am picky with my subscriptions, and feeds that feel like clutter are weeded out (I have a separate category for feeds “on probation”, and I’ll skip those articles on busy days). After years of this, I find that a lot of value and entertainment in my feeds.
I skim these web feeds on my phone using Feedly’s android app. This is fast consumption, and easy to do when taking a break or during in-between moments. Anything requiring deeper attention or more time, I save for later, using Pocket.
In addition to web feeds via Feedly, my Pocket queue is populated by tweets, web browsing, active research, and things-people-send-to-me. The ability to easily save anything for later means I have fewer interruptions and distractions. There is a separate time and place for consuming all that material. This makes me more efficient.
When researching on a particular subject, for personal interest or for a client, I read papers and “heavier” articles. I have a Dropbox folder where I keep this research material, and it stays there even after I’ve read it, for future reference. I’ll often transfer unread articles from this folder to my Kindle; I always keep the ol’ ebook filled with a collection of unread novels, non-fiction books, and dozens of research papers. This is particularly wonderful when traveling, as I am now.
We all have so many sources for reading material, and there are a lot of tools to help us manage everything. I’ve shared only the most significant of the tools that I use, (and hinted at the taxonomies I’ve invented to organize things) with which I’m able to read, and watch, and listen to, a lot more material without feeling overwhelmed or constantly interrupted.
Keep an eye on this reading-log WordPress category — I’ll be doing that back-posting and perhaps you’ll find we have common reading interests.
“Five Balltree Construction Algorithms.” (1989).
Stephen M. Omohundro
I browsed this paper after reading several blog posts and articles about balltree-related algorithms, including:
- “Damn Cool Algorithms, Part 1: BK-Trees.” Nick Johnson. http://blog.notdot.net/2007/4/Damn-Cool-Algorithms-Part-1-BK-Trees
- “VP trees: A data structure for finding stuff fast.” Stephen Hanov. http://stevehanov.ca/blog/index.php?id=130
These and Omohundro’s paper are worthwhile reading. Even if one is not directly able to apply these data structures, they still have benefit in the read. When I was reading them, I was reminded that:
- A concept that is intuitively straightforward can often be impractical or impossible to implement for a particular application.
- Data structures can be designed and built specifically to optimize an operation (that is required by your algorithm)
- That curse of dimensionality, god damnit.
- There are many really cool and clever algorithms that you’ll never be able to apply in your domain.
Balltree and related structures are hierarchical, tree-like representation. They place data points in the tree and provide instructions for traversal of the tree in such a way as to optimize some expected future operation. The clearest application is nearest neighbor search. They also give you an excuse to sensibly use terms like “hyper-spheres” and “leaf balls”.
Construction times for these structures don’t tend to scale well. Think O (N^3). A lot of effort is put into improving and optimizing construction, but direct application of these structures to large data sets is not tractable.
Relatedly: kd balls, Burkhard-Keller (BK) trees, and VP-trees. And others.
“Overlapping Experiment Infrastructure at Google” D. Tang
Published KDD Proceedings 2010
This paper describes the thought process and concepts behind the extensive testing philosophy and infrastructure at Google.
Reading log: This is a very useful paper I read a while ago and dug up again for a client in June. The concepts I learned here seem to emerge intermittently when meeting with clients.
I think this should be required reading for anyone getting started with overlapping testing infrastructures (those that manage multiple tests at the same time). Lean Analytics!
Key take-aways include:
- the concept of domains, subsets and layers to partition parameters and design infrastructure
- binary push vs data push; separating testing parameters from program code.
- Canary experiments and defining expected range of monitored metrics
My concerns (i.e. interests or applications in mind) with re-reading this paper for my client were:
- Applying overlapping infrastructure to A/B testing vs. Multi-Arm Bandit testing,
- The particulars of having a shared control group
- Using such an infrastructure to test and select machine learning algorithm hyperparameters
“MAD Skills: New Analysis Practices for Big Data”
Cohen, et al.
Proceedings of the VLDB Endowment, Volume 2 Issue 2, August 2009
Reading log: I’m not sure when I read this paper, so the back-dating is pretty much arbitrary.
Abstract from the paper:
As massive data acquisition and storage becomes increasingly affordable, a wide variety of enterprises are employing statisticians to engage in sophisticated data analysis. In this paper we highlight the emerging practice of Magnetic, Agile, Deep (MAD) data analysis as a radical departure from traditional Enterprise Data Warehouses and Business Intelligence. We present our design philosophy, techniques and experience providing MAD analytics for one of the world’s largest advertising networks at Fox Audience Network, using the Greenplum parallel database system. We describe database design methodologies that support the agile working style of analysts in these settings. We present dataparallel algorithms for sophisticated statistical techniques, with a focus ondensity methods. Finally, we reflect on database system features that enable agile design and flexible algorithm development using both SQL and MapReduce interfaces over a variety of storage mechanisms.
I was concerned that this paper would turn into a white-paper or technical sales piece on joint hardware-software product offerings by Greenplum. Presents a Greenplum case study: Greenplum database for their client Fox Networks.
- MAD is Magnetic, Agile, Deep data analysis
- The authors define the MAD acronym as a re-imagination of the data warehouse concept such that:
- Magnetic: encourages (attracts) new data sources, has reduced sensitivity to cleanliness of data sources
- Agile: logical and physical contents of the database can evolve and adapt rapidly
- Deep: Avoid BI rollups and sampling to serve more demanding statistical analyses.
- Presented as an alternative to “traditional Enterprise Data Warehouses and Business Intelligence.”
- Emphasis is on moving data to a data warehouse rapidly, and using a staged approach to clean and integrate the new data.
Provides background / definitions: OLAP, data cubes, common statistical systems, parallel processing paradigms, some statistical concepts, tf-idf analysis, Ordinary least squares curve fitting, etc. Then basically just states that all this possible in a fast, dynamic, fashion using Greenplum technology.
I skimmed rather than read this paper. It felt like it was at least a review of some important concepts, but actually I’m not sure I actually got anything out of this read.
I’ve been increasingly interested in this subject — given a stream of data, a time-series such as, perhaps, a periodic measurement from a sensor, how do we define and identify anomalous values quickly and efficiently?
[Update: Check out this August 11th post by Ben Lorica, focusing on IT Ops tools in this space.]
Michèle Basseville has written several papers on the subject which I found very helpful. These two were among the first I read, in February, while researching for a new client.
- “Statistical methods for change detection.” (2002).
- “Detecting Changes in Signals and Systems: A Survey” Automation, Vol. 2,t, No. 3, pp. 309-326, 1988
His approach involves two major steps. First, from the signal, generate “residuals”, which are defined as having three properties: residuals should be close to zero under ambient (normal) conditions, insensitive to noise, and sensitive to fault (anomaly). Second, evaluate the residuals using one or more previously design decision rules (stop conditions).
Bassevile defines multiple criteria for designing detection algorithms, which I found very useful. For each application, different criteria may take priority. They are often opposing or mutually exclusive to implement. An obvious example is balancing false positives and false negatives. Another tradeoff is the mean time between false alarms and the delay in fault detection. He draws the distinction between off-line and on-line change detection, and design differences in algorithms in each case.
Some of the ingredients he uses and discusses include:
- likelihood ratio and cumalative sum tests.
- the Page-Hinkley Stopping Rule
- using local approaches and moving windows to reduce computation costs.
- spectral properties of the incoming signal
- Cumulative Sum Control Chart (CUSUM) by ES Page — http://en.wikipedia.org/wiki/Cusum
If one is interested in this subject, I imagine Basseville is a familiar name already. Following his works and the paper that cite them is a deep dive straight into the subject. I find it all fascinating and hope to get many chances to utilize these techniques in future projects.
Read this paper in November 2012 (backdated post).
Compare this paper to many similar ones with general advice or observations on Machine Learning, such as the one by Andrew Ng.
These are the useful things Pedro chooses to highlight in this paper.
- Selection of an ML algorithm is simpler if you understand the three components: representation, evaluation (scoring function), optimization.
- Generalization is the goal. Use cross validation, optimize for the test set, not training data.
- Data and algorithms must be combined with domain knowledge and experience for good results. This is a good thing. Anti-Kaggle.
- Understand bias vs. variance in overfitting. Use techniques like regularization to combat them.
- Curse of dimensionality
- Theoretical guarantees in ML algorithms are not a criterion for practical decisions.
- Feature engineering is the most important contributor to success/failure of ML.
- More data > algorithmic sophistication, but adds scalability issues. Try the simplest learners first.
- Use ensemble methods. More models are better.
- “… Simpler hypotheses should be preferred because simplicity is a virtue in its own right, not because of a hypothetical connection with accuracy. This is probably what Occam meant in the first place.”
- Just because something has a representation doesn’t mean it can be learned.
- Correlation is not Causation.