Andrew’s Curves now free with python pandas (Reading log)

A blog post by Vytautas Jančauskas talks about the implementation of Andrew’s Curves in Python Pandas. These curves, introduced in David Andrew’s paper in 1972, allow one to visualize high dimensional data through transformation.

It is now trivial to generate such a plot from your pandas dataframe:

import pandas as pd
df = pd.Dataframe(some_data, columns = ['y', 'x1', 'x2', 'x3', 'x4', 'x5'])
pd.tools.plotting.andrews_curves(df, class_column='y')

I think this is a powerful and exciting tool that could be very insightful for exploratory data analysis.

SO_andrewscurves_03

Example: Andrews Plot of randomly generated data

I noticed a bug in the pandas implementation, which resulted in a Stack Overflow question and a pull request to pandas. The bug was corrected with impressive speed.

I read this paper that expounds upon some of Andrew’s ideas:
César García-Osorio, Colin Fyfe, “Visualization of High-Dimensional Data via Orthogonal Curves” (2005).

After playing around and reading a bit, I came up with some ideas for future work on this new feature:

Labels and ticks

In the above example plot, which I generated, the xticks are at multiples of π — which is sensible because what we are looking at is the projection of data onto the vector of Fourier series on the range (−π < t < π) . But the current pandas implementation has xticks at integer multiples. It also doesn’t provide axis labels. I should create a PR for this.

Column order

The shape of Andrew’s curves are highly influenced by the order of variables in the transformation. So in the pandas implementation, the order of the columns is significant.

Here are two plots of the air quality data set — the only difference is column order:
[Added the code I used to generate these plot at the bottom of this section.]

SO_andrewscurves_04_column_order

Andrew’s Curves on the same dataset (airquality), but with changed column order.

One might argue this difference does not matter… that if all you are doing is checking for structure in a dataset, then the shape of that structure is not important (compare the airquality Andrew’s Plot to the one with random data above). But in fact shapes can be very important when you are using visual data to develop an intuition about numbers. Also, Andrew’s Curves can be very informative beyond a binary “yes there is” / “no there isn’t” decision with respect to structure, and in that case the column order here could become analogous to bin widths in a histogram.

Here is the same “column-order experiment” as above, this time for the mtcars dataset:

SO_andrewscurves_06_column_order

Andrew’s Curves on the same dataset (mtcars), but with changed column order.

Surprised? Me too. For the sake of reproducibility, here are the column orders for the three mtcars plots:

['qsec', 'vs', 'am', 'gear', 'carb', 'mpg', 'cyl', 'disp', 'hp', 'drat', 'wt']
['wt', 'qsec', 'vs', 'am', 'gear', 'carb', 'mpg', 'cyl', 'disp', 'hp', 'drat']
['drat', 'wt', 'qsec', 'vs', 'am', 'gear', 'carb', 'mpg', 'cyl', 'disp', 'hp']

This is an inherent weakness with Andrew’s Curves, and no fault of pandas. The people that provide powerful tools cannot be responsible for mistakes that users might make. However, going along with analogy made earlier, anybody creating a tool to generate histograms will provide the capability to adjust bin sizes. In the same way, this vulnerability might need to be acknowledged in some way: by, for example, allowing the user to specify a column order when creating an Andrews Plot, or by allowing the user to generate several plots each with a random column order.

Other Plots

Andrew’s Curves also have other weaknesses, such as biasing some frequencies over others. Variations exist to address these weaknesses, and there are other visualizations built on the principle of transforming high-dimensional data. These might be worth exploring in more detail, but I’m out of time for now. See the paper by García-Osorio for more details.

The code used to generate some of the plots in this post:


/cc @orbitfold @tacaswell @jtratner


Update (2013 Oct 30):
In the above column order test, I was simply cycling the column order, not shuffling them. In the plot below, I’m rearranging the columns completely using random.shuffle(). Also included as a bonus is a side-by-side comparison with a Parallel Coordinates Plot (PCP).

AndrewsCurves_ss01_vsPCP_shuffle_columns

On the left are Andrew’s Curves, and the right column of figures are Parallel Coordinate Plots. Each row has a different column order. [Used the mtcars dataset with ‘gear’ as the class column. ]

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s