Category: Pafnuties

UC Berkeley BIDS Launch and Conscilience

Yesterday I attended the launch of the University of California Berkeley Institute for Data Science (BIDS). The Moore and Sloan Foundations announced a 5 year, $37.8 million contribution to kick start this Institute, which will be the third of its kind in the country. The other two are at the University of Washington and NYU. The Institute will open physically in 2014, with a pretty nice real estate inside the Doe Memorial Library.

Univerity of California, Berkeley logo

Univerity of California, Berkeley logo (Photo credit: Wikipedia)

I am pretty enthusiastic to have this Institute so close to home. There will be great opportunities to attend events and take advantage of whatever resources are made available to the community at large (I’m not a student at Cal). More than that, I would be interested in contributing my own time, or enabling a collaboration with The Data Guild, in whatever way possible, to advance the local data science community through UC Berkeley.

Packed house at the BIDS launch event

Packed house at the BIDS launch event

The launch event consisted of talks and presentations by many of the people involved, including Cal Chancellor Nicholas Dirks, the director of BIDS (and Nobel laureate) Saul Perlmutter, Tim O’Reilly, and Peter Norvig of Google fame. There were also interesting talks about academic data science projects currently in progress at the University. 

A key idea, one that seemed to form a common thread across all the talks, was that of conscilience. The term was popularized by EO Wilson in 1998 in his eponymous book, in which he talks about disciplines —  the hard sciences, the social sciences, and the humanities — moving closer to each other. Part observation and part projection, Wilson pointed out that part of this bridging between disciplines would be due to advances in technology and computation.

In the data science context, this shrinking of gaps between previously distinct communities and cultures is often observed between the scientific/academic and the commercial/industrial communities, two groups which historically have had very different objectives and approaches. We have seen in recent years that this is changing rapidly. Joshua Bloom noted in the panel discussion at the end of the evening that they are still quite separate, and likely will always be separate, but that they are undeniably much closer together than they have been in the past.

The talks at the BIDS launch event went beyond this common observation, though. Several mentioned the meeting of the hard sciences with social sciences, and the inter-disciplinary collaborations through data science. They talked about in the benefit of learning to think about problems in new, more data-centric ways, and how such data-driven approach was methodologically-centered rather than domain-specific. They specifically described how this shift towards methodology would create new types of specialists that could operate successfully across many disciplines. They even described a shift in cultures, harkening directly back to EO Wilson, and back to CP Snow’s “Two Cultures” argument

Wonderful, and appropriate, that the launch of a new institute of data science should bring together so many bright persons from a broad array of backgrounds, and create an opportunity for these philosophical reflections. These next few decades are going to be a very exciting time, when we get to observe and be part of the contribution that data science is making to the unity of knowledge. 

Advertisements

Andrew’s Curves now free with python pandas (Reading log)

A blog post by Vytautas Jančauskas talks about the implementation of Andrew’s Curves in Python Pandas. These curves, introduced in David Andrew’s paper in 1972, allow one to visualize high dimensional data through transformation.

It is now trivial to generate such a plot from your pandas dataframe:

import pandas as pd
df = pd.Dataframe(some_data, columns = ['y', 'x1', 'x2', 'x3', 'x4', 'x5'])
pd.tools.plotting.andrews_curves(df, class_column='y')

I think this is a powerful and exciting tool that could be very insightful for exploratory data analysis.

SO_andrewscurves_03

Example: Andrews Plot of randomly generated data

I noticed a bug in the pandas implementation, which resulted in a Stack Overflow question and a pull request to pandas. The bug was corrected with impressive speed.

I read this paper that expounds upon some of Andrew’s ideas:
César García-Osorio, Colin Fyfe, “Visualization of High-Dimensional Data via Orthogonal Curves” (2005).

After playing around and reading a bit, I came up with some ideas for future work on this new feature:

Labels and ticks

In the above example plot, which I generated, the xticks are at multiples of π — which is sensible because what we are looking at is the projection of data onto the vector of Fourier series on the range (−π < t < π) . But the current pandas implementation has xticks at integer multiples. It also doesn’t provide axis labels. I should create a PR for this.

Column order

The shape of Andrew’s curves are highly influenced by the order of variables in the transformation. So in the pandas implementation, the order of the columns is significant.

Here are two plots of the air quality data set — the only difference is column order:
[Added the code I used to generate these plot at the bottom of this section.]

SO_andrewscurves_04_column_order

Andrew’s Curves on the same dataset (airquality), but with changed column order.

One might argue this difference does not matter… that if all you are doing is checking for structure in a dataset, then the shape of that structure is not important (compare the airquality Andrew’s Plot to the one with random data above). But in fact shapes can be very important when you are using visual data to develop an intuition about numbers. Also, Andrew’s Curves can be very informative beyond a binary “yes there is” / “no there isn’t” decision with respect to structure, and in that case the column order here could become analogous to bin widths in a histogram.

Here is the same “column-order experiment” as above, this time for the mtcars dataset:

SO_andrewscurves_06_column_order

Andrew’s Curves on the same dataset (mtcars), but with changed column order.

Surprised? Me too. For the sake of reproducibility, here are the column orders for the three mtcars plots:

['qsec', 'vs', 'am', 'gear', 'carb', 'mpg', 'cyl', 'disp', 'hp', 'drat', 'wt']
['wt', 'qsec', 'vs', 'am', 'gear', 'carb', 'mpg', 'cyl', 'disp', 'hp', 'drat']
['drat', 'wt', 'qsec', 'vs', 'am', 'gear', 'carb', 'mpg', 'cyl', 'disp', 'hp']

This is an inherent weakness with Andrew’s Curves, and no fault of pandas. The people that provide powerful tools cannot be responsible for mistakes that users might make. However, going along with analogy made earlier, anybody creating a tool to generate histograms will provide the capability to adjust bin sizes. In the same way, this vulnerability might need to be acknowledged in some way: by, for example, allowing the user to specify a column order when creating an Andrews Plot, or by allowing the user to generate several plots each with a random column order.

Other Plots

Andrew’s Curves also have other weaknesses, such as biasing some frequencies over others. Variations exist to address these weaknesses, and there are other visualizations built on the principle of transforming high-dimensional data. These might be worth exploring in more detail, but I’m out of time for now. See the paper by García-Osorio for more details.

The code used to generate some of the plots in this post:


/cc @orbitfold @tacaswell @jtratner


Update (2013 Oct 30):
In the above column order test, I was simply cycling the column order, not shuffling them. In the plot below, I’m rearranging the columns completely using random.shuffle(). Also included as a bonus is a side-by-side comparison with a Parallel Coordinates Plot (PCP).

AndrewsCurves_ss01_vsPCP_shuffle_columns

On the left are Andrew’s Curves, and the right column of figures are Parallel Coordinate Plots. Each row has a different column order. [Used the mtcars dataset with ‘gear’ as the class column. ]

Data.gov, Open Government Platform, and Cancer data sets

After attending a lecture at University of San Francisco by Jonathan Reichental (@Reichental) on the use of open data in the public sector, I started poking around some data sets available at Data.gov.

Data.gov is pretty impressive. The site was established in 2009 by Vivek Kundra, the first person with the title “Federal CIO” of the United States, appointed by Barack Obama.  It is rapidly adding data sets; sixty-four thousand data sets have been added just in the last year.

Interestingly, there is an open-source version of data.gov itself, called the open government platform. It is built on Drupal and available on github. The initiative is spear-headed by the US and the Indian governments, to help promote transparency and citizen engagement by making data widely and easily available. Awesome.

The Indian version is: data.gov.in. There is also a Canadian version, a Ghanaian version, and many other countries are following suit.

I started mucking around and produced a plot of the Age-adjusted Urinary Bladder cancer occurrence, by state.

  • The data was easy to find. I downloaded it without declaring who I am or why I’m downloading the data, and I didn’t have to wait for any approval.
  • The data was well-formatted and trivially easy to digest using python pandas.
  • Ipython notebook and data source available below.

 

dataGOV_UrinaryBladderCancer_ByState

 

If you’re interested in this data, you should also check out http://statecancerprofiles.cancer.gov/ , which I didn’t know existed until I started writing this post. I was able to retrieve this map from there:

statecancerprofiles_Bladder_USmap

 

Reading workflow and backposting to reading-log

One of my categories on this blog is “reading-log“, which I intended as a way to highlight one of the books, articles or papers that I’ve read recently. I’ve been very negligent at this, but fortunately this is one of those situations where it’s not too late to do so.

I keep notes (on Evernote) with the date that I read the material and thoughts that it inspired. So I can still go back and post them retroactively. I can even artificially date the WordPress Post. I’ll be trying to do some of that over the next few days. If all goes well, subscribers will see a flurry of activity (which hopefully doesn’t chase any of them away).

I’ve been reading a lot these days. My reading workflow is always evolving, but I’ve got a system that seems to be working pretty well, and as a result I find it easier to read more and be efficient.

Image

I use Feedly, to which I switched after the days of Google Reader. I currently have 120+ sources (web feeds) in six or seven categories. I am picky with my subscriptions, and feeds that feel like clutter are weeded out (I have a separate category for feeds “on probation”, and I’ll skip those articles on busy days). After years of this, I find that a lot of value and entertainment in my feeds.

I skim these web feeds on my phone using Feedly’s android app. This is fast consumption, and easy to do when taking a break or during in-between moments. Anything requiring deeper attention or more time, I save for later, using Pocket.

In addition to web feeds via Feedly, my Pocket queue is populated by tweets, web browsing, active research, and things-people-send-to-me. The ability to easily save anything for later means I have fewer interruptions and distractions. There is a separate time and place for consuming all that material. This makes me more efficient.

When researching on a particular subject, for personal interest or for a client, I read papers and “heavier” articles. I have a Dropbox folder where I keep this research material, and it stays there even after I’ve read it, for future reference. I’ll often transfer unread articles from this folder to my Kindle; I always keep the ol’ ebook filled with a collection of unread novels, non-fiction books, and dozens of research papers. This is particularly wonderful when traveling, as I am now.

We all have so many sources for reading material, and there are a lot of tools to help us manage everything. I’ve shared only the most significant of the tools that I use, (and hinted at the taxonomies I’ve invented to organize things) with which I’m able to read, and watch, and listen to, a lot more material without feeling overwhelmed or constantly interrupted.

Keep an eye on this reading-log WordPress category — I’ll be doing that back-posting and perhaps you’ll find we have common reading interests.

 

Mean absolute percentage error (MAPE) in Scikit-learn

On CrossValidated, the StackExchange for statistics, someone asks:

How can we calculate the Mean absolute percentage error (MAPE) of our predictions using Python and scikit-learn?

Mean Absolute Percentage Error (MAPE) is an metric used to determine the success of a regression analysis. Read my answer on CV here:

http://stats.stackexchange.com/questions/58391/mean-absolute-percentage-error-mape-in-scikit-learn/62511#62511

Continue reading

Reading Log: “Five Ball-Tree Construction Algorithms”, Omohundro

“Five Balltree Construction Algorithms.” (1989).
Stephen M. Omohundro

I browsed this paper after reading several blog posts and articles about balltree-related algorithms, including:

  1. “Damn Cool Algorithms, Part 1: BK-Trees.” Nick Johnson. http://blog.notdot.net/2007/4/Damn-Cool-Algorithms-Part-1-BK-Trees
  2. “VP trees: A data structure for finding stuff fast.” Stephen Hanov. http://stevehanov.ca/blog/index.php?id=130

These and Omohundro’s paper are worthwhile reading. Even if one is not directly able to apply these data structures, they still have benefit in the read. When I was reading them, I was reminded that:

  • A concept that is intuitively straightforward can often be impractical or impossible to implement for a particular application.
  • Data structures can be designed and built specifically to optimize an operation (that is required by your algorithm)
  • That curse of dimensionality, god damnit.
  • There are many really cool and clever algorithms that you’ll never be able to apply in your domain.

Balltree and related structures are hierarchical, tree-like representation. They place data points in the tree and provide instructions for traversal of the tree in such a way as to optimize some expected future operation. The clearest application is nearest neighbor search. They also give you an excuse to sensibly use terms like “hyper-spheres” and “leaf balls”.

Construction times for these structures don’t tend to scale well. Think O (N^3). A lot of effort is put into improving and optimizing construction, but direct application of these structures to large data sets is not tractable.

Relatedly: kd balls, Burkhard-Keller (BK) trees, and VP-trees. And others.

Reading Log: “Overlapping Experiment Infrastructure at Google”, D. Tang

“Overlapping Experiment Infrastructure at Google” D. Tang
Published KDD Proceedings 2010
http://dl.acm.org/citation.cfm?id=1835810

This paper describes the thought process and concepts behind the extensive testing philosophy and infrastructure at Google.

Reading log: This is a very useful paper I read a while ago and dug up again for a client in June. The concepts I learned here seem to emerge intermittently when meeting with clients.

I think this should be required reading for anyone getting started with overlapping testing infrastructures (those that manage multiple tests at the same time). Lean Analytics!

Key take-aways include:

  • the concept of domains, subsets and layers to partition parameters and design infrastructure
  • binary push vs data push; separating testing parameters from program code.
  • Canary experiments and defining expected range of monitored metrics

My concerns (i.e. interests or applications in mind) with re-reading this paper for my client were:

  • Applying overlapping infrastructure to A/B testing vs. Multi-Arm Bandit testing,
  • The particulars of having a shared control group
  • Using such an infrastructure to test and select machine learning algorithm hyperparameters