Category: Coding

Data.gov, Open Government Platform, and Cancer data sets

After attending a lecture at University of San Francisco by Jonathan Reichental (@Reichental) on the use of open data in the public sector, I started poking around some data sets available at Data.gov.

Data.gov is pretty impressive. The site was established in 2009 by Vivek Kundra, the first person with the title “Federal CIO” of the United States, appointed by Barack Obama.  It is rapidly adding data sets; sixty-four thousand data sets have been added just in the last year.

Interestingly, there is an open-source version of data.gov itself, called the open government platform. It is built on Drupal and available on github. The initiative is spear-headed by the US and the Indian governments, to help promote transparency and citizen engagement by making data widely and easily available. Awesome.

The Indian version is: data.gov.in. There is also a Canadian version, a Ghanaian version, and many other countries are following suit.

I started mucking around and produced a plot of the Age-adjusted Urinary Bladder cancer occurrence, by state.

  • The data was easy to find. I downloaded it without declaring who I am or why I’m downloading the data, and I didn’t have to wait for any approval.
  • The data was well-formatted and trivially easy to digest using python pandas.
  • Ipython notebook and data source available below.

 

dataGOV_UrinaryBladderCancer_ByState

 

If you’re interested in this data, you should also check out http://statecancerprofiles.cancer.gov/ , which I didn’t know existed until I started writing this post. I was able to retrieve this map from there:

statecancerprofiles_Bladder_USmap

 

Advertisements

Mean absolute percentage error (MAPE) in Scikit-learn

On CrossValidated, the StackExchange for statistics, someone asks:

How can we calculate the Mean absolute percentage error (MAPE) of our predictions using Python and scikit-learn?

Mean Absolute Percentage Error (MAPE) is an metric used to determine the success of a regression analysis. Read my answer on CV here:

http://stats.stackexchange.com/questions/58391/mean-absolute-percentage-error-mape-in-scikit-learn/62511#62511

Continue reading

Detached HEAD — a git discovery

Recently I found myself with a detached HEAD. In Git.

This was the first time I encountered such a thing. When you are working on, or checkout, commits that are not attached to any branch, you have a detached head situation. Your commits are branchless. There is a pretty easy fix to this, and the solution is pretty easy to find on SO.

Check out SO: Why did git detach my head?

I retraced my steps to figure out exactly how this happened.

I created a branch (git branch newfeature; git checkout newfeature) and then cloned my repository for further work on this branch. This created an ambiguity for git: both the clone and master branch had a branch named newfeature. When I pulled my work from master with git pull , the commits were not attached to any branch.

The symptoms

I didn’t recognize this unfamiliar situation. I did notice I couldn’t find all those commits.

  • They weren’t visible with git log or git log newfeature.
  • git status with newfeature checked out showed a clean working directory.

With help from @ddunlop, I was finally able to view the commits with git log <hash>. I got the commit hash using git log in my cloned repo.

This is how I resolved the problem.

  1. git checkout <hash>.  I checked out my most recent commit using its hash. Git informed me that I was now in a ‘detached HEAD’ state. After that it was easy. I googled the provocative “detached HEAD” message and did some learning.
  2. git checkout newfeature
  3. git branch newfeature_2 6e51426cdb
  4. git merge newfeature_2
  5. git checkout master
  6. git merge newfeature

Then I just deleted the extra branches.

In the process, I also learned about “tracking” branches. Check out the useful SO: Switch branch without detaching head

R2D3 and other letters and numbers

Check out the alphabet soup of data web visualizations I am swimming in today.

  • R is statistical and computational software.
  • d3.js is a JavaScript library for building beautiful visualizations on the web. It uses scalable vector graphics (SVGs) directly from data through the document object model (DOM).
  • ggplot2 is a graphing library for R, developed by Hadley Wickham.
  • Raphaël.js — This is a JavaScript library for working with vector graphics. (It’s different: Raphaël.js creates and manipulates vector graphical objects that are also DOM objects. D3.js is primarily designed to tie data directly to DOM objects.  There is some overlap, but they’re different.)

The first three are pretty powerful and, if they are not already, are fast becoming critical parts of the data toolkit. The last is a promising newcomer, worth keeping an eye on.

So far so good. If you’re a data nerd, you probably already know all this. Stick with me.

It turns out that all these libraries, doing slightly different but related things, and doing them well, would work very well together. They’re not tightly integrated (yet) but there are several efforts to make it so.

Hadley Wickam, creator of the R package ggplot2, is a fan of d3.js and has suggested that the next version of ggplot2 will probably be redone on the web, likely using d3. He’s also working on a new R library that more immediately allows them to work well together. This is  great news.

He’s calling it R2D3 (– named, supposedly, more at the insistence of friends that are Star Wars geeks than due to his own fandom).

r2d3

(Confusingly, there were some unfounded rumors that Hadley’s next version of ggplot would be called R2D3.)

There are also a few projects to get Raphaël.js to work well with d3.js. One of them is called ‘d34raphael‘. Another, a bit more ambitious, is a custom build of d3 powered by Raphael. Awesome! Guess what it’s called? R2D3.

It’s not that uncommon for two open source libraries to have the same name, but these libraries both address the needs of a pretty niche audience. They both work with d3.js, but one extends “upstream” towards the data and the other extends “downstream” toward the graphics. It’s more than conceivable for someone to want to use all them at the same time: R, R2D3, D3, R2D3, and Raphael.

Apparently the the two authors, Mike Hemesath and Hadley Wickham didn’t know about each other’s projects when they named their own. If both projects are adopted widely, it will be interesting to see if either of them eventually decides to change names.

 

virtualenv for a nltk project with ipython configuration

On a new Ubuntu machine, I needed to use NLTK. This serves as a quick reference for myself, and maybe you’ll find it useful as well.

> mkdir bananaproject
> cd bananaproject
> virtualenv ENV
> source ENV/bin/activate

I created a new folder for my project, and a new virtualenv for it. Virtualenv comes in  damn handy for managing portability and dependencies on multiple python projects. The last command activated the virtual environment, so subsequent commands are now taking place within it.

> pip install yolk
> yolk -l

I installed yolk, which then tells me what I have installed and ready to use in my virtualenv. I use this to check dependencies before I can install NLTK.

> sudo apt-get install python-numpy
> pip install pyyaml

Numpy is a package I’m okay with having installed system-wide, not just in this virtualenv. Pyyaml on the other hand I installed just for this project.

> mkdir ENV/src
> cd ENV/src
> wget http://nltk.googlecode.com/files/nltk-2.0.1rc1.zip
> unzip nltk-2.0.1rc1.zip
> cd nltk-2.0.1rc1
> python setup.py install

Self-explanatory. Of course, the link to NLTK will soon be outdated; the latest can be found at http://www.nltk.org/download. The virtualenv was activated while I ran the install.

At this point I thought I was done, but when I started ipython and tried to import nltk, I got an import error. I need to tell ipython about the python executable I’m using and the changes to sys.path.

This is only necessary because of the way I set up my virtualenv and the order in which I have installed things. A simple alternate is to to use a virtualenv with the --no-site-packages option, and then install ipython afresh for that project.

This post came in handy: http://blog.ufsoft.org/2009/1/29/ipython-and-virtualenv. However, it was written in 2009, and I’m using ipython 0.12. A slight variation is necessary for ipython >= 0.11.

> vi ~/.ipython/virtualenv.py
[ Use this, or a variation thereof: https://gist.github.com/1176035 ]
> ipython profile create

The profile create command tells ipython to create default config files, which we can then play with. The command will tell you where the ipython_config.py file has been created, and in it we need to find this line:
c.InteractiveShellApp.exec_files = []
and change it to:
c.InteractiveShellApp.exec_files = ['virtualenv.py']

Now whenever I start ipython, the virtualenv.py script will be executed, which will set my sys.path variables the way I need them. I can now happily import numpy and import nltk in ipython.

In [1]: import nltk
In [2]: phrase = nltk.word_tokenize("That was easy.")
In [3]: nltk.pos_tag(phrase)
Out[3]: [('That', 'DT'), ('was', 'VBD'), ('easy', 'JJ'), ('.', '.')]


Nginx – based stack

I’ve been working on moving this blog to a different server, and simultaneously performing a migration operation on some other sites I spend time on.

In an attempt to do something new and to create an environment that provides some much needed flexibility, I’m putting some extra time and energy into selecting a server and technology stack. Here are the highlights:

  • nginx instead of Apache. Nothing against Apache, honestly; LAMP-ish stacks have been my M.O. for a long while. Nginx will, however, provide many benefits: first, exploring a completely new web server will improve my understanding of how web servers work; second, I suspect using Nginx with uWSGI will make it easier to deploy my increasing number of Python + virtualenv + (some framework) projects; third, I run several low-traffic domains on the same box, and Apache has really been struggling with that.
  • Transfer my blog out of WP.com. I find myself wanting to do more and more with my WordPress blog that just isn’t possible with wordpress.com hosting. Having built several WP themes now, I feel nimble enough to put a custom theme together quickly. The ability to install certain currently inaccessible plugins will be very satisfying and I want to play around with writing some of my own plugins as well.
  • Use the Natural Language Toolkit (nltk) to made Pablo more fun at thesexycow.com and to do some for-fun natural language analysis on my blog content.
  • I will stick with Linode; I’ve been happy with them in servers past. I’ll be using Ubuntu 11.04 Natty Narwhal, which has Python 2.7.1 and other impressive version numbers that Cent-#&$@#$-OS will probably get around to implementing no sooner than 2020.

So far, I have setup the server and Nginx with FastCGI, and started working on configuring wordpress and the first iteration of this blog’s theme.

sigmoid function fail

Plot the sigmoid function.

sig(u)=\frac{1}{1+e^{-u}}

Does this look sigmoidal to you?

A result that confused me until [Thanks, Sasha] I noticed the tick values on my x-axis, which matplotlib selected unintelligently.  If we simply correct the plot domain.

xs = [0.01*x for x in range(-1000,1000)]

 

I would like to know more about how different plotting packages, such as matplotlib and ggplot2 in R, select default values for xrange and yrange.