Tagged: data science

UC Berkeley BIDS Launch and Conscilience

Yesterday I attended the launch of the University of California Berkeley Institute for Data Science (BIDS). The Moore and Sloan Foundations announced a 5 year, $37.8 million contribution to kick start this Institute, which will be the third of its kind in the country. The other two are at the University of Washington and NYU. The Institute will open physically in 2014, with a pretty nice real estate inside the Doe Memorial Library.

Univerity of California, Berkeley logo

Univerity of California, Berkeley logo (Photo credit: Wikipedia)

I am pretty enthusiastic to have this Institute so close to home. There will be great opportunities to attend events and take advantage of whatever resources are made available to the community at large (I’m not a student at Cal). More than that, I would be interested in contributing my own time, or enabling a collaboration with The Data Guild, in whatever way possible, to advance the local data science community through UC Berkeley.

Packed house at the BIDS launch event

Packed house at the BIDS launch event

The launch event consisted of talks and presentations by many of the people involved, including Cal Chancellor Nicholas Dirks, the director of BIDS (and Nobel laureate) Saul Perlmutter, Tim O’Reilly, and Peter Norvig of Google fame. There were also interesting talks about academic data science projects currently in progress at the University. 

A key idea, one that seemed to form a common thread across all the talks, was that of conscilience. The term was popularized by EO Wilson in 1998 in his eponymous book, in which he talks about disciplines —  the hard sciences, the social sciences, and the humanities — moving closer to each other. Part observation and part projection, Wilson pointed out that part of this bridging between disciplines would be due to advances in technology and computation.

In the data science context, this shrinking of gaps between previously distinct communities and cultures is often observed between the scientific/academic and the commercial/industrial communities, two groups which historically have had very different objectives and approaches. We have seen in recent years that this is changing rapidly. Joshua Bloom noted in the panel discussion at the end of the evening that they are still quite separate, and likely will always be separate, but that they are undeniably much closer together than they have been in the past.

The talks at the BIDS launch event went beyond this common observation, though. Several mentioned the meeting of the hard sciences with social sciences, and the inter-disciplinary collaborations through data science. They talked about in the benefit of learning to think about problems in new, more data-centric ways, and how such data-driven approach was methodologically-centered rather than domain-specific. They specifically described how this shift towards methodology would create new types of specialists that could operate successfully across many disciplines. They even described a shift in cultures, harkening directly back to EO Wilson, and back to CP Snow’s “Two Cultures” argument

Wonderful, and appropriate, that the launch of a new institute of data science should bring together so many bright persons from a broad array of backgrounds, and create an opportunity for these philosophical reflections. These next few decades are going to be a very exciting time, when we get to observe and be part of the contribution that data science is making to the unity of knowledge. 

Advertisements

A list of public open Data Sets

I have been collecting a small list of public / open data sets for my own personal use. I have put the list online as a first entry on a new wiki. You can check it out here:

http://eda.amanahuja.me/PublicDataSets

A couple other comments, for the curious:

  • The wiki is also tied to another domain, so you can see the same page from http://eda.fenristech.com/PublicDataSets. This is an unresolved internal conflict. Suggestions welcome. 
  • The purpose of this wiki isn’t quite defined, but I have a good idea of where it is headed. Stay tuned to learn more.
  • I really wanted a way to sync an Evernote notebook to a MoimMoin wiki. I don’t think something like that already exists, nor do I have the time to work on it, but it would be damn convenient right now.

 

Data Science and Consultancy

An article recently posted on Harvard Business Review declares the Data Scientist as “The Sexiest Job of the 21st Century“. The authors, Thomas Davenport and DJ Patil are both familiar names to me, especially Patil, and, as expected, I found the article to be an interesting read. (It is also an easily palatable read, suitable for sending along to parents and friends who are still a little confused by what it is I do).

One passage, in particular, caught my interest:

Considering the difficulty of finding and keeping data scientists, one would think that a good strategy would involve hiring them as consultants… But the data scientists we’ve spoken with say they want to build things, not just give advice to a decision maker. One described being a consultant as “the dead zone—all you get to do is tell someone else what the analyses say they should do.” By creating solutions that work, they can have more impact and leave their marks as pioneers of their profession.

As a data scientist* who operates as a consultant, I found this thought-provoking. Is hiring a data scientist as a consultant a good strategy for a company? Is it true that most data scientists are averse to consulting because they cannot make as  much impact as a consultant than as a full-time employee? I certainly can’t speak for other data scientists, but here are some of my thoughts.

Many data science projects are well suited for consulting.

There are many indicators to help an organization decide when to outsource a project and when to handle it “in-house”. I’ve worked for many years in the world of technical consulting, and to me a significant percentage of data science projects are well-tailored for outsourcing.

  • Data science is usually not related to an organization’s core competency. A business that is good at making widgets may not be well equipped to build a team and develop processes for doing data science.
  • Many data science projects involve validating an idea before it is put into production. A consultant is often the right person to efficiently investigate the feasibility of an idea and determine its potential return-on-investment. An outsider will have the emotional detachment and political freedom to declare whether the project is well-grounded and realistic, and what it will take to execute the vision. Once validated, a business can make an informed decision about whether to build the product in-house.
  • Hiring a full-time data scientist can be difficult and time-consuming, especially when the individuals recruiting aren’t equipped to evaluate such candidates, and data scientists command a high salary. Much time and resources can be saved by first validating and developing a strategy with a consultancy.

There is also another important consideration.

“Data scientist” is a very broadly defined category. An experienced statistician with some programming skills, an experienced programmer with some some knowledge of machine learning, a veteran business analyst with proficiency in big data architecture — all these may truthfully call themselves data scientists. That the term is overloaded causes problems in the context of recruiting, and there’s another consequence.

A data science project is composed of many different components and many different phases. Data exploration, confirmatory analysis, translating hypothesis to business strategy, communicating yet-to-be-developed data-centric ideas to executives, architecting and developing production-ready systems, optimizing and scaling infrastructure. Each of these requires very different skill-sets, yet most organizations find themselves hiring a data scientist or several data scientists without an understanding of which skills will be needed, when, and for how long. A strategic approach is very important.

For example, an organization may bring on a consultant to do those things that require specialization and need to be done only once; hire a permanent data scientist for long-term tasks and tasks that require intimate knowledge of secure internal data; and train existing technical teams to handle some of the development and maintenance of the data science product in production.

A consultant enables organizations to explore or experiment with an idea (or develop new ideas) with less risk and investment.

[UPDATE: There are, of course, many disadvantages of using a consultant over hiring in-house. Employees have more intimate knowledge of the business and the data. There are important considerations related to data confidentiality and related legal restrictions in many industries. And there are the more general pitfalls of outsourcing, about which much is written elsewhere. I do not mean to imply that hiring a consultant for data science is always the right thing to do — just that there are many scenarios in which it is.]

Data Exploration is fun!

Switching perspectives to that of the data scientist, there are many reasons to choose consulting over working full-time for an employer. For me, one of the most important is simply that … it’s fun!

Patil and Davenport quote a data scientist who clearly gets the most satisfaction out of building finished products, but the beginning stages of the data product development cycle are equally rewarding. There is a unique challenge in gaining a broad understanding of the client organization and their business goals, in exploring the available data and their latent potential. One must develop hypotheses, find creative ways to test them. There’s both a focus in trying to achieve the objective, and a creative license in methodology. Often there’s an opportunity to find an unforeseen innovative use for the data.

Although truly data-driven companies will continually explore new ideas in their data, it is usually in the early exploratory phases of a project that I learn the most and feel the most rewarded for my work.  As a consultant, I am able to maximize the amount of time I spend on my favorite data science tasks.

Yes, there is a cost to this luxury, and the quoted data scientist makes a good point about what he calls the ‘dead zone’ and the impact of building lasting solutions. But for me, at this point of my career, I’m very happy with the trade-off — and I would imagine that many other data scientists are too.

*I usually don’t refer to myself as a ‘data scientist’, but that’s a discussion for another day.