It’s been over a month since Strata 2012. Lest they meet the same fate as scribbles from conferences long past, I’m putting my notes here instead of leaving them in whatever notebook I happened to be toting around that week.
The biggest a-ha moment, and one that I’ll be writing about in the future, came from Ben Goldacre’s keynote, when he compares big data practitioners to drunks looking for car keys only where the light shines. We focus on the data that’s available without asking, “what’s missing?” Plus, it’s fun to hear someone with a British accent say “blobogram.”
Getting Started With Hadoop
This was a three hour overview of what Hadoop is, how its two components—Hadoop Distributed Filesystem (HDFS) and MapReduce—work, and the tools that round out its ecosystem. Sara from Cloudera explained the details clearly and in plain English, using language that could be understood by technologists of all stripes, not just big data geeks.
- Hadoop gives you scalability and the ability to handle failure
- In return, you need to rethink your data processing algorithms
- Writing MapReduce jobs directly is like coding assembly. Hadoop ecosystem has tools that help you create, schedule, and run jobs, query HDFS, move data between HDFS and other types of storage, and aggregate files for Hadoop input
The Craft of Data Journalism
This three hour sessions was run by two folks from The Guardian (one is Simon Rogers), who demonstrated some of their data-driven projects and shared some of their tools. The Guardian’s data stories are blogged here: http://www.guardian.co.uk/news/datablog.
- Google spreadsheets: crucial because they’re free, easy, can be used as a backend, and are easy to mashup
- Google Fusion tables: used to create maps from location data. Pop-ups that appear for each data point on the map are customizable
- Google refine: for data cleanup
- Scraperwiki: to pull data from websites
- Bit.ly://fusionborderlinks: Guardian-created list of links to different boundaries (can be combined with Fusion table location data to add boundaries to maps)
- Presenters reported that ~80% of their time is getting, cleaning, and verifying the data—not creating the visualizations. This is consistent with my experience and also verifies that pre-cleaned & formatted data could be useful to journalists
Keynotes were a series of short presentations. Favorite quotes:
- Systems thinking is the science in data science
- Store raw data and project schemas onto it dynamically – Doug Cutting, Cloudera
- More, messy data has greater value than smaller, cleaner data. – Dave Campbell, Microsoft
- Problems confronting humanity: this is where big data needs to go – Mike Olson, Cloudera
- Big data is most valuable when it tells us what we don’t know we don’t know (HT Donald Rumsfeld). Much more valuable than data puking onto dashboards – Avinash Kaushik , Market Motive
- Big data can and should be used to find out what’s missing – Ben Goldacre, Bad Science
- What data do we have that we aren’t using? What are we throwing away—especially things that could better help us quantify impact (web logs, eblast data, social media stats)
The Science of Visualization
- When creating visualizations exploit the power of human visual processing
- Humans are slow are mental math (e.g., adding things up in tables) but fast visually
- Therefore, use visual pop-outs when possible (e.g., coloring numbers that should stand out)
- Data view: exploit bi-directional processing
- Data stories: show people where to look
- Purpose of visualizations is to speed up the cycle of analysis
- Some best practices (get the slides for the complete overview):
- If there are 3 variables, code the 3rd with size, not with color
- People see area, not height
- Size is ordinal; don’t use it to encode non-ordinal data
- Color is okay of ordinal data but works even better for non-ordinal data (as long as there aren’t too many values, up to 20 or so)…takes advantage of the visual pop-out
- Nominal: if there are over 20 items, colors become less effective, but shapes can work (good example is the animal graph in Tufte’s Beautiful Evidence)
- Small sizes make colors and shapes much less effective
- Although heatmaps are currently popular, they are often not the best choice for quantitative data
- If heatmap is being used to identify outliers, that’s not really quantitative data, so it’s okay
- But otherwise, it’s often best to use size instead of heatmaps (especially filled maps)
- Stephen Frew’s Show me the Numbers: accessible and practical advice for getting started
Effective Data Visualization
Although this session yielded some useful information, it was too focused on the presenter’s company, with not enough information about general best practices. That said, the company itself (DataMarket) is interesting—they are trying to be the Google of datasets. You find a dataset on their site, choose it, and are immediately presented with a visualization. Furthermore, people can now upload their own data to the site.
In other words, they have solved the problem of creating generic visualization code that will work with any dataset.
- DataMarket uses a Python/Django/Postgres stack
- In 2010, they reviewed 100+ libraries to see which met their 5 requirements for the site (compatibility for all major browsers, server-side rendering, vector output, iOS compatibility, full control). Protovis was the choice, although it lacked server-side rendering and doesn’t work with IE 7/8 (due to SVG output)
- DataMarket developed a server-side solution using node.js, svglib, and some other stuff to transform SVG to PDF for rendering
- They created a shim for IE 7/8 that captures Protovis SVG output and transforms it to VML (IE 7/8 = ~20% of their users)
- Although Protovis is no longer being actively developed, there is still an active community. DataMarket’s strategy is to stick with Protovis until more IE users are on 9 and then switch to D3
Data Visualization for a Better World
This was a Bay-area meetup group and not part of Strata. Really great event—Jake Porway, the founder of Data Without Borders, is smart, humble, and passionate about matching data scientists with social and non-profit orgs.
- NYT R&D Twitter sharing visualization front-end is JS and processing. Python to get data and MongoDB to store
- 2nd viz demoed uses OpenPaths.cc and webgl
- On their own, hackers will develop apps that solve their own programs. Therefore, hackathons need to include those who will use the apps to change the world (i.e., social sector orgs)
- “Unglamorous” things that are easy for data scientists to do can be transformative for social service orgs.
Video Graphics: Engaging and Informing
- Video lets you bring a human persona to the data
- On articulating data into stories:
- It goes beyond finding the x & y coordinates that match. Must engage people in the insights
- The story is the concept, not the characters
- Information doesn’t want to be free…it mostly wants to lie around. Don’t let your data lie around!
- Don’t confuse insight with story
- Find the “red thread” of story
- On showing multi-modal data:
- Words and pics need to be brought together
- Cross the streams! Show AND tell. Pictures AND words. Data AND interpretation.
- Don’t be afraid of entertainment and gloss
- Use the visual strengths of the medium
- Not everything needs to look like a diagram
- Why bother with flat video graphics (e.g, a bunch of words & numbers with narration)? People make these b/c they’re cheap to make, but a PowerPoint or PDF would be just as good.
- Example of non-flat words: intro text to Star Wars movies.
- Make things real
- Spread data socially
- Use a known expert
- Video must be relevant to NOW (e.g., Crisis of Credit video)
- Videos work well as an introduction to a subject
From Predictive Modeling to Optimization
- Predictive models don’t cause change!
- We don’t pay enough attention to the process of using predictive model output to optimize
- Traditional process (using insurance biz as an example)
- Objective: more money!
- Levers: what variables you can tweak (policy prices, etc.)
- PhD-quality algorithm (actuaries)
- Modeler/simulator to test what happens when you pull the levers
- Optimizer: pulls levers over and over again until you reach the desired objective (replaces the PhD-quality algorithm)
- Objective: maximize lifetime value of customer
- Levers: recommendation system, offers, discounts, customer care calls
- Data. What data can you collect? That is the right question, not “what data do you have already.” In insurance example, presenter implemented random pricing for 6 months to get enough data to create the model.
Visualizing Geo Data
- Just Plot It from Aaron Koblin is a helpful resource for map ideas
- Processing creates beautiful data visualizations
- TileMill is great for mapsslippy map tiles
- Unfolding: glues together Processing and TileMill
- Cartograms (not covered in the session) are useful for displaying quantities on maps
- Unless you have training, do not choose your own colors for maps! Use Color Brewer.
- Stats of the Union is an excellent example of an iPad mapping app
- You don’t always need to use the traditional boundaries on a map. A good example is the United States of Craigslist, which displays data according to Craigslist regions.
- Get Shapefiles at census.gov/geo/www/tiger
- Pyshape: Python library for shapes
- Add a spatial tree to your bad of tricks—helps maps render much faster. Rtree is one tool. Quick way to query and associate data with counties (on Github)
- Choice of bin sizes for map colors change the story. See slides for a great example of this!
- Equal intervals are bad! Doesn’t take into account the way your data is structured.
- Using quantiles is a reasonable first pass.
- Presenter (Jason Sundram of Where, Inc) has tweaked D3 choropleth code to make it interactive! See Github…
- WebGL used to make the spinning globe thing
- Very cool, but probably not useful to an org that specializes in U.S. data only
- Need < 250,000 points
- Cheat by binning (e.g., 72.123456 = 72.1)
- Scale values
Big Data for the Common Good
- Jake Porway and Virginia Carlson
- There are two kinds of government data:
- Data generated as a result of the government going about its business (e.g., logs of White House visitors). When Gov 2.0 folks talk about data, they are mostly talking about this.
- “Designed” data. Created for the public (e.g., Census Bureau reports such as CFFR).
- The Designed Data is what social orgs often rely on for their business, but it is going away. For example, food banks can’t get good data about their local economic conditions and know where to provide services.
- Foundations are asking for metrics. Organizations are sitting on tons of data. Result = funnel effect…more data coming in all the time, but the outgoing pipe remains the same size.
- There is a gap between civic coders and community groups/leaders.
- Something that may seem easy/trivial to a data scientist might be transformative to a non-profit
- You can’t plan social programs without data any more than you could do it without a budget.
- Virginia is pitching commercial data for the common good.
- Break down silos between data (e.g., gov), talent (e.g. Google), and social good (e.g., Unicef).
- Example: Could Google tell the foodbank where searches for their services are coming from?
Exploring the Stories Behind the Data
- Cheryl Phillips, Seattle Times
- Data’s increasing involvement in news stories doesn’t change the basics: you still need to tell who, what, when, why, how. Good example of this is NYT’s Faces of the Dead (casualties of war project)
- Nutgraph…journalism term. 3rd or 4th paragraph in a story…the “why” that follows the lead
- The nutgraph shows the reader why he or she should care, so don’t put it too late in the story
- Data in the story should support the nutgraph. Don’t plug in too many numbers.
- Visualizations also need a good lead-in and nutgraph
- Narrative techniques apply. Are there patterns (e.g., chronologic, geographic)?
- Images and visualizations allow you to use details without overwhelming the reader with the details in text.
- However, too many details in a viz can be as boring as too many details in text.
- Cheryl thinks visualizations should allow readers to annotate and share and encourages developers to get busy creating that feature.