Archive | Data RSS feed for this section

No Excuses for Ugly Excel Charts

2/2/2012: Corrected the revised bar chart by setting the horizontal axis minimum to zero. Thanks to Jon Peltier for catch.

Excel remains the de-facto graphing tool at National Priorities Project. A simple chart is often the best way to convey information about federal spending and budgeting, and Excel is the common language among our researchers and IT team.

Using Excel, however, is no excuse for ignoring style and the best practices of information display. So many organizations put out amazing, well-researched publications and then tack on default Excel graphs as an afterthought. But graphs are often what people look at first, and they deserve to be first-class citizens in the editing process.

I created some Excel chart templates for NPP, drawing on two sources for inspiration and practical advice: the classic Visual Display of Quantitative Information by Edward Tufte and The Wall Street Journal Guide to Information Graphics by Dona Wong.

Tufte is big on eliminating “unnecessary ink” that distracts from the information, and Wong advocates requiring the least amount of work on the reader’s part. With their advice in mind, I modified Excel’s default bar chart from this:

Excel bar chart - default

Bar chart: Excel default

To this:

Excel bar chart - modified

Bar chart: new template

  • Smaller gap between bars
  • Don’t make readers guess the numbers; if possible, label the bars directly
  • Direct labeling means you don’t need the noisy gridlines or even the x-axis
  • Remove the y-axis tick marks for even more noise reduction
  • Get rid of those zeros by showing data in millions or billions
  • Make sure the entire length of the bars is shown (in this case, by setting the horizontal axis minimum to zero). HT Jon Peltier.

The pie chart got a similar treatment. The Excel default:

Excel pie chart - default

Pie chart: Excel default

The new template:

Excel pie chart - modified

Pie chart: new template

  • Label the pie slices directly—don’t make people use a legend to decode
  • Avoid the default Office color palette and develop your own (ours is based on colors from our website)
  • A white line between pie slices emphasizes the boundaries

Excel isn’t perfect, but it’s out there in the world, and you can’t ignore it. Luckily, a little extra effort goes a long way.

Comments { 2 }

Python, Django, & MySQL on Windows 7, Part 5: Installing MySQL

This is the fifth and final post in a  dummies guide to getting stared with Python, Django, & MySQL on Windows 7.

By now, you should have Django installed into a virtual environment.  These tutorials aren’t meant to cover building a django app, just to point out the quirks involved with getting a project up and running on Windows.  These tutorials also assume you want to construct real applications using a real development environment.

To that end, you’ll want a heftier database than sqlite.  We use MySQL at the office, so these instructions cover installing it and using it with Django.

Install MySQL

  1. Download and install MySQL.
  2. Once MySQL is installed, proceed through the configuration wizard. Check Include Bin Directory in Windows PATH box.
  3. When prompted, set a password for the MySQL root account.
  4. Once the installation wizard is done, open a command window and log in to MySQL with the root account: mysql -uroot -p (you’ll be prompted for the password).
  5. After logging in, run the following commands to create a database, create a user for your Django project, and grant the user database access.

Install MySQL-python

You’ll need the MySQL-python package, a Python interface to MySQL.

  1. Download the windows MySQL-python distribution here.  The author has some instructions about the appropriate version; assuming a 32-bit version of Python 2.7, you’d download this package (.exe).
  2. After downloading, do not run the Windows installer. Doing so will install MySQL-python to your root python, which virtual environments created via –no-site-packages won’t be able to see.
  3. Instead, install the downloaded package to your virtual environment by using easy_install, which can install from Windows binary installers:
    easy_install file://c:/users/you/downloads/mysql-python-1.2.3.win32-py2.7.exe (modify to reflect the location of the downloaded installer and its name).installing mysql-python package via easy_install

Configure Django

Next, you’ll need to update the database-related settings of your Django project.

  1. From the directory of your Django project, open settings.py using your favorite editor.
  2. Update the default key in the DATABASES dictionary.  Set ENGINE to django.db.backends.mysql and set NAME, USER, and PASSWORD to the database name, username, and password you chose when installing MySQL.  See Part I of the Django tutorial for more information about database settings.
  3. Open a command window, activate your virtual environment, and change to the directory of your Django project.
  4. Type python manage.py syncdb. This command creates the underlying tables required for your Django project.
    syncdb output
  5. If the syncdb worked, you have Python, Django, and MySQL communicating in harmony.  Congratulations!  You can now proceed through the Django tutorial and create your first application.
Comments { 19 }

Python, Django, & MySQL on Windows 7, Part 4: Installing Django

This is the fourth post in a  dummies guide to getting stared with Python, Django, & MySQL on Windows 7.

We’re finally ready to install Django, a popular Web-development framework. Detailed instructions for building out a Django site are beyond the scope of this humble tutorial; try The Definitive Guide to Django or Django’s online Getting started docs for that.

These directions will simply make sure you can get up and running.

Installing Django

  1. Open a command window.
  2. Go to (or create) the virtual environment you’ll be using for your django project. For this example, I created a virtualenv called django-tutorial: virtualenv django-tutorial --no-site-packages
  3. Install django: pip install django
    install django 
  4. Start an interactive interpreter by typing python (or iPython, if you’ve made it virtual environment-aware).
  5. Test the install by importing the django module and checking its version: https://gist.github.com/1177372
  6. Create a new directory to hold your Django projects and code. Change to it.
  7. Think of a name for your first Django project and create it by running the following command: python -m django-admin startproject [projectname].
    If that doesn’t work, try python -m django-admin startproject [projectname] (thanks JukkaN!)
    Important: most Django docs show django-admin.py startproject [projectname] to start a new project, which can cause import errors and other trouble for Windows users. See this stackoverflow thread for details.
  8. You should now see the project’s folder in your Django directory:django project folder
  9. Change into the new project folder.
  10. Test the new project by typing python manage.py.  Manage.py is Django’s command line utility; you should see a list of its available subcommands.
  11. A further test is to start up Django’s development server: python manage.py runserver. You should see something like this:
    django runserver

If you’ve made it this far, you’ve successfully installed Django and created your first project.

Next up is Part 5: Installing MySQL.

Comments { 7 }

Two month milestone

flowering tree in Childs Park

Monday marked two completed months with the National Priorities Project. Though these weeks haven’t produced much writing, they’ve been a whirlwind of learning:

  • Python
  • Django
  • MySQL
  • The joy of setting up a proper Windows dev environment using the above three items
  • Piston, a tool for powering APIs through Django
  • Linux
  • Git/Github
  • The Federal Budget process
  • The Consolidated Federal Funds Report , a huge annual file of government expenditures.
  • Various other indicators about the state of our union: gas emissions by state, average teacher salaries, people in poverty, insurance enrollments, etc.
  • Finally, I’m NPP’s interim Twitterer, a fascinating distraction.

One day soon I’ll write a Dummies Guide to Setting up Python/Django/MySQL on Windows post. In the meantime, it’s great to be back in the hands-on tech saddle.

Comments { 2 }

Save the data, save on FOIA?

Last week I wrote my first entries on the National Priorities Project’s (NPP) blog. Friday’s piece concerned the potential $32 million cuts to the Federal government’s open data initiatives.

Alexander Howard wrote a tremendous overview of the situation, from the recent history of open government platforms to the less-than-perfect implementation of those platforms to the implications of having their funding cut from $34 million to $2 million.

He quoted some of NPP’s numbers that try to put $32 million in context. In terms of the Federal budget, it’s a tiny sum of money–.0009% of the proposed FY11 spending.

That’s an interesting figure, but even if $32 million is just a drop in the bucket, that’s not to say we should spend it carelessly. I’m new to the open government scene, but you don’t have to dig too far into Data.gov to realize it’s far from perfect. Howard’s primer provides some insight into the perverse incentives behind quirks like datasets split up by geography and agencies that don’t publish their juicy stuff.

But consider another number we published: $32 million is 7.7% of the amount that the government spent processing Freedom of Information Act (FOIA) requests in FY10.

A compelling story would be to find out what types of FOIA requests could be serviced via the Data.gov suite of sites. Even better, why not use these requests to prioritize the data that’s released online?

If we can use $32 million to take a bite out of that $416+ million FOIA bill*, why not pursue that investment?

Some of my colleagues would say because it’s not about the money—it’s about policy. As a developer, I have a hard time wrapping my brain around that. Policy? Why wouldn’t our elected officials just make decisions that are logical?

It seems I have much to learn.

*figure pulled from FOIA.gov/data.

Comments { 0 }

The new gig: return to data

Open Government Data Venn Diagram

diagram by Justin Grimes

I’m about two weeks into a new job with the National Priorities Project (NPP), a small, non-profit organization based in Northampton, MA.

NPP’s mission is to educate citizens about their federal budget: the budget process, the proposed numbers, and how our government ultimately spends money. All too often, it’s difficult for people to understand how the huge numbers bandied about in Washington, DC affect their states, schools, and communities.

The organization began in 1983, long before technology made it easier to analyze and share this information and long before the likes of sites like Data.gov.

NPP has formed some exciting partnerships in recent years, and I’m excited by our direction: combining the staff’s budget experts with an improved database search tool (launching soon) and an API developed in conjunction with the Sunlight Foundation. Finding additional indicators that, when combined with spending data, will tell a richer budget story. Telling those stories and giving our constituents the means to tell theirs too.

My role will be to look after and augment our database and help create visualizations and web tools for general use. Naturally, I took it upon myself to expand the job description to blogging, so look out budget analysts: I’m coming to gum up your blog with data stories.

It’s daunting, but I’m very excited to jump into the government data transparency community and learn the ropes. I’m also thrilled to return to a data-oriented job.

The ecosystem of government departments and agencies and their respective datasets is downright crazy, but I like crazy. It makes for good blog stories.

Comments { 1 }

Beware the change scripts created by SQL Server Management Studio

Part of my job is moonlighting as a SQL Server database admin, so a co-worker recently asked me to run a script against a production table. The task was simple: add a few new columns and create some indexes. He had generated the script by using SQL Server 2008 Management Studio’s (SSMS) Generate Change Script function.

Although the general sequence of steps in the generated script made sense, I have some complaints about the SSMS output. Here’s a very simple recreation of the scenario.

Consider a table with two columns. One is a primary key, and one has a default constraint:

1
2
3
4
5
6
CREATE TABLE phillies (
   phils_id INT IDENTITY(1,1)
,  phils_year CHAR(4) NOT NULL
   CONSTRAINT df_phillies_year DEFAULT (2008) 
,  CONSTRAINT pk_phils_id 
      PRIMARY KEY CLUSTERED (phils_id))

Using the SSMS table design view to add two columns to the phillies table and saving the change script results in the following code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
BEGIN TRANSACTION
GO
ALTER TABLE dbo.phillies
   DROP CONSTRAINT df_phillies_year
GO
CREATE TABLE dbo.Tmp_phillies (
   phils_id INT NOT NULL IDENTITY (1, 1),
   phils_year CHAR(4) NOT NULL,
   division_champ_flag bit NOT NULL,
   national_champ_flag bit NOT NULL
)  ON [PRIMARY]
GO
ALTER TABLE dbo.Tmp_phillies 
   ADD CONSTRAINT
   df_phillies_year DEFAULT ((2008)) FOR phils_year
GO
SET IDENTITY_INSERT dbo.Tmp_phillies ON
GO
IF EXISTS(SELECT * FROM dbo.phillies)
   EXEC('INSERT INTO dbo.Tmp_phillies 
      (phils_id, phils_year)
   SELECT phils_id, phils_year 
   FROM dbo.phillies 
   WITH (HOLDLOCK TABLOCKX)')
GO
SET IDENTITY_INSERT dbo.Tmp_phillies OFF
GO
DROP TABLE dbo.phillies
GO
EXECUTE sp_rename 
   N'dbo.Tmp_phillies', N'phillies', 'OBJECT' 
GO
ALTER TABLE dbo.phillies 
   ADD CONSTRAINT
   pk_phils_id PRIMARY KEY CLUSTERED 
   (phils_id) ON [PRIMARY]
GO
COMMIT

This script contains BEGIN TRANSACTION and COMMIT TRANSACTION statements but doesn’t accompany them with any kind of error handling. So if you run it as-is and encounter an error, nothing gets rolled back.

Any error handling that you might add, however, is thwarted by the fact that the script’s statements are contained in individual batches (i.e., separated by GO statements).

Say you individually check each statement for errors and issue a rollback/return if something goes awry.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
BEGIN TRANSACTION
GO
[snip]
CREATE TABLE dbo.Tmp_phillies (
   phils_id INT NOT NULL IDENTITY (1, 1),
   phils_year CHAR(4) NOT NULL,
   division_champ_flag bit NOT NULL,
   national_champ_flag bit NOT NULL
)  ON [PRIMARY]
GO
[snip]
-- throw an error
SELECT 23/0
IF @@ERROR <> 0
BEGIN
   PRINT 'error!'
   ROLLBACK TRANSACTION
   RETURN
END
[snip]
GO
DROP TABLE dbo.phillies

In this scenario, changes that occurred prior to the error will be rolled back. However, although the RETURN statement exits the current batch, subsequent batches (for example, the one that deletes your table) will execute.

A TRY/CATCH block is another potential error-handling method, but TRY/CATCH blocks can’t span multiple batches.

So what’s with the batch-happy SSMS? In older versions of SQL it might have been necessary to separate a CREATE TABLE and subsequent ALTER TABLE statements into separate batches (I haven’t tested this). But SQL 2005 and 2008 provide statement-level recompilation, so the multitude of batches is not necessary.

When I tested a “GO-less” version of the script, it worked swimmingly. It’s understandable that SSMS can’t generate the error-handling, but if it would reduce the scripts’ number of batches to the minimum required, it would be easier to add the error handling yourself.

Conclusion: don’t rely on these SQL Server Management Studio-generated change scripts for any serious work, and definitely don’t use them as a model for how to write robust T-SQL.

Comments are closed

Database Design Presentation

I’m extremely happy to work in an organization with an uber-smart and supportive group of developers. Once a month we meet for a lunchtime “tech talk” presentation on a topic chosen via Google Moderator.

This month, Tim Allen and I collaborated and spoke about database design. Working with Tim to research and organize the material was a blast, and even though we put silly pictures of ourselves on the slides, I’m posting them.

We focused most closely on normalization and indexing, with a few of our other best practices thrown in the mix. The concepts should be applicable to any RDBMS, but the details are specific to MS SQL Server, the database used in most of the organization’s applications.

wcit-techtalk-database-design

Comments are closed

Hypnotic B-Tree Video

While preparing a database design presentation with my co-worker Tim, I discovered this hypnotic video demonstrating inserts into a b-tree data structure. For some reason, the music compels me to leave the cubicle and pay a visit to Rami’s falafel truck.

Comments are closed

SQL Server operand type clash!

Note:  This article was originally published to the Wharton Computing Developer Center.

Yesterday a fellow developer hit a strange SQL error and determined that the culprit was a T-SQL CASE statement used in his ORDER BY clause.

1
2
3
4
5
6
7
8
9
10
ORDER BY
CASE
WHEN UPPER(@orderBy)='GROUPMESSAGEID' THEN groupMessageID
WHEN UPPER(@orderBy)='GAMEID' THEN gameID
WHEN UPPER(@orderBy)='GROUPID' THEN groupID
WHEN UPPER(@orderBy)='MESSAGEID' THEN messageID
WHEN UPPER(@orderBy)='ROUNDSENT' THEN roundSent
WHEN UPPER(@orderBy)='SENTON' THEN sentOn
ELSE groupMessageID
END

This code results in the following message:
Operand type clash: uniqueidentifier is incompatible with datetime

After some digging around, we found the underlying cause of the error:  the case statement return values have different data types.  In some cases, returning different data types will behave as expected.  However, mixing numeric and character data causes problems.

In the statement above, gameId, groupID, roundSent, and groupMessageID are integers, sentOn is a datetime, and messageID is a uniqueidentifier.  Because the data type precedence pecking order in this case is datetime, int, and then uniqueidentifier, SQL choose datetime as the return type. Uniqueidentifiers cannot be converted to datetimes, hence the error message.

It all became clear after reading this article by George Mastros.  Thank you to George.

Comments are closed