Nausheen Husain

[NO'SHEEN HOO-SANE']



Let's play with csvkit!


* Download data here. Use "Download zip" on the bottom right and save the folder to your Desktop.

** Open up the files called "2001_state_profiles.xlsx" and "2013_state_profiles.xlsx" in Excel. Take a look at the data. What questions can we ask of this data?

Let's try to answer some questions.

* Open up your command line ("Terminal" on Mac).

* Let's install csvkit using $ sudo pip install csvkit. You may have to type in a password.

* Change directories into the data folder. $ cd /Users/your_name/Desktop/SSRC-data-gh-pages (don't type the $).

Let's play with hate crime data from 2001.

* First, we'll transfer the Excel file into a CSV (comma separated value) file, which is generally easier to work with. $ in2csv 2001_state_profiles.xlsx > 2001_percentage.csv

* Let's look at the data as a whole. $ csvlook 2001_percentage.csv This should give you a fairly readable view of states, participating agencies and incidents reported.

* Let's get some more information about our columns using csvstat 2001_percentage.csv.

* Let's say we want to focus on the agency information for each state. Try $ csvcut -c 1,2,4 2001_percentage.csv | csvlook | head .

Pipes are awesome!

* Let's get even more specific. Try csvcut -c 1,2,4 2001_percentage.csv | csvgrep -c Participating_state -m Illinois | csvlook.

* Organization is important. Check out what happens when you type csvcut -c 1,2,4 2001_percentage.csv | csvsort -c Number_of_participating_agencies -r | csvlook.

** What would you type into the command line to order this data by most to least populated states?

Your turn! Play around with 2013_state_profiles.xlsx. Do the following:

* Turn the Excel file into a csv file called "2013_percentage.csv".

* Use csvstat, csvlook, head, csvcut and csvgrep.

* Find out which state has a higher rate of agencies submitting incident reports per number of participating agencies: New York or Tennessee.

Wouldn't it be cool if we could compare the rate of agencies submitting incident reports per number of participating agencies for ALL of the states? What would we need to do to make this possible?

* Open up your SSRC-data-gh-pages folder in your Finder and drag "hatecrimesdivide.py" into any text editing program (TextEdit, TextMate, Sublime). You can also just watch the screen for this part.

Scripting is awesome!

* Go back to your command line and use python hatecrimesdivide.py to apply the python script to 2001_percentage.csv.

* Now try csvlook 2001_percentage.csv | head. What did the script do?

* Let's return back to our text editor and edit the input_file before applying the script to our 2013_percentage.csv file. Change input_file = '2001_percentage.csv' to input_file = '2013_percentage.csv'. Change writer.writerow(next(reader) + ["2001_percentage"]) to writer.writerow(next(reader) + ["2013_percentage"]).

* Run the same commands for the 2013 data: python hatecrimesdivide.py and then csvlook 2013_percentage.csv | head.

Now we have rates of reporting for all the states for 2001 and 2013.

We can use csvkit to compare these two years. In order to be able to visualize this data, we need to get the percentages into one csv file with the Participating_state column.

* Let's go back to our csvcut function. Try csvcut -c 1,6 2001_percentage.csv > 2001_cut.csv

* Let's see what that looks like using csvlook 2001_cut.csv | head.

* Now let's do the same with our 2013 data. Type in csvcut -c 1,6 2013_percentage.csv > 2013_cut.csv and check out the result with csvlook 2013_cut.csv | head.

Now we have two files, 2001_cut.csv and 2013_cut.csv, each with a state column and a percentage column. Wouldn't it be great if we could somehow magically join those?

OH WAIT. WE CAN TOTALLY DO THAT.

* Let's see what csvjoin can do for us. Try csvjoin -c Participating_state 2001_cut.csv 2013_cut.csv | csvlook | head.

* Let's use it with our pipes to cut out the extra state column: csvjoin -c Participating_state 2001_cut.csv 2013_cut.csv | csvcut -c 1,2,4 > finaldata.csv.

* Let's take a look at the data in its final, chartable form: csvlook finaldata.csv. Isn't it beautiful?

Resources:

* Check out csvkit here.

* Here's a great article on Medium on how to mislead with numbers, using the NYTimes as an example.

* The NPR Visuals blog always has great posts on how they've worked with a variety of data as well as what tools they've found helpful.

THIS PAGE WAS PUBLISHED BY GITHUB PAGES. FONTS USED: MONTSERRAT, OPEN SANS.