Love PANDA? Try csvkit to Standardize Your Data

    by Christopher Groskopf
    September 17, 2012

    Now that [PANDA](http://pandaproject.net) has been out in the world for a while, we would like to suggest you also take a look at a “cousin” project of sorts: [csvkit](http://csvkit.readthedocs.org/en/latest/index.html). csvkit is a suite of powertools for working with Comma-separated values (CSV) files. It’s a command line toolkit, so you will need some minimal technical know-how to get started, but it also has [extensive documentation](http://csvkit.readthedocs.org/en/latest/index.html). Those new to such tools will find the [tutorial](http://csvkit.readthedocs.org/en/latest/index.html#tutorial) particularly useful!


    The PANDA project aims to make basic data analysis quick and easy for news organizations, and make data sharing simple. Both PANDA and csvkit have at their core the belief that simple data formats are the best data formats. In PANDA’s case this means that, although we import from CSV or Excel files, we only ever export to simple CSV files. This rewards us with the highest degree of interoperability with other software packages. Almost anything can read a CSV file. csvkit builds on the simplicity of CSV files by allowing you to do all sorts of useful things with them.



    Here are some of the things that you can do with csvkit:

    – Convert Excel, DBF, fixed-width and JSON files into CSV with [in2csv](http://csvkit.readthedocs.org/en/latest/scripts/in2csv.html).
    – Filter a CSV down to a subset of columns with [csvcut](http://csvkit.readthedocs.org/en/latest/scripts/csvcut.html).
    – Search and filter rows of a CSV with [csvgrep](http://csvkit.readthedocs.org/en/latest/scripts/csvgrep.html).
    – Perform SQL-like “joins” between CSV files with [csvjoin](http://csvkit.readthedocs.org/en/latest/scripts/csvjoin.html).
    – Convert a CSV to JSON or GeoJSON with [csvjson](http://csvkit.readthedocs.org/en/latest/scripts/csvjson.html).
    – Import a CSV straight into a database with [csvsql](http://csvkit.readthedocs.org/en/latest/scripts/csvsql.html).
    – Generate summary statistics for a CSV with [csvstat](http://csvkit.readthedocs.org/en/latest/scripts/csvstat.html).
    – And a whole lot of other useful things!


    PANDA and csvkit are a natural pairing of tools. Use csvkit to clean up or standardize your data before importing it into PANDA. Have a DBF or fixed-width file you can’t import into PANDA? Use csvkit to convert it into a CSV. Need to use data from your PANDA in your web app? Export it to CSV and use csvjson to convert it to JSON. Using csvkit with PANDA will make your newsroom data even more useful.

    csvkit works on Linux, OSX and Windows with Python versions 2.6 and 2.7 or with PyPy. Head over to the [documentation](http://csvkit.readthedocs.org/en/latest/index.html) and get started with faster, better data processing.

    Christopher Groskopf is the lead developer on PANDA Project and a former developer on the Chicago Tribune’s News Applications Team. He is also the creator of django-boundaryservice, csvkit, and Hack Tyler. His residence is in flux, but you can find him on Twitter regardless of his present whereabouts: @onyxfish.

    Tagged: command line toolkit csv files csvkit data excel panda

    Comments are closed.

  • Who We Are

    MediaShift is the premier destination for insight and analysis at the intersection of media and technology. The MediaShift network includes MediaShift, EducationShift, MetricShift and Idea Lab, as well as workshops and weekend hackathons, email newsletters, a weekly podcast and a series of DigitalEd online trainings.

    About MediaShift »
    Contact us »
    Sponsor MediaShift »
    MediaShift Newsletters »

    Follow us on Social Media