Bash CSV Handling

From TheBeard Science Project Wiki
Jump to: navigation, search

I'm sick of screwing around with the same caveats in CSV files, so I'm creating this definitive guide for my own reference. I will update this as I learn to handle individual issues.

For the following examples, I will use this sample CSV file test.csv (assuming Unix line ending "\n"). I've included several "problems" in it.

id,firstname,lastname,email,phone,hiredate
1111,"Colin","Quinn","c.quinn@example.com",1112223333,"02/12/2017"
2222,"Richard","Vos, Richard","r.vos@example.com",2223334444,"06/27/2019"
3333,"Patrice","O'Neal","p.o'neal@example.com",3334445555,"08/11/2018"
4444,"Jim","Jefferies","j.jefferies@example.com",,"09/11/2001"

Quoted Strings with Commas

When your fields are separated by commas, and one of the fields has a quoted string with a comma in it, it can screw up the parsing of the fields.

The problem in the sample file is in the second line. Someone accidentally entered "Vos, Richard" in a field that is only supposed to contain the last name "Vos". This kind of clerical error happens all the time. Not only is it incorrect, it has a comma in the text which is the character we use to delimit the fields.

if we try to naively break out the fields into separate columns, we see a problem:

cat test.csv | column -t -s ","
id    firstname  lastname     email                      phone                hiredate
1111  "Colin"    "Quinn"      "c.quinn@example.com"      1112223333           "02/12/2017"
2222  "Richard"  "Vos          Richard"                  "r.vos@example.com"  2223334444    "06/27/2019"
3333  "Patrice"  "O'Neal"     "p.o'neal@example.com"     3334445555           "08/11/2018"
4444  "Jim"      "Jefferies"  "j.jefferies@example.com"  "09/11/2001"

The comma in the lastname field was identified as a field as delimiter (field separator), so now the columns on the second line are shifted to the right.

The best way I've found to handle this without using a utility that is specifically for CSV parsing is this:

cat test.csv | sed 's/,\("[^"]*"\)*/<FS>\1/g' | column -t -s "<FS>"
id    firstname  lastname        email                      phone         hiredate
1111  "Colin"    "Quinn"         "c.quinn@example.com"      1112223333    "02/12/2017"
2222  "Richard"  "Vos, Richard"  "r.vos@example.com"        2223334444    "06/27/2019"
3333  "Patrice"  "O'Neal"        "p.o'neal@example.com"     3334445555    "08/11/2018"
4444  "Jim"      "Jefferies"     "j.jefferies@example.com"  "09/11/2001"

Note: Obviously, the last name needs to be corrected, but we don't want mistakes like this to break automated processes. Also, you should be using input validation for anything that takes user input, but this is for when that is not possible.

Empty Fields

On line 4 of the example file, we see that there is no phone number, so we have a field that is null. This is entirely valid CSV, but how do we handle it?

The naive approach of just separating fields by the comma delimiter presents a problem:

cat test.csv | column -t -s ","
id    firstname  lastname     email                      phone                hiredate
1111  "Colin"    "Quinn"      "c.quinn@example.com"      1112223333           "02/12/2017"
2222  "Richard"  "Vos          Richard"                  "r.vos@example.com"  2223334444    "06/27/2019"
3333  "Patrice"  "O'Neal"     "p.o'neal@example.com"     3334445555           "08/11/2018"
4444  "Jim"      "Jefferies"  "j.jefferies@example.com"  "09/11/2001"

The phone field gets populated by the hiredate field.

The best way to handle this is to pass the output through awk while defining the output field separator (OFS) as having a space before the comma:

cat test.csv | awk -F, -vOFS=" ," '{$1=$1}1' | column -t -s ","
id     firstname   lastname      email                       phone                 hiredate
1111   "Colin"     "Quinn"       "c.quinn@example.com"       1112223333            "02/12/2017"
2222   "Richard"   "Vos           Richard"                   "r.vos@example.com"   2223334444    "06/27/2019"
3333   "Patrice"   "O'Neal"      "p.o'neal@example.com"      3334445555            "08/11/2018"
4444   "Jim"       "Jefferies"   "j.jefferies@example.com"                         "09/11/2001"

Notice now that the phone field is blank as it should be.