Data Cleanup Automation Fun

I’m still playing with LANTA’s bus routes data, girding myself for — eventually — adding the bus routes into OpenStreetMap, but I recently decided to rebuild my data, basically starting over from scratch with the PDF’s I got from the LANTA website. My current workflow is:

PDF  —> CSV —> cleaned-up CSV —> GeoJSON —> cleaned-up GeoJSON —> (eventually) a PostGIS database

The conversion from PDF to CSV was automatic, using a Java program I found online, and the CSV cleanup — fixing things like transposed latitude/longitude, missing minus signs etc — was done manually (using a text editor and LibreOffice Calc), which was relatively uneventful but  laborious — there are 68 individual bus routes, each with its own file.

Things got even more laborious with the next two tasks. My first go-around with the conversion to GeoJSON was done manually within QGIS: load each of the CSV files, individually filling out the required parameters for each one. I wasn’t looking forward to converting the files individually, so I wrote a Python script to save all my route layers in GeoJSON format. (Just as an aside, I have to say I really like GeoJSON as a vector file format: I find it much easier to work with than the dated, unwieldy Shapefile standard; it’s also easier to open and work with in the JOSM editor. All my new data, if it’s not going into the database, is getting stored as GeoJSON files.)

The “GeoJSON cleanup” is where I massage the data into the forms I want: some of the table columns are unnecessary, some need to be renamed, there are a few extra columns to add (and populate), and finally I wanted to convert the LANTA bus stop names format (in  Robbie-the-Robot-style ALL CAPS) to something a little easier o the eyes. Doing this manually would have been beyond laborious, so I wrote another Python script to massage the route files. This turned out to be more of a learning experience — as in, multiple versions of the program failed spectacularly until I got it right — and probably took longer than the brute force, manual changes approach would have, but at least it wasn’t laborious…

I’m still not really sure why this version worked when others did not, but here is my code:

# script to run through all visible layers,
# adding/deleting/renaming fields as required
# to convert from LANTA bus data to something
# more like OpenStreetMap bus stop attributes
# it also properly capitalizes feature names

from PyQt4.QtCore import QVariant
import re

# some functions and regular expressions for string manipulation
def match_lower( matchobj ):
def match_upper ( matchobj ):
reg = re.compile( 'Lati|Longi|Time|At Street|On Street|Direct|Placem' )
reg_ns = re.compile( r'\(ns|fs|mid|off\)|\bncc\b|\bhs\b', re.IGNORECASE )
reg_nth = re.compile( r'[1-9]?[0-9][a-z]{,2}', re.IGNORECASE )
reg_leftParen = re.compile( r'([^\s])(\(\w*\))' )
reg_rightParen = re.compile( r'(\))([^\s])' )
reg_space = re.compile( r'\s+' )

for layer in iface.mapCanvas().layers():

# reset these variables for each new layer processed

myLayerName =
highwayExists = False
networkExists = False
operatorExists = False
publicExists = False
busExists = False
routeExists = False

pr = layer.dataProvider()

# change the names of some fields
fields = pr.fields()
count = 0
for field in fields:
fieldName =
print "test for changing field name " + fieldName + " count ", count
if ( fieldName == 'Public Information Name' ):
print 'changing to name'
layer.renameAttribute( count, 'name' )
if ( fieldName == 'Stop Number' ):
print 'changing to ref'
layer.renameAttribute( count, 'ref' )
if ( fieldName == 'Stop Order' ):
print 'changing to stop_order'
layer.renameAttribute( count, 'stop_order' )
count += 1

pr = layer.dataProvider()

# delete some fields
fields = pr.fields()
count = 0
for field in fields:
fieldName =
print "test for deleting fields " + fieldName + " count ", count
m = reg.match( fieldName )
if m:
print fieldName
delList.append( count )
print delList
count += 1

pr = layer.dataProvider()

# add some fields, checking if they don't already exist
count = 0
fields = pr.fields()
for field in fields:
fieldName =
print "test for adding fields " + fieldName + " count ", count
if( fieldName == 'highway' ):
highwayExists = True
print 'highway', count
if( fieldName == 'network' ):
networkExists = True
print 'network', count
if(fieldName == 'public_transport' ):
publicExists = True
if(fieldName == 'bus' ):
busExists = True
if ( fieldName == 'operator' ):
operatorExists = True
if( == 'route' ):
routeExists = True
count += 1
if( not highwayExists ):
print "adding highway"
pr.addAttributes( [ QgsField("highway", QVariant.String) ] )
if( not networkExists ):
print "adding network"
pr.addAttributes( [ QgsField("network", QVariant.String) ] )
if( not operatorExists ):
print "adding operator"
pr.addAttributes( [ QgsField("operator", QVariant.String) ] )
if( not publicExists ):
print "adding public_transportation"
pr.addAttributes( [ QgsField("public_transport", QVariant.String) ] )
if( not busExists ):
print "adding bus"
pr.addAttributes( [ QgsField("bus", QVariant.String) ] )
if not routeExists:
print "adding route"
pr.addAttributes( [ QgsField('route', QVariant.String) ] )

# add attributes to new fields for all features
for feature in layer.getFeatures():
feature['highway'] = "bus_stop"
feature['network'] = 'LANTA'
feature['operator'] = 'Lehigh and Northampton Transportation Authority'
feature['public_transport'] = 'platform'
feature['bus'] = 'yes'
feature['route'] = myLayerName

# clean up text in name field
fieldName = 'name'
for feature in layer.getFeatures():
myString = feature[fieldName]
myString = myString.title()
myString = reg_ns.sub( match_upper, myString )
myString = reg_nth.sub( match_lower, myString )
myString = reg_space.sub( ' ', myString )
myString = reg_leftParen.sub( r'\1 \2 ', myString )
mystring = reg_rightParen.sub( r'\1 \2', myString )
print myString
feature[fieldName] = myString


Once I got this up and running, I realized that I wanted t do some more preliminary cleanup on my spreadsheets, so I was back to square one. I couldn’t really find out how to do a bulk load of my CSV files into QGIS, and I realized that QGIS was just using ogr2ogr under the hood, so I decided to do the bulk converting, CSV to GeoJSON, with a shell script that calls ogr2ogr. Yet another learning curve later, and it works great. More code:

for i in *.csv
echo $i
tail -n+2 $i | ogr2ogr -nln ${i%.csv} -f "GeoJSON" ${i%csv}geojson \
ogrinfo ${i%csv}geojson

It struck me then, that all my data was really text, and so working with it in a more unix-ey fashion, with shell scripting and text manipulation programs (sed, awk) to do the conversions directly from the CSV files, was probably my better strategy. Oh well, I did the first part of it with bash at least, and the Python script works well enough.


I did it anyway, using awk to add/subtract/rename/Capitalize the CSV data before running it through ogr2ogr. The code (below) is about 35 lines (as opposed to 150 for the Python script, which only does part of the job anyway) and runs really fast:

#! /bin/bash
for i in *.csv
echo Converting $i to GeoJSON
cat $i | awk -F "," -v rtname=${i%.csv} 'BEGIN {
$2 != "" && /Stop/ {
print "ref,Latitude,Longitude,name,stop_order,highway,public_transport,bus,network,operator,route"
$2 != "" && /^[0-9]{4,4}/ {
string = tolower($8)
n=split(string,a," ")
string=toupper(substr(a[1],1,1)) substr(a[1],2)
for(i=2;i<=n;i++) {
string = string " " toupper(substr(a[i],1,1)) substr(a[i],2)
t_str = "\(ns\)|\(fs\)|\(mid\)|\b[nN]cc\b|\b[Hh]s\b"
if ( match( string, t_str, match_array ) ) {
the_match = toupper(match_array[0])
gsub(t_str, the_match, string)
gsub(/ *\(/, " (", string)
$8 = string
print $1 "," $2 "," $3 "," $8 "," $9 ",bus_stop,platform,yes,LANTA,Lehigh and Northampton Transportation Authority," rtname
}' > test1.csv
cat test1.csv
ogr2ogr -nln ${i%.csv} -f "GeoJSON" ${i%csv}geojson test1.csv -oo KEEP_GEOM_COLUMNS=NO
ogrinfo ${i%csv}geojson
rm test1.csv

Comments are closed.