• Tag Archives linux
  • Linux/computer stuff.

  • Data Cleanup Automation Fun

    I’m still playing with LANTA’s bus routes data, girding myself for — eventually — adding the bus routes into OpenStreetMap, but I recently decided to rebuild my data, basically starting over from scratch with the PDF’s I got from the LANTA website. My current workflow is:

    PDF  —> CSV —> cleaned-up CSV —> GeoJSON —> cleaned-up GeoJSON —> (eventually) a PostGIS database

    The conversion from PDF to CSV was automatic, using a Java program I found online, and the CSV cleanup — fixing things like transposed latitude/longitude, missing minus signs etc — was done manually (using a text editor and LibreOffice Calc), which was relatively uneventful but  laborious — there are 68 individual bus routes, each with its own file.

    Things got even more laborious with the next two tasks. My first go-around with the conversion to GeoJSON was done manually within QGIS: load each of the CSV files, individually filling out the required parameters for each one. I wasn’t looking forward to converting the files individually, so I wrote a Python script to save all my route layers in GeoJSON format. (Just as an aside, I have to say I really like GeoJSON as a vector file format: I find it much easier to work with than the dated, unwieldy Shapefile standard; it’s also easier to open and work with in the JOSM editor. All my new data, if it’s not going into the database, is getting stored as GeoJSON files.)

    The “GeoJSON cleanup” is where I massage the data into the forms I want: some of the table columns are unnecessary, some need to be renamed, there are a few extra columns to add (and populate), and finally I wanted to convert the LANTA bus stop names format (in  Robbie-the-Robot-style ALL CAPS) to something a little easier o the eyes. Doing this manually would have been beyond laborious, so I wrote another Python script to massage the route files. This turned out to be more of a learning experience — as in, multiple versions of the program failed spectacularly until I got it right — and probably took longer than the brute force, manual changes approach would have, but at least it wasn’t laborious…

    I’m still not really sure why this version worked when others did not, but here is my code:

    # script to run through all visible layers,
    # adding/deleting/renaming fields as required
    # to convert from LANTA bus data to something
    # more like OpenStreetMap bus stop attributes
    # it also properly capitalizes feature names

    from PyQt4.QtCore import QVariant
    import re

    # some functions and regular expressions for string manipulation
    def match_lower( matchobj ):
    return matchobj.group().lower()
    def match_upper ( matchobj ):
    return matchobj.group().upper()
    reg = re.compile( 'Lati|Longi|Time|At Street|On Street|Direct|Placem' )
    reg_ns = re.compile( r'\(ns|fs|mid|off\)|\bncc\b|\bhs\b', re.IGNORECASE )
    reg_nth = re.compile( r'[1-9]?[0-9][a-z]{,2}', re.IGNORECASE )
    reg_leftParen = re.compile( r'([^\s])(\(\w*\))' )
    reg_rightParen = re.compile( r'(\))([^\s])' )
    reg_space = re.compile( r'\s+' )

    for layer in iface.mapCanvas().layers():

    # reset these variables for each new layer processed

    myLayerName = layer.name()
    highwayExists = False
    networkExists = False
    operatorExists = False
    publicExists = False
    busExists = False
    routeExists = False
    delList=[]

    layer.startEditing()
    pr = layer.dataProvider()

    # change the names of some fields
    fields = pr.fields()
    count = 0
    for field in fields:
    fieldName = field.name()
    print "test for changing field name " + fieldName + " count ", count
    if ( fieldName == 'Public Information Name' ):
    print 'changing to name'
    layer.renameAttribute( count, 'name' )
    if ( fieldName == 'Stop Number' ):
    print 'changing to ref'
    layer.renameAttribute( count, 'ref' )
    if ( fieldName == 'Stop Order' ):
    print 'changing to stop_order'
    layer.renameAttribute( count, 'stop_order' )
    count += 1
    layer.updateFields()

    layer.commitChanges()
    layer.reload()
    layer.startEditing()
    pr = layer.dataProvider()

    # delete some fields
    fields = pr.fields()
    count = 0
    for field in fields:
    fieldName = field.name()
    print "test for deleting fields " + fieldName + " count ", count
    m = reg.match( fieldName )
    if m:
    print fieldName
    delList.append( count )
    print delList
    count += 1
    pr.deleteAttributes(delList)
    layer.updateFields()

    layer.commitChanges()
    layer.reload()
    layer.startEditing()
    pr = layer.dataProvider()

    # add some fields, checking if they don't already exist
    count = 0
    fields = pr.fields()
    for field in fields:
    fieldName = field.name()
    print "test for adding fields " + fieldName + " count ", count
    if( fieldName == 'highway' ):
    highwayExists = True
    print 'highway', count
    if( fieldName == 'network' ):
    networkExists = True
    print 'network', count
    if(fieldName == 'public_transport' ):
    publicExists = True
    if(fieldName == 'bus' ):
    busExists = True
    if ( fieldName == 'operator' ):
    operatorExists = True
    if( field.name() == 'route' ):
    routeExists = True
    count += 1
    if( not highwayExists ):
    print "adding highway"
    pr.addAttributes( [ QgsField("highway", QVariant.String) ] )
    layer.updateFields()
    if( not networkExists ):
    print "adding network"
    pr.addAttributes( [ QgsField("network", QVariant.String) ] )
    if( not operatorExists ):
    print "adding operator"
    pr.addAttributes( [ QgsField("operator", QVariant.String) ] )
    if( not publicExists ):
    print "adding public_transportation"
    pr.addAttributes( [ QgsField("public_transport", QVariant.String) ] )
    if( not busExists ):
    print "adding bus"
    pr.addAttributes( [ QgsField("bus", QVariant.String) ] )
    if not routeExists:
    print "adding route"
    pr.addAttributes( [ QgsField('route', QVariant.String) ] )
    layer.updateFields()

    # add attributes to new fields for all features
    for feature in layer.getFeatures():
    feature['highway'] = "bus_stop"
    feature['network'] = 'LANTA'
    feature['operator'] = 'Lehigh and Northampton Transportation Authority'
    feature['public_transport'] = 'platform'
    feature['bus'] = 'yes'
    feature['route'] = myLayerName
    layer.updateFeature(feature)
    layer.updateFields()

    # clean up text in name field
    fieldName = 'name'
    for feature in layer.getFeatures():
    myString = feature[fieldName]
    myString = myString.title()
    myString = reg_ns.sub( match_upper, myString )
    myString = reg_nth.sub( match_lower, myString )
    myString = reg_space.sub( ' ', myString )
    myString = reg_leftParen.sub( r'\1 \2 ', myString )
    mystring = reg_rightParen.sub( r'\1 \2', myString )
    print myString
    feature[fieldName] = myString
    layer.updateFeature(feature)

    layer.commitChanges()
    layer.reload()

    Once I got this up and running, I realized that I wanted t do some more preliminary cleanup on my spreadsheets, so I was back to square one. I couldn’t really find out how to do a bulk load of my CSV files into QGIS, and I realized that QGIS was just using ogr2ogr under the hood, so I decided to do the bulk converting, CSV to GeoJSON, with a shell script that calls ogr2ogr. Yet another learning curve later, and it works great. More code:

    for i in *.csv
    do
    echo $i
    tail -n+2 $i | ogr2ogr -nln ${i%.csv} -f "GeoJSON" ${i%csv}geojson \
    CSV:/vsistdin/ -oo X_POSSIBLE_NAMES=Lon* -oo Y_POSSIBLE_NAMES=Lat* -oo KEEP_GEOM_COLUMNS=NO
    ogrinfo ${i%csv}geojson
    done

    It struck me then, that all my data was really text, and so working with it in a more unix-ey fashion, with shell scripting and text manipulation programs (sed, awk) to do the conversions directly from the CSV files, was probably my better strategy. Oh well, I did the first part of it with bash at least, and the Python script works well enough.

    UPDATE:

    I did it anyway, using awk to add/subtract/rename/Capitalize the CSV data before running it through ogr2ogr. The code (below) is about 35 lines (as opposed to 150 for the Python script, which only does part of the job anyway) and runs really fast:

    #! /bin/bash
    for i in *.csv
    do
    echo Converting $i to GeoJSON
    cat $i | awk -F "," -v rtname=${i%.csv} 'BEGIN {
    }
    $2 != "" && /Stop/ {
    print "ref,Latitude,Longitude,name,stop_order,highway,public_transport,bus,network,operator,route"
    }
    $2 != "" && /^[0-9]{4,4}/ {
    string = tolower($8)
    n=split(string,a," ")
    string=toupper(substr(a[1],1,1)) substr(a[1],2)
    for(i=2;i<=n;i++) {
    string = string " " toupper(substr(a[i],1,1)) substr(a[i],2)
    }
    t_str = "\(ns\)|\(fs\)|\(mid\)|\b[nN]cc\b|\b[Hh]s\b"
    if ( match( string, t_str, match_array ) ) {
    the_match = toupper(match_array[0])
    gsub(t_str, the_match, string)
    }
    gsub(/ *\(/, " (", string)
    $8 = string
    print $1 "," $2 "," $3 "," $8 "," $9 ",bus_stop,platform,yes,LANTA,Lehigh and Northampton Transportation Authority," rtname
    }' > test1.csv
    cat test1.csv
    ogr2ogr -nln ${i%.csv} -f "GeoJSON" ${i%csv}geojson test1.csv -oo KEEP_GEOM_COLUMNS=NO
    ogrinfo ${i%csv}geojson
    rm test1.csv
    done