Wrangle OpenStreetMap Data [MongoDB]

Bella Napoli - Naples, Italy

Author: Shawn P Emhe II

I have lived in the Naples area for a significant portion of my adult life. Originally dubbed Neapolis, or New City by the ancient Greeks, Naples is anything but. The tight winding streets and dense population should provide interesting challenges for this project.

The map for the project can be found here.

Problems with the Data

Running audits on the Naples area dataset revealed several obstacles that needed to be worked through before the data could be wrangled into a MongoDB database. These problems ranged from the wide array of street types used in the region, inconsistent capitalization by map contributors and over abbrevations in street names. There were even conflicts in the way contributors tagged the city's most famous gastrnomical contribution. In addition, Italy has several different phone formats making phone number validation a feat of its own.

An audit module was used to audit the file and find errors in the data.

Street Names

The Italian format for street names places the street type at the begining of the name. In some situations the street names are also prefaced with a Roman numeral to indicate cadence. Small traversal streets frequently use this syntax. For example, "III Traversa Pisciarella" indicates that the road is the thrid street to cross Via Pisciarella.

After auditing the OSM file and building a list of expected street types only a few errors were found in the street types:

  • In some situations, "Prima" and "Seconda" (first and second) were used where Roman numerals were expected.
  • "Via" was found misspelled as "Viia".

These errors were removed by mapping the correctings in a dictionary.

st_name_mapping = {
    'Prima': 'I',
    'Seconda': 'II',
    'viia': 'Via',
    'Viia': 'Via'
}

However, other formatting problems were found:

  • Capitalization was inconsistent.
  • Street names used abbreviations whose meanings weren't apparent.

Correcting capitalzation was made easy by converting the strings to title case. A regular expression was used to ensure that roman numerals were fixed after this step. E.G., "Iv" needed to be converted back to "IV" after title case was applied.

The abbreviations required more attention. They were handled by building a regular expression that captured the abbreviation along with the immediately following word. This allowed for capturing the context before making a correction. The full street names were tracked down and used to buid a dictionary.

abbreviations = {
    "A. De": "Antonio De",
    "A. S. Novaro": "Angelo Silvio Novaro",
    "B. V. Romano": "Beato Vincenzo Romano",
    "G. Di": "Gaspare Di",
    "G. Marotta": "Giuseppe Marotta",
    "G. Melisurgo": "Guglielmo Melisurgo",
    "S. Angelo": "Sant'Angelo",
    "S.Agnese": "Sant'Agnese",
    "S.Ignazio": "Sant'Ignazio"
}

The final functions for repairing street names and abbreviations are below. Notice that the update_short_name function calls the update_street_name function to ensure that it has a properly formatted street name before expanding the abbreviation.

def update_street_name(name):
    # Convert lowercase street names to title case
    if name.islower():
        name = name.title()

    # Find Roman Numerals and make sure they are capitalized
    name = roman_numeral_re.sub(lambda x: str.upper(x.group()), name)

    # retrieve the street type
    m = street_type_re.search(name)
    street_type = m.group(2)
    if street_type in st_name_mapping:
        replacement = r'\1' + st_name_mapping[street_type]
        name = street_type_re.sub(replacement, name)
    elif street_type.islower():
        replacement = r'\1' + street_type.capitalize()
        name = street_type_re.sub(replacement, name)

    return name


def update_short_name(name):
    """Expands an abbreviated name to full length

    :param name: Abbreviated name to expand
    :return: Unabbreviated name
    """
    # First verify that the common errors have been fixed
    name = update_street_name(name)

    # Find the abbreviation to replace
    m = over_abbr_re.search(name)
    if m:
        if m.group() in abbreviations:
            name = over_abbr_re.sub(abbreviations[m.group()], name)

    return name

Cuisine Types

Auditing the cuisine type tags revealed a feature of Italian dining that I had already discovered while living in the country. There is a strong national pride in the local cuisine and the vast majority of restaurants serve Italian food. However, there were some anomolies in the way even the local food was tagged.

  • There were three different tags for pizza:
    • pizza
    • italian_pizza
    • fried_pizza

The Italian pizza tag can be easily dropped as superfluous information. However, from experience I know that fried pizza is an option that not all pizzerias offer and is worth keeping.

  • There were multiple ways of tagging the local food:
    • italian
    • regional
    • regional,_italian

"Regional" is also relavent because many restaurants like to distinguish themselves as being representative of the classic dishes of their area. However, "regional italian" can be shortened to "regional" without losing any information and increase conformity of the tags.

The following dictionary and function were created to update the cuisine tags.

cuisine_types = {
    "italian_pizza": "pizza",
    "regional,_italian": "regional"
}

def update_cuisine(cuisine_type):
    """
    Updates a cuisine type
    :param cuisine_type: type to fix
    :return: corrected cuisine type
    """
    if cuisine_type in cuisine_types.iterkeys():
        return cuisine_types[cuisine_type]
    else:
        return cuisine_type

Phone numbers

A unique feature of Italian phone numbers is that their lengths can vary. For this reason I decided that the best way to store them was without dashes or spaces. That way a different format did not need to be managed for each possible phone number length.

During the audit multiple discrepancies were found in the way phone numbers had been added. Out of 612 numbers:

  • 589 numbers included a country code
  • 572 numbers had spaces
  • 2 had dashes
  • 3 were missing the landline prefix (landlines are required to start with 0, while cellular numbers all start with 3)
  • 1 number was found with the wrong number of digits

A function was created to conform the numbers. None is returned when the phone number is of the incorrect length to prevent keeping incorrect data.

def update_number(number):
    """
    Corrects number formatting
    :param number: phone number to reformat
    :return: phone number in +<country code><number> format
    """
    # remove non-numeric characters
    number = re.sub(r'[^\d]', '', number)

    # remove country code for length checking
    number = re.sub(r'^39', '', number)

    # Return None if the number is in the incorrect format
    if not 6 <= len(number) <= 11:
        return None

    # Verify landlines include a 0 prefix.
    # A land line is any number not starting with a 3.
    number = re.sub(r'^([^03]\d+)', r'0\1', number)

    # Insert country code
    number = "+39" + number

    return number

The One That Got Away

Later, while exploring the data in MongoDB I discovered an unexpected type value.

napoli.distinct('type')
[u'node', u'way', u'civil']

I recalled from the sample project in the course that the author had found an instance of second level "k" tags overwriting the top level data. To fix this I added a check to the data.py module that prepended the string "tag_" to any "k" tags that already existed in the node.

if k in node:
    k = 'tag_' + k

This logic corrects problem without having to anticipate the "k" values that will cause conflicts.

Exploring the data

The napoli.osm XML file was 91.6 MB. The processing steps resulted in a 101 MB JSON file that was imported with the mongoimport command. The command below was used to import the JSON documents into a collection named "napoli" in the "openstreetmap" database.

batchfile
'C:\Program Files\MongoDB\Server\3.6\bin\mongoimport.exe' /d openstreetmap /c napoli /file:.\napoli.osm.json
2018-05-06T14:19:59.322+0200    connected to: localhost
2018-05-06T14:20:00.672+0200    [####....................] openstreetmap.napoli 19.2MB/101MB (18.9%)
2018-05-06T14:20:03.671+0200    [#############...........] openstreetmap.napoli 58.8MB/101MB (58.1%)
2018-05-06T14:20:06.670+0200    [########################] openstreetmap.napoli 101MB/101MB (100.0%)
2018-05-06T14:20:06.677+0200    [########################] openstreetmap.napoli 101MB/101MB (100.0%)
2018-05-06T14:20:06.678+0200    imported 478915 documents

The ouput showed that 478,915 documents were imported.

Connect to MongoDB

The pymongo library provides tools that can be used to explore the database with python.

In [1]:
# Import necessary modules
from pymongo import MongoClient, GEO2D
import numpy as np

# Create the client connection
client = MongoClient('mongodb://localhost:27017')

# Create a reference to the database and collection
db = client.openstreetmap
napoli = client.openstreetmap.napoli

Database Statistics

In [2]:
# Convert bytes to megabytes
def bytes_to_mb(size):
    return size / 1024 / 1024

# The db.command() method runs MongoDB commands on the database.
# The collstats command returns a dictionary of collection statistics which contains 'size'
coll_size = bytes_to_mb(db.command('collstats', 'napoli')['size'])

# The distinct method returns unique values from a field. Here dot notation is used to get 
# all of the unique uid's from the created objects, and then the length is captured
n_users = len(napoli.distinct('created.uid'))

# The count method returns the number of results for each query.
n_nodes = napoli.count({'type': 'node'})
n_ways = napoli.count({'type': 'way'})

print 'Colection Size: {} MB'.format(coll_size)
print 'Unique Users: {}'.format(n_users)
print 'Nodes: {}'.format(n_nodes)
print 'Ways: {}'.format(n_ways)
Colection Size: 109 MB
Unique Users: 715
Nodes: 412185
Ways: 66730

10 Most common cuisines

MongoDB uses Aggregation Pipelines to run queries that aggregate totals. The pipelines break the process down into a list of steps.

In [3]:
# Create a list of aggregation steps to feed as a pipeline
pipeline = [
    # because the cuisine field is an array, it needs to be "unwound" before aggregation
    {'$unwind': '$cuisine'},
    # aggregate by cuisine type and count each type
    {'$group': {
        '_id': '$cuisine',
        'count': {'$sum':1}
        }
    },
    # sort descending
    {'$sort': {'count': -1}},
    # limiting to top 10
    {'$limit': 10}
]

cuisine_counts = napoli.aggregate(pipeline)

# Lambda function to modify the output of aggregate
pretty_count = lambda x: (x['_id'].encode('utf-8'), x['count'])

for count in cuisine_counts:
   print(pretty_count(count))
('italian', 104)
('pizza', 91)
('regional', 25)
('seafood', 11)
('coffee_shop', 9)
('kebab', 7)
('burger', 6)
('fish', 5)
('pasta', 3)
('sandwich', 3)

Mangiamo Italiano

While Kebab and Burger have managed to make their way into the top 10, the rest are all Italian affair. Sandwhiches are also arguably international, but even they have a unique Italian flavor here.

Pizza on every corner

Sometimes it feels like Napoli has pizzarias like U.S. cities have Starbucks. But MongoDB can use geospatial queries to make quantifiable measuremants.

First, a geospacial index needs to created using the "pos" data.

napoli.create_index([('pos', GEO2D)])

Then we can collect all of the pizzarias in the city of Napoli and store their positions. The full dataset includes rural surrounding areas, so the search will be restricted to the city itself.

In [4]:
pizzaria_locations = list(napoli.find(
    {'cuisine': 'pizza', 'pos': {'$exists': 1}, 'address.city': 'Napoli'},
    {'pos':1}))

The $geoNear operator can be used to query for the nearest pizzaria to a location. This code will loop through all the pizzarias and build a list of distances to the next closest location.

In [5]:
distances = []
for pizzaria in pizzaria_locations:
    pipeline = [
        {
            '$geoNear': {
                'near': pizzaria['pos'],  # use the current pizzarias position
                'query': {
                    'cuisine': 'pizza',  # find only pizzarias
                    '_id': {'$ne': pizzaria['_id']}},  # but exclude the current pizzaria
                'distanceMultiplier': 6371,  # Earth's radius in km, used to convert the result from radians
                'distanceField': 'distance.calculated',
                'num': 1,  # limit to only 1 result
            }
        },
        {'$project': {'distance.calculated':1} }
    ]
    distances.append(napoli.aggregate(pipeline).next()['distance']['calculated'])

Output statistics:

In [6]:
print 'The average distance is {:.2f} km'.format(np.mean(distances))
print 'The closest pizzarias were {:.2f} km'.format(np.min(distances))
print 'The farthest were {:.2f} km'.format(np.max(distances))
The average distance is 12.44 km
The closest pizzarias were 0.84 km
The farthest were 76.67 km

The average and minimum distances are higher than I expected. I suspect that this is an indication of incomplete data and if more restaurants were accurately tagged the results would have shown the locations are much closer.

Additional Improvements

Querying the documents with "tourism" information revealed that aside from museums, the dataset contains very little information that could help a user explore Naples's storied past. It is possible that some of the sites tagged "attraction" could be historical, but this would require extra digging on the user's part.

In [7]:
pipeline = [
    {'$match': {
        'tourism': {'$exists': 1}
    }},
    {'$group': {
        '_id': '$tourism',
        'count': {'$sum': 1}
    }}
]

for count in napoli.aggregate(pipeline):
    print(pretty_count(count))
('gallery', 1)
('apartment', 4)
('hostel', 13)
('theme_park', 3)
('caravan_site', 4)
('artwork', 10)
('motel', 4)
('yes', 4)
('information', 16)
('picnic_site', 5)
('aquarium', 1)
('viewpoint', 48)
('museum', 49)
('hotel', 163)
('guest_house', 175)
('camp_site', 3)
('attraction', 23)

The Comune di Napoli - Open Data site has information that could enhance the dataset. Specifically, they have shapefiles for Unesco zones, the historic center and for archeological areas. This information could be imported and specifically tagged for their historical significance. However, it would require additional wrangling to convert the shapefiles into JSON files so that they could be read into MongoDB.