Bella Napoli - Naples, Italy¶
Author: Shawn P Emhe II
I have lived in the Naples area for a significant portion of my adult life. Originally dubbed Neapolis, or New City by the ancient Greeks, Naples is anything but. The tight winding streets and dense population should provide interesting challenges for this project.
The map for the project can be found here.
Running audits on the Naples area dataset revealed several obstacles that needed to be worked through before the data could be wrangled into a MongoDB database. These problems ranged from the wide array of street types used in the region, inconsistent capitalization by map contributors and over abbrevations in street names. There were even conflicts in the way contributors tagged the city's most famous gastrnomical contribution. In addition, Italy has several different phone formats making phone number validation a feat of its own.
An audit module was used to audit the file and find errors in the data.
The Italian format for street names places the street type at the begining of the name. In some situations the street names are also prefaced with a Roman numeral to indicate cadence. Small traversal streets frequently use this syntax. For example, "III Traversa Pisciarella" indicates that the road is the thrid street to cross Via Pisciarella.
After auditing the OSM file and building a list of expected street types only a few errors were found in the street types:
These errors were removed by mapping the correctings in a dictionary.
st_name_mapping = {
'Prima': 'I',
'Seconda': 'II',
'viia': 'Via',
'Viia': 'Via'
}
However, other formatting problems were found:
Correcting capitalzation was made easy by converting the strings to title case. A regular expression was used to ensure that roman numerals were fixed after this step. E.G., "Iv" needed to be converted back to "IV" after title case was applied.
The abbreviations required more attention. They were handled by building a regular expression that captured the abbreviation along with the immediately following word. This allowed for capturing the context before making a correction. The full street names were tracked down and used to buid a dictionary.
abbreviations = {
"A. De": "Antonio De",
"A. S. Novaro": "Angelo Silvio Novaro",
"B. V. Romano": "Beato Vincenzo Romano",
"G. Di": "Gaspare Di",
"G. Marotta": "Giuseppe Marotta",
"G. Melisurgo": "Guglielmo Melisurgo",
"S. Angelo": "Sant'Angelo",
"S.Agnese": "Sant'Agnese",
"S.Ignazio": "Sant'Ignazio"
}
The final functions for repairing street names and abbreviations are below. Notice that the update_short_name
function calls the update_street_name
function to ensure that it has a properly formatted street name before expanding the abbreviation.
def update_street_name(name):
# Convert lowercase street names to title case
if name.islower():
name = name.title()
# Find Roman Numerals and make sure they are capitalized
name = roman_numeral_re.sub(lambda x: str.upper(x.group()), name)
# retrieve the street type
m = street_type_re.search(name)
street_type = m.group(2)
if street_type in st_name_mapping:
replacement = r'\1' + st_name_mapping[street_type]
name = street_type_re.sub(replacement, name)
elif street_type.islower():
replacement = r'\1' + street_type.capitalize()
name = street_type_re.sub(replacement, name)
return name
def update_short_name(name):
"""Expands an abbreviated name to full length
:param name: Abbreviated name to expand
:return: Unabbreviated name
"""
# First verify that the common errors have been fixed
name = update_street_name(name)
# Find the abbreviation to replace
m = over_abbr_re.search(name)
if m:
if m.group() in abbreviations:
name = over_abbr_re.sub(abbreviations[m.group()], name)
return name
Auditing the cuisine type tags revealed a feature of Italian dining that I had already discovered while living in the country. There is a strong national pride in the local cuisine and the vast majority of restaurants serve Italian food. However, there were some anomolies in the way even the local food was tagged.
The Italian pizza tag can be easily dropped as superfluous information. However, from experience I know that fried pizza is an option that not all pizzerias offer and is worth keeping.
"Regional" is also relavent because many restaurants like to distinguish themselves as being representative of the classic dishes of their area. However, "regional italian" can be shortened to "regional" without losing any information and increase conformity of the tags.
The following dictionary and function were created to update the cuisine tags.
cuisine_types = {
"italian_pizza": "pizza",
"regional,_italian": "regional"
}
def update_cuisine(cuisine_type):
"""
Updates a cuisine type
:param cuisine_type: type to fix
:return: corrected cuisine type
"""
if cuisine_type in cuisine_types.iterkeys():
return cuisine_types[cuisine_type]
else:
return cuisine_type
A unique feature of Italian phone numbers is that their lengths can vary. For this reason I decided that the best way to store them was without dashes or spaces. That way a different format did not need to be managed for each possible phone number length.
During the audit multiple discrepancies were found in the way phone numbers had been added. Out of 612 numbers:
A function was created to conform the numbers. None is returned when the phone number is of the incorrect length to prevent keeping incorrect data.
def update_number(number):
"""
Corrects number formatting
:param number: phone number to reformat
:return: phone number in +<country code><number> format
"""
# remove non-numeric characters
number = re.sub(r'[^\d]', '', number)
# remove country code for length checking
number = re.sub(r'^39', '', number)
# Return None if the number is in the incorrect format
if not 6 <= len(number) <= 11:
return None
# Verify landlines include a 0 prefix.
# A land line is any number not starting with a 3.
number = re.sub(r'^([^03]\d+)', r'0\1', number)
# Insert country code
number = "+39" + number
return number
Later, while exploring the data in MongoDB I discovered an unexpected type
value.
napoli.distinct('type')
[u'node', u'way', u'civil']
I recalled from the sample project in the course that the author had found an instance of second level "k" tags overwriting the top level data. To fix this I added a check to the data.py module that prepended the string "tag_" to any "k" tags that already existed in the node.
if k in node:
k = 'tag_' + k
This logic corrects problem without having to anticipate the "k" values that will cause conflicts.
The napoli.osm
XML file was 91.6 MB. The processing steps resulted in a 101 MB JSON file that was imported with the mongoimport command. The command below was used to import the JSON documents into a collection named "napoli" in the "openstreetmap" database.
batchfile
'C:\Program Files\MongoDB\Server\3.6\bin\mongoimport.exe' /d openstreetmap /c napoli /file:.\napoli.osm.json
2018-05-06T14:19:59.322+0200 connected to: localhost
2018-05-06T14:20:00.672+0200 [####....................] openstreetmap.napoli 19.2MB/101MB (18.9%)
2018-05-06T14:20:03.671+0200 [#############...........] openstreetmap.napoli 58.8MB/101MB (58.1%)
2018-05-06T14:20:06.670+0200 [########################] openstreetmap.napoli 101MB/101MB (100.0%)
2018-05-06T14:20:06.677+0200 [########################] openstreetmap.napoli 101MB/101MB (100.0%)
2018-05-06T14:20:06.678+0200 imported 478915 documents
The ouput showed that 478,915 documents were imported.
The pymongo library provides tools that can be used to explore the database with python.
# Import necessary modules
from pymongo import MongoClient, GEO2D
import numpy as np
# Create the client connection
client = MongoClient('mongodb://localhost:27017')
# Create a reference to the database and collection
db = client.openstreetmap
napoli = client.openstreetmap.napoli
# Convert bytes to megabytes
def bytes_to_mb(size):
return size / 1024 / 1024
# The db.command() method runs MongoDB commands on the database.
# The collstats command returns a dictionary of collection statistics which contains 'size'
coll_size = bytes_to_mb(db.command('collstats', 'napoli')['size'])
# The distinct method returns unique values from a field. Here dot notation is used to get
# all of the unique uid's from the created objects, and then the length is captured
n_users = len(napoli.distinct('created.uid'))
# The count method returns the number of results for each query.
n_nodes = napoli.count({'type': 'node'})
n_ways = napoli.count({'type': 'way'})
print 'Colection Size: {} MB'.format(coll_size)
print 'Unique Users: {}'.format(n_users)
print 'Nodes: {}'.format(n_nodes)
print 'Ways: {}'.format(n_ways)
MongoDB uses Aggregation Pipelines to run queries that aggregate totals. The pipelines break the process down into a list of steps.
# Create a list of aggregation steps to feed as a pipeline
pipeline = [
# because the cuisine field is an array, it needs to be "unwound" before aggregation
{'$unwind': '$cuisine'},
# aggregate by cuisine type and count each type
{'$group': {
'_id': '$cuisine',
'count': {'$sum':1}
}
},
# sort descending
{'$sort': {'count': -1}},
# limiting to top 10
{'$limit': 10}
]
cuisine_counts = napoli.aggregate(pipeline)
# Lambda function to modify the output of aggregate
pretty_count = lambda x: (x['_id'].encode('utf-8'), x['count'])
for count in cuisine_counts:
print(pretty_count(count))
While Kebab and Burger have managed to make their way into the top 10, the rest are all Italian affair. Sandwhiches are also arguably international, but even they have a unique Italian flavor here.
Sometimes it feels like Napoli has pizzarias like U.S. cities have Starbucks. But MongoDB can use geospatial queries to make quantifiable measuremants.
First, a geospacial index needs to created using the "pos" data.
napoli.create_index([('pos', GEO2D)])
Then we can collect all of the pizzarias in the city of Napoli and store their positions. The full dataset includes rural surrounding areas, so the search will be restricted to the city itself.
pizzaria_locations = list(napoli.find(
{'cuisine': 'pizza', 'pos': {'$exists': 1}, 'address.city': 'Napoli'},
{'pos':1}))
The $geoNear
operator can be used to query for the nearest pizzaria to a location. This code will loop through all the pizzarias and build a list of distances to the next closest location.
distances = []
for pizzaria in pizzaria_locations:
pipeline = [
{
'$geoNear': {
'near': pizzaria['pos'], # use the current pizzarias position
'query': {
'cuisine': 'pizza', # find only pizzarias
'_id': {'$ne': pizzaria['_id']}}, # but exclude the current pizzaria
'distanceMultiplier': 6371, # Earth's radius in km, used to convert the result from radians
'distanceField': 'distance.calculated',
'num': 1, # limit to only 1 result
}
},
{'$project': {'distance.calculated':1} }
]
distances.append(napoli.aggregate(pipeline).next()['distance']['calculated'])
Output statistics:
print 'The average distance is {:.2f} km'.format(np.mean(distances))
print 'The closest pizzarias were {:.2f} km'.format(np.min(distances))
print 'The farthest were {:.2f} km'.format(np.max(distances))
The average and minimum distances are higher than I expected. I suspect that this is an indication of incomplete data and if more restaurants were accurately tagged the results would have shown the locations are much closer.
Querying the documents with "tourism" information revealed that aside from museums, the dataset contains very little information that could help a user explore Naples's storied past. It is possible that some of the sites tagged "attraction" could be historical, but this would require extra digging on the user's part.
pipeline = [
{'$match': {
'tourism': {'$exists': 1}
}},
{'$group': {
'_id': '$tourism',
'count': {'$sum': 1}
}}
]
for count in napoli.aggregate(pipeline):
print(pretty_count(count))
The Comune di Napoli - Open Data site has information that could enhance the dataset. Specifically, they have shapefiles for Unesco zones, the historic center and for archeological areas. This information could be imported and specifically tagged for their historical significance. However, it would require additional wrangling to convert the shapefiles into JSON files so that they could be read into MongoDB.