Sie sind auf Seite 1von 17

Surviving the Apocalypse in Los Angeles

Mike Chaykowsky and Richard Mata

Abstract
We want to find the best place in Los Angeles to be during a zombie apocalypse. We created a graph
network of Los Angeles where each node represents an intersection of two roads. Some nodes were tagged
as intersections closest to businesses that represent areas we want to be near or far from. We want to be
near supplies; hospitals, hardware stores, grocery stores, pharmacies, and sporting goods stores. We want
to be far from areas of dense population; schools, museums, recreation centers, and hotels. The graphs
edge weights were set to determine the distances measured by street pathways to each of these nodes in
the network. The final results revealed optimal areas in Los Angeles because they are far from populated
areas, and optimal areas close to supplies. The overlapping area is the best place to be in Los Angeles,
which is Glendale, CA.

Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Shortest Path Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
NetworkX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
OpenStreetMap (OSM) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Los Angeles - Open Data Portal (ODP) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Code Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

1
Introduction

In 1968 George Romeros Night of the Living Dead depicts a world where a growing group of living dead
monsters wreaked havoc for those still living. Romeros film is often credited as being the original modern
zombie film; however, an earlier portrayal of zombies can be traced back to Victor Halperins White Zombie,
which is about an evil voodoo priest in Haiti who zombifies a woman. During the time of Romeros film, these
living dead monsters were depictions of the ghoul, an evil demon, originally of Muslim legend, that feeds on
human beings. Nowadays such creatures have become to be known as zombies, a word that is said to have
come from West African languages ndzumbi meaning corpse and nzamni meaning spirit of a dead person.
So, what is a zombie apocalypse? A zombie apocalypse is the dismantle of civilized society due to a zombie
outbreak. Modern zombie films have become infamous for depicting such disasters; a post-industrial wasteland
with human corpses roaming abandoned cities while small groups of survivalists scavenge for any resources
they can find. Such outbreaks can be caused either by a virus, bacteria, or some other phenomena causing
those who have been infected to act primitive and destructive towards other human beings. Thus, those
attacked by zombies may often become zombies themselves causing an exponential growth of the zombie
population. Due to the breakdown of society, those who have yet to become infected are left to fend for
themselves. This requires them to scavenge for resources that are essential to help them survive this wasteland
of walking dead, all while avoiding any unnecessary visits into areas which can have a potentially high zombie
presence.
The city of Los Angeles is one of the most highly populated cities in the nation1 . In fact, Los Angeles has a
higher population than all the U.S. states, apart from California, Texas, New York, and Florida. In a city
with approximately 7,000 people per square mile (i.e., a population of approximately 13 million people),
the event of a zombie outbreak can cause the city to fall in a matter of days. Los Angeles is essentially a
death trap if a zombie outbreak were to ever occur, but could there still be a potential safe zone in Los
Angeles? For this project, we looked at where could possibly be the safest place in Los Angeles during a
zombie apocalypse. We essentially want our safe zone to be close to resources all while avoiding areas
where a high accumulation of the population may be. Thus, we took hardware stores, sporting goods stores,
grocery stores, hospitals, and pharmacies to act as our parameters for resources; in addition, we took schools,
museums, hotels, and recreation centers to act as our parameters for heavily populated areas. By creating a
network graph of Los Angeles, we can find the shortest path from all resources to all street intersections, and
finding the longest path from all heavily populated areas to all street intersection.
Here a graph (or simple graph) is a set of vertices V, and a set of edges E that are unordered pairs of distinct
elements of V:

If e E, e = {u, v}, where u, v V, u = v

Thus, a weighted graph is graph with a real-valued weight w assigned to each edge e E. For example2 ,
consider the following graph with weights assigned to each edge:
How can we determine the shortest path from node s to node x? This is what is known as a shortest path
problem.

Shortest Path Problems

In general, our problem can be classified as a shortest path problem; this is the problem of finding a path
between two nodes on a weighted graph such that the sum of the weights of the edges is minimized. There
are three variations of shortest path problems: single-source shortest path, single-destination shortest path,
all-pairs shortest path. In our case, we will be dealing with a single-source shortest path problem. That is,
given a weighted graph and source node, S, we want to find the shortest path from the source node to all
1 Brasuell, J. (2012). Los Angeles is the Most Densely Populated Urban Area in the US. Retrieved December 10, 2016, from

http://la.curbed.com/2012/3/26/10385086/los-angeles-is-the-most-densely-populated-urban-area-in-the-us
2 Percus, A. G. (2016). Discrete Mathematical Modeling [Course Notes].

2
Figure 1: Weighted graph.

other nodes in the graph. For this project, the source node(s) are all resources and heavily populated areas
we defined earlier. Such problems can be solved using Dijkstras algorithm.
Publish in 1959 by computer scientist Edsger W. Dijkstra, this algorithm is used to solve single-source shortest
path problems. Algorithm goes as follows:

1. Given a source node s, assign all shortest path lengths:

l(s, s) := 0, and l(s, u) := , u 6= x V

2. Set current node as source node, v := s. Mark all other nodes as unvisited, creating a set of all the
unvisited nodes.
3. For the current node v, consider all its unvisited neighbors u and calculate their tentative distances:

l(s, v) + w(v, u)

where w(v, u) is the weighted value assigned to the edge between nodes v and u.
4. Compare the newly calculated distance to the current assigned value and assign the smaller one, choosing
the minimum of the two to be taken as the new value of the path length:

l(s, u) := min{l(s, u), l(s, v) + w(v, u)}

5. Let y be an unvisited node that minimizes l(s, y), and set it as our current node: v := y. Repeat
steps 3 and 4.
6. The algorithm continues until all unvisited nodes have been marked as visited, at which point the
algorithm will end, returning the smallest path length, l(s, x), from source node s to sink node x.

Thus, the output of Dijkstras gives us the value of the shortest path from the source node(s), i.e. nodes in
our network representing resources and heavily populated areas, to our sink node(s), which is represented
by every other node in our network that is not resources or heavily populated areas. Let us consider the
following example to help us illustrate how the algorithm works. Consider the weighted graph shown in
Figure 1 above, find the shortest path from node s to node x using Dijkstras algorithm:
Visit s:
l(s, a) := min{l(s, a), l(s, s) + w(s, a)} = min{, 0 + 5} = 5
l(s, c) := min{l(s, c), l(s, s) + w(s, c)} = min{, 0 + 1} = 1
l(s, e) := min{l(s, e), l(s, s) + w(s, e)} = min{, 0 + 3} = 3

Next, we visit all unvisited neighbors of s.

3
Visit a:

l(s, b) := min{l(s, b), l(s, a) + w(a, b)} = min{, 5 + 3} = 8


l(s, c) := min{l(s, c), l(s, a) + w(a, c)} = min{1, 5 + 1} = 1

Visit c:

l(s, b) := min{l(s, b), l(s, c) + w(c, b)} = min{8, 1 + 5} = 6

Visit e:

l(s, d) := min{l(s, d), l(s, e) + w(e, d)} = min{, 3 + 5} = 8

Lastly, we visit all unvisited neighbors of a, c, and e.


Visit b:

l(s, x) := min{l(s, x), l(s, b) + w(b, x)} = min{, 6 + 1} = 7

Visit d:

l(s, x) := min{l(s, x), l(s, d) + w(d, x)} = min{7, 8 + 3} = 7

Thus, our shortest path length from node s to node x is l(s, x) = 7.


What is the computational complexity of Dijkstras algorithm? Every time we visit a node in the graph, the
number of operations is scaled linearly with the number of edges connected to that vertex. Thus, given a
graph with |V | = n, then our computational complexity is O(|E|). Then it follows that O(|E|) = O(n2 ).

NetworkX

NetworkX3 is a Python library for doing in-memory graph analysis. Below we can see a sample of what it
looks like to analyze a graph in NetworkX. Say we have some data about the distances that it takes for a
coffee bean salesperson to travel to all of the coffee shops in Los Angeles. These distances are completely
made up. So we can see from the adjacency matrix that some stores are just impossible to get to from other
stores, lets just assume we dont want to do those transitions due to traffic. We can look at the Djikstras
shortest path between Starbucks and Demitasse.

locations =['Start', 'Grocery'


, 'Sporting goods', 'Pharmacy', 'Hardware', 'Museum'
, 'School', 'Hotel', 'Safety']

adjacency = [
[ 0, 200, 1000, 1275, 1600, 4150, 6500, 5500, 9999] #Start
, [ -1, 0, 175, 300, 55, 5000, 5500, 9999, 9999] #Grocery
, [ -1, -1, 0, 50, 170, 4750, 3890, 9999, 9999] #Sporting goods
, [ -1, -1, -1, 0, 1900, 9999, 9999, 9999, 9999] #Pharmacy
, [ -1, -1, -1, -1, 0, 650, 9999, 9999, 9999] #Hardware
, [ -1, -1, -1, -1, -1, 0, 6000, 3500, 9999] #Museum
, [ -1, -1, -1, -1, -1, -1, 0, 3200, 9999] #School
3 networkx-1.11, 30 January 2016.

4
, [ -1, -1, -1, -1, -1, -1, -1, 0, 900] #Hotel
, [ -1, -1, -1, -1, -1, -1, -1, -1, 0] #Safety
]

nx.dijkstra_path(g, source='Start', target='Safety', weight='m')

['Start', 'Grocery', 'Hardware', 'Museum', 'Hotel', 'Safety']

Figure 2: Fake data.

This is a simple example, but is fairly explanatory for what we need to examine in the apocalypse question.
We need to obtain a series of nodes and edges, with weights attached to each edge. Then once we have built
this graph we must determine the shortest path to all of the supplies and the shortest path to all of the
populated areas. Then simply take the lowest 5th percentile of shortest distances to supplies, and farthest
95th percentile paths to populated areas and overlay the final maps looking for overlap.

OpenStreetMap (OSM)4

This will export all the current raw OpenStreetMap data (nodes, ways, relations - keys and tags) in an XML for-
mat. It does this by actually pointing your browser directly at the OpenStreetMap API to retrieve a bounding
box of data (a map call). This has limitations in terms of the size and complexity of the data you can request.
To get around this limitation, we can direct our machine at the api link on the website via the terminal: wget
-O sm_map2.xml "http://overpass-api.de/api/map?bbox=-118.4282,33.9986,-118.0989,34.1672".
The output here is approximately 1.3gb and the first few lines look as follows,
Once the data has been exported from OSMs api, we can see that the data are broken up into three categories;
nodes, ways, and relations. We care particularly about the ways category. If we look at the ways, we will see
where our streets are categorized highway=primary, highway=secondary, and highway=tertiary. So what
we want to do is search through all of the ways and every time we encounter a highway we want to make a
node. Since our goal is to have nodes at only the intersections, this is not good enough yet. So what the
code checks for is if a single node shows up more than once in the search, then once it comes up twice this is
classified as an intersection. Then the latitude and longitude is pulled from the tag of the node in OSM and
saved as one of our networkX nodes.
It is much harder to show distance between two nodes in higher dimensions. In two dimensions it is
clear to show distance in the Euclidean sense, but to determine the distance between centroids and some
4 Copyright, OpenStreetMap contributors

5
Figure 3: OSM data output.

points for geographic data that is 2-dimensional is much harder. KD-trees stores relational data that has
multidimensional space to it. After grabbing the latitudes and longitudes we build a KD tree around them.
Using this spatial relation, we can ask: Given an arbitrary latitude and longitude what is the closest latitude
and longitude to it?
One of the decisions we had to make was how to measure distance. The two ways that seemed feasible were
to measure using as the crow flies distance or measuring distance based on roadways. There are positives
and negatives to choosing both. To measure as the crow flies could be argued as the optimal solution since
in a zombie apocalypse people may just be walking everywhere and paths to places may not be affected
by roads. However, the general assumption is that to travel from one place to another place roadways will
still be used as a point of reference and possibly followed to completion. So the distances were set using
primary roads, secondary roads, and tertiary roads. Each type of road was given a weight of 1, 3, and 5,
respectively. This means that when the algorithm is run on the network graph, primary roads will take the
shortest distance to get from node to node, then secondary roads, then tertiary roads. These road definitions
are drawn directly from the way OpenStreetMap classifies their roads.

Los Angeles - Open Data Portal (ODP)5

If we want to determine where the closest areas to supplies are we have to know where all of the supplies
are in Los Angeles. Same is true for populated areas. As stated earlier we look at hardware stores, sporting
goods, pharmacies, hospitals, and grocery stores to determine supplies. To find where all of these stores are
in LA, we will use the Los Angeles Open Data Portal (ODP). This has a fairly complete list of all active
businesses in Los Angeles. Attached to each business is an NAICS code and description that classifies the
business type. This is what we will utilize to find the supply stores.
These classifications certainly have several false-postives. In other words, we may have a business classified as
a pharmacy, but it is a local physicians office that does not have the types of supplies we are looking for. So
our method to fix this issue was to only take the businesses that we recognized by hand-picking the names.
This is certainly not a very rigorous approach but seemed to work for our purposes as we could pick out all
of the chains and many of the smaller shops that had titles indicative of a pharmacy (or whichever business
type we were searching for). Once we have all of the businesses in LA, we can plot these using the latitudes
and longitudes and ask the networkX graph which node is closest to each latitude and longitude. Assuming
there are intersections of streets close to all of these businesses, we can assign each of these nodes the id of
the business. Now we have a set of nodes in the graph that we want to be near during a zombie apocalypse.
5 https://data.lacity.org/

6
Figure 4: businesses.csv

Analysis

NetworkX now has a node for every intersection in Los Angeles, then has nodes that have been tagged with
businesses we want to close to, and has edges to each node defining whether there is a primary, secondary,
or tertiary road connecting them, given weights 1, 3, and 5, respectively. Using Djikstras shortest path
algorithm we can now ask our graph what the distances are from each of the supply businesses to every other
node in the graph. Since the complexity of this is atronomical, when constructing the data frame we want to
make sure to check for distances iteratively. That is to say every time we find a point that has a supply store
closer to it we override that with where the supply store is and how far it is. Even with this method we had
trouble with the complexity of the calculations and had to reduce the number nearest intersections to search
through to obtain a result. Using this dataframe that contains the node id, the distance to the nearest supply
store, and the store id, we look at the areas with the nearest 5th percentile to supply stores. We then run the
same analysis with the populated areas, searching for schools, museums, hotels, and recreation centers, and
looked at the farthest 95th percentile of nodes from these locations. We then determined that where the two
optimal areas overlapped with be the unified optimal location to be during a zombie apocalypse.
Figure 5 represents the optimal area to be because it is close to supply stores (green), and the optimal area
to be because it is far from populated areas (red). The two areas strongly overlap in one area which we have
determined to be Glendale, CA.

Conclusion

This project aimed to answer the question: What is the safest location in Los Angeles such that it is closest
to resources and furthest away from heavily populated areas? Classifying this problem as a shortest path
problem allowed us implement Dijkstras algorithm to determine a shortest path to resources, and a longest
path to heavily populated areas. To do this, we had to create a network of Los Angeles. Using the data
portal, OpenStreetMap, we used an upper and lower bound based on latitude and longitude coordinates.
By downloading all the data in-between those bounds, we then used the python package NetworkX to
create a graph of Los Angeles. We took all nodes in our graph to represent all road intersections, businesses
we classified as resources and locations we classified as heavily populated areas. The edges of the graph
represented all roads connecting nodes together. Edges in the graph were given tags to classify the type
of road; and weights of edges indicated the scale on which road type to use, based on the assumption that
certain roads provided a more direct route. Implementing Dijkstras algorithm, we searched for the shortest
path from all source nodes (resources and heavily populated areas) to sink nodes (every other node in the

7
Figure 5: Results.

network). For heavily populated areas we had to tweak the algorithm so that instead of Dijkstras returning
the shortest path value, it would return the value of the longest path. In turn, we geographically map out
all resources, along with all areas we found to be the closest to these locations; similarly, we geographically
mapped out all heavily populated areas, along with all the areas we found to the furthest from these locations.
When overlapping both maps, we found locations that were both far from heavily populated areas and closest
to resources. The most notable location we found to fit both criteria was the city of Glendale.
For future work, one possible area that could be work on is the optimization of the weights given to a specific
edge. With our weights being based on just arbitrary values, one could possibly optimize the choice of weights
to better reflect actual geographical distances between nodes.

8
Code Appendix

Much of the code was adapted from an open source tutorial6 on geographic analysis of Los Angeles.

%matplotlib inline
import matplotlib.pyplot as plt
import osmapi
import matplotlib
import matplotlib.cm as cm
import requests
import matplotlib.pyplot as plt
from matplotlib.colors import colorConverter
from scipy import spatial
import numpy as np
import pandas as pd
from mpl_toolkits.basemap import Basemap
import shapefile
import pickle
import shapely
import geocoder
import geopy
from geopy.distance import vincenty
import networkx as nx
import sys
import os
from shapely import geometry
from lxml import etree
from collections import defaultdict
from io import StringIO, BytesIO

def rec_print(elem, level=0):


for i in range(level):
sys.stdout.write('\t')
sys.stdout.write(elem.tag)
if elem.text is not None:
sys.stdout.write(elem.text)
ecount = 0
for k in range(len(elem.keys())):
key = elem.keys()[k]
val = elem.values()[k]
if ecount > 0:
sys.stdout.write(';')
sys.stdout.write(' ' + key + '=' + val)
ecount += 1
print ''
for child in elem.getchildren():
rec_print(child, level+1)

tree = etree.parse("sm_map2.xml")

node_counts = defaultdict(int)
r=tree.getroot()
6 Polich, Kyle. Geographic Analysis of Los Angeles. 2015-05-12.

9
r.get('generator')
children = r.getchildren()
for child in children:
node_counts[child.tag] += 1

#print(node_counts)

tags = {}
interesting = ['node', 'relation', 'way']
for interest in interesting:
tags[interest] = defaultdict(int)

for child in children:


tag = child.tag
if tag == 'node' or tag == 'relation' or tag == 'way':
tag_counts = tags[tag]
gchildren = child.getchildren()
for gchild in gchildren:
key = gchild.get('k')
tag_counts[key] += 1

roads = {}
values = defaultdict(int)
children = r.getchildren()
for child in children:
tag = child.tag
if tag == 'way':
gchildren = child.getchildren()
for gchild in gchildren:
k = gchild.get('k')
if k=='highway':
i = child.keys().index('id')
road_id = child.values()[i]
v = gchild.get('v')
if v=='primary' or v=='secondary' or v=='tertiary' or v=='residential':
roads[road_id] = child
values[v] += 1

nodes = {}
children = r.getchildren()
for child in children:
tag = child.tag
if tag == 'node':
ref = child.values()[child.keys().index('id')]
lat = float(child.values()[child.keys().index('lat')])
lon = float(child.values()[child.keys().index('lon')])
nodes[ref] = [lat, lon]

g = nx.Graph()

intersection_check = defaultdict(list)

for road_id in roads.keys():

10
road = roads[road_id]
children = road.getchildren()
ncount = 0
for child in children:
if child.tag=='nd':
i = child.keys().index('ref')
ref = child.values()[i]
intersection_check[ref].append(road_id)

for road in roads.values():


road_id = road.get('id')
ncount = 0
name = '?'
hwtype = '?'
last_ref = -1
for child in road.getchildren():
if child.tag=='tag':
i = child.keys().index('k')
j = child.keys().index('v')
k = child.values()[i]
v = child.values()[j]
if k=='name':
name = v
if k=='highway':
hwtype = v
for child in road.getchildren():
if child.tag=='nd':
ref = child.get('ref')
n = nodes[ref]
lat = n[0]
lon = n[1]
ic = intersection_check[ref]
if len(ic) > 1:
g.add_node(ref, {'label': name, 'lat': lat, 'lon': lon})
else:
g.add_node(ref, {'label': '[intersection]', 'lat': lat, 'lon': lon})
if ncount > 0:
d = vincenty(nodes[ref], nodes[last_ref])
if hwtype=='primary':
d *= 1
if hwtype=='residential':
d *= 3
else:
d *= 5
w = d
if last_ref != -1:
g.add_edge(ref, last_ref, {'weight': w.meters})
ncount+=1
last_ref = ref

refids = []
roadids = []
lats = []

11
lons = []

for road in roads.values():


road_id = road.get('id')
for child in road.getchildren():
if child.tag=='nd':
refid = child.get('ref')
ll = nodes[refid]
refids.append(refid)
roadids.append(road_id)
lats.append(ll[0])
lons.append(ll[1])

lldf = pd.DataFrame({'refid': refids, 'road_id': roadids, 'lat': lats, 'lon': lons})


data = zip(lldf.lon, lldf.lat)
rtree = spatial.KDTree(data)

# GET DATA ON ACTIVE BUSINESSES IN LOS ANGELES

businesses = pd.read_csv('businesses.csv')

# SELECT ALL SUPPLY STORES AND AGGREGATE


filter = businesses['PRIMARY NAICS DESCRIPTION'] == 'Grocery stores (including supermarkets & convenienc
groceries = businesses[filter].copy()
filter = businesses['PRIMARY NAICS DESCRIPTION'] == 'Hardware, & plumbing & heating equipment & supplie
hardware = businesses[filter].copy()
filter = businesses['PRIMARY NAICS DESCRIPTION'] == 'Hospitals'
hospitals = businesses[filter].copy()
filter = businesses['PRIMARY NAICS DESCRIPTION'] == 'Pharmacies & drug stores'
pharmacies = businesses[filter].copy()
filter = businesses['PRIMARY NAICS DESCRIPTION'] == 'Sporting & recreational goods & supplies'
sporting = businesses[filter].copy()
gp1 = pd.DataFrame(groceries.groupby('BUSINESS NAME')['?LOCATION ACCOUNT #'].count())
gp2 = pd.DataFrame(hardware.groupby('BUSINESS NAME')['?LOCATION ACCOUNT #'].count())
gp3 = pd.DataFrame(hospitals.groupby('BUSINESS NAME')['?LOCATION ACCOUNT #'].count())
gp4 = pd.DataFrame(pharmacies.groupby('BUSINESS NAME')['?LOCATION ACCOUNT #'].count())
gp5 = pd.DataFrame(sporting.groupby('BUSINESS NAME')['?LOCATION ACCOUNT #'].count())

gp = pd.concat([gp1, gp2, gp3, gp4, gp5])

gp.reset_index(inplace=True)
gp.columns = ['name', 'freq']
gp.sort_values('freq', ascending=False, inplace=True)
gp[1:20]

These chunks were run separately but included here for brevity.

# SELECT ALL POPULATED AREAS AND AGGREGATE


filter = businesses['PRIMARY NAICS DESCRIPTION'] == 'Educational services (including schools, colleges,
schools = businesses[filter].copy()
filter = businesses['PRIMARY NAICS DESCRIPTION'] == 'Museums, historical sites, & similar institutions'
museums = businesses[filter].copy()
filter = businesses['PRIMARY NAICS DESCRIPTION'] == 'Traveler accommodation (including hotels, motels, &

12
hotels = businesses[filter].copy()
filter = businesses['PRIMARY NAICS DESCRIPTION'] == 'Other amusement & recreation services (including go
recreation = businesses[filter].copy()
gp1 = pd.DataFrame(schools.groupby('BUSINESS NAME')['?LOCATION ACCOUNT #'].count())
gp2 = pd.DataFrame(museums.groupby('BUSINESS NAME')['?LOCATION ACCOUNT #'].count())
gp3 = pd.DataFrame(hotels.groupby('BUSINESS NAME')['?LOCATION ACCOUNT #'].count())
gp4 = pd.DataFrame(recreation.groupby('BUSINESS NAME')['?LOCATION ACCOUNT #'].count())

gp = pd.concat([gp1, gp2, gp3, gp4])

gp.reset_index(inplace=True)
gp.columns = ['name', 'freq']
gp.sort_values('freq', ascending=False, inplace=True)
gp[400:500]

# LABEL THOSE THAT ARE CHAINS OR SEEM LEGITIMATE

schools['chain'] = 0
museums['chain'] = 0
hotels['chain'] = 0
recreation['chain'] = 0

chains = ['ILEAD CALIFORNIA CHARTERS 1', 'LOYOLA MARYMOUNT UNIVERSITY', 'SK MOVEMENT INC'
, 'CHURCH OF THE LIVING WORD', 'ALLIANT INTERNATIONAL UNIVERSITY', "MOUNT ST MARY'S COLLEGE/C"
, 'HILLCREST CHRISTIAN SCHOOL', 'MARRIOTT HOTEL SERVICES INC', 'CAMPBELL HALL-EPISCOPAL'
, 'B W MIDWILSHIRE PLAZA HOTEL INC', 'THE SHERATON LLC', 'SOUTHWESTERN LAW SCHOOL'
, 'HOLLYWOOD INN EXP SOUTH', 'HILTON MANAGEMENT LLC', 'GLENN VALLEY INN LLC'
, 'ALEXANDRIA MOTEL LLC', 'HOLLYWOOD LA BREA INN, LLC', 'BROWNSTONE HOTEL, LP'
, 'A A OFICINA CENTRAL HISPANA DE LOS ANGELES /C', 'GLOBE HOMES LLC'
, 'ONLINE EDUCATIONAL SCHOOLS INC', 'WEB EDUCATIONAL SERVICES INC'
, 'TRAFFIC EDUCATION CENTER', 'THE PUBLIC HEALTH FOUNDATION OF LOS ANGELES COUNTY INC'
, 'FITNESS INTERNATIONAL LLC', 'THIRTY-FIRST DISTRICT CALIFORNIA CONGRESS OF PARENTS & T'
, 'HAMID KHALES', 'SOLEDAD ENRICHMENT ACTION INC', 'BRE/ESA P PORTFOLIO OPERATING LESSEE INC'
, 'AMERICAN CROWN CIRCUS INC', 'VLADIMIR KRAVETS', 'FITNESS & SPORTS CLUBS LLC'
, 'FEDERATION OF PRESCHOOL/COMMUNITY EDUCATION CENTERS INC'
, 'PARTNERSHIPS TO UPLIFT COMMUNITIES LOS ANGELES', 'LOS ANGELES UNIFIED SCHOOL DISTRICT'
, 'BERNY RAMOS PRODUCTIONS INC', 'YOGA WORKS INC', 'LOS ANGELES URBAN LEAGUE', 'ARMAN GASPARYA
, 'KIPP LA SCHOOLS', 'CENTURY /LEARNING INITIATIVES FOR TODAY INC', 'CHEAP AND EASY ONLINE, IN
, 'SKID ROW HOUSING TRUST CO /C ET AL']

for chain in chains:


filter = schools['BUSINESS NAME'] == chain
schools.loc[filter, 'chain'] = 1
filter = museums['BUSINESS NAME'] == chain
museums.loc[filter, 'chain'] = 1
filter = hotels['BUSINESS NAME'] == chain
hotels.loc[filter, 'chain'] = 1
filter = recreation['BUSINESS NAME'] == chain
recreation.loc[filter, 'chain'] = 1

chains1 = schools[schools.chain==1].copy()
chains2 = recreation[recreation.chain==1].copy()
chains3 = hotels[hotels.chain==1].copy()

13
chains = pd.concat([chains1,chains2,chains3])

These chunks were run separately but included here for brevity.

# LABEL THOSE THAT ARE CHAINS OR SEEM LEGITIMATE

groceries['chain'] = 0
hardware['chain'] = 0
hospitals['chain'] = 0
pharmacies['chain'] = 0
sporting['chain'] = 0

chains = ['THE VONS COMPANIES INC', "TRADER JOE'S CO", 'FRESH & EASY, LLC', "GELSON'S MARKETS"
, "BERBERIAN ENTERPRISES INC", 'NUMERO UNO ACQUISITIONS LLC', 'ALBERTSONS LLC'
, 'MOTHERS NUTRITIONAL CENTER INC', 'THE VONS COMPANIES INC', "TRADER JOE'S CO", 'FRESH & EASY
, "GELSON'S MARKETS", "BERBERIAN ENTERPRISES INC", 'NUMERO UNO ACQUISITIONS LLC'
, 'ALBERTSONS LLC', 'MOTHERS NUTRITIONAL CENTER INC', 'THRIFTY PAYLESS INC', 'GARFIELD BEACH C
, 'USC OBSTETRICIANS AND GYNECOLOGIST /C', 'KAISER FOUNDATION HOSPITALS', 'LOS ANGELES CHRISTI
, 'PROVIDENCE MEDICAL INSTITUTE INC', 'CEDARS SINAI MEDICAL CARE FOUNDATION', 'THE NORTHEAST C
, 'SOUTHERN CALIFORNIA HEALTHCARE SYSTEM INC', 'KEDREN COMMUNITY HEALTH CENTER INC'
, 'PACIFICA HOSPITAL OF THE VALLEY', 'ORION HOSPICE CARE SERVICES INC', 'CHILDREN HOSPITAL'
, 'CLOVER SURGICAL CENTER INC', 'DESERT V V MEDICAL GROUP INC', 'DANIEL FREEMAN HOSPITALS INC'
, 'CEDARS-SINAI MEDICAL CENTER', 'CASPIAN MEDICAL CLINIC CORP', 'ALTAMED HEALTH SERVICES CORP'
, 'BEVERLY SURGERY CENTER INC', 'CALIFORNIA HOSPITAL MEDICAL CENTER-LOS ANGELES /C'
, 'HEALTHPOINTE MEDICAL GROUP INC', 'HANA MEDICAL CENTER INC','THRIFTY PAYLESS INC'
, 'GARFIELD BEACH CVS LLC', 'WALGREEN CO /C', 'THE VONS COMPANIES INC', "TRADER JOE'S CO"
, 'KENK USA INC', 'BERBERIAN ENTERPRISES INC', "GELSON'S MARKETS", 'NUMERO UNO ACQUISITIONS LL
, 'LONGS DRUG STORES CALIFORNIA LLC', 'MOTHERS NUTRITIONAL CENTER INC', 'MRS GOOCHS NATURAL FO
, 'FERGUSON ENTERPRISES INC', 'BODEGA LATINA CORP', 'SUPER CENTER CONCEPTS INC'
, 'HIRSCH PIPE/SUPPLY CO A CORP', 'KIRAN HUSSAIN', 'NORTHGATE GONZALEZ LLC', 'SHARMEENS ENTERP

for chain in chains:


filter = groceries['BUSINESS NAME'] == chain
groceries.loc[filter, 'chain'] = 1
filter = hardware['BUSINESS NAME'] == chain
hardware.loc[filter, 'chain'] = 1
filter = hospitals['BUSINESS NAME'] == chain
hospitals.loc[filter, 'chain'] = 1
filter = pharmacies['BUSINESS NAME'] == chain
pharmacies.loc[filter, 'chain'] = 1
filter = sporting['BUSINESS NAME'] == chain
sporting.loc[filter, 'chain'] = 1

chains1 = groceries[groceries.chain==1].copy()
chains2 = hardware[hardware.chain==1].copy()
chains3 = hospitals[hospitals.chain==1].copy()
chains4 = pharmacies[pharmacies.chain==1].copy()
chains5 = sporting[sporting.chain==1].copy()

chains = pd.concat([chains1,chains2,chains3,chains4,chains5])

14
# LOAD ANY CACHED GEO-CODED LOCATIONS, RETRIEVE THE REST VIA API AND SAVE
# DATASET HAS LOCATIONS, BUT THEY AREN'T VERY PRECISE AND SOME ARE MISSING

full_address = chains.apply(lambda x: x['STREET ADDRESS'] + ', ' + x['CITY'] + ', CA ' + x['ZIP CODE'],
chains['FULL ADDRESS'] = full_address

gfname = 'geocoded.pkl'
gresults = {}
if os.path.isfile(gfname):
print 'Retrieving previous geo-coded data'
f = open(gfname, 'r')
gresults = pickle.load(f)
f.close()

for f in full_address:
try:
loc = gresults[f]
except KeyError:
try:
print 'Retrieving ' + f
loc = geocoder.google(f)
gresults[f] = loc.latlng
import time
time.sleep(1)
except geopy.exc.GeocoderTimedOut:
time.sleep(60)

f = open(gfname, 'wb')
pickle.dump(gresults, f)
f.close()

lats = []
lons = []
for f in full_address:
loc = gresults[f]
lats.append(loc[0])
lons.append(loc[1])

chains['lat'] = lats
chains['lon'] = lons

chains.to_csv('chains2.tsv', set='\t')

nearest_intersection = []
for i in range(chains.shape[0]):
row = chains.iloc[i]
r = rtree.query([row.lon, row.lat])
rownum = r[1]
refid = lldf.loc[rownum, 'refid']
nearest_intersection.append(refid)

15
# GET DISTANCE TO ALL NODES FROM FIRST SUPPLY STORE or POPULATED AREA

ni = nearest_intersection[0]
spaths = nx.single_source_dijkstra_path_length(g, source=ni, weight='meters')
df = pd.DataFrame({'start': ni, 'dest': spaths.keys(), 'dist': spaths.values()})

for i in range(1,10):
ni = nearest_intersection[i]
spaths = nx.single_source_dijkstra_path_length(g, source=ni, weight='meters')
df2 = pd.DataFrame({'start': ni, 'dest': spaths.keys(), 'dist': spaths.values()})
m = pd.merge(df, df2, on=['dest'])
xmin = m['dist_x'] <= m['dist_y']
m['dist'] = 9999
m.ix[xmin, 'dist'] = m[xmin]['dist_x']
m.ix[xmin, 'start'] = m[xmin]['start_x']
m.ix[~xmin, 'dist'] = m[~xmin]['dist_y']
m.ix[~xmin, 'start'] = m[~xmin]['start_y']
m.drop('dist_x', axis=1, inplace=True)
m.drop('dist_y', axis=1, inplace=True)
m.drop('start_x', axis=1, inplace=True)
m.drop('start_y', axis=1, inplace=True)
df = m

df.sort_values('dist', inplace=True)
df.index = np.arange(df.shape[0])
plt.figure(figsize=(15,8))
plt.plot(df.index, df.dist)
plt.xlabel('ordered distances')
plt.ylabel('meters')
plt.show()

cutoff = np.percentile(df.dist, 95)


desert_df = df[df.dist > cutoff]
desert_df.columns = ['location_id', 'dist', 'nearest_store_id']
desert_df.index = np.arange(desert_df.shape[0])

These chunks were run separately but included here for brevity.

# SET RADIAL DISTANCE FOR OPTIMAL AREAS

cutoff = np.percentile(df.dist, 5)
desert_df = df[df.dist < cutoff]
desert_df.columns = ['location_id', 'dist', 'nearest_store_id']
desert_df.index = np.arange(desert_df.shape[0])

deserts = []
for lid in desert_df['location_id']:
deserts.append(nodes[lid])

df3 = pd.DataFrame(deserts)
df3.columns = ['lat', 'lon']

16
# VISUALIZE RESULTS

fig = plt.figure(figsize=(12,12))
ax = fig.add_axes([0.1,0.1,0.8,0.8])
bmap = Basemap(projection='stere',lon_0=-100,lat_0=35.,\
llcrnrlat=33.3,urcrnrlat=34.5,\
llcrnrlon=-118.8,urcrnrlon=-118.2,\
rsphere=6371200.,resolution='h',area_thresh=10000)
bmap.drawcountries()
bmap.drawcounties(linewidth=0.25)
bmap.drawmapboundary(fill_color='#99ffff')
bmap.fillcontinents(color='#ffffff',lake_color='darkblue', zorder=0)
bmap.drawrivers()

x, y = bmap(lons, lats)
bmap.scatter(x,y,50,marker='o',color='b', alpha=0.2)

x, y = bmap(df3.lon.tolist(), df3.lat.tolist())
bmap.scatter(x,y,100,marker='o',color='r', alpha=0.2)

plt.show()

17

Das könnte Ihnen auch gefallen