Sie sind auf Seite 1von 12

10/2/2018

nyc-311-data-analysis/311_data_analysis.ipynb at master · ikp-ogbeide/nyc-311-data-analysis · GitHub

Dismiss Join GitHub today GitHub is home to over 28 million developers working together to
Dismiss
Join GitHub today
GitHub is home to over 28 million developers working together to host
and review code, manage projects, and build software together.
Sign up
nyc-311-data-analysis / 311_data_analysis.ipynb
Find file
Copy path
ikp-ogbeide Update file
e830bc7 on Jan 2
1 contributor
2540 lines (2539 sloc) 716 KB
2540 lines (2539 sloc)
716 KB

10/2/2018

nyc-311-data-analysis/311_data_analysis.ipynb at master · ikp-ogbeide/nyc-311-data-analysis · GitHub

Exploratory Data Analysis of the 2014 New York City 311 Call Data

The aim of this EDA is to understand the NYC 311 call data and observe any noticible trends that we can make predictions on.

Author: Ikponmwosa Felix Ogbeide

Questions to answer through this project:

Which boroughs have the most call incidents

Which agency gets the most incidents

What borough has the fastest incident resolution time

How does the incidents vary month to month

PS I'm also on the lookout for interesting trends and observations beyond the questions above!

Before I begin exploring the data, it is essential to understand what type of data I am working with. This would enable me decide on what kind data cleaning or wrangling is necessary for this project

In [1]: # Import python libraries data manipulation import numpy as np import pandas as pd

data manipulation import numpy as np import pandas as pd In [2]: # Read 311 data

In [2]: # Read 311 data from xls file into pandas dataframe datafile = '311_Service_Requests_from_2014.csv' df = pd.read_csv(datafile)

df.head(3)

C:\Users\Ikp\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py:2698: DtypeWarning: Columns (8,1 7,40,41,42,43,44,45,46,47,48,49) have mixed types. Specify dtype option on import or set low_memory=False. interactivity=interactivity, compiler=compiler, result=result)

Out[2]:

 

Unique

Created

Closed

Agency

Agency

Complaint

Descriptor

Location Type

Incident

Incident

Key

Date

Date

Name

Type

Zip

Address

   

07/11/2014

08/05/2014

             

0

28457271

03:08:58

PM

12:41:37

PM

DOT

Department of

Transportation

Sidewalk

Condition

Defacement

Sidewalk

11368

37-73 104

STREET

1

28644314

08/08/2014

02:06:22

08/12/2014

11:33:34

DCA

Department of

Consumer

Consumer

Complaint

False

Advertising

NaN

10014

113

WASHINGTO

PM

AM

Affairs

PLACE

   

11/18/2014

11/18/2014

 

New York City

         

2

29306886

12:52:40

AM

01:35:22

AM

NYPD

Police Department

Blocked

Driveway

No Access

Street/Sidewalk

11358

42-10 159

STREET

3 rows × 53 columns

11358 42-10 159 STREET 3 rows × 53 columns In [3]: # Setting pandas maximum colummns
11358 42-10 159 STREET 3 rows × 53 columns In [3]: # Setting pandas maximum colummns
11358 42-10 159 STREET 3 rows × 53 columns In [3]: # Setting pandas maximum colummns
11358 42-10 159 STREET 3 rows × 53 columns In [3]: # Setting pandas maximum colummns

In [3]: # Setting pandas maximum colummns display so as to view all columns at once. pd.set_option('display.max_columns', None)

once. pd.set_option('display.max_columns', None ) In [4]: # Since the datafile has a Unique Key column its

In [4]: # Since the datafile has a Unique Key column its better to have that as the index

# Re-read datafile into df pandas dataframe setting unique key as index

df = pd.read_csv(datafile, index_col='Unique Key')

C:\Users\Ikp\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py:2698: DtypeWarning: Columns (8,1 7,40,41,42,43,44,45,46,47,48,49) have mixed types. Specify dtype option on import or set low_memory=False. interactivity=interactivity, compiler=compiler, result=result) C:\Users\Ikp\Anaconda3\lib\site-packages\numpy\lib\arraysetops.py:463: FutureWarning: elementwise comparis on failed; returning scalar instead, but in the future will perform elementwise comparison mask |= (ar1 == a)

From warning, we know we have mixed types in some columns. Its best to investigate what the mixtypes are.

In [5]: # I'll create a list to store the data with mixed data types.

# Later on in the project, I'll visit this list of mixed data types

df_mixed_dt = [df.columns[7], df.columns[16], df.columns[39],df.columns[40],df.columns[41],df.columns[42], df.columns[43], df.columns[44], df.columns[45],df.columns[46], df.columns[47], df.columns[4

8]]

In [6]: # View object types of each columns df.info()

8]] In [6]: # View object types of each columns df.info() <class 'pandas.core.frame.DataFrame'>

<class 'pandas.core.frame.DataFrame'>

Int64Index: 2099056 entries, 28457271 to 29607189 Data columns (total 52 columns):

Created Date

object

Closed Date

object

Agency

object

Agency Name

object

Complaint Type

object

Descriptor

object

Location Type

object

Incident Zip

object

Incident Address

object

Street Name

object

Cross Street 1

object

Cross Street 2

object

Intersection Street 1

object

Intersection Street 2

object

Address Type

object

City

object

Landmark

object

Facility Type

object

Status

object

Due Date

object

Resolution Description

object

Resolution Action Updated Date

object

Community Board

object

Borough

object

X Coordinate (State Plane)

float64

Y Coordinate (State Plane)

float64

Park Facility Name

object

Park Borough

object

School Name

object

School Number

object

School Region

object

School Code

object

School Phone Number

object

School Address

object

School City

object

School State

object

School Zip

object

School Not Found

object

float64

School or Citywide Complaint Vehicle Type

object

Taxi Company Borough

object

Taxi Pick Up Location

object

Bridge Highway Name

object

Bridge Highway Direction

object

Road Ramp

object

Bridge Highway Segment

object

Garage Lot Name

object

Ferry Direction

object

Ferry Terminal Name

object

float64

Latitude

float64

Longitude Location

object

10/2/2018

nyc-311-data-analysis/311_data_analysis.ipynb at master · ikp-ogbeide/nyc-311-data-analysis · GitHub

dtypes: float64(5), object(47) memory usage: 848.8+ MB

In [7]: # Count of non-null values for every column df_columns = list(df.columns) df[df_columns].count()

column df_columns = list(df.columns) df[df_columns].count() Out[7]: Created Date 2099056 2053916 Closed Date

Out[7]: Created Date

2099056

2053916

Closed Date

2099056

Agency

2099056

Agency Name

2099056

Complaint Type

2086658

Descriptor

1546759

Location Type

1952035

Incident Zip

1641505

Incident Address

1641401

Street Name

1675745

Cross Street 1

1668168

Cross Street 2

323622

Intersection Street 1

323523

Intersection Street 2

1994598

Address Type

1953071

City

810

Landmark

536769

Facility Type

2099053

Status

870518

Due Date Resolution Description

2085477

Resolution Action Updated Date

2060578

2099056

Community Board Borough

2099056

X Coordinate (State Plane)

1888333

Y Coordinate (State Plane)

1888333

Park Facility Name

2099056

2099056

Park Borough

2099056

School Name

2098598

School Number

2088711

School Region

2088711

School Code

2099056

School Phone Number

2099055

School Address

2099056

School City

2099056

School State

2099054

School Zip

857159

School Not Found

0

School or Citywide Complaint

481

Vehicle Type

1689

Taxi Company Borough

14906

Taxi Pick Up Location

9223

Bridge Highway Name

9216

Bridge Highway Direction

9177

Road Ramp

10678

Bridge Highway Segment

624

Garage Lot Name

479

Ferry Direction

1596

Ferry Terminal Name

1888333

Latitude

1888333

Longitude Location dtype: int64

1888333

Columns with a lot of missing data won't be useful to my analysis. Also, some columns in the data aren't just useful to this analysis, its best to remove these columns

int',

'Vehicle Type', 'Taxi Company Borough', 'Taxi Pick Up Location', 'Bridge Highway Name', 'Bridge Highway Direction', 'Road Ramp', 'Bridge Highway Segment', 'Garage Lot Name', 'Ferry Direction', 'Ferry Terminal Name', 'Location', 'Address Type', 'Agency Name', 'Resolution Action Updated Date', 'Descriptor', 'Location Type']

In [8]: # A list of columns to remove from the dataframe df_cols_rmv = ['Incident Address', 'Street Name', 'Cross Street 1', 'Cross Street 2', 'Intersection Street 1', 'Intersection Street 2', 'Landmark', 'Facility Type', 'Due Date', 'Resolution Description','Community Board', 'X Coordinate (State Plane)', 'Y Coordinate (State Plane)', 'Park Facility Name', 'Park Borough', 'School Name', 'School Number', 'School Region', 'School Code', 'School Phone Number', 'School Address', 'School City', 'School State', 'School Zip', 'School Not Found', 'School or Citywide Compla

In [9]: # Remove the columns added to the df_cols_rmv list from df dataframe df.drop(df_cols_rmv, inplace=True, axis=1)

df dataframe df.drop(df_cols_rmv, inplace= True , axis=1) In [10]: df.info() <class

In [10]: df.info()

inplace= True , axis=1) In [10]: df.info() <class 'pandas.core.frame.DataFrame'>

<class 'pandas.core.frame.DataFrame'>

Int64Index: 2099056 entries, 28457271 to 29607189 Data columns (total 10 columns):

Created Date

object

Closed Date

object

Agency

object

Complaint Type

object

Incident Zip

object

City

object

Status

object

Borough object Latitude float64 Longitude float64 dtypes: float64(2), object(8) memory usage: 176.2+ MB

In [11]: #Time to investigate df columns with mixed data types

df_mixed_dt

Out[11]: ['Incident Zip', 'Landmark', 'Vehicle Type', 'Taxi Company Borough', 'Taxi Pick Up Location', 'Bridge Highway Name', 'Bridge Highway Direction', 'Road Ramp', 'Bridge Highway Segment', 'Garage Lot Name', 'Ferry Direction', 'Ferry Terminal Name']

In [12]: #Drop columns in df-mixed_dt that aren't in df anymore df_mixed_dt = [x for x in df_mixed_dt if x in list(df.columns)]

= [x for x in df_mixed_dt if x in list(df.columns)] In [13]: df_mixed_dt Out[13]: ['Incident Zip']

In [13]: df_mixed_dt

if x in list(df.columns)] In [13]: df_mixed_dt Out[13]: ['Incident Zip'] In [14]: #Explore

Out[13]: ['Incident Zip']

In [14]: #Explore Incident Zip df['Incident Zip'].unique()

#Explore Incident Zip df['Incident Zip'].unique() Out[14]: array(['11368', '10014',

Out[14]: array(['11368', '10014', '11358', '10018', nan, '10466', '11357', '10309', '10030', '10029', '11417', '11369', '10465', '10473', '10461', '11101', '10010', '11217', '11355', '10019', '10128', '10462', '10456', '11385', '11356', '11225', '10028', '11104', '11203', '10024', '11249', '11103', '11211', '11364', '11432', '10023', '10003', '11375', '11354', '11361', '10009', '10314', '11412', '10036', '10012', '10163-4668', '11429', '10451', '11426', '10011', '11233', '10457', '10460', '10305', '10021', '11237', '10007', '10016', '10306', '11210', '11213', '11372', '11226', '11218', '11221', '11414', '11205', '10307', '11204', '11418', '11234', '11230', '11223', '10004', '11377', '11370', '11423', '10017', '11367', '10031', '10002', '11222', '10027', '11207', '10025', '11235', '11421', '10469', '11216', '11219', '11434', '10312', '11238', '11411', '11214', '10005', '11365', '11373', '10452', '11201', '10468', '10038', '11228', '10282', '10303', '11105', '10472', '10000', '10308', '11378', '11236', '10001', '10475', '10020', '10034', '11004', '11102', '11374', '10464', '11428', '11416', '10304', '10453', '11208', '11209', '10022', '11379', '11435', '11360', '11212', '10026', '10065', '11001', '11106', '07114', '11215', '07512', '10032', '11419', '11427', '11096', '07631', '10035', '10463', '11220', '11436', '10467', '11422', '11229', '10301', '10013', '10040', '10033', '11366', '11433', '11362', '10075', '11363', '11040', '10458', '10470', '11206',

10/2/2018

nyc-311-data-analysis/311_data_analysis.ipynb at master · ikp-ogbeide/nyc-311-data-analysis · GitHub

'11231', '11413', '10454', '11239', '10455', '10459', '10704', '10302', '11232', '11514', '11420', '10037', '10310', '11415', '10474', '11224', '10710', '10006', '11691', '10039', '11692', '11693', '10044', '11694', '11430', '29616-0759', '10471', '07017', '10069', '11746', '10280', '94524', '11109', '00000', '10048', '23450', '85044', '10804', 'OOOOO', 'UNKNOWN', '07601', '11749', '10533', '11566', '11359', '11716', '11516', '00083', '32789', '11803', '11559', '55426-1066', '11706', '11710', '10281', '10701', '11553', '23541', '55438', '11580', '11757', '37615', '11697', '07054', '11042', '11802-9060', '10116', '11554', '08857', '10805', '12602-0550', '07030', '07642', '77072', '92123', '11735', '10801', '10119', '11520', '10163', '11003', '55614', '92046-3023', '95476', '10129', '11005', '10601', '08817', '07087', '11452', '10153', 'NEW YORK A', '11581', '33319', '12550', '11590', '10402', '61702-3068', '08691', '11579', '07024', '07981', '07305', '11550', '44139', '000000', '11747', '07047', '10548', '11501', '98057', '10591', '23502', '10538', '75081', '10803', '77081', '07302', '11776', '11021', '75244', '75201-4234', '10589', '77057', '1000', '10112', '91752', '07306', '11598', '89703', '11242', '07652', '63301', '32824', '11563', '07010', '66211', '11572', '10122', '10603', '11556', '10606', '10705', '11791', '11758', '11030', '07666', '07310', '10103', '07351', '75007-1958', '10930', 'NJ 07114', '01701-0166', '10954', '11720', '06851', '10536', '14221-7820', '10552', '61654', '07647', '60089', '10501', '10162', '30071', '11552', '15251', '11241', '75024-2959', '11801', '11371', '11530', '11771', '08876', '11725', '07936', '0000', '11722', '07094', '07605', '29603', '11714', 'NJ', '327500', '75240', '37917', '32896', '11723', '08003', '07390', '19053', '000', '07073', '06902', '11777', '30356', '11545', '10507', '92508', '11010', '11901', '60181', '99999', '30005', '07052', '33634', '07111', '07207', '20705', '07960', '55439', '14221-5783', '30188', '84020', 11370.0, 11361.0, 11417.0, 10021.0, 10312.0, 11375.0, 10465.0, 11206.0, 11697.0, 11357.0, 10458.0, 11224.0, 10305.0, 11236.0, 11214.0, 11691.0, 11354.0, 11102.0, 10022.0, 10023.0, 11209.0, 10463.0, 11229.0, 10075.0, 11231.0, 10024.0, 11223.0, 10028.0, 11412.0, 11385.0, 11360.0, 10451.0, 11692.0, 11040.0, 10308.0, 10038.0, 11220.0, 11369.0, 11235.0, 11416.0, 11372.0, 11238.0, 10128.0, 10017.0, 11230.0, 11694.0, 10012.0, 10309.0, 11414.0, 11232.0, 10472.0, 10033.0, 11434.0, 11356.0, 10468.0, 11358.0, 11228.0, 10306.0, 10016.0, 10036.0, 10304.0, 11001.0, 10454.0, 11377.0, 11216.0, 11106.0, 10461.0, 10003.0, 10006.0, 10314.0, 10030.0, 11365.0, 11005.0, 11103.0, 11379.0, 10462.0, 11368.0, 11435.0, 10014.0, 11215.0, 10029.0, 11233.0, 11222.0, 10471.0, 11225.0, 11373.0, 10301.0, 11212.0, 11201.0, 10307.0, 10452.0, 11237.0, 11104.0, 11432.0, 11436.0, 11362.0, 11420.0, 11208.0, 10470.0, 11366.0, 11429.0, 11219.0, 11413.0, 10466.0, 11367.0, 11419.0, 10455.0, 10019.0, 11422.0, 10467.0, 11239.0, 11378.0, 10025.0, 11226.0, 10011.0, 10009.0, 11210.0, 11423.0, 11426.0, 10034.0, 11105.0, 10475.0, 10002.0, 11355.0, 11221.0, 10032.0, 11204.0, 11374.0, 11364.0, 10001.0, 11249.0, 11411.0, 11234.0, 10473.0, 10065.0, 11211.0, 11217.0, 11218.0, 11203.0, 10460.0, 10013.0, 11205.0, 10310.0, 10010.0, 11101.0, 10007.0, 11004.0, 11433.0, 10005.0, 10027.0, 10474.0, 11421.0, 10469.0, 10026.0, 10035.0, 11207.0, 11213.0, 10453.0, 11418.0, 11415.0, 10040.0, 10457.0, 10031.0, 10303.0, 11427.0, 10069.0, 11428.0, 10464.0, 10004.0, 10018.0, 10456.0, 10302.0, 11363.0, 11693.0, 10459.0, 10037.0, 10039.0, 7094.0, 10103.0, 10044.0, 10280.0, 7030.0, 11563.0, 10553.0, 11430.0, 7117.0, 10523.0, 11552.0, 11003.0, 7114.0, '19044', '11701', '11754', '54305-1654', '07664', '10549', '95476-9005', '10577', '33137-0098', '07102', '171111', '92019', '11797-9012', 75081.0, 10020.0, 10595.0, 10119.0, 10111.0, 11109.0, 10591.0, 83.0, 10701.0, 10129.0, 11520.0, 10122.0, 7310.0, 11733.0, 92123.0, 7632.0, 11507.0, 8776.0, 8691.0, 11042.0, 10710.0, 98206.0, 14225.0, 11577.0, '10523', '23541-1223', 11716.0, 75240.0, 11550.0, 7070.0, 10000.0, 13202.0, 8540.0, 10282.0, 11021.0, 11763.0, 7801.0, 60148.0, 11501.0, 11749.0, 7663.0, 11241.0, 10153.0, 10123.0, 10281.0, 0.0, 33486.0, 85080.0, 11704.0, 55812.0, 11.0, 10801.0, 11530.0, 11371.0, 7104.0, 23113.0, 11741.0, 11695.0, 10162.0, 85044.0, 7073.0, 10048.0, 11566.0, 11202.0, 10705.0, 11553.0, 11561.0, 8724.0, '48195-0391', '08540', 11746.0, 11797.0, 77201.0, 11542.0, 11251.0, 7410.0, 7450.0, 11793.0, 11559.0, 11768.0, 11581.0, 11359.0, 11590.0, 11703.0, 75024.0, 95476.0, 92506.0, 7307.0, 10107.0, 8520.0, 7144.0, 11011.0, 10803.0, 14644.0, 7920.0, 10984.0, 11570.0, 10965.0, 11778.0, 17032.0, 11580.0, 11569.0, 11565.0, 7901.0, 7041.0, 11756.0, 90304.0, 18466.0, 11386.0, 33304.0, 7306.0, 1123.0, 11710.0, 30339.0, 8069.0, 7024.0, 11598.0, '11788', '07086', '45274-2596', 7042.0, 10604.0, 7304.0, 8854.0, 11743.0, 11803.0, 10708.0, 6851.0, 30005.0, 60018.0, 6511.0, 91754.0, 7302.0, 8527.0, 23502.0, 11599532.0, 11735.0, 8690.0, 7430.0, 11554.0, 8003.0, 7601.0, 10112.0, 11725.0, 10041.0, 114566.0, 95762.0, '14221', 20188.0, 73126.0, '60179', '11434-420', '07661', '14450', '07632', '07712', '61602', '10583', '29616', '11576', '20201', '11570', '11317', 'ZIP CODE', '11730', '35210-6928', '11251', '63042', '32255', 92018.0, 11776.0, 6854.0, 1000000.0, 3110.0, 53707.0, 80155.0, 7652.0, 11791.0, 11801.0, 11030.0, 10601.0, 11804.0, 7047.0, 7730.0, 11020.0, 19044.0, 12203.0, '10041', '11561', '75403', '43216', '06801-1459', '6462430478', '02140', '07410', '11738', '74147', '10925', '11695', '08854', '07458', '32750', '11215-0389', '18519', '15219', '11756-0215', '11715', '48195-0954', '33428', '70163', '11704', '11779', '75026-0848', '02205', '7056', '11802', '02459', '08807', '11510', '64108', '85711', '07620', '98036', '10541', '10573', '11590-5114', '92013-0848', '10514', '07304', '07650', '91710', '18773-9635', '30303', '0', '11729', '78265', '07090', '11542', '10550', '11797', '55125', '33486', '30348-5658', '60076', '30360', '10101', '10111', '12203', '12941-0129', '91754', '11001-2024', '10123', '91356', '44122-5340', '10602-5055', '48451-0505', '35219', '07144', '11024', '1143', '07042', '07079', '08873', '10522', '11756', '11762', '08827-4010', '08080', '18966', '07002', '10994', '75086', '07093', '11797-9001', '60018-4501', '11020', '48393-1022', '33611', '11793', '11111', '43607', '32073', '12345', '11766', 'INDIAN WEL', '03104', '10528', '20123', '07203', '12024', '11549', '19850', '27713', '07670', '07067', '54602', '11767', '37214', '78265-9707', '0000000', '14206', '07417', '11540', '92923', '85251-6547', '30144-7518', '11741', '11797-9004', '91504', '43054', '10015', '117', '11507', '11753', 11747.0, 7981.0, 11980.0, 11576.0, 7372.0, 14615.0, 1208.0, 7940.0, '07675', '28201-1115', '14108', '10532', '14009', '10566', '34230', '14203', '06901', 'NEWARK', '11743', '14883', '11570-4705', '11535', '94566-9049', '6201-5773', '10008', '11533', '60523', '66207', '43218-3083', '80111', '10703', '10107', '29659', '320896', '08086', '12590', 'ANONYMOUS', '92036', '10952', '35476', 'NY', '45241', '11772-3518', '43614', '10150', '18042', '48090-2036', '11795', '14692-2878', 'UNSURE', '20188', '90036', '30345', '782659705', '11735-9000', '08830', '11760', '10435', '11518', '33122', '07045', '67504', '92506', '55438-0902', 10177.0, 10115.0, 11242.0, 10165.0, 76406.0, 17602.0, 12814.0, 10550.0, 7646.0, 33130.0, 11111.0, 7093.0, 33480.0, 2907.0, 22172.0, 11787.0, 14626.0, 68135.0, 7017.0, 10104.0, 10589.0, '10023-0007', 11341.0, 29006.0, 10703.0, 10605.0, 11431.0, 10116.0, 1850.0, 74106.0, 11514.0, 12345.0, 84770.0, 18015.0, 10150.0, 12779.0, 100000.0, 10603.0, 10108.0], dtype=object)

Its clear to see that Incident Zip has some issues other than the mixed data types, these issues are:

Mixed data types, sometimes floats, sometimes stringsissues other than the mixed data types, these issues are: Some on the zipcodes have 4

Some on the zipcodes have 4 digits added to it after an hyphenare: Mixed data types, sometimes floats, sometimes strings Some zipcodes are nan, others are ?, UNKNOWN,

Some zipcodes are nan, others are ?, UNKNOWN, ANONYMOUS and so onon the zipcodes have 4 digits added to it after an hyphen In [15]: # Function

In [15]: # Function that cleans the Incident Zip values and returns nan for data that cannot be cleaned def correct_zip(zip_code):

try:

zip_code = int(float(zip_code)) except:

try:

zip_code = int(float(zip_code.split('-')[0])) except:

return np.nan if zip_code < 10000 or zip_code > 19999:

return np.nan

else:

return str(zip_code)

10/2/2018

nyc-311-data-analysis/311_data_analysis.ipynb at master · ikp-ogbeide/nyc-311-data-analysis · GitHub

In [16]: # Apply correct_zip function to clean Incident Zip data df['Incident Zip'] = df['Incident Zip'].apply(correct_zip)

Zip'] = df['Incident Zip'].apply(correct_zip) In [17]: #Remove rows from data that have incident zip as

In [17]: #Remove rows from data that have incident zip as null i.e nan df = df[df['Incident Zip'].notnull()]

null i.e nan df = df[df['Incident Zip'].notnull()] In [18]: #Count of non-null values for every column

In [18]: #Count of non-null values for every column df_columns = list(df.columns) df[df_columns].count()

column df_columns = list(df.columns) df[df_columns].count() Out[18]: Created Date 1951306 Closed Date

Out[18]:

Created Date

1951306

Closed Date

1910835

Agency

1951306

Complaint Type

1951306

Incident Zip

1951306

City

1951245

Status

1951303

Borough

1951306

Latitude

1887423

Longitude

1887423

dtype: int64

1887423 Longitude 1887423 dtype: int64 In [19]: #Closed date, Latitude, and Longitude all have

In [19]: #Closed date, Latitude, and Longitude all have missing values, best to remove the rows where data in those columns are missing df = df[(df['Latitude'].notnull()) & (df['Longitude'].notnull()) & (df['Closed Date'].notnull())]

In [20]: #Count of non-null values for every column df_columns = list(df.columns) df[df_columns].count()

column df_columns = list(df.columns) df[df_columns].count() Out[20]: Created Date 1847268 Closed Date

Out[20]:

Created Date

1847268

Closed Date

1847268

Agency

1847268

Complaint Type

1847268

Incident Zip

1847268

City

1847211

Status

1847265

Borough

1847268

Latitude

1847268

Longitude

1847268

dtype: int64

In [21]: #As part of the EDA process, all columns should be explored to identify columns with messy data or nan df['Agency'].unique()

with messy data or nan df['Agency'].unique() Out[21]: array(['DOT', 'DCA',

Out[21]: array(['DOT', 'DCA', 'NYPD', 'FDNY', 'DOHMH', 'TLC', 'DOE', 'DOB', 'DPR', 'EDC', 'DSNY', 'DEP', 'DOITT', 'HPD', '3-1-1', 'DFTA', 'DHS'], dtype=object)

In [22]: df['Complaint Type'].unique()

dtype=object) In [22]: df['Complaint Type'].unique() Out[22]: array(['Sidewalk Condition', 'Consumer

Out[22]: array(['Sidewalk Condition', 'Consumer Complaint', 'Blocked Driveway', 'EAP Inspection - F59', 'Derelict Vehicle', 'Street Condition', 'Indoor Air Quality', 'Broken Muni Meter', 'Illegal Parking', 'Animal Abuse', 'Taxi Complaint', 'Street Sign - Damaged', 'Noise - Street/Sidewalk', 'Noise - Commercial', 'School Maintenance', 'Homeless Encampment', 'Noise - Vehicle', 'Construction', 'Vending', 'Traffic', 'Food Establishment', 'Damaged Tree', 'Street Sign - Missing', 'For Hire Vehicle Complaint', 'Dead Tree', 'Standing Water', 'Illegal Tree Damage', 'Highway Condition', 'Noise - Park', 'Overgrown Tree/Branches', 'Drinking', 'Maintenance or Facility', 'Fire Safety Director - F58', 'Noise - Helicopter', 'Illegal Fireworks', 'Root/Sewer/Sidewalk Condition', 'Urinating in Public', 'Dirty Conditions', 'Noise - House of Worship', 'DPR Internal', 'Food Poisoning', 'Industrial Waste', 'Water System', 'Graffiti', 'Curb Condition', 'Building/Use', 'Fire Alarm - New System', 'Mold', 'Street Sign - Dangling', 'Rodent', 'Unsanitary Animal Pvt Property', 'Public Payphone Complaint', 'Animal in a Park', 'General Construction/Plumbing', 'Asbestos', 'Sewer', 'Street Light Condition', 'Plumbing', 'FLOORING/STAIRS', 'Harboring Bees/Wasps', 'Fire Alarm - Modification', 'Fire Alarm - Addition', 'Bike/Roller/Skate Chronic', 'Violation of Park Rules', 'City Vehicle Placard Complaint', 'Fire Alarm - Reinspection', 'Disorderly Youth', 'Bus Stop Shelter Placement', 'Open Flame Permit', 'Indoor Sewage', 'Bike Rack Condition', 'Public Toilet', 'Bridge Condition', 'Posting Advertisement', 'GENERAL', 'Senior Center Complaint', 'Panhandling', 'Snow', 'Special Projects Inspection Team (SPIT)', 'Municipal Parking Facility', 'Special Enforcement', 'Highway Sign - Damaged', 'Taxi Report', 'Broken Parking Meter', 'Beach/Pool/Sauna Complaint', 'Derelict Vehicles', 'Traffic Signal Condition', 'HEAT/HOT WATER', 'Missed Collection (All Materials)', 'Public Assembly', 'Overflowing Litter Baskets', 'Sweeping/Missed', 'Vacant Lot', 'ELEVATOR', 'Sanitation Condition', 'Other Enforcement', 'Investigations and Discipline (IAD)', 'Drinking Water', 'Found Property', 'Squeegee', 'Unsanitary Pigeon Condition', 'ELECTRIC', 'Hazardous Materials', 'UNSANITARY CONDITION', 'Bottled Water', 'Boilers', 'Electrical', 'Elevator', 'Emergency Response Team (ERT)', 'BEST/Site Safety', 'Request Xmas Tree Collection', 'Cranes and Derricks', 'Tanning', 'PLUMBING', 'APPLIANCE', 'HEATING', 'Litter Basket / Request', 'For Hire Vehicle Report', 'DOOR/WINDOW', 'WATER LEAK', 'Sweeping/Inadequate', 'Scaffold Safety', 'Fire Alarm - Replacement', 'Radioactive Material', 'Recycling Enforcement', 'Overflowing Recycling Baskets', 'Derelict Bicycle', 'Homeless Person Assistance', 'PAINT/PLASTER', 'Highway Sign - Dangling', 'OUTSIDE BUILDING', 'Water Quality', 'Public Assembly - Temporary', 'Miscellaneous Categories', 'Lifeguard', 'NONCONST', 'PAINT - PLASTER', 'GENERAL CONSTRUCTION', 'CONSTRUCTION', 'Legal Services Provider Complaint', 'Non-Residential Heat', 'Highway Sign - Missing', 'X-Ray Machine/Equipment', 'SAFETY', 'VACANT APARTMENT', 'Stalled Sites', 'Building Condition', 'AGENCY', 'Transportation Provider Complaint', 'Water Conservation', 'Noise', 'Air Quality', 'Plant', 'Lead', 'Collection Truck Noise', 'Special Natural Area District (SNAD)', 'Adopt-A-Basket', 'Literature Request', 'SG-99', 'Noise - Residential', 'Non-Emergency Police Matter', 'New Tree Request'], dtype=object)

In [23]: df['City'].unique()

dtype=object) In [23]: df['City'].unique() Out[23]: array(['CORONA', 'NEW YORK',

Out[23]: array(['CORONA', 'NEW YORK', 'FLUSHING', 'BRONX', 'WHITESTONE', 'STATEN ISLAND', 'OZONE PARK', 'EAST ELMHURST', 'LONG ISLAND CITY', 'BROOKLYN', 'RIDGEWOOD', 'COLLEGE POINT', 'SUNNYSIDE', 'ASTORIA', 'OAKLAND GARDENS', 'FOREST HILLS', 'BAYSIDE', 'SAINT ALBANS', 'QUEENS VILLAGE', 'BELLEROSE', 'JACKSON HEIGHTS', 'HOWARD BEACH', 'RICHMOND HILL', 'WOODSIDE', 'HOLLIS', 'WOODHAVEN', 'JAMAICA', 'FRESH MEADOWS', 'ELMHURST', 'Ridgewood', 'MASPETH', 'GLEN OAKS', 'REGO PARK', 'Long Island City', 'MIDDLE VILLAGE', 'Bayside', nan, 'SOUTH RICHMOND HILL', 'ROSEDALE', 'Little Neck', 'LITTLE NECK', 'Jamaica', 'Richmond Hill', 'SPRINGFIELD GARDENS', 'Fresh Meadows', 'East Elmhurst', 'Woodhaven', 'Howard Beach', 'FLORAL PARK', 'KEW GARDENS', 'SOUTH OZONE PARK', 'CAMBRIA HEIGHTS', 'Far Rockaway', 'Flushing', 'South Ozone Park', 'Elmhurst', 'Ozone Park', 'Corona', 'South Richmond Hill', 'Jackson Heights', 'FAR ROCKAWAY', 'Queens Village', 'Springfield Gardens', 'Astoria', 'Cambria Heights', 'Glen Oaks', 'ROCKAWAY PARK', 'Rego Park', 'ARVERNE', 'Middle Village', 'NEW HYDE PARK', 'Woodside', 'Kew Gardens', 'Rockaway Park', 'Hollis', 'Maspeth', 'Rosedale', 'Saint Albans', 'Arverne', 'BREEZY POINT', 'Forest Hills', 'Oakland Gardens', 'Sunnyside', 'Bellerose', 'QUEENS', 'Whitestone', 'Floral Park', 'New Hyde Park', 'College Point', 'NEW HEMPSTEAD', 'UNKNOWN', 'BEDFORD HILLS', 'Breezy Point', 'BELLMORE', 'MANHATTAN'], dtype=object)

In [24]: df['Status'].unique()

dtype=object) In [24]: df['Status'].unique() Out[24]: array(['Closed', 'Pending',

Out[24]: array(['Closed', 'Pending', 'Assigned', nan, 'Open', 'Started'], dtype=object)

In [25]: df['Borough'].unique()

dtype=object) In [25]: df['Borough'].unique() Out[25]: array(['QUEENS', 'MANHATTAN',

Out[25]: array(['QUEENS', 'MANHATTAN', 'BRONX', 'STATEN ISLAND', 'BROOKLYN', 'Unspecified'], dtype=object)

Columns with messy or missing data

10/2/2018

nyc-311-data-analysis/311_data_analysis.ipynb at master · ikp-ogbeide/nyc-311-data-analysis · GitHub

City - contains some cities in uppercase and others in lowercase

status - contains nan values

Borough - contains 'Unspecified' boroughs

 

In [26]: #Lets look at the unspecified boroughs, we want to be sure that removing data from df won't cause problems later on df[df['Borough']=='Unspecified'][['Agency', 'City']]

Out[26]:

 

Agency

City

 

Unique Key

   

29167930

NYPD

STATEN ISLAND

28650971

DPR

STATEN ISLAND

28993659

NYPD

STATEN ISLAND

28337889

NYPD

STATEN ISLAND

28768853

TLC

NEW HEMPSTEAD

27339785

TLC

BEDFORD HILLS

28911820

NYPD

STATEN ISLAND

28767733

NYPD

STATEN ISLAND

28867255

NYPD

STATEN ISLAND

29044115

NYPD

STATEN ISLAND

29183835

DCA

BELLMORE

28994609

NYPD

STATEN ISLAND

29503930

NYPD

STATEN ISLAND

29040790

NYPD

STATEN ISLAND

29211842

NYPD

STATEN ISLAND

28151378

NYPD

STATEN ISLAND

28356529

NYPD

STATEN ISLAND

28983245

NYPD

STATEN ISLAND

28763594

NYPD

STATEN ISLAND

29217588

NYPD

STATEN ISLAND

29209600

NYPD

STATEN ISLAND

28961292

NYPD

STATEN ISLAND

29159476

NYPD

STATEN ISLAND

28945701

NYPD

STATEN ISLAND

28305246

NYPD

STATEN ISLAND

28501571

NYPD

STATEN ISLAND

27739170

DPR

STATEN ISLAND

28135459

TLC

BAYSIDE

28944841

NYPD

STATEN ISLAND

In [27]: # Majority of the data belongs to NYPD Agency and occurs in Staten Island

# To ensure I don't lose too much data from NYPD, I need to ensure this accounts for a neglegible number o

f NYPD nypd_total = df[df['Agency']=='NYPD']['Borough'].count() nypd_unspecified = df[(df['Borough']=='Unspecified') & (df['Agency']=="NYPD")]['Borough'].count() nypd_unspec_perct = nypd_unspecified/nypd_total*100

print("%1.3f"%nypd_unspec_perct)

0.005

In [28]: #Boroughs that are unspecified are negligible that it can be removed df = df[df['Borough'] != 'Unspecified']

df = df[df['Borough'] != 'Unspecified'] In [29]: #Number of Status columns with nan status_nan =

In [29]: #Number of Status columns with nan status_nan = len(df[df['Status'].isnull()].index) print(status_nan)

print(status_nan) 3 In [30]: #The number of rows with columns Status as nan

3

print(status_nan) 3 In [30]: #The number of rows with columns Status as nan is

In [30]: #The number of rows with columns Status as nan is 3, which is also negligible, I can remove it from the da taframe also. df = df[df['Status'].notnull()]

# Convert all City Values to Camel Case

def camel_case(city):

try:

city = city.split(' ') city = ' '.join([x.lower().capitalize() for x in city]) if city == 'Unknown':

return np.nan

else:

return city

except:

return np.nan

In [31]: # Since some city values are represented both in uppercase and lowercase, it's better to have the city in the same case

In [32]: # Apply camel_case function to City column df['City'] = df['City'].apply(camel_case)

df['City'] = df['City'].apply(camel_case) In [33]: # Lets view the City values with nan

In [33]: # Lets view the City values with nan df[df['City'].isnull()].groupby('Agency')['Status'].count()

Out[33]: Agency DOT TLC Name: Status, dtype: int64 57 1 In

Out[33]: Agency

DOT

TLC

Name: Status, dtype: int64

57

1

In [34]: # 57 of Cities with nan value are of DOT Agency.

# It's better to know if this is significant before removing it

city_null_dot = len(df[(df['City'].isnull()) & (df['Agency']=='DOT')].index) dot_total = len(df[df['Agency']=='DOT'].index) city_null_dot_perct = (city_null_dot/dot_total)*100

print("%1.3f"%city_null_dot_perct)

0.024

In [35]: # 0.024% is negligible and so Cities with nan can be removed from df df = df[df['City'].notnull()]

be removed from df df = df[df['City'].notnull()] t # Convert Created Date and Closed Date values

t

#

Convert Created Date and Closed Date values to DateTime object.

import datetime df['Created Date'] = df['Created Date'].apply(lambda x:datetime.datetime.strptime(x,'%m/%d/%Y %I:%M:%S %p' )) df['Closed Date'] = df['Closed Date'].apply(lambda x:datetime.datetime.strptime(x,'%m/%d/%Y %I:%M:%S %p'))

In [36]: # Created Date and Closed Date aren't in DateTime object. It's convenient when working with DateTime objec

In [37]: # It would be useful to create a column to compute how long it takes to close a complaint df['Processing Time'] = df['Closed Date'] - df['Created Date']

= df['Closed Date'] - df['Created Date'] In [38]: # Viewing the descriptive statistics on the

In [38]: # Viewing the descriptive statistics on the Processing Time can give some insights on turn around time df['Processing Time'].describe()

turn around time df['Processing Time'].describe() Out[38]: count 1847178 mean 14 days

Out[38]: count

1847178

mean

14 days 18:28:19.930685

std

min

47 days 13:22:45.772770

25%

-365 days +00:00:00 0 days 03:10:15

10/2/2018

nyc-311-data-analysis/311_data_analysis.ipynb at master · ikp-ogbeide/nyc-311-data-analysis · GitHub

50%

2 days 00:00:00

75%

max

8 days 19:38:00.750000 918 days 14:08:12

Name: Processing Time, dtype: object

From the descriptive statistis, we can see minimum processing time is negative, this means something is wrong with the date data and it should be explored

In [39]: # View Prcoessing Time data that is negative df[df['Processing Time']<datetime.timedelta(0,0,0)].head(3)

Out[39]:

 

Created

Closed

Agency

Complaint

Incident

City

Status

Borough

Latitude

Longitude

Process

Date

Date

Type

Zip

Time

Unique

                     

Key

       

Unsanitary

             

28581213

2014-

07-31

2014-

07-23

DOHMH

Animal Pvt

Property

10456

Bronx

Pending

BRONX

40.835153

-73.912449

-8 days

 

2014-

2014-

                 

28541630

07-25

07-07

DOHMH

Rodent

11206

Brooklyn

Pending

BROOKLYN

40.701265

-73.929265

-18 days

 

2014-

2014-

     

New

         

28934215

09-22

08-25

DOHMH

Rodent

10031

York

Pending

MANHATTAN

40.827318

-73.946620

-28 days

There are issues with some data in df, the Closed Date in some rows preceede its Created Date, thus, resulting in the negative processing time.

In [40]: # Remove all data from df that have negative Processing Time df = df[df['Processing Time']>=datetime.timedelta(0,0,0)]

Time']>=datetime.timedelta(0,0,0)] In [41]: #Count of non-null values for every column

In [41]: #Count of non-null values for every column df_columns = list(df.columns) df[df_columns].count()

column df_columns = list(df.columns) df[df_columns].count() Out[41]: Created Date 1830970 Closed Date

Out[41]:

Created Date

1830970

Closed Date

1830970

Agency

1830970

Complaint Type

1830970

Incident Zip

1830970

City

1830970

Status

1830970

Borough

1830970

Latitude

1830970

Longitude

1830970

Processing Time

1830970

dtype: int64

Now that the data looks clean enough for further exploration, I'll create a function to incorporate all the data cleaning process

This makes future work on the dataset convenient.

In [42]: def open_311_data(datafile):

import numpy as np import pandas as pd import datetime

#Function to clean Incident Zip def correct_zip(zip_code):

try:

zip_code = int(float(zip_code)) except:

try:

zip_code = int(float(zip_code.split('-')[0])) except:

return np.nan if zip_code < 10000 or zip_code > 19999:

return np.nan

else:

return str(zip_code)

#Function to clean City values, i.e convert City values to Camel Case def camel_case(city):

try:

city = city.split(' ') city = ' '.join([x.lower().capitalize() for x in city]) if city == 'Unknown':

return np.nan

else:

return city

except:

return np.nan

#Read the file df = pd.read_csv(datafile, index_col='Unique Key')

#Drop columns that aren't relevant to this analysis df_cols_rmv = ['Incident Address', 'Street Name', 'Cross Street 1', 'Cross Street 2', 'Intersection Street 1', 'Intersection Street 2', 'Landmark', 'Facility Type', 'Due Date', 'Resolution Description','Community Board', 'X Coordinate (State Plane)', 'Y Coordinate (State Plane)', 'Park Facility Name', 'Park Borough', 'School Name', 'School Number', 'School Region', 'School Code', 'School Phone Number', 'School Addres

s',

'School City', 'School State', 'School Zip', 'School Not Found', 'School or Citywide Co

mplaint',

'Vehicle Type', 'Taxi Company Borough', 'Taxi Pick Up Location', 'Bridge Highway Name', 'Bridge Highway Direction', 'Road Ramp', 'Bridge Highway Segment', 'Garage Lot Name', 'Ferry Direction', 'Ferry Terminal Name', 'Location', 'Address Type', 'Agency Name', 'Resolution Action Updated Date', 'Descriptor', 'Location Type']

df.drop(df_cols_rmv, inplace=True, axis=1)

#Clean Incident Zip df['Incident Zip'] = df['Incident Zip'].apply(correct_zip)

#Clean City values df['City'] = df['City'].apply(camel_case)

#Drop unspecified boroughs df = df[df['Borough'] != 'Unspecified']

#Drop all rows with nan df = df.dropna(how='any')

#Convert Created Date and Closed Date to datetime objects, create a Processing Time column df['Created Date'] = df['Created Date'].apply(lambda x:datetime.datetime.strptime(x,'%m/%d/%Y %I:%M:%S %p')) df['Closed Date'] = df['Closed Date'].apply(lambda x:datetime.datetime.strptime(x,'%m/%d/%Y %I:%M:%S %

p'))

df['Processing Time'] = df['Closed Date'] - df['Created Date']

#Remove negative processing time rows from the dataframe df = df[df['Processing Time']>=datetime.timedelta(0,0,0)]

return df

In [43]: # Open, read, and process the NYC 311 dataset using the open_311_data function datafile = '311_Service_Requests_from_2014.csv' df = open_311_data(datafile)

df.head(3)

C:\Users\Ikp\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py:2802: DtypeWarning: Columns (8,1 7,40,41,42,43,44,45,46,47,48,49) have mixed types. Specify dtype option on import or set low_memory=False. if self.run_code(code, result):

C:\Users\Ikp\Anaconda3\lib\site-packages\numpy\lib\arraysetops.py:463: FutureWarning: elementwise comparis on failed; returning scalar instead, but in the future will perform elementwise comparison mask |= (ar1 == a)

Out[43]:

Created

Closed

Agency

Complaint

Incident

City

Status

Borough

Latitude

Longitude

Processi

Date

Date

Type

Zip

Time

10/2/2018

nyc-311-data-analysis/311_data_analysis.ipynb at master · ikp-ogbeide/nyc-311-data-analysis · GitHub

Unique

                     

Key

 

2014-

2014-

                 

28457271

07-11

15:08:58

08-05

12:41:37

DOT

Sidewalk

Condition

11368

Corona

Closed

QUEENS

40.751870

-73.862718

24 days

21:32:39

 

2014-

2014-

                 

28644314

08-08

14:06:22

08-12

11:33:34

DCA

Consumer

Complaint

10014

New

York

Closed

MANHATTAN

40.732623

-74.001119

3

21:27:12

days

 

2014-11-

2014-11-

                 

29306886

18

00:52:40

18

01:35:22

NYPD

Blocked

Driveway

11358

Flushing

Closed

QUEENS

40.760384

-73.806826

0

00:42:42

days

Visualizations

In [44]: import matplotlib.pyplot as plt %matplotlib inline

[44]: import matplotlib.pyplot as plt %matplotlib inline In [45]: # Visualizing 311 call data Incidents with

In [45]: # Visualizing 311 call data Incidents with a heat map import gmaps

311 call data Incidents with a heat map import gmaps In [46]: import settings # Contains

In [46]: import settings # Contains my Google map API key gmaps.configure(api_key=settings.API_KEY) # Fill in with your API key new_york_coordinates = (40.75, -74.00) locations = df[['Latitude','Longitude']] fig = gmaps.figure(center=new_york_coordinates, zoom_level=12) heatmap_layer = gmaps.heatmap_layer(locations) fig.add_layer(heatmap_layer) fig

fig.add_layer(heatmap_layer) fig Failed to display Jupyter Widget of type Figure . If

Failed to display Jupyter Widget of type Figure.

If you're reading this message in the Jupyter Notebook or JupyterLab Notebook, it may mean that the widgets JavaScript is still loading. If this message persists, it likely means that the widgets JavaScript library is either not installed or not enabled. See the Jupyter Widgets Documentation (https://ipywidgets.readthedocs.io/en/stable/user_install.html) for setup instructions.

a

on

or

(for example, a static rendering on GitHub or NBViewer In [47]: #Exploration of incidents by Borough

In [47]: #Exploration of incidents by Borough borough = df.groupby('Borough') borough.size().plot(kind='bar', figsize=(12,6), title=('Incidents by Borough'));

figsize=(12,6), title=('Incidents by Borough')); From the graph, we can see that Brooklyn has the most
figsize=(12,6), title=('Incidents by Borough')); From the graph, we can see that Brooklyn has the most

From the graph, we can see that Brooklyn has the most incidents, while, Staten Island has the least. It should also be noted that Staten Island is the smallest of the five boroughs so that could be why it has the least incidents.

In [48]: # Visualization of incidents by Agency agency = df.groupby('Agency') agency.size().plot(kind='bar', figsize=(12,6), title=('Incidents calls per Agency'));

title=('Incidents calls per Agency')); HPD has the highest complaints followed by NYPD in Brooklyn
title=('Incidents calls per Agency')); HPD has the highest complaints followed by NYPD in Brooklyn

HPD has the highest complaints followed by NYPD in Brooklyn

In [49]: #Visualization of numnber of incidents in each Borough by Agency agency_borough = df.groupby(['Agency','Borough']).size().unstack() agency_borough.plot(kind='bar', title='Total Inicidents in each Borough by Agency', figsize=(15,7));

Inicidents in each Borough by Agency', figsize=(15,7)); In [50]: #Visualization of top Agencies with most incidents
Inicidents in each Borough by Agency', figsize=(15,7)); In [50]: #Visualization of top Agencies with most incidents

In [50]: #Visualization of top Agencies with most incidents per borough col_number = 2 row_number = 3

most incidents per borough col_number = 2 row_number = 3

10/2/2018

nyc-311-data-analysis/311_data_analysis.ipynb at master · ikp-ogbeide/nyc-311-data-analysis · GitHub

fig, axes = plt.subplots(row_number,col_number, figsize=(12,12))

for i, (label,col) in enumerate(agency_borough.iteritems()):

ax = axes[int(i/col_number), i%col_number] col = col.sort_values(ascending=True)[:5] col.plot(kind='barh', ax=ax) ax.set_title(label)

plt.tight_layout()

ax=ax) ax.set_title(label) plt.tight_layout() In [51]: # Visualization of most Complaints per Borough

In [51]: # Visualization of most Complaints per Borough borough_comp = df.groupby(['Complaint Type','Borough']).size().unstack()

col_number = 2 row_number = 3 fig, axes = plt.subplots(row_number,col_number, figsize=(12,12))

for i, (label,col) in enumerate(borough_comp.iteritems()):

ax = axes[int(i/col_number), i%col_number] col = col.sort_values(ascending=True)[:15] col.plot(kind='barh', ax=ax) ax.set_title(label)

plt.tight_layout()

ax=ax) ax.set_title(label) plt.tight_layout() Visualization on processing time. The Processing Time in the

Visualization on processing time.

The Processing Time in the dataframe is a datetime object, it is easier to convert the processing time into floats for calcuation

In [52]: import numpy as np df['Processing Time Float'] = df['Processing Time'].apply(lambda x:x/np.timedelta64(1, 'D'))

lambda x:x/np.timedelta64(1, 'D')) In [53]: # Histogram of Processing Time df['Processing

In [53]: # Histogram of Processing Time df['Processing Time Float'].hist(bins=30, figsize=(15,7));

Time Float'].hist(bins=30, figsize=(15,7));
Time Float'].hist(bins=30, figsize=(15,7));

10/2/2018

nyc-311-data-analysis/311_data_analysis.ipynb at master · ikp-ogbeide/nyc-311-data-analysis · GitHub

at master · ikp-ogbeide/nyc-311-data-analysis · GitHub Since datetime objects occurs in the dataframe, I can build

Since datetime objects occurs in the dataframe, I can build a bar graph to show Incidents per month and other interesting informtation. This would allow easy discovery of noticible trends and seasonality.

I can do this by adding a column to the data to keep track of year and months only

In [54]: import datetime df['YYYY-MM'] = df['Created Date'].apply(lambda x: datetime.datetime.strftime(x, '%Y-%m'))

lambda x: datetime.datetime.strftime(x, '%Y-%m')) In [55]: #Incidents on a monthly basis monthly_incidents =
lambda x: datetime.datetime.strftime(x, '%Y-%m')) In [55]: #Incidents on a monthly basis monthly_incidents =

In [55]: #Incidents on a monthly basis monthly_incidents = df.groupby('YYYY-MM').size().plot(figsize=(12,5), title='Incidents on a monthly basi s');

title='Incidents on a monthly basi s'); on a monthly basis'); In [56]: # Boroughs with

on a monthly basis');

In [56]: # Boroughs with Processing Time on a monthly basis df.groupby(['YYYY-MM','Borough'])['Processing Time Float'].mean().unstack().plot(figsize=(15,7), title='Processing time per Borough

title='Processing time per Borough In [57]: # Processing time per Borough

In [57]: # Processing time per Borough df.groupby('Borough')['Processing Time Float'].mean().plot(kind='bar', figsize=(15,7), title='Processing Time per Borough');

title='Processing Time per Borough'); In [58]: # Visulization of Number of Complaints per Agency
title='Processing Time per Borough'); In [58]: # Visulization of Number of Complaints per Agency
title='Processing Time per Borough'); In [58]: # Visulization of Number of Complaints per Agency

In [58]: # Visulization of Number of Complaints per Agency on a monthly basis date_agency = df.groupby(['YYYY-MM', 'Agency']) date_agency.size().unstack().plot(kind='bar', figsize=(15,7), title='Number of Complaints per Agency on a monthly basis'); plt.legend(loc='center left', bbox_to_anchor=(1.0, 0.5));

left', bbox_to_anchor=(1.0, 0.5)); In [59]: # Visualization of Agency with their number of

In [59]: # Visualization of Agency with their number of Complaints

df.groupby('Agency').size().sort_values(ascending=False).plot(kind='bar',figsize=(15,7),

title='Number of Complaints per Agency');

title='Number of Complaints per Agency');

10/2/2018

nyc-311-data-analysis/311_data_analysis.ipynb at master · ikp-ogbeide/nyc-311-data-analysis · GitHub

at master · ikp-ogbeide/nyc-311-data-analysis · GitHub Since HPD has the most complaints, I'll explore data

Since HPD has the most complaints, I'll explore data relating to HPD to learn more about complaints handled by HPD

In [60]: # Visualization of Incidents handled by HPD by Borough on a monthly basis

df[df['Agency']=='HPD'].groupby(['YYYY-MM','Borough']).size().unstack().plot(figsize=(12,7),

n a monthly basis');

title='Incidents per Borough o

a monthly basis'); title='Incidents per Borough o In [61]: # Visualizations of Complaints handled by HPD

In [61]: # Visualizations of Complaints handled by HPD df[df['Agency']=='HPD'].groupby('Complaint Type').size().sort_values(ascending=False).plot(kind='bar',

pe handled by HPD');

figsize=(12,6),

title='Number of each complaint ty

figsize=(12,6), title='Number of each complaint ty Visualizations of Complaint Type pe'); In [62]: #

Visualizations of Complaint Type

pe');

In [62]: # Visualization of number of complaint type df.groupby('Complaint Type').size().sort_values(ascending=False)[:20].plot(kind='bar', figsize=(15,6), title='Bar graph of Complaint Ty

figsize=(15,6), title='Bar graph of Complaint Ty Noise - Residential has the most complaints, lets explore it

Noise - Residential has the most complaints, lets explore it further

aints per Borough');

title='Residential Noise Compl

In [63]: # Borough with the most Noise Complaints - Residential df[df['Complaint Type']=='Noise - Residential'].groupby('Borough').size()[:10].sort_values(ascending=False ).plot(kind='bar',

False ).plot(kind='bar',

10/2/2018

nyc-311-data-analysis/311_data_analysis.ipynb at master · ikp-ogbeide/nyc-311-data-analysis · GitHub

Brooklyn has the most noise complaints for residential, it would be interesting to know if
Brooklyn has the most noise complaints for residential, it would be interesting to know if this noise peaked anytime within the year of if
its uniform through the year
In [64]: brooklyn_noise = df[(df['Borough']=='BROOKLYN') & (df['Complaint Type']=='Noise - Residential')]
brooklyn_noise.groupby('YYYY-MM').size().plot(kind='bar', figsize=(12,6),
title='Residential noise complaint in Brooklyn on a monthly b
asis');
In [65]: # Complaints per Borough through the year
df.groupby(['YYYY-MM','Borough']).size().unstack().plot(figsize=(15,6))
plt.legend(loc='center left', bbox_to_anchor=(1.0, 0.5));
Observations
Brooklyn has the highest the number of incident calls followed by Queens. Staten Island has the least incident calls.
HPD has the most incident calls follwed by NYPD.
Majority of incidents occur in January followed by November and then the incident calls dips to its lowest in September followed by April.
HPD related incident calls follow a nearly regular pattern across all boroughs from month to month. Heat/Hot water complaints are the most
frequent
Noise in residential areas were the most complaints in 2014 followed by Heat/Hot water complaints
Noise complaints were peaked in September and was lowest in February
Conclusion
Brooklyn has on average the slowest processing from month to month and this caould be associated with the fact that it has the highest number of
'