Ch3 Slides

PANDAS FOUNDATIONS
Indexing
time series
pandas Foundations
Using pandas to read datetime objects

read_csv() function
Can read strings into datetime objects
Need to specify parse_dates=True
ISO 8601 format
yyyy-mm-dd hh:mm:ss
pandas Foundations
Product sales CSV

pandas Foundations
Parse dates
In [1]: import pandas as pd
In [2]: sales = pd.read_csv('sales-feb-2015.csv',

...: parse_dates=True, index_col= 'Date')
pandas Foundations
Parse dates
In [3]: sales.head()
Out[3]:
Company Product Units
Date
2015-02-02 08:30:00 Hooli Software 3
2015-02-02 21:00:00 Mediacore Hardware 9
2015-02-03 14:00:00 Initech Software 13
2015-02-04 15:30:00 Streeplex Software 13
2015-02-04 22:00:00 Acme Coporation Hardware 14
pandas Foundations
Parse dates
In [4]: sales.info()
DatetimeIndex: 19 entries, 2015-02-02 08:30:00 to 2015-02-26
09:00:00
Data columns (total 3 columns):
Company 19 non-null object
Product 19 non-null object
Units 19 non-null int64
dtypes: int64(1), object(2)
memory usage: 608.0+ bytes
pandas Foundations
Selecting single datetime

In [5]: sales.loc['2015-02-19 11:00:00', 'Company']
Out[5]: Mediacore
pandas Foundations
Selecting whole day

In [6]: sales.loc['2015-2-5']
Out[6]:
Date
2015-02-05 02:00:00 Acme Coporation Software 19
2015-02-05 22:00:00 Hooli Service 10
pandas Foundations
Partial datetime string selection

Alternative formats:
sales.loc[February 5, 2015]
sales.loc[2015-Feb-5]
Whole month: sales.loc[2015-2]
Whole year: sales.loc[2015]
pandas Foundations
Selecting whole month

In [7]: sales.loc['2015-2']
Out[7]:
Date
2015-02-02 08:30:00 Hooli Software 3
2015-02-05 02:00:00 Acme Coporation Software 19
2015-02-05 22:00:00 Hooli Service 10
2015-02-09 09:00:00 Streeplex Service 19
2015-02-09 13:00:00 Mediacore Software 7
2015-02-11 23:00:00 Hooli Software 4
2015-02-16 12:00:00 Hooli Software 10
pandas Foundations
Slicing using dates/times

In [8]: sales.loc['2015-2-16':'2015-2-20']
Out[8]:
Date
2015-02-16 12:00:00 Hooli Software 10
2015-02-19 16:00:00 Mediacore Service 10
pandas Foundations
Convert strings to datetime

In [9]: evening_2_11 = pd.to_datetime(['2015-2-11 20:00',
...: '2015-2-11 21:00', '2015-2-11 22:00', '2015-2-11 23:00'])
In [10]: evening_2_11
Out[10]:
DatetimeIndex(['2015-02-11 20:00:00', '2015-02-11 21:00:00',
'2015-02-11 22:00:00', '2015-02-11 23:00:00'],
dtype='datetime64[ns]', freq=None)
pandas Foundations
Reindexing DataFrame
In [11]: sales.reindex(evening_2_11)
Out[11]:
2015-02-11 20:00:00 Initech Software 7.0
2015-02-11 21:00:00 NaN NaN NaN
2015-02-11 22:00:00 NaN NaN NaN
2015-02-11 23:00:00 Hooli Software 4.0
pandas Foundations
Filling missing values

In [12]: sales.reindex(evening_2_11, method='ffill')
Out[12]:
2015-02-11 23:00:00 Hooli Software 4
In [13]: sales.reindex(evening_2_11, method='bfill')

Out[13]:
2015-02-11 21:00:00 Hooli Software 4
2015-02-11 22:00:00 Hooli Software 4
2015-02-11 23:00:00 Hooli Software 4
PANDAS FOUNDATIONS
Lets practice!
PANDAS FOUNDATIONS
Resampling
time series data
pandas Foundations
Sales data

Out[3]:
Date
2015-02-02 08:30:00 Hooli Software 3
pandas Foundations
Resampling
Statistical methods over dierent time intervals
mean(), sum(), count(), etc.
Down-sampling
reduce datetime rows to slower frequency
Up-sampling
increase datetime rows to faster frequency
pandas Foundations
Aggregating means
In [4]: daily_mean = sales.resample('D').mean()
In [5]: daily_mean
Out[5]:
Units
Date
2015-02-02 6.0
2015-02-03 13.0
2015-02-04 13.5
2015-02-05 14.5
2015-02-06 NaN
2015-02-07 1.0
2015-02-08 NaN
2015-02-09 13.0
2015-02-10 NaN
2015-02-11 5.5
2015-02-12 NaN
2015-02-13 NaN
2015-02-14 NaN
pandas Foundations
Verifying
In [6]: print(daily_mean.loc['2015-2-2'])
Units 6.0
Name: 2015-02-02 00:00:00, dtype: float64
In [7]: print(sales.loc['2015-2-2', 'Units'])

Date
2015-02-02 08:30:00 3
2015-02-02 21:00:00 9
Name: Units, dtype: int64
In [8]: sales.loc['2015-2-2', 'Units'].mean()

Out[8]: 6.0
pandas Foundations
Method chaining
In [9]: sales.resample('D').sum()
Out[9]:
Units
Date
2015-02-02 6.0
2015-02-03 13.0
2015-02-04 13.5
2015-02-05 14.5
2015-02-06 NaN
2015-02-07 1.0
2015-02-08 NaN
2015-02-09 13.0
2015-02-10 NaN
2015-02-11 5.5
2015-02-12 NaN
2015-02-13 NaN
pandas Foundations
Method chaining
In [10]: sales.resample('D').sum().max()
Out[10]:
Units 29.0
dtype: float64
pandas Foundations
Resampling strings
In [11]: sales.resample('W').count()
Out[11]:
Date
2015-02-08 8 8 8
2015-02-15 4 4 4
2015-02-22 5 5 5
2015-03-01 2 2 2
pandas Foundations
Resampling frequencies
Input Description
min, T minute
H hour
D day
B business day
W week
M month
Q quarter
A year
pandas Foundations
Multiplying frequencies
In [12]: sales.loc[:,'Units'].resample('2W').sum()
Out[12]:
Date
2015-02-08 82
2015-02-22 79
2015-03-08 14
Freq: 2W-SUN, Name: Units, dtype: int64
pandas Foundations
Upsampling
In [13]: two_days = sales.loc['2015-2-4': '2015-2-5', 'Units']
In [13]: two_days
Out[13]:
Date
2015-02-04 15:30:00 13
2015-02-04 22:00:00 14
2015-02-05 02:00:00 19
2015-02-05 22:00:00 10
Name: Units, dtype: int64
pandas Foundations
Upsampling and filling

In [14]: two_days.resample('4H').ffill()
Out[14]:
Date
Date
2015-02-04 12:00:00 NaN
2015-02-04 16:00:00 13.0
2015-02-04 20:00:00 13.0
2015-02-05 00:00:00 14.0
2015-02-05 04:00:00 19.0
2015-02-05 08:00:00 19.0
2015-02-05 12:00:00 19.0
2015-02-05 16:00:00 19.0
2015-02-05 20:00:00 19.0
Freq: 4H, Name: Units, dtype: float64
PANDAS FOUNDATIONS
Lets practice!
PANDAS FOUNDATIONS
Manipulating
time series data
pandas Foundations
Sales data

...: parse_dates=['Date'])
Out[3]:
Date Company Product Units
0 2015-02-02 08:30:00 Hooli Software 3
1 2015-02-02 21:00:00 Mediacore Hardware 9
2 2015-02-03 14:00:00 Initech Software 13
3 2015-02-04 15:30:00 Streeplex Software 13
4 2015-02-04 22:00:00 Acme Coporation Hardware 14
pandas Foundations
String methods
In [4]: sales['Company'].str.upper()
Out[4]:
0 HOOLI
1 MEDIACORE
2 INITECH
3 STREEPLEX
4 ACME COPORATION
5 ACME COPORATION
6 HOOLI
7 ACME COPORATION
8 STREEPLEX
9 MEDIACORE
10 INITECH
11 HOOLI
12 HOOLI
13 MEDIACORE
14 MEDIACORE
15 MEDIACORE
pandas Foundations
Substring matching
In [5]: sales['Product'].str.contains('ware')
Out[5]:
0 True
1 True
2 True
3 True
4 True
5 True
6 False
7 True
8 False
9 True
10 True
11 True
12 True
13 True
14 False
pandas Foundations
Boolean arithmetic
In [6]: True + False
Out[6]: 1
In [7]: True + True

Out[7]: 2
In [8]: False + False

Out[8]: 0
pandas Foundations
Boolean reduction
In [9]: sales['Product'].str.contains('ware').sum()
Out[9]: 14
pandas Foundations
Datetime methods
In [9]: sales['Date'].dt.hour
Out[9]:
0 8
1 21
2 14
3 15
4 22
5 2
6 22
7 23
8 9
9 13
10 20
11 23
12 12
13 11
14 16
pandas Foundations
Set timezone
In [10]: central = sales['Date'].dt.tz_localize('US/Central')
In [11]: central
Out[11]:
0 2015-02-02 08:30:00-06:00
1 2015-02-02 21:00:00-06:00
2 2015-02-03 14:00:00-06:00
3 2015-02-04 15:30:00-06:00
4 2015-02-04 22:00:00-06:00
5 2015-02-05 02:00:00-06:00
6 2015-02-05 22:00:00-06:00
7 2015-02-07 23:00:00-06:00
8 2015-02-09 09:00:00-06:00
9 2015-02-09 13:00:00-06:00
10 2015-02-11 20:00:00-06:00
11 2015-02-11 23:00:00-06:00
12 2015-02-16 12:00:00-06:00
Name: Date, dtype: datetime64[ns, US/Central]

pandas Foundations
Convert timezone
In [12]: central.dt.tz_convert('US/Eastern')
Out[12]:
0 2015-02-02 09:30:00-05:00
1 2015-02-02 22:00:00-05:00
2 2015-02-03 15:00:00-05:00
3 2015-02-04 16:30:00-05:00
4 2015-02-04 23:00:00-05:00
5 2015-02-05 03:00:00-05:00
6 2015-02-05 23:00:00-05:00
7 2015-02-08 00:00:00-05:00
8 2015-02-09 10:00:00-05:00
9 2015-02-09 14:00:00-05:00
10 2015-02-11 21:00:00-05:00
11 2015-02-12 00:00:00-05:00
12 2015-02-16 13:00:00-05:00
13 2015-02-19 12:00:00-05:00
14 2015-02-19 17:00:00-05:00
Name: Date, dtype: datetime64[ns, US/Eastern]

pandas Foundations
Method chaining
In [13]: sales['Date'].dt.tz_localize('US/Central').
...: dt.tz_convert('US/Eastern')
Out[13]:
0 2015-02-02 09:30:00-05:00
1 2015-02-02 22:00:00-05:00
2 2015-02-03 15:00:00-05:00
3 2015-02-04 16:30:00-05:00
4 2015-02-04 23:00:00-05:00
5 2015-02-05 03:00:00-05:00
6 2015-02-05 23:00:00-05:00
7 2015-02-08 00:00:00-05:00
8 2015-02-09 10:00:00-05:00
9 2015-02-09 14:00:00-05:00
10 2015-02-11 21:00:00-05:00
11 2015-02-12 00:00:00-05:00
12 2015-02-16 13:00:00-05:00
13 2015-02-19 12:00:00-05:00
14 2015-02-19 17:00:00-05:00
Name: Date, dtype: datetime64[ns, US/Eastern]

pandas Foundations
World Population
In [14]: population = pd.read_csv('world_population.csv',
In [15]: population
Out[15]:
Population
Date
1960-12-31 2.087485e+10
1970-12-31 2.536513e+10
1980-12-31 3.057186e+10
1990-12-31 3.644928e+10
2000-12-31 4.228550e+10
2010-12-31 4.802217e+10
pandas Foundations
Upsample population
In [16]: population.resample('A').first()
Out[16]:
Population
Date
1960-12-31 2.087485e+10
1961-12-31 NaN
1962-12-31 NaN
1963-12-31 NaN
1964-12-31 NaN
1965-12-31 NaN
1966-12-31 NaN
1967-12-31 NaN
1968-12-31 NaN
1969-12-31 NaN
1970-12-31 2.536513e+10
1971-12-31 NaN
1972-12-31 NaN
pandas Foundations
Interpolate missing data

In [17]: population.resample('A').first().interpolate('linear')
Out[17]:
Population
Date
1960-12-31 2.087485e+10
1961-12-31 2.132388e+10
1962-12-31 2.177290e+10
1963-12-31 2.222193e+10
1964-12-31 2.267096e+10
1965-12-31 2.311999e+10
1966-12-31 2.356902e+10
1967-12-31 2.401805e+10
1968-12-31 2.446707e+10
1969-12-31 2.491610e+10
1970-12-31 2.536513e+10
1971-12-31 2.588580e+10
1972-12-31 2.640648e+10
PANDAS FOUNDATIONS
Lets practice!
PANDAS FOUNDATIONS
Time series
visualization
pandas Foundations
Topics
Line types
Plot types
Subplots
pandas Foundations
S&P 500 Data

In [2]: import matplotlib.pyplot as plt
In [3]: sp500 = pd.read_csv('sp500.csv', parse_dates=True,

...: index_col= 'Date')
In [4]: sp500.head()
Out[4]:
Open High Low Close Volume Adj Close
Date
2010-01-04 1116.560059 1133.869995 1116.560059 1132.989990 3991400000 1132.989990
2010-01-05 1132.660034 1136.630005 1129.660034 1136.520020 2491020000 1136.520020
2010-01-06 1135.709961 1139.189941 1133.949951 1137.140015 4972660000 1137.140015
2010-01-07 1136.270020 1142.459961 1131.319946 1141.689941 5270680000 1141.689941
2010-01-08 1140.520020 1145.390015 1136.219971 1144.979980 4389590000 1144.979980
pandas Foundations
Pandas plot
In [5]: sp500['Close'].plot()
In [6]: plt.show()
pandas Foundations
Default plot
pandas Foundations
Labels and title

In [7]: sp500['Close'].plot(title='S&P 500')
In [8]: plt.ylabel('Closing Price (US Dollars)')
In [9]: plt.show()
pandas Foundations
Labels and title

pandas Foundations
One week
In [10]: sp500.loc['2012-4-1':'2012-4-7', 'Close'].plot(title='S&P
...: 500')
In [12]: plt.show()
pandas Foundations
One week
pandas Foundations
Plot styles
In [13]: sp500.loc['2012-4', 'Close'].plot(style='k.-',
...: title='S&P500')
In [15]: plt.show()
pandas Foundations
One week
pandas Foundations
More plot styles

Style format string
color (k: black)
marker (. : dot)
line type (-: solid)
pandas Foundations
More plot styles

Color Marker Line
b: blue o: circle : do"ed
g: green *: star : dashed
r: red s: square
c: cyan +: plus
pandas Foundations
Area plot
In [16]: sp500['Close'].plot(kind='area', title='S&P 500')
In [18]: plt.show()
pandas Foundations
Area plot
pandas Foundations
Multiple columns
In [19]: sp500.loc['2012', ['Close','Volume']].plot(title='S&P
...: 500')
In [20]: plt.show()
pandas Foundations
Multiple columns
pandas Foundations
Subplots
In [21]: sp500.loc['2012', ['Close','Volume']].plot(subplots=True)
In [22]: plt.show()
pandas Foundations
Subplots
PANDAS FOUNDATIONS
Lets practice!

Ch3 Slides

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Ch3 Slides

Hochgeladen von

Copyright:

Verfügbare Formate

PANDAS FOUNDATIONS

Using pandas to read datetime objects

Product sales CSV

In [2]: sales = pd.read_csv('sales-feb-2015.csv',

Selecting single datetime

Selecting whole day

Partial datetime string selection

Selecting whole month

Slicing using dates/times

Convert strings to datetime

Filling missing values

In [13]: sales.reindex(evening_2_11, method='bfill')

In [2]: sales = pd.read_csv('sales-feb-2015.csv',

In [7]: print(sales.loc['2015-2-2', 'Units'])

In [8]: sales.loc['2015-2-2', 'Units'].mean()

Upsampling and filling

In [2]: sales = pd.read_csv('sales-feb-2015.csv',

In [7]: True + True

In [8]: False + False

Name: Date, dtype: datetime64[ns, US/Central]

Name: Date, dtype: datetime64[ns, US/Eastern]

Name: Date, dtype: datetime64[ns, US/Eastern]

Interpolate missing data

S&P 500 Data

In [2]: import matplotlib.pyplot as plt

In [3]: sp500 = pd.read_csv('sp500.csv', parse_dates=True,

Labels and title

In [8]: plt.ylabel('Closing Price (US Dollars)')

Labels and title

In [11]: plt.ylabel('Closing Price (US Dollars)')

In [14]: plt.ylabel('Closing Price (US Dollars)')

More plot styles

More plot styles

b: blue o: circle : do"ed

g: green *: star : dashed

In [17]: plt.ylabel('Closing Price (US Dollars)')

Das könnte Ihnen auch gefallen