Programming for Data Science¶

Time Series¶

Dr. Bhargavi R

SCOPE, VIT Chennai

  • Pandas provides rich set of tools to work with date and time.
  • Time stamp - References a particular instant/moment in time (e.g., September 2nd, 2020, 8:00am).
  • Time interval - reference a length of time between a particular beginning and end point.
  • Period - Reference a special case of time intervals in which each interval is of uniform length and does not overlap.
  • Time deltas - reference an exact length of time

Periods can be use to check if a specific event occurs within a certain period. Basically a Period represents an interval while a Timestamp represents a point in time

In [1]:
import numpy as np
import pandas as pd
from datetime import datetime
from dateutil import parser
In [2]:
# get today's date and time
print(datetime.today())
2023-06-20 21:23:15.735922
In [3]:
# Create a datetime object
print(datetime(day=21, month = 9, year= 2020))
2020-09-21 00:00:00
In [4]:
# Set day, month, day and time 
datetime(year = 2020, month = 8, day = 4, hour = 14, minute = 35)
Out[4]:
datetime.datetime(2020, 8, 4, 14, 35)
In [5]:
date = parser.parse(" 2020, Sept 2nd 8:30:45")
date
Out[5]:
datetime.datetime(2020, 9, 2, 8, 30, 45)
In [6]:
print(date.strftime('%A'))
print(date.strftime('%a'))
print(date.strftime('%m'))
print(date.strftime('%B'))
print(date.strftime("%H:%M:%S"))
Wednesday
Wed
09
September
08:30:45
In [7]:
# Numpy - Timeseries - datetime64
start_date = np.array('2020-07-04', dtype=np.datetime64)
print(start_date)
2020-07-04
In [8]:
dateseries = start_date + np.arange(10)
dateseries
Out[8]:
array(['2020-07-04', '2020-07-05', '2020-07-06', '2020-07-07',
       '2020-07-08', '2020-07-09', '2020-07-10', '2020-07-11',
       '2020-07-12', '2020-07-13'], dtype='datetime64[D]')

Dates and Times in Pandas¶

In [9]:
# Pandas Timestamp references to a specific instant in time that has nanosecond precision
pd.Timestamp(year=2020, month=9, day=2, hour=8, minute=30, second=20, microsecond=100, nanosecond=99)
Out[9]:
Timestamp('2020-09-02 08:30:20.000100099')
In [10]:
pd.Timestamp('2020-9-21')
Out[10]:
Timestamp('2020-09-21 00:00:00')
In [11]:
pd.Timestamp(2020, 9, 18, 12)
Out[11]:
Timestamp('2020-09-18 12:00:00')
In [12]:
pd.Timestamp('2020/9-13')
Out[12]:
Timestamp('2020-09-13 00:00:00')
In [13]:
pd.Timestamp('June 9, 2020 13:45')
Out[13]:
Timestamp('2020-06-09 13:45:00')
In [14]:
# Pandas supports Parsing of Time from various sources and formats
dates = pd.to_datetime([datetime(2019, 1,26), np.datetime64('2018-01-01'), '10th of Feb 2019', '2019 April 5', 'January 15 2020',
                       'Aug. 31, 2020', 'SEPT, 2 2020'])
dates
Out[14]:
DatetimeIndex(['2019-01-26', '2018-01-01', '2019-02-10', '2019-04-05',
               '2020-01-15', '2020-08-31', '2020-09-02'],
              dtype='datetime64[ns]', freq=None)
In [15]:
# Addition operation 
start = np.datetime64('2000-01-15')
print(start)
dates = pd.to_datetime(start + np.arange(20))
dates
2000-01-15
Out[15]:
DatetimeIndex(['2000-01-15', '2000-01-16', '2000-01-17', '2000-01-18',
               '2000-01-19', '2000-01-20', '2000-01-21', '2000-01-22',
               '2000-01-23', '2000-01-24', '2000-01-25', '2000-01-26',
               '2000-01-27', '2000-01-28', '2000-01-29', '2000-01-30',
               '2000-01-31', '2000-02-01', '2000-02-02', '2000-02-03'],
              dtype='datetime64[ns]', freq=None)
  • Timedeltas are differences in times.
  • Expressed in difference units, e.g. days, hours, minutes, seconds.
  • They can be both positive and negative.
In [16]:
pd.Timedelta('1 days')
Out[16]:
Timedelta('1 days 00:00:00')
In [17]:
pd.Timedelta('1 days 2 hours')
Out[17]:
Timedelta('1 days 02:00:00')
In [18]:
pd.Timedelta('1 days 00:00:00')
Out[18]:
Timedelta('1 days 00:00:00')
In [19]:
pd.Timedelta(1, unit='d')
Out[19]:
Timedelta('1 days 00:00:00')
In [20]:
start = np.datetime64('2000-01-15')
dates = start + pd.to_timedelta(np.arange(20), unit = 'd')
dates
Out[20]:
DatetimeIndex(['2000-01-15', '2000-01-16', '2000-01-17', '2000-01-18',
               '2000-01-19', '2000-01-20', '2000-01-21', '2000-01-22',
               '2000-01-23', '2000-01-24', '2000-01-25', '2000-01-26',
               '2000-01-27', '2000-01-28', '2000-01-29', '2000-01-30',
               '2000-01-31', '2000-02-01', '2000-02-02', '2000-02-03'],
              dtype='datetime64[ns]', freq=None)
In [21]:
pd.to_timedelta(np.arange(20), unit = 'w')
Out[21]:
TimedeltaIndex([  '0 days',   '7 days',  '14 days',  '21 days',  '28 days',
                 '35 days',  '42 days',  '49 days',  '56 days',  '63 days',
                 '70 days',  '77 days',  '84 days',  '91 days',  '98 days',
                '105 days', '112 days', '119 days', '126 days', '133 days'],
               dtype='timedelta64[ns]', freq=None)
In [22]:
start = np.datetime64('2000-01-15')
dates = start + pd.to_timedelta(np.arange(20), unit = 'w')
dates
Out[22]:
DatetimeIndex(['2000-01-15', '2000-01-22', '2000-01-29', '2000-02-05',
               '2000-02-12', '2000-02-19', '2000-02-26', '2000-03-04',
               '2000-03-11', '2000-03-18', '2000-03-25', '2000-04-01',
               '2000-04-08', '2000-04-15', '2000-04-22', '2000-04-29',
               '2000-05-06', '2000-05-13', '2000-05-20', '2000-05-27'],
              dtype='datetime64[ns]', freq=None)

Timedelta limitations¶

  • Pandas represents Timedeltas in nanosecond resolution using 64 bit integers
  • 64 bit integer limits determine the Timedelta limits.
In [23]:
print(pd.Timedelta.min)
print(pd.Timedelta.max)
-106752 days +00:12:43.145224193
106751 days 23:47:16.854775807
In [24]:
ts = pd.Series(pd.date_range('2020-1-1', periods=5, freq='D'))  # Creates a series
ts
Out[24]:
0   2020-01-01
1   2020-01-02
2   2020-01-03
3   2020-01-04
4   2020-01-05
dtype: datetime64[ns]
In [25]:
td = pd.Series([pd.Timedelta(days=i) for i in range(5)])
td
Out[25]:
0   0 days
1   1 days
2   2 days
3   3 days
4   4 days
dtype: timedelta64[ns]
In [26]:
df = pd.DataFrame({'A': ts, 'B': td})
df
Out[26]:
A B
0 2020-01-01 0 days
1 2020-01-02 1 days
2 2020-01-03 2 days
3 2020-01-04 3 days
4 2020-01-05 4 days
In [27]:
df['C'] = df['A'] + df['B']
df
Out[27]:
A B C
0 2020-01-01 0 days 2020-01-01
1 2020-01-02 1 days 2020-01-03
2 2020-01-03 2 days 2020-01-05
3 2020-01-04 3 days 2020-01-07
4 2020-01-05 4 days 2020-01-09
In [28]:
print(df.min(axis = 0))
print(df.max(axis = 0))
A    2020-01-01 00:00:00
B        0 days 00:00:00
C    2020-01-01 00:00:00
dtype: object
A    2020-01-05 00:00:00
B        4 days 00:00:00
C    2020-01-09 00:00:00
dtype: object

TimedelatIndex¶

  • To generate an index with time delta, you can use either the TimedeltaIndex or the timedelta_range()
In [29]:
import datetime
pd.TimedeltaIndex(['1 days', '1 days, 00:00:05', np.timedelta64(2, 'D'),
                   datetime.timedelta(days=2, seconds=2)])
Out[29]:
TimedeltaIndex(['1 days 00:00:00', '1 days 00:00:05', '2 days 00:00:00',
                '2 days 00:00:02'],
               dtype='timedelta64[ns]', freq=None)
In [30]:
# Generate sequences of fixed-frequency dates and time spans
dates = pd.date_range('2000-01-15', periods = 10, freq = 'W' )
dates
Out[30]:
DatetimeIndex(['2000-01-16', '2000-01-23', '2000-01-30', '2000-02-06',
               '2000-02-13', '2000-02-20', '2000-02-27', '2000-03-05',
               '2000-03-12', '2000-03-19'],
              dtype='datetime64[ns]', freq='W-SUN')
In [31]:
s = pd.Series(np.arange(100), index=pd.timedelta_range('1 days', periods=100, freq='h'))
s
Out[31]:
1 days 00:00:00     0
1 days 01:00:00     1
1 days 02:00:00     2
1 days 03:00:00     3
1 days 04:00:00     4
                   ..
4 days 23:00:00    95
5 days 00:00:00    96
5 days 01:00:00    97
5 days 02:00:00    98
5 days 03:00:00    99
Freq: H, Length: 100, dtype: int64
In [34]:
# dates = pd.date_range(datetime(2000, 3, 15), periods = 20, freq = 'd')
# print(dates)
In [36]:
# days = np.array([pd.Timedelta(days = 10) for i in range(20)])
# print(days)

# new_dates = dates + days
# new_dates
In [37]:
s1 = pd.Series(dates)
s2 = pd.Series(days)
s = s1 + s2
s
Out[37]:
0    2000-01-26
1    2000-02-02
2    2000-02-09
3    2000-02-16
4    2000-02-23
5    2000-03-01
6    2000-03-08
7    2000-03-15
8    2000-03-22
9    2000-03-29
10          NaT
11          NaT
12          NaT
13          NaT
14          NaT
15          NaT
16          NaT
17          NaT
18          NaT
19          NaT
dtype: datetime64[ns]
In [44]:
# import random
# data = np.array([random.randint(1, 20) for i in range(20)])
# df = pd.Series(data, index = dates)
# print(df)
# print(df.index.day)
# print(df.index.month)
In [43]:
# df['2000-03-19' : '2000-04-02']
In [40]:
day = pd.Timestamp('2020-09-02')
print(day.day)
print(day.day_name())
print((day + pd.Timedelta('2 days')).day_name())
day = day + pd.Timedelta('2 days')
print((day + pd.offsets.BDay()).day_name())
2
Wednesday
Friday
Monday
In [41]:
p = pd.Period('2020-01-01', freq = 'M')
test = pd.Timestamp('2020-01-01 22:11')
test1 = pd.Timestamp('2020-01-10')
test2 = pd.Timestamp('2020-02-10')
print(p.start_time < test < p.end_time)
print(p.start_time < test1 < p.end_time)
print(p.start_time < test2 < p.end_time)
print(p.start_time)
print(p.end_time)
True
True
False
2020-01-01 00:00:00
2020-01-31 23:59:59.999999999
In [42]:
pdf = pd.Series(pd.period_range('1/1/2011', freq='M', periods=3))
print(pdf)
0    2011-01
1    2011-02
2    2011-03
dtype: period[M]