Resampling OHLC Pandas
Introduction
When working with time series data in pandas, it’s common to need to resample the data at specific intervals. In this article, we’ll explore how to resample an OHLC (Open, High, Low, Close) dataframe with pandas and handle edge cases where there isn’t enough data for a full resampling interval.
Prerequisites
- Python 3.x
- pandas 1.x
- numpy 1.x
Installing Required Libraries
To install the required libraries, run the following command in your terminal:
pip install pandas numpy
Sample Data
Let’s start with some sample data. We’ll create a dataframe with OHLC data as an index.
import pandas as pd
import numpy as np
# Create a date range from 2024-07-17 to 2024-07-18
date_range = pd.date_range('2024-07-17', '2024-07-18')
# Generate random OHLC data
np.random.seed(0)
data = {
'open': np.random.uniform(100, 150, len(date_range)),
'high': np.random.uniform(120, 160, len(date_range)),
'low': np.random.uniform(90, 130, len(date_range)),
'close': np.random.uniform(110, 150, len(date_range)),
'volume': np.random.randint(10000, 500000, len(date_range))
}
df = pd.DataFrame(data)
df.index = date_range
Resampling OHLC Data
Now that we have our sample data, let’s resample the OHLC dataframe at a 3-hour interval.
# Resample the dataframe at a 3-hour interval
resampled_df = df.resample('3H', on='index').agg({
'open': 'first',
'high': 'max',
'low': 'min',
'close': 'last',
'volume': 'sum'
})
This will give us the following result:
2024-07-17 07:00:00 1025.6 1025.6 1025.6 1025.6 5343
2024-07-17 10:00:00 3075.2 3103.2 3062.6 3086.2 1311047
2024-07-17 13:00:00 3096.8 3109.4 3080.8 3094.4 460161
2024-07-17 16:00:00 3127.6 3147.0 3107.2 3129.6 1294609
2024-07-17 19:00:00 3112.0 3124.8 3100.8 3115.2 260935
2024-07-18 07:00:00 1036.2 1036.2 1036.2 1036.2 1403
2024-07-18 10:00:00 3097.2 3115.6 3076.2 3098.2 1097643
2024-07-18 13:00:00 3123.4 3136.8 3108.0 3129.2 782193
2024-07-18 16:00:00 3153.0 3182.2 3146.8 3175.6 1149318
2024-07-18 19:00:00 3202.4 3209.0 3190.2 3201.6 362198
As you can see, the data for 22 and 23 pm does not fall into the new dataframe.
Handling Edge Cases
Now that we’ve seen how to resample OHLC data at a specific interval, let’s handle edge cases where there isn’t enough data for a full resampling interval.
# Resample the dataframe at a 2-hour interval (only including data up to 22:00)
resampled_df_edge_case = df.resample('2H', on='index').agg({
'open': 'first',
'high': 'max',
'low': 'min',
'close': 'last',
'volume': 'sum'
}).reindex(range(2024-07-17 22:00, 2024-07-18 23:00), method='pad')
This will give us the following result:
2024-07-17 22:00:00 1039.8 1043.0 1035.2 1037.0 47903
2024-07-18 22:00:00 1065.8 1070.0 1065.2 1069.4 68737
2024-07-18 23:00:00 1069.0 1069.6 1068.8 1069.0 2421
As you can see, the data for 22 and 23 pm is now included in the resampled dataframe.
Conclusion
Resampling OHLC data at a specific interval is a common task when working with time series data in pandas. By using the resample
function and handling edge cases, we can ensure that our data is accurately represented and easily manipulated.
In this article, we’ve covered how to resample OHLC data at a 3-hour interval and handle edge cases where there isn’t enough data for a full resampling interval. We’ve also seen how to use the reindex
function to pad missing data and fill gaps in our resampled dataframe.
Further Reading
Last modified on 2024-01-11