Easy way to run 2,370,886 rows of options data

Over 1,200,000 options contracts trade daily.

Analyzing this volume of data has become something only professionals can do using sophisticated tools.

Not anymore.

Polars is a high-performance DataFrame library like pandas.

Only it’s 100x faster.

In today’s newsletter, you’ll use Polars to load 2,370,886 rows of historic options data (about 500MB). From there, you’ll build functions to filter the data and analyze the implied volatility structure.

Let’s go!

An easy way to run 2,370,886 rows of options data

Polars is a high-performance Python library designed for data analysis. Its integration with Rust makes it fast and memory efficient.

It's quickly becoming a favorite tool for analysts who work with big data sets.

Polars processes large datasets fast because its concurrency capabilities. It uses lazy evaluation so users can put together complex filters without immediate execution. The API is similar to pandas so the learning curve is reduced

Financial professionals are leveraging Polars to streamline data analysis. It's used in backtesting for evaluating past performance and in conducting risk analytics. High-frequency traders also benefit from its fast data processing at sub-second speeds.

You can follow along with today’s newsletter with data you can download here.

Ready to get started?

Imports and set up

Make sure to unzip the data in your working directory. Then import the libraries we need for the analysis

1import glob
2import os
3import datetime as dt
4import polars as pl
5import matplotlib.pyplot as plt

We use these libraries to handle file operations, work with dates, process data efficiently, and create plots.

Read and process option chain data

First, we define a function to read and process option chain data from CSV files

1def read_chains(fl):
2    df = (
3        pl.read_csv(fl)
4          .with_columns([
5              pl.col("date").str.to_datetime("%m/%d/%Y", strict=False),
6              pl.col("option_expiration").str.to_datetime("%m/%d/%Y", strict=False),
7              pl.col("forward_price")
8                .cast(pl.Float64, strict=False)
9                .fill_null(pl.lit(None).cast(pl.Float64))
10          ])
11          .sort("date")
12    )
13    return df

This function reads a CSV file and processes its contents using Polars. It converts date columns to datetime format and ensures the forward price is handled correctly. The function then sorts the data by date and returns the processed Polars dataframe.

1files = glob.glob(os.path.join("rut-eod", "*.csv"))
2df = pl.concat([read_chains(fl) for fl in files], how="vertical_relaxed")

We locate all CSV files in the "rut-eod" directory and read them using our read_chains function. The resulting dataframes are combined into a single large dataframe containing all the option chain data. For 500MB of data, this takes about two seconds.

Define helper functions for data analysis

We create two helper functions to extract specific data from our options chains.

1def read_vol_curve(df, as_of_date, underlying, expiry, delta_low, delta_high):
2    return (
3        df
4        .filter(
5            (pl.col("date") == as_of_date)
6            & (pl.col("symbol") == underlying)
7            & (pl.col("option_expiration") == expiry)
8            & (
9                ((pl.col("delta") >= delta_low) & (pl.col("delta") <= delta_high))
10                | ((pl.col("delta") >= -delta_high) & (pl.col("delta") <= -delta_low))
11            )
12        )
13        .group_by("strike")
14        .agg(pl.col("iv").mean().alias("iv_mean"))
15        .sort("strike")
16    )
17
18def query_expirations(df, as_of_date, underlying, dte=30):
19    cutoff_date = as_of_date + dt.timedelta(days=dte)
20
21    return (
22        df
23        .filter(
24            (pl.col("date") == as_of_date) &
25            (pl.col("symbol") == underlying) &
26            (pl.col("option_expiration") > cutoff_date)
27        )
28        .group_by("option_expiration")
29        .agg(pl.col("volume").sum().alias("total_volume"))
30        .sort("option_expiration")
31        .get_column("option_expiration")
32        .to_list()
33    )

The read_vol_curve function filters and processes data to create a volatility curve, while query_expirations finds relevant expiration dates based on given criteria.

Set parameters and create volatility smile plot

Set the parameters to filter the options data for a specific as-of date, underlying, days to expiration, and a range of delta.

1as_of_date = dt.datetime(2013, 6, 3)
2expiry = dt.datetime(2015, 12, 18)
3underlying = "RUT"
4dte = 30
5delta_low = 0.05
6delta_high = 0.50
7
8expiries = query_expirations(df, as_of_date, underlying, dte)
9fig, ax = plt.subplots(figsize=(10, 6))
10cmap = plt.get_cmap("rainbow", len(expiries))
11format_kw = {"linewidth": 0.5, "alpha": 0.85}
12
13for i, expiry in enumerate(expiries):
14    curve = read_vol_curve(
15        df,
16        as_of_date, 
17        underlying, 
18        expiry, 
19        delta_low, 
20        delta_high
21    )
22    ax.plot(
23        curve["strike"], 
24        curve["iv_mean"],
25        label=expiry.strftime("%Y-%m-%d"),
26        color=cmap(i),
27        **format_kw
28    )
29ax.set_ylabel("implied volatility")
30ax.legend(loc="upper right", framealpha=0.7)

We set our analysis parameters, including the date, underlying asset, and delta range. We then query for relevant expiration dates and create a plot. For each expiration, we calculate and plot the volatility curve. This creates a visual representation of the volatility smile across different expiration dates.

The result is a plot that looks like this.

The chart depicts implied volatility across different expiration dates, indicating volatility term structures shifting downward and flattening as expiration approaches. This suggests market expectations of reduced uncertainty and declining volatility with time. Traders might interpret this as a normalization of market conditions or decreased perceived risk over the given period.

Your next steps

The best way to get familiar with Polars is to inspect the data and add filters to the function that reads the data. For example, instead of filtering for a range of delta, you can filter for other Greeks.