Use Polars and DuckDB to build a research database

November 23, 2024
Facebook logo.
Twitter logo.
LinkedIn logo.
Get this code in Google Colab

Use Polars and DuckDB to build a research database

All quants work with data to analyze the markets.

Most of them have internal databases. Hedge funds and banks like Jane Street, Man Group, and Goldman Sachs have teams of people managing data.

Building your own stock price database is important if you want to combine data to analyze the markets. Putting stock prices, economic trends, and custom analytics in a database helps you research the markets faster.

You can use the cloud, but running a database is expensive.

There’s a better way.

With Python, Polars, and DuckDB, you can set up your own local stock database. This solution gives you speed, control, and reliability.

In fact, Polars is faster than Pandas and DuckDB is one of the fastest database engines out there.

By reading today’s newsletter, you’ll build a local DuckDB database with Polars.

Let's go!

Use Polars and DuckDB to build a research database

With the explosion of free market data, there’s never been a better time to start storing it. Once you have it, combine it, transform it, and analyze it in any way you want.

Polars, known for speed and efficiency, handles large datasets, while DuckDB excels in querying data with SQL (and is also lightening fast).

Setting up involves installing both libraries and sourcing stock data, typically in CSV format. Polars loads and transforms data with an easy API, allowing filtering and complex operations.

DuckDB stores the processed data, and gives users SQL for in-depth analysis. This setup is user-friendly and offers control over local data processing.

Professionals use these tools for fast data manipulation and analysis. Polars and DuckDB offer scalability, making them ideal for handling large datasets efficiently.

Analysts can perform complex queries, like calculating moving averages or detecting price breakouts, all on your computer.

Let's see how it works with Python.

Imports and set up

Given the power of what we’re about to do, it’s amazing we only need three libraries.

1import polars as pl
2import yfinance as yf
3import duckdb

Now let’s grab some price data to work with.

1prices = yf.download("NVDA", start='2023-01-01', end='2024-01-01')
2
3df = (
4    pl
5    .from_pandas(
6        prices
7        .reset_index()
8    )
9    .with_columns(
10        [pl.lit("NVDA").alias("symbol")]
11    )
12)

The code uses Yahoo Finance to download stock prices for NVIDIA, covering the specified date range.

The downloaded data is in Pandas format, which is then converted into a Polars DataFrame for lightening fast processing.

After resetting the index to ensure the date is a regular column, a new column "symbol" is added with the value "NVDA" for easy identification of the stock in later analyses. This setup is great useful for managing and analyzing large datasets.

Create a database and store the stock data for querying

This code initializes a connection to a DuckDB database called 'stocks.db' and creates a table named 'stocks'. If the table already exists, it will not create a new one. The data from our Polars DataFrame is stored in this table. This allows us to use SQL queries on the stock data, which is particularly helpful for complex data manipulations and analyses.

1con = duckdb.connect('stocks.db')
2con.execute("""
3    CREATE TABLE IF NOT EXISTS stocks AS SELECT * FROM df
4""")

The connection to DuckDB is established, allowing for SQL on data stored in-memory or on disk.

The code checks if a table named 'stocks' exists in the database. If not, it creates it using the data from the Polars DataFrame.

That’s right: We can query directly from the Polars DataFrame!

This integration of Polars and DuckDB lets us leverage SQL queries for data analysis, combining the strengths of both SQL and DataFrame manipulations.

Let’s see some examples.

Perform SQL queries to extract insights from the data

Here, we execute SQL queries on the stored stock data to calculate the average closing price and find high-volume trading days.

1pl.DataFrame(
2    con.execute("""
3    SELECT symbol, round(avg(Close), 2) as avg_close FROM stocks GROUP BY symbol
4    """).fetchdf()
5)

The SQL query calculates the average closing price of NVIDIA stock by grouping all data entries by the stock symbol and averaging the closing prices. The result is then rounded to two decimal places.

Let’s see a few more examples.

1pl.DataFrame(
2    con.execute("""
3    SELECT symbol, 
4           date, 
5           Volume,
6           Close
7    FROM stocks 
8    ORDER BY Volume DESC 
9    LIMIT 5
10    """).fetchdf()
11)

query retrieves the top 5 trading days with the highest volume of trades. The results are converted into Polars DataFrames, allowing easy manipulation and visualization of the output.

Finally, a more complicated query computing the 20 day volume average weighted price (VWAP).

1vwap_query = """
2WITH daily_vwap AS (
3    SELECT 
4        "Date",
5        symbol,
6        SUM(Volume * Close) / SUM(Volume) as vwap
7    FROM stocks
8    GROUP BY "Date", symbol
9),
10rolling_vwap AS (
11    SELECT 
12        "Date",
13        symbol,
14        AVG(vwap) OVER (
15            PARTITION BY symbol 
16            ORDER BY "Date" 
17            ROWS BETWEEN 19 PRECEDING AND CURRENT ROW
18        ) as rolling_20d_vwap
19    FROM daily_vwap
20)
21SELECT * FROM rolling_vwap
22ORDER BY symbol, "Date";
23"""
24p_df = pl.DataFrame(con.execute(vwap_query).fetchdf())

The SL first calculates the daily VWAP by dividing the total value traded (Volume * Close) by the total volume for each day.

Next, it computes a 20-day rolling average of these daily VWAPs to smooth short-term fluctuations and reveal longer-term price trends. The use of a window function in SQL allows the rolling average to be calculated efficiently.

The output is then converted into a Polars DataFrame for further analysis or visualization.

The VWAP is an important metric for traders as it provides insights into the average price a stock has traded at, factoring in volume. The rolling VWAP smooths out daily fluctuations to reveal longer-term trends.

Finally, we can close the connection.

1con.close()

Your next steps

Now try modifying the code to download data for a different stock symbol, such as AAPL or TSLA. You could also change the date range to analyze different periods.

Man with glasses and a wristwatch, wearing a white shirt, looking thoughtfully at a laptop with a data screen in the background.