PQN #015: How to build an automated quant database

How to build an automated quant database

How to build an automated quant database

In today’s issue I’m going to show you how to build your own automated stock price database.

All quants work with data to analyze the markets. Most of them have internal databases. Hedge funds and banks like Jane Street, Man Group, and Goldman Sachs have teams of people managing data.

Building your own stock price database is important if you want to combine data to analyze the markets. Putting stock prices, economic trends, and custom analytics in a database helps you research the markets faster.

Data drives the markets

With the explosion of free market data, there’s never been a better time to start storing it. Once you have it, combine it, transform it, and analyze it in any way you want.

By the end of this issue you will know how to:

  • Use SQLite to build a database
  • Download stock data for free
  • Store the data in a database
  • Run it all automatically

I’m going to show you how to do it with Python.

Let’s get started

A quick note

In today’s issue, I use a Python script. By using a script, I can automatically download data using Task Manager (Windows) or CRON (Mac/Linux).

Open your favorite code editor and create a Python file called market_data.py.

Step 1: Import the libraries

Start by importing the libraries. I like to use yfinance to get stock data. I use SQLite as a database you can store on your computer. It’s an efficient file-based database which makes it perfect for building research tools. It uses standard SQL so you can easily move to a different database like MySQL or Postgres.

To create the database, you need to import the SQLite library. You can use pandas to interact directly with the database.

from sys import argv

import pandas as pd
import yfinance as yf
import sqlite3

Step 2: Download and store stock price data

The script you are building will do two things:

  1. Save data within a date range
  2. Save data for the last trading day

To make it easy, create three functions to download the data, save data within a range, and save data for the last trading day.

First, create a function that downloads data.

def get_stock_data(symbol, start, end):
    data = yf.download(symbol, start=start, end=end)
    data.reset_index(inplace=True)
    data.rename(columns={
        "Date": "date",
        "Open": "open",
        "Low": "low",
        "Close": "close",
        "Adj Close": "adj_close",
        "Volume": "volume"
    }, inplace=True)
    data['symbol'] = symbol
    return data

The function uses yfinance to download data. Then it converts the date to a column, renames the columns, adds the symbol, and returns a DataFrame. Add the symbol so you can save more than one stock in the same table.

Next, create a function that uses get_stock_data to get stock data and saves it into the database.

def save_data_range(symbol, start, end, con):
    data = get_stock_data(symbol, start, end)
    data.to_sql(
        "stock_data", 
        con, 
        if_exists="append", 
        index=False
    )

The function uses pandas to save the data in the DataFrame into the database. Finally, create a function that saves the last trading day’s data into the database.

def save_last_trading_session(symbol, con):
    today = pd.Timestamp.today()
    data = get_stock_data(symbol, today, today)
    data.to_sql(
        "stock_data", 
        con, 
        if_exists="append", 
        index=False
    )

This function grabs data from today and inserts it into the database.

I want to run this script automatically. Before I do, I need to add some code to run it from the command line.

if __name__ == "__main__":
    # usage example for bulk insert
    #     python market_data.py bulk SPY 2022-01-01 2022-10-20
    # usage example for last session
    #     python market_aata.py last SPY

    con = sqlite3.connect("market_data.sqlite")

    if argv[1] == "bulk":
        symbol = argv[2]
        start = argv[3]
        end = argv[4]
        save_data_range(symbol, start, end, con)
        print(f"{symbol} saved between {start} and {end}")
    elif argv[1] == "last":
        symbol = argv[2]
        save_last_trading_session(symbol, con)
        print(f"{symbol} saved")
    else:
        print("Enter bulk or last")

The comments have examples of how to run the code at the command line.

The code in your file should look like this.

from sys import argv

import pandas as pd
import yfinance as yf
import sqlite3

def get_stock_data(symbol, start, end):
    data = yf.download(symbol, start=start, end=end)
    data.reset_index(inplace=True)
    data.rename(columns={
        "Date": "date",
        "Open": "open",
        "Low": "low",
        "Close": "close",
        "Adj Close": "adj_close",
        "Volume": "volume"
    }, inplace=True)
    data['symbol'] = symbol
    return data

def save_data_range(symbol, start, end, con):
    data = get_stock_data(symbol, start, end)
    data.to_sql(
        "stock_data", 
        con, 
        if_exists="append", 
        index=False
    )

def save_last_trading_session(symbol, con):
    today = pd.Timestamp.today()
    data = get_stock_data(symbol, today, today)
    data.to_sql(
        "stock_data", 
        con, 
        if_exists="append", 
        index=False
    )

if __name__ == "__main__":
    # usage example for bulk insert
    #     python market_data.py bulk SPY 2022-01-01 2022-10-20
    # usage example for last session
    #     python market_aata.py last SPY

    con = sqlite3.connect("market_data.sqlite")

    if argv[1] == "bulk":
        symbol = argv[2]
        start = argv[3]
        end = argv[4]
        save_data_range(symbol, start, end, con)
        print(f"{symbol} saved between {start} and {end}")
    elif argv[1] == "last":
        symbol = argv[2]
        save_last_trading_session(symbol, con)
        print(f"{symbol} saved")
    else:
        print("Enter bulk or last")

Remember, to use this code, you need to run it in the terminal.

Step 3: Automate your script

The best way to build your database is to first download history. You do this once to “backfill” the historic data. Then you schedule the script to run every trading day after the market close to capture that day’s data going forward.

There are a lot of references to schedule a Python job on Mac and Windows so I won’t cover it in detail here.

Here are two that will work:

  1. ​Schedule on Mac/Linux
  2. ​Schedule on Windows

Bonus: Query the SQLite database with pandas in a Jupyter Notebook

To get data out of the database, create a Jupyter Notebook in the same directory as the database file. Then you can write SQL as you would normally.

import sqlite3
import pandas as pd

# connect to the database
con = sqlite3.connect("market_data.sqlite")

# simple select statement
df_1 = pd.read_sql_query("SELECT * from stock_data where symbol='SPY'", con)

# where statement
df_2 = pd.read_sql_query("SELECT * from stock_data where symbol='SPY' and volume > 100000", con)

You can join with other tables, filter, and group in any way you want.

Well, that’s it for today. I hope you enjoyed it.

See you again next week.