May cohort is now open: How to secure your spot:

How to download more fundamental data to power trading

Quants, financial analysis, and traders use fundamental data for investing and trading. These data are derived from quarterly and annual statements that companies file with the U.S. Securities Exchange Commission (SEC).

These statements are rich with data that can be used to build predictive factor models for investment portfolios.

The problem?

We can’t download all these documents, parse them, and use them in a way that is useful for analysis at scale.

Until now.

How to download more fundamental data to power trading

The Edgar (Electronic Data Gathering, Analysis, and Retrieval) system is operated by the SEC. It automates the submission and retrieval of financial documents filed by companies. It also makes these data available electronically.

Edgar includes filings like annual reports (10-K) and quarterly reports (10-Q).

The data can be downloaded in various formats. These include HTML, XML, and plain text which makes it easy to use with Python.

After reading today’s newsletter, you’ll be able to download filing data, parse it, and use it to compute price to earnings ratio.

Let’s dive in!

Imports and set up

Let’s start with the libraries we need for the analysis. These libraries are standard Python libraries with the exception of OpenBB which we’ll use for data.

import requests
from io import BytesIO
from zipfile import ZipFile, BadZipFile
from pathlib import Path
from tqdm import tqdm
import pandas as pd
from openbb import obb

# Set the URLs to Edgar's data repository
SEC_URL = "<https://www.sec.gov/>"
FSN_PATH = "files/dera/data/financial-statement-and-notes-data-sets/"
DATA_PATH = Path("edgar")
user_agent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36"

Download and extract the filing data

We first generate a list of filing quarters to download. In our example, we’ll grab 4 quarters worth of data in 2015. The Edgar filing data is large so make sure you start with only a few quarters to get things running.

filing_periods = [
    (d.year, d.quarter) for d in pd.date_range("2015", "2015-12-31", freq="QE")
]

for yr, qtr in tqdm(filing_periods):
    path = DATA_PATH / f"{yr}_{qtr}" / "source"
    if not path.exists():
        path.mkdir(parents=True)
    filing = f"{yr}q{qtr}_notes.zip"
    url = f"{SEC_URL}{FSN_PATH}{filing}"
    response = requests.get(url, headers={"User-Agent": user_agent}).content
    with ZipFile(BytesIO(response)) as zip_file:
        for file in zip_file.namelist():
            local_file = path / file
            if local_file.exists():
                continue
            with local_file.open("wb") as output:
                for line in zip_file.open(file).readlines():
                    output.write(line)

This code iterates over each filing period, downloads, and extracts SEC filing documents for each period. For each year and quarter, it builds a directory path and fetches the Zip file from the SEC website. The Zip file is then extracted and each file in the Zip archive is saved to the directory.

The next step is to convert the extracted files to the on-disk, columnar format, Parquet.

for f in tqdm(sorted(list(DATA_PATH.glob("**/*.tsv")))):
    parquet_path = f.parent.parent / "parquet"
    if not parquet_path.exists():
        parquet_path.mkdir(parents=True)
    file_name = f.stem + ".parquet"
    if not (parquet_path / file_name).exists():
        df = pd.read_csv(
            f, sep="\\t", encoding="latin1", low_memory=False, on_bad_lines="skip"
        )
        df.to_parquet(parquet_path / file_name)
        f.unlink()

The code iterates through the downloaded TSV files in the DATA_PATH directory, converting each to a Parquet file in the ‘parquet’ subdirectory. For each TSV file, it reads the file into a DataFrame, writes the DataFrame to a new Parquet file, and then deletes the original TSV file.

Build the fundamentals data set

Now that we’ve stored the data from each of the filings as Parquet files, we can begin building the data set.

sub = pd.read_parquet(DATA_PATH / '2015_3' / 'parquet' / 'sub.parquet')
name = "APPLE INC"
cik = sub[sub.name == name].T.dropna().squeeze().cik
aapl_subs = pd.DataFrame()
for sub in DATA_PATH.glob("**/sub.parquet"):
    sub = pd.read_parquet(sub)
    aapl_sub = sub[
        (sub.cik.astype(int) == cik) & (sub.form.isin(["10-Q", "10-K"]))
    ]
    aapl_subs = pd.concat([aapl_subs, aapl_sub])

This code uses pandas to filter the sub DataFrame where the company name is APPLE INC. From there, we transpose the columns in the DataFrame to the rows, drop any rows where there is no data, and convert it into a Series. This code extracts Apple’s data from the quarterly and annual filing documents.

Use the fundamental data to build the PE ratio

First, extract all numerical data available from the Apple filings.

aapl_nums = pd.DataFrame()
for num in DATA_PATH.glob("**/num.parquet"):
    num = pd.read_parquet(num).drop("dimh", axis=1)
    aapl_num = num[num.adsh.isin(aapl_subs.adsh)]
    aapl_nums = pd.concat([aapl_nums, aapl_num])
aapl_nums.ddate = pd.to_datetime(aapl_nums.ddate, format="%Y%m%d")
aapl_nums.to_parquet(DATA_PATH / "aapl_nums.parquet")

Now, we can select a field, such as earnings per diluted share (EPS), that we can combine with market data to calculate the price to earnings ratio.

eps = aapl_nums[
    (aapl_nums.tag == "EarningsPerShareDiluted") & (aapl_nums.qtrs == 1)
].drop("tag", axis=1)
eps = eps.groupby("adsh").apply(
    lambda x: x.nlargest(n=1, columns=["ddate"]), include_groups=False
)
eps = eps[["ddate", "value"]].set_index("ddate").squeeze().sort_index()
ax = eps.plot.bar()
ax.set_xticklabels(eps.index.to_period("Q"));

This code extracts the diluted earnings per share from each of the filings and plots it in a bar chart.

How to download more fundamental data to power trading. Quants, financial analysis, and traders use fundamental data for investing and trading.
How to download more fundamental data to power trading. Quants, financial analysis, and traders use fundamental data for investing and trading.

Now use OpenBB to grab market data and align it with the EPS data to compute the price to earnings ratios.

aapl = (
    obb.equity.price.historical(
        "AAPL", start_date="2014-12-31", end_date=eps.index.max(), provider="yfinance"
    )
    .to_df()
    .resample("D")
    .last()
    .loc["2014":"2015"]
)

pe = aapl.close.to_frame("price").join(eps.to_frame("eps")).ffill().dropna()
pe["pe_ratio"] = pe.price.div(pe.eps)
ax = pe.plot(subplots=True, figsize=(16, 8), legend=False, lw=0.5)
ax[0].set_title("Adj Close")
ax[1].set_title("Diluted EPS")
ax[2].set_title("Trailing P/E")

The result is a chart depicting the closing price, diluted EPS, the the trailing price to earnings ratio.

How to download more fundamental data to power trading. Quants, financial analysis, and traders use fundamental data for investing and trading.
How to download more fundamental data to power trading. Quants, financial analysis, and traders use fundamental data for investing and trading.

Next steps

We only scratched the surface of what is available through the Edgar filings. As a next step, extract and parse the following fields:

• PaymentsOfDividendsCommonStock

• WeightedAverageNumberOfDilutedSharesOutstanding

• OperatingIncomeLoss

• NetIncomeLoss

• GrossProfit