May cohort is now open: How to secure your spot:

How an old Nintendo baddie boosts portfolio analysis

K-medoids is a powerful method used in data science to cluster similar data together. It's like k-means but robust to outliers for portfolio analysis.

Today’s newsletter is based on a reader’s suggestion. We look at k-medoids which is a villain in the popular Nintendo game Metroid.

No it’s not. But if you know Metroid, you have to agree:

It sounds like one!

It’s actually a powerful method used in data science to cluster similar data together. It’s robust to outliers so super useful when clustering features of a portfolio of financial assets.

Let’s check it out!

How an old Nintendo baddie boosts portfolio analysis

K-medoids is similar to k-means except k-means clusters data by assigning points to the nearest mean of a cluster, while k-medoids assigns points to the nearest actual data point. The data point is designated as the “medoid” of a cluster.

Here’s how you can use k-means for a similar analysis:

In the context of clustering portfolio returns and volatility, k-medoids is more robust to outliers, since it uses actual portfolio feature values as cluster centers, whereas k-means can be influenced by extreme returns or volatilities. In the case of a portfolio of high-volatility tech stocks, this can be a problem.

By understanding K-medoids and how it’s used in practice, you can make more informed decisions about your investments.

Here’s how.

Imports and set up

We’ll use the scikit-learn-extra module to run the k-medoids analysis. scikit-learn-extra is a module for machine learning that extends scikit-learn. It includes algorithms that are useful but do not satisfy the scikit-learn inclusion criteria.

import numpy as np
import pandas as pd
from sklearn_extra.cluster import KMedoids
import matplotlib.pyplot as plt
from openbb_terminal.sdk import openbb

For this analysis, we’ll cluster the annualized returns and volatility for all 100 stocks in the Nasdaq 100 index. To get the tickers, we can use pandas.

nq = pd.read_html("<https://en.wikipedia.org/wiki/Nasdaq-100>")[4]
symbols = nq.Ticker.tolist()
data = openbb.stocks.ca.hist(
    symbols, 
    start_date="2020-01-01", 
    end_date="2022-12-31"
)

This code parses the tables on the wikipedia page and returns the table at the 4th index. Then we create a list of ticker symbols and pass them to OpenBB to get the historic price data. Finally, we get the annualized returns and volatility for each ticker.

moments = (
    data
    .pct_change()
    .describe()
    .T[["mean", "std"]]
    .rename(columns={"mean": "returns", "std": "vol"})
) * [252, np.sqrt(252)]

We compute the daily return of each ticker, the use the describe method to get a DataFrame with the summary statistics. From there, we extract the mean and standard deviation, rename the columns, and annualize the values.

Running the k-medoid analysis

Getting the medoids is only one line of code. The remaining code creates colors for the points in each cluster.

km = KMedoids(n_clusters=5).fit(moments)
labels = km.labels_
unique_labels = set(labels)
colors = [
    plt.cm.Spectral(each) for each in np.linspace(0, 1, len(unique_labels))
]

From there, we generate the plots with the medoids a bright cyan color.

for k, col in zip(unique_labels, colors):
    class_member_mask = labels == k

    xy = moments[class_member_mask]
    plt.plot(
        xy.iloc[:, 0],
        xy.iloc[:, 1],
        "o",
        markerfacecolor=tuple(col),
        markeredgecolor="k",
    )

plt.plot(
    km.cluster_centers_[:, 0],
    km.cluster_centers_[:, 1],
    "o",
    markerfacecolor="cyan",
    markeredgecolor="k",
)
plt.xlabel("Return")
plt.ylabel("Ann. Vol.")

We iterate over the cluster labels and their corresponding colors, plotting the data points belonging to each cluster in 2D space. The class_member_mask is used to filter the data points that belong to the current cluster.

Finally, the code plots the cluster centers in cyan to indicate their positions on the graph.

K-medoids is a powerful method used in data science to cluster similar data together. It's like k-means but robust to outliers for portfolio analysis.

Each cluster represents a set of stocks with similar risk-return characteristics. By examining these clusters, we can identify stocks that are statistically similar in terms of their performance metrics.

This information is valuable for asset allocation strategies when seeking specific risk and return objectives. For instance, a cluster with high returns and low volatility would be particularly appealing for risk-averse investors seeking stable growth.

Conversely, a cluster with high returns and high volatility might be more suitable for investors with a higher risk tolerance.

Next steps

As a next step, try to use the k-medoid analysis on different features. For example, you can compute the conditional value at risk and Sharpe ratio for each stock.

Does the different analysis give you further insight into how you might construct a portfolio?