Generate synthetic market data with TensorFlow
Generate synthetic market data with TensorFlow
The lifeblood of quant finance is data.
The problem is that data is sometimes hard to come by. It may be expensive or just not available.
What if we had a way to generate synthetic market data?
Artificially recreating a dataset is a complex process. The new data needs to mimic the existing data distributions and not introduce biasing or noise in the dataset.
That’s where Generative Adversarial Networks (GANs) can help.
In today’s guest post by LSEG, we’ll explore how to use GANs to recreate datasets.
You can read their entire article here.
Generate synthetic market data with TensorFlow
A GAN is a complex twin neural network structure that tries to learn the data and then generates new data from it.
We call it a twin structure as it is comprised of the Generator and the Discriminator neural networks that are competing with each other during learning.
The learning process mimics the way we as humans learn with the help of an expert. We try to solve a problem succeed or fail but we receive feedback from the expert on what to do better next time we try.
Let’s see how it works.
Data ingestion and feature engineering
Using Refinitiv Data libraries, we ingest tick data for OMX index futures using rd.get_history()
within a while loop. This helps get around the 10,000 data point limit per call.
The data is then consolidated into a DataFrame, with numerical transformations applied to the columns.
Feature engineering includes calculating trade per tick (TPT) and price gradients using a custom tick_gradient()
function.
The processed tick data within the flash crash event timeframe is then ready for further analysis and modeling.
Model preparation and transformation
We then scale the data to reduce its dimensionality through PCA. This helps with visualization and analysis.
The transformation not only compresses the data but also pickles the scaler and PCA model for future use.
This step prepares the data for comparison with synthetic data generated by the GAN.
GAN modeling and training
Finally, we build the GAN using TensorFlow.
The discriminator learns to differentiate between real and synthetic data, while the generator learns to produce data that mimics the real data distribution.
The GAN is trained iteratively, with the discriminator and generator updated at the same time to improve the synthetic data to better match the real data.
The training process is visualized by comparing the principal components of the real and generated data.
This process highlights the GAN's ability to replicate complex data distributions such as those seen during a flash crash event.
Next steps
The LSEG article covers all the steps of this process in detail. Check it out here.