Real-world datasets are important when building and testing machine learning models. You can test your model by establishing a benchmark or identifying its flaws using several data sets. You might also want to construct synthetic datasets to test your algorithms. Introducing noise, correlations, or redundant information to the data.
Obtaining Time Series Datasets in Python
- This guide will teach you 3 main things:
- How to use Pandas_datareader to load data.
- How to call a web data server’s API using the requests library.
- How to generate synthetic time-series data.
Loading Data Using pandas_datareader
You can use pip to install the libraries that are used in your system. The pandas_datareader library enables you to fetch data from various sources related to financial and economic time series like Yahoo Finance, World Bank, and the St. Louis Fed.
You can read an economic time series from Federal Reserve Economic Data (FRED). Every time series in FRED is identified by a symbol. We can also read data for specific countries by specifying the ISO-3166-1 country code.
Pandas_datareader gets data from the web in real-time and assembles it into a pandas DataFrame behind the scenes. Each data source needs a different reader.
Fetching Data Using Web APIs
You can also get data through APIs without the need to authenticate.
This can be done in Python using the standard library urllib or requests. WorldBank, for instance, has APIs freely available where data can be read in as JSON, XML, or plain text.
Creating Synthetic Data Using Numpy
We may not want to use real-world data for our project if we require something specific that’s unlikely to occur in the real world. Testing a model using ideal time-series data is one such scenario.
Use Numpy to create random samples from various distributions.
Read more at machinelearningmastery.com.