Institutional asset management is under pressure – fee compression, shifting investor sentiment and technological advances are placing demands on portfolio managers to find new ways to deliver value to investors. A massive explosion of alternative data has started to appear over the last several years and is likely to continue to accelerate over the next decade. Hedge funds spend vast amounts of money on alternative data, though most of this data is not useful in its raw form.
Boosted.ai CEO Josh Pantony recently sat down for a webinar to discuss how to identify the complex problems that arise with alternative data and, more importantly, how to fix them. From this discussion, there are three main points worth emphasizing regarding data – incorrect point-in-time data, data that is derived in an in-sample fashion and data with a form of survivorship bias.
How Machines Can Collapse from Incorrect Point-in-Time Data
One of the main problems we’ve seen when it comes to point-in-time data is that often, a data set provider will say the market saw something on a given date when in fact, it couldn’t have seen it until several weeks later.
An example we see arise repeatedly is around earnings releases. If a company announces earnings on, say, January 3rd, and then three weeks later announces revised earnings, the data provider will share the revised data but say it saw it on January 3rd. In reality, the market couldn’t have seen the data on January 3rd, but the machine doesn’t reflect that knowledge. The machine will learn from incorrect point-in-time data and make assumptions that look great in a backtest, but when it goes live, the system will collapse because the data is inaccurate.
Any machine learning model built off of this incorrect point-in-time data will learn from the bias and continue to collapse once live. In this example, the machine must build an earnings model with correct point-in-time data to avoid any bias to the model in the future.
In-Sample Model Data and Prediction Periods
Often, a data vendor might have various interesting sets of data. They will recognize that the data is hard to use on its own, so they will build a prediction model to support and decode the data.
One way to make earnings predictions is to look at credit card data. A vendor can go back in time and take all available credit card data from the past ten years and train a machine learning model to make predictions based on that period. However, if the model makes a prediction for 2015, that model is cheating because it was trained on 2015 data and therefore will run into bias problems when it goes live.
The solution to this challenge is to access the raw data and create a variation of a prediction model. Rather than training a model on all available data over the past ten years, it’s essential to create a rolling window of data and ensure the rolling window doesn’t interact with the prediction period.
Survivorship Bias and Non-Active Tickers
Finally, it’s critical to assess and deal with potential survivorship bias within a model. A lack of awareness around survivorship bias will lead to the machine making risky bets that will work in a backtest but collapse when ran live.
For example, suppose a vendor doesn’t capture data on delisted companies. If you try to create a model based on underlying data that doesn’t include delisted companies, the machine will learn from that and assume the lack of data guarantees the companies within the data set haven’t gone bankrupt or been bought. While the lack of data on delisted companies will make things look great in a backtest because the data was never present, the model will break down the moment it goes live.
This bias implies vendors are not only missing data for non-active tickets, but also that the way vendors collect data for non-active tickers differs from how they collect data for active tickers, leading the machine to make risky decisions.
Boosted.ai creates a point-in-time representation of any given universe that accurately captures the mandate you are looking for in order to solve for survivorship bias. From there, we extract anomalous behavior and capture that data to build a new prediction model that doesn’t suffer from the same survivorship bias.
What’s Next for Alternative Data: Demystifying the ‘Black Box’
The biggest obstacle holding back broader adoption of machine learning in investment management is the ‘black box’ nature of advanced and modern machine learning techniques.
Once a machine learns something from underlying data, identifying what it learned and why it did so is critical to avoiding bias. Boosted.ai’s human-plus-machine approach is explainable and allows us to understand where data comes from, why the machine makes the predictions it does and, ultimately, avoids bias derived from incorrect underlying data. This is yet another critical step in adapting alternative data for asset management in a clean and explainable way.
To watch the full webinar, please click here.