A Detailed Guide to CECL Data Collection

Part Two in a Two-Part Series on CECL Data Requirements


Even with CECL compelling banks to collect more internal loan data, we continue to emphasize profitability as the primary benefit of robust, proprietary, loan-level data. Make no mistake, the data template we outline below is for CECL modeling. CECL compliance, however, is a prerequisite to profitability. Also, while third-party data may suffice for some components of the CECL estimate, especially in the early years of implementation, reliance on third-party data can drag down profitability. Third-party data is often expensive to buy, may be unsatisfactory to an auditor, and can yield less accurate forecasts. Inaccurate forecasts mean volatile loss reserves and excessive capital buffers that dilute shareholder profitability. An accurate forecast built on internal data not only solves these problems but can also be leveraged to optimize loan screening and loan pricing decisions.

Below is a detailed table of data fields to collect. You should collect this dataset whether you plan to build credit models yourself or hire a vendor. A good vendor would expect a dataset like the one outlined below.

The table is not exhaustive for every asset class and circumstance, but covers the basics and then some, and is plenty serviceable for CECL modeling. A regional bank that had collected and preserved these data fields at a loan level over this past business cycle would be in a league of its own. In our new CECL world, it would hold a data asset worth perhaps more than its loan portfolio.

The following variables are useful in building probability of default, loss severity, and/or prepayment models – the models that inform credit loss forecasts. Note that variables in italics can be calculated from the collected variables.

Below the table are a few important notes.

Loan-Level Data to Collect and Preserve for CECL Modeling

Preserve the full time series: All data should be preserved, beginning with the origination data. Each new month of data should add to, not overwrite, prior data. If your loan servicing system cannot accommodate this, other reasonably priced databases are available. Without the full time series of data, you cannot establish delinquency roll rates, the impact of macroeconomics on delinquency transitions, or prepayment patterns. In short, you lose accuracy. We have previously written that datasets should span at least ten years.

Preserve original credit characteristics: A CECL model needs to forecast credit losses long into the future, based on credit characteristics available at the time the model is run. It does us little good to learn relationships between today’s FICO scores and short-term default probabilities over the next twelve months. For the most part, our model must predict default based on static credit characteristics, with dynamism entering the model through the macroeconomic inputs, which might assume improving, deteriorating, or stable conditions. An exception exists where the credit characteristic itself can be reasonably predicted, as is the case with LTV on the basis of real estate indices.

Capture updated credit characteristics: When it doubt, capture the data. We noted in the prior bullet point that updated credit characteristics may not always be useful, especially if they are not captured systematically and regularly across the portfolio. But your credit modeler might discover that a better model can be built using “most recent” credit characteristics rather than original credit characteristics in certain cases. Also, updated credit characteristics can be useful for portfolio segmentation.

Notes on specific variables: The reasons for collecting some of the variables will be apparent. Here are the less self-explanatory CECL data requirements and the reasons for collecting these variables:

Payment date lets us match loan outcomes with macroeconomic factors and calculate loan age, an important explanatory variable.

Loan age is useful in establishing default probabilities as most assets exhibit different default probabilities at different stages in their life.

Interest rate information is useful in confirming scheduled payment and as an explanatory variable in default and prepayment models. Loans exhibit higher default and prepayment probabilities, all else equal, when charged higher interest rates.

Scheduled payment lets us calculate prepayment and underpayment.

TDRs and modifications inform default probabilities as they are signs of distress.

Outstanding principal balance at end of period is useful to confirm scheduled payment, to measure loss severity, and possibly as an explanatory variable in prepayment and credit models.

Why No Risk Ratings? It wouldn’t hurt to include risk ratings in your monthly data, and if you are committed to building a risk ratings migration model, you would need them. We prefer delinquency state transition models, however, because they are objective. Risk ratings are either subjective or else an amalgamation of metrics that could be disaggregated and modeled individually. For most banks, a risk rating is akin to a prediction – it ranks likelihood of future defaults or magnitudes of forecasted losses. Predicting future risk rating is thus like predicting a prediction. It is both more useful and more doable to predict objective outcomes, which is why we prefer to model based on delinquency status.

Don’t hesitate to get in touch if you have thoughts or questions about CECL data collection.