An empirical analysis on 80,000 annual observations of listed companies demonstrates the primacy of the economic cycle and sectoral exposure in predicting defaults
Macro-Sectoral Factors Dominate Financial Metrics.
Introduction: Explainability, an imperative for credit and investment decisions
In M&A, private equity and corporate financing operations, the ability to anticipate the default risk of a target company determines the quality of decision-making. Traditional credit analysis methods rely on standardized financial ratios (leverage, interest coverage, liquidity), supplemented by a qualitative assessment of the sector and macroeconomic context.
Machine learning now offers the possibility to quantify these intuitions and objectively prioritize risk factors. But in a regulated environment (ECB/EBA directives on internal risk models) and facing multiple stakeholders (boards of directors, investment committees, regulators), the requirement for interpretability is central: a “black box” model is not auditable, therefore not usable.
This study applies explainable machine learning techniques to a massive dataset of 80,000 annual observations of companies to identify the true drivers of corporate default risk.
Methodology: Dataset, models and metrics
Data scope
The dataset covers 80,000 company-year observations from companies listed on Nasdaq and NYSE between 2000 and 2018. Each observation integrates:
Over 20 financial indicators: long-term debt, EBITDA, EBIT, net revenue, gross margin, total assets, accounts receivable, equity, operating cash flow, etc.
Detailed sectoral classifications: division (macro-sector), majorgroup (micro-sector), allowing fine granularity of industrial exposure.
Target variable: default status (binary), with a strong class imbalance typical of real credit data (defaults represent less than 5% of observations).
Algorithms used
Several model families were tested and compared:
Clustering (K-Means, DBSCAN) to identify homogeneous company profiles in terms of risk and understand the latent structure of the data.
Random Forest, a robust ensemble algorithm capable of handling complex interactions between variables and providing native variable importance.
LightGBM, a gradient boosting algorithm optimized for large datasets and performing well on imbalanced classes through weighting techniques.
Neural networks (multilayer perceptron) to capture subtle non-linear relationships and test the added value of more complex models.
SMOTE (Synthetic Minority Over-sampling Technique) to correct class imbalance by synthetically generating default observations, thus avoiding model bias toward the majority class.
Metrics: AUC and Precision-Recall
In a context of imbalanced classes, overall accuracy is a misleading metric. A naive model systematically predicting “no default” would achieve 95% accuracy if only 5% of companies default, while being completely useless.
Two metrics were therefore prioritized:
AUC-ROC (Area Under the Receiver Operating Characteristic Curve): measures the model’s ability to discriminate between healthy and at-risk companies, regardless of the chosen decision threshold. An AUC of 0.85+ indicates good discriminative performance.
Precision-Recall Curve: emphasizes effective detection of defaults. Precision measures the proportion of true alerts among all alerts issued, while recall measures the proportion of defaults actually detected. In an M&A screening context, missing a default (false negative) is more costly than a false alert (false positive).
Explainability via SHAP (SHapley Additive exPlanations)
SHAP is a method based on game theory that assigns each variable a marginal contribution to the prediction, for each observation. Concretely, SHAP breaks down the model’s prediction by indicating which factors push toward high or low risk, and to what extent.
This transparency addresses three critical requirements:
Regulatory compliance: ECB/EBA directives require that internal credit risk models be validatable and interpretable.
Decision-maker trust: a board of directors or investment committee must understand why a target is deemed risky, beyond a simple probabilistic score.
Actionability: identify levers for risk reduction (financial restructuring, sectoral diversification, timing of entry into the cycle).
Empirical results: The hierarchy of risk drivers
The SHAP analysis, aggregated over the entire dataset, reveals for the best-performing algorithm, the following hierarchy of the ten main default risk drivers (ranked by absolute mean SHAP value):
1. fyear (fiscal year) — dominant factor
2. division (macro sectoral classification)
3. majorgroup (micro sectoral classification)
4. Long-term debt
5. Earnings before interest, taxes, depreciation and amortization
6. Earnings before interest and taxes
7. Net revenue
8. Gross margin
9. Total assets
10. Total receivables

Interpretation: macro and sector before financial metrics
This result is counter-intuitive compared to traditional credit analysis practice, which often places financial ratios (leverage, profitability, liquidity) in the foreground. Here, empirical data reveal a different reality: macroeconomic context and sectoral exposure are the primary determinants of default risk, before the company’s intrinsic financial health.
1. Fiscal year (fyear): the economic cycle as dominant driver
Fiscal year captures economic cycle effects over the 2000-2018 period: 2008-2009 financial crisis, 2015-2016 slowdown, post-2010 expansion. A perfectly managed company, with solid financial ratios, can default if it operates in a cyclical sector at the wrong time in the cycle.
Implication for M&A operations: Entry timing in a transaction is crucial. Acquiring a target at the top of the cycle in a cyclical sector (construction, commodities, transport) exposes to significantly higher default risk than an acquisition at the bottom of the cycle or in a counter-cyclical sector.
Empirical illustration: Sample companies that defaulted in 2009-2010 had comparable financial profiles to those that survived, but operated in sectors more sensitive to the cycle (retail, construction, finance). Fiscal year alone explains a substantial part of default variance.
2. Sectoral classifications (division, majorgroup): industrial exposure structures risk
Certain sectors are structurally riskier than others, regardless of the management quality of the companies operating in them. The data confirm this intuition:
High-risk sectors: Industries where default lurks
Coal mining: Average probability ~0.55. This sector is highly dependent on commodity prices, which fluctuate with global demand and energy transition to renewables. Add strict environmental regulations (e.g., emission standards) and high costs for retired miners’ health, and you get a recipe for bankruptcies, as seen with several US companies in the 2010-2020 period.


Local passenger transit: Average probability ~0.50. Strong competition from services like Uber or personal vehicles, combined with low margins due to fixed costs (fleet maintenance) and demand sensitive to crises (e.g., post-COVID collapse). Insufficient public subsidies and fuel inflation worsen vulnerability, leading to cascading bankruptcies in urban transport.
Water transportation services: Average probability ~0.45. Weather hazards (storms, floods) disrupt operations, while global supply chains expose to additional costs (unpaid fuel, as in the Hanjin case in 2016). Fleet overcapacity and dependence on international trade amplify risks during economic slowdowns.


Oil and gas extraction: Average probability ~0.40. Extreme dependence on oil prices, volatile due to geopolitics (e.g., wars or OPEC) and green transition. Operational risks (accidents, exploration costs) and high debts for infrastructure often lead to restructurings, as during the price collapse in 2020.
Textile mill products: Average probability ~0.35. Fierce international competition (low-cost Asia), volatile supply chains (raw materials like cotton) and pressures for sustainability (e.g., failures like Renewcell in 2024 due to high recycling costs). Fashion demand fluctuations and raw material inflation precipitate bankruptcies.

Resilient sectors: Industries that better resist crises

Non-depository credit institutions: Average probability ~0.20. Resilient thanks to strict federal regulation limiting excessive exposures, and adaptation to economic cycles through products like alternative loans. Less impacted by traditional banking crises, they benefit from favorable monetary policies.
Leather products: Average probability ~0.18. Innovation in sustainability (e.g., eco-responsible finishes) and local supply chains reduce vulnerabilities. The sector benefits from niche demand (luxury, craftsmanship) and regulations favoring quality, making bankruptcies less frequent despite competition.


Railroad transportation: Average probability ~0.15. Supported by massive public subsidies for infrastructure (e.g., US investments of $23 billion annually) and 1980 deregulation that boosted efficiency. Essential logistics (freight) ensures resilience to shocks, unlike more volatile modes.
.
Legal services: Average probability ~0.12. Constant demand, even increased in recession (bankruptcies, restructurings, litigation). Lawyers benefit from high barriers to entry (qualifications) and rapid adaptation to crises, as seen during recessions when insolvency practices explode.


Fishing and hunting: Average probability ~0.10. Stable regulations (quotas, aid via Chapter 12 for farmers/fishermen) and manageable seasonality with local niches. Despite environmental risks, the sector resists via subsidies and constant food demand, avoiding massive bankruptcies.
The AI model highlights how factors like price volatility or public support influence bankruptcy risk.
Implication for target screening: Before analyzing a target’s financial statements, it is imperative to qualify its sectoral exposure. A target with high EBITDA in a structurally declining sector presents higher risk than a target with average EBITDA in a growing sector.
3. Financial metrics: important, but conditioned by context
Financial ratios (LT debt, EBITDA, EBIT, revenue, gross margin, assets, receivables) come after macro-sectoral factors in the SHAP hierarchy. This doesn’t mean they are negligible, but that their impact is conditioned by context.
Long-term debt: Its relative importance (4th SHAP position) confirms that refinancing risk and creditor pressure remain central. However, an indebted company in a resilient sector (utilities) presents lower risk than a lightly indebted company in a crisis sector.
EBITDA, EBIT, gross margin: These operational profitability indicators signal management quality and competitive positioning, but their predictive power is modulated by cycle and sector.
Total assets and accounts receivable: These size and cash conversion cycle variables capture liquidity and working capital management effects, particularly critical during macroeconomic stress phases.
Implications for M&A practice
1. Due diligence: qualify macro-sectoral risk before financial analysis
Results suggest a revised analytical sequence for target screening:
Step 1 – Macro-sectoral qualification: Identify exposure to economic cycle (cyclical vs resilient sector) and sectoral positioning (growth, mature, declining sector).
Step 2 – Conditional financial analysis: Interpret financial ratios in light of macro-sectoral context. A 5x leverage may be acceptable in a stable sector with predictable cash flows (utilities, infrastructure), but prohibitive in a cyclical sector (construction, commodities).
Step 3 – Contextualized stress tests: Model stress scenarios integrating sectoral shocks (sectoral demand decline, technological disruption) and macro (recession, rate hikes), not just idiosyncratic shocks (key client loss, litigation).
2. Valuation: adjust risk premiums according to macro-sectoral exposure
In DCF (Discounted Cash Flow) valuation models used in M&A, the discount rate (WACC) integrates a risk premium. SHAP results suggest this premium should be calibrated according to macro-sectoral exposure, beyond traditional sectoral beta.
Example: A target in “division X” sector (identified as high-risk by SHAP) should see its WACC increased by 100-200 basis points compared to a comparable target in a resilient sector, even if financial ratios are similar.
3. Transaction structuring: protection clauses and earn-out
When SHAP analysis reveals strong exposure to cycle and sector, transaction structuring can integrate protection mechanisms:
Price adjustment clauses: Indexation of final price on macro-sectoral indicators (sectoral index, GDP growth, commodity prices).
Conditional earn-out: Deferred payment linked to target performance, but adjusted for macro-sectoral effects (EBITDA normalization relative to sectoral average).
Warranties & indemnifications: Enhanced warranties on macro-sectoral assumptions underlying the business plan.
4. Post-acquisition monitoring: predictive risk dashboards
For investment portfolios (private equity, corporate ventures), SHAP models enable construction of dynamic risk dashboards:
Macro-sectoral alerts: Early detection of sectoral context deterioration (sectoral sentiment index, disruption signals).
Evolving predictive scoring: Quarterly update of default risk score integrating new financial data AND macro-sectoral evolutions.
Intervention prioritization: Concentration of value creation efforts on holdings exposed to stressed sectors, before financial metrics deteriorate.
Model performance: Random Forest and LightGBM leading
Algorithm comparison
On the test set, discriminative performances (AUC-ROC) are as follows:
Random Forest: AUC = 0.87 | High robustness, native interpretability via SHAP.
LightGBM: AUC = 0.81 | Significantly lower performance, fast training, correct handling of imbalanced classes.
Neural networks: AUC = 0.80 | Comparable performance but higher complexity and lower interpretability.
Clustering + rules: AUC = 0.79 | Descriptive approach useful for portfolio segmentation, but lower predictive performance.
The performance-explainability tradeoff
Random Forest offers the best tradeoff: state-of-the-art predictive performance, reduced overfitting risk, ease of hyperparameter optimization, and explainability via SHAP.
Limitations and enrichment perspectives
Methodological limitations
Temporal scope: The dataset ends in 2018, before the COVID-19 pandemic, post-2021 inflation, recent geopolitical tensions, and accelerated energy transition. Macro-sectoral patterns have evolved.
Geographic scope: Nasdaq and NYSE only. Risk dynamics differ significantly in Europe (different regulation, financing structures), Asia or emerging markets.
Missing variables: Critical qualitative factors not captured in structured financial statements: management quality, governance, ongoing litigation, ESG exposure, patents and R&D, customer-supplier concentration.
Enrichment pathways
Alternative data: Integration of textual data (NLP on annual reports, earnings call transcripts, sectoral press) to capture market sentiment and early deterioration signals.
ESG data: ESG scores emerge as resilience predictors (solid governance, climate risk management). Their integration would improve prediction on transitioning sectors (energy, automotive).
Dynamic models: Prediction of risk trajectories (temporal evolution of score) rather than annual snapshots. Useful for anticipating progressive deteriorations.
Sectoral specialization: Training dedicated models by industrial division to capture specific business logic (for example, risk in aviation sector strongly depends on kerosene price and passenger traffic, variables absent from the generic model).
Conclusion: Toward AI-augmented due diligence
This empirical study on 1.6 million financial observations demonstrates that corporate default risk is understood first through the lens of economic cycle and sectoral exposure, before traditional financial metrics. This result, counter-intuitive compared to standard credit analysis practice, is robust across several machine learning model families.
For M&A and private equity operations, the implication is strategic: macro-sectoral risk qualification must precede financial analysis, and valuations must integrate risk premiums calibrated on these factors.
Model explainability via SHAP is not a technical luxury, but an operational imperative: it transforms probabilistic predictions into actionable insights for investment committees, boards of directors, and regulators.
Explainable AI does not replace business expertise, it augments it: it quantifies intuitions, reveals patterns invisible to the naked eye, and provides a rigorous framework for decision-making in complex and regulated environments.
This project is the opposite of a black box. The source code is accessible on github.