top of page

Credit Risk Classification with Logistic & Random Forest Models

This analysis explores patterns in consumer creditworthiness using the well-known German Credit dataset from the R package caret, which includes detailed financial and demographic information on 1,000 loan applicants. Each applicant is classified as either having good or bad credit risk based on their loan repayment history. The analysis begins with exploratory visualizations to identify differences in characteristics such as age, loan duration, and residential stability across credit risk groups. A logistic regression model is then used to estimate the odds of good credit based on a mix of demographic characteristics, financial indicators, loan details, and employment history. This approach provides interpretable relationships between each predictor and the likelihood of a favorable credit outcome. Finally, a machine learning approach known as a random forest model is applied to assess predictive performance and variable importance, using permutation-based accuracy metrics to identify the most influential features in classifying credit risk. The logistic and random forest models serve complementary purposes: one helps interpret which factors are associated with credit outcomes, while the other focuses on prediction and identifying the most influential variables.=

Initial Insights: Age, Loan Duration, and Residence

The plots below compare credit risk groups on three critical variables:

  • Age

  • Loan duration

  • Years at current residence

​

Each chart overlays violin plots with box plots to show both distribution and central tendency. The violin plot’s width at any point reflects the density of observations; wider sections indicate more individuals with values in that range, while narrower sections indicate fewer.

​

Applicants with bad credit tend to be younger, on average, and are more concentrated in the lower age range. Loans associated with bad credit also tend to have longer durations, with a clear concentration of borrowers in the higher end of the loan term distribution. In contrast, there appears to be no meaningful relationship between credit risk and length of residence, suggesting that residential stability does not differentiate between good and bad credit applicants in this dataset.

​

​

Bivariate Insights into Credit Risk: The Roles of Age, Loan Duration, and Residential Stability

credit_violin_plots.png

Explaining Credit Outcomes with Logistic Regression

The logistic regression model estimates the probability of a good credit outcome based on a range of predictors, including age, loan characteristics (duration and amount), account balances, employment history, housing status, and foreign worker status. Specifically, the model includes the following predictors: age, whether the loan was used for a car purchase, homeownership status, loan duration, years at current residence, number of existing credits, loan amount, indicators for low checking and savings balances, long-term employment (over seven years), and whether the applicant is a foreign worker. The resulting odds ratios represent the change in the odds of having good credit associated with a one-unit increase in each predictor, holding all else constant. For example, homeowners have 57% higher odds of good credit compared to non-homeowners, suggesting a positive association between housing stability and creditworthiness. In contrast, applicants with less than $200 in their checking account have 78% lower odds of good credit, highlighting the importance of liquid assets. The marginal effects plot displays only the statistically significant predictors and helps visualize the direction and strength of each association, offering a more intuitive view of how each factor influences credit risk.

Odds Ratios for Statistically Significant Predictors

coefplot.png

Machine Learning with Random Forests

To complement the logistic regression model, a random forest classifier—a machine learning approach—was used to evaluate predictive performance and identify the most influential variables in classifying credit risk. Unlike logistic regression, which estimates interpretable relationships between individual predictors and the outcome, random forests excel at capturing complex, nonlinear patterns and interactions across variables without requiring strong parametric assumptions. Variable importance was assessed using permutation-based accuracy, which measures how much model performance declines when each predictor is randomly shuffled. The most influential variable was the indicator for less than $200 in checking, which had the highest importance score (42.6), indicating that disrupting this feature has a substantial negative effect on predictive accuracy. This underscores the strong predictive power of low liquid balances in determining credit risk. One advantage of this approach over traditional regression is that it enables direct comparison of variable importance across predictors measured on different scales, such as binary flags and continuous variables. The random forest analysis provides a more holistic view of predictive influence, helping prioritize variables that may be critical in automated credit scoring systems, even if they are less interpretable in isolation.​

Variable Importance (Permutation Accuracy)

rf.png
bottom of page