Loan Approval Prediction

The GitHub repository for this project can be viewed here.

The Loan Approval Prediction project leverages various machine learning models to predict whether a loan application will be approved based on applicant details such as income, loan amount, marital status, education, credit history, and employment type.

We performed extensive exploratory data analysis (EDA), handled missing values, encoded categorical variables, and compared multiple algorithms including Logistic Regression, Random Forest, Gradient Boosting, XGBoost, LightGBM, and CatBoost.

After hyperparameter tuning, the Random Forest Classifier achieved the best balance of accuracy and robustness.


📂 Dataset

Dataset: Loan Prediction Dataset – Kaggle


🔍 Project Workflow

  1. Data Preprocessing

    • Filled missing categorical features with the mode.
    • Filled missing numerical features with the median.
    • One-hot encoded categorical variables.
    • Cleaned feature names for compatibility with ML libraries.
  2. Exploratory Data Analysis (EDA)

    • Visualized distributions of categorical and numeric variables.
    • Analyzed the target variable distribution.
    • Checked correlations between numeric features.
  3. Model Comparison

    • Baseline: Logistic Regression.
    • Tree-based models: Random Forest, Gradient Boosting, XGBoost, LightGBM, CatBoost.
    • Evaluation metrics: Accuracy, F1-score, ROC-AUC.
  4. Hyperparameter Tuning

    • Tuned Random Forest, CatBoost, and XGBoost using GridSearchCV.
  5. Final Model

    • Tuned Random Forest selected as the final model.
    • Accuracy: 0.8143
    • F1-score: 0.8779
    • ROC-AUC: 0.7814
  6. Evaluation

    • Cross-validation metrics.
    • Confusion matrix visualization.
    • Feature importance analysis.

📈 Results Summary

ModelAccuracyF1-scoreROC-AUC
Random Forest (Tuned)0.81430.87790.7814
CatBoost (Tuned)0.81440.87880.7782
XGBoost (Tuned)0.80780.87430.7744
Logistic Regression0.80460.87290.7495

📂 Sample Predictions

The results.csv file in the repository contains sample predictions generated by the final tuned Random Forest model.


Next Steps

  • Implement stacking/ensembling of top-performing models.
  • Experiment with advanced feature engineering (interaction terms, ratios).
  • Optimize classification threshold for business-specific metrics.
  • Deploy as a web app for real-time loan approval predictions.