Skip to content

Latest commit

 

History

History
111 lines (91 loc) · 8.19 KB

README.md

File metadata and controls

111 lines (91 loc) · 8.19 KB

Smart Agent Recruitment

📝 Description

  • This is a classification machine learning problem to identify the best agents / applicants, for a Financial Distribution company, who will be able to source business for the company within 3 months post their 7 day corporate training.
  • In this project the predictions are made using LightGBM. Other models like XGBoost and AdaBoost was also used for experimentation.
  • A Power BI Dashboard is developed to capture the past trends in agent recruitment and derive meaningful insights from the Data.

📊 Power BI Dashboard

The Overview Dashboard and the Applicant Details Dashboard for FinMan Agent Recruitment.




📁 Code

⌛ Dataset

The dataset train.csv is used for training. The train dataset had 9,527 records with 23 features.
The dataset consisted the following attributes :

  • ID : Application ID for the Applicant.
  • Office_PIN
  • Application_Receipt_Date
  • Applicant_City_PIN
  • Applicant_Gender
  • Applicant_Birthdate
  • Applicant_Marital_Status
  • Applicant_Occupation
  • Applicant_Qualification
  • Manager_DOJ
  • Manager_Joining_Designation
  • Manager_Current_Designation
  • Manager_Grade
  • Manager_Status : Status of Employment of the Manager (Confirmed / Probation).
  • Manager_Gender
  • Manager_DoB
  • Manager_Num_Application : Number of applications sourced by the Manager.
  • Manager_Num_Coded
  • Manager_Business : Amount of Business Sourced by the Manager in the last 3 months.
  • Manager_Num_Products : Number of Produts sold by the Manager in the last 3 months.
  • Manager_Business2 : Amount of Business Sourced by the Manager in the last 3 months excluding the amount sourced by Category A advisor.
  • Manager_Num_Products2 : Number of Produts sold by the Manager in the last 3 months excluding the number sold by Category A advisor.
  • Business_Sourced : If the Applicant was able to source Business within 3 months (0 : Didn't Source Business , 1 : Sourced Business).

📃 Technical Overview

The project has been divided into the following steps :

1. Exploratory Data Analysis

In this step features having missing values and outliers, target variable distribution, numerical feature distribution, categorical feature distribution, Univariate and Bivariate Analysis was performed.
Some of the data insights are given below. (For the detail EDA please refer to the ipynb notebook)

  • During univariate Analysis, it is oberserved that all the numerical features had skewness.





  • The features Manager_Business and Manager_Business2 are highly coorelated. Similarly a high correlation is observed between Manager_Num_Products and Manager_Num_Products2. In order to remove multi-colinearity the columns Manager_Business2 and Manager_Num_Products2 will be dropped.



  • As expected there will be a strong correlation between Manager_Num_Products and Manager_Business. As the number of products sold increases the amount of business sourced by the Manager also increased.



  • The peak number of applications were received in the month of May, 2007. In initial months the number of applicatins received was low. However the number increased in the subsequent months.The a huge bulk of applications are received in the months starting from July till December in both the years of 2007 and 2008.



  • It is observed that initially in the period of Apr - Aug 2007, the number of products sold where business was sourced is very lesser than the times when the business was not sourced. The number of products sold where business was sourced started to increase in September, 2007. The difference between the number of products sold between busniess sourced and non-soucred gradually decreased and this trend continued till March, 2008. There were instances where Number of products sold when business was sourced is more than that when not sourced.



  • On investigating each applications received throughout the time period, a trend is captured. For a particular day the agent's application which was received first or relatively at the beginning of the day was able to source business within 3 months post 7 day training. This pattern is observed across all the 16 months of the train dataset. This trend will be captured in a feature in the Feature Engineering step.



2. Data Preprocessing / Cleaning

  • 19 features out of 23 had missing values.
  • The Arbitray Value imputation is done for handling missing values in the numerical, categorical and date columns / features.
  • The date columns were converted to proper datetime data type.
  • Irrelevent features were dropped from train and test datasets.

3. Feature Engineering

  • In this step 4 extra numerical features were created :
    • Agent_Age : The age of the Applicant / agent as on Application Receipt Date.
    • Manager_Age : The age of the Manager as on Application Receipt Date.
    • Manager_Exp : The work experience of Manager in the company.
    • App_Order_Percent : Percentile of the position of the Application Received calculated at a daily level.
  • The categorical features (Applicant_Gender, Applicant_Occupation) were One Hot Encoded and (Manager_Joining_Designation, Manager_Current_Designation) were Label Encoded.

📈 Modelling and Evaluation

  • In the modelling part, the following models are used :
    • XGBoost (Mean CV Scores : 0.88256, Variance in CV Scores : 0.00564)
    • Light Gradient Boosting (Mean CV Scores : 0.8826, Variance in CV Scores : 0.00338)
    • AdaBoost (Mean CV Scores : 0.8755 0.000684)
  • The scoring metric is ROC_AUC.
  • Randomized Search CV is used for hyperparameter tuning and finding the best parameters under roc_auc scoring.

📋 Results

Feature Importance

XGBoost

In the XGBoost model, the top 5 features of importance are : Agent_Age, App_Order_percent, Manager_Age, Applicant_City_PIN and Manager_Exp.

LightGBM

In the LightGBM model, the top 5 features of importance are : App_Order_percent, Manager_Exp, Applicant_City_PIN, Office_PIN, Manager_Age and Agent_Age.

⚙️ Tools and Technologies used

The tools used in this project include:

  • Python - This was needed to conduct Data Quality Assessment, Data Cleaning processes, Exploratory Data Analysis of the datasets and to gain useful insights, feature engineering and building the model.
  • Power BI - This Business Intelligence tool was required to explore data and create charts, graphs, visualizations to come up with a Dashboard to capture the past trends in Agent Recruitment.

✒️ Authors