The goal of this project is to predict the payment date based on the given dataset. The project is implemented in the HRC30094W_AMIT_PATNAIK_prediction.ipynb
Jupyter Notebook.
![Screenshot 2023-06-27 at 5 06 33 PM](https://private-user-images.githubusercontent.com/77204104/249137254-eb8bbb4e-ddd2-4bba-8325-3da9bd0f560b.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzQ2MTA1ODUsIm5iZiI6MTczNDYxMDI4NSwicGF0aCI6Ii83NzIwNDEwNC8yNDkxMzcyNTQtZWI4YmJiNGUtZGRkMi00YmJhLTgzMjUtM2RhOWJkMGY1NjBiLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDEyMTklMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQxMjE5VDEyMTEyNVomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWI0YTY4ZWRhNmE0MzQyYWZlMmQ2MGNiNWQ5NjRkMTlhNGNkNThlZWUyMTllZmE5YmIxNjc3MjE4ZGM5NzJlMjAmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.PJMjvTo0qcRKAJzrJ9KBJbVVYeYg397lW6N1MgVvX14)
- Importing related Libraries
- Storing the Dataset into the Dataframe
- Checking the Shape of the Dataframe
- Checking the Detail Information of the Dataframe
- Displaying All the Column Names
- Describing the Entire Dataset
- Data Cleaning
- Showing Top 5 Records from the Dataset
- Data Type Conversion
- Splitting the Dataset
- Splitting the Train and Test Data
- Exploratory Data Analysis (EDA)
- Feature Engineering
- Feature Selection
- Modeling
- Comparing Different Machine Learning Models
- Linear Regression
- Decision Tree Regression
- Random Forest Regression
- Support Vector Regression
- Extreme Gradient Boost Regression
- End of the Project
![Screenshot 2023-06-27 at 4 13 32 PM](https://private-user-images.githubusercontent.com/77204104/249125209-308cd0c5-2448-4f54-a0ec-3f8008875bdd.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzQ2MTA1ODUsIm5iZiI6MTczNDYxMDI4NSwicGF0aCI6Ii83NzIwNDEwNC8yNDkxMjUyMDktMzA4Y2QwYzUtMjQ0OC00ZjU0LWEwZWMtM2Y4MDA4ODc1YmRkLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDEyMTklMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQxMjE5VDEyMTEyNVomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTI0NzhiY2NjNmY2MTU4NGFmYWVlNzczOWI5MTA3MGVhNWU5MWM0ZTU4NTljZDlmMDBlNTBkYWNkYWYyM2I3NWImWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.GqQFoj87RLUFwmzp1XtHaNOJWwf-XdgugKetP1Z1qJU)
![Screenshot 2023-06-27 at 4 14 55 PM](https://private-user-images.githubusercontent.com/77204104/249125555-b8796040-2c0f-49aa-91a9-72b01f20950e.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzQ2MTA1ODUsIm5iZiI6MTczNDYxMDI4NSwicGF0aCI6Ii83NzIwNDEwNC8yNDkxMjU1NTUtYjg3OTYwNDAtMmMwZi00OWFhLTkxYTktNzJiMDFmMjA5NTBlLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDEyMTklMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQxMjE5VDEyMTEyNVomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWRkNWIxNmRlODFiODdiOTIzZGU5OWE5NTBhM2ZmZWQzMDVkMWFkNGVlMjE2Njc1NDAyNzgzZjA0OTUwODhiMmUmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.iypUfoz1qlIxWKldFscs2WIk5pjeRKflOGGiPsXqdZU)
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
df=pd.read_csv(r"C:\Users\amitp\hr\dataset.csv")
df
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
business_code | cust_number | name_customer | clear_date | buisness_year | doc_id | posting_date | document_create_date | document_create_date.1 | due_in_date | invoice_currency | document type | posting_id | area_business | total_open_amount | baseline_create_date | cust_payment_terms | invoice_id | isOpen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | U001 | 0200769623 | WAL-MAR corp | 2020-02-11 00:00:00 | 2020.0 | 1.930438e+09 | 2020-01-26 | 20200125 | 20200126 | 20200210.0 | USD | RV | 1.0 | NaN | 54273.28 | 20200126.0 | NAH4 | 1.930438e+09 | 0 |
1 | U001 | 0200980828 | BEN E | 2019-08-08 00:00:00 | 2019.0 | 1.929646e+09 | 2019-07-22 | 20190722 | 20190722 | 20190811.0 | USD | RV | 1.0 | NaN | 79656.60 | 20190722.0 | NAD1 | 1.929646e+09 | 0 |
2 | U001 | 0200792734 | MDV/ trust | 2019-12-30 00:00:00 | 2019.0 | 1.929874e+09 | 2019-09-14 | 20190914 | 20190914 | 20190929.0 | USD | RV | 1.0 | NaN | 2253.86 | 20190914.0 | NAA8 | 1.929874e+09 | 0 |
3 | CA02 | 0140105686 | SYSC llc | NaN | 2020.0 | 2.960623e+09 | 2020-03-30 | 20200330 | 20200330 | 20200410.0 | CAD | RV | 1.0 | NaN | 3299.70 | 20200331.0 | CA10 | 2.960623e+09 | 1 |
4 | U001 | 0200769623 | WAL-MAR foundation | 2019-11-25 00:00:00 | 2019.0 | 1.930148e+09 | 2019-11-13 | 20191113 | 20191113 | 20191128.0 | USD | RV | 1.0 | NaN | 33133.29 | 20191113.0 | NAH4 | 1.930148e+09 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
49995 | U001 | 0200561861 | CO corporation | NaN | 2020.0 | 1.930797e+09 | 2020-04-21 | 20200417 | 20200421 | 20200506.0 | USD | RV | 1.0 | NaN | 3187.86 | 20200421.0 | NAA8 | 1.930797e+09 | 1 |
49996 | U001 | 0200769623 | WAL-MAR co | 2019-09-03 00:00:00 | 2019.0 | 1.929744e+09 | 2019-08-15 | 20190814 | 20190815 | 20190830.0 | USD | RV | 1.0 | NaN | 6766.54 | 20190815.0 | NAH4 | 1.929744e+09 | 0 |
49997 | U001 | 0200772595 | SAFEW associates | 2020-03-05 00:00:00 | 2020.0 | 1.930537e+09 | 2020-02-19 | 20200218 | 20200219 | 20200305.0 | USD | RV | 1.0 | NaN | 6120.86 | 20200219.0 | NAA8 | 1.930537e+09 | 0 |
49998 | U001 | 0200726979 | BJ'S llc | 2019-12-12 00:00:00 | 2019.0 | 1.930199e+09 | 2019-11-27 | 20191126 | 20191127 | 20191212.0 | USD | RV | 1.0 | NaN | 63.48 | 20191127.0 | NAA8 | 1.930199e+09 | 0 |
49999 | U001 | 0200020431 | DEC corp | 2019-01-15 00:00:00 | 2019.0 | 1.928576e+09 | 2019-01-05 | 20190105 | 20190105 | 20190124.0 | USD | RV | 1.0 | NaN | 1790.30 | 20190101.0 | NAM4 | 1.928576e+09 | 0 |
50000 rows × 19 columns
df.shape
(50000, 19)
df.info
<bound method DataFrame.info of business_code cust_number name_customer clear_date \
0 U001 0200769623 WAL-MAR corp 2020-02-11 00:00:00
1 U001 0200980828 BEN E 2019-08-08 00:00:00
2 U001 0200792734 MDV/ trust 2019-12-30 00:00:00
3 CA02 0140105686 SYSC llc NaN
4 U001 0200769623 WAL-MAR foundation 2019-11-25 00:00:00
... ... ... ... ...
49995 U001 0200561861 CO corporation NaN
49996 U001 0200769623 WAL-MAR co 2019-09-03 00:00:00
49997 U001 0200772595 SAFEW associates 2020-03-05 00:00:00
49998 U001 0200726979 BJ'S llc 2019-12-12 00:00:00
49999 U001 0200020431 DEC corp 2019-01-15 00:00:00
buisness_year doc_id posting_date document_create_date \
0 2020.0 1.930438e+09 2020-01-26 20200125
1 2019.0 1.929646e+09 2019-07-22 20190722
2 2019.0 1.929874e+09 2019-09-14 20190914
3 2020.0 2.960623e+09 2020-03-30 20200330
4 2019.0 1.930148e+09 2019-11-13 20191113
... ... ... ... ...
49995 2020.0 1.930797e+09 2020-04-21 20200417
49996 2019.0 1.929744e+09 2019-08-15 20190814
49997 2020.0 1.930537e+09 2020-02-19 20200218
49998 2019.0 1.930199e+09 2019-11-27 20191126
49999 2019.0 1.928576e+09 2019-01-05 20190105
document_create_date.1 due_in_date invoice_currency document type \
0 20200126 20200210.0 USD RV
1 20190722 20190811.0 USD RV
2 20190914 20190929.0 USD RV
3 20200330 20200410.0 CAD RV
4 20191113 20191128.0 USD RV
... ... ... ... ...
49995 20200421 20200506.0 USD RV
49996 20190815 20190830.0 USD RV
49997 20200219 20200305.0 USD RV
49998 20191127 20191212.0 USD RV
49999 20190105 20190124.0 USD RV
posting_id area_business total_open_amount baseline_create_date \
0 1.0 NaN 54273.28 20200126.0
1 1.0 NaN 79656.60 20190722.0
2 1.0 NaN 2253.86 20190914.0
3 1.0 NaN 3299.70 20200331.0
4 1.0 NaN 33133.29 20191113.0
... ... ... ... ...
49995 1.0 NaN 3187.86 20200421.0
49996 1.0 NaN 6766.54 20190815.0
49997 1.0 NaN 6120.86 20200219.0
49998 1.0 NaN 63.48 20191127.0
49999 1.0 NaN 1790.30 20190101.0
cust_payment_terms invoice_id isOpen
0 NAH4 1.930438e+09 0
1 NAD1 1.929646e+09 0
2 NAA8 1.929874e+09 0
3 CA10 2.960623e+09 1
4 NAH4 1.930148e+09 0
... ... ... ...
49995 NAA8 1.930797e+09 1
49996 NAH4 1.929744e+09 0
49997 NAA8 1.930537e+09 0
49998 NAA8 1.930199e+09 0
49999 NAM4 1.928576e+09 0
[50000 rows x 19 columns]>
df.columns
Index(['business_code', 'cust_number', 'name_customer', 'clear_date',
'buisness_year', 'doc_id', 'posting_date', 'document_create_date',
'document_create_date.1', 'due_in_date', 'invoice_currency',
'document type', 'posting_id', 'area_business', 'total_open_amount',
'baseline_create_date', 'cust_payment_terms', 'invoice_id', 'isOpen'],
dtype='object')
df.describe()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
buisness_year | doc_id | document_create_date | document_create_date.1 | due_in_date | posting_id | area_business | total_open_amount | baseline_create_date | invoice_id | isOpen | |
---|---|---|---|---|---|---|---|---|---|---|---|
count | 50000.000000 | 5.000000e+04 | 5.000000e+04 | 5.000000e+04 | 5.000000e+04 | 50000.0 | 0.0 | 50000.000000 | 5.000000e+04 | 4.999400e+04 | 50000.000000 |
mean | 2019.305700 | 2.012238e+09 | 2.019351e+07 | 2.019354e+07 | 2.019368e+07 | 1.0 | NaN | 32337.021651 | 2.019354e+07 | 2.011340e+09 | 0.200000 |
std | 0.460708 | 2.885235e+08 | 4.496041e+03 | 4.482134e+03 | 4.470614e+03 | 0.0 | NaN | 39205.975231 | 4.482701e+03 | 2.766335e+08 | 0.400004 |
min | 2019.000000 | 1.928502e+09 | 2.018123e+07 | 2.018123e+07 | 2.018122e+07 | 1.0 | NaN | 0.720000 | 2.018121e+07 | 1.928502e+09 | 0.000000 |
25% | 2019.000000 | 1.929342e+09 | 2.019050e+07 | 2.019051e+07 | 2.019052e+07 | 1.0 | NaN | 4928.312500 | 2.019050e+07 | 1.929342e+09 | 0.000000 |
50% | 2019.000000 | 1.929964e+09 | 2.019091e+07 | 2.019091e+07 | 2.019093e+07 | 1.0 | NaN | 17609.010000 | 2.019091e+07 | 1.929964e+09 | 0.000000 |
75% | 2020.000000 | 1.930619e+09 | 2.020013e+07 | 2.020013e+07 | 2.020022e+07 | 1.0 | NaN | 47133.635000 | 2.020013e+07 | 1.930619e+09 | 0.000000 |
max | 2020.000000 | 9.500000e+09 | 2.020052e+07 | 2.020052e+07 | 2.020071e+07 | 1.0 | NaN | 668593.360000 | 2.020052e+07 | 2.960636e+09 | 1.000000 |
- Top 5 records from the dataset
df.head()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
business_code | cust_number | name_customer | clear_date | buisness_year | doc_id | posting_date | document_create_date | document_create_date.1 | due_in_date | invoice_currency | document type | posting_id | area_business | total_open_amount | baseline_create_date | cust_payment_terms | invoice_id | isOpen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | U001 | 0200769623 | WAL-MAR corp | 2020-02-11 00:00:00 | 2020.0 | 1.930438e+09 | 2020-01-26 | 20200125 | 20200126 | 20200210.0 | USD | RV | 1.0 | NaN | 54273.28 | 20200126.0 | NAH4 | 1.930438e+09 | 0 |
1 | U001 | 0200980828 | BEN E | 2019-08-08 00:00:00 | 2019.0 | 1.929646e+09 | 2019-07-22 | 20190722 | 20190722 | 20190811.0 | USD | RV | 1.0 | NaN | 79656.60 | 20190722.0 | NAD1 | 1.929646e+09 | 0 |
2 | U001 | 0200792734 | MDV/ trust | 2019-12-30 00:00:00 | 2019.0 | 1.929874e+09 | 2019-09-14 | 20190914 | 20190914 | 20190929.0 | USD | RV | 1.0 | NaN | 2253.86 | 20190914.0 | NAA8 | 1.929874e+09 | 0 |
3 | CA02 | 0140105686 | SYSC llc | NaN | 2020.0 | 2.960623e+09 | 2020-03-30 | 20200330 | 20200330 | 20200410.0 | CAD | RV | 1.0 | NaN | 3299.70 | 20200331.0 | CA10 | 2.960623e+09 | 1 |
4 | U001 | 0200769623 | WAL-MAR foundation | 2019-11-25 00:00:00 | 2019.0 | 1.930148e+09 | 2019-11-13 | 20191113 | 20191113 | 20191128.0 | USD | RV | 1.0 | NaN | 33133.29 | 20191113.0 | NAH4 | 1.930148e+09 | 0 |
- Output : area_business - 100% null, clear_data = 20% null, invoice_id = 0.12% null
nullpercent=df.isna().mean()*100
nullpercent
business_code 0.000
cust_number 0.000
name_customer 0.000
clear_date 20.000
buisness_year 0.000
doc_id 0.000
posting_date 0.000
document_create_date 0.000
document_create_date.1 0.000
due_in_date 0.000
invoice_currency 0.000
document type 0.000
posting_id 0.000
area_business 100.000
total_open_amount 0.000
baseline_create_date 0.000
cust_payment_terms 0.000
invoice_id 0.012
isOpen 0.000
dtype: float64
Many would have same invoice_id and doc_id
df[["invoice_id","doc_id"]]
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
invoice_id | doc_id | |
---|---|---|
0 | 1.930438e+09 | 1.930438e+09 |
1 | 1.929646e+09 | 1.929646e+09 |
2 | 1.929874e+09 | 1.929874e+09 |
3 | 2.960623e+09 | 2.960623e+09 |
4 | 1.930148e+09 | 1.930148e+09 |
... | ... | ... |
49995 | 1.930797e+09 | 1.930797e+09 |
49996 | 1.929744e+09 | 1.929744e+09 |
49997 | 1.930537e+09 | 1.930537e+09 |
49998 | 1.930199e+09 | 1.930199e+09 |
49999 | 1.928576e+09 | 1.928576e+09 |
50000 rows × 2 columns
Code to check - 'baseline_create_date',"document_create_date",'document_create_date.1' - these columns are almost same.
-If they are same, we need to drop them later
df[['baseline_create_date',"document_create_date",'document_create_date.1']]
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
baseline_create_date | document_create_date | document_create_date.1 | |
---|---|---|---|
0 | 20200126.0 | 20200125 | 20200126 |
1 | 20190722.0 | 20190722 | 20190722 |
2 | 20190914.0 | 20190914 | 20190914 |
3 | 20200331.0 | 20200330 | 20200330 |
4 | 20191113.0 | 20191113 | 20191113 |
... | ... | ... | ... |
49995 | 20200421.0 | 20200417 | 20200421 |
49996 | 20190815.0 | 20190814 | 20190815 |
49997 | 20200219.0 | 20200218 | 20200219 |
49998 | 20191127.0 | 20191126 | 20191127 |
49999 | 20190101.0 | 20190105 | 20190105 |
50000 rows × 3 columns
df[["posting_id"]]
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
posting_id | |
---|---|
0 | 1.0 |
1 | 1.0 |
2 | 1.0 |
3 | 1.0 |
4 | 1.0 |
... | ... |
49995 | 1.0 |
49996 | 1.0 |
49997 | 1.0 |
49998 | 1.0 |
49999 | 1.0 |
50000 rows × 1 columns
df[['isOpen']]
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
isOpen | |
---|---|
0 | 0 |
1 | 0 |
2 | 0 |
3 | 1 |
4 | 0 |
... | ... |
49995 | 1 |
49996 | 0 |
49997 | 0 |
49998 | 0 |
49999 | 0 |
50000 rows × 1 columns
- 'area_business'
- "posting_id"
- "invoice_id"
- "document_create_date"
- "isOpen"
- 'document type'
- 'document_create_date.1
df.drop(['area_business',
"posting_id",
"invoice_id",
"document_create_date",
"isOpen",
'document type',
'document_create_date.1'],axis=1,inplace=True)
df.columns
Index(['business_code', 'cust_number', 'name_customer', 'clear_date',
'buisness_year', 'doc_id', 'posting_date', 'due_in_date',
'invoice_currency', 'total_open_amount', 'baseline_create_date',
'cust_payment_terms'],
dtype='object')
df.duplicated(keep='first')
0 False
1 False
2 False
3 False
4 False
...
49995 False
49996 False
49997 False
49998 False
49999 False
Length: 50000, dtype: bool
df.duplicated().sum()
1161
df.drop_duplicates(inplace=True)
- It must be 0 by now
df.duplicated().sum()
0
df.shape
(48839, 12)
df.isnull().sum()
business_code 0
cust_number 0
name_customer 0
clear_date 9681
buisness_year 0
doc_id 0
posting_date 0
due_in_date 0
invoice_currency 0
total_open_amount 0
baseline_create_date 0
cust_payment_terms 0
dtype: int64
#Data type Conversion
df.dtypes
business_code object
cust_number object
name_customer object
clear_date object
buisness_year float64
doc_id float64
posting_date object
due_in_date float64
invoice_currency object
total_open_amount float64
baseline_create_date float64
cust_payment_terms object
dtype: object
- clear_date
- posting_date
- due_in_date
- baseline_create_date
df.dtypes[['clear_date',
'posting_date',
'due_in_date',
'baseline_create_date']]
clear_date object
posting_date object
due_in_date float64
baseline_create_date float64
dtype: object
- clear_date
- posting_date
- due_in_date
- baseline_create_date
- Note - You have to convert all these above columns into "%Y%m%d" format
df["clear_date"]=pd.to_datetime(df["clear_date"])
df["posting_date"]=pd.to_datetime(df["posting_date"])
df ["due_in_date"]=pd.to_datetime(df["due_in_date"], format="%Y%m%d")
df["baseline_create_date"]=pd.to_datetime(df["baseline_create_date"], format="%Y%m%d")
df.dtypes[['clear_date','posting_date','due_in_date','baseline_create_date']]
clear_date datetime64[ns]
posting_date datetime64[ns]
due_in_date datetime64[ns]
baseline_create_date datetime64[ns]
dtype: object
- Count of each currency
df["invoice_currency"].value_counts()
USD 45011
CAD 3828
Name: invoice_currency, dtype: int64
df[["total_open_amount"]]
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
total_open_amount | |
---|---|
0 | 54273.28 |
1 | 79656.60 |
2 | 2253.86 |
3 | 3299.70 |
4 | 33133.29 |
... | ... |
49995 | 3187.86 |
49996 | 6766.54 |
49997 | 6120.86 |
49998 | 63.48 |
49999 | 1790.30 |
48839 rows × 1 columns
- 1 CAD = 0.7 USD
- Create a new column i.e "converted_usd" and store USD and convered CAD to USD
df['total_open_amount']=np.where(df ['invoice_currency']=='USD',df ['total_open_amount']*1,df['total_open_amount']*0.7)
df['converted_usd']=df['total_open_amount']
display_ipoll=pd.DataFrame({'converted_usd':df.converted_usd})
display_ipoll
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
converted_usd | |
---|---|
0 | 54273.28 |
1 | 79656.60 |
2 | 2253.86 |
3 | 2309.79 |
4 | 33133.29 |
... | ... |
49995 | 3187.86 |
49996 | 6766.54 |
49997 | 6120.86 |
49998 | 63.48 |
49999 | 1790.30 |
48839 rows × 1 columns
- Note - use "buisness_year" column for this
df[["buisness_year"]].value_counts()
buisness_year
2019.0 33975
2020.0 14864
dtype: int64
- 'invoice_currency'
- 'total_open_amount',
df.drop(['invoice_currency','total_open_amount'],axis=1,inplace=True)
df.shape[1]
11
- Note - Output expected is only one column
df.columns[df.isna().any()]
Index(['clear_date'], dtype='object')
df['clear_date'].isna().sum()
9681
- First dataframe (refer that as maindata) only containing the rows, that have NO NULL data in that column ( This is going to be our train dataset )
- Second dataframe (refer that as nulldata) that contains the columns, that have Null data in that column ( This is going to be our test dataset )
maindata=df[df['clear_date'].notna()]
nulldata=df[df['clear_date'].isna()]
maindata.shape
(39158, 11)
nulldata.shape
(9681, 11)
maindata.head()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
business_code | cust_number | name_customer | clear_date | buisness_year | doc_id | posting_date | due_in_date | baseline_create_date | cust_payment_terms | converted_usd | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | U001 | 0200769623 | WAL-MAR corp | 2020-02-11 | 2020.0 | 1.930438e+09 | 2020-01-26 | 2020-02-10 | 2020-01-26 | NAH4 | 54273.280 |
1 | U001 | 0200980828 | BEN E | 2019-08-08 | 2019.0 | 1.929646e+09 | 2019-07-22 | 2019-08-11 | 2019-07-22 | NAD1 | 79656.600 |
2 | U001 | 0200792734 | MDV/ trust | 2019-12-30 | 2019.0 | 1.929874e+09 | 2019-09-14 | 2019-09-29 | 2019-09-14 | NAA8 | 2253.860 |
4 | U001 | 0200769623 | WAL-MAR foundation | 2019-11-25 | 2019.0 | 1.930148e+09 | 2019-11-13 | 2019-11-28 | 2019-11-13 | NAH4 | 33133.290 |
5 | CA02 | 0140106181 | THE corporation | 2019-12-04 | 2019.0 | 2.960581e+09 | 2019-09-20 | 2019-10-04 | 2019-09-24 | CA10 | 15558.088 |
nulldata.tail()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
business_code | cust_number | name_customer | clear_date | buisness_year | doc_id | posting_date | due_in_date | baseline_create_date | cust_payment_terms | converted_usd | |
---|---|---|---|---|---|---|---|---|---|---|---|
49975 | U001 | 0200769623 | WAL-MAR in | NaT | 2020.0 | 1.930625e+09 | 2020-03-10 | 2020-03-25 | 2020-03-10 | NAH4 | 13114.99 |
49980 | U001 | 0200769623 | WAL-MAR corporation | NaT | 2020.0 | 1.930851e+09 | 2020-05-03 | 2020-05-18 | 2020-05-03 | NAH4 | 8899.71 |
49982 | U001 | 0200148860 | DOLLA co | NaT | 2020.0 | 1.930638e+09 | 2020-03-11 | 2020-03-26 | 2020-03-11 | NAA8 | 4967.06 |
49992 | U001 | 0200900909 | SYSCO co | NaT | 2020.0 | 1.930702e+09 | 2020-03-25 | 2020-04-09 | 2020-03-25 | NAA8 | 1998.64 |
49995 | U001 | 0200561861 | CO corporation | NaT | 2020.0 | 1.930797e+09 | 2020-04-21 | 2020-05-06 | 2020-04-21 | NAA8 | 3187.86 |
- Note - You are expected to create a new column 'Delay' from two existing columns, "clear_date" and "due_in_date"
- Formula - Delay = clear_date - due_in_date
maindata['delay']=maindata['clear_date'] - maindata['due_in_date']
maindata.head()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
business_code | cust_number | name_customer | clear_date | buisness_year | doc_id | posting_date | due_in_date | baseline_create_date | cust_payment_terms | converted_usd | delay | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | U001 | 0200769623 | WAL-MAR corp | 2020-02-11 | 2020.0 | 1.930438e+09 | 2020-01-26 | 2020-02-10 | 2020-01-26 | NAH4 | 54273.280 | 1 days |
1 | U001 | 0200980828 | BEN E | 2019-08-08 | 2019.0 | 1.929646e+09 | 2019-07-22 | 2019-08-11 | 2019-07-22 | NAD1 | 79656.600 | -3 days |
2 | U001 | 0200792734 | MDV/ trust | 2019-12-30 | 2019.0 | 1.929874e+09 | 2019-09-14 | 2019-09-29 | 2019-09-14 | NAA8 | 2253.860 | 92 days |
4 | U001 | 0200769623 | WAL-MAR foundation | 2019-11-25 | 2019.0 | 1.930148e+09 | 2019-11-13 | 2019-11-28 | 2019-11-13 | NAH4 | 33133.290 | -3 days |
5 | CA02 | 0140106181 | THE corporation | 2019-12-04 | 2019.0 | 2.960581e+09 | 2019-09-20 | 2019-10-04 | 2019-09-24 | CA10 | 15558.088 | 61 days |
- Note - You are expected to make a new column "avgdelay" by grouping "name_customer" column with reapect to mean of the "Delay" column.
- This new column "avg_delay" is meant to store "customer_name" wise delay
- groupby('name_customer')['Delay'].mean(numeric_only=False)
- Display the new "avg_delay" column
avg_delay=maindata.groupby('name_customer')['delay'].mean(numeric_only=False)
pd.DataFrame(avg_delay)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
delay | |
---|---|
name_customer | |
11078 us | 17 days 00:00:00 |
17135 associates | -10 days +00:00:00 |
17135 llc | -3 days +00:00:00 |
236008 associates | -3 days +00:00:00 |
99 CE | 2 days 00:00:00 |
... | ... |
YEN BROS corp | 0 days 00:00:00 |
YEN BROS corporation | -1 days +12:00:00 |
YEN BROS llc | -2 days +00:00:00 |
ZARCO co | -1 days +00:00:00 |
ZIYAD us | 6 days 00:00:00 |
3889 rows × 1 columns
You need to add the "avg_delay" column with the maindata, mapped with "name_customer" column
- Note - You need to use map function to map the avgdelay with respect to "name_customer" column
maindata['avg_delay']=maindata['name_customer'].map(avg_delay)
maindata
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
business_code | cust_number | name_customer | clear_date | buisness_year | doc_id | posting_date | due_in_date | baseline_create_date | cust_payment_terms | converted_usd | delay | avg_delay | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | U001 | 0200769623 | WAL-MAR corp | 2020-02-11 | 2020.0 | 1.930438e+09 | 2020-01-26 | 2020-02-10 | 2020-01-26 | NAH4 | 54273.280 | 1 days | -3 days +07:08:49.779837776 |
1 | U001 | 0200980828 | BEN E | 2019-08-08 | 2019.0 | 1.929646e+09 | 2019-07-22 | 2019-08-11 | 2019-07-22 | NAD1 | 79656.600 | -3 days | 19 days 00:00:00 |
2 | U001 | 0200792734 | MDV/ trust | 2019-12-30 | 2019.0 | 1.929874e+09 | 2019-09-14 | 2019-09-29 | 2019-09-14 | NAA8 | 2253.860 | 92 days | 8 days 02:10:54.545454545 |
4 | U001 | 0200769623 | WAL-MAR foundation | 2019-11-25 | 2019.0 | 1.930148e+09 | 2019-11-13 | 2019-11-28 | 2019-11-13 | NAH4 | 33133.290 | -3 days | -3 days +19:33:27.692307693 |
5 | CA02 | 0140106181 | THE corporation | 2019-12-04 | 2019.0 | 2.960581e+09 | 2019-09-20 | 2019-10-04 | 2019-09-24 | CA10 | 15558.088 | 61 days | 71 days 10:40:00 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
49994 | U001 | 0200762301 | C&S WH trust | 2019-07-25 | 2019.0 | 1.929601e+09 | 2019-07-10 | 2019-07-25 | 2019-07-10 | NAC6 | 84780.400 | 0 days | -1 days +14:45:00 |
49996 | U001 | 0200769623 | WAL-MAR co | 2019-09-03 | 2019.0 | 1.929744e+09 | 2019-08-15 | 2019-08-30 | 2019-08-15 | NAH4 | 6766.540 | 4 days | -3 days +12:40:08.540925267 |
49997 | U001 | 0200772595 | SAFEW associates | 2020-03-05 | 2020.0 | 1.930537e+09 | 2020-02-19 | 2020-03-05 | 2020-02-19 | NAA8 | 6120.860 | 0 days | 1 days 01:08:34.285714285 |
49998 | U001 | 0200726979 | BJ'S llc | 2019-12-12 | 2019.0 | 1.930199e+09 | 2019-11-27 | 2019-12-12 | 2019-11-27 | NAA8 | 63.480 | 0 days | 1 days 13:36:42.985074626 |
49999 | U001 | 0200020431 | DEC corp | 2019-01-15 | 2019.0 | 1.928576e+09 | 2019-01-05 | 2019-01-24 | 2019-01-01 | NAM4 | 1790.300 | -9 days | -4 days +02:20:52.173913044 |
39158 rows × 13 columns
- Days_format : 17 days 00:00:00
- Format in seconds : 1641600.0
maindata['avg_delay']=maindata['avg_delay'].dt.total_seconds()
maindata['avg_delay']
0 -2.334702e+05
1 1.641600e+06
2 6.990545e+05
4 -1.887923e+05
5 6.172800e+06
...
49994 -3.330000e+04
49996 -2.135915e+05
49997 9.051429e+04
49998 1.354030e+05
49999 -3.371478e+05
Name: avg_delay, Length: 39158, dtype: float64
Since you have created the "avg_delay" column from "Delay" and "clear_date" column, there is no need of these two columns anymore
- You are expected to drop "Delay" and "clear_date" columns from maindata dataframe
maindata.drop(['delay','clear_date'],axis=1,inplace=True)
-
Note - y should have the target column i.e. "avg_delay" and the other column should be in X
-
X is going to hold the source fields and y will be going to hold the target fields
X=maindata.iloc[:,0:-1]
X.head()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
business_code | cust_number | name_customer | buisness_year | doc_id | posting_date | due_in_date | baseline_create_date | cust_payment_terms | converted_usd | |
---|---|---|---|---|---|---|---|---|---|---|
0 | U001 | 0200769623 | WAL-MAR corp | 2020.0 | 1.930438e+09 | 2020-01-26 | 2020-02-10 | 2020-01-26 | NAH4 | 54273.280 |
1 | U001 | 0200980828 | BEN E | 2019.0 | 1.929646e+09 | 2019-07-22 | 2019-08-11 | 2019-07-22 | NAD1 | 79656.600 |
2 | U001 | 0200792734 | MDV/ trust | 2019.0 | 1.929874e+09 | 2019-09-14 | 2019-09-29 | 2019-09-14 | NAA8 | 2253.860 |
4 | U001 | 0200769623 | WAL-MAR foundation | 2019.0 | 1.930148e+09 | 2019-11-13 | 2019-11-28 | 2019-11-13 | NAH4 | 33133.290 |
5 | CA02 | 0140106181 | THE corporation | 2019.0 | 2.960581e+09 | 2019-09-20 | 2019-10-04 | 2019-09-24 | CA10 | 15558.088 |
Y=maindata.iloc[:,-1]
Y.head()
0 -2.334702e+05
1 1.641600e+06
2 6.990545e+05
4 -1.887923e+05
5 6.172800e+06
Name: avg_delay, dtype: float64
- Note - The expected output should be in "X_train", "X_loc_test", "y_train", "y_loc_test" format
from sklearn.model_selection import train_test_split
X_train,X_loc_test,y_train,y_loc_test=train_test_split(X,Y,test_size=0.4)
X_train.shape,X_loc_test.shape,y_train.shape,y_loc_test.shape
((23494, 10), (15664, 10), (23494,), (15664,))
Now you are expected to split the "X_loc_test" and "y_loc_test" dataset into "Test" and "Validation" (as the names given below) dataframe with 50:50 format
- Note - The expected output should be in "X_val", "X_test", "y_val", "y_test" format
X_val,X_test,y_val,y_test=train_test_split(X_loc_test,y_loc_test,test_size=0.5)
X_val.shape,X_test.shape,y_val.shape,y_test.shape
((7832, 10), (7832, 10), (7832,), (7832,))
- Note - You are expected to make a distribution plot for the target variable
plt.figure(figsize=(20,10))
sns.distplot(Y)
<AxesSubplot:xlabel='avg_delay', ylabel='Density'>
You are expected to group the X_train dataset on 'name_customer' column with 'doc_id' in the x_train set
- Note code given for groupby statement- X_train.groupby(by=['name_customer'], as_index=False)['doc_id'].count()
X_train.groupby(by=['name_customer'], as_index=False)['doc_id'].count()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
name_customer | doc_id | |
---|---|---|
0 | 99 CE | 1 |
1 | 99 CE associates | 1 |
2 | 99 CE co | 1 |
3 | 99 CE corporation | 2 |
4 | 99 CE foundation | 3 |
... | ... | ... |
3131 | YEN BROS co | 1 |
3132 | YEN BROS corp | 1 |
3133 | YEN BROS corporation | 2 |
3134 | ZARCO co | 1 |
3135 | ZIYAD us | 1 |
3136 rows × 2 columns
z = X_train['doc_id']
sns.distplot(z)
<AxesSubplot:xlabel='doc_id', ylabel='Density'>
Create a Distribution plot only for business_year and a seperate distribution plot of "business_year" column along with the doc_id" column
X_train.buisness_year=X_train.buisness_year.astype(dtype=int)
plt.figure(figsize=(10,5))
sns.distplot(X_train.buisness_year)
<AxesSubplot:xlabel='buisness_year', ylabel='Density'>
X_train.head()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
business_code | cust_number | name_customer | buisness_year | doc_id | posting_date | due_in_date | baseline_create_date | cust_payment_terms | converted_usd | |
---|---|---|---|---|---|---|---|---|---|---|
33991 | U001 | 0200769623 | WAL-MAR us | 2019 | 1.928720e+09 | 2019-02-01 | 2019-02-16 | 2019-02-01 | NAH4 | 41417.43 |
37525 | U001 | 0200721222 | GO associates | 2019 | 1.928612e+09 | 2019-01-13 | 2019-01-28 | 2019-01-13 | NAA8 | 62069.98 |
31588 | U001 | 0200759878 | SA corporation | 2019 | 1.929058e+09 | 2019-03-31 | 2019-04-15 | 2019-03-31 | NAH4 | 12871.79 |
41228 | U001 | 0200772595 | SAFEW in | 2019 | 1.928780e+09 | 2019-02-11 | 2019-02-26 | 2019-02-11 | NAA8 | 63252.25 |
36162 | U001 | 0200439158 | POST foundation | 2019 | 1.930052e+09 | 2019-10-30 | 2019-11-14 | 2019-10-30 | NAA8 | 22481.28 |
X_train.describe()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
buisness_year | doc_id | converted_usd | |
---|---|---|---|
count | 23494.000000 | 2.349400e+04 | 23494.000000 |
mean | 2019.133353 | 2.014658e+09 | 30561.172186 |
std | 0.339963 | 2.930471e+08 | 35892.284892 |
min | 2019.000000 | 1.928509e+09 | 7.510000 |
25% | 2019.000000 | 1.929180e+09 | 4622.383500 |
50% | 2019.000000 | 1.929738e+09 | 17027.670000 |
75% | 2019.000000 | 1.930210e+09 | 45345.510000 |
max | 2020.000000 | 9.500000e+09 | 668593.360000 |
The "business_code" column inside X_train, is a categorical column, so you need to perform Labelencoder on that particular column
- Note - call the Label Encoder from sklearn library and use the fit() function on "business_code" column
- Note - Please fill in the blanks (two) to complete this code
from sklearn.preprocessing import LabelEncoder
business_coder = LabelEncoder()
business_coder.fit(X_train['business_code'])
LabelEncoder()
-
Note - For Training set you are expected to use fit_trainsform()
-
Note - For Test set you are expected to use the trainsform()
-
Note - For Validation set you are expected to use the trainsform()
-
Partial code is provided, please fill in the blanks
X_train['business_code_enc'] = business_coder.fit_transform(X_train['business_code'])
X_val['business_code_enc'] = business_coder.transform(X_val['business_code'])
X_test['business_code_enc'] = business_coder.transform(X_test['business_code'])
X_train[['business_code','business_code_enc']].head(10)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
business_code | business_code_enc | |
---|---|---|
33991 | U001 | 1 |
37525 | U001 | 1 |
31588 | U001 | 1 |
41228 | U001 | 1 |
36162 | U001 | 1 |
11188 | U001 | 1 |
30086 | U001 | 1 |
46888 | U001 | 1 |
48415 | U001 | 1 |
48455 | U001 | 1 |
Create a function called "custom" for dropping the columns 'business_code' from train, test and validation dataframe
- Note - Fill in the blank to complete the code
def custom(col ,traindf = X_train,valdf = X_val,testdf = X_test):
traindf.drop(col, axis =1,inplace=True)
valdf.drop(col,axis=1 , inplace=True)
testdf.drop(col,axis=1 , inplace=True)
return traindf,valdf ,testdf
Call the function by passing the column name which needed to be dropped from train, test and validation dataframes. Return updated dataframes to be stored in X_train ,X_val, X_test
- Note = Fill in the blank to complete the code
X_train,X_val,X_test=custom(['business_code'])
Manually replacing str values with numbers, Here we are trying manually replace the customer numbers with some specific values like, 'CCCA' as 1, 'CCU' as 2 and so on. Also we are converting the datatype "cust_number" field to int type.
- We are doing it for all the three dataframes as shown below. This is fully completed code. No need to modify anything here
X_train['cust_number'] = X_train['cust_number'].str.replace('CCCA',"1").str.replace('CCU',"2").str.replace('CC',"3").astype(int)
X_test['cust_number'] = X_test['cust_number'].str.replace('CCCA',"1").str.replace('CCU',"2").str.replace('CC',"3").astype(int)
X_val['cust_number'] = X_val['cust_number'].str.replace('CCCA',"1").str.replace('CCU',"2").str.replace('CC',"3").astype(int)
It differs from LabelEncoder by handling new classes and providing a value for it [Unknown]. Unknown will be added in fit and transform will take care of new item. It gives unknown class id.
- Note - Keep this code as it is, we will be using this later on.
#For encoding unseen labels
class EncoderExt(object):
def __init__(self):
self.label_encoder = LabelEncoder()
def fit(self, data_list):
self.label_encoder = self.label_encoder.fit(list(data_list) + ['Unknown'])
self.classes_ = self.label_encoder.classes_
return self
def transform(self, data_list):
new_data_list = list(data_list)
for unique_item in np.unique(data_list):
if unique_item not in self.label_encoder.classes_:
new_data_list = ['Unknown' if x==unique_item else x for x in new_data_list]
return self.label_encoder.transform(new_data_list)
- Note - Keep the code as it is, no need to change
label_encoder = EncoderExt()
label_encoder.fit(X_train['name_customer'])
X_train['name_customer_enc']=label_encoder.transform(X_train['name_customer'])
X_val['name_customer_enc']=label_encoder.transform(X_val['name_customer'])
X_test['name_customer_enc']=label_encoder.transform(X_test['name_customer'])
As we have created the a new column "name_customer_enc", so now drop "name_customer" column from all three dataframes
- Note - Keep the code as it is, no need to change
X_train ,X_val, X_test = custom(['name_customer'])
- Note - Keep the code as it is, no need to change
label_encoder1 = EncoderExt()
label_encoder1.fit(X_train['cust_payment_terms'])
X_train['cust_payment_terms_enc']=label_encoder1.transform(X_train['cust_payment_terms'])
X_val['cust_payment_terms_enc']=label_encoder1.transform(X_val['cust_payment_terms'])
X_test['cust_payment_terms_enc']=label_encoder1.transform(X_test['cust_payment_terms'])
X_train ,X_val, X_test = custom(['cust_payment_terms'])
- Note - You are expected yo use dtype
X_train.dtypes
cust_number int32
buisness_year int32
doc_id float64
posting_date datetime64[ns]
due_in_date datetime64[ns]
baseline_create_date datetime64[ns]
converted_usd float64
business_code_enc int32
name_customer_enc int32
cust_payment_terms_enc int32
dtype: object
X_test.dtypes
cust_number int32
buisness_year float64
doc_id float64
posting_date datetime64[ns]
due_in_date datetime64[ns]
baseline_create_date datetime64[ns]
converted_usd float64
business_code_enc int32
name_customer_enc int32
cust_payment_terms_enc int32
dtype: object
X_val.dtypes
cust_number int32
buisness_year float64
doc_id float64
posting_date datetime64[ns]
due_in_date datetime64[ns]
baseline_create_date datetime64[ns]
converted_usd float64
business_code_enc int32
name_customer_enc int32
cust_payment_terms_enc int32
dtype: object
- Extract days from "posting_date" column and store it into a new column "day_of_postingdate" for train, test and validation dataset
- Extract months from "posting_date" column and store it into a new column "month_of_postingdate" for train, test and validation dataset
- Extract year from "posting_date" column and store it into a new column "year_of_postingdate" for train, test and validation dataset
- Note - You are supposed yo use
- dt.day
- dt.month
- dt.year
X_train['day_of_postingdate'] = X_train['posting_date'].dt.day
X_train['month_of_postingdate'] = X_train['posting_date'].dt.month
X_train['year_of_postingdate'] = X_train['posting_date'].dt.year
X_val['day_of_postingdate'] = X_val['posting_date'].dt.day
X_val['month_of_postingdate'] = X_val['posting_date'].dt.month
X_val['year_of_postingdate'] = X_val['posting_date'].dt.year
X_test['day_of_postingdate'] = X_test['posting_date'].dt.day
X_test['month_of_postingdate'] = X_test['posting_date'].dt.month
X_test['year_of_postingdate'] = X_test['posting_date'].dt.year
X_train ,X_val, X_test = custom(['posting_date'])
- Extract days from "baseline_create_date" column and store it into a new column "day_of_createdate" for train, test and validation dataset
- Extract months from "baseline_create_date" column and store it into a new column "month_of_createdate" for train, test and validation dataset
- Extract year from "baseline_create_date" column and store it into a new column "year_of_createdate" for train, test and validation dataset
- Note - You are supposed yo use
- dt.day
- dt.month
- dt.year
- Note - Do as it is been shown in the previous two code boxes
X_train["day_of_createdate"] = X_train["baseline_create_date"].dt.day
X_train["month_of_createdate"] = X_train["baseline_create_date"].dt.month
X_train["year_of_createdate"] = X_train["baseline_create_date"].dt.year
X_val["day_of_createdate"] = X_val["baseline_create_date"].dt.day
X_val["month_of_createdate"] = X_val["baseline_create_date"].dt.month
X_val["year_of_createdate"] = X_val["baseline_create_date"].dt.year
X_test["day_of_createdate"] = X_test["baseline_create_date"].dt.day
X_test["month_of_createdate"] = X_test["baseline_create_date"].dt.month
X_test["year_of_createdate"] = X_test["baseline_create_date"].dt.year
pass the "baseline_create_date" column into the Custom function for train, test and validation dataset
X_train ,X_val, X_test = custom(["baseline_create_date"])
- Extract days from "due_in_date" column and store it into a new column "day_of_due" for train, test and validation dataset
- Extract months from "due_in_date" column and store it into a new column "month_of_due" for train, test and validation dataset
- Extract year from "due_in_date" column and store it into a new column "year_of_due" for train, test and validation dataset
- Note - You are supposed yo use
- dt.day
- dt.month
- dt.year
- Note - Do as it is been shown in the previous code
X_train["day_of_due"] = X_train["due_in_date"].dt.day
X_train["month_of_due"] = X_train["due_in_date"].dt.month
X_train["year_of_due"] = X_train["due_in_date"].dt.year
X_val["day_of_due"] = X_val["due_in_date"].dt.day
X_val["month_of_due"] = X_val["due_in_date"].dt.month
X_val["year_of_due"] = X_val["due_in_date"].dt.year
X_test["day_of_due"] = X_test["due_in_date"].dt.day
X_test["month_of_due"] = X_test["due_in_date"].dt.month
X_test["year_of_due" ] = X_test["due_in_date"].dt.year
pass the "due_in_date" column into the Custom function for train, test and validation dataset
X_train ,X_val, X_test = custom(["due_in_date"])
- Note - all the data type should be in either int64 or float64 format
X_train.dtypes,X_val.dtypes,X_test.dtypes
(cust_number int32
buisness_year int32
doc_id float64
converted_usd float64
business_code_enc int32
name_customer_enc int32
cust_payment_terms_enc int32
day_of_postingdate int64
month_of_postingdate int64
year_of_postingdate int64
day_of_createdate int64
month_of_createdate int64
year_of_createdate int64
day_of_due int64
month_of_due int64
year_of_due int64
dtype: object,
cust_number int32
buisness_year float64
doc_id float64
converted_usd float64
business_code_enc int32
name_customer_enc int32
cust_payment_terms_enc int32
day_of_postingdate int64
month_of_postingdate int64
year_of_postingdate int64
day_of_createdate int64
month_of_createdate int64
year_of_createdate int64
day_of_due int64
month_of_due int64
year_of_due int64
dtype: object,
cust_number int32
buisness_year float64
doc_id float64
converted_usd float64
business_code_enc int32
name_customer_enc int32
cust_payment_terms_enc int32
day_of_postingdate int64
month_of_postingdate int64
year_of_postingdate int64
day_of_createdate int64
month_of_createdate int64
year_of_createdate int64
day_of_due int64
month_of_due int64
year_of_due int64
dtype: object)
- Calling the VarianceThreshold Function
- Note - Keep the code as it is, no need to change
from sklearn.feature_selection import VarianceThreshold
constant_filter = VarianceThreshold(threshold=0)
constant_filter.fit(X_train)
len(X_train.columns[constant_filter.get_support()])
16
- Note - Keep the code as it is, no need to change
constant_columns = [column for column in X_train.columns
if column not in X_train.columns[constant_filter.get_support()]]
print(len(constant_columns))
0
-
transpose the feature matrice
-
print the number of duplicated features
-
select the duplicated features columns names
-
Note - Keep the code as it is, no need to change
x_train_T = X_train.T
print(x_train_T.duplicated().sum())
duplicated_columns = x_train_T[x_train_T.duplicated()].index.values
0
-
We have created a function called handling correlation which is going to return fields based on the correlation matrix value with a threshold of 0.8
-
Note - Keep the code as it is, no need to change
def handling_correlation(X_train,threshold=0.8):
corr_features = set()
corr_matrix = X_train.corr()
for i in range(len(corr_matrix .columns)):
for j in range(i):
if abs(corr_matrix.iloc[i, j]) >threshold:
colname = corr_matrix.columns[i]
corr_features.add(colname)
return list(corr_features)
- Note : Here we are trying to find out the relevant fields, from X_train
- Please fill in the blanks to call handling_correlation() function with a threshold value of 0.85
train=X_train.copy()
handling_correlation(train.copy(),0.85)
['month_of_createdate',
'year_of_due',
'month_of_due',
'day_of_createdate',
'year_of_createdate',
'year_of_postingdate']
- Note - Keep the code as it is, no need to change
colormap = plt.cm.RdBu
plt.figure(figsize=(14,12))
plt.title('Pearson Correlation of Features', y=1.05, size=20)
sns.heatmap(X_train.merge(y_train , on = X_train.index ).corr(),linewidths=0.1,vmax=1.0,
square=True, cmap='gist_rainbow_r', linecolor='white', annot=True)
<AxesSubplot:title={'center':'Pearson Correlation of Features'}>
- Note - Fill in the blanks to call the appropriate method
from sklearn.feature_selection import VarianceThreshold
sel = VarianceThreshold(0.8)
sel.fit(X_train)
VarianceThreshold(threshold=0.8)
sel.variances_
array([1.85032604e+15, 1.15570122e-01, 8.58729238e+16, 1.28820128e+09,
2.77491114e-01, 1.09872068e+06, 1.35298449e+02, 7.62974860e+01,
1.23349081e+01, 1.16416596e-01, 7.77242187e+01, 1.23410019e+01,
1.16522784e-01, 7.61016288e+01, 1.21711484e+01, 1.18812605e-01])
- 'year_of_createdate'
- 'year_of_due'
- 'day_of_createdate'
- 'year_of_postingdate'
- 'month_of_due'
- 'month_of_createdate'
Now you need to compare with different machine learning models, and needs to find out the best predicted model
- Linear Regression
- Decision Tree Regression
- Random Forest Regression
- Support Vector Regression
- Extreme Gradient Boost Regression
- MSE
- R2
- Algorithm
MSE_Score = []
R2_Score = []
Algorithm = []
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
-
Step 1 : Call the Linear Regression from sklearn library
-
Step 2 : make an object of Linear Regression
-
Step 3 : fit the X_train and y_train dataframe into the object
-
Step 4 : Predict the output by passing the X_test Dataset into predict function
-
Note - Append the Algorithm name into the algorithm list for tracking purpose
from sklearn.linear_model import LinearRegression
Algorithm.append('LinearRegression')
regressor = LinearRegression()
regressor.fit(X_train, y_train)
predicted= regressor.predict(X_test)
- Mean Square Error
- R Square Error
for y_test and predicted dataset and store those data inside respective list for comparison
MSE_Score.append(mean_squared_error(y_test, predicted))
R2_Score.append(r2_score(y_test, predicted))
predict_test= regressor.predict(X_val)
mean_squared_error(y_val, predict_test, squared=False)
497380.6218706041
for i in Algorithm, MSE_Score, R2_Score:
print(i,end=',')
['LinearRegression'],[288105958884.5953],[0.344296527869704],
-
Step 1 : Call the Support Vector Regressor from sklearn library
-
Step 2 : make an object of SVR
-
Step 3 : fit the X_train and y_train dataframe into the object
-
Step 4 : Predict the output by passing the X_test Dataset into predict function
-
Note - Append the Algorithm name into the algorithm list for tracking purpose
from sklearn.svm import SVR
Algorithm.append('Support Vector Regression')
regressor = SVR()
regressor.fit(X_train, y_train)
predicted= regressor.predict(X_test)
- Mean Square Error
- R Square Error
for "y_test" and "predicted" dataset and store those data inside respective list for comparison
MSE_Score.append(mean_squared_error(y_test, predicted))
R2_Score.append(r2_score(y_test, predicted))
predict_test= regressor.predict(X_val)
mean_squared_error(y_val, predict_test, squared=False)
618688.683054989
for i in Algorithm, MSE_Score, R2_Score:
print(i,end=',')
['LinearRegression', 'Support Vector Regression'],[288105958884.5953, 442957027534.2891],[0.344296527869704, -0.008130696370260093],
-
Step 1 : Call the Decision Tree Regressor from sklearn library
-
Step 2 : make an object of Decision Tree
-
Step 3 : fit the X_train and y_train dataframe into the object
-
Step 4 : Predict the output by passing the X_test Dataset into predict function
-
Note - Append the Algorithm name into the algorithm list for tracking purpose
from sklearn.tree import DecisionTreeRegressor
Algorithm.append('Decision Tree Regressor')
regressor = DecisionTreeRegressor()
regressor.fit(X_train, y_train)
predicted= regressor.predict(X_test)
- Mean Square Error
- R Square Error
for y_test and predicted dataset and store those data inside respective list for comparison
MSE_Score.append(mean_squared_error(y_test, predicted))
R2_Score.append(r2_score(y_test, predicted))
predict_test= regressor.predict(X_val)
mean_squared_error(y_val, predict_test, squared=False)
515974.497143672
for i in Algorithm, MSE_Score, R2_Score:
print(i,end=',')
['LinearRegression', 'Support Vector Regression', 'Decision Tree Regressor'],[288105958884.5953, 442957027534.2891, 187930847727.1828],[0.344296527869704, -0.008130696370260093, 0.572286148290102],
-
Step 1 : Call the Random Forest Regressor from sklearn library
-
Step 2 : make an object of Random Forest
-
Step 3 : fit the X_train and y_train dataframe into the object
-
Step 4 : Predict the output by passing the X_test Dataset into predict function
-
Note - Append the Algorithm name into the algorithm list for tracking purpose
from sklearn.ensemble import RandomForestRegressor
Algorithm.append('Random Forest Regressor')
regressor = RandomForestRegressor()
regressor.fit(X_train, y_train)
predicted= regressor.predict(X_test)
- Mean Square Error
- R Square Error
for y_test and predicted dataset and store those data inside respective list for comparison
MSE_Score.append(mean_squared_error(y_test, predicted))
R2_Score.append(r2_score(y_test, predicted))
predict_test= regressor.predict(X_val)
mean_squared_error(y_val, predict_test, squared=False)
363378.5731367752
for i in Algorithm, MSE_Score, R2_Score:
print(i,end=',')
['LinearRegression', 'Support Vector Regression', 'Decision Tree Regressor', 'Random Forest Regressor'],[288105958884.5953, 442957027534.2891, 187930847727.1828, 110124866391.19873],[0.344296527869704, -0.008130696370260093, 0.572286148290102, 0.7493656238831272],
-
Step 1 : Call the XGBoost Regressor from xgb library
-
Step 2 : make an object of Xgboost
-
Step 3 : fit the X_train and y_train dataframe into the object
-
Step 4 : Predict the output by passing the X_test Dataset into predict function
-
Note - Append the Algorithm name into the algorithm list for tracking purpose### Extreme Gradient Boost Regression
-
Note - No need to change the code
import xgboost as xgb
Algorithm.append('XGB Regressor')
regressor = xgb.XGBRegressor()
regressor.fit(X_train, y_train)
predicted = regressor.predict(X_test)
- Mean Square Error
- R Square Error
for y_test and predicted dataset and store those data inside respective list for comparison
MSE_Score.append(mean_squared_error(y_test, predicted))
R2_Score.append(r2_score(y_test, predicted))
predict_test= regressor.predict(X_val)
mean_squared_error(y_val, predict_test, squared=False)
342795.9447604792
for i in Algorithm, MSE_Score, R2_Score:
print(i,end=',')
['LinearRegression', 'Support Vector Regression', 'Decision Tree Regressor', 'Random Forest Regressor', 'XGB Regressor'],[288105958884.5953, 442957027534.2891, 187930847727.1828, 110124866391.19873, 95108767969.45203],[0.344296527869704, -0.008130696370260093, 0.572286148290102, 0.7493656238831272, 0.7835409249116406],
comparision = pd.DataFrame({'Algorithm':Algorithm,'MSE_Score':MSE_Score,'R2_Score':R2_Score})
comparision
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Algorithm | MSE_Score | R2_Score | |
---|---|---|---|
0 | LinearRegression | 2.881060e+11 | 0.344297 |
1 | Support Vector Regression | 4.429570e+11 | -0.008131 |
2 | Decision Tree Regressor | 1.879308e+11 | 0.572286 |
3 | Random Forest Regressor | 1.101249e+11 | 0.749366 |
4 | XGB Regressor | 9.510877e+10 | 0.783541 |
-
Step 1 - Fit X_train and y_train inside the model
-
Step 2 - Predict the X_test dataset
-
Step 3 - Predict the X_val dataset
-
Note - No need to change the code
regressorfinal = xgb.XGBRegressor()
regressorfinal.fit(X_train, y_train)
predictedfinal = regressorfinal.predict(X_test)
predict_testfinal = regressorfinal.predict(X_val)
- Note - No need to change the code
mean_squared_error(y_test,predictedfinal,squared=False)
308397.0946190188
mean_squared_error(y_val,predict_testfinal,squared=False)
342795.9447604792
r2_score(y_test,predictedfinal)
0.7835409249116406
r2_score(y_val,predict_testfinal)
0.6917078094013065
regressorfinal.score(X_train,y_train)*100
94.84248330173806
regressorfinal.score(X_val,y_val)*100
69.17078094013065
regressorfinal.score(X_test,y_test)*100
78.35409249116407
- XG Boost model has best combination of prediction performance and processing time compared to other algorithms.It is highly flexible and versatile tool that can work through most regression, classification and ranking problems as well as user built objective functions .Also it outperforms several other well known implementations of gradient tree boosting.
- Step 1 : Label Encoding
- Step 2 : Day, Month and Year extraction
- Step 3 : Change all the column data type into int64 or float64
- Step 4 : Need to drop the useless columns
nulldata.head()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
business_code | cust_number | name_customer | clear_date | buisness_year | doc_id | posting_date | due_in_date | baseline_create_date | cust_payment_terms | converted_usd | |
---|---|---|---|---|---|---|---|---|---|---|---|
3 | CA02 | 0140105686 | SYSC llc | NaT | 2020.0 | 2.960623e+09 | 2020-03-30 | 2020-04-10 | 2020-03-31 | CA10 | 2309.79 |
7 | U001 | 0200744019 | TARG us | NaT | 2020.0 | 1.930659e+09 | 2020-03-19 | 2020-04-03 | 2020-03-19 | NAA8 | 11173.02 |
10 | U001 | 0200418007 | AM | NaT | 2020.0 | 1.930611e+09 | 2020-03-11 | 2020-03-26 | 2020-03-11 | NAA8 | 3525.59 |
14 | U001 | 0200739534 | OK systems | NaT | 2020.0 | 1.930788e+09 | 2020-04-15 | 2020-04-30 | 2020-04-15 | NAA8 | 121105.65 |
15 | U001 | 0200353024 | DECA corporation | NaT | 2020.0 | 1.930817e+09 | 2020-04-23 | 2020-04-26 | 2020-04-16 | NAM2 | 3726.06 |
nulldata.shape
(9681, 11)
nulldata.describe()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
buisness_year | doc_id | converted_usd | |
---|---|---|---|
count | 9681.0 | 9.681000e+03 | 9681.000000 |
mean | 2020.0 | 2.006165e+09 | 32065.681125 |
std | 0.0 | 2.673629e+08 | 35419.613688 |
min | 2020.0 | 1.930535e+09 | 0.720000 |
25% | 2020.0 | 1.930658e+09 | 5607.190000 |
50% | 2020.0 | 1.930731e+09 | 19024.190000 |
75% | 2020.0 | 1.930818e+09 | 47752.640000 |
max | 2020.0 | 2.960636e+09 | 653644.800000 |
nulldata.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 9681 entries, 3 to 49995
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 business_code 9681 non-null object
1 cust_number 9681 non-null object
2 name_customer 9681 non-null object
3 clear_date 0 non-null datetime64[ns]
4 buisness_year 9681 non-null float64
5 doc_id 9681 non-null float64
6 posting_date 9681 non-null datetime64[ns]
7 due_in_date 9681 non-null datetime64[ns]
8 baseline_create_date 9681 non-null datetime64[ns]
9 cust_payment_terms 9681 non-null object
10 converted_usd 9681 non-null float64
dtypes: datetime64[ns](4), float64(3), object(4)
memory usage: 907.6+ KB
nulldata1=nulldata.copy()
- Note - you are expected to fit "business_code" as it is a categorical variable
- Note - No need to change the code
from sklearn.preprocessing import LabelEncoder
business_codern = LabelEncoder()
business_codern.fit(nulldata['business_code'])
nulldata['business_code_enc'] = business_codern.transform(nulldata['business_code'])
- Note - No need to change the code
nulldata['cust_number'] = nulldata['cust_number'].str.replace('CCCA',"1").str.replace('CCU',"2").str.replace('CC',"3").astype(int)
You need to extract day, month and year from the "clear_date", "posting_date", "due_in_date", "baseline_create_date" columns
- Note - You are supposed To use -
- dt.day
- dt.month
- dt.year
nulldata['day_of_postingdate'] = nulldata['posting_date'].dt.day
nulldata['month_of_postingdate'] = nulldata['posting_date'].dt.month
nulldata['year_of_postingdate'] = nulldata['posting_date'].dt.year
nulldata['day_of_cleardate'] = nulldata['clear_date'].dt.day
nulldata['month_of_cleardate'] = nulldata['clear_date'].dt.month
nulldata['year_of_cleardate'] = nulldata['clear_date'].dt.year
nulldata['day_of_createdate'] = nulldata['baseline_create_date'].dt.day
nulldata['month_of_createdate'] = nulldata['baseline_create_date'].dt.month
nulldata['year_of_createdate'] = nulldata['baseline_create_date'].dt.year
nulldata['day_of_due'] = nulldata['due_in_date'].dt.day
nulldata['month_of_due'] = nulldata['due_in_date'].dt.month
nulldata['year_of_due'] = nulldata['due_in_date'].dt.year
- 'cust_payment_terms' and store into 'cust_payment_terms_enc'
- 'business_code' and store into 'business_code_enc'
- 'name_customer' and store into 'name_customer_enc'
Note - No need to change the code
nulldata['cust_payment_terms_enc']=label_encoder1.transform(nulldata['cust_payment_terms'])
nulldata['business_code_enc']=label_encoder1.transform(nulldata['business_code'])
nulldata['name_customer_enc']=label_encoder.transform(nulldata['name_customer'])
nulldata.dtypes
business_code object
cust_number int32
name_customer object
clear_date datetime64[ns]
buisness_year float64
doc_id float64
posting_date datetime64[ns]
due_in_date datetime64[ns]
baseline_create_date datetime64[ns]
cust_payment_terms object
converted_usd float64
business_code_enc int32
day_of_postingdate int64
month_of_postingdate int64
year_of_postingdate int64
day_of_cleardate float64
month_of_cleardate float64
year_of_cleardate float64
day_of_createdate int64
month_of_createdate int64
year_of_createdate int64
day_of_due int64
month_of_due int64
year_of_due int64
cust_payment_terms_enc int32
name_customer_enc int32
dtype: object
- 'business_code'
- "baseline_create_date"
- "due_in_date"
- "posting_date"
- "name_customer"
- "clear_date"
- "cust_payment_terms"
- 'day_of_cleardate'
- "month_of_cleardate"
- "year_of_cleardate"
null_col_drop= ['business_code',
"baseline_create_date",
"due_in_date",
"posting_date",
"name_customer",
"clear_date",
"cust_payment_terms",
'day_of_cleardate',
"month_of_cleardate",
"year_of_cleardate"]
nulldata.drop(null_col_drop, axis=1,inplace=True)
nulldata.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 9681 entries, 3 to 49995
Data columns (total 16 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 cust_number 9681 non-null int32
1 buisness_year 9681 non-null float64
2 doc_id 9681 non-null float64
3 converted_usd 9681 non-null float64
4 business_code_enc 9681 non-null int32
5 day_of_postingdate 9681 non-null int64
6 month_of_postingdate 9681 non-null int64
7 year_of_postingdate 9681 non-null int64
8 day_of_createdate 9681 non-null int64
9 month_of_createdate 9681 non-null int64
10 year_of_createdate 9681 non-null int64
11 day_of_due 9681 non-null int64
12 month_of_due 9681 non-null int64
13 year_of_due 9681 non-null int64
14 cust_payment_terms_enc 9681 non-null int32
15 name_customer_enc 9681 non-null int32
dtypes: float64(3), int32(4), int64(9)
memory usage: 1.1 MB
- use info() method
print(f'{nulldata.info()} {X_test.info}')
<class 'pandas.core.frame.DataFrame'>
Int64Index: 9681 entries, 3 to 49995
Data columns (total 16 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 cust_number 9681 non-null int32
1 buisness_year 9681 non-null float64
2 doc_id 9681 non-null float64
3 converted_usd 9681 non-null float64
4 business_code_enc 9681 non-null int32
5 day_of_postingdate 9681 non-null int64
6 month_of_postingdate 9681 non-null int64
7 year_of_postingdate 9681 non-null int64
8 day_of_createdate 9681 non-null int64
9 month_of_createdate 9681 non-null int64
10 year_of_createdate 9681 non-null int64
11 day_of_due 9681 non-null int64
12 month_of_due 9681 non-null int64
13 year_of_due 9681 non-null int64
14 cust_payment_terms_enc 9681 non-null int32
15 name_customer_enc 9681 non-null int32
dtypes: float64(3), int32(4), int64(9)
memory usage: 1.1 MB
None <bound method DataFrame.info of cust_number buisness_year doc_id converted_usd \
25235 200769623 2019.0 1.929519e+09 6492.52
20446 200771157 2020.0 1.930470e+09 15008.50
32222 200785971 2019.0 1.929519e+09 352.12
14292 200744019 2019.0 1.928833e+09 59624.53
41577 200763814 2019.0 1.929567e+09 235.54
... ... ... ... ...
37783 200726979 2019.0 1.929656e+09 59.16
49589 200803720 2020.0 1.930445e+09 8518.44
49697 200708411 2019.0 1.929742e+09 54128.57
19265 200704045 2019.0 1.928577e+09 53454.66
661 200772670 2019.0 1.929791e+09 15839.46
business_code_enc name_customer_enc cust_payment_terms_enc \
25235 1 3035 33
20446 1 3083 20
32222 1 2760 20
14292 1 2833 20
41577 1 2773 28
... ... ... ...
37783 1 330 20
49589 1 725 35
49697 1 2494 20
19265 1 2276 20
661 1 194 47
day_of_postingdate month_of_postingdate year_of_postingdate \
25235 21 6 2019
20446 3 2 2020
32222 21 6 2019
14292 20 2 2019
41577 9 7 2019
... ... ... ...
37783 25 7 2019
49589 27 1 2020
49697 14 8 2019
19265 5 1 2019
661 25 8 2019
day_of_createdate month_of_createdate year_of_createdate day_of_due \
25235 21 6 2019 6
20446 3 2 2020 18
32222 21 6 2019 6
14292 20 2 2019 7
41577 9 7 2019 29
... ... ... ... ...
37783 25 7 2019 9
49589 16 1 2020 26
49697 14 8 2019 29
19265 5 1 2019 20
661 25 8 2019 9
month_of_due year_of_due
25235 7 2019
20446 2 2020
32222 7 2019
14292 3 2019
41577 7 2019
... ... ...
37783 8 2019
49589 1 2020
49697 8 2019
19265 1 2019
661 9 2019
[7832 rows x 16 columns]>
You must have noticed that there is a mismatch in the column sequence while compairing the dataframes
-
Note - In order to fed into the machine learning model, you need to edit the sequence of "nulldata", similar to the "X_test" dataframe
-
Display all the columns of the X_test dataframe
-
Display all the columns of the Nulldata dataframe
-
Store the Nulldata with new sequence into a new dataframe
-
Note - The code is given below, no need to change
X_test.columns
Index(['cust_number', 'buisness_year', 'doc_id', 'converted_usd',
'business_code_enc', 'name_customer_enc', 'cust_payment_terms_enc',
'day_of_postingdate', 'month_of_postingdate', 'year_of_postingdate',
'day_of_createdate', 'month_of_createdate', 'year_of_createdate',
'day_of_due', 'month_of_due', 'year_of_due'],
dtype='object')
nulldata.columns
Index(['cust_number', 'buisness_year', 'doc_id', 'converted_usd',
'business_code_enc', 'day_of_postingdate', 'month_of_postingdate',
'year_of_postingdate', 'day_of_createdate', 'month_of_createdate',
'year_of_createdate', 'day_of_due', 'month_of_due', 'year_of_due',
'cust_payment_terms_enc', 'name_customer_enc'],
dtype='object')
nulldata2=nulldata[['cust_number', 'buisness_year', 'doc_id', 'converted_usd',
'business_code_enc', 'name_customer_enc', 'cust_payment_terms_enc',
'day_of_postingdate', 'month_of_postingdate', 'year_of_postingdate',
'day_of_createdate', 'month_of_createdate', 'year_of_createdate',
'day_of_due', 'month_of_due', 'year_of_due']]
nulldata2.head()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
cust_number | buisness_year | doc_id | converted_usd | business_code_enc | name_customer_enc | cust_payment_terms_enc | day_of_postingdate | month_of_postingdate | year_of_postingdate | day_of_createdate | month_of_createdate | year_of_createdate | day_of_due | month_of_due | year_of_due | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
3 | 140105686 | 2020.0 | 2.960623e+09 | 2309.79 | 67 | 2754 | 4 | 30 | 3 | 2020 | 31 | 3 | 2020 | 10 | 4 | 2020 |
7 | 200744019 | 2020.0 | 1.930659e+09 | 11173.02 | 67 | 2841 | 20 | 19 | 3 | 2020 | 19 | 3 | 2020 | 3 | 4 | 2020 |
10 | 200418007 | 2020.0 | 1.930611e+09 | 3525.59 | 67 | 92 | 20 | 11 | 3 | 2020 | 11 | 3 | 2020 | 26 | 3 | 2020 |
14 | 200739534 | 2020.0 | 1.930788e+09 | 121105.65 | 67 | 2051 | 20 | 15 | 4 | 2020 | 15 | 4 | 2020 | 30 | 4 | 2020 |
15 | 200353024 | 2020.0 | 1.930817e+09 | 3726.06 | 67 | 732 | 35 | 23 | 4 | 2020 | 16 | 4 | 2020 | 26 | 4 | 2020 |
final_result=regressorfinal.predict(nulldata2)
- Note - No need to change the code
final_result = pd.Series(final_result,name='avg_delay')
final_result
0 986812.937500
1 182130.406250
2 46762.542969
3 179988.281250
4 -653035.625000
...
9676 -45526.519531
9677 -4528.582031
9678 -189695.156250
9679 26729.308594
9680 -23034.001953
Name: avg_delay, Length: 9681, dtype: float32
Now you need to merge this final_result dataframe with the BACKUP of "nulldata" Dataframe which we have created in earlier steps
nulldata1.reset_index(drop=True,inplace=True)
Final = nulldata1.merge(final_result , on = nulldata.index )
Final
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
key_0 | business_code | cust_number | name_customer | clear_date | buisness_year | doc_id | posting_date | due_in_date | baseline_create_date | cust_payment_terms | converted_usd | avg_delay | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 3 | CA02 | 0140105686 | SYSC llc | NaT | 2020.0 | 2.960623e+09 | 2020-03-30 | 2020-04-10 | 2020-03-31 | CA10 | 2309.79 | 986812.937500 |
1 | 7 | U001 | 0200744019 | TARG us | NaT | 2020.0 | 1.930659e+09 | 2020-03-19 | 2020-04-03 | 2020-03-19 | NAA8 | 11173.02 | 182130.406250 |
2 | 10 | U001 | 0200418007 | AM | NaT | 2020.0 | 1.930611e+09 | 2020-03-11 | 2020-03-26 | 2020-03-11 | NAA8 | 3525.59 | 46762.542969 |
3 | 14 | U001 | 0200739534 | OK systems | NaT | 2020.0 | 1.930788e+09 | 2020-04-15 | 2020-04-30 | 2020-04-15 | NAA8 | 121105.65 | 179988.281250 |
4 | 15 | U001 | 0200353024 | DECA corporation | NaT | 2020.0 | 1.930817e+09 | 2020-04-23 | 2020-04-26 | 2020-04-16 | NAM2 | 3726.06 | -653035.625000 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
9676 | 49975 | U001 | 0200769623 | WAL-MAR in | NaT | 2020.0 | 1.930625e+09 | 2020-03-10 | 2020-03-25 | 2020-03-10 | NAH4 | 13114.99 | -45526.519531 |
9677 | 49980 | U001 | 0200769623 | WAL-MAR corporation | NaT | 2020.0 | 1.930851e+09 | 2020-05-03 | 2020-05-18 | 2020-05-03 | NAH4 | 8899.71 | -4528.582031 |
9678 | 49982 | U001 | 0200148860 | DOLLA co | NaT | 2020.0 | 1.930638e+09 | 2020-03-11 | 2020-03-26 | 2020-03-11 | NAA8 | 4967.06 | -189695.156250 |
9679 | 49992 | U001 | 0200900909 | SYSCO co | NaT | 2020.0 | 1.930702e+09 | 2020-03-25 | 2020-04-09 | 2020-03-25 | NAA8 | 1998.64 | 26729.308594 |
9680 | 49995 | U001 | 0200561861 | CO corporation | NaT | 2020.0 | 1.930797e+09 | 2020-04-21 | 2020-05-06 | 2020-04-21 | NAA8 | 3187.86 | -23034.001953 |
9681 rows × 13 columns
Final.shape
(9681, 13)
- Convert "due_in_date" into datetime format
- Convert "avg_delay" into datetime format
- Create a new column "clear_date" and store the sum of "due_in_date" and "avg_delay"
- display the new "clear_date" column
- Note - Code is given below, no need to change
Final['clear_date'] = pd.to_datetime(Final['due_in_date']) + pd.to_timedelta(Final['avg_delay'], unit='s')
Final['clear_date']
0 2020-04-21 10:06:52.937500000
1 2020-04-05 02:35:30.406250000
2 2020-03-26 12:59:22.542968750
3 2020-05-02 01:59:48.281250000
4 2020-04-18 10:36:04.375000000
...
9676 2020-03-24 11:21:13.480468750
9677 2020-05-17 22:44:31.417968750
9678 2020-03-23 19:18:24.843750000
9679 2020-04-09 07:25:29.308593750
9680 2020-05-05 17:36:05.998046875
Name: clear_date, Length: 9681, dtype: datetime64[ns]
- Note - Formula = avg_delay//(24 * 3600)
- Note - full code is given for this, no need to change
Final['avg_delay'] = Final.apply(lambda row: row.avg_delay//(24 * 3600), axis = 1)
Final['avg_delay']
0 11.0
1 2.0
2 0.0
3 2.0
4 -8.0
...
9676 -1.0
9677 -1.0
9678 -3.0
9679 0.0
9680 -1.0
Name: avg_delay, Length: 9681, dtype: float64
-
Need to perform binning
-
create a list of bins i.e. bins= [0,15,30,45,60,100]
-
create a list of labels i.e. labels = ['0-15','16-30','31-45','46-60','Greatar than 60']
-
perform binning by using cut() function from "Final" dataframe
-
Please fill up the first two rows of the code
bins=[0,15,30,45,60,100]
labels = ['0-15','16-30','31-45','46-60','Greatar than 60']
Final['Aging Bucket'] = pd.cut(Final['avg_delay'], bins=bins, labels=labels, right=False)
Final.drop(['key_0','avg_delay'],axis=1,inplace=True)
Final['Aging Bucket'].value_counts()
0-15 4908
16-30 120
31-45 70
46-60 27
Greatar than 60 7
Name: Aging Bucket, dtype: int64
Final.head
<bound method NDFrame.head of business_code cust_number name_customer \
0 CA02 0140105686 SYSC llc
1 U001 0200744019 TARG us
2 U001 0200418007 AM
3 U001 0200739534 OK systems
4 U001 0200353024 DECA corporation
... ... ... ...
9676 U001 0200769623 WAL-MAR in
9677 U001 0200769623 WAL-MAR corporation
9678 U001 0200148860 DOLLA co
9679 U001 0200900909 SYSCO co
9680 U001 0200561861 CO corporation
clear_date buisness_year doc_id posting_date \
0 2020-04-21 10:06:52.937500000 2020.0 2.960623e+09 2020-03-30
1 2020-04-05 02:35:30.406250000 2020.0 1.930659e+09 2020-03-19
2 2020-03-26 12:59:22.542968750 2020.0 1.930611e+09 2020-03-11
3 2020-05-02 01:59:48.281250000 2020.0 1.930788e+09 2020-04-15
4 2020-04-18 10:36:04.375000000 2020.0 1.930817e+09 2020-04-23
... ... ... ... ...
9676 2020-03-24 11:21:13.480468750 2020.0 1.930625e+09 2020-03-10
9677 2020-05-17 22:44:31.417968750 2020.0 1.930851e+09 2020-05-03
9678 2020-03-23 19:18:24.843750000 2020.0 1.930638e+09 2020-03-11
9679 2020-04-09 07:25:29.308593750 2020.0 1.930702e+09 2020-03-25
9680 2020-05-05 17:36:05.998046875 2020.0 1.930797e+09 2020-04-21
due_in_date baseline_create_date cust_payment_terms converted_usd \
0 2020-04-10 2020-03-31 CA10 2309.79
1 2020-04-03 2020-03-19 NAA8 11173.02
2 2020-03-26 2020-03-11 NAA8 3525.59
3 2020-04-30 2020-04-15 NAA8 121105.65
4 2020-04-26 2020-04-16 NAM2 3726.06
... ... ... ... ...
9676 2020-03-25 2020-03-10 NAH4 13114.99
9677 2020-05-18 2020-05-03 NAH4 8899.71
9678 2020-03-26 2020-03-11 NAA8 4967.06
9679 2020-04-09 2020-03-25 NAA8 1998.64
9680 2020-05-06 2020-04-21 NAA8 3187.86
Aging Bucket
0 0-15
1 0-15
2 0-15
3 0-15
4 NaN
... ...
9676 NaN
9677 NaN
9678 NaN
9679 0-15
9680 NaN
[9681 rows x 12 columns]>
Final.to_csv('HRC30094W_AMIT_PATNAIK.csv')