Loading¶
In [1]:
# read in the employee data
import pandas as pd
employee_data = pd.read_csv('../Data/employee_data.csv')
employee_data.head()
Out[1]:
EmployeeID | Age | Gender | DistanceFromHome | JobLevel | Department | MonthlyIncome | PerformanceRating | JobSatisfaction | Attrition | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 1001 | 41 | Female | 1 | 2 | Sales | 5993 | 3 | 4 | Yes |
1 | 1002 | 49 | Male | 8 | 2 | Research & Development | 5130 | 4 | 2 | No |
2 | 1004 | 37 | Male | 2 | 1 | Research & Development | 2090 | 3 | 3 | Yes |
3 | 1005 | 33 | Female | 3 | 1 | Research & Development | 2909 | 3 | 3 | No |
4 | 1007 | 27 | Male | 2 | 1 | Research & Development | 3468 | 3 | 2 | No |
In [2]:
# note the number of rows and columns
employee_data.shape
Out[2]:
(1470, 10)
In [3]:
# view the data types of all the columns
employee_data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1470 entries, 0 to 1469 Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 EmployeeID 1470 non-null int64 1 Age 1470 non-null int64 2 Gender 1470 non-null object 3 DistanceFromHome 1470 non-null int64 4 JobLevel 1470 non-null int64 5 Department 1470 non-null object 6 MonthlyIncome 1470 non-null int64 7 PerformanceRating 1470 non-null int64 8 JobSatisfaction 1470 non-null int64 9 Attrition 1470 non-null object dtypes: int64(7), object(3) memory usage: 115.0+ KB
In [4]:
# look at the numeric columns
employee_data.dtypes[employee_data.dtypes == 'int64']
Out[4]:
EmployeeID int64 Age int64 DistanceFromHome int64 JobLevel int64 MonthlyIncome int64 PerformanceRating int64 JobSatisfaction int64 dtype: object
In [5]:
# look at the non-numeric columns
employee_data.dtypes[employee_data.dtypes != 'int64']
Out[5]:
Gender object Department object Attrition object dtype: object
In [6]:
# create a copy of the dataframe
data = employee_data.copy()
data.head()
Out[6]:
EmployeeID | Age | Gender | DistanceFromHome | JobLevel | Department | MonthlyIncome | PerformanceRating | JobSatisfaction | Attrition | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 1001 | 41 | Female | 1 | 2 | Sales | 5993 | 3 | 4 | Yes |
1 | 1002 | 49 | Male | 8 | 2 | Research & Development | 5130 | 4 | 2 | No |
2 | 1004 | 37 | Male | 2 | 1 | Research & Development | 2090 | 3 | 3 | Yes |
3 | 1005 | 33 | Female | 3 | 1 | Research & Development | 2909 | 3 | 3 | No |
4 | 1007 | 27 | Male | 2 | 1 | Research & Development | 3468 | 3 | 2 | No |
In [7]:
# look at the gender values
data.Gender.value_counts()
Out[7]:
Gender Male 882 Female 588 Name: count, dtype: int64
In [8]:
# change gender into a numeric field using np.where
import numpy as np
data.Gender = np.where(data.Gender == 'Female', 1, 0)
data.Gender.head()
Out[8]:
0 1 1 0 2 0 3 1 4 0 Name: Gender, dtype: int64
In [9]:
# look at the attrition values
data.Attrition.value_counts()
Out[9]:
Attrition No 1233 Yes 237 Name: count, dtype: int64
In [10]:
# change attrition to a numeric field using np.where
data.Attrition = np.where(data.Attrition == 'Yes', 1, 0)
data.Attrition.head()
Out[10]:
0 1 1 0 2 1 3 0 4 0 Name: Attrition, dtype: int64
In [11]:
# look at the department values
data.Department.value_counts()
Out[11]:
Department Research & Development 961 Sales 446 Human Resources 63 Name: count, dtype: int64
In [12]:
# change department to a numeric field via dummy variables
pd.get_dummies(data.Department).astype('int').head()
Out[12]:
Human Resources | Research & Development | Sales | |
---|---|---|---|
0 | 0 | 0 | 1 |
1 | 0 | 1 | 0 |
2 | 0 | 1 | 0 |
3 | 0 | 1 | 0 |
4 | 0 | 1 | 0 |
In [13]:
# attach the columns back on to the dataframe
data = pd.concat([data, pd.get_dummies(data.Department).astype('int')], axis=1)
data.drop('Department', axis=1, inplace=True)
data.head()
Out[13]:
EmployeeID | Age | Gender | DistanceFromHome | JobLevel | MonthlyIncome | PerformanceRating | JobSatisfaction | Attrition | Human Resources | Research & Development | Sales | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1001 | 41 | 1 | 1 | 2 | 5993 | 3 | 4 | 1 | 0 | 0 | 1 |
1 | 1002 | 49 | 0 | 8 | 2 | 5130 | 4 | 2 | 0 | 0 | 1 | 0 |
2 | 1004 | 37 | 0 | 2 | 1 | 2090 | 3 | 3 | 1 | 0 | 1 | 0 |
3 | 1005 | 33 | 1 | 3 | 1 | 2909 | 3 | 3 | 0 | 0 | 1 | 0 |
4 | 1007 | 27 | 0 | 2 | 1 | 3468 | 3 | 2 | 0 | 0 | 1 | 0 |
In [14]:
# view the cleaned dataframe
data.head()
Out[14]:
EmployeeID | Age | Gender | DistanceFromHome | JobLevel | MonthlyIncome | PerformanceRating | JobSatisfaction | Attrition | Human Resources | Research & Development | Sales | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1001 | 41 | 1 | 1 | 2 | 5993 | 3 | 4 | 1 | 0 | 0 | 1 |
1 | 1002 | 49 | 0 | 8 | 2 | 5130 | 4 | 2 | 0 | 0 | 1 | 0 |
2 | 1004 | 37 | 0 | 2 | 1 | 2090 | 3 | 3 | 1 | 0 | 1 | 0 |
3 | 1005 | 33 | 1 | 3 | 1 | 2909 | 3 | 3 | 0 | 0 | 1 | 0 |
4 | 1007 | 27 | 0 | 2 | 1 | 3468 | 3 | 2 | 0 | 0 | 1 | 0 |
In [15]:
# note the number of rows and columns
data.shape
Out[15]:
(1470, 12)
In [16]:
# what is the overall attrition for all employees in the data aka what percent of employees leave the company?
data.Attrition.mean() # 16% of employees leave the company
Out[16]:
0.16122448979591836
In [17]:
# create a summary table to show the mean of each column for employees who stay vs leave - what are your takeaways?
data.groupby('Attrition').mean()
Out[17]:
EmployeeID | Age | Gender | DistanceFromHome | JobLevel | MonthlyIncome | PerformanceRating | JobSatisfaction | Human Resources | Research & Development | Sales | |
---|---|---|---|---|---|---|---|---|---|---|---|
Attrition | |||||||||||
0 | 2027.656123 | 37.561233 | 0.406326 | 8.915653 | 2.145985 | 6832.739659 | 3.153285 | 2.778589 | 0.041363 | 0.671533 | 0.287105 |
1 | 2010.345992 | 33.607595 | 0.367089 | 10.632911 | 1.637131 | 4787.092827 | 3.156118 | 2.468354 | 0.050633 | 0.561181 | 0.388186 |
Insight: People who stay tend to be older, female, live close by, more senior, are happy with their jobs and work in research & development
In [18]:
# create a new dataframe without the attrition column for us to model on
df = data.drop('Attrition', axis=1)
df.head()
Out[18]:
EmployeeID | Age | Gender | DistanceFromHome | JobLevel | MonthlyIncome | PerformanceRating | JobSatisfaction | Human Resources | Research & Development | Sales | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1001 | 41 | 1 | 1 | 2 | 5993 | 3 | 4 | 0 | 0 | 1 |
1 | 1002 | 49 | 0 | 8 | 2 | 5130 | 4 | 2 | 0 | 1 | 0 |
2 | 1004 | 37 | 0 | 2 | 1 | 2090 | 3 | 3 | 0 | 1 | 0 |
3 | 1005 | 33 | 1 | 3 | 1 | 2909 | 3 | 3 | 0 | 1 | 0 |
4 | 1007 | 27 | 0 | 2 | 1 | 3468 | 3 | 2 | 0 | 1 | 0 |
In [19]:
# drop the employee column as well before modeling
df = df.drop(columns='EmployeeID')
df.head()
Out[19]:
Age | Gender | DistanceFromHome | JobLevel | MonthlyIncome | PerformanceRating | JobSatisfaction | Human Resources | Research & Development | Sales | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 41 | 1 | 1 | 2 | 5993 | 3 | 4 | 0 | 0 | 1 |
1 | 49 | 0 | 8 | 2 | 5130 | 4 | 2 | 0 | 1 | 0 |
2 | 37 | 0 | 2 | 1 | 2090 | 3 | 3 | 0 | 1 | 0 |
3 | 33 | 1 | 3 | 1 | 2909 | 3 | 3 | 0 | 1 | 0 |
4 | 27 | 0 | 2 | 1 | 3468 | 3 | 2 | 0 | 1 | 0 |
In [20]:
# note the number of rows and columns in the dataframe
df.shape
Out[20]:
(1470, 10)
In [21]:
# create a pair plot comparing all the columns of the dataframe - what observations do you notice?
import seaborn as sns
sns.pairplot(df);
OBSERVATIONS:
- Age and gender seem to be pretty evenly distributed
- More people live closer to the office
- Job level and income are correlated
- There are fewer high performers
- Most people are happy with the jobs
- There are few people in HR compared to the other departments
K-Means Clustering¶
a. Standardize the data¶
In [22]:
# scale the data using standardization
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df_scaled = pd.DataFrame(scaler.fit_transform(df), columns=df.columns)
df_scaled.head()
Out[22]:
Age | Gender | DistanceFromHome | JobLevel | MonthlyIncome | PerformanceRating | JobSatisfaction | Human Resources | Research & Development | Sales | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 0.446350 | 1.224745 | -1.010909 | -0.057788 | -0.108350 | -0.426230 | 1.153254 | -0.211604 | -1.374051 | 1.515244 |
1 | 1.322365 | -0.816497 | -0.147150 | -0.057788 | -0.291719 | 2.346151 | -0.660853 | -0.211604 | 0.727775 | -0.659960 |
2 | 0.008343 | -0.816497 | -0.887515 | -0.961486 | -0.937654 | -0.426230 | 0.246200 | -0.211604 | 0.727775 | -0.659960 |
3 | -0.429664 | 1.224745 | -0.764121 | -0.961486 | -0.763634 | -0.426230 | 0.246200 | -0.211604 | 0.727775 | -0.659960 |
4 | -1.086676 | -0.816497 | -0.887515 | -0.961486 | -0.644858 | -0.426230 | -0.660853 | -0.211604 | 0.727775 | -0.659960 |
In [23]:
# double check that all the column means are 0 and standard deviations are 1
df_scaled.describe()
Out[23]:
Age | Gender | DistanceFromHome | JobLevel | MonthlyIncome | PerformanceRating | JobSatisfaction | Human Resources | Research & Development | Sales | |
---|---|---|---|---|---|---|---|---|---|---|
count | 1.470000e+03 | 1.470000e+03 | 1.470000e+03 | 1.470000e+03 | 1.470000e+03 | 1.470000e+03 | 1.470000e+03 | 1.470000e+03 | 1.470000e+03 | 1.470000e+03 |
mean | -3.504377e-17 | -4.350262e-17 | 4.350262e-17 | -2.658493e-17 | -4.471102e-17 | -6.114534e-16 | -9.183886e-17 | 6.767074e-17 | 2.900174e-17 | 8.458842e-17 |
std | 1.000340e+00 | 1.000340e+00 | 1.000340e+00 | 1.000340e+00 | 1.000340e+00 | 1.000340e+00 | 1.000340e+00 | 1.000340e+00 | 1.000340e+00 | 1.000340e+00 |
min | -2.072192e+00 | -8.164966e-01 | -1.010909e+00 | -9.614864e-01 | -1.167343e+00 | -4.262300e-01 | -1.567907e+00 | -2.116037e-01 | -1.374051e+00 | -6.599598e-01 |
25% | -7.581700e-01 | -8.164966e-01 | -8.875151e-01 | -9.614864e-01 | -7.632087e-01 | -4.262300e-01 | -6.608532e-01 | -2.116037e-01 | -1.374051e+00 | -6.599598e-01 |
50% | -1.011589e-01 | -8.164966e-01 | -2.705440e-01 | -5.778755e-02 | -3.365516e-01 | -4.262300e-01 | 2.462002e-01 | -2.116037e-01 | 7.277751e-01 | -6.599598e-01 |
75% | 6.653541e-01 | 1.224745e+00 | 5.932157e-01 | 8.459113e-01 | 3.986245e-01 | -4.262300e-01 | 1.153254e+00 | -2.116037e-01 | 7.277751e-01 | 1.515244e+00 |
max | 2.526886e+00 | 1.224745e+00 | 2.444129e+00 | 2.653309e+00 | 2.867626e+00 | 2.346151e+00 | 1.153254e+00 | 4.725816e+00 | 7.277751e-01 | 1.515244e+00 |
b. Write a loop to fit models with 2 to 15 clusters and record the inertia and silhouette scores¶
In [24]:
# import kmeans and write a loop to fit models with 2 to 15 clusters
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
# create an empty list to hold many inertia and silhouette values
inertia_values = []
silhouette_scores = []
# create 2 - 15 clusters, and add the intertia scores and silhouette scores to the lists
for k in range(2, 16):
kmeans = KMeans(n_clusters=k, n_init=10, random_state=42) # changed from auto to 10
kmeans.fit(df_scaled)
inertia_values.append(kmeans.inertia_)
silhouette_scores.append(silhouette_score(df_scaled, kmeans.labels_, metric='euclidean', sample_size=None))
In [25]:
# plot the inertia values
import matplotlib.pyplot as plt
# turn the list into a series for plotting
inertia_series = pd.Series(inertia_values, index=range(2, 16))
# plot the data
inertia_series.plot(marker='o')
plt.xlabel("Number of Clusters (k)")
plt.ylabel("Inertia")
plt.title("Number of Clusters vs. Inertia");
In [26]:
# plot the silhouette scores
# turn the list into a series for plotting
silhouette_series = pd.Series(silhouette_scores, index=range(2, 16))