Loading¶
In [1]:
# read in the employee data
import pandas as pd
employee_data = pd.read_csv('../Data/employee_data.csv')
employee_data.head()
Out[1]:
EmployeeID | Age | Gender | DistanceFromHome | JobLevel | Department | MonthlyIncome | PerformanceRating | JobSatisfaction | Attrition | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 1001 | 41 | Female | 1 | 2 | Sales | 5993 | 3 | 4 | Yes |
1 | 1002 | 49 | Male | 8 | 2 | Research & Development | 5130 | 4 | 2 | No |
2 | 1004 | 37 | Male | 2 | 1 | Research & Development | 2090 | 3 | 3 | Yes |
3 | 1005 | 33 | Female | 3 | 1 | Research & Development | 2909 | 3 | 3 | No |
4 | 1007 | 27 | Male | 2 | 1 | Research & Development | 3468 | 3 | 2 | No |
In [2]:
# note the number of rows and columns
employee_data.shape
Out[2]:
(1470, 10)
In [3]:
# view the data types of all the columns
employee_data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1470 entries, 0 to 1469 Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 EmployeeID 1470 non-null int64 1 Age 1470 non-null int64 2 Gender 1470 non-null object 3 DistanceFromHome 1470 non-null int64 4 JobLevel 1470 non-null int64 5 Department 1470 non-null object 6 MonthlyIncome 1470 non-null int64 7 PerformanceRating 1470 non-null int64 8 JobSatisfaction 1470 non-null int64 9 Attrition 1470 non-null object dtypes: int64(7), object(3) memory usage: 115.0+ KB
In [4]:
# look at the numeric columns
employee_data.dtypes[employee_data.dtypes == 'int64']
Out[4]:
EmployeeID int64 Age int64 DistanceFromHome int64 JobLevel int64 MonthlyIncome int64 PerformanceRating int64 JobSatisfaction int64 dtype: object
In [5]:
# look at the non-numeric columns
employee_data.dtypes[employee_data.dtypes != 'int64']
Out[5]:
Gender object Department object Attrition object dtype: object
In [6]:
# create a copy of the dataframe
data = employee_data.copy()
data.head()
Out[6]:
EmployeeID | Age | Gender | DistanceFromHome | JobLevel | Department | MonthlyIncome | PerformanceRating | JobSatisfaction | Attrition | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 1001 | 41 | Female | 1 | 2 | Sales | 5993 | 3 | 4 | Yes |
1 | 1002 | 49 | Male | 8 | 2 | Research & Development | 5130 | 4 | 2 | No |
2 | 1004 | 37 | Male | 2 | 1 | Research & Development | 2090 | 3 | 3 | Yes |
3 | 1005 | 33 | Female | 3 | 1 | Research & Development | 2909 | 3 | 3 | No |
4 | 1007 | 27 | Male | 2 | 1 | Research & Development | 3468 | 3 | 2 | No |
In [7]:
# look at the gender values
data.Gender.value_counts()
Out[7]:
Gender Male 882 Female 588 Name: count, dtype: int64
In [8]:
# change gender into a numeric field using np.where
import numpy as np
data.Gender = np.where(data.Gender == 'Female', 1, 0)
data.Gender.head()
Out[8]:
0 1 1 0 2 0 3 1 4 0 Name: Gender, dtype: int64
In [9]:
# look at the attrition values
data.Attrition.value_counts()
Out[9]:
Attrition No 1233 Yes 237 Name: count, dtype: int64
In [10]:
# change attrition to a numeric field using np.where
data.Attrition = np.where(data.Attrition == 'Yes', 1, 0)
data.Attrition.head()
Out[10]:
0 1 1 0 2 1 3 0 4 0 Name: Attrition, dtype: int64
In [11]:
# look at the department values
data.Department.value_counts()
Out[11]:
Department Research & Development 961 Sales 446 Human Resources 63 Name: count, dtype: int64
In [12]:
# change department to a numeric field via dummy variables
pd.get_dummies(data.Department).astype('int').head()
Out[12]:
Human Resources | Research & Development | Sales | |
---|---|---|---|
0 | 0 | 0 | 1 |
1 | 0 | 1 | 0 |
2 | 0 | 1 | 0 |
3 | 0 | 1 | 0 |
4 | 0 | 1 | 0 |
In [13]:
# attach the columns back on to the dataframe
data = pd.concat([data, pd.get_dummies(data.Department).astype('int')], axis=1)
data.drop('Department', axis=1, inplace=True)
data.head()
Out[13]:
EmployeeID | Age | Gender | DistanceFromHome | JobLevel | MonthlyIncome | PerformanceRating | JobSatisfaction | Attrition | Human Resources | Research & Development | Sales | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1001 | 41 | 1 | 1 | 2 | 5993 | 3 | 4 | 1 | 0 | 0 | 1 |
1 | 1002 | 49 | 0 | 8 | 2 | 5130 | 4 | 2 | 0 | 0 | 1 | 0 |
2 | 1004 | 37 | 0 | 2 | 1 | 2090 | 3 | 3 | 1 | 0 | 1 | 0 |
3 | 1005 | 33 | 1 | 3 | 1 | 2909 | 3 | 3 | 0 | 0 | 1 | 0 |
4 | 1007 | 27 | 0 | 2 | 1 | 3468 | 3 | 2 | 0 | 0 | 1 | 0 |
In [14]:
# view the cleaned dataframe
data.head()
Out[14]:
EmployeeID | Age | Gender | DistanceFromHome | JobLevel | MonthlyIncome | PerformanceRating | JobSatisfaction | Attrition | Human Resources | Research & Development | Sales | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1001 | 41 | 1 | 1 | 2 | 5993 | 3 | 4 | 1 | 0 | 0 | 1 |
1 | 1002 | 49 | 0 | 8 | 2 | 5130 | 4 | 2 | 0 | 0 | 1 | 0 |
2 | 1004 | 37 | 0 | 2 | 1 | 2090 | 3 | 3 | 1 | 0 | 1 | 0 |
3 | 1005 | 33 | 1 | 3 | 1 | 2909 | 3 | 3 | 0 | 0 | 1 | 0 |
4 | 1007 | 27 | 0 | 2 | 1 | 3468 | 3 | 2 | 0 | 0 | 1 | 0 |
In [15]:
# note the number of rows and columns
data.shape
Out[15]:
(1470, 12)
In [16]:
# what is the overall attrition for all employees in the data aka what percent of employees leave the company?
data.Attrition.mean() # 16% of employees leave the company
Out[16]:
0.16122448979591836
In [17]:
# create a summary table to show the mean of each column for employees who stay vs leave - what are your takeaways?
data.groupby('Attrition').mean()
Out[17]:
EmployeeID | Age | Gender | DistanceFromHome | JobLevel | MonthlyIncome | PerformanceRating | JobSatisfaction | Human Resources | Research & Development | Sales | |
---|---|---|---|---|---|---|---|---|---|---|---|
Attrition | |||||||||||
0 | 2027.656123 | 37.561233 | 0.406326 | 8.915653 | 2.145985 | 6832.739659 | 3.153285 | 2.778589 | 0.041363 | 0.671533 | 0.287105 |
1 | 2010.345992 | 33.607595 | 0.367089 | 10.632911 | 1.637131 | 4787.092827 | 3.156118 | 2.468354 | 0.050633 | 0.561181 | 0.388186 |
Insight: People who stay tend to be older, female, live close by, more senior, are happy with their jobs and work in research & development
In [18]:
# create a new dataframe without the attrition column for us to model on
df = data.drop('Attrition', axis=1)
df.head()
Out[18]:
EmployeeID | Age | Gender | DistanceFromHome | JobLevel | MonthlyIncome | PerformanceRating | JobSatisfaction | Human Resources | Research & Development | Sales | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1001 | 41 | 1 | 1 | 2 | 5993 | 3 | 4 | 0 | 0 | 1 |
1 | 1002 | 49 | 0 | 8 | 2 | 5130 | 4 | 2 | 0 | 1 | 0 |
2 | 1004 | 37 | 0 | 2 | 1 | 2090 | 3 | 3 | 0 | 1 | 0 |
3 | 1005 | 33 | 1 | 3 | 1 | 2909 | 3 | 3 | 0 | 1 | 0 |
4 | 1007 | 27 | 0 | 2 | 1 | 3468 | 3 | 2 | 0 | 1 | 0 |
In [19]:
# drop the employee column as well before modeling
df = df.drop(columns='EmployeeID')
df.head()
Out[19]:
Age | Gender | DistanceFromHome | JobLevel | MonthlyIncome | PerformanceRating | JobSatisfaction | Human Resources | Research & Development | Sales | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 41 | 1 | 1 | 2 | 5993 | 3 | 4 | 0 | 0 | 1 |
1 | 49 | 0 | 8 | 2 | 5130 | 4 | 2 | 0 | 1 | 0 |
2 | 37 | 0 | 2 | 1 | 2090 | 3 | 3 | 0 | 1 | 0 |
3 | 33 | 1 | 3 | 1 | 2909 | 3 | 3 | 0 | 1 | 0 |
4 | 27 | 0 | 2 | 1 | 3468 | 3 | 2 | 0 | 1 | 0 |
In [20]:
# note the number of rows and columns in the dataframe
df.shape
Out[20]:
(1470, 10)
In [21]:
# create a pair plot comparing all the columns of the dataframe - what observations do you notice?
import seaborn as sns
sns.pairplot(df);
OBSERVATIONS:
- Age and gender seem to be pretty evenly distributed
- More people live closer to the office
- Job level and income are correlated
- There are fewer high performers
- Most people are happy with the jobs
- There are few people in HR compared to the other departments
K-Means Clustering¶
a. Standardize the data¶
In [22]:
# scale the data using standardization
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df_scaled = pd.DataFrame(scaler.fit_transform(df), columns=df.columns)
df_scaled.head()
Out[22]:
Age | Gender | DistanceFromHome | JobLevel | MonthlyIncome | PerformanceRating | JobSatisfaction | Human Resources | Research & Development | Sales | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 0.446350 | 1.224745 | -1.010909 | -0.057788 | -0.108350 | -0.426230 | 1.153254 | -0.211604 | -1.374051 | 1.515244 |
1 | 1.322365 | -0.816497 | -0.147150 | -0.057788 | -0.291719 | 2.346151 | -0.660853 | -0.211604 | 0.727775 | -0.659960 |
2 | 0.008343 | -0.816497 | -0.887515 | -0.961486 | -0.937654 | -0.426230 | 0.246200 | -0.211604 | 0.727775 | -0.659960 |
3 | -0.429664 | 1.224745 | -0.764121 | -0.961486 | -0.763634 | -0.426230 | 0.246200 | -0.211604 | 0.727775 | -0.659960 |
4 | -1.086676 | -0.816497 | -0.887515 | -0.961486 | -0.644858 | -0.426230 | -0.660853 | -0.211604 | 0.727775 | -0.659960 |
In [23]:
# double check that all the column means are 0 and standard deviations are 1
df_scaled.describe()
Out[23]:
Age | Gender | DistanceFromHome | JobLevel | MonthlyIncome | PerformanceRating | JobSatisfaction | Human Resources | Research & Development | Sales | |
---|---|---|---|---|---|---|---|---|---|---|
count | 1.470000e+03 | 1.470000e+03 | 1.470000e+03 | 1.470000e+03 | 1.470000e+03 | 1.470000e+03 | 1.470000e+03 | 1.470000e+03 | 1.470000e+03 | 1.470000e+03 |
mean | -3.504377e-17 | -4.350262e-17 | 4.350262e-17 | -2.658493e-17 | -4.471102e-17 | -6.114534e-16 | -9.183886e-17 | 6.767074e-17 | 2.900174e-17 | 8.458842e-17 |
std | 1.000340e+00 | 1.000340e+00 | 1.000340e+00 | 1.000340e+00 | 1.000340e+00 | 1.000340e+00 | 1.000340e+00 | 1.000340e+00 | 1.000340e+00 | 1.000340e+00 |
min | -2.072192e+00 | -8.164966e-01 | -1.010909e+00 | -9.614864e-01 | -1.167343e+00 | -4.262300e-01 | -1.567907e+00 | -2.116037e-01 | -1.374051e+00 | -6.599598e-01 |
25% | -7.581700e-01 | -8.164966e-01 | -8.875151e-01 | -9.614864e-01 | -7.632087e-01 | -4.262300e-01 | -6.608532e-01 | -2.116037e-01 | -1.374051e+00 | -6.599598e-01 |
50% | -1.011589e-01 | -8.164966e-01 | -2.705440e-01 | -5.778755e-02 | -3.365516e-01 | -4.262300e-01 | 2.462002e-01 | -2.116037e-01 | 7.277751e-01 | -6.599598e-01 |
75% | 6.653541e-01 | 1.224745e+00 | 5.932157e-01 | 8.459113e-01 | 3.986245e-01 | -4.262300e-01 | 1.153254e+00 | -2.116037e-01 | 7.277751e-01 | 1.515244e+00 |
max | 2.526886e+00 | 1.224745e+00 | 2.444129e+00 | 2.653309e+00 | 2.867626e+00 | 2.346151e+00 | 1.153254e+00 | 4.725816e+00 | 7.277751e-01 | 1.515244e+00 |
b. Write a loop to fit models with 2 to 15 clusters and record the inertia and silhouette scores¶
In [24]:
# import kmeans and write a loop to fit models with 2 to 15 clusters
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
# create an empty list to hold many inertia and silhouette values
inertia_values = []
silhouette_scores = []
# create 2 - 15 clusters, and add the intertia scores and silhouette scores to the lists
for k in range(2, 16):
kmeans = KMeans(n_clusters=k, n_init=10, random_state=42) # changed from auto to 10
kmeans.fit(df_scaled)
inertia_values.append(kmeans.inertia_)
silhouette_scores.append(silhouette_score(df_scaled, kmeans.labels_, metric='euclidean', sample_size=None))
In [25]:
# plot the inertia values
import matplotlib.pyplot as plt
# turn the list into a series for plotting
inertia_series = pd.Series(inertia_values, index=range(2, 16))
# plot the data
inertia_series.plot(marker='o')
plt.xlabel("Number of Clusters (k)")
plt.ylabel("Inertia")
plt.title("Number of Clusters vs. Inertia");
In [26]:
# plot the silhouette scores
# turn the list into a series for plotting
silhouette_series = pd.Series(silhouette_scores, index=range(2, 16))
# plot the data
silhouette_series.plot(marker='o')
plt.xlabel("Number of Clusters (k)")
plt.ylabel("Silhouette Score")
plt.title("Number of Clusters vs. Silhouette Score");
c. Identify a k value that looks like an elbow on the inertia plot and has a high silhouette score¶
In [27]:
# fit a kmeans model for the k value that you identified
kmeans4 = KMeans(n_clusters=4, n_init=10, random_state=42)
kmeans4.fit(df_scaled)
Out[27]:
KMeans(n_clusters=4, n_init=10, random_state=42)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
KMeans(n_clusters=4, n_init=10, random_state=42)
In [28]:
# find the number of employees in each cluster
from collections import Counter
Counter(kmeans4.labels_)
Out[28]:
Counter({0: 747, 1: 407, 2: 253, 3: 63})
In [29]:
# create a heat map of the cluster centers
import seaborn as sns
import matplotlib.pyplot as plt
cluster_centers4 = pd.DataFrame(kmeans4.cluster_centers_, columns=df_scaled.columns)
plt.figure(figsize=(10, 2))
sns.heatmap(cluster_centers4, annot=True, cmap="RdBu", fmt=".1f", linewidths=.5);
Interpret the clusters:
- Cluster 0: junior, research & dev employees
- Cluster 1: sales employees
- Cluster 2: senior employees
- Cluster 3: HR employees
3. PCA¶
a. Fit a PCA Model with 2 components for visualization¶
In [30]:
# fit a PCA model with 2 components
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
pca.fit(df_scaled)
Out[30]:
PCA(n_components=2)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
PCA(n_components=2)
In [31]:
# view the explained variance ratio
pca.explained_variance_ratio_
Out[31]:
array([0.23793893, 0.18883434])
In [32]:
# view the components
pca.components_
Out[32]:
array([[ 0.43287352, 0.04877625, -0.00285089, 0.60509274, 0.59445012, -0.02556521, -0.00472736, 0.02964393, -0.21392918, 0.20833797], [-0.21384802, 0.00840873, 0.01653328, -0.14533326, -0.17730123, -0.04153184, 0.01140416, 0.11374447, -0.67887246, 0.65246219]])
In [33]:
# view the columns
df_scaled.columns
Out[33]:
Index(['Age', 'Gender', 'DistanceFromHome', 'JobLevel', 'MonthlyIncome', 'PerformanceRating', 'JobSatisfaction', 'Human Resources', 'Research & Development', 'Sales'], dtype='object')
Interpret the components:
- Component 1: higher age, job level, monthly income = more senior
- Component 2: lower = research, higher = sales
b. Overlay the K-Means cluster colors¶
In [34]:
# transform the data
df_scaled_transformed = pd.DataFrame(pca.transform(df_scaled), columns=['PC1', 'PC2'])
df_scaled_transformed.head()
Out[34]:
PC1 | PC2 | |
---|---|---|
0 | 0.765263 | 1.853973 |
1 | -0.031684 | -1.285674 |
2 | -1.462588 | -0.645564 |
3 | -1.449531 | -0.563547 |
4 | -1.758252 | -0.473654 |
In [35]:
# plot the data
import matplotlib.pyplot as plt
import seaborn as sns
sns.scatterplot(x='PC1', y='PC2', data=df_scaled_transformed)
plt.xlabel('More Senior -->')
plt.ylabel('<-- Research Sales -->');
In [36]:
# overlay the kmeans clusters (hint: set the hue to be the cluster labels)
sns.scatterplot(x='PC1',
y='PC2',
data=df_scaled_transformed, hue=kmeans4.labels_, palette='viridis');
c. Overlay the Department colors instead¶
In [37]:
# overlay the department colors (hint: set the hue to be the department column)
sns.scatterplot(x='PC1',
y='PC2',
data=df_scaled_transformed, hue=employee_data.Department, palette='viridis')
plt.legend(loc='upper right');
4. Another K-Means Clustering without the department¶
Since the departments seemed to dominate the visualization, let's exclude them and try fitting more K-Means models.
a. Create a new dataframe without the Departments¶
In [38]:
# create a new dataframe that excludes the three department columns from the scaled dataframe
df_scaled_v2 = df_scaled.iloc[:, :7]
df_scaled_v2.head()
Out[38]:
Age | Gender | DistanceFromHome | JobLevel | MonthlyIncome | PerformanceRating | JobSatisfaction | |
---|---|---|---|---|---|---|---|
0 | 0.446350 | 1.224745 | -1.010909 | -0.057788 | -0.108350 | -0.426230 | 1.153254 |
1 | 1.322365 | -0.816497 | -0.147150 | -0.057788 | -0.291719 | 2.346151 | -0.660853 |
2 | 0.008343 | -0.816497 | -0.887515 | -0.961486 | -0.937654 | -0.426230 | 0.246200 |
3 | -0.429664 | 1.224745 | -0.764121 | -0.961486 | -0.763634 | -0.426230 | 0.246200 |
4 | -1.086676 | -0.816497 | -0.887515 | -0.961486 | -0.644858 | -0.426230 | -0.660853 |
b. Write a loop to fit models with 2 to 15 clusters and record the inertia and silhouette scores¶
In [39]:
# write a loop to fit models with 2 to 15 clusters
# create an empty list to hold many inertia and silhouette values
inertia_values_v2 = []
silhouette_scores_v2 = []
# create 2 - 15 clusters, and add the intertia scores and silhouette scores to the lists
for k in range(2, 16):
kmeans = KMeans(n_clusters=k, n_init=10, random_state=42) # changed from auto to 10
kmeans.fit(df_scaled_v2)
inertia_values_v2.append(kmeans.inertia_)
silhouette_scores_v2.append(silhouette_score(df_scaled_v2, kmeans.labels_, metric='euclidean', sample_size=None))
In [40]:
# plot the inertia values
# turn the list into a series for plotting
inertia_series_v2 = pd.Series(inertia_values_v2, index=range(2, 16))
# plot the data
inertia_series_v2.plot(marker='o')
plt.xlabel("Number of Clusters (k)")
plt.ylabel("Inertia")
plt.title("Number of Clusters vs. Inertia");
In [41]:
# plot the silhouette scores
# turn the list into a series for plotting
silhouette_series_v2 = pd.Series(silhouette_scores_v2, index=range(2, 16))
# plot the data
silhouette_series_v2.plot(marker='o')
plt.xlabel("Number of Clusters (k)")
plt.ylabel("Silhouette Score")
plt.title("Number of Clusters vs. Silhouette Score");
c. Identify a few k values that looks like an elbow on the inertia plot and have a high silhouette score¶
i. k=3¶
In [42]:
# fit a kmeans model for the k value that you identified
kmeans3_v2 = KMeans(n_clusters=3, n_init=10, random_state=42)
kmeans3_v2.fit(df_scaled_v2)
Out[42]:
KMeans(n_clusters=3, n_init=10, random_state=42)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
KMeans(n_clusters=3, n_init=10, random_state=42)
In [43]:
# find the number of employees in each cluster
Counter(kmeans3_v2.labels_)
Out[43]:
Counter({1: 988, 2: 282, 0: 200})
In [44]:
# create a heat map of the cluster centers
cluster_centers3_v2 = pd.DataFrame(kmeans3_v2.cluster_centers_, columns=df_scaled_v2.columns)
plt.figure(figsize=(10, 2))
sns.heatmap(cluster_centers3_v2, annot=True, cmap="RdBu", fmt=".1f", linewidths=.5);
Interpret the clusters:
- Cluster 0: high performing employees
- Cluster 1: junior, low performing employees
- Cluster 2: senior employees
ii. k=4¶
In [45]:
# fit a kmeans model for the k value that you identified
kmeans4_v2 = KMeans(n_clusters=4, n_init=10, random_state=42)
kmeans4_v2.fit(df_scaled_v2)
Out[45]:
KMeans(n_clusters=4, n_init=10, random_state=42)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
KMeans(n_clusters=4, n_init=10, random_state=42)
In [46]:
# find the number of employees in each cluster
Counter(kmeans4_v2.labels_)
Out[46]:
Counter({2: 616, 0: 405, 1: 249, 3: 200})
In [47]:
# create a heat map of the cluster centers
cluster_centers4_v2 = pd.DataFrame(kmeans4_v2.cluster_centers_, columns=df_scaled_v2.columns)
plt.figure(figsize=(10, 2))
sns.heatmap(cluster_centers4_v2, annot=True, cmap="RdBu", fmt=".1f", linewidths=.5);
Interpret the clusters:
- Cluster 0: female employees
- Cluster 1: senior employees
- Cluster 2: male employees
- Cluster 3: high performing employees
iii. k=6¶
In [48]:
# fit a kmeans model for the k value that you identified
kmeans6_v2 = KMeans(n_clusters=6, n_init=10, random_state=42)
kmeans6_v2.fit(df_scaled_v2)
Out[48]:
KMeans(n_clusters=6, n_init=10, random_state=42)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
KMeans(n_clusters=6, n_init=10, random_state=42)
In [49]:
# find the number of employees in each cluster
Counter(kmeans6_v2.labels_)
Out[49]:
Counter({3: 349, 0: 304, 4: 219, 5: 201, 1: 200, 2: 197})
In [50]:
# create a heat map of the cluster centers
cluster_centers6_v2 = pd.DataFrame(kmeans6_v2.cluster_centers_, columns=df_scaled_v2.columns)
plt.figure(figsize=(10, 2))
sns.heatmap(cluster_centers6_v2, annot=True, cmap="RdBu", fmt=".1f", linewidths=.5);
Interpret the clusters:
- Cluster 0: men who like their jobs
- Cluster 1: high performers
- Cluster 2: long commuters
- Cluster 3: women
- Cluster 4: senior employee
- Cluster 5: men who dislike their jobs
5. PCA without the department¶
a. Fit a PCA Model with 2 components for visualization¶
In [51]:
# fit a PCA model with 2 components
from sklearn.decomposition import PCA
pca_v2 = PCA(n_components=2)
pca_v2.fit(df_scaled_v2)
Out[51]:
PCA(n_components=2)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
PCA(n_components=2)
In [52]:
# view the explained variance ratio
pca_v2.explained_variance_ratio_ # this is higher than before
Out[52]:
array([0.33354222, 0.14850324])
In [53]:
# view the components
pca_v2.components_
Out[53]:
array([[ 0.47124275, 0.0460627 , -0.00629691, 0.62393147, 0.62140377, -0.01687984, -0.00712661], [ 0.01896906, 0.58615904, 0.41405416, -0.01838822, -0.02778096, 0.4833175 , -0.49991119]])
In [54]:
# view the columns
df_scaled_v2.columns
Out[54]:
Index(['Age', 'Gender', 'DistanceFromHome', 'JobLevel', 'MonthlyIncome', 'PerformanceRating', 'JobSatisfaction'], dtype='object')
Interpret the components:
- Component 1: higher age, job level, monthly income = more senior
- Component 2: <-- happy in job | women, longer commute, higher perfomring -->
b. Overlay the K-Means cluster colors¶
In [55]:
# transform the data
df_scaled_transformed_v2 = pd.DataFrame(pca_v2.transform(df_scaled_v2), columns=['PC1', 'PC2'])
df_scaled_transformed_v2.head()
Out[55]:
PC1 | PC2 | |
---|---|---|
0 | 0.168712 | -0.470665 |
1 | 0.334248 | 0.959030 |
2 | -1.205213 | -1.131272 |
3 | -1.210236 | 0.103169 |
4 | -1.532824 | -0.706731 |
In [56]:
# plot the data
import matplotlib.pyplot as plt
import seaborn as sns
sns.scatterplot(x='PC1', y='PC2', data=df_scaled_transformed_v2)
plt.xlabel('More Senior -->')
plt.ylabel('<-- Happy in Job Women / Longer Commute / High Performing -->');
In [57]:
# overlay the kmeans clusters (choose your favorite k-means model from the previous section)
sns.scatterplot(x='PC1',
y='PC2',
data=df_scaled_transformed_v2, hue=kmeans6_v2.labels_, palette='viridis');
c.Create a 3D plot¶
In [58]:
# fit a PCA model with 3 components
pca3_v2 = PCA(n_components=3)
pca3_v2.fit(df_scaled_v2)
Out[58]:
PCA(n_components=3)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
PCA(n_components=3)
In [59]:
# view the explained variance ratio
pca3_v2.explained_variance_ratio_ # this is higher than before
Out[59]:
array([0.33354222, 0.14850324, 0.14578114])
In [60]:
# view the components
pca3_v2.components_
Out[60]:
array([[ 0.47124275, 0.0460627 , -0.00629691, 0.62393147, 0.62140377, -0.01687984, -0.00712661], [ 0.01896906, 0.58615904, 0.41405416, -0.01838822, -0.02778096, 0.4833175 , -0.49991119], [ 0.02702996, -0.36313603, 0.55488385, 0.02159616, 0.01065236, 0.52050007, 0.53666127]])
In [61]:
# view the columns
df_scaled_v2.columns
Out[61]:
Index(['Age', 'Gender', 'DistanceFromHome', 'JobLevel', 'MonthlyIncome', 'PerformanceRating', 'JobSatisfaction'], dtype='object')
Interpret the components:
- Component 1: higher age, job level, monthly income = more senior
- Component 2: <-- happy in job | women, longer commute, higher perfomring -->
- Component 3: longer commute, higher performance, happy in job
In [62]:
# transform the data
df_scaled_transformed3_v2 = pd.DataFrame(pca3_v2.transform(df_scaled_v2), columns=['PC1', 'PC2', 'PC3'])
df_scaled_transformed3_v2.head()
Out[62]:
PC1 | PC2 | PC3 | |
---|---|---|---|
0 | 0.168712 | -0.470665 | -0.598970 |
1 | 0.334248 | 0.959030 | 1.112754 |
2 | -1.205213 | -1.131272 | -0.316222 |
3 | -1.210236 | 0.103169 | -0.998987 |
4 | -1.532824 | -0.706731 | -0.829482 |
In [63]:
# import plotting libraries
import matplotlib.pyplot as plt
import seaborn as sns
from mpl_toolkits.mplot3d import Axes3D
# combine the data and cluster labels
cluster_labels = pd.Series(kmeans6_v2.labels_, name='cluster')
# create a clean dataframe
df_clean = pd.concat([df_scaled_transformed3_v2, cluster_labels], axis=1)
# create a 3d scatter plot
fig = plt.figure(figsize=(8, 6))
ax = Axes3D(fig)
fig.add_axes(ax)
# specify the data and labels
sc = ax.scatter(df_clean['PC1'], df_clean['PC2'], df_clean['PC3'],
c=df_clean['cluster'], cmap='tab10')
ax.set_xlabel('PC1')
ax.set_ylabel('PC2')
ax.set_zlabel('PC3')
# add a legend
plt.legend(*sc.legend_elements(), title='clusters',
bbox_to_anchor=(1.05, 1));
6. EDA on Clusters¶
a. Confirm the 6 clusters¶
In [64]:
# view the kmeans model with 6 clusters
kmeans6_v2
Out[64]:
KMeans(n_clusters=6, n_init=10, random_state=42)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
KMeans(n_clusters=6, n_init=10, random_state=42)
In [65]:
# view the cluster labels
kmeans6_v2.labels_
Out[65]:
array([3, 1, 0, ..., 1, 5, 0], dtype=int32)
b. Create a dataframe with the cluster labels and names¶
In [66]:
# create a dataframe with two columns - one of the label and another of the cluster name
clusters = pd.DataFrame(kmeans6_v2.labels_, columns=['Cluster'])
clusters.head()
Out[66]:
Cluster | |
---|---|
0 | 3 |
1 | 1 |
2 | 0 |
3 | 3 |
4 | 5 |
In [67]:
# create a mapping for the cluster names
cluster_mapping = {0: 'Men who like their jobs',
1: 'High performers',
2: 'Long commuters',
3: 'Female employees',
4: 'Senior employees',
5: 'Men who dislike their jobs'}
In [68]:
# combine the labels and names into a single dataframe
clusters['Cluster_Name'] = clusters['Cluster'].map(cluster_mapping)
clusters.head()
Out[68]:
Cluster | Cluster_Name | |
---|---|---|
0 | 3 | Female employees |
1 | 1 | High performers |
2 | 0 | Men who like their jobs |
3 | 3 | Female employees |
4 | 5 | Men who dislike their jobs |
c. View the attrition rates for each cluster¶
In [69]:
# combine the clusters and attrition data
clusters = pd.concat([clusters, data.Attrition], axis=1)
clusters.head()
Out[69]:
Cluster | Cluster_Name | Attrition | |
---|---|---|---|
0 | 3 | Female employees | 1 |
1 | 1 | High performers | 0 |
2 | 0 | Men who like their jobs | 1 |
3 | 3 | Female employees | 0 |
4 | 5 | Men who dislike their jobs | 0 |
In [70]:
# what is the attrition rate for each cluster?
clusters.groupby(['Cluster_Name'])['Attrition'].mean()
Out[70]:
Cluster_Name Female employees 0.154728 High performers 0.185000 Long commuters 0.218274 Men who dislike their jobs 0.189055 Men who like their jobs 0.161184 Senior employees 0.073059 Name: Attrition, dtype: float64
In [71]:
# sort the values
clusters.groupby(['Cluster_Name'])['Attrition'].mean().sort_values(ascending=False)
Out[71]:
Cluster_Name Long commuters 0.218274 Men who dislike their jobs 0.189055 High performers 0.185000 Men who like their jobs 0.161184 Female employees 0.154728 Senior employees 0.073059 Name: Attrition, dtype: float64
Interpret the findings:
- Long commuters are most likely to leave
- Senior employees are most likely to stay
In [72]:
# find the number of employees in each cluster
clusters.Cluster.value_counts()
Out[72]:
Cluster 3 349 0 304 4 219 5 201 1 200 2 197 Name: count, dtype: int64
d. View the department breakdown for each cluster¶
In [73]:
# combine the clusters and department data
clusters = pd.concat([clusters, employee_data.Department], axis=1)
clusters.head()
Out[73]:
Cluster | Cluster_Name | Attrition | Department | |
---|---|---|---|---|
0 | 3 | Female employees | 1 | Sales |
1 | 1 | High performers | 0 | Research & Development |
2 | 0 | Men who like their jobs | 1 | Research & Development |
3 | 3 | Female employees | 0 | Research & Development |
4 | 5 | Men who dislike their jobs | 0 | Research & Development |
In [74]:
# what is the attrition rate for each cluster + department combination?
clusters.groupby(['Cluster_Name', 'Department']).mean()
Out[74]:
Cluster | Attrition | ||
---|---|---|---|
Cluster_Name | Department | ||
Female employees | Human Resources | 3.0 | 0.300000 |
Research & Development | 3.0 | 0.121076 | |
Sales | 3.0 | 0.206897 | |
High performers | Human Resources | 1.0 | 0.142857 |
Research & Development | 1.0 | 0.188406 | |
Sales | 1.0 | 0.181818 | |
Long commuters | Human Resources | 2.0 | 0.666667 |
Research & Development | 2.0 | 0.153846 | |
Sales | 2.0 | 0.311475 | |
Men who dislike their jobs | Human Resources | 5.0 | 0.214286 |
Research & Development | 5.0 | 0.172131 | |
Sales | 5.0 | 0.215385 | |
Men who like their jobs | Human Resources | 0.0 | 0.071429 |
Research & Development | 0.0 | 0.152284 | |
Sales | 0.0 | 0.193548 | |
Senior employees | Human Resources | 4.0 | 0.000000 |
Research & Development | 4.0 | 0.059603 | |
Sales | 4.0 | 0.125000 |
In [75]:
# sort the values
clusters.groupby(['Cluster_Name', 'Department']).mean().sort_values('Attrition', ascending=False)
Out[75]:
Cluster | Attrition | ||
---|---|---|---|
Cluster_Name | Department | ||
Long commuters | Human Resources | 2.0 | 0.666667 |
Sales | 2.0 | 0.311475 | |
Female employees | Human Resources | 3.0 | 0.300000 |
Men who dislike their jobs | Sales | 5.0 | 0.215385 |
Human Resources | 5.0 | 0.214286 | |
Female employees | Sales | 3.0 | 0.206897 |
Men who like their jobs | Sales | 0.0 | 0.193548 |
High performers | Research & Development | 1.0 | 0.188406 |
Sales | 1.0 | 0.181818 | |
Men who dislike their jobs | Research & Development | 5.0 | 0.172131 |
Long commuters | Research & Development | 2.0 | 0.153846 |
Men who like their jobs | Research & Development | 0.0 | 0.152284 |
High performers | Human Resources | 1.0 | 0.142857 |
Senior employees | Sales | 4.0 | 0.125000 |
Female employees | Research & Development | 3.0 | 0.121076 |
Men who like their jobs | Human Resources | 0.0 | 0.071429 |
Senior employees | Research & Development | 4.0 | 0.059603 |
Human Resources | 4.0 | 0.000000 |
Interpret the findings:
- The groups most likely to leave are people with long commutes, women in HR and those in HR and sales
- The groups most likely NOT to leave are senior employees, men who like their jobs and those in research and development
In [76]:
# find the number of employees in each cluster + department combo
clusters.groupby(['Cluster_Name', 'Department']).count()
Out[76]:
Cluster | Attrition | ||
---|---|---|---|
Cluster_Name | Department | ||
Female employees | Human Resources | 10 | 10 |
Research & Development | 223 | 223 | |
Sales | 116 | 116 | |
High performers | Human Resources | 7 | 7 |
Research & Development | 138 | 138 | |
Sales | 55 | 55 | |
Long commuters | Human Resources | 6 | 6 |
Research & Development | 130 | 130 | |
Sales | 61 | 61 | |
Men who dislike their jobs | Human Resources | 14 | 14 |
Research & Development | 122 | 122 | |
Sales | 65 | 65 | |
Men who like their jobs | Human Resources | 14 | 14 |
Research & Development | 197 | 197 | |
Sales | 93 | 93 | |
Senior employees | Human Resources | 12 | 12 |
Research & Development | 151 | 151 | |
Sales | 56 | 56 |
7. Recommender¶
In [77]:
# looking at the clusters, what segment info would you share with the team?
clusters.groupby(['Cluster_Name'])['Attrition'].mean().sort_values(ascending=False)
Out[77]:
Cluster_Name Long commuters 0.218274 Men who dislike their jobs 0.189055 High performers 0.185000 Men who like their jobs 0.161184 Female employees 0.154728 Senior employees 0.073059 Name: Attrition, dtype: float64
In [78]:
# recommendations in each cluster
Higher attrition:
- Long commuters: find remote options, create a more inclusive remote culture
- Men who dislike their jobs: have their managers have conversations with them
- Higher performers: find opportunities for more senior positions
Lower attrition:
- Senior employees: this makes sense that they have less attrition since they've been around a long time
- Female employees: this is an interesting finding, dig more into why this is the case
- Men who like their jobs: this make sense that people who like their job would stay
In [ ]: