Objective¶
A travel agency has asked you and your team to create a machine learning model to help them predict whether or not potential customers will purchase a new vacation package they are offering. As part of this process, the agency would also like to understand the potential customers better to more efficiently market their new package.
Data Definition¶
- CustomerID: Unique customer ID
- ProdTaken: Whether the customer has purchased a package or not (0: No, 1: Yes)
- Age: Age of customer
- TypeofContact: How customer was contacted (Company Invited or Self Inquiry)
- CityTier: City tier depends on the development of a city, population, facilities, and living standards. The categories are ordered i.e. Tier 1 > Tier 2 > Tier 3
- DurationOfPitch: Duration of the pitch by a salesperson to the customer
- Occupation: Occupation of customer
- Gender: Gender of customer
- NumberOfPersonVisiting: Total number of people planning to take the trip with the customer
- NumberOfFollowups: Total number of follow-ups has been done by the salesperson after the sales pitch
- ProductPitched: Product pitched by the salesperson
- PreferredPropertyStar: Preferred hotel property rating by customer
- MaritalStatus: Marital status of customer
- NumberOfTrips: Average number of trips in a year by customer
- Passport: The customer has a passport or not (0: No, 1: Yes)
- PitchSatisfactionScore: Sales pitch satisfaction score
- OwnCar: Whether the customers own a car or not (0: No, 1: Yes)
- NumberOfChildrenVisiting: Total number of children with age less than 5 planning to take the trip with the customer
- Designation: Designation of the customer in the current organization
- MonthlyIncome: Gross monthly income of the customer
Boilerplate¶
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn.preprocessing
scaler = sklearn.preprocessing.MinMaxScaler()
labelEncoder = sklearn.preprocessing.LabelEncoder()
sns.set(color_codes=True)
%matplotlib inline
df = pd.read_csv("vacation_package.csv")
Functions¶
def ReturnByPurchase(columnparam):
purchased = df.loc[(df[columnparam].isnull() == False) & (df['ProdTaken'] == 1)]
nopurchase = df.loc[(df[columnparam].isnull() == False) & (df['ProdTaken'] == 0)]
print("Purchased Mean: " + str(round(purchased[columnparam].mean(),6)))
print("Purchased Median: " + str(purchased[columnparam].median()))
print("Purchased Mode: " + str(purchased[columnparam].mode()))
print('\n' + "-----------" + '\n')
print("Unpurchased Mean: " + str(round(nopurchase[columnparam].mean(),6)))
print("Unpurchased Median: " + str(nopurchase[columnparam].median()))
print("Unpurchased Mode: " + str(nopurchase[columnparam].mode()))
def histogramboxplot(data, feature, figsize=(12,7), kde=True, bins=None):
"""
Boxplot and histogram combined
data: dataframe
feature: dataframe column
figsize: size of figure (default (12,7))
kde: whether to show the density curve (default False)
bins: number of bins for histogram (default None)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2, # Number of rows of the subplot grid= 2
sharex=True, # x-axis will be shared among all subplots
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
) # creating the 2 subplots
sns.boxplot(
data=data, x=feature, ax=ax_box2, showmeans=True, meanprops = dict(marker='D', markeredgecolor='black', markerfacecolor='red')
) # boxplot will be created and a star will indicate the mean value of the column
sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins
) if bins else sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2
) # For histogram
ax_hist2.axvline(
data[feature].mean(), color="red", linestyle="-"
) # Add mean to the histogram
ax_hist2.axvline(
data[feature].median(), color="black", linestyle="--"
) # Add median to the histogram
def stackedbarplot(data, predictor, target):
"""
Print the category counts and plot a stacked bar chart
data: dataframe
predictor: independent variable
target: target variable
"""
count = data[predictor].nunique()
sorter = data[target].value_counts().index[-1]
tab1 = pd.crosstab(data[predictor], data[target], margins=True).sort_values(
by=sorter, ascending=False
)
print(tab1)
print("-" * 115)
tab = pd.crosstab(data[predictor], data[target], normalize="index").sort_values(
by=sorter, ascending=False
)
print(tab)
tab.plot(kind="bar", stacked=True, figsize=(count + 5, 5))
plt.legend(
loc="lower left", frameon=False,
)
plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
plt.show()
def descviolinbox(Data, tablename):
"""
Prints the Description of a numeric value and a stacked violin and boxplot
Data = Dataframe
value = Target Column
"""
tablename = str(tablename)
description = Data[tablename].describe()
sns.violinplot(data=Data, y=tablename)
sns.boxplot(data=Data, y=tablename, color="gray")
return description
def bounds(Data, tablename):
"""
Returns upper and lower bounds for a table, helps us capture outliers
Data = Dataframe
tablename = Target Column
"""
tablename = str(tablename)
Q1 = Data[tablename].quantile(0.25)
Q3 = Data[tablename].quantile(0.75)
IQR = Q3 - Q1
lowerlim = Q1 - 1.5 * IQR
upperlim = Q3 + 1.5 * IQR
print("Q1: " + str(Q1))
print("Q3: " + str(Q3))
print("IQR: " + str(IQR))
print("lower limit: " + str(lowerlim))
print("upper limit: " + str(upperlim))
return(lowerlim, upperlim, Q1, Q3)
def numericdata(Data, tablename):
"""
Returns the Mean, Median, Mode, and Range for a dataset
Data = Dataframe
tablename = target column
"""
tablename = str(tablename)
Q1 = Data[tablename].quantile(0.25)
Q3 = Data[tablename].quantile(0.75)
IQR = Q3-Q1
mean = Data[tablename].mean()
median = Data[tablename].median()
mode = Data[tablename].mode()[0]
mean = round(mean,6)
median = round(median,6)
mode = round(mode,6)
IQR = round(IQR,6)
print("Mean: " + str(mean))
print("Median: " + str(median))
print("Mode: " + str(mode))
print("Range: " + str(IQR))
def labeldisplot(Data, tablename):
tablename = str(tablename)
sns.displot(data=Data, x=tablename, kde=True);
plt.axvline(x=Data[tablename].median(), color = 'black')
plt.axvline(x=Data[tablename].mean(), color = 'red', ls='--')
def labeledbarplot(data, feature, perc=False, n=None):
"""
Barplot with percentage at the top
data: dataframe
feature: dataframe column
perc: whether to display percentages instead of count (default is False)
n: displays the top n category levels (default is None, i.e., display all levels)
"""
total = len(data[feature]) # length of the column
count = data[feature].nunique()
if n is None:
plt.figure(figsize=(count + 1, 5))
else:
plt.figure(figsize=(n + 1, 5))
plt.xticks(rotation=90, fontsize=15)
ax = sns.countplot(
data=data,
x=feature,
palette="Paired",
order=data[feature].value_counts().index[:n].sort_values(),
)
for p in ax.patches:
if perc == True:
label = "{:.1f}%".format(
100 * p.get_height() / total
) # percentage of each class of the category
else:
label = p.get_height() # count of each level of the category
x = p.get_x() + p.get_width() / 2 # width of the plot
y = p.get_height() # height of the plot
ax.annotate(
label,
(x, y),
ha="center",
va="center",
size=12,
xytext=(0, 5),
textcoords="offset points",
) # annotate the percentage
plt.show() # show the plot
def stackedbox(data, value, hue):
"""
data= Dataframe
Value = Independent variable we want to test
Hue = Dependent Variable we want to include
"""
sns.catplot(data= data, x = hue, y = value, kind="box")
def minmeanmax(data, independent, dependent):
"""
Prints the min, mean and max of things grouped by our dependent variable
data = Dataframe
independent = The variable we want to do math on
Dependent = the variable we want to group by
"""
notpurchased = df.loc[df[dependent] == 0] #Adjust Number of df.Locs for number of variables
purchased = df.loc[df[dependent] == 1]
print("Not Purchased Min Mean and Max:" + '\n')
print(f' Min: {notpurchased[independent].min()}')
print(f' Mean: {round(notpurchased[independent].mean(), 6)}')
print(f' Max: {notpurchased[independent].max()}')
print("\n" + "Purchased Min Mean and Max:" + "\n")
print(f' Min: {purchased[independent].min()}')
print(f' Mean: {round(purchased[independent].mean(), 6)}')
print(f' Max: {purchased[independent].max()}')
#Add as many print statement blocks as required for the dependent variable, keep this count low
Data Overview¶
# df.sample(10)
| CustomerID | ProdTaken | Age | TypeofContact | CityTier | DurationOfPitch | Occupation | Gender | NumberOfPersonVisiting | NumberOfFollowups | ProductPitched | PreferredPropertyStar | MaritalStatus | NumberOfTrips | Passport | PitchSatisfactionScore | OwnCar | NumberOfChildrenVisiting | Designation | MonthlyIncome | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 996 | 200996 | 0 | 27.0 | Self Enquiry | 3 | 16.0 | Small Business | Female | 3 | 4.0 | Deluxe | 3.0 | Divorced | 2.0 | 1 | 3 | 1 | 2.0 | Manager | 20769.0 |
| 100 | 200100 | 1 | 37.0 | Self Enquiry | 2 | 12.0 | Salaried | Male | 3 | 3.0 | Basic | 5.0 | Married | 5.0 | 1 | 2 | 1 | 0.0 | Executive | 17073.0 |
| 4493 | 204493 | 0 | 35.0 | Self Enquiry | 1 | 9.0 | Small Business | Female | 3 | 5.0 | Basic | 5.0 | Unmarried | 3.0 | 0 | 1 | 1 | 1.0 | Executive | 23059.0 |
| 3579 | 203579 | 0 | 47.0 | Self Enquiry | 3 | 7.0 | Salaried | Male | 3 | 2.0 | Super Deluxe | 5.0 | Single | NaN | 0 | 4 | 1 | 2.0 | AVP | 36245.0 |
| 320 | 200320 | 0 | 27.0 | Self Enquiry | 3 | NaN | Salaried | Male | 3 | 3.0 | Deluxe | 3.0 | Single | 2.0 | 1 | 4 | 1 | 2.0 | Manager | NaN |
| 2585 | 202585 | 0 | 46.0 | Self Enquiry | 1 | 36.0 | Small Business | Male | 3 | 4.0 | Basic | 3.0 | Unmarried | 7.0 | 0 | 2 | 1 | 1.0 | Executive | 22130.0 |
| 3409 | 203409 | 0 | 26.0 | Self Enquiry | 1 | 26.0 | Small Business | Male | 4 | 4.0 | Basic | 3.0 | Divorced | 5.0 | 0 | 5 | 1 | 3.0 | Executive | 22347.0 |
| 3718 | 203718 | 0 | 32.0 | Self Enquiry | 3 | 36.0 | Small Business | Female | 4 | 5.0 | Deluxe | 3.0 | Married | 3.0 | 0 | 3 | 1 | 1.0 | Manager | 24146.0 |
| 2039 | 202039 | 0 | 36.0 | Company Invited | 3 | 14.0 | Salaried | Male | 3 | 4.0 | Standard | 5.0 | Unmarried | 2.0 | 0 | 1 | 0 | 0.0 | Senior Manager | 22587.0 |
| 3581 | 203581 | 0 | 23.0 | Company Invited | 1 | 33.0 | Salaried | Female | 3 | 5.0 | Basic | 3.0 | Married | 3.0 | 1 | 3 | 0 | 1.0 | Executive | 21492.0 |
# df.sample(10)
| CustomerID | ProdTaken | Age | TypeofContact | CityTier | DurationOfPitch | Occupation | Gender | NumberOfPersonVisiting | NumberOfFollowups | ProductPitched | PreferredPropertyStar | MaritalStatus | NumberOfTrips | Passport | PitchSatisfactionScore | OwnCar | NumberOfChildrenVisiting | Designation | MonthlyIncome | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 3602 | 203602 | 0 | 38.0 | Self Enquiry | 1 | 26.0 | Salaried | Male | 3 | 4.0 | Deluxe | 3.0 | Married | 5.0 | 0 | 1 | 0 | 1.0 | Manager | 24446.0 |
| 2235 | 202235 | 0 | 38.0 | Company Invited | 1 | 9.0 | Salaried | Male | 2 | 3.0 | Basic | 3.0 | Married | 4.0 | 0 | 3 | 1 | 0.0 | Executive | 17821.0 |
| 4820 | 204820 | 1 | 35.0 | Self Enquiry | 1 | 14.0 | Salaried | Male | 4 | 4.0 | Deluxe | 3.0 | Married | 3.0 | 0 | 3 | 0 | 1.0 | Manager | 21263.0 |
| 2864 | 202864 | 0 | 30.0 | Self Enquiry | 1 | 10.0 | Small Business | Male | 4 | 5.0 | Standard | 3.0 | Divorced | 3.0 | 0 | 2 | 1 | 3.0 | Senior Manager | 30613.0 |
| 1577 | 201577 | 1 | 25.0 | Self Enquiry | 3 | 11.0 | Small Business | Male | 2 | 4.0 | Deluxe | 3.0 | Single | 2.0 | 1 | 3 | 0 | 1.0 | Manager | 20744.0 |
| 3342 | 203342 | 0 | 44.0 | Self Enquiry | 1 | 10.0 | Salaried | Male | 4 | 6.0 | King | NaN | Divorced | 5.0 | 0 | 5 | 1 | 3.0 | VP | 38418.0 |
| 1303 | 201303 | 0 | 47.0 | Self Enquiry | 1 | 10.0 | Salaried | Female | 3 | 4.0 | Standard | 4.0 | Married | 1.0 | 1 | 4 | 1 | 2.0 | Senior Manager | 25333.0 |
| 666 | 200666 | 1 | NaN | Self Enquiry | 1 | 9.0 | Salaried | Female | 2 | 3.0 | Deluxe | 3.0 | Divorced | 1.0 | 1 | 5 | 1 | 1.0 | Manager | NaN |
| 1129 | 201129 | 0 | 42.0 | Self Enquiry | 1 | 15.0 | Salaried | Male | 3 | 4.0 | King | 3.0 | Single | 1.0 | 0 | 1 | 1 | 1.0 | VP | 34613.0 |
| 1232 | 201232 | 0 | 35.0 | Self Enquiry | 1 | 33.0 | Salaried | Male | 2 | 3.0 | Deluxe | 3.0 | Married | 3.0 | 0 | 5 | 0 | 1.0 | Manager | 21883.0 |
Observation¶
Null values appear to be represented as NaN
df.shape
(4888, 20)
Observation¶
There appears to be 20 columns and 4888 rows
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 4888 entries, 0 to 4887 Data columns (total 20 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 CustomerID 4888 non-null int64 1 ProdTaken 4888 non-null int64 2 Age 4662 non-null float64 3 TypeofContact 4863 non-null object 4 CityTier 4888 non-null int64 5 DurationOfPitch 4637 non-null float64 6 Occupation 4888 non-null object 7 Gender 4888 non-null object 8 NumberOfPersonVisiting 4888 non-null int64 9 NumberOfFollowups 4843 non-null float64 10 ProductPitched 4888 non-null object 11 PreferredPropertyStar 4862 non-null float64 12 MaritalStatus 4888 non-null object 13 NumberOfTrips 4748 non-null float64 14 Passport 4888 non-null int64 15 PitchSatisfactionScore 4888 non-null int64 16 OwnCar 4888 non-null int64 17 NumberOfChildrenVisiting 4822 non-null float64 18 Designation 4888 non-null object 19 MonthlyIncome 4655 non-null float64 dtypes: float64(7), int64(7), object(6) memory usage: 763.9+ KB
Observation¶
Our data appears to all be floats, integers, and objects, and they all make sense for where they are. There appear to be a few null values in our dataset
df.isnull().sum()
CustomerID 0 ProdTaken 0 Age 226 TypeofContact 25 CityTier 0 DurationOfPitch 251 Occupation 0 Gender 0 NumberOfPersonVisiting 0 NumberOfFollowups 45 ProductPitched 0 PreferredPropertyStar 26 MaritalStatus 0 NumberOfTrips 140 Passport 0 PitchSatisfactionScore 0 OwnCar 0 NumberOfChildrenVisiting 66 Designation 0 MonthlyIncome 233 dtype: int64
Observation¶
we have welll over 200 empty rows in three columns, one column with over 100 and 4 more with less than 100, these will need to be handled
df.duplicated().sum()
0
Observation¶
There are no duplicate rows in our dataset
Exploring Numeric Data¶
df.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| CustomerID | 4888.0 | 202443.500000 | 1411.188388 | 200000.0 | 201221.75 | 202443.5 | 203665.25 | 204887.0 |
| ProdTaken | 4888.0 | 0.188216 | 0.390925 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 |
| Age | 4662.0 | 37.622265 | 9.316387 | 18.0 | 31.00 | 36.0 | 44.00 | 61.0 |
| CityTier | 4888.0 | 1.654255 | 0.916583 | 1.0 | 1.00 | 1.0 | 3.00 | 3.0 |
| DurationOfPitch | 4637.0 | 15.490835 | 8.519643 | 5.0 | 9.00 | 13.0 | 20.00 | 127.0 |
| NumberOfPersonVisiting | 4888.0 | 2.905074 | 0.724891 | 1.0 | 2.00 | 3.0 | 3.00 | 5.0 |
| NumberOfFollowups | 4843.0 | 3.708445 | 1.002509 | 1.0 | 3.00 | 4.0 | 4.00 | 6.0 |
| PreferredPropertyStar | 4862.0 | 3.581037 | 0.798009 | 3.0 | 3.00 | 3.0 | 4.00 | 5.0 |
| NumberOfTrips | 4748.0 | 3.236521 | 1.849019 | 1.0 | 2.00 | 3.0 | 4.00 | 22.0 |
| Passport | 4888.0 | 0.290917 | 0.454232 | 0.0 | 0.00 | 0.0 | 1.00 | 1.0 |
| PitchSatisfactionScore | 4888.0 | 3.078151 | 1.365792 | 1.0 | 2.00 | 3.0 | 4.00 | 5.0 |
| OwnCar | 4888.0 | 0.620295 | 0.485363 | 0.0 | 0.00 | 1.0 | 1.00 | 1.0 |
| NumberOfChildrenVisiting | 4822.0 | 1.187267 | 0.857861 | 0.0 | 1.00 | 1.0 | 2.00 | 3.0 |
| MonthlyIncome | 4655.0 | 23619.853491 | 5380.698361 | 1000.0 | 20346.00 | 22347.0 | 25571.00 | 98678.0 |
# df.mode()
| CustomerID | ProdTaken | Age | TypeofContact | CityTier | DurationOfPitch | Occupation | Gender | NumberOfPersonVisiting | NumberOfFollowups | ProductPitched | PreferredPropertyStar | MaritalStatus | NumberOfTrips | Passport | PitchSatisfactionScore | OwnCar | NumberOfChildrenVisiting | Designation | MonthlyIncome | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 200000 | 0.0 | 35.0 | Self Enquiry | 1.0 | 9.0 | Salaried | Male | 3.0 | 4.0 | Basic | 3.0 | Married | 2.0 | 0.0 | 3.0 | 1.0 | 1.0 | Executive | 17342.0 |
| 1 | 200001 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 20855.0 |
| 2 | 200002 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 21020.0 |
| 3 | 200003 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 21288.0 |
| 4 | 200004 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 4883 | 204883 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 4884 | 204884 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 4885 | 204885 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 4886 | 204886 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 4887 | 204887 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
4888 rows × 20 columns
Observations¶
- Most people did not purchase a package
- Ages range from 18 to 61, with the middle being in the mid 30s
- City Tier ranges from 1-3, with most people being in a Tier 1 city
- Pitches ranged from 5 minutes to 2 hours with most being 13-15 minutes
- Only 1-5 people visited a destination, with most common number of people visiting being three
- There were usually 4 followups
- The most frequent product pitched was the basic package
- Preferred Properties ranged from 3-5 stars, with most being 3 stars
- Most individuals wer married
- There were between 1 to 22 trips taken annually, with the middle being 3, and most people taking 2.
- Most people did not have a passport
- Pitch Satisfaction Scores hovered around 3
- Most Individuals owned a car
- On average one child went with each person
- The most common designation was executive
- The average monthly salary hovered between 22 and 23k
# df.corr()
| CustomerID | ProdTaken | Age | CityTier | DurationOfPitch | NumberOfPersonVisiting | NumberOfFollowups | PreferredPropertyStar | NumberOfTrips | Passport | PitchSatisfactionScore | OwnCar | NumberOfChildrenVisiting | MonthlyIncome | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| CustomerID | 1.000000 | 0.056506 | 0.032127 | 0.012975 | 0.064298 | 0.604013 | 0.427539 | 0.010553 | 0.224848 | 0.007974 | -0.035847 | 0.003805 | 0.511763 | 0.276833 |
| ProdTaken | 0.056506 | 1.000000 | -0.147254 | 0.086852 | 0.078257 | 0.009627 | 0.112171 | 0.099577 | 0.018898 | 0.260844 | 0.051394 | -0.011508 | 0.007421 | -0.130585 |
| Age | 0.032127 | -0.147254 | 1.000000 | -0.015625 | -0.012063 | 0.011621 | -0.002577 | -0.010474 | 0.184905 | 0.033399 | 0.018510 | 0.048654 | 0.007370 | 0.464869 |
| CityTier | 0.012975 | 0.086852 | -0.015625 | 1.000000 | 0.022703 | -0.001671 | 0.023652 | -0.009164 | -0.029709 | 0.001793 | -0.042160 | 0.003817 | 0.000672 | 0.051817 |
| DurationOfPitch | 0.064298 | 0.078257 | -0.012063 | 0.022703 | 1.000000 | 0.065141 | 0.009434 | -0.006637 | 0.009715 | 0.033034 | -0.002880 | -0.001626 | 0.031408 | -0.006252 |
| NumberOfPersonVisiting | 0.604013 | 0.009627 | 0.011621 | -0.001671 | 0.065141 | 1.000000 | 0.328569 | 0.033867 | 0.195223 | 0.011177 | -0.019581 | 0.010362 | 0.610621 | 0.195134 |
| NumberOfFollowups | 0.427539 | 0.112171 | -0.002577 | 0.023652 | 0.009434 | 0.328569 | 1.000000 | -0.024176 | 0.139517 | 0.004970 | 0.004054 | 0.012112 | 0.286425 | 0.176503 |
| PreferredPropertyStar | 0.010553 | 0.099577 | -0.010474 | -0.009164 | -0.006637 | 0.033867 | -0.024176 | 1.000000 | 0.012115 | 0.001040 | -0.022701 | 0.015742 | 0.035798 | 0.014289 |
| NumberOfTrips | 0.224848 | 0.018898 | 0.184905 | -0.029709 | 0.009715 | 0.195223 | 0.139517 | 0.012115 | 1.000000 | 0.012949 | -0.004378 | -0.011825 | 0.168795 | 0.139105 |
| Passport | 0.007974 | 0.260844 | 0.033399 | 0.001793 | 0.033034 | 0.011177 | 0.004970 | 0.001040 | 0.012949 | 1.000000 | 0.002926 | -0.022330 | 0.020264 | 0.002545 |
| PitchSatisfactionScore | -0.035847 | 0.051394 | 0.018510 | -0.042160 | -0.002880 | -0.019581 | 0.004054 | -0.022701 | -0.004378 | 0.002926 | 1.000000 | 0.068850 | 0.000878 | 0.030421 |
| OwnCar | 0.003805 | -0.011508 | 0.048654 | 0.003817 | -0.001626 | 0.010362 | 0.012112 | 0.015742 | -0.011825 | -0.022330 | 0.068850 | 1.000000 | 0.026572 | 0.080262 |
| NumberOfChildrenVisiting | 0.511763 | 0.007421 | 0.007370 | 0.000672 | 0.031408 | 0.610621 | 0.286425 | 0.035798 | 0.168795 | 0.020264 | 0.000878 | 0.026572 | 1.000000 | 0.201643 |
| MonthlyIncome | 0.276833 | -0.130585 | 0.464869 | 0.051817 | -0.006252 | 0.195134 | 0.176503 | 0.014289 | 0.139105 | 0.002545 | 0.030421 | 0.080262 | 0.201643 | 1.000000 |
Observations¶
Most correlations in this dataset are weak with ProdTaken, most heavily with NumberofFollowups, and Passport being relatively positively correlated, and age being moderately negatively correlated. Some data changes will need to be done
Checking for Odd Categorical Data¶
df.nunique()
CustomerID 4888 ProdTaken 2 Age 44 TypeofContact 2 CityTier 3 DurationOfPitch 34 Occupation 4 Gender 3 NumberOfPersonVisiting 5 NumberOfFollowups 6 ProductPitched 5 PreferredPropertyStar 3 MaritalStatus 4 NumberOfTrips 12 Passport 2 PitchSatisfactionScore 5 OwnCar 2 NumberOfChildrenVisiting 4 Designation 5 MonthlyIncome 2475 dtype: int64
Observations¶
There are some odd things here, like three unique genders and four unique marital statuses.
df['Gender'].value_counts()
Male 2916 Female 1817 Fe Male 155 Name: Gender, dtype: int64
There are 155 incorrect labels for the "Female" Category
df['MaritalStatus'].value_counts()
Married 2340 Divorced 950 Single 916 Unmarried 682 Name: MaritalStatus, dtype: int64
There isn't anything wrong with this category, but it might be possible to lump them int married and unmarried if we want to down the line
Removing Unnecessary Column(s)¶
We know each value is unique, but we do not need an ID, since we do not desire to keep track of individual customers in our dataset, rather we want to be able to predict based on values that have bearing besides autoincrements
df = df.drop(['CustomerID'], axis = 1)
# df.head(5)
| ProdTaken | Age | TypeofContact | CityTier | DurationOfPitch | Occupation | Gender | NumberOfPersonVisiting | NumberOfFollowups | ProductPitched | PreferredPropertyStar | MaritalStatus | NumberOfTrips | Passport | PitchSatisfactionScore | OwnCar | NumberOfChildrenVisiting | Designation | MonthlyIncome | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 41.0 | Self Enquiry | 3 | 6.0 | Salaried | Female | 3 | 3.0 | Deluxe | 3.0 | Single | 1.0 | 1 | 2 | 1 | 0.0 | Manager | 20993.0 |
| 1 | 0 | 49.0 | Company Invited | 1 | 14.0 | Salaried | Male | 3 | 4.0 | Deluxe | 4.0 | Divorced | 2.0 | 0 | 3 | 1 | 2.0 | Manager | 20130.0 |
| 2 | 1 | 37.0 | Self Enquiry | 1 | 8.0 | Free Lancer | Male | 3 | 4.0 | Basic | 3.0 | Single | 7.0 | 1 | 3 | 0 | 0.0 | Executive | 17090.0 |
| 3 | 0 | 33.0 | Company Invited | 1 | 9.0 | Salaried | Female | 2 | 3.0 | Basic | 3.0 | Divorced | 2.0 | 1 | 5 | 1 | 1.0 | Executive | 17909.0 |
| 4 | 0 | NaN | Self Enquiry | 1 | 8.0 | Small Business | Male | 2 | 3.0 | Basic | 4.0 | Divorced | 1.0 | 0 | 5 | 1 | 0.0 | Executive | 18468.0 |
Data Preprocessing¶
Handling Null values¶
df.isnull().sum()
ProdTaken 0 Age 226 TypeofContact 25 CityTier 0 DurationOfPitch 251 Occupation 0 Gender 0 NumberOfPersonVisiting 0 NumberOfFollowups 45 ProductPitched 0 PreferredPropertyStar 26 MaritalStatus 0 NumberOfTrips 140 Passport 0 PitchSatisfactionScore 0 OwnCar 0 NumberOfChildrenVisiting 66 Designation 0 MonthlyIncome 233 dtype: int64
Age¶
df.Age.isna().sum()
226
df.Age.describe().T
count 4662.000000 mean 37.622265 std 9.316387 min 18.000000 25% 31.000000 50% 36.000000 75% 44.000000 max 61.000000 Name: Age, dtype: float64
sns.displot(df['Age'], kde = True)
plt.axvline(x = df.Age.median(), color = 'black', ls='--')
plt.axvline(x = df.Age.mean(), color = 'red')
<matplotlib.lines.Line2D at 0x249363edb20>
Observations¶
The data is skewed right, and given that it likely has a bearing on the dependent variable, and a difference depending on whether or not they purchased, imputing the median based on whether or not they purchased ticket seems like a smart idea.
Finding the values¶
ReturnByPurchase('Age')
Purchased Mean: 34.770548 Purchased Median: 33.0 Purchased Mode: 0 29.0 Name: Age, dtype: float64 ----------- Unpurchased Mean: 38.282092 Unpurchased Median: 37.0 Unpurchased Mode: 0 36.0 Name: Age, dtype: float64
Setting the Values¶
df.loc[(df['Age'].isnull() == True) & (df['ProdTaken'] == 1), ['Age']] = 33
df.loc[(df['Age'].isnull() == True) & (df['ProdTaken'] == 0),['Age']] = 37
Examining Changes¶
df.Age.isnull().sum()
0
df.Age.describe().T
count 4888.000000 mean 37.557488 std 9.109545 min 18.000000 25% 31.000000 50% 37.000000 75% 43.000000 max 61.000000 Name: Age, dtype: float64
sns.displot(df['Age'], kde = True)
plt.axvline(x = df.Age.median(), color = 'black', ls='--')
plt.axvline(x = df.Age.mean(), color = 'red')
<matplotlib.lines.Line2D at 0x24937bbf520>
The median and the mean appear to have been brought closer together in this dataset due to the imputing we just did, but overall the datas structure has not changed much
Type of Contact¶
df.TypeofContact.isna().sum()
25
# df[df['TypeofContact'].isnull()]
| ProdTaken | Age | TypeofContact | CityTier | DurationOfPitch | Occupation | Gender | NumberOfPersonVisiting | NumberOfFollowups | ProductPitched | PreferredPropertyStar | MaritalStatus | NumberOfTrips | Passport | PitchSatisfactionScore | OwnCar | NumberOfChildrenVisiting | Designation | MonthlyIncome | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 224 | 0 | 31.0 | NaN | 1 | NaN | Small Business | Male | 2 | 5.0 | Deluxe | 3.0 | Divorced | 1.0 | 0 | 3 | 1 | 0.0 | Manager | NaN |
| 571 | 0 | 26.0 | NaN | 1 | NaN | Salaried | Female | 3 | 5.0 | Basic | 3.0 | Married | 4.0 | 0 | 4 | 1 | 2.0 | Executive | NaN |
| 572 | 0 | 29.0 | NaN | 1 | NaN | Small Business | Female | 3 | 3.0 | Deluxe | 3.0 | Divorced | 5.0 | 0 | 2 | 1 | 0.0 | Manager | NaN |
| 576 | 0 | 27.0 | NaN | 3 | NaN | Small Business | Male | 2 | 3.0 | Deluxe | 3.0 | Divorced | 1.0 | 0 | 3 | 0 | 1.0 | Manager | NaN |
| 579 | 0 | 34.0 | NaN | 1 | NaN | Small Business | Female | 2 | 4.0 | Basic | 5.0 | Single | 2.0 | 0 | 2 | 1 | 1.0 | Executive | NaN |
| 598 | 1 | 28.0 | NaN | 1 | NaN | Small Business | Male | 2 | 3.0 | Basic | 3.0 | Single | 7.0 | 0 | 3 | 0 | 0.0 | Executive | NaN |
| 622 | 0 | 32.0 | NaN | 3 | NaN | Salaried | Male | 3 | 3.0 | Deluxe | 3.0 | Married | 3.0 | 0 | 2 | 0 | 0.0 | Manager | NaN |
| 724 | 0 | 24.0 | NaN | 1 | NaN | Small Business | Female | 2 | 4.0 | Deluxe | 3.0 | Married | 2.0 | 0 | 3 | 1 | 1.0 | Manager | NaN |
| 843 | 0 | 26.0 | NaN | 1 | NaN | Small Business | Male | 2 | 1.0 | Basic | 3.0 | Divorced | 2.0 | 0 | 5 | 1 | 1.0 | Executive | NaN |
| 1021 | 1 | 25.0 | NaN | 3 | NaN | Salaried | Male | 3 | 4.0 | Basic | 5.0 | Divorced | 4.0 | 0 | 1 | 1 | 0.0 | Executive | NaN |
| 1047 | 0 | 33.0 | NaN | 3 | NaN | Small Business | Male | 2 | 3.0 | Deluxe | 5.0 | Divorced | 1.0 | 0 | 3 | 0 | 0.0 | Manager | NaN |
| 1143 | 0 | 45.0 | NaN | 3 | NaN | Small Business | Male | 2 | 4.0 | Deluxe | 5.0 | Married | 2.0 | 0 | 3 | 0 | 0.0 | Manager | NaN |
| 1182 | 0 | 36.0 | NaN | 1 | NaN | Small Business | Female | 2 | 4.0 | Deluxe | 3.0 | Married | 1.0 | 0 | 5 | 1 | 1.0 | Manager | NaN |
| 1217 | 0 | 24.0 | NaN | 1 | NaN | Small Business | Male | 3 | 1.0 | Basic | 3.0 | Married | 2.0 | 0 | 1 | 0 | 0.0 | Executive | NaN |
| 1356 | 0 | 41.0 | NaN | 3 | NaN | Small Business | Female | 2 | 3.0 | Deluxe | 4.0 | Married | 6.0 | 0 | 3 | 1 | 1.0 | Manager | NaN |
| 1469 | 0 | 34.0 | NaN | 1 | NaN | Small Business | Male | 2 | 1.0 | Deluxe | 3.0 | Married | 3.0 | 0 | 3 | 0 | 1.0 | Manager | NaN |
| 1694 | 0 | 31.0 | NaN | 1 | NaN | Small Business | Male | 2 | 5.0 | Deluxe | 3.0 | Married | 1.0 | 0 | 3 | 0 | 0.0 | Manager | NaN |
| 2041 | 0 | 26.0 | NaN | 1 | NaN | Salaried | Female | 3 | 5.0 | Basic | 3.0 | Married | 4.0 | 0 | 4 | 1 | 0.0 | Executive | NaN |
| 2042 | 0 | 29.0 | NaN | 1 | NaN | Small Business | Female | 3 | 3.0 | Deluxe | 3.0 | Married | 5.0 | 0 | 1 | 0 | 1.0 | Manager | NaN |
| 2046 | 0 | 27.0 | NaN | 3 | NaN | Small Business | Male | 2 | 3.0 | Deluxe | 3.0 | Married | 1.0 | 0 | 3 | 1 | 1.0 | Manager | NaN |
| 2049 | 0 | 34.0 | NaN | 1 | NaN | Small Business | Female | 2 | 4.0 | Basic | 5.0 | Single | 2.0 | 0 | 1 | 1 | 0.0 | Executive | NaN |
| 2068 | 1 | 28.0 | NaN | 1 | NaN | Small Business | Male | 2 | 3.0 | Basic | 3.0 | Single | 7.0 | 0 | 3 | 1 | 1.0 | Executive | NaN |
| 2092 | 0 | 32.0 | NaN | 3 | NaN | Salaried | Male | 3 | 3.0 | Deluxe | 3.0 | Married | 3.0 | 0 | 1 | 0 | 2.0 | Manager | NaN |
| 2194 | 0 | 24.0 | NaN | 1 | NaN | Small Business | Female | 2 | 4.0 | Deluxe | 3.0 | Married | 2.0 | 0 | 3 | 0 | 0.0 | Manager | NaN |
| 2313 | 0 | 26.0 | NaN | 1 | NaN | Small Business | Male | 2 | 1.0 | Basic | 3.0 | Married | 2.0 | 0 | 5 | 1 | 1.0 | Executive | NaN |
df['TypeofContact'].mode()
0 Self Enquiry Name: TypeofContact, dtype: object
df.loc[df.TypeofContact.isnull(), 'TypeofContact'] = 'Self Enquiry'
Notes¶
With only 25 missing fields, and the vast majority of TypeofContact being self enquiry, we just imputed with the mode of the TypeofContact.
df.TypeofContact.isna().sum()
0
Duration of Pitch¶
df.DurationOfPitch.isna().sum()
251
df.DurationOfPitch.describe()
count 4637.000000 mean 15.490835 std 8.519643 min 5.000000 25% 9.000000 50% 13.000000 75% 20.000000 max 127.000000 Name: DurationOfPitch, dtype: float64
sns.displot(df['DurationOfPitch'], kde = True)
plt.axvline(x = df.DurationOfPitch.median(), color = 'black', ls='--')
plt.axvline(x = df.DurationOfPitch.mean(), color = 'red')
<matplotlib.lines.Line2D at 0x24937bd7ca0>
Observations¶
This data is extremely skewed right, we will have to come back when we are examining outliers later on and fix this, as I believe the 120 is an inaccurate number, given the KDE near 1 out to 120. We will have to impute with the median rather than the mean for null values. Due to the large number, splitting by whether they purchased or not seems wise again.
Finding the Values¶
ReturnByPurchase('DurationOfPitch')
Purchased Mean: 16.873143 Purchased Median: 15.0 Purchased Mode: 0 9.0 Name: DurationOfPitch, dtype: float64 ----------- Unpurchased Mean: 15.169325 Unpurchased Median: 13.0 Unpurchased Mode: 0 9.0 Name: DurationOfPitch, dtype: float64
Setting the Values¶
df.loc[(df['DurationOfPitch'].isnull() == True) & (df['ProdTaken'] == 1), ['DurationOfPitch']] = 15
df.loc[(df['DurationOfPitch'].isnull() == True) & (df['ProdTaken'] == 0),['DurationOfPitch']] = 13
Examining Changes¶
df.DurationOfPitch.describe()
count 4888.000000 mean 15.381342 std 8.313127 min 5.000000 25% 9.000000 50% 13.000000 75% 19.000000 max 127.000000 Name: DurationOfPitch, dtype: float64
sns.displot(df['DurationOfPitch'], kde = True)
plt.axvline(x = df.DurationOfPitch.median(), color = 'black', ls='--')
plt.axvline(x = df.DurationOfPitch.mean(), color = 'red')
<matplotlib.lines.Line2D at 0x24937e9bfd0>
There appears to be little difference in the structure of the data after these changes, except the increase in the number of objects in the median
Number of Followups¶
df.NumberOfFollowups.isna().sum()
45
df.NumberOfFollowups.describe()
count 4843.000000 mean 3.708445 std 1.002509 min 1.000000 25% 3.000000 50% 4.000000 75% 4.000000 max 6.000000 Name: NumberOfFollowups, dtype: float64
sns.displot(df['NumberOfFollowups'], kde = True)
plt.axvline(x = df.NumberOfFollowups.median(), color = 'black', ls='--')
plt.axvline(x = df.NumberOfFollowups.mean(), color = 'red')
<matplotlib.lines.Line2D at 0x24937f74310>
Observations¶
This is mainly categorical with an integer, with the median and mode both being 4, and the mean a little off at 3.7. This likely has bearing on the dependent variable, so we will impute with that in mind
Finding the Values¶
ReturnByPurchase('NumberOfFollowups')
Purchased Mean: 3.941886 Purchased Median: 4.0 Purchased Mode: 0 4.0 Name: NumberOfFollowups, dtype: float64 ----------- Unpurchased Mean: 3.654286 Unpurchased Median: 4.0 Unpurchased Mode: 0 4.0 Name: NumberOfFollowups, dtype: float64
Setting the Values¶
df.loc[df['NumberOfFollowups'].isnull(), ['NumberOfFollowups']] = 4
Checking the Results¶
df.NumberOfFollowups.isna().sum()
0
df.NumberOfFollowups.describe()
count 4888.000000 mean 3.711129 std 0.998271 min 1.000000 25% 3.000000 50% 4.000000 75% 4.000000 max 6.000000 Name: NumberOfFollowups, dtype: float64
sns.displot(df['NumberOfFollowups'], kde = True)
plt.axvline(x = df.NumberOfFollowups.median(), color = 'black', ls='--')
plt.axvline(x = df.NumberOfFollowups.mean(), color = 'red')
<matplotlib.lines.Line2D at 0x24937fe0610>
The data looks to mostly be in the same shape as before, due to the low number of changes to fields
Preferred Property Star¶
df.PreferredPropertyStar.isna().sum()
26
df.PreferredPropertyStar.describe()
count 4862.000000 mean 3.581037 std 0.798009 min 3.000000 25% 3.000000 50% 3.000000 75% 4.000000 max 5.000000 Name: PreferredPropertyStar, dtype: float64
sns.displot(df['PreferredPropertyStar'], kde = True)
plt.axvline(x = df.PreferredPropertyStar.median(), color = 'black', ls='--')
plt.axvline(x = df.PreferredPropertyStar.mean(), color = 'red')
<matplotlib.lines.Line2D at 0x249380501f0>
Observations¶
This is another categorical value with an integer base. The median and Mode are both at 3, and the mean is a bit above at 3.58 I will impute with the median once more
Finding the Values¶
ReturnByPurchase('PreferredPropertyStar')
Purchased Mean: 3.746171 Purchased Median: 3.0 Purchased Mode: 0 3.0 Name: PreferredPropertyStar, dtype: float64 ----------- Unpurchased Mean: 3.542806 Unpurchased Median: 3.0 Unpurchased Mode: 0 3.0 Name: PreferredPropertyStar, dtype: float64
df.loc[df['PreferredPropertyStar'].isnull(), ['PreferredPropertyStar']] = 3
Checking the Results¶
df.PreferredPropertyStar.isna().sum()
0
df.PreferredPropertyStar.describe()
count 4888.000000 mean 3.577946 std 0.797005 min 3.000000 25% 3.000000 50% 3.000000 75% 4.000000 max 5.000000 Name: PreferredPropertyStar, dtype: float64
sns.displot(df['PreferredPropertyStar'], kde = True)
plt.axvline(x = df.PreferredPropertyStar.median(), color = 'black', ls='--')
plt.axvline(x = df.PreferredPropertyStar.mean(), color = 'red')
<matplotlib.lines.Line2D at 0x249380d2f10>
As expected at this point, the data appears to mostly be the same as before.
Number of Trips¶
df.NumberOfTrips.isna().sum()
140
df.NumberOfTrips.describe()
count 4748.000000 mean 3.236521 std 1.849019 min 1.000000 25% 2.000000 50% 3.000000 75% 4.000000 max 22.000000 Name: NumberOfTrips, dtype: float64
sns.displot(df['NumberOfTrips'], kde = True)
plt.axvline(x = df.NumberOfTrips.median(), color = 'black', ls='--')
plt.axvline(x = df.NumberOfTrips.mean(), color = 'red')
<matplotlib.lines.Line2D at 0x2493912a880>
Observations¶
The median and mean are close, and there is definitely an outlier up at 22, given the right skew, I will impute our missing values with the Median
Finding the Values¶
ReturnByPurchase('NumberOfTrips')
Purchased Mean: 3.30837 Purchased Median: 3.0 Purchased Mode: 0 2.0 Name: NumberOfTrips, dtype: float64 ----------- Unpurchased Mean: 3.219531 Unpurchased Median: 3.0 Unpurchased Mode: 0 2.0 Name: NumberOfTrips, dtype: float64
Changing the Values¶
df.loc[df['NumberOfTrips'].isnull(), ['NumberOfTrips']] = 3
Checking the Changes¶
df.NumberOfTrips.isna().sum()
0
df.NumberOfTrips.describe()
count 4888.000000 mean 3.229746 std 1.822769 min 1.000000 25% 2.000000 50% 3.000000 75% 4.000000 max 22.000000 Name: NumberOfTrips, dtype: float64
sns.displot(df['NumberOfTrips'], kde = True)
plt.axvline(x = df.NumberOfTrips.median(), color = 'black', ls='--')
plt.axvline(x = df.NumberOfTrips.mean(), color = 'red')
<matplotlib.lines.Line2D at 0x249391926a0>
At the risk of sounding like a broken record, our data looks similar to how it was initially which we expected
Number of Children Visitng¶
df.NumberOfChildrenVisiting.isna().sum()
66
df.NumberOfChildrenVisiting.describe()
count 4822.000000 mean 1.187267 std 0.857861 min 0.000000 25% 1.000000 50% 1.000000 75% 2.000000 max 3.000000 Name: NumberOfChildrenVisiting, dtype: float64
sns.displot(df['NumberOfChildrenVisiting'], kde = True)
plt.axvline(x = df.NumberOfChildrenVisiting.median(), color = 'black', ls='--')
plt.axvline(x = df.NumberOfChildrenVisiting.mean(), color = 'red')
<matplotlib.lines.Line2D at 0x24939393970>
Observations¶
The median and mode are equal at one, and the mean is slightly higher at 1.18, we will impute the missing values with the median
Finding the Values¶
ReturnByPurchase('NumberOfChildrenVisiting')
Purchased Mean: 1.200438 Purchased Median: 1.0 Purchased Mode: 0 1.0 Name: NumberOfChildrenVisiting, dtype: float64 ----------- Unpurchased Mean: 1.18419 Unpurchased Median: 1.0 Unpurchased Mode: 0 1.0 Name: NumberOfChildrenVisiting, dtype: float64
Changing the Values¶
df.loc[df['NumberOfChildrenVisiting'].isnull(), ['NumberOfChildrenVisiting']] = 1
Checking the Changes¶
df.NumberOfChildrenVisiting.isna().sum()
0
df.NumberOfChildrenVisiting.describe()
count 4888.000000 mean 1.184738 std 0.852323 min 0.000000 25% 1.000000 50% 1.000000 75% 2.000000 max 3.000000 Name: NumberOfChildrenVisiting, dtype: float64
sns.displot(df['NumberOfChildrenVisiting'], kde = True)
plt.axvline(x = df.NumberOfChildrenVisiting.median(), color = 'black', ls='--')
plt.axvline(x = df.NumberOfChildrenVisiting.mean(), color = 'red')
<matplotlib.lines.Line2D at 0x2493924f580>
Monthly Income¶
df.MonthlyIncome.isna().sum()
233
df.MonthlyIncome.describe()
count 4655.000000 mean 23619.853491 std 5380.698361 min 1000.000000 25% 20346.000000 50% 22347.000000 75% 25571.000000 max 98678.000000 Name: MonthlyIncome, dtype: float64
sns.displot(df['MonthlyIncome'], kde = True)
plt.axvline(x = df.MonthlyIncome.median(), color = 'black', ls='--')
plt.axvline(x = df.MonthlyIncome.mean(), color = 'red')
<matplotlib.lines.Line2D at 0x2493951fb50>
Observations¶
We have a few pieces of data causing skewing to the right, with a massive max, as well as some outliers more than likely beneath our dataset. Out median and mean are fairly close in the grand scheme of things, so imputing with median will be helpful in bringing them closer.
Finding the Values¶
ReturnByPurchase('MonthlyIncome')
Purchased Mean: 22172.827703 Purchased Median: 21172.0 Purchased Mode: 0 17293.0 1 17404.0 2 20971.0 3 21082.0 Name: MonthlyIncome, dtype: float64 ----------- Unpurchased Mean: 23960.962835 Unpurchased Median: 22729.0 Unpurchased Mode: 0 20855.0 Name: MonthlyIncome, dtype: float64
Changing the Values¶
df.loc[(df['MonthlyIncome'].isnull() == True) & (df['ProdTaken'] == 1), ['MonthlyIncome']] = 21172
df.loc[(df['MonthlyIncome'].isnull() == True) & (df['ProdTaken'] == 0),['MonthlyIncome']] = 22729
Checking the Changes¶
df.MonthlyIncome.isna().sum()
0
df.MonthlyIncome.describe()
count 4888.000000 mean 23567.195376 std 5257.438805 min 1000.000000 25% 20485.000000 50% 22595.500000 75% 25424.750000 max 98678.000000 Name: MonthlyIncome, dtype: float64
sns.displot(df['NumberOfTrips'], kde = True)
plt.axvline(x = df.NumberOfTrips.median(), color = 'black', ls='--')
plt.axvline(x = df.NumberOfTrips.mean(), color = 'red')
<matplotlib.lines.Line2D at 0x2493a708070>
The mean dropped a little bit, and the median increased some, but the overall structure of the data stays mostly unchanged
Confirming no more Null Values¶
df.isna().sum()
ProdTaken 0 Age 0 TypeofContact 0 CityTier 0 DurationOfPitch 0 Occupation 0 Gender 0 NumberOfPersonVisiting 0 NumberOfFollowups 0 ProductPitched 0 PreferredPropertyStar 0 MaritalStatus 0 NumberOfTrips 0 Passport 0 PitchSatisfactionScore 0 OwnCar 0 NumberOfChildrenVisiting 0 Designation 0 MonthlyIncome 0 dtype: int64
Handling Outliers¶
df.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| ProdTaken | 4888.0 | 0.188216 | 0.390925 | 0.0 | 0.0 | 0.0 | 0.00 | 1.0 |
| Age | 4888.0 | 37.557488 | 9.109545 | 18.0 | 31.0 | 37.0 | 43.00 | 61.0 |
| CityTier | 4888.0 | 1.654255 | 0.916583 | 1.0 | 1.0 | 1.0 | 3.00 | 3.0 |
| DurationOfPitch | 4888.0 | 15.381342 | 8.313127 | 5.0 | 9.0 | 13.0 | 19.00 | 127.0 |
| NumberOfPersonVisiting | 4888.0 | 2.905074 | 0.724891 | 1.0 | 2.0 | 3.0 | 3.00 | 5.0 |
| NumberOfFollowups | 4888.0 | 3.711129 | 0.998271 | 1.0 | 3.0 | 4.0 | 4.00 | 6.0 |
| PreferredPropertyStar | 4888.0 | 3.577946 | 0.797005 | 3.0 | 3.0 | 3.0 | 4.00 | 5.0 |
| NumberOfTrips | 4888.0 | 3.229746 | 1.822769 | 1.0 | 2.0 | 3.0 | 4.00 | 22.0 |
| Passport | 4888.0 | 0.290917 | 0.454232 | 0.0 | 0.0 | 0.0 | 1.00 | 1.0 |
| PitchSatisfactionScore | 4888.0 | 3.078151 | 1.365792 | 1.0 | 2.0 | 3.0 | 4.00 | 5.0 |
| OwnCar | 4888.0 | 0.620295 | 0.485363 | 0.0 | 0.0 | 1.0 | 1.00 | 1.0 |
| NumberOfChildrenVisiting | 4888.0 | 1.184738 | 0.852323 | 0.0 | 1.0 | 1.0 | 2.00 | 3.0 |
| MonthlyIncome | 4888.0 | 23567.195376 | 5257.438805 | 1000.0 | 20485.0 | 22595.5 | 25424.75 | 98678.0 |
Duration of Pitch¶
df.DurationOfPitch.describe()
count 4888.000000 mean 15.381342 std 8.313127 min 5.000000 25% 9.000000 50% 13.000000 75% 19.000000 max 127.000000 Name: DurationOfPitch, dtype: float64
histogramboxplot(data = df, feature = 'DurationOfPitch')
There are some pretty extreme outliers in Duration of Pitch, so cleaning up these outliers will really affect the way this data is structured, and hopefully produce better results. The ones close to the whisker are likely accurate, but the ones at 120 minutes seem like erronous entry, so I will set it to the median.
# df[df['DurationOfPitch'] > 120]
| ProdTaken | Age | TypeofContact | CityTier | DurationOfPitch | Occupation | Gender | NumberOfPersonVisiting | NumberOfFollowups | ProductPitched | PreferredPropertyStar | MaritalStatus | NumberOfTrips | Passport | PitchSatisfactionScore | OwnCar | NumberOfChildrenVisiting | Designation | MonthlyIncome | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1434 | 0 | 37.0 | Company Invited | 3 | 126.0 | Salaried | Male | 2 | 3.0 | Basic | 3.0 | Married | 3.0 | 0 | 1 | 1 | 1.0 | Executive | 18482.0 |
| 3878 | 0 | 53.0 | Company Invited | 3 | 127.0 | Salaried | Male | 3 | 4.0 | Basic | 3.0 | Married | 4.0 | 0 | 1 | 1 | 2.0 | Executive | 22160.0 |
df.groupby("ProdTaken")["DurationOfPitch"].median()
ProdTaken 0 13.0 1 15.0 Name: DurationOfPitch, dtype: float64
df.loc[df['DurationOfPitch'] > 125, 'DurationOfPitch'] = 13
histogramboxplot(data = df, feature = 'DurationOfPitch')
This has improved the look of the dataset immediately, and makes it more approachable for further modification if needed, but ultimately likely will not
Number of Persons Visiting¶
df.DurationOfPitch.describe()
count 4888.000000 mean 15.334902 std 8.003437 min 5.000000 25% 9.000000 50% 13.000000 75% 19.000000 max 36.000000 Name: DurationOfPitch, dtype: float64
histogramboxplot(data = df, feature = 'NumberOfPersonVisiting')
This only has one outlier at the top, and it feels like it is likely an accurate value, so I will be leaving it for the time being
Number of Followups¶
df.DurationOfPitch.describe()
count 4888.000000 mean 15.334902 std 8.003437 min 5.000000 25% 9.000000 50% 13.000000 75% 19.000000 max 36.000000 Name: DurationOfPitch, dtype: float64
histogramboxplot(data = df, feature = 'NumberOfFollowups')
This one has a single datapoint above and a single one below that happens to be out of the whiskers, and seem accurate, so I will be leaving them be.
Number of Trips¶
df.DurationOfPitch.describe()
count 4888.000000 mean 15.334902 std 8.003437 min 5.000000 25% 9.000000 50% 13.000000 75% 19.000000 max 36.000000 Name: DurationOfPitch, dtype: float64
histogramboxplot(data = df, feature = 'NumberOfTrips')
This one was a debate, it is possible that the outliers at the top were accurate, but they also seem a bit extreme for the dataset, and may throw off a model. I have decided to take the values that are above 15 and impute them with the median, but leave the one near ten
# df[df['NumberOfTrips'] > 15]
| ProdTaken | Age | TypeofContact | CityTier | DurationOfPitch | Occupation | Gender | NumberOfPersonVisiting | NumberOfFollowups | ProductPitched | PreferredPropertyStar | MaritalStatus | NumberOfTrips | Passport | PitchSatisfactionScore | OwnCar | NumberOfChildrenVisiting | Designation | MonthlyIncome | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 385 | 1 | 30.0 | Company Invited | 1 | 10.0 | Large Business | Male | 2 | 3.0 | Basic | 3.0 | Single | 19.0 | 1 | 4 | 1 | 1.0 | Executive | 17285.0 |
| 816 | 0 | 39.0 | Company Invited | 1 | 15.0 | Salaried | Male | 3 | 3.0 | Deluxe | 4.0 | Unmarried | 21.0 | 0 | 2 | 1 | 0.0 | Manager | 21782.0 |
| 2829 | 1 | 31.0 | Company Invited | 1 | 11.0 | Large Business | Male | 3 | 4.0 | Basic | 3.0 | Single | 20.0 | 1 | 4 | 1 | 2.0 | Executive | 20963.0 |
| 3260 | 0 | 40.0 | Company Invited | 1 | 16.0 | Salaried | Male | 4 | 4.0 | Deluxe | 4.0 | Unmarried | 22.0 | 0 | 2 | 1 | 1.0 | Manager | 25460.0 |
df.groupby("ProdTaken")["NumberOfTrips"].median()
ProdTaken 0 3.0 1 3.0 Name: NumberOfTrips, dtype: float64
df.loc[df['NumberOfTrips'] > 18, 'NumberOfTrips'] = 3
histogramboxplot(data = df, feature = 'NumberOfTrips')
This seems to keep the shape of the data well intact, while leaving a likely accurate outlier in the data to prevent model overfit
Monthly Income¶
df.DurationOfPitch.describe()
count 4888.000000 mean 15.334902 std 8.003437 min 5.000000 25% 9.000000 50% 13.000000 75% 19.000000 max 36.000000 Name: DurationOfPitch, dtype: float64
histogramboxplot(data = df, feature = 'MonthlyIncome')
There are some outliers low, and a ton of outliers high. The ones below 40,000 seem likely accurate, but the ones approaching 100,000 a month seem likely erronous. I will examine the low values, determine if their data seems erronous, and then decide how to impute them. I will leave the values below 40,000 intact and impute the ones above 80,000 with the median
# df[df['MonthlyIncome'] < 10000]
| ProdTaken | Age | TypeofContact | CityTier | DurationOfPitch | Occupation | Gender | NumberOfPersonVisiting | NumberOfFollowups | ProductPitched | PreferredPropertyStar | MaritalStatus | NumberOfTrips | Passport | PitchSatisfactionScore | OwnCar | NumberOfChildrenVisiting | Designation | MonthlyIncome | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 142 | 0 | 38.0 | Self Enquiry | 1 | 9.0 | Large Business | Female | 2 | 3.0 | Deluxe | 3.0 | Single | 4.0 | 1 | 5 | 0 | 0.0 | Manager | 1000.0 |
| 2586 | 0 | 39.0 | Self Enquiry | 1 | 10.0 | Large Business | Female | 3 | 4.0 | Deluxe | 3.0 | Single | 5.0 | 1 | 5 | 0 | 1.0 | Manager | 4678.0 |
These values seem inaccurate given the fact that they are living in a major city, and are a manager in large busines. I will impute them with the median
# df[df['MonthlyIncome'] > 80000]
| ProdTaken | Age | TypeofContact | CityTier | DurationOfPitch | Occupation | Gender | NumberOfPersonVisiting | NumberOfFollowups | ProductPitched | PreferredPropertyStar | MaritalStatus | NumberOfTrips | Passport | PitchSatisfactionScore | OwnCar | NumberOfChildrenVisiting | Designation | MonthlyIncome | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 38 | 0 | 36.0 | Self Enquiry | 1 | 11.0 | Salaried | Female | 2 | 4.0 | Basic | 3.0 | Divorced | 1.0 | 1 | 2 | 1 | 0.0 | Executive | 95000.0 |
| 2482 | 0 | 37.0 | Self Enquiry | 1 | 12.0 | Salaried | Female | 3 | 5.0 | Basic | 5.0 | Divorced | 2.0 | 1 | 2 | 1 | 1.0 | Executive | 98678.0 |
df.groupby("ProdTaken")["MonthlyIncome"].median()
ProdTaken 0 22729.0 1 21172.0 Name: MonthlyIncome, dtype: float64
df.loc[df['MonthlyIncome'] > 80000, 'MonthlyIncome'] = 22729
df.loc[df['MonthlyIncome'] < 10000, 'MonthlyIncome'] = 22729
histogramboxplot(data = df, feature = 'MonthlyIncome')
This seems much better in terms of data, I will likely log scale this later due to the high number just outside the the upper whisker, but for now, I will leave it be.
Remove erronous Categorical Data¶
df['Gender'].value_counts()
Male 2916 Female 1817 Fe Male 155 Name: Gender, dtype: int64
df.loc[df['Gender'] == 'Fe Male', ['Gender']] = 'Female'
df['Gender'].value_counts()
Male 2916 Female 1972 Name: Gender, dtype: int64
Exploratory Data Analysis¶
Univariate Analysis¶
# df.head(5)
| ProdTaken | Age | TypeofContact | CityTier | DurationOfPitch | Occupation | Gender | NumberOfPersonVisiting | NumberOfFollowups | ProductPitched | PreferredPropertyStar | MaritalStatus | NumberOfTrips | Passport | PitchSatisfactionScore | OwnCar | NumberOfChildrenVisiting | Designation | MonthlyIncome | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 41.0 | Self Enquiry | 3 | 6.0 | Salaried | Female | 3 | 3.0 | Deluxe | 3.0 | Single | 1.0 | 1 | 2 | 1 | 0.0 | Manager | 20993.0 |
| 1 | 0 | 49.0 | Company Invited | 1 | 14.0 | Salaried | Male | 3 | 4.0 | Deluxe | 4.0 | Divorced | 2.0 | 0 | 3 | 1 | 2.0 | Manager | 20130.0 |
| 2 | 1 | 37.0 | Self Enquiry | 1 | 8.0 | Free Lancer | Male | 3 | 4.0 | Basic | 3.0 | Single | 7.0 | 1 | 3 | 0 | 0.0 | Executive | 17090.0 |
| 3 | 0 | 33.0 | Company Invited | 1 | 9.0 | Salaried | Female | 2 | 3.0 | Basic | 3.0 | Divorced | 2.0 | 1 | 5 | 1 | 1.0 | Executive | 17909.0 |
| 4 | 0 | 37.0 | Self Enquiry | 1 | 8.0 | Small Business | Male | 2 | 3.0 | Basic | 4.0 | Divorced | 1.0 | 0 | 5 | 1 | 0.0 | Executive | 18468.0 |
df.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| ProdTaken | 4888.0 | 0.188216 | 0.390925 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 |
| Age | 4888.0 | 37.557488 | 9.109545 | 18.0 | 31.00 | 37.0 | 43.00 | 61.0 |
| CityTier | 4888.0 | 1.654255 | 0.916583 | 1.0 | 1.00 | 1.0 | 3.00 | 3.0 |
| DurationOfPitch | 4888.0 | 15.334902 | 8.003437 | 5.0 | 9.00 | 13.0 | 19.00 | 36.0 |
| NumberOfPersonVisiting | 4888.0 | 2.905074 | 0.724891 | 1.0 | 2.00 | 3.0 | 3.00 | 5.0 |
| NumberOfFollowups | 4888.0 | 3.711129 | 0.998271 | 1.0 | 3.00 | 4.0 | 4.00 | 6.0 |
| PreferredPropertyStar | 4888.0 | 3.577946 | 0.797005 | 3.0 | 3.00 | 3.0 | 4.00 | 5.0 |
| NumberOfTrips | 4888.0 | 3.215426 | 1.754188 | 1.0 | 2.00 | 3.0 | 4.00 | 8.0 |
| Passport | 4888.0 | 0.290917 | 0.454232 | 0.0 | 0.00 | 0.0 | 1.00 | 1.0 |
| PitchSatisfactionScore | 4888.0 | 3.078151 | 1.365792 | 1.0 | 2.00 | 3.0 | 4.00 | 5.0 |
| OwnCar | 4888.0 | 0.620295 | 0.485363 | 0.0 | 0.00 | 1.0 | 1.00 | 1.0 |
| NumberOfChildrenVisiting | 4888.0 | 1.184738 | 0.852323 | 0.0 | 1.00 | 1.0 | 2.00 | 3.0 |
| MonthlyIncome | 4888.0 | 23545.010434 | 5026.428314 | 16009.0 | 20486.75 | 22596.5 | 25407.75 | 38677.0 |
Age¶
descviolinbox(df, 'Age')
count 4888.000000 mean 37.557488 std 9.109545 min 18.000000 25% 31.000000 50% 37.000000 75% 43.000000 max 61.000000 Name: Age, dtype: float64
labeldisplot(df, 'Age')
bounds(df, 'Age')
Q1: 31.0 Q3: 43.0 IQR: 12.0 lower limit: 13.0 upper limit: 61.0
(13.0, 61.0, 31.0, 43.0)
numericdata(df, 'Age')
Mean: 37.557488 Median: 37.0 Mode: 37.0 Range: 12.0
Observations¶
Mean is 37.5 median is 37, mode is also 37 and the IQR is 12
- Median and mean are close, with the mean slightly to the right
- There is a slight right skew to the data, but not enough to fix with a log scale down the road
- We will probably want to scale our data, given that it has higher values than most of our data
- Interesting thing that might require looking at the dataset, some ages have double the count to those next to them, not something normally anticipated with age.
Duration of Pitch¶
descviolinbox(df, 'DurationOfPitch')
count 4888.000000 mean 15.334902 std 8.003437 min 5.000000 25% 9.000000 50% 13.000000 75% 19.000000 max 36.000000 Name: DurationOfPitch, dtype: float64
labeldisplot(df, 'DurationOfPitch')
bounds(df, 'DurationOfPitch')
Q1: 9.0 Q3: 19.0 IQR: 10.0 lower limit: -6.0 upper limit: 34.0
(-6.0, 34.0, 9.0, 19.0)
numericdata(df, 'DurationOfPitch')
Mean: 15.334902 Median: 13.0 Mode: 9.0 Range: 10.0
Observations¶
Mean is 15.3, Median is 13, Mode is 9, and the IQR is 10
- We have a pretty significant right skew in our data
- The KDE is important in this graph, because our initial thought was the Mode was 13
- Due to the long right tail, a log scale might be a good idea
- We also may want to just bin the data, as keeping this numeric might overfit the model
Monthly Income¶
descviolinbox(df, 'MonthlyIncome')
count 4888.000000 mean 23545.010434 std 5026.428314 min 16009.000000 25% 20486.750000 50% 22596.500000 75% 25407.750000 max 38677.000000 Name: MonthlyIncome, dtype: float64
labeldisplot(df, 'MonthlyIncome')
bounds(df, 'MonthlyIncome')
Q1: 20486.75 Q3: 25407.75 IQR: 4921.0 lower limit: 13105.25 upper limit: 32789.25
(13105.25, 32789.25, 20486.75, 25407.75)
numericdata(df, 'MonthlyIncome')
Mean: 23545.010434 Median: 22596.5 Mode: 22729.0 Range: 4921.0
Observations¶
Mean is \$23,545, Median is \$22,597, Mode is \$22,729 and the range is \$4,921
- This data has an extremely long right tale, and correspondingly a significant right skew
- The data is not very close to a normal distribution
- most of the right tail is flat, meaning there are roughly the same number of records
- There is a gap just below 20,000 which doesn't seem like accurate data, acquiring information for customers in this range may help us fit our model
- This seems to be a prime candidate for a log scale, as it has a long right tail, and the numeric values are significantly larger than the rest of the dataset
Prod Taken¶
labeledbarplot(df, 'ProdTaken')
Observations¶
Almost 4 times as many people did not purchase a product as those who did.
Type of Contact¶
labeledbarplot(df, 'TypeofContact')
Observations¶
Roughly double the number of people in our dataset contacted by self enquiry than Company Invited
City Tier¶
labeledbarplot(df, 'CityTier')
Observations¶
Most people do not live in a tier 2 city, and double the number of people live in Tier 1 cities than Tier 3
Occupation¶
labeledbarplot(df, 'Occupation')
Observations¶
- Most individuals are Salaried or Small Business
- Free Lancer has no significant bearing on our dataset
- Free Lancer and Large business should merge to prevent overfit, as an "Other" category
Gender¶
labeledbarplot(df, 'Gender', perc= True )
Observations¶
There are more men than women in this dataset, but it is much more representative than the data we normally see. There are roughly 1000 more men than women, or about a 60/40 split between men and women
Number of People Visiting¶
labeledbarplot(df, 'NumberOfPersonVisiting')
Observations¶
- This dataset is extremely close to a normal distribution
- 1 and 5 should be combined into an "other" category to prevent overfit, or 1+2 and 4+5 merge
Number of Followups¶
labeledbarplot(df, 'NumberOfFollowups')
Observations¶
- This again seems normally distributed
- Either 1, 2, and six should merge into "other", or more likely, one and two merge, and five and six merge.
Product Pitched¶
labeledbarplot(df, 'ProductPitched')
Observations¶
- This is not a normal distribution, with Basic and Deluxe topping the chart
- Basic, Deluxe, and Standard should remain the same while King and Super Deluxe merge into an "Other Category
Preferred Property Star¶
labeledbarplot(df, 'PreferredPropertyStar')
Observations¶
There are triple the number of people who prefer 3 star properties to 4 or 5
Marital Status¶
labeledbarplot(df, 'MaritalStatus')
Observations¶
- Married has double the number of records than other categories
- It is probbaly a good idea to simplify just to "Married" or "Unmarried"
Number of Trips¶
labeledbarplot(df, 'NumberOfTrips')
Observations¶
- This is not normally distributed, 2 and 3 have double the number of other records
- Merging categories for 6, 7, and 8 into "6 and Above" seems like a good idea to prevent overfit
Passport¶
labeledbarplot(df, 'Passport', perc = True)
Observations¶
70% of individuals in our dataset do not have a passport, which may significantly impact our sales of overseas packages
Pitch Satisfaction Score¶
labeledbarplot(df, 'PitchSatisfactionScore')
Observations¶
Our pitch satisfaction score records seem to be all fairly statistically significant and fairly normal in distribution. 3 seems to have the highest with roughly 500 records more than our other data. 2 is the lowest with a thousnad less records than 3
Car Ownership¶
labeledbarplot(df, 'OwnCar', perc =True)
Observations¶
Roughly 40% of people in our dataset did not own a car
Number of Children Visiting¶
labeledbarplot(df, 'NumberOfChildrenVisiting')
Observations¶
- Fairly normal distribution
- Combine 2 and 3 together to make 2+
- May be a good idea to merge with other datasets, such as NumberofPersonVisiting to create a simple category of if that group has children or not.
Designation¶
labeledbarplot(df, 'Designation')
Observations¶
- This is a fairly normal distribution
- I think there are too many categories for this, combining AVP, VP, and Senior Manager into just "other" makes sense, or AVP and VP can be combined into executive, and senior manager to manager
Bivariate Analysis¶
fig, ax = plt.subplots(figsize=(12,12))
sns.heatmap(df.corr(), annot = True, cmap = "mako", square = True, ax = ax)
<AxesSubplot:>
Observations¶
- Most correlation with our dependent variable is extremely weak, and finding data to strengthen this is probably needed, maybe in categorical items that don't show up here
- Passport had the strongest correlation with the dependent variable at 0.26, but it did mean people with passports were more likely to purchase a package
- Age had a fairly strong correlation with monthly income at 0.46
- Number of Followups had a fairly weak correlation with number of people visiting
- Number of Person Visiting and Number of Children Visiting had a strong correlation
- The dependent variable had a 0.13 and 0.15 negative correlation with Monthly income and age respectively
sns.pairplot(df, diag_kind="kde", hue = 'ProdTaken')
<seaborn.axisgrid.PairGrid at 0x2493a99fd30>
Observations¶
Age and Duration of pitch have an extremely loose, or no correlation with monthy lincome
Numeric Means by ProdTaken¶
df.groupby(['ProdTaken']).mean().T
| ProdTaken | 0 | 1 |
|---|---|---|
| Age | 38.223286 | 34.685870 |
| CityTier | 1.615927 | 1.819565 |
| DurationOfPitch | 14.999496 | 16.781522 |
| NumberOfPersonVisiting | 2.901714 | 2.919565 |
| NumberOfFollowups | 3.657510 | 3.942391 |
| PreferredPropertyStar | 3.540071 | 3.741304 |
| NumberOfTrips | 3.203125 | 3.268478 |
| Passport | 0.233871 | 0.536957 |
| PitchSatisfactionScore | 3.044355 | 3.223913 |
| OwnCar | 0.622984 | 0.608696 |
| NumberOfChildrenVisiting | 1.181452 | 1.198913 |
| MonthlyIncome | 23871.228831 | 22138.016304 |
Observations¶
- People who purchase the packages are younger
- In general, people who purchased the package had a 17 minute pitch
- People who purchased a package had more followups than those who didn't
- People who purchased a package preferred higher property stars
- Passport holders were more likely to purchase a package
- Those more satisfied with the pitch were more likely to purchase a package
- Car owners were slightly less likely to purchase a package
- Those that made a lower income were more likely to purchase a package
Numeric Columns By ProdTaken¶
Age¶
stackedbox(df, 'Age', 'ProdTaken')
minmeanmax(df, 'Age', 'ProdTaken')
Not Purchased Min Mean and Max: Min: 18.0 Mean: 38.223286 Max: 61.0 Purchased Min Mean and Max: Min: 18.0 Mean: 34.68587 Max: 60.0
Observations¶
Those who purchased a package were slightly younger all around, with a lower median, mean, and Q3 point than those who didn't interestingly though they had the same minimum, but those who purchased had a lower maximum
Duration of Pitch¶
stackedbox(df, 'DurationOfPitch', 'ProdTaken')
minmeanmax(df, 'DurationOfPitch', 'ProdTaken')
Not Purchased Min Mean and Max: Min: 5.0 Mean: 14.999496 Max: 36.0 Purchased Min Mean and Max: Min: 6.0 Mean: 16.781522 Max: 36.0
Observations¶
- Those that purchased packages had a longer pitch in every way except for the maximum
- Those that didn't have some outliers due to having a lower mean, median, a smaller IQR, and a lower Q3 point
- If we saw significant correlation with duration of pitch it might be worth removing those errant values, but for now we will be leaving them
Monthly Income¶
stackedbox(df, 'MonthlyIncome', 'ProdTaken')
minmeanmax(df, 'MonthlyIncome', 'ProdTaken')
Not Purchased Min Mean and Max: Min: 16051.0 Mean: 23871.228831 Max: 38677.0 Purchased Min Mean and Max: Min: 16009.0 Mean: 22138.016304 Max: 38537.0
Observations¶
- We saw above that MonthlyIncome and Prodtaken had a slight negative correlation
- Given the values we see here, with a clear lower min, Q1, mean, median, Q3, and max the outliers are causing issues with that correlation
- I do not wish to remove or cap the outliers, as they are likely accurate, but doing a log scale probably helps us build a better model with the changes to the structure likely bringing in some of those outliers
Categorical Columns by ProdTaken¶
# df.head(5)
| ProdTaken | Age | TypeofContact | CityTier | DurationOfPitch | Occupation | Gender | NumberOfPersonVisiting | NumberOfFollowups | ProductPitched | PreferredPropertyStar | MaritalStatus | NumberOfTrips | Passport | PitchSatisfactionScore | OwnCar | NumberOfChildrenVisiting | Designation | MonthlyIncome | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 41.0 | Self Enquiry | 3 | 6.0 | Salaried | Female | 3 | 3.0 | Deluxe | 3.0 | Single | 1.0 | 1 | 2 | 1 | 0.0 | Manager | 20993.0 |
| 1 | 0 | 49.0 | Company Invited | 1 | 14.0 | Salaried | Male | 3 | 4.0 | Deluxe | 4.0 | Divorced | 2.0 | 0 | 3 | 1 | 2.0 | Manager | 20130.0 |
| 2 | 1 | 37.0 | Self Enquiry | 1 | 8.0 | Free Lancer | Male | 3 | 4.0 | Basic | 3.0 | Single | 7.0 | 1 | 3 | 0 | 0.0 | Executive | 17090.0 |
| 3 | 0 | 33.0 | Company Invited | 1 | 9.0 | Salaried | Female | 2 | 3.0 | Basic | 3.0 | Divorced | 2.0 | 1 | 5 | 1 | 1.0 | Executive | 17909.0 |
| 4 | 0 | 37.0 | Self Enquiry | 1 | 8.0 | Small Business | Male | 2 | 3.0 | Basic | 4.0 | Divorced | 1.0 | 0 | 5 | 1 | 0.0 | Executive | 18468.0 |
Type of Contact¶
stackedbarplot(df, 'TypeofContact', 'ProdTaken')
ProdTaken 0 1 All TypeofContact All 3968 920 4888 Self Enquiry 2859 610 3469 Company Invited 1109 310 1419 ------------------------------------------------------------------------------------------------------------------- ProdTaken 0 1 TypeofContact Company Invited 0.781536 0.218464 Self Enquiry 0.824157 0.175843
sns.countplot(data= df, x='TypeofContact', hue = 'ProdTaken')
<AxesSubplot:xlabel='TypeofContact', ylabel='count'>
Observations¶
- Self Enquiries make up the majority of our data, and consequently share the most accepted products
- If you were invited by a company you were slightly more likely to purchase the product. with a little over 20% of customers purchasing the product
City Tier¶
stackedbarplot(df, 'CityTier', 'ProdTaken')
ProdTaken 0 1 All CityTier All 3968 920 4888 1 2670 520 3190 3 1146 354 1500 2 152 46 198 ------------------------------------------------------------------------------------------------------------------- ProdTaken 0 1 CityTier 3 0.764000 0.236000 2 0.767677 0.232323 1 0.836991 0.163009
sns.countplot(data= df, x='CityTier', hue = 'ProdTaken')
<AxesSubplot:xlabel='CityTier', ylabel='count'>
Obserations¶
- Tier 2 and 3 cities purchased products at a much higher rate than Tier 1 Cities
- Tier 1 Cities were offered many more packages than Tier 2 and 3 cities
- Tier 1 cities purchased the most packages, but Tier 3 is close with significantly fewer products offered
- tier 2 cities have very few records, so observations may be more skewed than Tier 1 or Tier 3 cities
Occupation¶
stackedbarplot(df, 'Occupation', 'ProdTaken')
ProdTaken 0 1 All Occupation All 3968 920 4888 Salaried 1954 414 2368 Small Business 1700 384 2084 Large Business 314 120 434 Free Lancer 0 2 2 ------------------------------------------------------------------------------------------------------------------- ProdTaken 0 1 Occupation Free Lancer 0.000000 1.000000 Large Business 0.723502 0.276498 Small Business 0.815739 0.184261 Salaried 0.825169 0.174831
sns.countplot(data= df, x='Occupation', hue = 'ProdTaken')
<AxesSubplot:xlabel='Occupation', ylabel='count'>
df.loc[df['Occupation'] == 'Free Lancer']
| ProdTaken | Age | TypeofContact | CityTier | DurationOfPitch | Occupation | Gender | NumberOfPersonVisiting | NumberOfFollowups | ProductPitched | PreferredPropertyStar | MaritalStatus | NumberOfTrips | Passport | PitchSatisfactionScore | OwnCar | NumberOfChildrenVisiting | Designation | MonthlyIncome | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2 | 1 | 37.0 | Self Enquiry | 1 | 8.0 | Free Lancer | Male | 3 | 4.0 | Basic | 3.0 | Single | 7.0 | 1 | 3 | 0 | 0.0 | Executive | 17090.0 |
| 2446 | 1 | 38.0 | Self Enquiry | 1 | 9.0 | Free Lancer | Male | 4 | 5.0 | Basic | 3.0 | Single | 8.0 | 1 | 3 | 0 | 1.0 | Executive | 20768.0 |
Observations¶
- Free Lancers take 100% of the packages they are offered, but there are exactly two records for freelancers.
- Removing the Free Lancer records and/or changing the label to one of the others will help us prevent disrupting the model
- Of the other models, large businesses took 30% of the packages offered, but had the smallest count of packages bought
- Small Business and Salaried both had a close purchase rate of about 20%, but salaried were offered more packages for roughly the same count
Gender¶
stackedbarplot(df, 'Gender', 'ProdTaken')
ProdTaken 0 1 All Gender All 3968 920 4888 Male 2338 578 2916 Female 1630 342 1972 ------------------------------------------------------------------------------------------------------------------- ProdTaken 0 1 Gender Male 0.801783 0.198217 Female 0.826572 0.173428
sns.countplot(data= df, x='Gender', hue = 'ProdTaken')
<AxesSubplot:xlabel='Gender', ylabel='count'>
Observations¶
- Men purchased packages at a slightly higher rate than women
- Men were offered significantly more packages than women were
- Our resultant model may be slightly skewed to view men more favorably than women in package sales
Number of Persons Visiting¶
stackedbarplot(df, 'NumberOfPersonVisiting', 'ProdTaken')
ProdTaken 0 1 All NumberOfPersonVisiting All 3968 920 4888 3 1942 460 2402 2 1151 267 1418 4 833 193 1026 1 39 0 39 5 3 0 3 ------------------------------------------------------------------------------------------------------------------- ProdTaken 0 1 NumberOfPersonVisiting 3 0.808493 0.191507 2 0.811707 0.188293 4 0.811891 0.188109 1 1.000000 0.000000 5 1.000000 0.000000
sns.countplot(data= df, x='NumberOfPersonVisiting', hue = 'ProdTaken')
<AxesSubplot:xlabel='NumberOfPersonVisiting', ylabel='count'>
Observations¶
- 1 and 5 people visiting did not purchase any packages
- 2,3, and 4 person groups all purchased about 20% of the packages they were offered
- There are 39 rows for 1 and 3 rows for 5, removing the rows outright is a bad idea
- Grouping 1 and 2, and 4 and 5 seems like a good idea, that does cause a slight negative skew to both, but overall they should stay roughly the same
Number of Followups¶
stackedbarplot(df, 'NumberOfFollowups', 'ProdTaken')
ProdTaken 0 1 All NumberOfFollowups All 3968 920 4888 4.0 1726 387 2113 3.0 1222 244 1466 5.0 577 191 768 6.0 82 54 136 2.0 205 24 229 1.0 156 20 176 ------------------------------------------------------------------------------------------------------------------- ProdTaken 0 1 NumberOfFollowups 6.0 0.602941 0.397059 5.0 0.751302 0.248698 4.0 0.816848 0.183152 3.0 0.833561 0.166439 1.0 0.886364 0.113636 2.0 0.895197 0.104803
sns.countplot(data= df, x='NumberOfFollowups', hue = 'ProdTaken')
<AxesSubplot:xlabel='NumberOfFollowups', ylabel='count'>
Observations¶
- Six had the highest percentage of people purchase a package at roughly 40%
- One and two had the lowest percentage of takers with around 10% each
- There is a pretty significant correlation between number of followups and ProdTaken
Product Pitched¶
stackedbarplot(df, 'ProductPitched', 'ProdTaken')
ProdTaken 0 1 All ProductPitched All 3968 920 4888 Basic 1290 552 1842 Deluxe 1528 204 1732 Standard 618 124 742 King 210 20 230 Super Deluxe 322 20 342 ------------------------------------------------------------------------------------------------------------------- ProdTaken 0 1 ProductPitched Basic 0.700326 0.299674 Standard 0.832884 0.167116 Deluxe 0.882217 0.117783 King 0.913043 0.086957 Super Deluxe 0.941520 0.058480
sns.countplot(data= df, x='ProductPitched', hue = 'ProdTaken')
<AxesSubplot:xlabel='ProductPitched', ylabel='count'>
Observations¶
- Basic Performed the Best in terms of sales with 30% of pitched products being sold
- The Super Deluxe performed the absolute worst, with just 6% of those that were pitched the product purchasing it
- The Deluxe was pitched a similar number of times to Basic, but performed far worse at just 10% of those having the product pitched purchasing the product
- There is a fairly strong negative correlation between products pitched and ProdTaken (assuming Super Deluxe or King are Higher numbers and Basic is the lowest)
Preferred Property Star¶
stackedbarplot(df, 'PreferredPropertyStar', 'ProdTaken')
ProdTaken 0 1 All PreferredPropertyStar All 3968 920 4888 3.0 2531 488 3019 5.0 706 250 956 4.0 731 182 913 ------------------------------------------------------------------------------------------------------------------- ProdTaken 0 1 PreferredPropertyStar 5.0 0.738494 0.261506 4.0 0.800657 0.199343 3.0 0.838357 0.161643
sns.countplot(data= df, x='PreferredPropertyStar', hue = 'ProdTaken')
<AxesSubplot:xlabel='PreferredPropertyStar', ylabel='count'>
Observations¶
- 3 was preferred vastly more often over 4 and 5 star properties
- People who preferred 3, purchased a package only 16% of the time
- 4 and 5 were preferred by roughly an equal number of customers
- people who preferred 5 star properties purchased a packages 26% of the time, a massive leap of those that preferred 3 stars
- There is a positive correlation between property star and prodtaken
Marital Status¶
stackedbarplot(df, 'MaritalStatus', 'ProdTaken')
ProdTaken 0 1 All MaritalStatus All 3968 920 4888 Married 2014 326 2340 Single 612 304 916 Unmarried 516 166 682 Divorced 826 124 950 ------------------------------------------------------------------------------------------------------------------- ProdTaken 0 1 MaritalStatus Single 0.668122 0.331878 Unmarried 0.756598 0.243402 Married 0.860684 0.139316 Divorced 0.869474 0.130526
sns.countplot(data= df, x='MaritalStatus', hue = 'ProdTaken')
<AxesSubplot:xlabel='MaritalStatus', ylabel='count'>
Observations¶
- Single people were by far the most likely to purchase a package, with 33% of individuals in that group purchasing a package
- Those that are married and divorced, the highest number of records in our dataset were the least likely to purchase a package at 14 and 13 percent each
- Single individuals have a similar count of package takers to married despite being vastly underrepresented in our dataset
- Unmarried individuals also had a fairly high percentage of takers with 24% buyinh a package
- These seems to be a strong correlation between those that have never married, and products being purchased
Number of Trips¶
stackedbarplot(df, 'NumberOfTrips', 'ProdTaken')
ProdTaken 0 1 All NumberOfTrips All 3968 920 4888 2.0 1165 299 1464 3.0 992 231 1223 1.0 508 112 620 6.0 258 64 322 5.0 396 62 458 7.0 156 62 218 4.0 417 61 478 8.0 76 29 105 ------------------------------------------------------------------------------------------------------------------- ProdTaken 0 1 NumberOfTrips 7.0 0.715596 0.284404 8.0 0.723810 0.276190 2.0 0.795765 0.204235 6.0 0.801242 0.198758 3.0 0.811120 0.188880 1.0 0.819355 0.180645 5.0 0.864629 0.135371 4.0 0.872385 0.127615
sns.countplot(data= df, x='NumberOfTrips', hue = 'ProdTaken')
<AxesSubplot:xlabel='NumberOfTrips', ylabel='count'>
Observations¶
- 7 and 8 top the list in terms of shares of packages sold, with roughly 28% each
- 1, 2, 3, and 6 sit in the middle around 20%
- 4 and 5 are the lowest with roughly 13% each
- There is a moderate correlation between number of trips and purchasing products
- Given the number of different categories, grouping 6 and above seems like a good idea, due to model overfit, however, the dataset is also currently numeric so this may be a bad idea
Passport¶
stackedbarplot(df, 'Passport', 'ProdTaken')
ProdTaken 0 1 All Passport All 3968 920 4888 1 928 494 1422 0 3040 426 3466 ------------------------------------------------------------------------------------------------------------------- ProdTaken 0 1 Passport 1 0.652602 0.347398 0 0.877092 0.122908
sns.countplot(data= df, x='Passport', hue = 'ProdTaken')
<AxesSubplot:xlabel='Passport', ylabel='count'>
Observation¶
- 35% of people who had a passport purchased a product vs 12% if they did not posses a passport
- There is a strong correlation between passport ownership and vacation taking, but passport owners are a smaller part of our dataset
Pitch Satisfaction Score¶
stackedbarplot(df, 'PitchSatisfactionScore', 'ProdTaken')
ProdTaken 0 1 All PitchSatisfactionScore All 3968 920 4888 3 1162 316 1478 5 760 210 970 4 750 162 912 1 798 144 942 2 498 88 586 ------------------------------------------------------------------------------------------------------------------- ProdTaken 0 1 PitchSatisfactionScore 5 0.783505 0.216495 3 0.786198 0.213802 4 0.822368 0.177632 1 0.847134 0.152866 2 0.849829 0.150171
sns.countplot(data= df, x='PitchSatisfactionScore', hue = 'ProdTaken')
<AxesSubplot:xlabel='PitchSatisfactionScore', ylabel='count'>
Observations¶
- 3 and 5 top the list for packages purchased at roughly 20%, while 1 and 2 sit at the bottom with 15%
- 4 sits in the middle with 18% purchasing a package
- There is a weak correlation between pitch satisfaction and product purchase, but its hardly there
Own Car¶
stackedbarplot(df, 'OwnCar', 'ProdTaken')
ProdTaken 0 1 All OwnCar All 3968 920 4888 1 2472 560 3032 0 1496 360 1856 ------------------------------------------------------------------------------------------------------------------- ProdTaken 0 1 OwnCar 0 0.806034 0.193966 1 0.815303 0.184697
sns.countplot(data= df, x='OwnCar', hue = 'ProdTaken')
<AxesSubplot:xlabel='OwnCar', ylabel='count'>
Observations¶
- There appears to be almost no correlation between car ownership and product purchase
- We can probably drop this column in feature engineering, due to only a 0.9% difference between purchase rates
Number of Children Visiting¶
stackedbarplot(df, 'NumberOfChildrenVisiting', 'ProdTaken')
ProdTaken 0 1 All NumberOfChildrenVisiting All 3968 920 4888 1.0 1747 399 2146 2.0 1082 253 1335 0.0 880 202 1082 3.0 259 66 325 ------------------------------------------------------------------------------------------------------------------- ProdTaken 0 1 NumberOfChildrenVisiting 3.0 0.796923 0.203077 2.0 0.810487 0.189513 0.0 0.813309 0.186691 1.0 0.814073 0.185927
sns.countplot(data= df, x='NumberOfChildrenVisiting', hue = 'ProdTaken')
<AxesSubplot:xlabel='NumberOfChildrenVisiting', ylabel='count'>
Observations¶
- This again had a very small variance in product purchase rates, only deviating between 18.6% and 20.3%, or a 1.7% difference
- We can likely drop this entire column to prevent model overfit
Designation¶
stackedbarplot(df, 'Designation', 'ProdTaken')
ProdTaken 0 1 All Designation All 3968 920 4888 Executive 1290 552 1842 Manager 1528 204 1732 Senior Manager 618 124 742 AVP 322 20 342 VP 210 20 230 ------------------------------------------------------------------------------------------------------------------- ProdTaken 0 1 Designation Executive 0.700326 0.299674 Senior Manager 0.832884 0.167116 Manager 0.882217 0.117783 VP 0.913043 0.086957 AVP 0.941520 0.058480
sns.countplot(data= df, x='Designation', hue = 'ProdTaken')
<AxesSubplot:xlabel='Designation', ylabel='count'>
Observations¶
- Executives were far and away the most likely to purchase a package at 30%
- The next Highest was Senior Manager at 17%, almost half of the Executive numbers
- AVP was the lowest with just 5% purchasing a package
- Assuming the designation is supposed to be ordered Manager, Senior Manager, AVP, VP, Executive, there doesn't seem to be a strong correlation in the data, but there is something to senior managers and executives having a higher rate, that the data should be kept
- Grouping AVP and VP is a good idea
Multivariate Analysis of Different Features¶
Pointplot of Income, Number of Trips, and Product Taken¶
plt.figure(figsize=(15, 7))
sns.pointplot(x = 'NumberOfTrips', y='MonthlyIncome', data=df, hue = 'ProdTaken')
<AxesSubplot:xlabel='NumberOfTrips', ylabel='MonthlyIncome'>
Observations¶
The more you made, the less likely you were to buy any package, but the data seemed pretty similar in the variance of income by number of trips, with a major leap up from 2 trips a year to 3 in income
PointPlot of Income, Designation, and Product Taken¶
plt.figure(figsize=(15, 7))
sns.pointplot(x = 'Designation', y='MonthlyIncome', data=df, hue = 'ProdTaken')
<AxesSubplot:xlabel='Designation', ylabel='MonthlyIncome'>
Observations¶
- This absolutely blew us out of the water, we were expecting AVP and VP to be beneath executive in terms of pay, but executive makes much less than even manager.
- This weird pay gap could be explained by shares taken as income by executives rather than cash, but I am unfamiliar enough with the dataset that this is not a certainty
- Variance in pay for those that purchased a package is much higher than those that didn't, given the size of the error bars we see for AVP
PointPlot of Income, Product Piched, and ProdTaken¶
plt.figure(figsize=(15, 7))
sns.pointplot(x = 'ProductPitched', y='MonthlyIncome', data=df, hue = 'ProdTaken')
<AxesSubplot:xlabel='ProductPitched', ylabel='MonthlyIncome'>
Observations¶
- This graph is identical to the graph of pay above, meaning that individuals with the same designation were all pitched the same package with no variation
- This could have unintentional consequences for our dataset, and it may be a good idea to get more data than we currently have to account for this issue
PointPlot of Income, Preferred Property Star, and ProdTaken¶
plt.figure(figsize=(15, 7))
sns.pointplot(x = 'PreferredPropertyStar', y='MonthlyIncome', data=df, hue = 'ProdTaken')
<AxesSubplot:xlabel='PreferredPropertyStar', ylabel='MonthlyIncome'>
Observations¶
- those that preferred 3 star made more than those that preferred 4
- Five Star preference had a wide gap, with those preferring 5 star hotels and purchasing a package making less than 3 star preferrers who did the same
- People who made above 23000 a year did not purchase a package
PointPlot of Pitch Duration, Pitch Satisfaction, and Prod Taken¶
plt.figure(figsize=(15, 7))
sns.pointplot(x = 'PitchSatisfactionScore', y='DurationOfPitch', data=df, hue = 'ProdTaken')
<AxesSubplot:xlabel='PitchSatisfactionScore', ylabel='DurationOfPitch'>
Observations¶
- There is a lot of variance in pitch duration for those that purchased a product, with large error bars
- The longer the pitch was, the more likely people were to purchase a product
- Four star pitches tended to be shorter and less likely to result in a product being purchased than 3 or Five
Pointplot of Age, Marital Status, and ProdTaken¶
plt.figure(figsize=(15, 7))
sns.pointplot(data=df, x='MaritalStatus', y='Age', hue='ProdTaken')
<AxesSubplot:xlabel='MaritalStatus', ylabel='Age'>
Observations¶
- Single individuals who bought packages were the youngest people in our dataset
- Each category who purchased a product was younger than those who didn't
- Divorced and married individuals were the oldest in the dataset, and those that were divorced and bought a package had the widest spread of age in the dataset
Conclusions and Recommendations for EDA¶
Predictive Strength of Columns¶
Features that had strong predictors for our dependent variable were:
- Product Pitched
- Marital Status
- Number of Followups
- Passport
- Designation
These are important for determining whether or not someone purchases a package from our company. There were some more moderate predictors as well such as:
- Income
- Duration of Pitch
- City Tier
- Occupation
- Preferred Property Star
- Number of Trips
These weren't as important to the end result, but still has a bearing that is worth teasing out in our model. There are some weaker columns, that aren't as important such as:
- Age
- Type of Contact
- Pitch Satisfaction Score
- Gender
These were all extremely close value wise to one another, with little spread between values. they also, like pitch satisfaction score could have values that undermine the correlation the rest of the values had presented. They could be modified, or removed without significant impact to the model. Finally we have columns that appear to have no bearing on the data whatsoever:
- Number of Children
- Own car
- Number of Person visiting
These should all be removed to prevent overfit of the model.
NonNumeric Categories that Encoding¶
- Type of Contact
- Occupation
- Gender
- Product Pitched
- Marital Status
- Designation
Feature Engineering¶
Remove Unnecessary Columns¶
Own Car¶
This had no bearing on whether or not an individual purchased a package, and should be removed due to potential model overfitting
stackedbarplot(df, 'OwnCar', 'ProdTaken')
ProdTaken 0 1 All OwnCar All 3968 920 4888 1 2472 560 3032 0 1496 360 1856 ------------------------------------------------------------------------------------------------------------------- ProdTaken 0 1 OwnCar 0 0.806034 0.193966 1 0.815303 0.184697
df.head(5)
| ProdTaken | Age | TypeofContact | CityTier | DurationOfPitch | Occupation | Gender | NumberOfPersonVisiting | NumberOfFollowups | ProductPitched | PreferredPropertyStar | MaritalStatus | NumberOfTrips | Passport | PitchSatisfactionScore | OwnCar | NumberOfChildrenVisiting | Designation | MonthlyIncome | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 41.0 | Self Enquiry | 3 | 6.0 | Salaried | Female | 3 | 3.0 | Deluxe | 3.0 | Single | 1.0 | 1 | 2 | 1 | 0.0 | Manager | 20993.0 |
| 1 | 0 | 49.0 | Company Invited | 1 | 14.0 | Salaried | Male | 3 | 4.0 | Deluxe | 4.0 | Divorced | 2.0 | 0 | 3 | 1 | 2.0 | Manager | 20130.0 |
| 2 | 1 | 37.0 | Self Enquiry | 1 | 8.0 | Free Lancer | Male | 3 | 4.0 | Basic | 3.0 | Single | 7.0 | 1 | 3 | 0 | 0.0 | Executive | 17090.0 |
| 3 | 0 | 33.0 | Company Invited | 1 | 9.0 | Salaried | Female | 2 | 3.0 | Basic | 3.0 | Divorced | 2.0 | 1 | 5 | 1 | 1.0 | Executive | 17909.0 |
| 4 | 0 | 37.0 | Self Enquiry | 1 | 8.0 | Small Business | Male | 2 | 3.0 | Basic | 4.0 | Divorced | 1.0 | 0 | 5 | 1 | 0.0 | Executive | 18468.0 |
df.drop('OwnCar', axis = 1, inplace = True)
df.head(5)
| ProdTaken | Age | TypeofContact | CityTier | DurationOfPitch | Occupation | Gender | NumberOfPersonVisiting | NumberOfFollowups | ProductPitched | PreferredPropertyStar | MaritalStatus | NumberOfTrips | Passport | PitchSatisfactionScore | NumberOfChildrenVisiting | Designation | MonthlyIncome | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 41.0 | Self Enquiry | 3 | 6.0 | Salaried | Female | 3 | 3.0 | Deluxe | 3.0 | Single | 1.0 | 1 | 2 | 0.0 | Manager | 20993.0 |
| 1 | 0 | 49.0 | Company Invited | 1 | 14.0 | Salaried | Male | 3 | 4.0 | Deluxe | 4.0 | Divorced | 2.0 | 0 | 3 | 2.0 | Manager | 20130.0 |
| 2 | 1 | 37.0 | Self Enquiry | 1 | 8.0 | Free Lancer | Male | 3 | 4.0 | Basic | 3.0 | Single | 7.0 | 1 | 3 | 0.0 | Executive | 17090.0 |
| 3 | 0 | 33.0 | Company Invited | 1 | 9.0 | Salaried | Female | 2 | 3.0 | Basic | 3.0 | Divorced | 2.0 | 1 | 5 | 1.0 | Executive | 17909.0 |
| 4 | 0 | 37.0 | Self Enquiry | 1 | 8.0 | Small Business | Male | 2 | 3.0 | Basic | 4.0 | Divorced | 1.0 | 0 | 5 | 0.0 | Executive | 18468.0 |
Number of Person Visiting¶
This too, had no bearing on whether or not a vaction package was purchased for 2, 3, and 4. one and five each did not purchase a package, but there were only 44 total records among the two categories. I am removing it to prevent model overfit on such a minor piece of data
stackedbarplot(df, 'NumberOfPersonVisiting', 'ProdTaken')
ProdTaken 0 1 All NumberOfPersonVisiting All 3968 920 4888 3 1942 460 2402 2 1151 267 1418 4 833 193 1026 1 39 0 39 5 3 0 3 ------------------------------------------------------------------------------------------------------------------- ProdTaken 0 1 NumberOfPersonVisiting 3 0.808493 0.191507 2 0.811707 0.188293 4 0.811891 0.188109 1 1.000000 0.000000 5 1.000000 0.000000
df.head(5)
| ProdTaken | Age | TypeofContact | CityTier | DurationOfPitch | Occupation | Gender | NumberOfPersonVisiting | NumberOfFollowups | ProductPitched | PreferredPropertyStar | MaritalStatus | NumberOfTrips | Passport | PitchSatisfactionScore | NumberOfChildrenVisiting | Designation | MonthlyIncome | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 41.0 | Self Enquiry | 3 | 6.0 | Salaried | Female | 3 | 3.0 | Deluxe | 3.0 | Single | 1.0 | 1 | 2 | 0.0 | Manager | 20993.0 |
| 1 | 0 | 49.0 | Company Invited | 1 | 14.0 | Salaried | Male | 3 | 4.0 | Deluxe | 4.0 | Divorced | 2.0 | 0 | 3 | 2.0 | Manager | 20130.0 |
| 2 | 1 | 37.0 | Self Enquiry | 1 | 8.0 | Free Lancer | Male | 3 | 4.0 | Basic | 3.0 | Single | 7.0 | 1 | 3 | 0.0 | Executive | 17090.0 |
| 3 | 0 | 33.0 | Company Invited | 1 | 9.0 | Salaried | Female | 2 | 3.0 | Basic | 3.0 | Divorced | 2.0 | 1 | 5 | 1.0 | Executive | 17909.0 |
| 4 | 0 | 37.0 | Self Enquiry | 1 | 8.0 | Small Business | Male | 2 | 3.0 | Basic | 4.0 | Divorced | 1.0 | 0 | 5 | 0.0 | Executive | 18468.0 |
df.drop('NumberOfPersonVisiting', axis = 1, inplace = True)
df.head(5)
| ProdTaken | Age | TypeofContact | CityTier | DurationOfPitch | Occupation | Gender | NumberOfFollowups | ProductPitched | PreferredPropertyStar | MaritalStatus | NumberOfTrips | Passport | PitchSatisfactionScore | NumberOfChildrenVisiting | Designation | MonthlyIncome | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 41.0 | Self Enquiry | 3 | 6.0 | Salaried | Female | 3.0 | Deluxe | 3.0 | Single | 1.0 | 1 | 2 | 0.0 | Manager | 20993.0 |
| 1 | 0 | 49.0 | Company Invited | 1 | 14.0 | Salaried | Male | 4.0 | Deluxe | 4.0 | Divorced | 2.0 | 0 | 3 | 2.0 | Manager | 20130.0 |
| 2 | 1 | 37.0 | Self Enquiry | 1 | 8.0 | Free Lancer | Male | 4.0 | Basic | 3.0 | Single | 7.0 | 1 | 3 | 0.0 | Executive | 17090.0 |
| 3 | 0 | 33.0 | Company Invited | 1 | 9.0 | Salaried | Female | 3.0 | Basic | 3.0 | Divorced | 2.0 | 1 | 5 | 1.0 | Executive | 17909.0 |
| 4 | 0 | 37.0 | Self Enquiry | 1 | 8.0 | Small Business | Male | 3.0 | Basic | 4.0 | Divorced | 1.0 | 0 | 5 | 0.0 | Executive | 18468.0 |
Number of Children Visiting¶
This is the third column that showed no bearing on whether or not a product was purchased, and again, will be removed to prevent model overfit.
stackedbarplot(df, 'NumberOfChildrenVisiting', 'ProdTaken')
ProdTaken 0 1 All NumberOfChildrenVisiting All 3968 920 4888 1.0 1747 399 2146 2.0 1082 253 1335 0.0 880 202 1082 3.0 259 66 325 ------------------------------------------------------------------------------------------------------------------- ProdTaken 0 1 NumberOfChildrenVisiting 3.0 0.796923 0.203077 2.0 0.810487 0.189513 0.0 0.813309 0.186691 1.0 0.814073 0.185927
df.head(5)
| ProdTaken | Age | TypeofContact | CityTier | DurationOfPitch | Occupation | Gender | NumberOfFollowups | ProductPitched | PreferredPropertyStar | MaritalStatus | NumberOfTrips | Passport | PitchSatisfactionScore | NumberOfChildrenVisiting | Designation | MonthlyIncome | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 41.0 | Self Enquiry | 3 | 6.0 | Salaried | Female | 3.0 | Deluxe | 3.0 | Single | 1.0 | 1 | 2 | 0.0 | Manager | 20993.0 |
| 1 | 0 | 49.0 | Company Invited | 1 | 14.0 | Salaried | Male | 4.0 | Deluxe | 4.0 | Divorced | 2.0 | 0 | 3 | 2.0 | Manager | 20130.0 |
| 2 | 1 | 37.0 | Self Enquiry | 1 | 8.0 | Free Lancer | Male | 4.0 | Basic | 3.0 | Single | 7.0 | 1 | 3 | 0.0 | Executive | 17090.0 |
| 3 | 0 | 33.0 | Company Invited | 1 | 9.0 | Salaried | Female | 3.0 | Basic | 3.0 | Divorced | 2.0 | 1 | 5 | 1.0 | Executive | 17909.0 |
| 4 | 0 | 37.0 | Self Enquiry | 1 | 8.0 | Small Business | Male | 3.0 | Basic | 4.0 | Divorced | 1.0 | 0 | 5 | 0.0 | Executive | 18468.0 |
df.drop('NumberOfChildrenVisiting', axis = 1, inplace = True)
df.head(5)
| ProdTaken | Age | TypeofContact | CityTier | DurationOfPitch | Occupation | Gender | NumberOfFollowups | ProductPitched | PreferredPropertyStar | MaritalStatus | NumberOfTrips | Passport | PitchSatisfactionScore | Designation | MonthlyIncome | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 41.0 | Self Enquiry | 3 | 6.0 | Salaried | Female | 3.0 | Deluxe | 3.0 | Single | 1.0 | 1 | 2 | Manager | 20993.0 |
| 1 | 0 | 49.0 | Company Invited | 1 | 14.0 | Salaried | Male | 4.0 | Deluxe | 4.0 | Divorced | 2.0 | 0 | 3 | Manager | 20130.0 |
| 2 | 1 | 37.0 | Self Enquiry | 1 | 8.0 | Free Lancer | Male | 4.0 | Basic | 3.0 | Single | 7.0 | 1 | 3 | Executive | 17090.0 |
| 3 | 0 | 33.0 | Company Invited | 1 | 9.0 | Salaried | Female | 3.0 | Basic | 3.0 | Divorced | 2.0 | 1 | 5 | Executive | 17909.0 |
| 4 | 0 | 37.0 | Self Enquiry | 1 | 8.0 | Small Business | Male | 3.0 | Basic | 4.0 | Divorced | 1.0 | 0 | 5 | Executive | 18468.0 |
Pitch Satisfaction Score¶
We have two columns, pitch duration and pitch satisfaction score that describe an individual pitch. While Pitch Satisfaction score could be useful, 3 and 5 both have a larger number of takers than 4, and Pitch Duration tends to show a stronger correlation with ProdTaken than Pitch Satisfaction does. This may negatively affect a model down the road, but we do not have the background to know for sure.
stackedbarplot(df, 'PitchSatisfactionScore', 'ProdTaken')
ProdTaken 0 1 All PitchSatisfactionScore All 3968 920 4888 3 1162 316 1478 5 760 210 970 4 750 162 912 1 798 144 942 2 498 88 586 ------------------------------------------------------------------------------------------------------------------- ProdTaken 0 1 PitchSatisfactionScore 5 0.783505 0.216495 3 0.786198 0.213802 4 0.822368 0.177632 1 0.847134 0.152866 2 0.849829 0.150171
df.head(5)
| ProdTaken | Age | TypeofContact | CityTier | DurationOfPitch | Occupation | Gender | NumberOfFollowups | ProductPitched | PreferredPropertyStar | MaritalStatus | NumberOfTrips | Passport | PitchSatisfactionScore | Designation | MonthlyIncome | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 41.0 | Self Enquiry | 3 | 6.0 | Salaried | Female | 3.0 | Deluxe | 3.0 | Single | 1.0 | 1 | 2 | Manager | 20993.0 |
| 1 | 0 | 49.0 | Company Invited | 1 | 14.0 | Salaried | Male | 4.0 | Deluxe | 4.0 | Divorced | 2.0 | 0 | 3 | Manager | 20130.0 |
| 2 | 1 | 37.0 | Self Enquiry | 1 | 8.0 | Free Lancer | Male | 4.0 | Basic | 3.0 | Single | 7.0 | 1 | 3 | Executive | 17090.0 |
| 3 | 0 | 33.0 | Company Invited | 1 | 9.0 | Salaried | Female | 3.0 | Basic | 3.0 | Divorced | 2.0 | 1 | 5 | Executive | 17909.0 |
| 4 | 0 | 37.0 | Self Enquiry | 1 | 8.0 | Small Business | Male | 3.0 | Basic | 4.0 | Divorced | 1.0 | 0 | 5 | Executive | 18468.0 |
df.drop('PitchSatisfactionScore', axis = 1, inplace = True)
df.head(5)
| ProdTaken | Age | TypeofContact | CityTier | DurationOfPitch | Occupation | Gender | NumberOfFollowups | ProductPitched | PreferredPropertyStar | MaritalStatus | NumberOfTrips | Passport | Designation | MonthlyIncome | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 41.0 | Self Enquiry | 3 | 6.0 | Salaried | Female | 3.0 | Deluxe | 3.0 | Single | 1.0 | 1 | Manager | 20993.0 |
| 1 | 0 | 49.0 | Company Invited | 1 | 14.0 | Salaried | Male | 4.0 | Deluxe | 4.0 | Divorced | 2.0 | 0 | Manager | 20130.0 |
| 2 | 1 | 37.0 | Self Enquiry | 1 | 8.0 | Free Lancer | Male | 4.0 | Basic | 3.0 | Single | 7.0 | 1 | Executive | 17090.0 |
| 3 | 0 | 33.0 | Company Invited | 1 | 9.0 | Salaried | Female | 3.0 | Basic | 3.0 | Divorced | 2.0 | 1 | Executive | 17909.0 |
| 4 | 0 | 37.0 | Self Enquiry | 1 | 8.0 | Small Business | Male | 3.0 | Basic | 4.0 | Divorced | 1.0 | 0 | Executive | 18468.0 |
Clean up and Consolidate labels in Categorical featues¶
Occupation¶
sns.countplot(data= df, x='Occupation', hue = 'ProdTaken')
<AxesSubplot:xlabel='Occupation', ylabel='count'>
Freelancer is a statistically insignificant amount of data, and so we decided to drop it. It could negatively affect our data, and we wanted to ensure that our model didn't assume every Freelancer was going to purchase a package
df.drop(df[(df['Occupation'] == 'Free Lancer')].index, inplace=True)
sns.countplot(data= df, x='Occupation', hue = 'ProdTaken')
<AxesSubplot:xlabel='Occupation', ylabel='count'>
Number of Trips¶
stackedbarplot(df, 'NumberOfTrips', 'ProdTaken')
ProdTaken 0 1 All NumberOfTrips All 3968 918 4886 2.0 1165 299 1464 3.0 992 231 1223 1.0 508 112 620 6.0 258 64 322 5.0 396 62 458 4.0 417 61 478 7.0 156 61 217 8.0 76 28 104 ------------------------------------------------------------------------------------------------------------------- ProdTaken 0 1 NumberOfTrips 7.0 0.718894 0.281106 8.0 0.730769 0.269231 2.0 0.795765 0.204235 6.0 0.801242 0.198758 3.0 0.811120 0.188880 1.0 0.819355 0.180645 5.0 0.864629 0.135371 4.0 0.872385 0.127615
We have eight different categories for trips, and some of them share similar probabilities for package purchase rates. If we leave them be, we may overfit a model, so binning them (before later reconverting them to numeric fields) should help us prevent this. Due to how they are arranged with purchase rates we would like to group them like the following
- 1-3
- 4-5
- 6
- 7+
This will halve the number of catefories in our dataset and keep the general preditictive strengths of each one. 6 as a value is difficult due to not fitting well with 4 and five or seven and eight, and better fitting with 1-3. We just left it on its own to not make the eventual data difficult to understand
sns.countplot(data= df, x='NumberOfTrips', hue = 'ProdTaken')
<AxesSubplot:xlabel='NumberOfTrips', ylabel='count'>
conditions = [
(df['NumberOfTrips'] >= 7),
(df['NumberOfTrips'] == 6),
(df['NumberOfTrips'] >= 4) & (df['NumberOfTrips'] <= 5),
(df['NumberOfTrips'] >= 1) & (df['NumberOfTrips'] <= 3)
]
values = [
'7+',
'6',
'4-5',
'1-3'
]
df['NumberOfTrips'] = np.select(conditions, values)
sns.countplot(data=df, x='NumberOfTrips', order=['1-3', '4-5', '6', '7+'], hue='ProdTaken')
<AxesSubplot:xlabel='NumberOfTrips', ylabel='count'>
stackedbarplot(df, 'NumberOfTrips', 'ProdTaken')
ProdTaken 0 1 All NumberOfTrips All 3968 918 4886 1-3 2665 642 3307 4-5 813 123 936 7+ 232 89 321 6 258 64 322 ------------------------------------------------------------------------------------------------------------------- ProdTaken 0 1 NumberOfTrips 7+ 0.722741 0.277259 6 0.801242 0.198758 1-3 0.805866 0.194134 4-5 0.868590 0.131410
We can see that there was no significant chantges to the % of buyers in our data by grouping the numbers this way, so our model should be able to predict well without being overfit
Designation¶
stackedbarplot(df, 'Designation', 'ProdTaken')
ProdTaken 0 1 All Designation All 3968 918 4886 Executive 1290 550 1840 Manager 1528 204 1732 Senior Manager 618 124 742 AVP 322 20 342 VP 210 20 230 ------------------------------------------------------------------------------------------------------------------- ProdTaken 0 1 Designation Executive 0.701087 0.298913 Senior Manager 0.832884 0.167116 Manager 0.882217 0.117783 VP 0.913043 0.086957 AVP 0.941520 0.058480
sns.countplot(data=df, x='Designation', hue='ProdTaken')
<AxesSubplot:xlabel='Designation', ylabel='count'>
plt.figure(figsize=(15, 7))
sns.pointplot(data=df, x='Designation', y='MonthlyIncome')
<AxesSubplot:xlabel='Designation', ylabel='MonthlyIncome'>
This data was a little bit weird both in its initial labelling and how we hope to consolidate it. Executives made the least, followed by managers, then senior managers, AVPs, and finally VPs made the most. We want to group VP and AVP, and then group Manager and Senior manager while leaving executive alone.
conditions = [
(df['Designation'] == 'Manager'),
(df['Designation'] == 'Executive'),
(df['Designation'] == 'Senior Manager'),
(df['Designation'] == 'AVP'),
(df['Designation'] == 'VP')
]
values = [
'Management',
'Executive',
'Management',
'VP and AVP',
'VP and AVP'
]
df['Designation'] = np.select(conditions, values)
stackedbarplot(df, 'Designation', 'ProdTaken')
ProdTaken 0 1 All Designation All 3968 918 4886 Executive 1290 550 1840 Management 2146 328 2474 VP and AVP 532 40 572 ------------------------------------------------------------------------------------------------------------------- ProdTaken 0 1 Designation Executive 0.701087 0.298913 Management 0.867421 0.132579 VP and AVP 0.930070 0.069930
sns.countplot(data=df, x='Designation', hue='ProdTaken')
<AxesSubplot:xlabel='Designation', ylabel='count'>
The rough idea of the percentage of buyers has stayed intact after we have reduced the number of columns in Designation, and grouped them together. This also more accurately reflects the number of individuals set to be an executive
Scale Numeric Columns¶
df.head(5)
| ProdTaken | Age | TypeofContact | CityTier | DurationOfPitch | Occupation | Gender | NumberOfFollowups | ProductPitched | PreferredPropertyStar | MaritalStatus | NumberOfTrips | Passport | Designation | MonthlyIncome | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 41.0 | Self Enquiry | 3 | 6.0 | Salaried | Female | 3.0 | Deluxe | 3.0 | Single | 1-3 | 1 | Management | 20993.0 |
| 1 | 0 | 49.0 | Company Invited | 1 | 14.0 | Salaried | Male | 4.0 | Deluxe | 4.0 | Divorced | 1-3 | 0 | Management | 20130.0 |
| 3 | 0 | 33.0 | Company Invited | 1 | 9.0 | Salaried | Female | 3.0 | Basic | 3.0 | Divorced | 1-3 | 1 | Executive | 17909.0 |
| 4 | 0 | 37.0 | Self Enquiry | 1 | 8.0 | Small Business | Male | 3.0 | Basic | 4.0 | Divorced | 1-3 | 0 | Executive | 18468.0 |
| 5 | 0 | 32.0 | Company Invited | 1 | 8.0 | Salaried | Male | 3.0 | Basic | 3.0 | Single | 1-3 | 0 | Executive | 18068.0 |
We have three numeric columns that have a wider range than the rest of our dataset that will need to be scaled accordingly. These are Age, Duration of Pitch and Monthly Income. Age has relatively few outliers, so it will just be min-maxed scaled to preserve the rough shape of the data. Monthly Income and Duration of Pitch have a long right tail, and will be log scaled to change the datashape towards a normal distribution
Age¶
labeldisplot(df, 'Age')
We can see our data is close to normal, and wouldn't benefit greatly from log scaling
scaler.fit(df[['Age']])
df[['Age']] = scaler.transform(df[['Age']])
labeldisplot(df, 'Age')
We can see the shape of our data remains intact, with its slight right skew but otherwise mostly normal distribution
Duration of Pitch¶
labeldisplot(df, 'DurationOfPitch')
sns.kdeplot(data=df, x="DurationOfPitch", fill=False)
<AxesSubplot:xlabel='DurationOfPitch', ylabel='Density'>
df.DurationOfPitch.describe()
count 4886.000000 mean 15.337700 std 8.003873 min 5.000000 25% 9.000000 50% 13.000000 75% 19.000000 max 36.000000 Name: DurationOfPitch, dtype: float64
We don't want the data to stay this way, and a log scale will help us reign in this long right tail, that is all within the whiskers of the data, shifting the center to the right
df['DurationOfPitch'] = np.log(df['DurationOfPitch'])
df.DurationOfPitch.describe()
count 4886.000000 mean 2.603731 std 0.499056 min 1.609438 25% 2.197225 50% 2.564949 75% 2.944439 max 3.583519 Name: DurationOfPitch, dtype: float64
sns.kdeplot(data=df, x="DurationOfPitch", fill=False)
<AxesSubplot:xlabel='DurationOfPitch', ylabel='Density'>
labeldisplot(df, 'DurationOfPitch')
This data is still not normal, but it is significantly closer than it would be otherwise, and should help prevent skewing our model, but this requires further testing
Monthly Income¶
labeldisplot(df, 'MonthlyIncome')
This data has the vague structure of a normal distribution, but it has a long right tail that we want to remove and make the dataset as a whole more normal.
sns.kdeplot(data=df, x="MonthlyIncome", fill=False)
<AxesSubplot:xlabel='MonthlyIncome', ylabel='Density'>
df.MonthlyIncome.describe()
count 4886.000000 mean 23546.899918 std 5026.451397 min 16009.000000 25% 20487.250000 50% 22597.000000 75% 25411.250000 max 38677.000000 Name: MonthlyIncome, dtype: float64
df['MonthlyIncome'] = np.log(df['MonthlyIncome'])
df.MonthlyIncome.describe()
count 4886.000000 mean 10.045997 std 0.199718 min 9.680906 25% 9.927558 50% 10.025572 75% 10.142947 max 10.563000 Name: MonthlyIncome, dtype: float64
sns.kdeplot(data=df, x="MonthlyIncome", fill=False)
<AxesSubplot:xlabel='MonthlyIncome', ylabel='Density'>
labeldisplot(df, 'MonthlyIncome')
This at least shortened the tail, but didn't do an insane amount to normalize the distribution. We will now need to go through and do a min/max encoding due to the numbers being larger than we want, especially in comparison to the rest of the data
scaler.fit(df[['MonthlyIncome']])
df[['MonthlyIncome']] = scaler.transform(df[['MonthlyIncome']])
labeldisplot(df, 'MonthlyIncome')
We have kept the shape of the data, but left it between 0 and 1.
Encoding Categorical Values¶
df.head()
| ProdTaken | Age | TypeofContact | CityTier | DurationOfPitch | Occupation | Gender | NumberOfFollowups | ProductPitched | PreferredPropertyStar | MaritalStatus | NumberOfTrips | Passport | Designation | MonthlyIncome | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0.534884 | Self Enquiry | 3 | 1.791759 | Salaried | Female | 3.0 | Deluxe | 3.0 | Single | 1-3 | 1 | Management | 0.307267 |
| 1 | 0 | 0.720930 | Company Invited | 1 | 2.639057 | Salaried | Male | 4.0 | Deluxe | 4.0 | Divorced | 1-3 | 0 | Management | 0.259678 |
| 3 | 0 | 0.348837 | Company Invited | 1 | 2.197225 | Salaried | Female | 3.0 | Basic | 3.0 | Divorced | 1-3 | 1 | Executive | 0.127143 |
| 4 | 0 | 0.441860 | Self Enquiry | 1 | 2.079442 | Small Business | Male | 3.0 | Basic | 4.0 | Divorced | 1-3 | 0 | Executive | 0.161988 |
| 5 | 0 | 0.325581 | Company Invited | 1 | 2.079442 | Salaried | Male | 3.0 | Basic | 3.0 | Single | 1-3 | 0 | Executive | 0.137164 |
Type of Contact, Occupation, gender, ProductPitched, Marital Status, Number of Trips, and Designation are all non-numeric and will need to be encoded using either onehot or label encoding. Given that we already have a large number of columns, we will be utilizing Lable Encoding for these values.
Type of Contact¶
labeledbarplot(df, 'TypeofContact')
df.head(5)
| ProdTaken | Age | TypeofContact | CityTier | DurationOfPitch | Occupation | Gender | NumberOfFollowups | ProductPitched | PreferredPropertyStar | MaritalStatus | NumberOfTrips | Passport | Designation | MonthlyIncome | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0.534884 | Self Enquiry | 3 | 1.791759 | Salaried | Female | 3.0 | Deluxe | 3.0 | Single | 1-3 | 1 | Management | 0.307267 |
| 1 | 0 | 0.720930 | Company Invited | 1 | 2.639057 | Salaried | Male | 4.0 | Deluxe | 4.0 | Divorced | 1-3 | 0 | Management | 0.259678 |
| 3 | 0 | 0.348837 | Company Invited | 1 | 2.197225 | Salaried | Female | 3.0 | Basic | 3.0 | Divorced | 1-3 | 1 | Executive | 0.127143 |
| 4 | 0 | 0.441860 | Self Enquiry | 1 | 2.079442 | Small Business | Male | 3.0 | Basic | 4.0 | Divorced | 1-3 | 0 | Executive | 0.161988 |
| 5 | 0 | 0.325581 | Company Invited | 1 | 2.079442 | Salaried | Male | 3.0 | Basic | 3.0 | Single | 1-3 | 0 | Executive | 0.137164 |
Transforming the Data with Label Encoder¶
df['TypeofContact'] = labelEncoder.fit_transform(df['TypeofContact'])
df.head(5)
| ProdTaken | Age | TypeofContact | CityTier | DurationOfPitch | Occupation | Gender | NumberOfFollowups | ProductPitched | PreferredPropertyStar | MaritalStatus | NumberOfTrips | Passport | Designation | MonthlyIncome | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0.534884 | 1 | 3 | 1.791759 | Salaried | Female | 3.0 | Deluxe | 3.0 | Single | 1-3 | 1 | Management | 0.307267 |
| 1 | 0 | 0.720930 | 0 | 1 | 2.639057 | Salaried | Male | 4.0 | Deluxe | 4.0 | Divorced | 1-3 | 0 | Management | 0.259678 |
| 3 | 0 | 0.348837 | 0 | 1 | 2.197225 | Salaried | Female | 3.0 | Basic | 3.0 | Divorced | 1-3 | 1 | Executive | 0.127143 |
| 4 | 0 | 0.441860 | 1 | 1 | 2.079442 | Small Business | Male | 3.0 | Basic | 4.0 | Divorced | 1-3 | 0 | Executive | 0.161988 |
| 5 | 0 | 0.325581 | 0 | 1 | 2.079442 | Salaried | Male | 3.0 | Basic | 3.0 | Single | 1-3 | 0 | Executive | 0.137164 |
labeledbarplot(df, 'TypeofContact')
Occupation¶
df.head(5)
| ProdTaken | Age | TypeofContact | CityTier | DurationOfPitch | Occupation | Gender | NumberOfFollowups | ProductPitched | PreferredPropertyStar | MaritalStatus | NumberOfTrips | Passport | Designation | MonthlyIncome | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0.534884 | 1 | 3 | 1.791759 | Salaried | Female | 3.0 | Deluxe | 3.0 | Single | 1-3 | 1 | Management | 0.307267 |
| 1 | 0 | 0.720930 | 0 | 1 | 2.639057 | Salaried | Male | 4.0 | Deluxe | 4.0 | Divorced | 1-3 | 0 | Management | 0.259678 |
| 3 | 0 | 0.348837 | 0 | 1 | 2.197225 | Salaried | Female | 3.0 | Basic | 3.0 | Divorced | 1-3 | 1 | Executive | 0.127143 |
| 4 | 0 | 0.441860 | 1 | 1 | 2.079442 | Small Business | Male | 3.0 | Basic | 4.0 | Divorced | 1-3 | 0 | Executive | 0.161988 |
| 5 | 0 | 0.325581 | 0 | 1 | 2.079442 | Salaried | Male | 3.0 | Basic | 3.0 | Single | 1-3 | 0 | Executive | 0.137164 |
labeledbarplot(df, 'Occupation')
Transforming the Data with Label Encoder¶
df['Occupation'] = labelEncoder.fit_transform(df['Occupation'])
labeledbarplot(df, 'Occupation')
df.head(5)
| ProdTaken | Age | TypeofContact | CityTier | DurationOfPitch | Occupation | Gender | NumberOfFollowups | ProductPitched | PreferredPropertyStar | MaritalStatus | NumberOfTrips | Passport | Designation | MonthlyIncome | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0.534884 | 1 | 3 | 1.791759 | 1 | Female | 3.0 | Deluxe | 3.0 | Single | 1-3 | 1 | Management | 0.307267 |
| 1 | 0 | 0.720930 | 0 | 1 | 2.639057 | 1 | Male | 4.0 | Deluxe | 4.0 | Divorced | 1-3 | 0 | Management | 0.259678 |
| 3 | 0 | 0.348837 | 0 | 1 | 2.197225 | 1 | Female | 3.0 | Basic | 3.0 | Divorced | 1-3 | 1 | Executive | 0.127143 |
| 4 | 0 | 0.441860 | 1 | 1 | 2.079442 | 2 | Male | 3.0 | Basic | 4.0 | Divorced | 1-3 | 0 | Executive | 0.161988 |
| 5 | 0 | 0.325581 | 0 | 1 | 2.079442 | 1 | Male | 3.0 | Basic | 3.0 | Single | 1-3 | 0 | Executive | 0.137164 |
Gender¶
labeledbarplot(df, 'Gender')
df.head(5)
| ProdTaken | Age | TypeofContact | CityTier | DurationOfPitch | Occupation | Gender | NumberOfFollowups | ProductPitched | PreferredPropertyStar | MaritalStatus | NumberOfTrips | Passport | Designation | MonthlyIncome | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0.534884 | 1 | 3 | 1.791759 | 1 | Female | 3.0 | Deluxe | 3.0 | Single | 1-3 | 1 | Management | 0.307267 |
| 1 | 0 | 0.720930 | 0 | 1 | 2.639057 | 1 | Male | 4.0 | Deluxe | 4.0 | Divorced | 1-3 | 0 | Management | 0.259678 |
| 3 | 0 | 0.348837 | 0 | 1 | 2.197225 | 1 | Female | 3.0 | Basic | 3.0 | Divorced | 1-3 | 1 | Executive | 0.127143 |
| 4 | 0 | 0.441860 | 1 | 1 | 2.079442 | 2 | Male | 3.0 | Basic | 4.0 | Divorced | 1-3 | 0 | Executive | 0.161988 |
| 5 | 0 | 0.325581 | 0 | 1 | 2.079442 | 1 | Male | 3.0 | Basic | 3.0 | Single | 1-3 | 0 | Executive | 0.137164 |
Transforming the Data with Label Encoder¶
df['Gender'] = labelEncoder.fit_transform(df['Gender'])
df.head(5)
| ProdTaken | Age | TypeofContact | CityTier | DurationOfPitch | Occupation | Gender | NumberOfFollowups | ProductPitched | PreferredPropertyStar | MaritalStatus | NumberOfTrips | Passport | Designation | MonthlyIncome | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0.534884 | 1 | 3 | 1.791759 | 1 | 0 | 3.0 | Deluxe | 3.0 | Single | 1-3 | 1 | Management | 0.307267 |
| 1 | 0 | 0.720930 | 0 | 1 | 2.639057 | 1 | 1 | 4.0 | Deluxe | 4.0 | Divorced | 1-3 | 0 | Management | 0.259678 |
| 3 | 0 | 0.348837 | 0 | 1 | 2.197225 | 1 | 0 | 3.0 | Basic | 3.0 | Divorced | 1-3 | 1 | Executive | 0.127143 |
| 4 | 0 | 0.441860 | 1 | 1 | 2.079442 | 2 | 1 | 3.0 | Basic | 4.0 | Divorced | 1-3 | 0 | Executive | 0.161988 |
| 5 | 0 | 0.325581 | 0 | 1 | 2.079442 | 1 | 1 | 3.0 | Basic | 3.0 | Single | 1-3 | 0 | Executive | 0.137164 |
labeledbarplot(df, 'Gender')
Product Pitched¶
labeledbarplot(df, 'ProductPitched')
df.head(5)
| ProdTaken | Age | TypeofContact | CityTier | DurationOfPitch | Occupation | Gender | NumberOfFollowups | ProductPitched | PreferredPropertyStar | MaritalStatus | NumberOfTrips | Passport | Designation | MonthlyIncome | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0.534884 | 1 | 3 | 1.791759 | 1 | 0 | 3.0 | Deluxe | 3.0 | Single | 1-3 | 1 | Management | 0.307267 |
| 1 | 0 | 0.720930 | 0 | 1 | 2.639057 | 1 | 1 | 4.0 | Deluxe | 4.0 | Divorced | 1-3 | 0 | Management | 0.259678 |
| 3 | 0 | 0.348837 | 0 | 1 | 2.197225 | 1 | 0 | 3.0 | Basic | 3.0 | Divorced | 1-3 | 1 | Executive | 0.127143 |
| 4 | 0 | 0.441860 | 1 | 1 | 2.079442 | 2 | 1 | 3.0 | Basic | 4.0 | Divorced | 1-3 | 0 | Executive | 0.161988 |
| 5 | 0 | 0.325581 | 0 | 1 | 2.079442 | 1 | 1 | 3.0 | Basic | 3.0 | Single | 1-3 | 0 | Executive | 0.137164 |
Transforming the Data with Label Encoder¶
df['ProductPitched'] = labelEncoder.fit_transform(df['ProductPitched'])
df.head(5)
| ProdTaken | Age | TypeofContact | CityTier | DurationOfPitch | Occupation | Gender | NumberOfFollowups | ProductPitched | PreferredPropertyStar | MaritalStatus | NumberOfTrips | Passport | Designation | MonthlyIncome | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0.534884 | 1 | 3 | 1.791759 | 1 | 0 | 3.0 | 1 | 3.0 | Single | 1-3 | 1 | Management | 0.307267 |
| 1 | 0 | 0.720930 | 0 | 1 | 2.639057 | 1 | 1 | 4.0 | 1 | 4.0 | Divorced | 1-3 | 0 | Management | 0.259678 |
| 3 | 0 | 0.348837 | 0 | 1 | 2.197225 | 1 | 0 | 3.0 | 0 | 3.0 | Divorced | 1-3 | 1 | Executive | 0.127143 |
| 4 | 0 | 0.441860 | 1 | 1 | 2.079442 | 2 | 1 | 3.0 | 0 | 4.0 | Divorced | 1-3 | 0 | Executive | 0.161988 |
| 5 | 0 | 0.325581 | 0 | 1 | 2.079442 | 1 | 1 | 3.0 | 0 | 3.0 | Single | 1-3 | 0 | Executive | 0.137164 |
labeledbarplot(df, 'ProductPitched')
Marital Status¶
labeledbarplot(df, 'MaritalStatus')
df.head(5)
| ProdTaken | Age | TypeofContact | CityTier | DurationOfPitch | Occupation | Gender | NumberOfFollowups | ProductPitched | PreferredPropertyStar | MaritalStatus | NumberOfTrips | Passport | Designation | MonthlyIncome | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0.534884 | 1 | 3 | 1.791759 | 1 | 0 | 3.0 | 1 | 3.0 | Single | 1-3 | 1 | Management | 0.307267 |
| 1 | 0 | 0.720930 | 0 | 1 | 2.639057 | 1 | 1 | 4.0 | 1 | 4.0 | Divorced | 1-3 | 0 | Management | 0.259678 |
| 3 | 0 | 0.348837 | 0 | 1 | 2.197225 | 1 | 0 | 3.0 | 0 | 3.0 | Divorced | 1-3 | 1 | Executive | 0.127143 |
| 4 | 0 | 0.441860 | 1 | 1 | 2.079442 | 2 | 1 | 3.0 | 0 | 4.0 | Divorced | 1-3 | 0 | Executive | 0.161988 |
| 5 | 0 | 0.325581 | 0 | 1 | 2.079442 | 1 | 1 | 3.0 | 0 | 3.0 | Single | 1-3 | 0 | Executive | 0.137164 |
Transforming the Data with Label Encoder¶
df['MaritalStatus'] = labelEncoder.fit_transform(df['MaritalStatus'])
df.head(5)
| ProdTaken | Age | TypeofContact | CityTier | DurationOfPitch | Occupation | Gender | NumberOfFollowups | ProductPitched | PreferredPropertyStar | MaritalStatus | NumberOfTrips | Passport | Designation | MonthlyIncome | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0.534884 | 1 | 3 | 1.791759 | 1 | 0 | 3.0 | 1 | 3.0 | 2 | 1-3 | 1 | Management | 0.307267 |
| 1 | 0 | 0.720930 | 0 | 1 | 2.639057 | 1 | 1 | 4.0 | 1 | 4.0 | 0 | 1-3 | 0 | Management | 0.259678 |
| 3 | 0 | 0.348837 | 0 | 1 | 2.197225 | 1 | 0 | 3.0 | 0 | 3.0 | 0 | 1-3 | 1 | Executive | 0.127143 |
| 4 | 0 | 0.441860 | 1 | 1 | 2.079442 | 2 | 1 | 3.0 | 0 | 4.0 | 0 | 1-3 | 0 | Executive | 0.161988 |
| 5 | 0 | 0.325581 | 0 | 1 | 2.079442 | 1 | 1 | 3.0 | 0 | 3.0 | 2 | 1-3 | 0 | Executive | 0.137164 |
labeledbarplot(df, 'MaritalStatus')
Number of Trips¶
labeledbarplot(df, 'NumberOfTrips')
df.head(10)
| ProdTaken | Age | TypeofContact | CityTier | DurationOfPitch | Occupation | Gender | NumberOfFollowups | ProductPitched | PreferredPropertyStar | MaritalStatus | NumberOfTrips | Passport | Designation | MonthlyIncome | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0.534884 | 1 | 3 | 1.791759 | 1 | 0 | 3.0 | 1 | 3.0 | 2 | 1-3 | 1 | Management | 0.307267 |
| 1 | 0 | 0.720930 | 0 | 1 | 2.639057 | 1 | 1 | 4.0 | 1 | 4.0 | 0 | 1-3 | 0 | Management | 0.259678 |
| 3 | 0 | 0.348837 | 0 | 1 | 2.197225 | 1 | 0 | 3.0 | 0 | 3.0 | 0 | 1-3 | 1 | Executive | 0.127143 |
| 4 | 0 | 0.441860 | 1 | 1 | 2.079442 | 2 | 1 | 3.0 | 0 | 4.0 | 0 | 1-3 | 0 | Executive | 0.161988 |
| 5 | 0 | 0.325581 | 0 | 1 | 2.079442 | 1 | 1 | 3.0 | 0 | 3.0 | 2 | 1-3 | 0 | Executive | 0.137164 |
| 6 | 0 | 0.953488 | 1 | 1 | 2.197225 | 2 | 0 | 2.0 | 0 | 5.0 | 0 | 4-5 | 1 | Executive | 0.111912 |
| 7 | 0 | 0.279070 | 1 | 1 | 3.401197 | 1 | 1 | 3.0 | 0 | 3.0 | 1 | 1-3 | 0 | Executive | 0.113387 |
| 8 | 0 | 0.465116 | 0 | 1 | 3.367296 | 1 | 1 | 4.0 | 3 | 3.0 | 3 | 1-3 | 0 | Management | 0.483602 |
| 9 | 0 | 0.418605 | 1 | 1 | 3.496508 | 2 | 1 | 3.0 | 1 | 3.0 | 0 | 7+ | 0 | Management | 0.265688 |
| 10 | 0 | 0.395349 | 1 | 1 | 3.091042 | 2 | 1 | 2.0 | 0 | 4.0 | 0 | 1-3 | 0 | Executive | 0.096149 |
Transforming the Data with Label Encoder¶
df['NumberOfTrips'] = labelEncoder.fit_transform(df['NumberOfTrips'])
df.head(10)
| ProdTaken | Age | TypeofContact | CityTier | DurationOfPitch | Occupation | Gender | NumberOfFollowups | ProductPitched | PreferredPropertyStar | MaritalStatus | NumberOfTrips | Passport | Designation | MonthlyIncome | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0.534884 | 1 | 3 | 1.791759 | 1 | 0 | 3.0 | 1 | 3.0 | 2 | 0 | 1 | Management | 0.307267 |
| 1 | 0 | 0.720930 | 0 | 1 | 2.639057 | 1 | 1 | 4.0 | 1 | 4.0 | 0 | 0 | 0 | Management | 0.259678 |
| 3 | 0 | 0.348837 | 0 | 1 | 2.197225 | 1 | 0 | 3.0 | 0 | 3.0 | 0 | 0 | 1 | Executive | 0.127143 |
| 4 | 0 | 0.441860 | 1 | 1 | 2.079442 | 2 | 1 | 3.0 | 0 | 4.0 | 0 | 0 | 0 | Executive | 0.161988 |
| 5 | 0 | 0.325581 | 0 | 1 | 2.079442 | 1 | 1 | 3.0 | 0 | 3.0 | 2 | 0 | 0 | Executive | 0.137164 |
| 6 | 0 | 0.953488 | 1 | 1 | 2.197225 | 2 | 0 | 2.0 | 0 | 5.0 | 0 | 1 | 1 | Executive | 0.111912 |
| 7 | 0 | 0.279070 | 1 | 1 | 3.401197 | 1 | 1 | 3.0 | 0 | 3.0 | 1 | 0 | 0 | Executive | 0.113387 |
| 8 | 0 | 0.465116 | 0 | 1 | 3.367296 | 1 | 1 | 4.0 | 3 | 3.0 | 3 | 0 | 0 | Management | 0.483602 |
| 9 | 0 | 0.418605 | 1 | 1 | 3.496508 | 2 | 1 | 3.0 | 1 | 3.0 | 0 | 3 | 0 | Management | 0.265688 |
| 10 | 0 | 0.395349 | 1 | 1 | 3.091042 | 2 | 1 | 2.0 | 0 | 4.0 | 0 | 0 | 0 | Executive | 0.096149 |
labeledbarplot(df, 'NumberOfTrips')
Designation¶
labeledbarplot(df, 'Designation')
df.head(5)
| ProdTaken | Age | TypeofContact | CityTier | DurationOfPitch | Occupation | Gender | NumberOfFollowups | ProductPitched | PreferredPropertyStar | MaritalStatus | NumberOfTrips | Passport | Designation | MonthlyIncome | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0.534884 | 1 | 3 | 1.791759 | 1 | 0 | 3.0 | 1 | 3.0 | 2 | 0 | 1 | Management | 0.307267 |
| 1 | 0 | 0.720930 | 0 | 1 | 2.639057 | 1 | 1 | 4.0 | 1 | 4.0 | 0 | 0 | 0 | Management | 0.259678 |
| 3 | 0 | 0.348837 | 0 | 1 | 2.197225 | 1 | 0 | 3.0 | 0 | 3.0 | 0 | 0 | 1 | Executive | 0.127143 |
| 4 | 0 | 0.441860 | 1 | 1 | 2.079442 | 2 | 1 | 3.0 | 0 | 4.0 | 0 | 0 | 0 | Executive | 0.161988 |
| 5 | 0 | 0.325581 | 0 | 1 | 2.079442 | 1 | 1 | 3.0 | 0 | 3.0 | 2 | 0 | 0 | Executive | 0.137164 |
Transforming the Data with Label Encoder¶
df['Designation'] = labelEncoder.fit_transform(df['Designation'])
df.head(5)
| ProdTaken | Age | TypeofContact | CityTier | DurationOfPitch | Occupation | Gender | NumberOfFollowups | ProductPitched | PreferredPropertyStar | MaritalStatus | NumberOfTrips | Passport | Designation | MonthlyIncome | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0.534884 | 1 | 3 | 1.791759 | 1 | 0 | 3.0 | 1 | 3.0 | 2 | 0 | 1 | 1 | 0.307267 |
| 1 | 0 | 0.720930 | 0 | 1 | 2.639057 | 1 | 1 | 4.0 | 1 | 4.0 | 0 | 0 | 0 | 1 | 0.259678 |
| 3 | 0 | 0.348837 | 0 | 1 | 2.197225 | 1 | 0 | 3.0 | 0 | 3.0 | 0 | 0 | 1 | 0 | 0.127143 |
| 4 | 0 | 0.441860 | 1 | 1 | 2.079442 | 2 | 1 | 3.0 | 0 | 4.0 | 0 | 0 | 0 | 0 | 0.161988 |
| 5 | 0 | 0.325581 | 0 | 1 | 2.079442 | 1 | 1 | 3.0 | 0 | 3.0 | 2 | 0 | 0 | 0 | 0.137164 |
labeledbarplot(df, 'Designation')
Features that Remain Untouched¶
We had a few features that we didn't touch in our Feature Engineering and we wanted to walk through what we had yet to cover in our engineering.
Prodtaken was our dependent variable, and had already been encoded to be between 0 and 1 in the dataset, it didn't need any more engineering from us.
City Tier was alread numeric between 1 and 3. We could have done label encoding, but we felt it may be confusing, and the range was within features that had already been encoded in a similar way
Number of followups had six different categories, one for each number of followup. We considered grouping and then encoding them again, but there was good variance in prodtaken for each column, and so decided to leave it be.
Preferred property star was a debate for us, we considered doing label encoding, since the array for it started at 3 rather than 0, but all the ranges were within those covered by number of followups, and we didn't know how it would affect the model one way or the other due to lack of knowledge about how models are made
Passport was similar to ProdTaken, with it already having been encoded and given a value between 0 and 1, and as such didn't need touching by us