Objective¶
A travel agency has asked you and your team to create a machine learning model to help them predict whether or not potential customers will purchase a new vacation package they are offering. As part of this process, the agency would also like to understand the potential customers better to more efficiently market their new package.
Data Definition¶
- CustomerID: Unique customer ID
- ProdTaken: Whether the customer has purchased a package or not (0: No, 1: Yes)
- Age: Age of customer
- TypeofContact: How customer was contacted (Company Invited or Self Inquiry)
- CityTier: City tier depends on the development of a city, population, facilities, and living standards. The categories are ordered i.e. Tier 1 > Tier 2 > Tier 3
- DurationOfPitch: Duration of the pitch by a salesperson to the customer
- Occupation: Occupation of customer
- Gender: Gender of customer
- NumberOfPersonVisiting: Total number of people planning to take the trip with the customer
- NumberOfFollowups: Total number of follow-ups has been done by the salesperson after the sales pitch
- ProductPitched: Product pitched by the salesperson
- PreferredPropertyStar: Preferred hotel property rating by customer
- MaritalStatus: Marital status of customer
- NumberOfTrips: Average number of trips in a year by customer
- Passport: The customer has a passport or not (0: No, 1: Yes)
- PitchSatisfactionScore: Sales pitch satisfaction score
- OwnCar: Whether the customers own a car or not (0: No, 1: Yes)
- NumberOfChildrenVisiting: Total number of children with age less than 5 planning to take the trip with the customer
- Designation: Designation of the customer in the current organization
- MonthlyIncome: Gross monthly income of the customer
Boilerplate¶
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn.preprocessing
scaler = sklearn.preprocessing.MinMaxScaler()
labelEncoder = sklearn.preprocessing.LabelEncoder()
sns.set(color_codes=True)
%matplotlib inline
df = pd.read_csv("vacation_package.csv")
Functions¶
def ReturnByPurchase(columnparam):
purchased = df.loc[(df[columnparam].isnull() == False) & (df['ProdTaken'] == 1)]
nopurchase = df.loc[(df[columnparam].isnull() == False) & (df['ProdTaken'] == 0)]
print("Purchased Mean: " + str(round(purchased[columnparam].mean(),6)))
print("Purchased Median: " + str(purchased[columnparam].median()))
print("Purchased Mode: " + str(purchased[columnparam].mode()))
print('\n' + "-----------" + '\n')
print("Unpurchased Mean: " + str(round(nopurchase[columnparam].mean(),6)))
print("Unpurchased Median: " + str(nopurchase[columnparam].median()))
print("Unpurchased Mode: " + str(nopurchase[columnparam].mode()))
def histogramboxplot(data, feature, figsize=(12,7), kde=True, bins=None):
"""
Boxplot and histogram combined
data: dataframe
feature: dataframe column
figsize: size of figure (default (12,7))
kde: whether to show the density curve (default False)
bins: number of bins for histogram (default None)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2, # Number of rows of the subplot grid= 2
sharex=True, # x-axis will be shared among all subplots
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
) # creating the 2 subplots
sns.boxplot(
data=data, x=feature, ax=ax_box2, showmeans=True, meanprops = dict(marker='D', markeredgecolor='black', markerfacecolor='red')
) # boxplot will be created and a star will indicate the mean value of the column
sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins
) if bins else sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2
) # For histogram
ax_hist2.axvline(
data[feature].mean(), color="red", linestyle="-"
) # Add mean to the histogram
ax_hist2.axvline(
data[feature].median(), color="black", linestyle="--"
) # Add median to the histogram
def stackedbarplot(data, predictor, target):
"""
Print the category counts and plot a stacked bar chart
data: dataframe
predictor: independent variable
target: target variable
"""
count = data[predictor].nunique()
sorter = data[target].value_counts().index[-1]
tab1 = pd.crosstab(data[predictor], data[target], margins=True).sort_values(
by=sorter, ascending=False
)
print(tab1)
print("-" * 115)
tab = pd.crosstab(data[predictor], data[target], normalize="index").sort_values(
by=sorter, ascending=False
)
print(tab)
tab.plot(kind="bar", stacked=True, figsize=(count + 5, 5))
plt.legend(
loc="lower left", frameon=False,
)
plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
plt.show()
def descviolinbox(Data, tablename):
"""
Prints the Description of a numeric value and a stacked violin and boxplot
Data = Dataframe
value = Target Column
"""
tablename = str(tablename)
description = Data[tablename].describe()
sns.violinplot(data=Data, y=tablename)
sns.boxplot(data=Data, y=tablename, color="gray")
return description
def bounds(Data, tablename):
"""
Returns upper and lower bounds for a table, helps us capture outliers
Data = Dataframe
tablename = Target Column
"""
tablename = str(tablename)
Q1 = Data[tablename].quantile(0.25)
Q3 = Data[tablename].quantile(0.75)
IQR = Q3 - Q1
lowerlim = Q1 - 1.5 * IQR
upperlim = Q3 + 1.5 * IQR
print("Q1: " + str(Q1))
print("Q3: " + str(Q3))
print("IQR: " + str(IQR))
print("lower limit: " + str(lowerlim))
print("upper limit: " + str(upperlim))
return(lowerlim, upperlim, Q1, Q3)
def numericdata(Data, tablename):
"""
Returns the Mean, Median, Mode, and Range for a dataset
Data = Dataframe
tablename = target column
"""
tablename = str(tablename)
Q1 = Data[tablename].quantile(0.25)
Q3 = Data[tablename].quantile(0.75)
IQR = Q3-Q1
mean = Data[tablename].mean()
median = Data[tablename].median()
mode = Data[tablename].mode()[0]
mean = round(mean,6)
median = round(median,6)
mode = round(mode,6)
IQR = round(IQR,6)
print("Mean: " + str(mean))
print("Median: " + str(median))
print("Mode: " + str(mode))
print("Range: " + str(IQR))
def labeldisplot(Data, tablename):
tablename = str(tablename)
sns.displot(data=Data, x=tablename, kde=True);
plt.axvline(x=Data[tablename].median(), color = 'black')
plt.axvline(x=Data[tablename].mean(), color = 'red', ls='--')
def labeledbarplot(data, feature, perc=False, n=None):
"""
Barplot with percentage at the top
data: dataframe
feature: dataframe column
perc: whether to display percentages instead of count (default is False)
n: displays the top n category levels (default is None, i.e., display all levels)
"""
total = len(data[feature]) # length of the column
count = data[feature].nunique()
if n is None:
plt.figure(figsize=(count + 1, 5))
else:
plt.figure(figsize=(n + 1, 5))
plt.xticks(rotation=90, fontsize=15)
ax = sns.countplot(
data=data,
x=feature,
palette="Paired",
order=data[feature].value_counts().index[:n].sort_values(),
)
for p in ax.patches:
if perc == True:
label = "{:.1f}%".format(
100 * p.get_height() / total
) # percentage of each class of the category
else:
label = p.get_height() # count of each level of the category
x = p.get_x() + p.get_width() / 2 # width of the plot
y = p.get_height() # height of the plot
ax.annotate(
label,
(x, y),
ha="center",
va="center",
size=12,
xytext=(0, 5),
textcoords="offset points",
) # annotate the percentage
plt.show() # show the plot
def stackedbox(data, value, hue):
"""
data= Dataframe
Value = Independent variable we want to test
Hue = Dependent Variable we want to include
"""
sns.catplot(data= data, x = hue, y = value, kind="box")
def minmeanmax(data, independent, dependent):
"""
Prints the min, mean and max of things grouped by our dependent variable
data = Dataframe
independent = The variable we want to do math on
Dependent = the variable we want to group by
"""
notpurchased = df.loc[df[dependent] == 0] #Adjust Number of df.Locs for number of variables
purchased = df.loc[df[dependent] == 1]
print("Not Purchased Min Mean and Max:" + '\n')
print(f' Min: {notpurchased[independent].min()}')
print(f' Mean: {round(notpurchased[independent].mean(), 6)}')
print(f' Max: {notpurchased[independent].max()}')
print("\n" + "Purchased Min Mean and Max:" + "\n")
print(f' Min: {purchased[independent].min()}')
print(f' Mean: {round(purchased[independent].mean(), 6)}')
print(f' Max: {purchased[independent].max()}')
#Add as many print statement blocks as required for the dependent variable, keep this count low
Data Overview¶
# df.sample(10)
| CustomerID | ProdTaken | Age | TypeofContact | CityTier | DurationOfPitch | Occupation | Gender | NumberOfPersonVisiting | NumberOfFollowups | ProductPitched | PreferredPropertyStar | MaritalStatus | NumberOfTrips | Passport | PitchSatisfactionScore | OwnCar | NumberOfChildrenVisiting | Designation | MonthlyIncome | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 996 | 200996 | 0 | 27.0 | Self Enquiry | 3 | 16.0 | Small Business | Female | 3 | 4.0 | Deluxe | 3.0 | Divorced | 2.0 | 1 | 3 | 1 | 2.0 | Manager | 20769.0 |
| 100 | 200100 | 1 | 37.0 | Self Enquiry | 2 | 12.0 | Salaried | Male | 3 | 3.0 | Basic | 5.0 | Married | 5.0 | 1 | 2 | 1 | 0.0 | Executive | 17073.0 |
| 4493 | 204493 | 0 | 35.0 | Self Enquiry | 1 | 9.0 | Small Business | Female | 3 | 5.0 | Basic | 5.0 | Unmarried | 3.0 | 0 | 1 | 1 | 1.0 | Executive | 23059.0 |
| 3579 | 203579 | 0 | 47.0 | Self Enquiry | 3 | 7.0 | Salaried | Male | 3 | 2.0 | Super Deluxe | 5.0 | Single | NaN | 0 | 4 | 1 | 2.0 | AVP | 36245.0 |
| 320 | 200320 | 0 | 27.0 | Self Enquiry | 3 | NaN | Salaried | Male | 3 | 3.0 | Deluxe | 3.0 | Single | 2.0 | 1 | 4 | 1 | 2.0 | Manager | NaN |
| 2585 | 202585 | 0 | 46.0 | Self Enquiry | 1 | 36.0 | Small Business | Male | 3 | 4.0 | Basic | 3.0 | Unmarried | 7.0 | 0 | 2 | 1 | 1.0 | Executive | 22130.0 |
| 3409 | 203409 | 0 | 26.0 | Self Enquiry | 1 | 26.0 | Small Business | Male | 4 | 4.0 | Basic | 3.0 | Divorced | 5.0 | 0 | 5 | 1 | 3.0 | Executive | 22347.0 |
| 3718 | 203718 | 0 | 32.0 | Self Enquiry | 3 | 36.0 | Small Business | Female | 4 | 5.0 | Deluxe | 3.0 | Married | 3.0 | 0 | 3 | 1 | 1.0 | Manager | 24146.0 |
| 2039 | 202039 | 0 | 36.0 | Company Invited | 3 | 14.0 | Salaried | Male | 3 | 4.0 | Standard | 5.0 | Unmarried | 2.0 | 0 | 1 | 0 | 0.0 | Senior Manager | 22587.0 |
| 3581 | 203581 | 0 | 23.0 | Company Invited | 1 | 33.0 | Salaried | Female | 3 | 5.0 | Basic | 3.0 | Married | 3.0 | 1 | 3 | 0 | 1.0 | Executive | 21492.0 |
# df.sample(10)
| CustomerID | ProdTaken | Age | TypeofContact | CityTier | DurationOfPitch | Occupation | Gender | NumberOfPersonVisiting | NumberOfFollowups | ProductPitched | PreferredPropertyStar | MaritalStatus | NumberOfTrips | Passport | PitchSatisfactionScore | OwnCar | NumberOfChildrenVisiting | Designation | MonthlyIncome | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 3602 | 203602 | 0 | 38.0 | Self Enquiry | 1 | 26.0 | Salaried | Male | 3 | 4.0 | Deluxe | 3.0 | Married | 5.0 | 0 | 1 | 0 | 1.0 | Manager | 24446.0 |
| 2235 | 202235 | 0 | 38.0 | Company Invited | 1 | 9.0 | Salaried | Male | 2 | 3.0 | Basic | 3.0 | Married | 4.0 | 0 | 3 | 1 | 0.0 | Executive | 17821.0 |
| 4820 | 204820 | 1 | 35.0 | Self Enquiry | 1 | 14.0 | Salaried | Male | 4 | 4.0 | Deluxe | 3.0 | Married | 3.0 | 0 | 3 | 0 | 1.0 | Manager | 21263.0 |
| 2864 | 202864 | 0 | 30.0 | Self Enquiry | 1 | 10.0 | Small Business | Male | 4 | 5.0 | Standard | 3.0 | Divorced | 3.0 | 0 | 2 | 1 | 3.0 | Senior Manager | 30613.0 |
| 1577 | 201577 | 1 | 25.0 | Self Enquiry | 3 | 11.0 | Small Business | Male | 2 | 4.0 | Deluxe | 3.0 | Single | 2.0 | 1 | 3 | 0 | 1.0 | Manager | 20744.0 |
| 3342 | 203342 | 0 | 44.0 | Self Enquiry | 1 | 10.0 | Salaried | Male | 4 | 6.0 | King | NaN | Divorced | 5.0 | 0 | 5 | 1 | 3.0 | VP | 38418.0 |
| 1303 | 201303 | 0 | 47.0 | Self Enquiry | 1 | 10.0 | Salaried | Female | 3 | 4.0 | Standard | 4.0 | Married | 1.0 | 1 | 4 | 1 | 2.0 | Senior Manager | 25333.0 |
| 666 | 200666 | 1 | NaN | Self Enquiry | 1 | 9.0 | Salaried | Female | 2 | 3.0 | Deluxe | 3.0 | Divorced | 1.0 | 1 | 5 | 1 | 1.0 | Manager | NaN |
| 1129 | 201129 | 0 | 42.0 | Self Enquiry | 1 | 15.0 | Salaried | Male | 3 | 4.0 | King | 3.0 | Single | 1.0 | 0 | 1 | 1 | 1.0 | VP | 34613.0 |
| 1232 | 201232 | 0 | 35.0 | Self Enquiry | 1 | 33.0 | Salaried | Male | 2 | 3.0 | Deluxe | 3.0 | Married | 3.0 | 0 | 5 | 0 | 1.0 | Manager | 21883.0 |
Observation¶
Null values appear to be represented as NaN
df.shape
(4888, 20)
Observation¶
There appears to be 20 columns and 4888 rows
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 4888 entries, 0 to 4887 Data columns (total 20 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 CustomerID 4888 non-null int64 1 ProdTaken 4888 non-null int64 2 Age 4662 non-null float64 3 TypeofContact 4863 non-null object 4 CityTier 4888 non-null int64 5 DurationOfPitch 4637 non-null float64 6 Occupation 4888 non-null object 7 Gender 4888 non-null object 8 NumberOfPersonVisiting 4888 non-null int64 9 NumberOfFollowups 4843 non-null float64 10 ProductPitched 4888 non-null object 11 PreferredPropertyStar 4862 non-null float64 12 MaritalStatus 4888 non-null object 13 NumberOfTrips 4748 non-null float64 14 Passport 4888 non-null int64 15 PitchSatisfactionScore 4888 non-null int64 16 OwnCar 4888 non-null int64 17 NumberOfChildrenVisiting 4822 non-null float64 18 Designation 4888 non-null object 19 MonthlyIncome 4655 non-null float64 dtypes: float64(7), int64(7), object(6) memory usage: 763.9+ KB
Observation¶
Our data appears to all be floats, integers, and objects, and they all make sense for where they are. There appear to be a few null values in our dataset
df.isnull().sum()
CustomerID 0 ProdTaken 0 Age 226 TypeofContact 25 CityTier 0 DurationOfPitch 251 Occupation 0 Gender 0 NumberOfPersonVisiting 0 NumberOfFollowups 45 ProductPitched 0 PreferredPropertyStar 26 MaritalStatus 0 NumberOfTrips 140 Passport 0 PitchSatisfactionScore 0 OwnCar 0 NumberOfChildrenVisiting 66 Designation 0 MonthlyIncome 233 dtype: int64
Observation¶
we have welll over 200 empty rows in three columns, one column with over 100 and 4 more with less than 100, these will need to be handled
df.duplicated().sum()
0
Observation¶
There are no duplicate rows in our dataset
Exploring Numeric Data¶
df.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| CustomerID | 4888.0 | 202443.500000 | 1411.188388 | 200000.0 | 201221.75 | 202443.5 | 203665.25 | 204887.0 |
| ProdTaken | 4888.0 | 0.188216 | 0.390925 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 |
| Age | 4662.0 | 37.622265 | 9.316387 | 18.0 | 31.00 | 36.0 | 44.00 | 61.0 |
| CityTier | 4888.0 | 1.654255 | 0.916583 | 1.0 | 1.00 | 1.0 | 3.00 | 3.0 |
| DurationOfPitch | 4637.0 | 15.490835 | 8.519643 | 5.0 | 9.00 | 13.0 | 20.00 | 127.0 |
| NumberOfPersonVisiting | 4888.0 | 2.905074 | 0.724891 | 1.0 | 2.00 | 3.0 | 3.00 | 5.0 |
| NumberOfFollowups | 4843.0 | 3.708445 | 1.002509 | 1.0 | 3.00 | 4.0 | 4.00 | 6.0 |
| PreferredPropertyStar | 4862.0 | 3.581037 | 0.798009 | 3.0 | 3.00 | 3.0 | 4.00 | 5.0 |
| NumberOfTrips | 4748.0 | 3.236521 | 1.849019 | 1.0 | 2.00 | 3.0 | 4.00 | 22.0 |
| Passport | 4888.0 | 0.290917 | 0.454232 | 0.0 | 0.00 | 0.0 | 1.00 | 1.0 |
| PitchSatisfactionScore | 4888.0 | 3.078151 | 1.365792 | 1.0 | 2.00 | 3.0 | 4.00 | 5.0 |
| OwnCar | 4888.0 | 0.620295 | 0.485363 | 0.0 | 0.00 | 1.0 | 1.00 | 1.0 |
| NumberOfChildrenVisiting | 4822.0 | 1.187267 | 0.857861 | 0.0 | 1.00 | 1.0 | 2.00 | 3.0 |
| MonthlyIncome | 4655.0 | 23619.853491 | 5380.698361 | 1000.0 | 20346.00 | 22347.0 | 25571.00 | 98678.0 |
# df.mode()
| CustomerID | ProdTaken | Age | TypeofContact | CityTier | DurationOfPitch | Occupation | Gender | NumberOfPersonVisiting | NumberOfFollowups | ProductPitched | PreferredPropertyStar | MaritalStatus | NumberOfTrips | Passport | PitchSatisfactionScore | OwnCar | NumberOfChildrenVisiting | Designation | MonthlyIncome | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 200000 | 0.0 | 35.0 | Self Enquiry | 1.0 | 9.0 | Salaried | Male | 3.0 | 4.0 | Basic | 3.0 | Married | 2.0 | 0.0 | 3.0 | 1.0 | 1.0 | Executive | 17342.0 |
| 1 | 200001 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 20855.0 |
| 2 | 200002 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 21020.0 |
| 3 | 200003 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 21288.0 |
| 4 | 200004 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 4883 | 204883 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 4884 | 204884 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 4885 | 204885 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 4886 | 204886 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 4887 | 204887 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
4888 rows × 20 columns
Observations¶
- Most people did not purchase a package
- Ages range from 18 to 61, with the middle being in the mid 30s
- City Tier ranges from 1-3, with most people being in a Tier 1 city
- Pitches ranged from 5 minutes to 2 hours with most being 13-15 minutes
- Only 1-5 people visited a destination, with most common number of people visiting being three
- There were usually 4 followups
- The most frequent product pitched was the basic package
- Preferred Properties ranged from 3-5 stars, with most being 3 stars
- Most individuals wer married
- There were between 1 to 22 trips taken annually, with the middle being 3, and most people taking 2.
- Most people did not have a passport
- Pitch Satisfaction Scores hovered around 3
- Most Individuals owned a car
- On average one child went with each person
- The most common designation was executive
- The average monthly salary hovered between 22 and 23k
# df.corr()
| CustomerID | ProdTaken | Age | CityTier | DurationOfPitch | NumberOfPersonVisiting | NumberOfFollowups | PreferredPropertyStar | NumberOfTrips | Passport | PitchSatisfactionScore | OwnCar | NumberOfChildrenVisiting | MonthlyIncome | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| CustomerID | 1.000000 | 0.056506 | 0.032127 | 0.012975 | 0.064298 | 0.604013 | 0.427539 | 0.010553 | 0.224848 | 0.007974 | -0.035847 | 0.003805 | 0.511763 | 0.276833 |
| ProdTaken | 0.056506 | 1.000000 | -0.147254 | 0.086852 | 0.078257 | 0.009627 | 0.112171 | 0.099577 | 0.018898 | 0.260844 | 0.051394 | -0.011508 | 0.007421 | -0.130585 |
| Age | 0.032127 | -0.147254 | 1.000000 | -0.015625 | -0.012063 | 0.011621 | -0.002577 | -0.010474 | 0.184905 | 0.033399 | 0.018510 | 0.048654 | 0.007370 | 0.464869 |
| CityTier | 0.012975 | 0.086852 | -0.015625 | 1.000000 | 0.022703 | -0.001671 | 0.023652 | -0.009164 | -0.029709 | 0.001793 | -0.042160 | 0.003817 | 0.000672 | 0.051817 |
| DurationOfPitch | 0.064298 | 0.078257 | -0.012063 | 0.022703 | 1.000000 | 0.065141 | 0.009434 | -0.006637 | 0.009715 | 0.033034 | -0.002880 | -0.001626 | 0.031408 | -0.006252 |
| NumberOfPersonVisiting | 0.604013 | 0.009627 | 0.011621 | -0.001671 | 0.065141 | 1.000000 | 0.328569 | 0.033867 | 0.195223 | 0.011177 | -0.019581 | 0.010362 | 0.610621 | 0.195134 |
| NumberOfFollowups | 0.427539 | 0.112171 | -0.002577 | 0.023652 | 0.009434 | 0.328569 | 1.000000 | -0.024176 | 0.139517 | 0.004970 | 0.004054 | 0.012112 | 0.286425 | 0.176503 |
| PreferredPropertyStar | 0.010553 | 0.099577 | -0.010474 | -0.009164 | -0.006637 | 0.033867 | -0.024176 | 1.000000 | 0.012115 | 0.001040 | -0.022701 | 0.015742 | 0.035798 | 0.014289 |
| NumberOfTrips | 0.224848 | 0.018898 | 0.184905 | -0.029709 | 0.009715 | 0.195223 | 0.139517 | 0.012115 | 1.000000 | 0.012949 | -0.004378 | -0.011825 | 0.168795 | 0.139105 |
| Passport | 0.007974 | 0.260844 | 0.033399 | 0.001793 | 0.033034 | 0.011177 | 0.004970 | 0.001040 | 0.012949 | 1.000000 | 0.002926 | -0.022330 | 0.020264 | 0.002545 |
| PitchSatisfactionScore | -0.035847 | 0.051394 | 0.018510 | -0.042160 | -0.002880 | -0.019581 | 0.004054 | -0.022701 | -0.004378 | 0.002926 | 1.000000 | 0.068850 | 0.000878 | 0.030421 |
| OwnCar | 0.003805 | -0.011508 | 0.048654 | 0.003817 | -0.001626 | 0.010362 | 0.012112 | 0.015742 | -0.011825 | -0.022330 | 0.068850 | 1.000000 | 0.026572 | 0.080262 |
| NumberOfChildrenVisiting | 0.511763 | 0.007421 | 0.007370 | 0.000672 | 0.031408 | 0.610621 | 0.286425 | 0.035798 | 0.168795 | 0.020264 | 0.000878 | 0.026572 | 1.000000 | 0.201643 |
| MonthlyIncome | 0.276833 | -0.130585 | 0.464869 | 0.051817 | -0.006252 | 0.195134 | 0.176503 | 0.014289 | 0.139105 | 0.002545 | 0.030421 | 0.080262 | 0.201643 | 1.000000 |
Observations¶
Most correlations in this dataset are weak with ProdTaken, most heavily with NumberofFollowups, and Passport being relatively positively correlated, and age being moderately negatively correlated. Some data changes will need to be done
Checking for Odd Categorical Data¶
df.nunique()
CustomerID 4888 ProdTaken 2 Age 44 TypeofContact 2 CityTier 3 DurationOfPitch 34 Occupation 4 Gender 3 NumberOfPersonVisiting 5 NumberOfFollowups 6 ProductPitched 5 PreferredPropertyStar 3 MaritalStatus 4 NumberOfTrips 12 Passport 2 PitchSatisfactionScore 5 OwnCar 2 NumberOfChildrenVisiting 4 Designation 5 MonthlyIncome 2475 dtype: int64
Observations¶
There are some odd things here, like three unique genders and four unique marital statuses.
df['Gender'].value_counts()
Male 2916 Female 1817 Fe Male 155 Name: Gender, dtype: int64
There are 155 incorrect labels for the "Female" Category
df['MaritalStatus'].value_counts()
Married 2340 Divorced 950 Single 916 Unmarried 682 Name: MaritalStatus, dtype: int64
There isn't anything wrong with this category, but it might be possible to lump them int married and unmarried if we want to down the line
Removing Unnecessary Column(s)¶
We know each value is unique, but we do not need an ID, since we do not desire to keep track of individual customers in our dataset, rather we want to be able to predict based on values that have bearing besides autoincrements
df = df.drop(['CustomerID'], axis = 1)
# df.head(5)
| ProdTaken | Age | TypeofContact | CityTier | DurationOfPitch | Occupation | Gender | NumberOfPersonVisiting | NumberOfFollowups | ProductPitched | PreferredPropertyStar | MaritalStatus | NumberOfTrips | Passport | PitchSatisfactionScore | OwnCar | NumberOfChildrenVisiting | Designation | MonthlyIncome | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 41.0 | Self Enquiry | 3 | 6.0 | Salaried | Female | 3 | 3.0 | Deluxe | 3.0 | Single | 1.0 | 1 | 2 | 1 | 0.0 | Manager | 20993.0 |
| 1 | 0 | 49.0 | Company Invited | 1 | 14.0 | Salaried | Male | 3 | 4.0 | Deluxe | 4.0 | Divorced | 2.0 | 0 | 3 | 1 | 2.0 | Manager | 20130.0 |
| 2 | 1 | 37.0 | Self Enquiry | 1 | 8.0 | Free Lancer | Male | 3 | 4.0 | Basic | 3.0 | Single | 7.0 | 1 | 3 | 0 | 0.0 | Executive | 17090.0 |
| 3 | 0 | 33.0 | Company Invited | 1 | 9.0 | Salaried | Female | 2 | 3.0 | Basic | 3.0 | Divorced | 2.0 | 1 | 5 | 1 | 1.0 | Executive | 17909.0 |
| 4 | 0 | NaN | Self Enquiry | 1 | 8.0 | Small Business | Male | 2 | 3.0 | Basic | 4.0 | Divorced | 1.0 | 0 | 5 | 1 | 0.0 | Executive | 18468.0 |
Data Preprocessing¶
Handling Null values¶
df.isnull().sum()
ProdTaken 0 Age 226 TypeofContact 25 CityTier 0 DurationOfPitch 251 Occupation 0 Gender 0 NumberOfPersonVisiting 0 NumberOfFollowups 45 ProductPitched 0 PreferredPropertyStar 26 MaritalStatus 0 NumberOfTrips 140 Passport 0 PitchSatisfactionScore 0 OwnCar 0 NumberOfChildrenVisiting 66 Designation 0 MonthlyIncome 233 dtype: int64
Age¶
df.Age.isna().sum()
226
df.Age.describe().T
count 4662.000000 mean 37.622265 std 9.316387 min 18.000000 25% 31.000000 50% 36.000000 75% 44.000000 max 61.000000 Name: Age, dtype: float64
sns.displot(df['Age'], kde = True)
plt.axvline(x = df.Age.median(), color = 'black', ls='--')
plt.axvline(x = df.Age.mean(), color = 'red')
<matplotlib.lines.Line2D at 0x249363edb20>
Observations¶
The data is skewed right, and given that it likely has a bearing on the dependent variable, and a difference depending on whether or not they purchased, imputing the median based on whether or not they purchased ticket seems like a smart idea.
Finding the values¶
ReturnByPurchase('Age')
Purchased Mean: 34.770548 Purchased Median: 33.0 Purchased Mode: 0 29.0 Name: Age, dtype: float64 ----------- Unpurchased Mean: 38.282092 Unpurchased Median: 37.0 Unpurchased Mode: 0 36.0 Name: Age, dtype: float64
Setting the Values¶
df.loc[(df['Age'].isnull() == True) & (df['ProdTaken'] == 1), ['Age']] = 33
df.loc[(df['Age'].isnull() == True) & (df['ProdTaken'] == 0),['Age']] = 37
Examining Changes¶
df.Age.isnull().sum()
0
df.Age.describe().T
count 4888.000000 mean 37.557488 std 9.109545 min 18.000000 25% 31.000000 50% 37.000000 75% 43.000000 max 61.000000 Name: Age, dtype: float64
sns.displot(df['Age'], kde = True)
plt.axvline(x = df.Age.median(), color = 'black', ls='--')
plt.axvline(x = df.Age.mean(), color = 'red')
<matplotlib.lines.Line2D at 0x24937bbf520>
The median and the mean appear to have been brought closer together in this dataset due to the imputing we just did, but overall the datas structure has not changed much
Type of Contact¶
df.TypeofContact.isna().sum()
25
# df[df['TypeofContact'].isnull()]
| ProdTaken | Age | TypeofContact | CityTier | DurationOfPitch | Occupation | Gender | NumberOfPersonVisiting | NumberOfFollowups | ProductPitched | PreferredPropertyStar | MaritalStatus | NumberOfTrips | Passport | PitchSatisfactionScore | OwnCar | NumberOfChildrenVisiting | Designation | MonthlyIncome | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 224 | 0 | 31.0 | NaN | 1 | NaN | Small Business | Male | 2 | 5.0 | Deluxe | 3.0 | Divorced | 1.0 | 0 | 3 | 1 | 0.0 | Manager | NaN |
| 571 | 0 | 26.0 | NaN | 1 | NaN | Salaried | Female | 3 | 5.0 | Basic | 3.0 | Married | 4.0 | 0 | 4 | 1 | 2.0 | Executive | NaN |
| 572 | 0 | 29.0 | NaN | 1 | NaN | Small Business | Female | 3 | 3.0 | Deluxe | 3.0 | Divorced | 5.0 | 0 | 2 | 1 | 0.0 | Manager | NaN |
| 576 | 0 | 27.0 | NaN | 3 | NaN | Small Business | Male | 2 | 3.0 | Deluxe | 3.0 | Divorced | 1.0 | 0 | 3 | 0 | 1.0 | Manager | NaN |
| 579 | 0 | 34.0 | NaN | 1 | NaN | Small Business | Female | 2 | 4.0 | Basic | 5.0 | Single | 2.0 | 0 | 2 | 1 | 1.0 | Executive | NaN |
| 598 | 1 | 28.0 | NaN | 1 | NaN | Small Business | Male | 2 | 3.0 | Basic | 3.0 | Single | 7.0 | 0 | 3 | 0 | 0.0 | Executive | NaN |
| 622 | 0 | 32.0 | NaN | 3 | NaN | Salaried | Male | 3 | 3.0 | Deluxe | 3.0 | Married | 3.0 | 0 | 2 | 0 | 0.0 | Manager | NaN |
| 724 | 0 | 24.0 | NaN | 1 | NaN | Small Business | Female | 2 | 4.0 | Deluxe | 3.0 | Married | 2.0 | 0 | 3 | 1 | 1.0 | Manager | NaN |
| 843 | 0 | 26.0 | NaN | 1 | NaN | Small Business | Male | 2 | 1.0 | Basic | 3.0 | Divorced | 2.0 | 0 | 5 | 1 | 1.0 | Executive | NaN |
| 1021 | 1 | 25.0 | NaN | 3 | NaN | Salaried | Male | 3 | 4.0 | Basic | 5.0 | Divorced | 4.0 | 0 | 1 | 1 | 0.0 | Executive | NaN |
| 1047 | 0 | 33.0 | NaN | 3 | NaN | Small Business | Male | 2 | 3.0 | Deluxe | 5.0 | Divorced | 1.0 | 0 | 3 | 0 | 0.0 | Manager | NaN |
| 1143 | 0 | 45.0 | NaN | 3 | NaN | Small Business | Male | 2 | 4.0 | Deluxe | 5.0 | Married | 2.0 | 0 | 3 | 0 | 0.0 | Manager | NaN |
| 1182 | 0 | 36.0 | NaN | 1 | NaN | Small Business | Female | 2 | 4.0 | Deluxe | 3.0 | Married | 1.0 | 0 | 5 | 1 | 1.0 | Manager | NaN |
| 1217 | 0 | 24.0 | NaN | 1 | NaN | Small Business | Male | 3 | 1.0 | Basic | 3.0 | Married | 2.0 | 0 | 1 | 0 | 0.0 | Executive | NaN |
| 1356 | 0 | 41.0 | NaN | 3 | NaN | Small Business | Female | 2 | 3.0 | Deluxe | 4.0 | Married | 6.0 | 0 | 3 | 1 | 1.0 | Manager | NaN |
| 1469 | 0 | 34.0 | NaN | 1 | NaN | Small Business | Male | 2 | 1.0 | Deluxe | 3.0 | Married | 3.0 | 0 | 3 | 0 | 1.0 | Manager | NaN |
| 1694 | 0 | 31.0 | NaN | 1 | NaN | Small Business | Male | 2 | 5.0 | Deluxe | 3.0 | Married | 1.0 | 0 | 3 | 0 | 0.0 | Manager | NaN |
| 2041 | 0 | 26.0 | NaN | 1 | NaN | Salaried | Female | 3 | 5.0 | Basic | 3.0 | Married | 4.0 | 0 | 4 | 1 | 0.0 | Executive | NaN |
| 2042 | 0 | 29.0 | NaN | 1 | NaN | Small Business | Female | 3 | 3.0 | Deluxe | 3.0 | Married | 5.0 | 0 | 1 | 0 | 1.0 | Manager | NaN |
| 2046 | 0 | 27.0 | NaN | 3 | NaN | Small Business | Male | 2 | 3.0 | Deluxe | 3.0 | Married | 1.0 | 0 | 3 | 1 | 1.0 | Manager | NaN |
| 2049 | 0 | 34.0 | NaN | 1 | NaN | Small Business | Female | 2 | 4.0 | Basic | 5.0 | Single | 2.0 | 0 | 1 | 1 | 0.0 | Executive | NaN |
| 2068 | 1 | 28.0 | NaN | 1 | NaN | Small Business | Male | 2 | 3.0 | Basic | 3.0 | Single | 7.0 | 0 | 3 | 1 | 1.0 | Executive | NaN |
| 2092 | 0 | 32.0 | NaN | 3 | NaN | Salaried | Male | 3 | 3.0 | Deluxe | 3.0 | Married | 3.0 | 0 | 1 | 0 | 2.0 | Manager | NaN |
| 2194 | 0 | 24.0 | NaN | 1 | NaN | Small Business | Female | 2 | 4.0 | Deluxe | 3.0 | Married | 2.0 | 0 | 3 | 0 | 0.0 | Manager | NaN |
| 2313 | 0 | 26.0 | NaN | 1 | NaN | Small Business | Male | 2 | 1.0 | Basic | 3.0 | Married | 2.0 | 0 | 5 | 1 | 1.0 | Executive | NaN |
df['TypeofContact'].mode()
0 Self Enquiry Name: TypeofContact, dtype: object
df.loc[df.TypeofContact.isnull(), 'TypeofContact'] = 'Self Enquiry'
Notes¶
With only 25 missing fields, and the vast majority of TypeofContact being self enquiry, we just imputed with the mode of the TypeofContact.
df.TypeofContact.isna().sum()
0
Duration of Pitch¶
df.DurationOfPitch.isna().sum()
251
df.DurationOfPitch.describe()
count 4637.000000 mean 15.490835 std 8.519643 min 5.000000 25% 9.000000 50% 13.000000 75% 20.000000 max 127.000000 Name: DurationOfPitch, dtype: float64
sns.displot(df['DurationOfPitch'], kde = True)
plt.axvline(x = df.DurationOfPitch.median(), color = 'black', ls='--')
plt.axvline(x = df.DurationOfPitch.mean(), color = 'red')
<matplotlib.lines.Line2D at 0x24937bd7ca0>
Observations¶
This data is extremely skewed right, we will have to come back when we are examining outliers later on and fix this, as I believe the 120 is an inaccurate number, given the KDE near 1 out to 120. We will have to impute with the median rather than the mean for null values. Due to the large number, splitting by whether they purchased or not seems wise again.
Finding the Values¶
ReturnByPurchase('DurationOfPitch')
Purchased Mean: 16.873143 Purchased Median: 15.0 Purchased Mode: 0 9.0 Name: DurationOfPitch, dtype: float64 ----------- Unpurchased Mean: 15.169325 Unpurchased Median: 13.0 Unpurchased Mode: 0 9.0 Name: DurationOfPitch, dtype: float64
Setting the Values¶
df.loc[(df['DurationOfPitch'].isnull() == True) & (df['ProdTaken'] == 1), ['DurationOfPitch']] = 15
df.loc[(df['DurationOfPitch'].isnull() == True) & (df['ProdTaken'] == 0),['DurationOfPitch']] = 13
Examining Changes¶
df.DurationOfPitch.describe()
count 4888.000000 mean 15.381342 std 8.313127 min 5.000000 25% 9.000000 50% 13.000000 75% 19.000000 max 127.000000 Name: DurationOfPitch, dtype: float64
sns.displot(df['DurationOfPitch'], kde = True)
plt.axvline(x = df.DurationOfPitch.median(), color = 'black', ls='--')
plt.axvline(x = df.DurationOfPitch.mean(), color = 'red')
<matplotlib.lines.Line2D at 0x24937e9bfd0>
There appears to be little difference in the structure of the data after these changes, except the increase in the number of objects in the median
Number of Followups¶
df.NumberOfFollowups.isna().sum()
45
df.NumberOfFollowups.describe()
count 4843.000000 mean 3.708445 std 1.002509 min 1.000000 25% 3.000000 50% 4.000000 75% 4.000000 max 6.000000 Name: NumberOfFollowups, dtype: float64
sns.displot(df['NumberOfFollowups'], kde = True)
plt.axvline(x = df.NumberOfFollowups.median(), color = 'black', ls='--')
plt.axvline(x = df.NumberOfFollowups.mean(), color = 'red')
<matplotlib.lines.Line2D at 0x24937f74310>
Observations¶
This is mainly categorical with an integer, with the median and mode both being 4, and the mean a little off at 3.7. This likely has bearing on the dependent variable, so we will impute with that in mind
Finding the Values¶
ReturnByPurchase('NumberOfFollowups')
Purchased Mean: 3.941886 Purchased Median: 4.0 Purchased Mode: 0 4.0 Name: NumberOfFollowups, dtype: float64 ----------- Unpurchased Mean: 3.654286 Unpurchased Median: 4.0 Unpurchased Mode: 0 4.0 Name: NumberOfFollowups, dtype: float64
Setting the Values¶
df.loc[df['NumberOfFollowups'].isnull(), ['NumberOfFollowups']] = 4
Checking the Results¶
df.NumberOfFollowups.isna().sum()
0
df.NumberOfFollowups.describe()
count 4888.000000 mean 3.711129 std 0.998271 min 1.000000 25% 3.000000 50% 4.000000 75% 4.000000 max 6.000000 Name: NumberOfFollowups, dtype: float64
sns.displot(df['NumberOfFollowups'], kde = True)
plt.axvline(x = df.NumberOfFollowups.median(), color = 'black', ls='--')
plt.axvline(x = df.NumberOfFollowups.mean(), color = 'red')
<matplotlib.lines.Line2D at 0x24937fe0610>
The data looks to mostly be in the same shape as before, due to the low number of changes to fields
Preferred Property Star¶
df.PreferredPropertyStar.isna().sum()
26
df.PreferredPropertyStar.describe()
count 4862.000000 mean 3.581037 std 0.798009 min 3.000000 25% 3.000000 50% 3.000000 75% 4.000000 max 5.000000 Name: PreferredPropertyStar, dtype: float64
sns.displot(df['PreferredPropertyStar'], kde = True)
plt.axvline(x = df.PreferredPropertyStar.median(), color = 'black', ls='--')
plt.axvline(x = df.PreferredPropertyStar.mean(), color = 'red')
<matplotlib.lines.Line2D at 0x249380501f0>
Observations¶
This is another categorical value with an integer base. The median and Mode are both at 3, and the mean is a bit above at 3.58 I will impute with the median once more
Finding the Values¶
ReturnByPurchase('PreferredPropertyStar')
Purchased Mean: 3.746171 Purchased Median: 3.0 Purchased Mode: 0 3.0 Name: PreferredPropertyStar, dtype: float64 ----------- Unpurchased Mean: 3.542806 Unpurchased Median: 3.0 Unpurchased Mode: 0 3.0 Name: PreferredPropertyStar, dtype: float64
df.loc[df['PreferredPropertyStar'].isnull(), ['PreferredPropertyStar']] = 3
Checking the Results¶
df.PreferredPropertyStar.isna().sum()
0
df.PreferredPropertyStar.describe()
count 4888.000000 mean 3.577946 std 0.797005 min 3.000000 25% 3.000000 50% 3.000000 75% 4.000000 max 5.000000 Name: PreferredPropertyStar, dtype: float64
sns.displot(df['PreferredPropertyStar'], kde = True)
plt.axvline(x = df.PreferredPropertyStar.median(), color = 'black', ls='--')
plt.axvline(x = df.PreferredPropertyStar.mean(), color = 'red')
<matplotlib.lines.Line2D at 0x249380d2f10>
As expected at this point, the data appears to mostly be the same as before.
Number of Trips¶
df.NumberOfTrips.isna().sum()
140
df.NumberOfTrips.describe()
count 4748.000000 mean 3.236521 std 1.849019 min 1.000000 25% 2.000000 50% 3.000000 75% 4.000000 max 22.000000 Name: NumberOfTrips, dtype: float64
sns.displot(df['NumberOfTrips'], kde = True)
plt.axvline(x = df.NumberOfTrips.median(), color = 'black', ls='--')
plt.axvline(x = df.NumberOfTrips.mean(), color = 'red')
<matplotlib.lines.Line2D at 0x2493912a880>
Observations¶
The median and mean are close, and there is definitely an outlier up at 22, given the right skew, I will impute our missing values with the Median
Finding the Values¶
ReturnByPurchase('NumberOfTrips')
Purchased Mean: 3.30837 Purchased Median: 3.0 Purchased Mode: 0 2.0 Name: NumberOfTrips, dtype: float64 ----------- Unpurchased Mean: 3.219531 Unpurchased Median: 3.0 Unpurchased Mode: 0 2.0 Name: NumberOfTrips, dtype: float64
Changing the Values¶
df.loc[df['NumberOfTrips'].isnull(), ['NumberOfTrips']] = 3
Checking the Changes¶
df.NumberOfTrips.isna().sum()
0
df.NumberOfTrips.describe()
count 4888.000000 mean 3.229746 std 1.822769 min 1.000000 25% 2.000000 50% 3.000000 75% 4.000000 max 22.000000 Name: NumberOfTrips, dtype: float64
sns.displot(df['NumberOfTrips'], kde = True)
plt.axvline(x = df.NumberOfTrips.median(), color = 'black', ls='--')
plt.axvline(x = df.NumberOfTrips.mean(), color = 'red')
<matplotlib.lines.Line2D at 0x249391926a0>
At the risk of sounding like a broken record, our data looks similar to how it was initially which we expected
Number of Children Visitng¶
df.NumberOfChildrenVisiting.isna().sum()
66
df.NumberOfChildrenVisiting.describe()
count 4822.000000 mean 1.187267 std 0.857861 min 0.000000 25% 1.000000 50% 1.000000 75% 2.000000 max 3.000000 Name: NumberOfChildrenVisiting, dtype: float64
sns.displot(df['NumberOfChildrenVisiting'], kde = True)
plt.axvline(x = df.NumberOfChildrenVisiting.median(), color = 'black', ls='--')
plt.axvline(x = df.NumberOfChildrenVisiting.mean(), color = 'red')
<matplotlib.lines.Line2D at 0x24939393970>
Observations¶
The median and mode are equal at one, and the mean is slightly higher at 1.18, we will impute the missing values with the median
Finding the Values¶
ReturnByPurchase('NumberOfChildrenVisiting')
Purchased Mean: 1.200438 Purchased Median: 1.0 Purchased Mode: 0 1.0 Name: NumberOfChildrenVisiting, dtype: float64 ----------- Unpurchased Mean: 1.18419 Unpurchased Median: 1.0 Unpurchased Mode: 0 1.0 Name: NumberOfChildrenVisiting, dtype: float64
Changing the Values¶
df.loc[df['NumberOfChildrenVisiting'].isnull(), ['NumberOfChildrenVisiting']] = 1
Checking the Changes¶
df.NumberOfChildrenVisiting.isna().sum()
0
df.NumberOfChildrenVisiting.describe()
count 4888.000000 mean 1.184738 std 0.852323 min 0.000000 25% 1.000000 50% 1.000000 75% 2.000000 max 3.000000 Name: NumberOfChildrenVisiting, dtype: float64
sns.displot(df['NumberOfChildrenVisiting'], kde = True)
plt.axvline(x = df.NumberOfChildrenVisiting.median(), color = 'black', ls='--')
plt.axvline(x = df.NumberOfChildrenVisiting.mean(), color = 'red')
<matplotlib.lines.Line2D at 0x2493924f580>
Monthly Income¶
df.MonthlyIncome.isna().sum()
233
df.MonthlyIncome.describe()
count 4655.000000 mean 23619.853491 std 5380.698361 min 1000.000000 25% 20346.000000 50% 22347.000000 75% 25571.000000 max 98678.000000 Name: MonthlyIncome, dtype: float64
sns.displot(df['MonthlyIncome'], kde = True)
plt.axvline(x = df.MonthlyIncome.median(), color = 'black', ls='--')
plt.axvline(x = df.MonthlyIncome.mean(), color = 'red')
<matplotlib.lines.Line2D at 0x2493951fb50>
Observations¶
We have a few pieces of data causing skewing to the right, with a massive max, as well as some outliers more than likely beneath our dataset. Out median and mean are fairly close in the grand scheme of things, so imputing with median will be helpful in bringing them closer.
Finding the Values¶
ReturnByPurchase('MonthlyIncome')
Purchased Mean: 22172.827703 Purchased Median: 21172.0 Purchased Mode: 0 17293.0 1 17404.0 2 20971.0 3 21082.0 Name: MonthlyIncome, dtype: float64 ----------- Unpurchased Mean: 23960.962835 Unpurchased Median: 22729.0 Unpurchased Mode: 0 20855.0 Name: MonthlyIncome, dtype: float64
Changing the Values¶
df.loc[(df['MonthlyIncome'].isnull() == True) & (df['ProdTaken'] == 1), ['MonthlyIncome']] = 21172
df.loc[(df['MonthlyIncome'].isnull() == True) & (df['ProdTaken'] == 0),['MonthlyIncome']] = 22729
Checking the Changes¶
df.MonthlyIncome.isna().sum()
0
df.MonthlyIncome.describe()
count 4888.000000 mean 23567.195376 std 5257.438805 min 1000.000000 25% 20485.000000 50% 22595.500000 75% 25424.750000 max 98678.000000 Name: MonthlyIncome, dtype: float64
sns.displot(df['NumberOfTrips'], kde = True)
plt.axvline(x = df.NumberOfTrips.median(), color = 'black', ls='--')
plt.axvline(x = df.NumberOfTrips.mean(), color = 'red')
<matplotlib.lines.Line2D at 0x2493a708070>
The mean dropped a little bit, and the median increased some, but the overall structure of the data stays mostly unchanged
Confirming no more Null Values¶
df.isna().sum()
ProdTaken 0 Age 0 TypeofContact 0 CityTier 0 DurationOfPitch 0 Occupation 0 Gender 0 NumberOfPersonVisiting 0 NumberOfFollowups 0 ProductPitched 0 PreferredPropertyStar 0 MaritalStatus 0 NumberOfTrips 0 Passport 0 PitchSatisfactionScore 0 OwnCar 0 NumberOfChildrenVisiting 0 Designation 0 MonthlyIncome 0 dtype: int64
Handling Outliers¶
df.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| ProdTaken | 4888.0 | 0.188216 | 0.390925 | 0.0 | 0.0 | 0.0 | 0.00 | 1.0 |
| Age | 4888.0 | 37.557488 | 9.109545 | 18.0 | 31.0 | 37.0 | 43.00 | 61.0 |
| CityTier | 4888.0 | 1.654255 | 0.916583 | 1.0 | 1.0 | 1.0 | 3.00 | 3.0 |
| DurationOfPitch | 4888.0 | 15.381342 | 8.313127 | 5.0 | 9.0 | 13.0 | 19.00 | 127.0 |
| NumberOfPersonVisiting | 4888.0 | 2.905074 | 0.724891 | 1.0 | 2.0 | 3.0 | 3.00 | 5.0 |
| NumberOfFollowups | 4888.0 | 3.711129 | 0.998271 | 1.0 | 3.0 | 4.0 | 4.00 | 6.0 |
| PreferredPropertyStar | 4888.0 | 3.577946 | 0.797005 | 3.0 | 3.0 | 3.0 | 4.00 | 5.0 |
| NumberOfTrips | 4888.0 | 3.229746 | 1.822769 | 1.0 | 2.0 | 3.0 | 4.00 | 22.0 |
| Passport | 4888.0 | 0.290917 | 0.454232 | 0.0 | 0.0 | 0.0 | 1.00 | 1.0 |
| PitchSatisfactionScore | 4888.0 | 3.078151 | 1.365792 | 1.0 | 2.0 | 3.0 | 4.00 | 5.0 |
| OwnCar | 4888.0 | 0.620295 | 0.485363 | 0.0 | 0.0 | 1.0 | 1.00 | 1.0 |
| NumberOfChildrenVisiting | 4888.0 | 1.184738 | 0.852323 | 0.0 | 1.0 | 1.0 | 2.00 | 3.0 |
| MonthlyIncome | 4888.0 | 23567.195376 | 5257.438805 | 1000.0 | 20485.0 | 22595.5 | 25424.75 | 98678.0 |
Duration of Pitch¶
df.DurationOfPitch.describe()
count 4888.000000 mean 15.381342 std 8.313127 min 5.000000 25% 9.000000 50% 13.000000 75% 19.000000 max 127.000000 Name: DurationOfPitch, dtype: float64
histogramboxplot(data = df, feature = 'DurationOfPitch')
There are some pretty extreme outliers in Duration of Pitch, so cleaning up these outliers will really affect the way this data is structured, and hopefully produce better results. The ones close to the whisker are likely accurate, but the ones at 120 minutes seem like erronous entry, so I will set it to the median.
# df[df['DurationOfPitch'] > 120]
| ProdTaken | Age | TypeofContact | CityTier | DurationOfPitch | Occupation | Gender | NumberOfPersonVisiting | NumberOfFollowups | ProductPitched | PreferredPropertyStar | MaritalStatus | NumberOfTrips | Passport | PitchSatisfactionScore | OwnCar | NumberOfChildrenVisiting | Designation | MonthlyIncome | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1434 | 0 | 37.0 | Company Invited | 3 | 126.0 | Salaried | Male | 2 | 3.0 | Basic | 3.0 | Married | 3.0 | 0 | 1 | 1 | 1.0 | Executive | 18482.0 |
| 3878 | 0 | 53.0 | Company Invited | 3 | 127.0 | Salaried | Male | 3 | 4.0 | Basic | 3.0 | Married | 4.0 | 0 | 1 | 1 | 2.0 | Executive | 22160.0 |
df.groupby("ProdTaken")["DurationOfPitch"].median()
ProdTaken 0 13.0 1 15.0 Name: DurationOfPitch, dtype: float64
df.loc[df['DurationOfPitch'] > 125, 'DurationOfPitch'] = 13
histogramboxplot(data = df, feature = 'DurationOfPitch')
This has improved the look of the dataset immediately, and makes it more approachable for further modification if needed, but ultimately likely will not
Number of Persons Visiting¶
df.DurationOfPitch.describe()
count 4888.000000 mean 15.334902 std 8.003437 min 5.000000 25% 9.000000 50% 13.000000 75% 19.000000 max 36.000000 Name: DurationOfPitch, dtype: float64
histogramboxplot(data = df, feature = 'NumberOfPersonVisiting')
This only has one outlier at the top, and it feels like it is likely an accurate value, so I will be leaving it for the time being
Number of Followups¶
df.DurationOfPitch.describe()
count 4888.000000 mean 15.334902 std 8.003437 min 5.000000 25% 9.000000 50% 13.000000 75% 19.000000 max 36.000000 Name: DurationOfPitch, dtype: float64
histogramboxplot(data = df, feature = 'NumberOfFollowups')
This one has a single datapoint above and a single one below that happens to be out of the whiskers, and seem accurate, so I will be leaving them be.
Number of Trips¶
df.DurationOfPitch.describe()
count 4888.000000 mean 15.334902 std 8.003437 min 5.000000 25% 9.000000 50% 13.000000 75% 19.000000 max 36.000000 Name: DurationOfPitch, dtype: float64
histogramboxplot(data = df, feature = 'NumberOfTrips')
This one was a debate, it is possible that the outliers at the top were accurate, but they also seem a bit extreme for the dataset, and may throw off a model. I have decided to take the values that are above 15 and impute them with the median, but leave the one near ten
# df[df['NumberOfTrips'] > 15]
| ProdTaken | Age | TypeofContact | CityTier | DurationOfPitch | Occupation | Gender | NumberOfPersonVisiting | NumberOfFollowups | ProductPitched | PreferredPropertyStar | MaritalStatus | NumberOfTrips | Passport | PitchSatisfactionScore | OwnCar | NumberOfChildrenVisiting | Designation | MonthlyIncome | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 385 | 1 | 30.0 | Company Invited | 1 | 10.0 | Large Business | Male | 2 | 3.0 | Basic | 3.0 | Single | 19.0 | 1 | 4 | 1 | 1.0 | Executive | 17285.0 |
| 816 | 0 | 39.0 | Company Invited | 1 | 15.0 | Salaried | Male | 3 | 3.0 | Deluxe | 4.0 | Unmarried | 21.0 | 0 | 2 | 1 | 0.0 | Manager | 21782.0 |
| 2829 | 1 | 31.0 | Company Invited | 1 | 11.0 | Large Business | Male | 3 | 4.0 | Basic | 3.0 | Single | 20.0 | 1 | 4 | 1 | 2.0 | Executive | 20963.0 |
| 3260 | 0 | 40.0 | Company Invited | 1 | 16.0 | Salaried | Male | 4 | 4.0 | Deluxe | 4.0 | Unmarried | 22.0 | 0 | 2 | 1 | 1.0 | Manager | 25460.0 |
df.groupby("ProdTaken")["NumberOfTrips"].median()
ProdTaken 0 3.0 1 3.0 Name: NumberOfTrips, dtype: float64
df.loc[df['NumberOfTrips'] > 18, 'NumberOfTrips'] = 3
histogramboxplot(data = df, feature = 'NumberOfTrips')
This seems to keep the shape of the data well intact, while leaving a likely accurate outlier in the data to prevent model overfit
Monthly Income¶
df.DurationOfPitch.describe()
count 4888.000000 mean 15.334902 std 8.003437 min 5.000000 25% 9.000000 50% 13.000000 75% 19.000000 max 36.000000 Name: DurationOfPitch, dtype: float64
histogramboxplot(data = df, feature = 'MonthlyIncome')
There are some outliers low, and a ton of outliers high. The ones below 40,000 seem likely accurate, but the ones approaching 100,000 a month seem likely erronous. I will examine the low values, determine if their data seems erronous, and then decide how to impute them. I will leave the values below 40,000 intact and impute the ones above 80,000 with the median
# df[df['MonthlyIncome'] < 10000]
| ProdTaken | Age | TypeofContact | CityTier | DurationOfPitch | Occupation | Gender | NumberOfPersonVisiting | NumberOfFollowups | ProductPitched | PreferredPropertyStar | MaritalStatus | NumberOfTrips | Passport | PitchSatisfactionScore | OwnCar | NumberOfChildrenVisiting | Designation | MonthlyIncome | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 142 | 0 | 38.0 | Self Enquiry | 1 | 9.0 | Large Business | Female | 2 | 3.0 | Deluxe | 3.0 | Single | 4.0 | 1 | 5 | 0 | 0.0 | Manager | 1000.0 |
| 2586 | 0 | 39.0 | Self Enquiry | 1 | 10.0 | Large Business | Female | 3 | 4.0 | Deluxe | 3.0 | Single | 5.0 | 1 | 5 | 0 | 1.0 | Manager | 4678.0 |
These values seem inaccurate given the fact that they are living in a major city, and are a manager in large busines. I will impute them with the median
# df[df['MonthlyIncome'] > 80000]
| ProdTaken | Age | TypeofContact | CityTier | DurationOfPitch | Occupation | Gender | NumberOfPersonVisiting | NumberOfFollowups | ProductPitched | PreferredPropertyStar | MaritalStatus | NumberOfTrips | Passport | PitchSatisfactionScore | OwnCar | NumberOfChildrenVisiting | Designation | MonthlyIncome | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 38 | 0 | 36.0 | Self Enquiry | 1 | 11.0 | Salaried | Female | 2 | 4.0 | Basic | 3.0 | Divorced | 1.0 | 1 | 2 | 1 | 0.0 | Executive | 95000.0 |
| 2482 | 0 | 37.0 | Self Enquiry | 1 | 12.0 | Salaried | Female | 3 | 5.0 | Basic | 5.0 | Divorced | 2.0 | 1 | 2 | 1 | 1.0 | Executive | 98678.0 |
df.groupby("ProdTaken")["MonthlyIncome"].median()
ProdTaken 0 22729.0 1 21172.0 Name: MonthlyIncome, dtype: float64
df.loc[df['MonthlyIncome'] > 80000, 'MonthlyIncome'] = 22729
df.loc[df['MonthlyIncome'] < 10000, 'MonthlyIncome'] = 22729
histogramboxplot(data = df, feature = 'MonthlyIncome')
This seems much better in terms of data, I will likely log scale this later due to the high number just outside the the upper whisker, but for now, I will leave it be.
Remove erronous Categorical Data¶
df['Gender'].value_counts()
Male 2916 Female 1817 Fe Male 155 Name: Gender, dtype: int64
df.loc[df['Gender'] == 'Fe Male', ['Gender']] = 'Female'
df['Gender'].value_counts()
Male 2916 Female 1972 Name: Gender, dtype: int64
Exploratory Data Analysis¶
Univariate Analysis¶
# df.head(5)
| ProdTaken | Age | TypeofContact | CityTier | DurationOfPitch | Occupation | Gender | NumberOfPersonVisiting | NumberOfFollowups | ProductPitched | PreferredPropertyStar | MaritalStatus | NumberOfTrips | Passport | PitchSatisfactionScore | OwnCar | NumberOfChildrenVisiting | Designation | MonthlyIncome | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 41.0 | Self Enquiry | 3 | 6.0 | Salaried | Female | 3 | 3.0 | Deluxe | 3.0 | Single | 1.0 | 1 | 2 | 1 | 0.0 | Manager | 20993.0 |
| 1 | 0 | 49.0 | Company Invited | 1 | 14.0 | Salaried | Male | 3 | 4.0 | Deluxe | 4.0 | Divorced | 2.0 | 0 | 3 | 1 | 2.0 | Manager | 20130.0 |
| 2 | 1 | 37.0 | Self Enquiry | 1 | 8.0 | Free Lancer | Male | 3 | 4.0 | Basic | 3.0 | Single | 7.0 | 1 | 3 | 0 | 0.0 | Executive | 17090.0 |
| 3 | 0 | 33.0 | Company Invited | 1 | 9.0 | Salaried | Female | 2 | 3.0 | Basic | 3.0 | Divorced | 2.0 | 1 | 5 | 1 | 1.0 | Executive | 17909.0 |
| 4 | 0 | 37.0 | Self Enquiry | 1 | 8.0 | Small Business | Male | 2 | 3.0 | Basic | 4.0 | Divorced | 1.0 | 0 | 5 | 1 | 0.0 | Executive | 18468.0 |
df.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| ProdTaken | 4888.0 | 0.188216 | 0.390925 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 |
| Age | 4888.0 | 37.557488 | 9.109545 | 18.0 | 31.00 | 37.0 | 43.00 | 61.0 |
| CityTier | 4888.0 | 1.654255 | 0.916583 | 1.0 | 1.00 | 1.0 | 3.00 | 3.0 |
| DurationOfPitch | 4888.0 | 15.334902 | 8.003437 | 5.0 | 9.00 | 13.0 | 19.00 | 36.0 |
| NumberOfPersonVisiting | 4888.0 | 2.905074 | 0.724891 | 1.0 | 2.00 | 3.0 | 3.00 | 5.0 |
| NumberOfFollowups | 4888.0 | 3.711129 | 0.998271 | 1.0 | 3.00 | 4.0 | 4.00 | 6.0 |
| PreferredPropertyStar | 4888.0 | 3.577946 | 0.797005 | 3.0 | 3.00 | 3.0 | 4.00 | 5.0 |
| NumberOfTrips | 4888.0 | 3.215426 | 1.754188 | 1.0 | 2.00 | 3.0 | 4.00 | 8.0 |
| Passport | 4888.0 | 0.290917 | 0.454232 | 0.0 | 0.00 | 0.0 | 1.00 | 1.0 |
| PitchSatisfactionScore | 4888.0 | 3.078151 | 1.365792 | 1.0 | 2.00 | 3.0 | 4.00 | 5.0 |
| OwnCar | 4888.0 | 0.620295 | 0.485363 | 0.0 | 0.00 | 1.0 | 1.00 | 1.0 |
| NumberOfChildrenVisiting | 4888.0 | 1.184738 | 0.852323 | 0.0 | 1.00 | 1.0 | 2.00 | 3.0 |
| MonthlyIncome | 4888.0 | 23545.010434 | 5026.428314 | 16009.0 | 20486.75 | 22596.5 | 25407.75 | 38677.0 |
Age¶
descviolinbox(df, 'Age')
count 4888.000000 mean 37.557488 std 9.109545 min 18.000000 25% 31.000000 50% 37.000000 75% 43.000000 max 61.000000 Name: Age, dtype: float64
labeldisplot(df, 'Age')
bounds(df, 'Age')
Q1: 31.0 Q3: 43.0 IQR: 12.0 lower limit: 13.0 upper limit: 61.0
(13.0, 61.0, 31.0, 43.0)
numericdata(df, 'Age')
Mean: 37.557488 Median: 37.0 Mode: 37.0 Range: 12.0
Observations¶
Mean is 37.5 median is 37, mode is also 37 and the IQR is 12
- Median and mean are close, with the mean slightly to the right
- There is a slight right skew to the data, but not enough to fix with a log scale down the road
- We will probably want to scale our data, given that it has higher values than most of our data
- Interesting thing that might require looking at the dataset, some ages have double the count to those next to them, not something normally anticipated with age.
Duration of Pitch¶
descviolinbox(df, 'DurationOfPitch')
count 4888.000000 mean 15.334902 std 8.003437 min 5.000000 25% 9.000000 50% 13.000000 75% 19.000000 max 36.000000 Name: DurationOfPitch, dtype: float64
labeldisplot(df, 'DurationOfPitch')
bounds(df, 'DurationOfPitch')
Q1: 9.0 Q3: 19.0 IQR: 10.0 lower limit: -6.0 upper limit: 34.0
(-6.0, 34.0, 9.0, 19.0)
numericdata(df, 'DurationOfPitch')
Mean: 15.334902 Median: 13.0 Mode: 9.0 Range: 10.0
Observations¶
Mean is 15.3, Median is 13, Mode is 9, and the IQR is 10
- We have a pretty significant right skew in our data
- The KDE is important in this graph, because our initial thought was the Mode was 13
- Due to the long right tail, a log scale might be a good idea
- We also may want to just bin the data, as keeping this numeric might overfit the model
Monthly Income¶
descviolinbox(df, 'MonthlyIncome')
count 4888.000000 mean 23545.010434 std 5026.428314 min 16009.000000 25% 20486.750000 50% 22596.500000 75% 25407.750000 max 38677.000000 Name: MonthlyIncome, dtype: float64
labeldisplot(df, 'MonthlyIncome')
bounds(df, 'MonthlyIncome')
Q1: 20486.75 Q3: 25407.75 IQR: 4921.0 lower limit: 13105.25 upper limit: 32789.25
(13105.25, 32789.25, 20486.75, 25407.75)
numericdata(df, 'MonthlyIncome')
Mean: 23545.010434 Median: 22596.5 Mode: 22729.0 Range: 4921.0
Observations¶
Mean is \$23,545, Median is \$22,597, Mode is \$22,729 and the range is \$4,921
- This data has an extremely long right tale, and correspondingly a significant right skew
- The data is not very close to a normal distribution
- most of the right tail is flat, meaning there are roughly the same number of records
- There is a gap just below 20,000 which doesn't seem like accurate data, acquiring information for customers in this range may help us fit our model
- This seems to be a prime candidate for a log scale, as it has a long right tail, and the numeric values are significantly larger than the rest of the dataset
Prod Taken¶
labeledbarplot(df, 'ProdTaken')
Observations¶
Almost 4 times as many people did not purchase a product as those who did.
Type of Contact¶
labeledbarplot(df, 'TypeofContact')
Observations¶
Roughly double the number of people in our dataset contacted by self enquiry than Company Invited
City Tier¶
labeledbarplot(df, 'CityTier')
Observations¶
Most people do not live in a tier 2 city, and double the number of people live in Tier 1 cities than Tier 3
Occupation¶
labeledbarplot(df, 'Occupation')
Observations¶
- Most individuals are Salaried or Small Business
- Free Lancer has no significant bearing on our dataset
- Free Lancer and Large business should merge to prevent overfit, as an "Other" category
Gender¶
labeledbarplot(df, 'Gender', perc= True )
Observations¶
There are more men than women in this dataset, but it is much more representative than the data we normally see. There are roughly 1000 more men than women, or about a 60/40 split between men and women
Number of People Visiting¶
labeledbarplot(df, 'NumberOfPersonVisiting')
Observations¶
- This dataset is extremely close to a normal distribution
- 1 and 5 should be combined into an "other" category to prevent overfit, or 1+2 and 4+5 merge
Number of Followups¶
labeledbarplot(df, 'NumberOfFollowups')
Observations¶
- This again seems normally distributed
- Either 1, 2, and six should merge into "other", or more likely, one and two merge, and five and six merge.
Product Pitched¶
labeledbarplot(df, 'ProductPitched')
Observations¶
- This is not a normal distribution, with Basic and Deluxe topping the chart
- Basic, Deluxe, and Standard should remain the same while King and Super Deluxe merge into an "Other Category
Preferred Property Star¶
labeledbarplot(df, 'PreferredPropertyStar')
Observations¶
There are triple the number of people who prefer 3 star properties to 4 or 5
Marital Status¶
labeledbarplot(df, 'MaritalStatus')
Observations¶
- Married has double the number of records than other categories
- It is probbaly a good idea to simplify just to "Married" or "Unmarried"
Number of Trips¶
labeledbarplot(df, 'NumberOfTrips')
Observations¶
- This is not normally distributed, 2 and 3 have double the number of other records
- Merging categories for 6, 7, and 8 into "6 and Above" seems like a good idea to prevent overfit
Passport¶
labeledbarplot(df, 'Passport', perc = True)
Observations¶
70% of individuals in our dataset do not have a passport, which may significantly impact our sales of overseas packages
Pitch Satisfaction Score¶
labeledbarplot(df, 'PitchSatisfactionScore')
Observations¶
Our pitch satisfaction score records seem to be all fairly statistically significant and fairly normal in distribution. 3 seems to have the highest with roughly 500 records more than our other data. 2 is the lowest with a thousnad less records than 3
Car Ownership¶
labeledbarplot(df, 'OwnCar', perc =True)
Observations¶
Roughly 40% of people in our dataset did not own a car
Number of Children Visiting¶
labeledbarplot(df, 'NumberOfChildrenVisiting')
Observations¶
- Fairly normal distribution
- Combine 2 and 3 together to make 2+
- May be a good idea to merge with other datasets, such as NumberofPersonVisiting to create a simple category of if that group has children or not.
Designation¶
labeledbarplot(df, 'Designation')
Observations¶
- This is a fairly normal distribution
- I think there are too many categories for this, combining AVP, VP, and Senior Manager into just "other" makes sense, or AVP and VP can be combined into executive, and senior manager to manager
Bivariate Analysis¶
fig, ax = plt.subplots(figsize=(12,12))
sns.heatmap(df.corr(), annot = True, cmap = "mako", square = True, ax = ax)
<AxesSubplot:>
Observations¶
- Most correlation with our dependent variable is extremely weak, and finding data to strengthen this is probably needed, maybe in categorical items that don't show up here
- Passport had the strongest correlation with the dependent variable at 0.26, but it did mean people with passports were more likely to purchase a package
- Age had a fairly strong correlation with monthly income at 0.46
- Number of Followups had a fairly weak correlation with number of people visiting
- Number of Person Visiting and Number of Children Visiting had a strong correlation
- The dependent variable had a 0.13 and 0.15 negative correlation with Monthly income and age respectively
sns.pairplot(df, diag_kind="kde", hue = 'ProdTaken')
<seaborn.axisgrid.PairGrid at 0x2493a99fd30>
Observations¶
Age and Duration of pitch have an extremely loose, or no correlation with monthy lincome
Numeric Means by ProdTaken¶
df.groupby(['ProdTaken']).mean().T
| ProdTaken | 0 | 1 |
|---|---|---|
| Age | 38.223286 | 34.685870 |
| CityTier | 1.615927 | 1.819565 |
| DurationOfPitch | 14.999496 | 16.781522 |
| NumberOfPersonVisiting | 2.901714 | 2.919565 |
| NumberOfFollowups | 3.657510 | 3.942391 |
| PreferredPropertyStar | 3.540071 | 3.741304 |
| NumberOfTrips | 3.203125 | 3.268478 |
| Passport | 0.233871 | 0.536957 |
| PitchSatisfactionScore | 3.044355 | 3.223913 |
| OwnCar | 0.622984 | 0.608696 |
| NumberOfChildrenVisiting | 1.181452 | 1.198913 |
| MonthlyIncome | 23871.228831 | 22138.016304 |
Observations¶
- People who purchase the packages are younger
- In general, people who purchased the package had a 17 minute pitch
- People who purchased a package had more followups than those who didn't
- People who purchased a package preferred higher property stars
- Passport holders were more likely to purchase a package
- Those more satisfied with the pitch were more likely to purchase a package
- Car owners were slightly less likely to purchase a package
- Those that made a lower income were more likely to purchase a package
Numeric Columns By ProdTaken¶
Age¶
stackedbox(df, 'Age', 'ProdTaken')
minmeanmax(df, 'Age', 'ProdTaken')
Not Purchased Min Mean and Max: Min: 18.0 Mean: 38.223286 Max: 61.0 Purchased Min Mean and Max: Min: 18.0 Mean: 34.68587 Max: 60.0
Observations¶
Those who purchased a package were slightly younger all around, with a lower median, mean, and Q3 point than those who didn't interestingly though they had the same minimum, but those who purchased had a lower maximum
Duration of Pitch¶
stackedbox(df, 'DurationOfPitch', 'ProdTaken')
minmeanmax(df, 'DurationOfPitch', 'ProdTaken')
Not Purchased Min Mean and Max: Min: 5.0 Mean: 14.999496 Max: 36.0 Purchased Min Mean and Max: Min: 6.0 Mean: 16.781522 Max: 36.0
Observations¶
- Those that purchased packages had a longer pitch in every way except for the maximum
- Those that didn't have some outliers due to having a lower mean, median, a smaller IQR, and a lower Q3 point
- If we saw significant correlation with duration of pitch it might be worth removing those errant values, but for now we will be leaving them
Monthly Income¶
stackedbox(df, 'MonthlyIncome', 'ProdTaken')
minmeanmax(df, 'MonthlyIncome', 'ProdTaken')
Not Purchased Min Mean and Max: Min: 16051.0 Mean: 23871.228831 Max: 38677.0 Purchased Min Mean and Max: Min: 16009.0 Mean: 22138.016304 Max: 38537.0
Observations¶
- We saw above that MonthlyIncome and Prodtaken had a slight negative correlation
- Given the values we see here, with a clear lower min, Q1, mean, median, Q3, and max the outliers are causing issues with that correlation
- I do not wish to remove or cap the outliers, as they are likely accurate, but doing a log scale probably helps us build a better model with the changes to the structure likely bringing in some of those outliers
Categorical Columns by ProdTaken¶
# df.head(5)
| ProdTaken | Age | TypeofContact | CityTier | DurationOfPitch | Occupation | Gender | NumberOfPersonVisiting | NumberOfFollowups | ProductPitched | PreferredPropertyStar | MaritalStatus | NumberOfTrips | Passport | PitchSatisfactionScore | OwnCar | NumberOfChildrenVisiting | Designation | MonthlyIncome | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 41.0 | Self Enquiry | 3 | 6.0 | Salaried | Female | 3 | 3.0 | Deluxe | 3.0 | Single | 1.0 | 1 | 2 | 1 | 0.0 | Manager | 20993.0 |
| 1 | 0 | 49.0 | Company Invited | 1 | 14.0 | Salaried | Male | 3 | 4.0 | Deluxe | 4.0 | Divorced | 2.0 | 0 | 3 | 1 | 2.0 | Manager | 20130.0 |
| 2 | 1 | 37.0 | Self Enquiry | 1 | 8.0 | Free Lancer | Male | 3 | 4.0 | Basic | 3.0 | Single | 7.0 | 1 | 3 | 0 | 0.0 | Executive | 17090.0 |
| 3 | 0 | 33.0 | Company Invited | 1 | 9.0 | Salaried | Female | 2 | 3.0 | Basic | 3.0 | Divorced | 2.0 | 1 | 5 | 1 | 1.0 | Executive | 17909.0 |
| 4 | 0 | 37.0 | Self Enquiry | 1 | 8.0 | Small Business | Male | 2 | 3.0 | Basic | 4.0 | Divorced | 1.0 | 0 | 5 | 1 | 0.0 | Executive | 18468.0 |
Type of Contact¶
stackedbarplot(df, 'TypeofContact', 'ProdTaken')
ProdTaken 0 1 All TypeofContact All 3968 920 4888 Self Enquiry 2859 610 3469 Company Invited 1109 310 1419 ------------------------------------------------------------------------------------------------------------------- ProdTaken 0 1 TypeofContact Company Invited 0.781536 0.218464 Self Enquiry 0.824157 0.175843
sns.countplot(data= df, x='TypeofContact', hue = 'ProdTaken')
<AxesSubplot:xlabel='TypeofContact', ylabel='count'>
Observations¶
- Self Enquiries make up the majority of our data, and consequently share the most accepted products
- If you were invited by a company you were slightly more likely to purchase the product. with a little over 20% of customers purchasing the product
City Tier¶
stackedbarplot(df, 'CityTier', 'ProdTaken')
ProdTaken 0 1 All CityTier All 3968 920 4888 1 2670 520 3190 3 1146 354 1500 2 152 46 198 ------------------------------------------------------------------------------------------------------------------- ProdTaken 0 1 CityTier 3 0.764000 0.236000 2 0.767677 0.232323 1 0.836991 0.163009
sns.countplot(data= df, x='CityTier', hue = 'ProdTaken')
<AxesSubplot:xlabel='CityTier', ylabel='count'>
Obserations¶
- Tier 2 and 3 cities purchased products at a much higher rate than Tier 1 Cities
- Tier 1 Cities were offered many more packages than Tier 2 and 3 cities
- Tier 1 cities purchased the most packages, but Tier 3 is close with significantly fewer products offered
- tier 2 cities have very few records, so observations may be more skewed than Tier 1 or Tier 3 cities
Occupation¶
stackedbarplot(df, 'Occupation', 'ProdTaken')
ProdTaken 0 1 All Occupation All 3968 920 4888 Salaried 1954 414 2368 Small Business 1700 384 2084 Large Business 314 120 434 Free Lancer 0 2 2 ------------------------------------------------------------------------------------------------------------------- ProdTaken 0 1 Occupation Free Lancer 0.000000 1.000000 Large Business 0.723502 0.276498 Small Business 0.815739 0.184261 Salaried 0.825169 0.174831
sns.countplot(data= df, x='Occupation', hue = 'ProdTaken')
<AxesSubplot:xlabel='Occupation', ylabel='count'>
df.loc[df['Occupation'] == 'Free Lancer']
| ProdTaken | Age | TypeofContact | CityTier | DurationOfPitch | Occupation | Gender | NumberOfPersonVisiting | NumberOfFollowups | ProductPitched | PreferredPropertyStar | MaritalStatus | NumberOfTrips | Passport | PitchSatisfactionScore | OwnCar | NumberOfChildrenVisiting | Designation | MonthlyIncome | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2 | 1 | 37.0 | Self Enquiry | 1 | 8.0 | Free Lancer | Male | 3 | 4.0 | Basic | 3.0 | Single | 7.0 | 1 | 3 | 0 | 0.0 | Executive | 17090.0 |
| 2446 | 1 | 38.0 | Self Enquiry | 1 | 9.0 | Free Lancer | Male | 4 | 5.0 | Basic | 3.0 | Single | 8.0 | 1 | 3 | 0 | 1.0 | Executive | 20768.0 |
Observations¶
- Free Lancers take 100% of the packages they are offered, but there are exactly two records for freelancers.
- Removing the Free Lancer records and/or changing the label to one of the others will help us prevent disrupting the model
- Of the other models, large businesses took 30% of the packages offered, but had the smallest count of packages bought
- Small Business and Salaried both had a close purchase rate of about 20%, but salaried were offered more packages for roughly the same count
Gender¶
stackedbarplot(df, 'Gender', 'ProdTaken')
ProdTaken 0 1 All Gender All 3968 920 4888 Male 2338 578 2916 Female 1630 342 1972 ------------------------------------------------------------------------------------------------------------------- ProdTaken 0 1 Gender Male 0.801783 0.198217 Female 0.826572 0.173428
sns.countplot(data= df, x='Gender', hue = 'ProdTaken')
<AxesSubplot:xlabel='Gender', ylabel='count'>
Observations¶
- Men purchased packages at a slightly higher rate than women
- Men were offered significantly more packages than women were
- Our resultant model may be slightly skewed to view men more favorably than women in package sales
Number of Persons Visiting¶
stackedbarplot(df, 'NumberOfPersonVisiting', 'ProdTaken')
ProdTaken 0 1 All NumberOfPersonVisiting All 3968 920 4888 3 1942 460 2402 2 1151 267 1418 4 833 193 1026 1 39 0 39 5 3 0 3 ------------------------------------------------------------------------------------------------------------------- ProdTaken 0 1 NumberOfPersonVisiting 3 0.808493 0.191507 2 0.811707 0.188293 4 0.811891 0.188109 1 1.000000 0.000000 5 1.000000 0.000000
sns.countplot(data= df, x='NumberOfPersonVisiting', hue = 'ProdTaken')
<AxesSubplot:xlabel='NumberOfPersonVisiting', ylabel='count'>
Observations¶
- 1 and 5 people visiting did not purchase any packages
- 2,3, and 4 person groups all purchased about 20% of the packages they were offered
- There are 39 rows for 1 and 3 rows for 5, removing the rows outright is a bad idea
- Grouping 1 and 2, and 4 and 5 seems like a good idea, that does cause a slight negative skew to both, but overall they should stay roughly the same
Number of Followups¶
stackedbarplot(df, 'NumberOfFollowups', 'ProdTaken')
ProdTaken 0 1 All NumberOfFollowups All 3968 920 4888 4.0 1726 387 2113 3.0 1222 244 1466 5.0 577 191 768 6.0 82 54 136 2.0 205 24 229 1.0 156 20 176 ------------------------------------------------------------------------------------------------------------------- ProdTaken 0 1 NumberOfFollowups 6.0 0.602941 0.397059 5.0 0.751302 0.248698 4.0 0.816848 0.183152 3.0 0.833561 0.166439 1.0 0.886364 0.113636 2.0 0.895197 0.104803
sns.countplot(data= df, x='NumberOfFollowups', hue = 'ProdTaken')
<AxesSubplot:xlabel='NumberOfFollowups', ylabel='count'>
Observations¶
- Six had the highest percentage of people purchase a package at roughly 40%
- One and two had the lowest percentage of takers with around 10% each
- There is a pretty significant correlation between number of followups and ProdTaken
Product Pitched¶
stackedbarplot(df, 'ProductPitched', 'ProdTaken')
ProdTaken 0 1 All ProductPitched All 3968 920 4888 Basic 1290 552 1842 Deluxe 1528 204 1732 Standard 618 124 742 King 210 20 230 Super Deluxe 322 20 342 ------------------------------------------------------------------------------------------------------------------- ProdTaken 0 1 ProductPitched Basic 0.700326 0.299674 Standard 0.832884 0.167116 Deluxe 0.882217 0.117783 King 0.913043 0.086957 Super Deluxe 0.941520 0.058480