A fictional EdTech company is focusing on accelerating its growth by increasing the number of enrolled users.
Therefore, I was asked to analyze various aspects of customer acquisition to see the status of new users’ growth in your company. The dataset was obtained from kaggle
The dataset contains 5 CSV files containg:
lead_basic_details
: Contains the basic details of leadssales_managers_assigned_leads_details
: Contains the details of the senior and junior sales managers and their assigned leads.leads_interaction_details
: Contains the details of call interactions of junior sales managers with the leads.Lead interaction detailsleads_demo_watched_details
: Contains the details of the demo session watched by the leads.leads_reasons_for_no_interest
: Contains the details of the reasons given by the leads for their lack of interest.Before analyzing, I explored the data and cleaned the data. Here is a link to a notebook where I cleaned and prepared the data for analysis.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
base_color = sns.color_palette()[0]
I came across outliers, missing values and duplicate rows when analyzing the data. I have attached my notebook where I go through the process of cleaning and wrangling the data.
leads_df = pd.read_csv('clean/leads_basic_details_clean.csv')
interactions_df = pd.read_csv('clean/leads_interaction_details_clean.csv')
demos_df = pd.read_csv('clean/leads_demo_watched_details_clean.csv')
no_interest_df = pd.read_csv('clean/leads_reasons_for_no_interest_clean.csv')
def plot_bar_chart(df,col,xlabel,y=False,ninety=True):
order = df[col].value_counts().index
if y:
sns.countplot(y=col,data=df,order=order,color=base_color)
plt.ylabel(xlabel)
plt.xlabel("Number of leads")
else:
sns.countplot(x=col,data=df,order=order,color=base_color)
plt.xlabel(xlabel)
plt.ylabel("Number of leads")
if ninety:
plt.xticks(rotation=90);
lead_end = no_interest_df.query("stage_dropped == 'lead'")
plot_bar_chart(lead_end,'reason','Reasons')
plt.title('Reasons for not being interested in the demo');
Leads gave the reason of preferring offline classes and affordability as to why they weren't interested in demoing.
awareness_end = no_interest_df.query("stage_dropped == 'awareness'")
plot_bar_chart(awareness_end,'reason','Reasons')
plt.title('Reasons for not being interested in considering the service');
consider_end = no_interest_df.query("stage_dropped == 'consider'")
plot_bar_chart(consider_end,'reason','Reasons')
plt.title('Reasons for not being interested for conversion');
Mostly the leads cited affordability, preferring offline classes and not being in interested in the subject.
afford = no_interest_df[no_interest_df['reason']=="Can't afford"]
afford_details = pd.merge(afford,leads_df,on='lead_id')
plot_bar_chart(afford_details,'current_education','Current Education',y=True)
plt.title("Current Education for leads who cited affordability");
plt.hist(x='age',data=afford_details,bins=6)
plt.xlabel('Age')
plt.ylabel('Count')
plt.title("Distribution of age of leads who cited affordability");
Majority of the leads who cited affordability are aged between 22-25 and are currently looking for a job. Maybe the parents can come in and support
parent_count = afford_details['parent_occupation'].value_counts()
plt.pie(parent_count,labels=parent_count.index,autopct=lambda p : '{:.2f}% ({:,.0f})'.format(p,p * sum(parent_count)/100))
plt.title("Parent Occupation of leads who dropped out due to affordability");
The parents are working as goverment employees or an IT employee or in business.
offline_class = no_interest_df[no_interest_df['reason']=='Wants offline classes']
offline_class_details = pd.merge(leads_df,offline_class,on='lead_id')
plot_bar_chart(offline_class_details,'current_city','Current City',ninety=False)
plt.title("City of residence of leads who preffer offline classes ");
Leads who preffered offline classes were mostly from Visakhapatnam, Kochi and Mumbai. The company could set up centres at these locations if they wanted to better accomodate the clients, or.....
plt.figure(figsize=(8,6))
plot_bar_chart(leads_df,'current_city',"Current City",ninety=False)
plt.title('Current City of residence of all the leads')
Text(0.5, 1.0, 'Current City of residence of all the leads')
or set up at Visakhapatnam, Hyderabad and Kochi if they are looking at the city of residence of all the leads.
no_interest_domain = no_interest_df[no_interest_df['reason']=='Not Interested in domain']
no_interest_domain_details = pd.merge(leads_df,no_interest_domain,on='lead_id')
plot_bar_chart(no_interest_domain_details,'current_education',"Current Education",y=True)
plt.title("Current Education of leads who dropped out because \nthe domain wasn't interesting");
Overwhelming number of the leads are Bachelor of Technology students. The company could perform a survey on B.Tech students to see what subjects are interesting for students and use that to tailor make new courses.
converted_leads_id = interactions_df[interactions_df['lead_stage']=='conversion']['lead_id'].unique()
print(f"{round((converted_leads_id.shape[0] / leads_df.shape[0])*100)}% of leads are successfully converted")
18% of leads are successfully converted
(no_interest_df.stage_dropped.value_counts() / leads_df.shape[0]) * 100
lead 45.810056 awareness 22.067039 consider 14.245810 Name: stage_dropped, dtype: float64
45% of leads drop out at the earliest stage. Remember that the most cited reason for this stage was the preference of offline classes and at the later customer acquistion stages was affordability. If the company could set up centres to accomodate those clients, engage the parents during the customer acquisition process, introduce interesting subjects for B. Tech students or introduces a more affordable pricing model, there will be fewer lead attrition numbers happening at the lowest stage.
dropped_leads_demo = pd.merge(demos_df,no_interest_df,on='lead_id')
converted_leads = interactions_df[interactions_df['lead_stage']=='conversion'][['lead_id']]
converted_leads['stage_dropped'] = "Conversion"
converted_leads_demos = pd.merge(demos_df,converted_leads,on='lead_id')
demo_x = pd.concat([dropped_leads_demo,converted_leads_demos])
plt.figure(figsize=(10,6))
sns.boxplot(x='stage_dropped',y='watched_percentage',data=demo_x,color=base_color)
labels = [
"Dropped at awareness",
"Dropped at consideration",
"Successfully Converted"
]
plt.xticks([0,1,2],labels)
plt.ylabel('Watch Percentage')
plt.xlabel('Stage status')
plt.title("Demo Video Engagement");
Leads who dropped out at the consideration stage had an higher average watch percentage but few of those leads watched more than 75% as compared to the group who dropped out at the awareness stage. Majority of the leads who were successfully converted rarely ever watched the entirety of the demo.
sns.boxplot(x='language',y='watched_percentage',data=demos_df,color=base_color)
plt.xlabel("Demo Language")
plt.ylabel("Watched Percentage")
plt.title("Demo Video Engagement");
English had an overall better engagement as compared to the other languages.
dropped_leads_details = pd.merge(leads_df,no_interest_df,on='lead_id')
converted_leads = interactions_df[interactions_df['lead_stage']=='conversion'][['lead_id']]
converted_leads['converted'] = "Yes"
dropped_leads_details['converted'] = "No"
converted_leads_details = pd.merge(leads_df,converted_leads,on='lead_id')
details_x = pd.concat([dropped_leads_details,converted_leads_details])
plt.figure(figsize=(10,6))
details_x_order = details_x['lead_gen_source'].value_counts().index
sns.countplot(x='lead_gen_source',hue='converted',data=details_x,hue_order=['Yes','No'],order=details_x_order)
plt.ylabel('Number of leads')
plt.xlabel('Source of Lead')
plt.title("Source from which converted leads were generated from");
Even though Social Media, SEO and User Referrals were the sources with the leads generated, Email Marketing has the most successfully converted leads even though it was the lowest generating source for all the leads. followed by Social Media and SEO.
gender_count = converted_leads_details['gender'].value_counts()
plt.pie(gender_count,labels=gender_count.index,autopct=lambda p : '{:.2f}% ({:,.0f})'.format(p,p * sum(gender_count)/100))
plt.title("Gender Composition of Successful Converted leads");
67% of the converted leads are females.