Customer Churn Analysis of a Telecommunications company in California from April to June 2022¶

Project by:
GLADYS LACIA
Freelancer, Professional Website at GitHub: freelancergladyslacia.github.io
CONTENT WRITER | GENERAL VIRTUAL ASSISTANT | DATA ANALYST
onbizgladyslacia@gmail.com | linkedin.com/in/freelancergladyslacia | facebook.com/onbizgladyslacia
Upwork | OLJ
Resume for Data Analyst role: freelancergladyslacia.github.io/niche/data-analyst

Rationale¶

This project aims to perform an in-depth Exploratory Data Analysis (EDA) and visualization of a telecommunications company in California, from April to June 2022. Using Python libraries such as Pandas, Matplotlib, Seaborn, and Scikit-learn, it explores five customer churn analyses based on a dataset of 7,043 customer records with 11 features, including account information, billing and payment, services and usage, and marketing and promotions. Data preprocessing involved standardizing numerical features, one-hot encoding categorical features, and combining preprocessing and model into a pipeline.

  1. Churn Rate by Contract Type using Data Loading and Cleaning, Grouping and Calculation, and Churn Rate Calculation. This determines how customer churn rates differ based on their contract type (month-to-month, one year, or two year). A bar chart visually compares the churn rates across different contract types. The data is grouped by contract type, and churn rates are calculated. Month-to-month contract customers had a churn rate of 46.75%, compared to only 11.27% and 2.83% for one-year and two-year contracts, respectively.
  2. Top Churn Reasons by Customer Segment using Data Segmentation and Churn Reason Analysis. This identifies the primary reasons why customers churn, segmented by characteristics like age group, tenure, and marital status. Bar charts display the top churn reasons by segment. The data is filtered and grouped by customer attributes and churn reasons. Younger, single customers cited network issues as top churn drivers, while older, married customers prioritized cost and billing clarity.
  3. Churn Rate by Internet Type & Streaming Usage using Data Preparation and Churn Rate Calculation. This compares churn behavior based on internet service type and streaming usage, identifying higher-risk customer segments. Stacked bar charts visualize churn and retention counts by internet and streaming service. Customers using fiber optic internet have a churn rate of 17.33%, significantly higher than the overall average. Similarly, customers who use streaming services include TV (11.56%), movies (11.60%), or music (11.45%) also exhibit higher churn rates compared to the average.
  4. Tenure in Months vs. Churn Category using Data Filtering, Grouping, and Median Calculation. This examines the relationship between customer tenure and different churn categories (e.g., voluntary, involuntary, dissatisfaction, pricing). A box plot visualizes tenure distribution across churn categories. Customers who churn due to price tend to leave earlier, with a median tenure of only 5.5 months, compared to those who churn due to competitor offerings (median tenure: 8.0 months) or dissatisfaction (median tenure: 9.0 months). Furthermore, certain churn categories are more prevalent among different tenure groups. For instance, long-tenure churners (37+ months) are more associated with “Other” reasons (median tenure: 19.0 months) and “Attitude” (median tenure: 15.0 months), while short-tenure churners (0–12 months) are more likely to leave due to “Price,” “Competitor,” or “Dissatisfaction,” as indicated by their lower median tenures.
  5. Churn Category Prediction with Logistic Regression using Feature Selection and Encoding, Train-Test Split (80/20), Model Building and Evaluation. A predictive model was developed to classify churn categories based on customer behavior, enabling early identification of churn risk factors. The preprocessing pipeline included feature selection, one-hot encoding for categorical variables, and scaling for numerical features. After splitting the data into training and test sets, a logistic regression model was trained and evaluated.

Model performance was assessed using a confusion matrix and classification report, with metrics including precision, recall, and F1-score for each churn category. Although the overall accuracy reached 22.63%, which is low for a multi-class classification task, the model provided valuable insights into churn drivers. Key predictive features included tenure, monthly charges, contract type, payment method, and promotional offers.

To improve model performance, future solutions include addressing class imbalance by using techniques like SMOTE or assigning class weights, and potentially merging low-frequency churn categories. Furthermore, enhancing the feature set with new variables such as interaction terms, refund ratios, and customer interaction history could provide valuable insights. Exploring alternative models, including tree-based methods like XGBoost or CatBoost, and experimenting with different multi-class strategies like Gradient Boosted Trees or hierarchical classification could also significantly enhance predictive power. These improvements would enable the model to better capture the complexity of customer churn behavior and ultimately lead to more accurate and actionable predictions.

In [22]:
import pandas as pd
from google.colab import drive

drive.mount('/content/drive', force_remount=True)
file_path = '/content/drive/MyDrive/Portfolios/Data-Analyst/Gladys-Lacia-for-Data-Analyst-role/datasets/dataset-Customer-Churn-Telco-California-from-April-to-June-2022.csv'

df = pd.read_csv(file_path, low_memory=False)
df.info()
Mounted at /content/drive
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 38 columns):
 #   Column                             Non-Null Count  Dtype  
---  ------                             --------------  -----  
 0   Customer ID                        7043 non-null   object 
 1   Gender                             7043 non-null   object 
 2   Age                                7043 non-null   int64  
 3   Married                            7043 non-null   object 
 4   Number of Dependents               7043 non-null   int64  
 5   City                               7043 non-null   object 
 6   Zip Code                           7043 non-null   int64  
 7   Latitude                           7043 non-null   float64
 8   Longitude                          7043 non-null   float64
 9   Number of Referrals                7043 non-null   int64  
 10  Tenure in Months                   7043 non-null   int64  
 11  Offer                              3166 non-null   object 
 12  Phone Service                      7043 non-null   object 
 13  Avg Monthly Long Distance Charges  6361 non-null   float64
 14  Multiple Lines                     6361 non-null   object 
 15  Internet Service                   7043 non-null   object 
 16  Internet Type                      5517 non-null   object 
 17  Avg Monthly GB Download            5517 non-null   float64
 18  Online Security                    5517 non-null   object 
 19  Online Backup                      5517 non-null   object 
 20  Device Protection Plan             5517 non-null   object 
 21  Premium Tech Support               5517 non-null   object 
 22  Streaming TV                       5517 non-null   object 
 23  Streaming Movies                   5517 non-null   object 
 24  Streaming Music                    5517 non-null   object 
 25  Unlimited Data                     5517 non-null   object 
 26  Contract                           7043 non-null   object 
 27  Paperless Billing                  7043 non-null   object 
 28  Payment Method                     7043 non-null   object 
 29  Monthly Charge                     7043 non-null   float64
 30  Total Charges                      7043 non-null   float64
 31  Total Refunds                      7043 non-null   float64
 32  Total Extra Data Charges           7043 non-null   int64  
 33  Total Long Distance Charges        7043 non-null   float64
 34  Total Revenue                      7043 non-null   float64
 35  Customer Status                    7043 non-null   object 
 36  Churn Category                     1869 non-null   object 
 37  Churn Reason                       1869 non-null   object 
dtypes: float64(9), int64(6), object(23)
memory usage: 2.0+ MB
In [23]:
df = df.dropna(subset=['Churn Reason'])
df = df.dropna(subset=['Offer'])
df = df.dropna(subset=['Internet Type'])
df = df.dropna(subset=['Avg Monthly Long Distance Charges'])
df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 681 entries, 2 to 7039
Data columns (total 38 columns):
 #   Column                             Non-Null Count  Dtype  
---  ------                             --------------  -----  
 0   Customer ID                        681 non-null    object 
 1   Gender                             681 non-null    object 
 2   Age                                681 non-null    int64  
 3   Married                            681 non-null    object 
 4   Number of Dependents               681 non-null    int64  
 5   City                               681 non-null    object 
 6   Zip Code                           681 non-null    int64  
 7   Latitude                           681 non-null    float64
 8   Longitude                          681 non-null    float64
 9   Number of Referrals                681 non-null    int64  
 10  Tenure in Months                   681 non-null    int64  
 11  Offer                              681 non-null    object 
 12  Phone Service                      681 non-null    object 
 13  Avg Monthly Long Distance Charges  681 non-null    float64
 14  Multiple Lines                     681 non-null    object 
 15  Internet Service                   681 non-null    object 
 16  Internet Type                      681 non-null    object 
 17  Avg Monthly GB Download            681 non-null    float64
 18  Online Security                    681 non-null    object 
 19  Online Backup                      681 non-null    object 
 20  Device Protection Plan             681 non-null    object 
 21  Premium Tech Support               681 non-null    object 
 22  Streaming TV                       681 non-null    object 
 23  Streaming Movies                   681 non-null    object 
 24  Streaming Music                    681 non-null    object 
 25  Unlimited Data                     681 non-null    object 
 26  Contract                           681 non-null    object 
 27  Paperless Billing                  681 non-null    object 
 28  Payment Method                     681 non-null    object 
 29  Monthly Charge                     681 non-null    float64
 30  Total Charges                      681 non-null    float64
 31  Total Refunds                      681 non-null    float64
 32  Total Extra Data Charges           681 non-null    int64  
 33  Total Long Distance Charges        681 non-null    float64
 34  Total Revenue                      681 non-null    float64
 35  Customer Status                    681 non-null    object 
 36  Churn Category                     681 non-null    object 
 37  Churn Reason                       681 non-null    object 
dtypes: float64(9), int64(6), object(23)
memory usage: 207.5+ KB

Analysis 1: Churn Rate by Contract Type¶

Objective: To evaluate how customer churn varies across different contract types (Month-to-Month, One Year, Two Year) and identify which contract type is most associated with churn.

Question: Which contract type has the highest churn rate, and by how much compared to others?

Insights¶

  • Month-to-Month contracts have the highest churn rate.
  • One Year and Two Year contracts exhibit lower churn rates.

Recommendations¶

  • Offer discounted pricing, loyalty rewards, or exclusive benefits to encourage customers to choose 1-year or 2-year plans over month-to-month options.
  • Consider bundling services or features only available for longer-term subscribers.
  • Conduct surveys or interviews to understand why short-term users churn more often, such as service quality, pricing, support, or value perception.
  • Use feedback to improve the onboarding and user experience for month-to-month customers.
  • Increase engagement through regular check-ins, educational content, and proactive customer support.
  • Offer 24/7 support or dedicated account managers for short-term users at risk of churn.
  • Consider offering a flexible hybrid plan, such as 6-month contracts or quarterly billing, to strike a balance between commitment and flexibility.
In [24]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from google.colab import drive

drive.mount('/content/drive', force_remount=True)
file_path = '/content/drive/MyDrive/Portfolios/Data-Analyst/Gladys-Lacia-for-Data-Analyst-role/datasets/dataset-Customer-Churn-Telco-California-from-April-to-June-2022.csv'

df = pd.read_csv(file_path, low_memory=False)

contract_counts_before = df['Contract'].value_counts()
print("Contract counts before data cleaning:\n", contract_counts_before)
total_customers_by_contract = contract_counts_before.copy()

df = df.dropna(subset=['Churn Reason', 'Offer', 'Internet Type', 'Avg Monthly Long Distance Charges'])

contract_counts_after = df['Contract'].value_counts()
print("\nContract counts after data cleaning:\n", contract_counts_after)
churned_df = df[df['Customer Status'] == 'Churned']
churned_customers_by_contract = churned_df['Contract'].value_counts()

churn_rate = (churned_customers_by_contract / total_customers_by_contract * 100).round(2)
churn_rate_by_contract = pd.DataFrame({
    'Total Customers': total_customers_by_contract,
    'Churned Customers': churned_customers_by_contract,
    'Churn Rate (%)': churn_rate
}).fillna(0).astype({'Churned Customers': 'int'})
print("\nChurn Rate by Contract Type:\n", churn_rate_by_contract)

plt.figure(figsize=(10, 6))
sns.set_style("whitegrid")
sns.barplot(x=churn_rate_by_contract.index,
            y='Churned Customers',
            data=churn_rate_by_contract,
            color='limegreen',
            label='Churned Customers')
sns.barplot(x=churn_rate_by_contract.index,
            y='Total Customers',
            data=churn_rate_by_contract,
            color='lime',
            alpha=0.5,
            label='Total Customers')
for i, (index, row) in enumerate(churn_rate_by_contract.iterrows()):
    plt.text(i, row['Churned Customers'] + 50, f"{row['Churn Rate (%)']}%",
             ha='center', va='bottom', fontsize=11, fontweight='bold', color='black')

plt.title('Churn Rate by Contract Type', fontsize=14)
plt.ylabel('Number of Customers')
plt.xlabel('Contract Type')
plt.legend()
plt.tight_layout()
plt.show()
Mounted at /content/drive
Contract counts before data cleaning:
 Contract
Month-to-Month    3610
Two Year          1883
One Year          1550
Name: count, dtype: int64

Contract counts after data cleaning:
 Contract
Month-to-Month    597
One Year           65
Two Year           19
Name: count, dtype: int64

Churn Rate by Contract Type:
                 Total Customers  Churned Customers  Churn Rate (%)
Contract                                                          
Month-to-Month             3610                597           16.54
One Year                   1550                 65            4.19
Two Year                   1883                 19            1.01
No description has been provided for this image

Analysis 2: Top Churn Reasons by Customer Segment¶

Objective: To identify the top reasons why customers are churning, segmented by customer characteristics such as age group, tenure, and marital status.

Question: What are the most common churn reasons for different types of customers?

Insights¶

  • The top 3 reasons among almost all customers, regardless of age, tenure, and marital status, are:
    1. Competitor made better offer
    2. Competitor had better devices
    3. Attitude of support person
  • The concern of customers aged 18 to 30 is network reliability.
  • Most customers aged 31 to 60 are looking for offers with higher download speeds.
  • Customers aged 61 and above tend to stop using the service due to high prices and product dissatisfaction.
  • Customers with 0 to 12 months of tenure tend to leave due to the attitude of the service provider, better data offers from competitors, and relocation to another place.
  • Customers with 37+ months of tenure tend to stop using the service due to extra data charges, lack of affordable download/upload speeds, long-distance charges, and service dissatisfaction.
  • The concern of non-married customers is that the price is too high, while the concern of married customers is product dissatisfaction.

Recommendations¶

  • Regularly benchmark your offers and devices against competitors. Launch loyalty-based upgrade programs and offer exclusive bundles or limited-time promotions to counteract competitive offers.
  • Invest in customer service training focused on empathy and resolution speed. Implement feedback loops (CSAT or NPS) after every interaction and reward top-performing support agents.
  • Improve network infrastructure in areas with high youth density and communicate improvements transparently.
  • Introduce tiered speed packages and upgrade incentives with transparent performance benchmarks.
  • Design senior-friendly plans with lower pricing, no hidden fees, and simple features. Provide dedicated senior support reps.
  • Offer relocation-friendly packages, onboarding rewards, and assign a dedicated onboarding support rep for new users.
  • Roll out loyalty discounts, waive select long-distance/extra data fees, and solicit feedback for long-time users via surveys or interviews to uncover hidden dissatisfaction drivers.
  • Provide flexible pricing plans with customizable options for singles or individuals.
  • Offer family-centric plans with add-on features, and improve product usability through bundled tutorials, user guides, or home consultations.
In [25]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from google.colab import drive

drive.mount('/content/drive', force_remount=True)
file_path = '/content/drive/MyDrive/Portfolios/Data-Analyst/Gladys-Lacia-for-Data-Analyst-role/datasets/dataset-Customer-Churn-Telco-California-from-April-to-June-2022.csv'
df = pd.read_csv(file_path, low_memory=False)
df = df.dropna(subset=['Churn Reason', 'Offer', 'Internet Type', 'Avg Monthly Long Distance Charges'])

churned_df = df[df['Customer Status'] == 'Churned'].copy()

bins_age = [0, 30, 45, 60, 100]
labels_age = ['18-30', '31-45', '46-60', '61+']
churned_df['Age Group'] = pd.cut(churned_df['Age'], bins=bins_age, labels=labels_age)

bins_tenure = [0, 12, 24, 36, 100]
labels_tenure = ['0-12', '13-24', '25-36', '37+']
churned_df['Tenure Group'] = pd.cut(churned_df['Tenure in Months'], bins=bins_tenure, labels=labels_tenure)

# ========== 1. Top Churn Reasons by Age Group ==========
age_reason = churned_df.groupby(['Age Group', 'Churn Reason'], observed=True).size().reset_index(name='Count')
top_age_reasons = age_reason.sort_values(['Age Group', 'Count'], ascending=[True, False])

plt.figure(figsize=(12, 6))
sns.barplot(data=top_age_reasons, x='Age Group', y='Count', hue='Churn Reason', dodge=False)
plt.title('Top Churn Reasons by Age Group')
plt.ylabel('Number of Customers')
plt.xticks(rotation=0)
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.show()
Mounted at /content/drive
No description has been provided for this image
In [26]:
age_group_18_30_df = df[(df['Age'] >= 18) & (df['Age'] <= 30)]
churn_reasons_18_30 = age_group_18_30_df['Churn Reason'].value_counts()

total_churn_reasons_18_30 = churn_reasons_18_30.sum()
print("\nTotal churn reasons (customers) in age group 18–30:", total_churn_reasons_18_30)
top_reasons_18_30 = churn_reasons_18_30.head(100)
print("Top churn reasons for customers aged 18-30:\n", top_reasons_18_30)
Total churn reasons (customers) in age group 18–30: 143
Top churn reasons for customers aged 18-30:
 Churn Reason
Competitor made better offer                 23
Competitor had better devices                19
Attitude of support person                   18
Competitor offered more data                 17
Attitude of service provider                 16
Network reliability                          10
Competitor offered higher download speeds     7
Don't know                                    6
Service dissatisfaction                       5
Lack of affordable download/upload speed      3
Price too high                                3
Product dissatisfaction                       3
Moved                                         3
Long distance charges                         2
Extra data charges                            2
Lack of self-service on Website               2
Limited range of services                     1
Poor expertise of phone support               1
Deceased                                      1
Poor expertise of online support              1
Name: count, dtype: int64
In [27]:
age_group_31_45_df = df[(df['Age'] >= 31) & (df['Age'] <= 45)]
churn_reasons_31_45 = age_group_31_45_df['Churn Reason'].value_counts()

total_churn_reasons_31_45 = churn_reasons_31_45.sum()
print("\nTotal churn reasons (customers) in age group 31-45:", total_churn_reasons_31_45)
top_reasons_31_45 = churn_reasons_31_45.head(100)
print("Top churn reasons for customers aged 31-45:\n", top_reasons_31_45)
Total churn reasons (customers) in age group 31-45: 140
Top churn reasons for customers aged 31-45:
 Churn Reason
Competitor had better devices                31
Competitor made better offer                 17
Attitude of support person                   16
Competitor offered higher download speeds    11
Don't know                                   10
Competitor offered more data                  9
Attitude of service provider                  7
Moved                                         6
Product dissatisfaction                       6
Service dissatisfaction                       5
Network reliability                           5
Lack of self-service on Website               5
Lack of affordable download/upload speed      3
Extra data charges                            2
Limited range of services                     2
Long distance charges                         2
Poor expertise of online support              1
Poor expertise of phone support               1
Price too high                                1
Name: count, dtype: int64
In [28]:
age_group_46_60_df = df[(df['Age'] >= 46) & (df['Age'] <= 60)]
churn_reasons_46_60 = age_group_46_60_df['Churn Reason'].value_counts()

total_churn_reasons_46_60 = churn_reasons_46_60.sum()
print("\nTotal churn reasons (customers) in age group 46-60:", total_churn_reasons_46_60)
top_reasons_46_60 = churn_reasons_46_60.head(100)
print("Top churn reasons for customers aged 46-60:\n", top_reasons_46_60)
Total churn reasons (customers) in age group 46-60: 179
Top churn reasons for customers aged 46-60:
 Churn Reason
Competitor had better devices                33
Competitor made better offer                 23
Attitude of support person                   22
Competitor offered higher download speeds    16
Competitor offered more data                 14
Attitude of service provider                 11
Price too high                                9
Network reliability                           8
Moved                                         8
Don't know                                    8
Product dissatisfaction                       6
Lack of affordable download/upload speed      4
Extra data charges                            4
Poor expertise of online support              4
Service dissatisfaction                       4
Long distance charges                         2
Lack of self-service on Website               2
Limited range of services                     1
Name: count, dtype: int64
In [29]:
age_group_61plus_df = df[df['Age'] >= 61]
churn_reasons_61plus = age_group_61plus_df['Churn Reason'].value_counts()

total_churn_reasons_61plus = churn_reasons_61plus.sum()
print("\nTotal churn reasons (customers) in age group 61+:", total_churn_reasons_61plus)
top_reasons_61plus = churn_reasons_61plus.head(100)
print("Top churn reasons for customers aged 61+:\n", top_reasons_61plus)
Total churn reasons (customers) in age group 61+: 219
Top churn reasons for customers aged 61+:
 Churn Reason
Competitor made better offer                 59
Competitor had better devices                31
Attitude of support person                   22
Don't know                                   14
Price too high                               13
Product dissatisfaction                      10
Competitor offered more data                 10
Attitude of service provider                 10
Extra data charges                            9
Limited range of services                     7
Long distance charges                         7
Competitor offered higher download speeds     6
Service dissatisfaction                       6
Network reliability                           6
Lack of affordable download/upload speed      5
Moved                                         2
Poor expertise of online support              2
Name: count, dtype: int64
In [30]:
# ========== 2. Top Churn Reasons by Tenure Group ==========
tenure_reason = churned_df.groupby(['Tenure Group', 'Churn Reason'], observed=True).size().reset_index(name='Count')
top_tenure_reasons = tenure_reason.sort_values(['Tenure Group', 'Count'], ascending=[True, False])

plt.figure(figsize=(12, 6))
sns.barplot(data=top_tenure_reasons, x='Tenure Group', y='Count', hue='Churn Reason', dodge=False)
plt.title('Top Churn Reasons by Tenure in Months')
plt.ylabel('Number of Customers')
plt.xticks(rotation=0)
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.show()
No description has been provided for this image
In [31]:
tenure_group_0_12_df = df[(df['Tenure in Months'] >= 0) & (df['Tenure in Months'] <= 12)]
churn_reasons_0_12 = tenure_group_0_12_df['Churn Reason'].value_counts()

total_churn_reasons_0_12 = churn_reasons_0_12.sum()
print("\nTotal churn reasons (customers) in tenure group 0-12:", total_churn_reasons_0_12)
top_reasons_0_12 = churn_reasons_0_12.head(100)
print("Top churn reasons for customers (0-12 months):\n", top_reasons_0_12)
Total churn reasons (customers) in tenure group 0-12: 372
Top churn reasons for customers (0-12 months):
 Churn Reason
Competitor made better offer                 74
Competitor had better devices                69
Attitude of support person                   46
Attitude of service provider                 28
Competitor offered more data                 22
Don't know                                   22
Competitor offered higher download speeds    21
Network reliability                          17
Moved                                        10
Product dissatisfaction                       9
Price too high                                9
Lack of affordable download/upload speed      8
Service dissatisfaction                       8
Extra data charges                            8
Long distance charges                         5
Lack of self-service on Website               5
Limited range of services                     4
Poor expertise of online support              4
Poor expertise of phone support               2
Deceased                                      1
Name: count, dtype: int64
In [32]:
tenure_group_13_24_df = df[(df['Tenure in Months'] >= 13) & (df['Tenure in Months'] <= 24)]
churn_reasons_13_24 = tenure_group_13_24_df['Churn Reason'].value_counts()

total_churn_reasons_13_24 = churn_reasons_13_24.sum()
print("\nTotal churn reasons (customers) in tenure group 13-24:", total_churn_reasons_13_24)
top_reasons_13_24 = churn_reasons_13_24.head(100)
print("Top churn reasons for customers (13-24 months):\n", top_reasons_13_24)
Total churn reasons (customers) in tenure group 13-24: 106
Top churn reasons for customers (13-24 months):
 Churn Reason
Competitor had better devices                20
Attitude of support person                   15
Competitor made better offer                 12
Competitor offered more data                  7
Don't know                                    6
Service dissatisfaction                       6
Competitor offered higher download speeds     5
Attitude of service provider                  5
Network reliability                           4
Limited range of services                     4
Lack of self-service on Website               4
Product dissatisfaction                       3
Long distance charges                         3
Price too high                                3
Moved                                         3
Extra data charges                            3
Lack of affordable download/upload speed      2
Poor expertise of online support              1
Name: count, dtype: int64
In [33]:
tenure_group_25_36_df = df[(df['Tenure in Months'] >= 25) & (df['Tenure in Months'] <= 36)]
churn_reasons_25_36 = tenure_group_25_36_df['Churn Reason'].value_counts()

total_churn_reasons_25_36 = churn_reasons_25_36.sum()
print("\nTotal churn reasons (customers) in tenure group 25-36:", total_churn_reasons_25_36)
top_reasons_25_36 = churn_reasons_25_36.head(100)
print("Top churn reasons for customers (25-36 months):\n", top_reasons_25_36)
Total churn reasons (customers) in tenure group 25-36: 66
Top churn reasons for customers (25-36 months):
 Churn Reason
Competitor made better offer                 14
Attitude of support person                    9
Competitor had better devices                 7
Competitor offered higher download speeds     5
Extra data charges                            4
Competitor offered more data                  4
Price too high                                4
Don't know                                    4
Product dissatisfaction                       3
Service dissatisfaction                       3
Attitude of service provider                  2
Poor expertise of online support              2
Moved                                         2
Long distance charges                         1
Limited range of services                     1
Network reliability                           1
Name: count, dtype: int64
In [34]:
tenure_group_37plus_df = df[df['Age'] >= 37]
churn_reasons_37plus = tenure_group_37plus_df['Churn Reason'].value_counts()

total_churn_reasons_37plus = churn_reasons_37plus.sum()
print("\nTotal churn reasons (customers) in tenure group 37+:", total_churn_reasons_37plus)
top_reasons_37plus = churn_reasons_37plus.head(100)
print("Top churn reasons for customers (37+ months):\n", top_reasons_37plus)
Total churn reasons (customers) in tenure group 37+: 481
Top churn reasons for customers (37+ months):
 Churn Reason
Competitor made better offer                 89
Competitor had better devices                84
Attitude of support person                   53
Competitor offered higher download speeds    30
Competitor offered more data                 28
Don't know                                   27
Attitude of service provider                 27
Price too high                               23
Product dissatisfaction                      20
Network reliability                          18
Moved                                        14
Extra data charges                           14
Lack of affordable download/upload speed     11
Long distance charges                        11
Service dissatisfaction                      10
Limited range of services                     9
Poor expertise of online support              7
Lack of self-service on Website               5
Poor expertise of phone support               1
Name: count, dtype: int64
In [35]:
# ========== 3. Top Churn Reasons by Marital Status ==========
marital_reason = churned_df.groupby(['Married', 'Churn Reason'], observed=True).size().reset_index(name='Count')
top_marital_reasons = marital_reason.sort_values(['Married', 'Count'], ascending=[True, False])

plt.figure(figsize=(10, 6))
sns.barplot(data=top_marital_reasons, x='Married', y='Count', hue='Churn Reason', dodge=False)
plt.title('Top Churn Reasons by Marital Status')
plt.ylabel('Number of Customers')
plt.xticks(rotation=0)
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.show()
No description has been provided for this image
In [36]:
married_no_df = df[df['Married'] == 'No']
churn_reasons_no = married_no_df['Churn Reason'].value_counts()

total_churn_reasons_no = churn_reasons_no.sum()
print("\nTotal churn reasons of non-married customers:", total_churn_reasons_no)
top_reasons_no = churn_reasons_no.head(100)
print("Top churn reasons for non-married customers:\n", top_reasons_no)
Total churn reasons of non-married customers: 436
Top churn reasons for non-married customers:
 Churn Reason
Competitor made better offer                 79
Competitor had better devices                74
Attitude of support person                   54
Attitude of service provider                 31
Don't know                                   29
Competitor offered more data                 28
Competitor offered higher download speeds    27
Price too high                               17
Network reliability                          16
Service dissatisfaction                      13
Moved                                        12
Lack of affordable download/upload speed     11
Product dissatisfaction                       9
Long distance charges                         8
Extra data charges                            8
Lack of self-service on Website               7
Poor expertise of online support              7
Limited range of services                     3
Poor expertise of phone support               2
Deceased                                      1
Name: count, dtype: int64
In [37]:
married_yes_df = df[df['Married'] == 'Yes']
churn_reasons_yes = married_yes_df['Churn Reason'].value_counts()

total_churn_reasons_yes = churn_reasons_yes.sum()
print("\nTotal churn reasons of married customers:", total_churn_reasons_yes)
top_reasons_yes = churn_reasons_yes.head(100)
print("Top churn reasons for married customers:\n", top_reasons_yes)
Total churn reasons of married customers: 245
Top churn reasons for married customers:
 Churn Reason
Competitor made better offer                 43
Competitor had better devices                40
Attitude of support person                   24
Competitor offered more data                 22
Product dissatisfaction                      16
Competitor offered higher download speeds    13
Attitude of service provider                 13
Network reliability                          13
Don't know                                    9
Extra data charges                            9
Price too high                                9
Limited range of services                     8
Moved                                         7
Service dissatisfaction                       7
Long distance charges                         5
Lack of affordable download/upload speed      4
Lack of self-service on Website               2
Poor expertise of online support              1
Name: count, dtype: int64

Analysis 3: Churn Rate by Internet Type & Streaming Usage¶

Objective: To compare churn behavior based on customers' internet types and their usage of streaming services, identifying which segments are more likely to churn by analyzing churn vs retained counts and rates.

Questions:

  1. How does customer churn vary across different internet service types?
  2. Is there a relationship between churn rates and usage of streaming services?

To answer the questions:

  • Customer churn significantly varies depending on the type of internet service used:
    1. Fiber Optic has the highest churn rate at 17.33%, suggesting that customers using this high-speed internet may have higher expectations or face pricing or service issues.
    2. Cable shows a moderate churn rate of 8.31%, possibly due to it being a more stable or affordable option.
    3. DSL has the lowest churn rate at 5.21%, indicating relatively better customer retention, perhaps due to lower costs or less volatility in service expectations.
  • Yes, there appears to be a consistent relationship: All three streaming services have similar churn rates — Streaming Movies (11.60%), Streaming TV (11.56%), and Streaming Music (11.45%).
  • Insights¶

    • Customers using faster or premium internet services like Fiber Optic are more prone to churn than those using traditional or slower options like DSL.
    • Streaming users are more likely to churn compared to average churn rates, suggesting that customers engaged in high-bandwidth activities like streaming may be more sensitive to service quality or pricing, especially if performance doesn't meet expectations.

    Recommendations¶

    • Ensure consistent speed and reliability by proactively monitoring and upgrading network performance in fiber-enabled areas.
    • Introduce exclusive perks such as priority support, streaming service bundles, and home Wi-Fi optimization visits.
    • Prioritize traffic shaping and load balancing during peak hours for popular streaming platforms like Netflix, YouTube, etc.
    • Provide streaming-optimized plans that guarantee minimum speeds or include unlimited bandwidth during off-peak hours.
    • Offer streaming bundles or discounts such as free months of Netflix and Prime Video, as part of higher-tier plans to justify pricing and add stickiness.
    In [38]:
    import pandas as pd
    import matplotlib.pyplot as plt
    import seaborn as sns
    from google.colab import drive
    
    drive.mount('/content/drive', force_remount=True)
    file_path = '/content/drive/MyDrive/Portfolios/Data-Analyst/Gladys-Lacia-for-Data-Analyst-role/datasets/dataset-Customer-Churn-Telco-California-from-April-to-June-2022.csv'
    
    df = pd.read_csv(file_path, low_memory=False)
    df_raw = df.copy()
    
    # --- Combining Internet Type and Streaming Usage Counts (Before Cleaning) ---
    internet_type_counts_before = df['Internet Type'].value_counts()
    streaming_counts_before = df[['Streaming TV', 'Streaming Movies', 'Streaming Music']].apply(lambda x: x == 'Yes').sum()
    
    counts_before_df = pd.concat([internet_type_counts_before, streaming_counts_before])
    counts_before_df = counts_before_df.rename_axis('Internet Type / Streaming Usage').reset_index(name='Total Customers')
    
    print("\n--- Internet Type & Streaming Usage BEFORE Data Cleaning ---\n")
    print(counts_before_df.to_string(index=False))
    
    total_customers_by_internet_type = internet_type_counts_before.copy()
    
    # --- Data Cleaning ---
    df = df.dropna(subset=['Churn Reason', 'Offer', 'Internet Type', 'Avg Monthly Long Distance Charges'])
    churned_df = df[df['Customer Status'] == 'Churned']
    streaming_cols = ['Streaming TV', 'Streaming Movies', 'Streaming Music']
    
    # --- Combining Internet Type and Streaming Usage Counts (After Cleaning) ---
    internet_type_counts_after = df['Internet Type'].value_counts()
    streaming_counts_after = df[['Streaming TV', 'Streaming Movies', 'Streaming Music']].apply(lambda x: x == 'Yes').sum()
    
    counts_after_df = pd.concat([internet_type_counts_after, streaming_counts_after])
    counts_after_df = counts_after_df.rename_axis('Internet Type / Streaming Usage').reset_index(name='Churned Customers')
    
    print("\n--- Internet Type & Streaming Usage AFTER Data Cleaning ---\n")
    print(counts_after_df.to_string(index=False))
    
    churned_customers_by_internet_type = churned_df['Internet Type'].value_counts()
    churned_customers_by_streaming_tv = churned_df['Streaming TV'].value_counts()
    churned_customers_by_streaming_movies = churned_df['Streaming Movies'].value_counts()
    churned_customers_by_streaming_music = churned_df['Streaming Music'].value_counts()
    
    churn_rate_internet_type = (churned_customers_by_internet_type / total_customers_by_internet_type * 100).round(2)
    churn_rate_by_internet_type = pd.DataFrame({
        'Total Customers': total_customers_by_internet_type,
        'Churned Customers': churned_customers_by_internet_type,
        'Churn Rate (%)': churn_rate_internet_type
    }).fillna(0).astype({'Churned Customers': 'int'})
    
    streaming_total_before = df_raw[['Streaming TV', 'Streaming Movies', 'Streaming Music']].apply(lambda x: x == 'Yes').sum()
    streaming_churned_after = churned_df[['Streaming TV', 'Streaming Movies', 'Streaming Music']].apply(lambda x: x == 'Yes').sum()
    churn_rate_streaming_usage = ((streaming_churned_after / streaming_total_before) * 100).round(2)
    
    streaming_df = pd.DataFrame({
        'Total Customers': streaming_total_before,
        'Churned Customers': streaming_churned_after,
        'Churn Rate (%)': churn_rate_streaming_usage
    }).astype({'Churned Customers': 'int'})
    
    churn_rate_by_internet_type_and_streaming_usage = pd.concat([
        churn_rate_by_internet_type,
        streaming_df
    ])
    
    print("\nChurn Rate by Internet Type & Streaming Usage:\n", churn_rate_by_internet_type_and_streaming_usage)
    
    # --- Internet Type Data ---
    internet_type_total = df_raw['Internet Type'].value_counts()
    internet_type_churned = churned_df['Internet Type'].value_counts()
    internet_type_df = pd.DataFrame({
        'Churned': internet_type_churned,
        'Total': internet_type_total
    }).fillna(0).astype(int)
    internet_type_df['Retained'] = internet_type_df['Total'] - internet_type_df['Churned']
    internet_type_df['Churn Rate (%)'] = (internet_type_df['Churned'] / internet_type_df['Total'] * 100).round(2)
    
    # --- Streaming Usage Data ---
    streaming_total = df_raw[streaming_cols].apply(lambda x: x == 'Yes').sum()
    streaming_churned = churned_df[streaming_cols].apply(lambda x: x == 'Yes').sum()
    streaming_df = pd.DataFrame({
        'Churned': streaming_churned,
        'Total': streaming_total
    }).astype(int)
    streaming_df['Retained'] = streaming_df['Total'] - streaming_df['Churned']
    streaming_df['Churn Rate (%)'] = (streaming_df['Churned'] / streaming_df['Total'] * 100).round(2)
    
    # --- Plotting Stacked Bars ---
    fig, axes = plt.subplots(1, 2, figsize=(16, 6), sharey=True)
    
    # Internet Type Stack Bars
    internet_type_df[['Churned', 'Retained']].plot(
        kind='bar',
        stacked=True,
        ax=axes[0],
        color=['lime', 'limegreen'], rot=0
    )
    axes[0].set_title("Churn Rate by Internet Type")
    axes[0].set_ylabel("Number of Customers")
    axes[0].set_xlabel(" ")
    axes[0].legend(["Churned", "Retained"])
    
    # churn rate percentages in Internet Type
    for idx, row in internet_type_df.iterrows():
        axes[0].text(
            x=internet_type_df.index.get_loc(idx),
            y=row['Churned'] + 20,
            s=f"{row['Churn Rate (%)']}%",
            ha='center',
            fontsize=10,
            fontweight='bold'
        )
    
    # Streaming Usage Stack Bars
    streaming_df[['Churned', 'Retained']].plot(
        kind='bar',
        stacked=True,
        ax=axes[1],
        color=['yellow', 'gold'], rot=0
    )
    axes[1].set_title("Churn Rate by Streaming Usage")
    axes[1].set_xlabel(" ")
    axes[1].legend(["Churned", "Retained"])
    
    # churn rate percentages in Streaming Usage
    for idx, row in streaming_df.iterrows():
        axes[1].text(
            x=streaming_df.index.get_loc(idx),
            y=row['Churned'] + 20,
            s=f"{row['Churn Rate (%)']}%",
            ha='center',
            fontsize=10,
            fontweight='bold'
        )
    
    plt.tight_layout()
    plt.show()
    
    Mounted at /content/drive
    
    --- Internet Type & Streaming Usage BEFORE Data Cleaning ---
    
    Internet Type / Streaming Usage  Total Customers
                        Fiber Optic             3035
                                DSL             1652
                              Cable              830
                       Streaming TV             2707
                   Streaming Movies             2732
                    Streaming Music             2488
    
    --- Internet Type & Streaming Usage AFTER Data Cleaning ---
    
    Internet Type / Streaming Usage  Churned Customers
                        Fiber Optic                526
                                DSL                 86
                              Cable                 69
                       Streaming TV                313
                   Streaming Movies                317
                    Streaming Music                285
    
    Churn Rate by Internet Type & Streaming Usage:
                       Total Customers  Churned Customers  Churn Rate (%)
    Fiber Optic                  3035                526           17.33
    DSL                          1652                 86            5.21
    Cable                         830                 69            8.31
    Streaming TV                 2707                313           11.56
    Streaming Movies             2732                317           11.60
    Streaming Music              2488                285           11.45
    
    No description has been provided for this image

    Analysis 4: Tenure in Months vs. Churn Category¶

    Objective: To examine how long customers remain with the company before churning, and to determine whether tenure length varies depending on the high-level reason for leaving. The Churn Categories are Attitude, Competitor, Dissatisfaction, Other, and Price.

    Questions:

    1. Do customers who churn due to Price tend to leave earlier or later than those who churn due to Competitors or Dissatisfaction?

    Customers who churn due to Price tend to leave earlier than those who leave due to Competitors or Dissatisfaction.

    • Median tenure (Price): 5.5 months
    • Median tenure (Competitor): 8.0 months
    • Median tenure (Dissatisfaction): 9.0 months

    This indicates that price-sensitive customers make quick decisions to leave, likely due to immediate affordability concerns or didn't meet cost expectations.

    1. Are certain churn categories more common among long-tenure vs short-tenure customers?

    Yes. Based on the means and medians:

    • Long-tenure churners are more associated with categories Other (Median: 19.0, Mean: 24.0) and Attitude (Median: 15.0, Mean: 21.2). These customers stay longer and leave for personal, lifestyle, or shifting preference reasons.
    • Short-tenure churners are more associated with categories Price (Median: 5.5), Competitor (Median: 8.0), and Dissatisfaction (Median: 9.0). These groups leave early, often within their first year.
    1. Which churn categories should the company address first to improve customer retention?

    The company should prioritize addressing Price, Competitor, and Dissatisfaction churn categories:

    • Price-sensitive customers may respond to better value packages or flexible pricing.
    • Customers who leave because of Competitor may need loyalty programs or service differentiators.
    • Customers who leave because of Dissatisfaction signal a need for improved customer experience, support, or service reliability.

    Insights¶

    • Customers whose reason is in the "Other" category have the highest tenure, implying that the causes of churn here are less service-related such as relocation and life changes.
    • Customers whose reason is in the "Price" category leave fast, implying immediate dissatisfaction with affordability.
    • Customers whose reason is in the "Attitude" category having changing needs or disengagement over time rather than early dissatisfaction.
    • Customers with "Competitor" and "Dissatisfaction" reasons show different tenures in months before leaving.

    Recommendations¶

    • Conduct structured exit interviews to understand and document these non-service-related reasons.
    • Target these customers later with relocation offers, remote-friendly service options, or seasonal promotions.
    • Allow account suspensions or temporary service holds to retain them indirectly.
    • Introduce tiered plans or customizable packages to fit different budgets.
    • Offer initial loyalty incentives to reduce early dropout.
    • Identify disengaged users and re-engage them with tailored content or offers.
    • Recommend new features, upgrades, or bundles based on usage trends.
    • Address common complaints promptly to prevent dissatisfaction.
    • Provide mid-tenure rewards or benefits to reduce attraction to competitors.
    In [39]:
    import pandas as pd
    import matplotlib.pyplot as plt
    import seaborn as sns
    
    churned_df = df[df['Customer Status'] == 'Churned']
    churned_df = churned_df.dropna(subset=['Churn Category', 'Tenure in Months'])
    
    median_values = churned_df.groupby('Churn Category')['Tenure in Months'].median()
    mean_values = churned_df.groupby('Churn Category')['Tenure in Months'].mean()
    
    
    sns.set(style="whitegrid")
    plt.figure(figsize=(10, 6))
    sns.boxplot(
        data=churned_df,
        x='Churn Category',
        y='Tenure in Months',
        color='yellow'
    )
    for i, category in enumerate(median_values.index):
        # median
        plt.scatter(i, median_values[category], color='teal', zorder=10, label='Median' if i == 0 else "")
        plt.text(i, median_values[category] + 1, f"Median: {median_values[category]:.1f}",
                 ha='center', va='bottom', color='teal', fontsize=9)
        # mean
        plt.scatter(i, mean_values[category], color='red', zorder=10, label='Mean' if i == 0 else "")
        plt.text(i, mean_values[category] + 1, f"Mean: {mean_values[category]:.1f}",
                 ha='center', va='bottom', color='red', fontsize=9)
    
    plt.title('Customer Tenure by Churn Category', fontsize=14)
    plt.xlabel('Churn Category', fontsize=12)
    plt.ylabel('Tenure in Months', fontsize=12)
    plt.xticks(rotation=0)
    plt.tight_layout()
    plt.show()
    
    No description has been provided for this image

    Analysis 5: Churn Category Prediction with Logistic Regression¶

    Objective: To predict the category of customer churn (Attitude, Competitor, Dissatisfaction, Other, or Price) based on customer behavior data such as contract type, monthly charges, and tenure, using a logistic regression model.

    Questions: Note that all answers here are based on interpretations from the feature importance analysis.

    • What are the key drivers behind each type of churn?
    1. Price-related: Monthly Charge (2nd most important overall), Offer Type (Offer B–E), and Contract Type (Month-to-Month appears to be more volatile). Customers paying higher charges without significant perceived value or discounts are more likely to leave due to price.
    2. Competitor-related: Internet Type (Fiber Optic, DSL, Cable), Streaming Services (TV, Movies, Music), Online Security, and Online Backup. Customers may feel that competitors offer better services, faster internet, and desirable streaming options.
    3. Dissatisfaction-related: Payment Method, Customer Support, and Contract Type. Poor experience with billing, customer support, and inconvenient payment methods contributes to dissatisfaction-based churn.
    4. Attitude-related: Tenure in Months, Contract Type, and Marital Status. Harder to quantify, usual habitual churners who leave regardless of experience, possibly due to lifestyle and psychological factors.
    • Can we proactively identify customers at risk of churning for a specific reason?
    • Yes, especially for Price, Competitor, and Dissatisfaction categories.
    • There are strong and measurable behavioral indicators such as:
      1. High Monthly Charges = Risk of Price Churn
      2. Use of DSL or basic streaming + long Tenure = Risk of Competitor Churn
      3. Use of Mailed Check or Bank Withdrawal + low support engagement = Risk of Dissatisfaction Churn
    • For Attitude-related churn, proactive identification is more difficult due to the vague and psychological nature of the category.
    • How should the company tailor its retention strategies depending on why customers are likely to leave? Here are the answers by churn reason:
    • Price: Offer discounted bundles or loyalty credits. Promote value for money and long-term savings such as annual contracts.
    • Competitor: Highlight service upgrades, better speed/performance. Provide exclusive offers such as 3 months of free streaming.
    • Dissatisfaction: Streamline payment methods, improve customer support accessibility, and send post-call surveys and act on complaints.
    • Attitude: Use engagement campaigns such as birthday offers and re-engagement emails. Conduct surveys to understand underlying behavior.
    • Other: Offer personalized retention calls/emails. Use exit interviews to collect more granular data for this segment.

    Insights¶

    • Most classes are poorly predicted except for "Competitor", which had 20 correct predictions, but also many misclassified.
    • The "Dissatisfaction", "Other", and "Price" categories are especially weak, with low correct predictions and high misclassifications across categories.
    • The accuracy score for multi-class classification analysis is 22.63%, struggling with multiclass separation due to overlapping feature signals and imbalance between classes.
    • Tenure and Monthly Charge are clearly the most influential drivers in predicting churn category.
    • Payment Method and Offer are also strong differentiators, reflecting customer preferences and frustrations.
    • Streaming services, Online Backup/Security, and Internet Type may contribute to specific dissatisfaction or price sensitivity.

    Recommendations¶

    • Launch loyalty rewards for long-term users; provide competitive bundles.
    • Improve service reliability and customer support; conduct exit interviews.
    • Offer discount tiers or custom billing plans to sensitive customers.
    • Collect more feedback via surveys or exit forms to refine this class.
    • Develop customer engagement programs such as exclusive perks and check-ins.

    Notes¶

    • For model performance, multi-class classification analysis using Logistic Regression within a One-vs-Rest (OvR) strategy is used. The purpose of this analysis is to predict the Churn Category (a multi-class target) among customers who have already churned based on selected customer features such as contract type, internet type, usage, and billing. It is One-vs-Rest because LogisticRegression is inherently binary. OneVsRestClassifier allows extension to multi-class problems by training one classifier per class.
    • For model interpretability, feature importance analysis using a Random Forest Classifier is used. It is a post-modeling evaluation specifically for a Random Forest-based classification model. The purpose of this analysis is to understand which features contribute most to the model's ability to classify outcomes such as customer churn categories or similar labels. The analysis helps identify which variables are most influential in the model, guide feature selection for future model improvements, and provide explainability to stakeholders or non-technical audiences.

    Data and Model Problem Oversights and Future Solutions¶

    • Class Imbalance Problem: Low accuracy and high misclassification for classes like Dissatisfaction, Other, and Price.

    Future Solutions: Use SMOTE or class weights to balance minority classes. Consider grouping low-frequency churn categories if they are semantically similar such as merge Other and Price, if justified.

    • Feature Problem: Overlapping signals might confuse the classifier.

    Future Solutions: New features such as Monthly Charge x Tenure, Refund Ratio, and usage intensity. Include customer interaction history, complaint logs, or NPS scores if available.

    • Future Solutions for Model Problem:
      1. Use Tree-Based or Ensemble Models for Interpretability and Power: Try XGBoost or CatBoost for better handling of mixed feature types and imbalance.
      2. Multi-Class Strategy: Evaluate other classifiers like Gradient Boosted Trees (better at handling noise) and Multinomial Logistic Regression (instead of One-vs-Rest). Also, consider hierarchical classification such as first predict churn vs. not, then predict churn reason.
    In [40]:
    from sklearn.linear_model import LogisticRegression
    from sklearn.preprocessing import OneHotEncoder, StandardScaler
    from sklearn.compose import ColumnTransformer
    from sklearn.pipeline import Pipeline
    from sklearn.model_selection import train_test_split
    from sklearn.multiclass import OneVsRestClassifier
    from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay, accuracy_score
    import matplotlib.pyplot as plt
    import pandas as pd
    
    # selected features based on importance
    selected_features = [
        'Contract',
        'Tenure in Months',
        'Monthly Charge',
        'Internet Type',
        'Online Security',
        'Offer',
        'Payment Method',
        'Streaming TV',
        'Streaming Movies',
        'Total Refunds',
        'Total Extra Data Charges'
    ]
    
    # filter dataset to include only churned customers
    churned_data = df[df['Customer Status'] == 'Churned']
    
    X = churned_data[selected_features]
    y = churned_data['Churn Category']
    
    # define feature types
    numerical = ['Tenure in Months', 'Monthly Charge', 'Total Refunds', 'Total Extra Data Charges']
    categorical = list(set(selected_features) - set(numerical))
    
    # preprocessing pipeline
    preprocessor = ColumnTransformer([
        ('num', StandardScaler(), numerical),
        ('cat', OneHotEncoder(drop='first'), categorical)
    ])
    
    # train/test split
    X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.2, random_state=42)
    
    # pipeline with OneVsRestClassifier
    model = Pipeline(steps=[
        ('preprocessor', preprocessor),
        ('classifier', OneVsRestClassifier(LogisticRegression(max_iter=1000, class_weight='balanced')))
    ])
    
    # train and predict
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    
    # confusion matrix
    labels = model.named_steps['classifier'].classes_
    cm = confusion_matrix(y_test, y_pred, labels=labels)
    
    # display
    fig, ax = plt.subplots(figsize=(10, 8))
    disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=labels)
    disp.plot(cmap='Greens', ax=ax, xticks_rotation=45)
    ax.grid(False)
    
    plt.title("Churn Category Prediction by Multi-class Classification Model with One-vs-Rest (OvR) Logistic Regression")
    plt.tight_layout()
    plt.show()
    
    accuracy = accuracy_score(y_test, y_pred)
    print(f"\nAccuracy Score: {accuracy:.2%}")
    
    No description has been provided for this image
    Accuracy Score: 22.63%
    
    In [41]:
    import numpy as np
    import pandas as pd
    import matplotlib.pyplot as plt
    import seaborn as sns
    from sklearn.metrics import accuracy_score
    from sklearn.ensemble import RandomForestClassifier
    
    # define and fit a RandomForestClassifier since pipeline was not used for training
    rf_classifier = RandomForestClassifier(random_state=42)
    rf_classifier.fit(model.named_steps['preprocessor'].transform(X_train), y_train)
    
    # Get feature importances
    importances = rf_classifier.feature_importances_
    
    # Get encoded feature names
    ohe = model.named_steps['preprocessor'].named_transformers_['cat']
    encoded_cat_cols = ohe.get_feature_names_out()
    
    # Combine feature names
    all_features = numerical + list(encoded_cat_cols)  # using numerical from previous cell
    
    # Create importance DataFrame
    importance_df = pd.DataFrame({
        'Feature': all_features,
        'Importance': importances
    }).sort_values(by='Importance', ascending=False)
    
    print(importance_df)
    
    # Plot
    plt.figure(figsize=(12, 6))
    sns.barplot(x='Importance', y='Feature', data=importance_df, color='seagreen')
    plt.title("Feature Importance by Random Forest Classifier for Customer Churn Patterns")
    plt.xlabel("Importance Levels")
    plt.ylabel("Features")
    plt.tight_layout()
    plt.show()
    
    accuracy = accuracy_score(y_test, y_pred)
    print(f"Accuracy Score: {accuracy:.2%}")
    
                            Feature  Importance
    1                Monthly Charge    0.424992
    0              Tenure in Months    0.222629
    3      Total Extra Data Charges    0.052282
    10   Payment Method_Credit Card    0.037085
    16         Streaming Movies_Yes    0.035567
    13             Streaming TV_Yes    0.035337
    2                 Total Refunds    0.029555
    12          Online Security_Yes    0.028108
    15    Internet Type_Fiber Optic    0.024585
    14            Internet Type_DSL    0.017279
    11  Payment Method_Mailed Check    0.016869
    8             Contract_One Year    0.016689
    6                 Offer_Offer D    0.015649
    5                 Offer_Offer C    0.013337
    7                 Offer_Offer E    0.011715
    4                 Offer_Offer B    0.011628
    9             Contract_Two Year    0.006695
    
    No description has been provided for this image
    Accuracy Score: 22.63%
    
    In [42]:
    import numpy as np
    import pandas as pd
    import matplotlib.pyplot as plt
    
    ovr_clf = model.named_steps['classifier']
    estimators = ovr_clf.estimators_
    
    ohe = model.named_steps['preprocessor'].named_transformers_['cat']
    encoded_cat_cols = ohe.get_feature_names_out()
    
    all_features = numerical + list(encoded_cat_cols)
    
    coef_df = pd.DataFrame(
        [est.coef_.flatten() for est in estimators],
        columns=all_features,
        index=ovr_clf.classes_
    ).T
    coef_df['mean_abs'] = coef_df.abs().mean(axis=1)
    coef_df_sorted = coef_df.sort_values(by='mean_abs', ascending=True).drop('mean_abs', axis=1)
    
    plt.figure(figsize=(14, 10))
    coef_df_sorted.plot(kind='barh', figsize=(14, 10), colormap='Paired')
    plt.title("Logistic Regression Coefficients by Churn Category")
    plt.xlabel("Coefficient Value")
    plt.ylabel("Feature")
    plt.tight_layout()
    plt.legend(title="Churn Category", bbox_to_anchor=(1.05, 1), loc='upper left')
    plt.grid(False)
    plt.show()
    
    <Figure size 1400x1000 with 0 Axes>
    No description has been provided for this image

    Professional Website at GitHub: freelancergladyslacia.github.io
    Contact at onbizgladyslacia@gmail.com | linkedin.com/in/freelancergladyslacia | facebook.com/onbizgladyslacia
    Project created in October 2024