DIGITAL MARKETING

 Discover how AI transforms business operations, drives revenue growth, and creates competitive advantage. Learn implementation strategies, real-world applications, and best practices for enterprise AI adoption. Executive Summary Artificial Intelligence has transitioned from experimental technology to essential business infrastructure. Organizations worldwide are investing heavily in AI to automate processes, enhance decision-making, improve customer experiences, and create entirely new revenue streams. According to McKinsey's 2024 State of AI Report, 55% of organizations have adopted AI in at least one business function , up from 20% in 2017. Companies deploying AI at scale report 3-10x greater productivity gains , cost reductions of 10-40%, and revenue increases of 5-15% from new AI-enabled products and services. The competitive imperative is clear: Organizations that successfully integrate AI into their strategy and operations will dominate their industries. Those that delay ri...

Data Analytics: Complete Course Script - From Beginner to Professional

 Comprehensive data analytics course covering fundamentals, tools, techniques, real-world projects, and career pathways. Learn skills to become a professional data analyst in 12 weeks.


Course Overview and Learning Objectives

What is Data Analytics?

Definition: Data Analytics is the practice of examining raw data to discover patterns, draw conclusions, and support decision-making. It combines statistics, programming, and business acumen to transform data into actionable insights.

Course Structure

This comprehensive course is designed to take you from complete beginner to job-ready data analyst in 12 weeks through structured learning, hands-on projects, and real-world applications.

Prerequisite: Basic computer literacy. No prior analytics or programming experience required.

Learning Outcomes

By completing this course, you will be able to:

  • Understand data analytics fundamentals and concepts
  • Use Excel for data analysis and visualization
  • Write SQL queries to extract and transform data
  • Use Python for data manipulation and analysis
  • Create compelling data visualizations
  • Develop analytics dashboards
  • Conduct exploratory data analysis
  • Build predictive models
  • Present insights to stakeholders
  • Execute end-to-end analytics projects

WEEK 1-2: FOUNDATIONS OF DATA ANALYTICS

Module 1: Data Analytics Fundamentals

Lesson 1.1: Introduction to Data Analytics


Key Concepts:

Data Raw facts and observations. Examples: Sales transactions, customer demographics, website clicks.

Information Data processed into meaningful context. Example: "Sales increased 15% in Q3 among 25-35 age group in North India."

Insights Conclusions drawn from information that drive decisions. Example: "We should target 25-35 age group with Q4 campaign in North India for maximum ROI."

The Analytics Process:


Raw Data → Collection → Cleaning → Analysis → Visualization → Insights → Action → Business Impact

Lesson 1.2: Types of Analytics

Descriptive Analytics - "What Happened?" Analyzing historical data to understand patterns.

Example: "Average monthly sales over past 12 months"

Tools: Excel, Power BI, Tableau

Outcome: Reports, dashboards, summaries

Diagnostic Analytics - "Why Did It Happen?" Investigating reasons behind outcomes.

Example: "Why did sales increase 15% in Q3?"

Techniques: Correlation analysis, trend analysis, root cause analysis

Tools: SQL, Python, statistical software

Predictive Analytics - "What Will Happen?" Using historical data to forecast future outcomes.

Example: "Predict next month's sales based on historical trends"

Techniques: Machine learning, regression, time series forecasting

Tools: Python, R, specialized ML platforms

Prescriptive Analytics - "What Should We Do?" Recommending actions to achieve desired outcomes.

Example: "Allocate 40% budget to North India, 35% to South, 25% to East for maximum projected revenue"

Techniques: Optimization, simulation, decision analysis

Module 2: Data Types and Structures

Lesson 2.1: Data Types

Structured Data Organized in tables with rows and columns. Example: Customer database with columns (Name, Age, Email, Purchase Amount).

Unstructured Data No predefined structure. Examples: Text documents, images, videos, audio.

Semi-Structured Data Mix of structured and unstructured. Examples: JSON, XML, log files.

Quantitative (Numerical) Data Measurable quantities. Examples: Age, salary, temperature, revenue.

Categorical (Qualitative) Data Non-numeric categories. Examples: Color, gender, product category, region.

Lesson 2.2: Data Distribution

Normal Distribution Bell-shaped curve; most data points near mean. Many natural phenomena follow normal distribution.

Skewed Distribution Asymmetrical. Positive skew (tail right), negative skew (tail left).

Bimodal Distribution Two peaks. Suggests two distinct groups in data.

Understanding distributions helps identify:

  • Outliers and anomalies
  • Appropriate statistical tests
  • Data transformation needs

Module 3: Key Analytics Concepts

Lesson 3.1: Statistical Foundations

Mean (Average) Sum of values divided by count.

Mean = (Sum of all values) / (Number of values)

Example: (10 + 20 + 30 + 40 + 50) / 5 = 30

Median Middle value when data sorted. Less affected by outliers.

Example: 10, 20, 30, 40, 50
Median = 30 (middle value)

Mode Most frequently occurring value.

Example: 10, 20, 20, 30, 30, 30
Mode = 30 (occurs 3 times)

Standard Deviation Measures spread/variability of data. High SD = data spread out; Low SD = data clustered.

Variance Square of standard deviation. Measures how far data points are from mean.

Correlation Relationship between two variables. Ranges from -1 to +1.

Correlation = 1: Perfect positive relationship
Correlation = 0: No relationship
Correlation = -1: Perfect negative relationship

Lesson 3.2: Sampling and Bias

Sampling Selecting subset of data for analysis. Used when full dataset is too large.

Sample Size Larger samples more accurately represent population. Balance between precision and resources.

Bias Systematic error skewing results. Types: Selection bias, measurement bias, response bias.

Mitigation: Random sampling, stratified sampling, careful data collection.


WEEK 3-4: EXCEL FOR DATA ANALYSIS

Module 4: Excel Fundamentals and Data Preparation

Lesson 4.1: Excel Basics

Spreadsheet Structure Rows (horizontal), columns (vertical), cells (intersections).

Data Entry Best Practices:

  • One data point per cell
  • Consistent formatting
  • No merged cells in data ranges
  • Column headers in first row
  • No blank rows/columns within data

Lesson 4.2: Formulas and Functions

Mathematical Functions

=SUM(A1:A10)          // Add values
=AVERAGE(A1:A10)      // Calculate average
=COUNT(A1:A10)        // Count cells with numbers
=MIN(A1:A10)          // Find minimum
=MAX(A1:A10)          // Find maximum

Logical Functions

=IF(A1>100, "High", "Low")        // Conditional logic
=AND(A1>50, B1<100)               // All conditions true?
=OR(A1=1, A1=2, A1=3)             // Any condition true?

Text Functions

=LEN(A1)              // Length of text
=UPPER(A1)            // Convert to uppercase
=CONCATENATE(A1, " ", B1)  // Combine text
=LEFT(A1, 3)          // First 3 characters

Lesson 4.3: Data Cleaning

Common Data Quality Issues:

Missing Values

  • Identify: Look for blank cells
  • Handle: Delete, fill with mean/median, leave as is depending on context

Duplicate Records

  • Identify: Sort and visually inspect, or Data > Remove Duplicates
  • Handle: Remove duplicates or investigate root cause

Inconsistent Formatting

  • Standardize text cases (UPPER, LOWER, PROPER functions)
  • Consistent date formats
  • Consistent number formats

Outliers

  • Identify: Values significantly different from others
  • Verify: Confirm if genuine or data error
  • Handle: Keep if genuine, remove if error

Module 5: Pivot Tables and Summarization

Lesson 5.1: Creating Pivot Tables

What is a Pivot Table? Summarizes large datasets into meaningful insights. Rotates (pivots) data dimensions.

Example Scenario: Raw data: Sales transactions with Date, Region, Product, Amount

Pivot table shows: Revenue by Region and Product

Steps to Create:

  1. Select data range
  2. Insert > Pivot Table
  3. Drag fields to Rows, Columns, Values
  4. Customize as needed

Lesson 5.2: Aggregation Functions in Pivot Tables

SUM:        Total of values
AVERAGE:    Mean value
COUNT:      Number of items
MIN/MAX:    Smallest/largest value
STDEV:      Standard deviation

Practical Example:

Data: Sales by Salesperson, Month, Product

Pivot Table Structure:
─────────────────────────────────────────
      Jan      Feb      Mar    Total
─────────────────────────────────────────
John  $5,000   $6,000   $7,000  $18,000
Jane  $4,500   $5,500   $6,200  $16,200
Bob   $3,200   $3,800   $4,100  $11,100
─────────────────────────────────────────
Total $12,700  $15,300  $17,300 $45,300

WEEK 5-6: SQL FOR DATA ANALYSIS

Module 6: SQL Fundamentals

Lesson 6.1: Introduction to SQL

What is SQL? Structured Query Language. Universal language for interacting with databases.

Why SQL?

  • Access large datasets efficiently
  • Extract specific data subsets
  • Perform complex calculations
  • Join data from multiple tables

Lesson 6.2: Basic SELECT Queries

Selecting All Data:

sql
SELECT * FROM customers;

Selecting Specific Columns:

sql
SELECT name, email, age FROM customers;

Filtering with WHERE:

sql
SELECT * FROM customers
WHERE age > 25 AND region = 'North India';

Logical Operators:

AND:  Both conditions true
OR:   At least one condition true
NOT:  Condition not true
IN:   Value in list
BETWEEN: Value in range
LIKE: Pattern matching

Lesson 6.3: Aggregation and Grouping

Aggregate Functions:

sql
SELECT 
  COUNT(*) as total_customers,
  AVG(purchase_amount) as avg_purchase,
  SUM(purchase_amount) as total_revenue,
  MAX(purchase_amount) as max_purchase
FROM orders;

GROUP BY - Aggregate by Category:

sql
SELECT 
  region,
  COUNT(*) as customer_count,
  SUM(purchase_amount) as total_sales,
  AVG(purchase_amount) as avg_sale
FROM orders
GROUP BY region;

HAVING - Filter Groups:

sql
SELECT 
  product_category,
  COUNT(*) as sales_count
FROM orders
GROUP BY product_category
HAVING COUNT(*) > 50;

Module 7: Advanced SQL

Lesson 7.1: Joins

INNER JOIN - Common Records

sql
SELECT 
  o.order_id,
  c.customer_name,
  o.purchase_amount
FROM orders o
INNER JOIN customers c
  ON o.customer_id = c.customer_id;

LEFT JOIN - All from Left Table

sql
SELECT 
  c.customer_name,
  COUNT(o.order_id) as order_count
FROM customers c
LEFT JOIN orders o
  ON c.customer_id = o.customer_id
GROUP BY c.customer_name;

Lesson 7.2: Subqueries

Subquery in WHERE Clause:

sql
SELECT * FROM customers
WHERE customer_id IN (
  SELECT customer_id FROM orders
  WHERE purchase_amount > 10000
);

Subquery in FROM Clause:

sql
SELECT 
  region,
  avg_purchase
FROM (
  SELECT 
    region,
    AVG(purchase_amount) as avg_purchase
  FROM orders
  GROUP BY region
) as regional_avg
WHERE avg_purchase > 5000;

Lesson 7.3: Window Functions

Running Total:

sql
SELECT 
  date,
  amount,
  SUM(amount) OVER (
    ORDER BY date
  ) as running_total
FROM sales
ORDER BY date;

Rank Within Groups:

sql
SELECT 
  salesperson,
  sales,
  RANK() OVER (ORDER BY sales DESC) as rank
FROM sales_performance;

WEEK 7-8: PYTHON FOR DATA ANALYSIS

Module 8: Python Basics for Analytics

Lesson 8.1: Python Setup and Libraries

Essential Libraries:

python
# Data manipulation
import pandas as pd
import numpy as np

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Statistical analysis
from scipy import stats
import statsmodels.api as sm

# Machine learning (later weeks)
from sklearn import preprocessing, ensemble, metrics

Lesson 8.2: Working with Data in Pandas

Creating DataFrames:

python
# From dictionary
data = {
  'Name': ['Alice', 'Bob', 'Charlie'],
  'Age': [25, 30, 35],
  'Salary': [50000, 60000, 75000]
}
df = pd.DataFrame(data)

# From CSV
df = pd.read_csv('customers.csv')

# From Excel
df = pd.read_excel('sales_data.xlsx')

Exploring Data:

python
# First/last rows
df.head()
df.tail()

# Dataset info
df.info()           # Data types, missing values
df.describe()       # Statistical summary
df.shape            # Rows and columns

# Check for missing values
df.isnull().sum()

Data Cleaning:

python
# Remove duplicates
df = df.drop_duplicates()

# Handle missing values
df = df.dropna()                    # Remove rows with NaN
df['Age'] = df['Age'].fillna(df['Age'].mean())  # Fill with mean

# Remove outliers (values > 3 std devs)
df = df[np.abs(stats.zscore(df['Salary'])) < 3]

# Data type conversion
df['Date'] = pd.to_datetime(df['Date'])
df['Age'] = df['Age'].astype(int)

Filtering and Selection:

python
# Filter rows
high_earners = df[df['Salary'] > 60000]

# Multiple conditions
result = df[(df['Age'] > 25) & (df['Salary'] < 80000)]

# Select columns
selected = df[['Name', 'Salary']]

# Select by location
df.loc[0]           # First row
df.iloc[0:5]        # First 5 rows
df.loc[df['Age'] > 30, 'Name']  # Names of people over 30

Module 9: Data Analysis with Python

Lesson 9.1: Descriptive Statistics

python
# Mean, median, mode
df['Salary'].mean()
df['Salary'].median()
df['Salary'].mode()

# Spread
df['Salary'].std()      # Standard deviation
df['Salary'].var()      # Variance
df['Salary'].min()
df['Salary'].max()

# Percentiles
df['Salary'].quantile(0.25)  # 25th percentile
df['Salary'].quantile(0.75)  # 75th percentile

Lesson 9.2: Grouping and Aggregation

python
# Group by single column
by_region = df.groupby('Region')['Salary'].mean()

# Multiple aggregations
summary = df.groupby('Department').agg({
    'Salary': ['mean', 'min', 'max'],
    'Age': 'mean',
    'Name': 'count'
})

# Group by multiple columns
by_dept_region = df.groupby(['Department', 'Region'])['Salary'].sum()

Lesson 9.3: Merging and Joining Data

python
# Merge DataFrames
df_merged = pd.merge(df_customers, df_orders, 
                     on='customer_id', 
                     how='inner')

# Concatenate
df_combined = pd.concat([df1, df2], axis=0)

# Join
df_result = df1.join(df2, on='key')

Lesson 9.4: Correlation and Statistical Testing

python
# Correlation matrix
correlation = df.corr()

# Correlation with specific column
df.corr()['Salary'].sort_values(ascending=False)

# Statistical test - t-test (comparing two groups)
from scipy import stats
group1 = df[df['Region'] == 'North']['Salary']
group2 = df[df['Region'] == 'South']['Salary']
t_stat, p_value = stats.ttest_ind(group1, group2)

print(f"T-statistic: {t_stat}, P-value: {p_value}")
if p_value < 0.05:
    print("Significant difference between regions")

WEEK 8-9: DATA VISUALIZATION

Module 10: Visualization Fundamentals

Lesson 10.1: Visualization Principles

Choose Right Chart Type:

Line Chart - Show trends over time

Sales Over 12 Months (ascending trend)

Bar Chart - Compare categories

Sales by Region (North > South > East)

Pie Chart - Show parts of whole

Market Share: Company A 40%, B 35%, C 25%

Scatter Plot - Show relationship between two variables

Age vs Salary (positive correlation)

Histogram - Show distribution

Customer age distribution (bell curve)

Lesson 10.2: Visualization with Python

Matplotlib - Basic Plotting:

python
import matplotlib.pyplot as plt

# Line chart
plt.figure(figsize=(10, 6))
plt.plot(df['Month'], df['Sales'], marker='o')
plt.xlabel('Month')
plt.ylabel('Sales ($)')
plt.title('Monthly Sales Trend')
plt.grid(True)
plt.show()

# Bar chart
plt.bar(df['Region'], df['Total_Sales'])
plt.xlabel('Region')
plt.ylabel('Sales ($)')
plt.title('Sales by Region')
plt.show()

# Histogram
plt.hist(df['Age'], bins=20)
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.title('Age Distribution')
plt.show()

Seaborn - Advanced Visualization:

python
import seaborn as sns

# Heatmap showing correlations
sns.heatmap(df.corr(), annot=True, cmap='coolwarm')
plt.title('Correlation Heatmap')
plt.show()

# Box plot by category
sns.boxplot(x='Region', y='Salary', data=df)
plt.title('Salary Distribution by Region')
plt.show()

# Scatter plot with colors
sns.scatterplot(x='Age', y='Salary', hue='Region', data=df)
plt.title('Age vs Salary by Region')
plt.show()

Module 11: Business Intelligence Dashboards

Lesson 11.1: Dashboard Design Principles

KPI (Key Performance Indicator) Selection: Choose 3-5 most important metrics for target audience.

Visual Hierarchy: Most important metrics prominent; supporting details secondary.

Color Usage:

  • Green = Good
  • Yellow = Warning
  • Red = Alert
  • Avoid rainbow colors (accessibility)

Interactivity: Filters, drill-downs, date ranges for exploration.

Lesson 11.2: Building Dashboards with Power BI

Dashboard Components:

┌─────────────────────────────────────┐
│    Dashboard: Monthly Performance    │
├─────────────────────────────────────┤
│  Revenue: $2.5M    Growth: +15%     │
│  Customers: 15K    Churn: 2.3%      │
├────────────────────┬────────────────┤
│  Revenue Trend     │ Sales by Region│
│  (Line Chart)      │ (Pie Chart)    │
├────────────────────┼────────────────┤
│  Top Products      │ Forecast       │
│  (Bar Chart)       │ (Area Chart)   │
└────────────────────┴────────────────┘

WEEK 10: EXPLORATORY DATA ANALYSIS (EDA)

Module 12: Conducting Comprehensive EDA

Lesson 12.1: EDA Framework

Step 1: Understand Data Context

  • What does data represent?
  • How was data collected?
  • What are obvious limitations?

Step 2: Explore Structure

  • Rows and columns count
  • Data types
  • Missing values percentage

Step 3: Univariate Analysis Analyze individual variables.

python
# Numerical variables
df['Age'].describe()
plt.hist(df['Age'], bins=30)

# Categorical variables
df['Region'].value_counts()
df['Region'].value_counts().plot(kind='bar')

Step 4: Multivariate Analysis Analyze relationships between variables.

python
# Correlation between continuous variables
df[['Age', 'Salary', 'Experience']].corr()

# Relationship between categorical and continuous
df.groupby('Region')['Salary'].mean()
sns.boxplot(x='Region', y='Salary', data=df)

Step 5: Anomalies and Outliers Identify unusual patterns.

python
# Identify outliers
Q1 = df['Salary'].quantile(0.25)
Q3 = df['Salary'].quantile(0.75)
IQR = Q3 - Q1
outliers = df[(df['Salary'] < Q1 - 1.5*IQR) | 
              (df['Salary'] > Q3 + 1.5*IQR)]

# Flag suspicious patterns
df[df['Age'] > 100]  # Unrealistic ages
df[df['Salary'] < 0]  # Negative salaries

WEEK 11: PREDICTIVE ANALYTICS

Module 13: Introduction to Predictive Modeling

Lesson 13.1: Regression Analysis

Linear Regression - Predicting Continuous Values

Problem: Predict employee salary based on experience years

python
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error
import numpy as np

# Prepare data
X = df[['Years_Experience']].values
y = df['Salary'].values

# Split into training (80%) and testing (20%)
X_train, X_test, y_train, y_test = train_test_split(
  X, y, test_size=0.2, random_state=42
)

# Train model
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Evaluate
r2 = r2_score(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))

print(f"R² Score: {r2:.2f}")  # 0 to 1; higher is better
print(f"RMSE: ${rmse:.2f}")

Interpretation:

  • R² = 0.85: Model explains 85% of salary variation
  • RMSE = $5,000: Average prediction error is $5,000

Lesson 13.2: Classification - Predicting Categories

Problem: Predict if customer will churn (leave)

python
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix
import pandas as pd

# Prepare features and target
X = df[['Age', 'Tenure_Months', 'Monthly_Spend', 'Support_Tickets']]
y = df['Churned']  # 0 = No, 1 = Yes

# Split data
X_train, X_test, y_train, y_test = train_test_split(
  X, y, test_size=0.2
)

# Train model
model = RandomForestClassifier(n_estimators=100)
model.fit(X_train, y_train)

# Predict
y_pred = model.predict(X_test)

# Evaluate
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2%}")  # Percentage correct

# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
print(cm)
# [[TN  FP]   True Negatives & False Positives
#  [FN  TP]]  False Negatives & True Positives

WEEK 12: CAPSTONE PROJECT & CAREER PATHWAYS

Module 14: Capstone Analytics Project

Project: Retail Sales Analysis and Prediction

Scenario: You're hired as data analyst for online retail company. Your task: analyze sales patterns, identify high-value customers, forecast revenue, and recommend strategies.

Dataset Provided:

Columns: Date, Product, Region, Customer_Type, 
         Units_Sold, Unit_Price, Total_Sales, Customer_ID
Rows: 100,000+ transactions over 2 years

Part 1: Exploratory Analysis (Week 12, Days 1-2)

python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Load data
df = pd.read_csv('retail_sales.csv')
df['Date'] = pd.to_datetime(df['Date'])

# Basic exploration
print(f"Dataset shape: {df.shape}")
print(f"\nMissing values:\n{df.isnull().sum()}")
print(f"\nData types:\n{df.dtypes}")

# Statistical summary
print(df.describe())

# Analysis
print("\nTop 5 Products by Revenue:")
print(df.groupby('Product')['Total_Sales'].sum().sort_values(ascending=False).head())

print("\nSales by Region:")
print(df.groupby('Region')['Total_Sales'].sum().sort_values(ascending=False))

# Visualizations
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# Sales trend over time
daily_sales = df.groupby('Date')['Total_Sales'].sum()
axes[0, 0].plot(daily_sales)
axes[0, 0].set_title('Daily Sales Trend')

# Sales by region
df.groupby('Region')['Total_Sales'].sum().plot(kind='bar', ax=axes[0, 1])
axes[0, 1].set_title('Sales by Region')

# Product performance
df.groupby('Product')['Units_Sold'].sum().plot(kind='barh', ax=axes[1, 0])
axes[1, 0].set_title('Units Sold by Product')

# Customer type distribution
df.groupby('Customer_Type')['Total_Sales'].mean().plot(kind='bar', ax=axes[1, 1])
axes[1, 1].set_title('Average Sales by Customer Type')

plt.tight_layout()
plt.show()

Part 2: Customer Segmentation (Days 3-4)

python
# RFM Analysis: Recency, Frequency, Monetary
from datetime import datetime

reference_date = df['Date'].max() + timedelta(days=1)

rfm = df.groupby('Customer_ID').agg({
    'Date': lambda x: (reference_date - x.max()).days,  # Recency
    'Customer_ID': 'count',  # Frequency
    'Total_Sales': 'sum'  # Monetary
}).rename(columns={
    'Date': 'Recency',
    'Customer_ID': 'Frequency',
    'Total_Sales': 'Monetary'
})

# Score customers (1=worst, 5=best)
rfm['R_Score'] = pd.qcut(rfm['Recency'], 5, labels=[5,4,3,2,1])
rfm['F_Score'] = pd.qcut(rfm['Frequency'].rank(method='first'), 5, labels=[1,2,3,4,5])
rfm['M_Score'] = pd.qcut(rfm['Monetary'], 5, labels=[1,2,3,4,5])

rfm['RFM_Score'] = rfm['R_Score'].astype(int) + rfm['F_Score'].astype(int) + rfm['M_Score'].astype(int)

# Segment customers
def segment(score):
    if score >= 13: return 'Champions'
    elif score >= 10: return 'Loyal'
    elif score >= 7: return 'At Risk'
    else: return 'Lost'

rfm['Segment'] = rfm['RFM_Score'].apply(segment)

print("\nCustomer Segments:")
print(rfm['Segment'].value_counts())

Part 3: Revenue Forecasting (Days 5-6)

python
from sklearn.linear_model import LinearRegression
import numpy as np

# Prepare time series data
daily_revenue = df.groupby('Date')['Total_Sales'].sum().reset_index()
daily_revenue['Days_Since_Start'] = (daily_revenue['Date'] - daily_revenue['Date'].min()).dt.days

# Split into train/test
train_size = int(len(daily_revenue) * 0.8)
train = daily_revenue[:train_size]
test = daily_revenue[train_size:]

# Train model
X_train = train['Days_Since_Start'].values.reshape(-1, 1)
y_train = train['Total_Sales'].values

model = LinearRegression()
model.fit(X_train, y_train)

# Forecast next 30 days
future_days = np.arange(daily_revenue['Days_Since_Start'].max() + 1, 
                        daily_revenue['Days_Since_Start'].max() + 31).reshape(-1, 1)
forecast = model.predict(future_days)

print(f"30-Day Revenue Forecast: ${forecast.sum():,.2f}")
print(f"Average Daily Forecast: ${forecast.mean():,.2f}")

# Visualize
plt.figure(figsize=(12, 6))
plt.plot(daily_revenue['Date'], daily_revenue['Total_Sales'], label='Actual')
future_dates = [daily_revenue['Date'].max() + timedelta(days=i) for i in range(1, 31)]
plt.plot(future_dates, forecast, label='Forecast', linestyle='--')
plt.xlabel('Date')
plt.ylabel('Daily Revenue ($)')
plt.title('Revenue Forecast')
plt.legend()
plt.show()

Part 4: Insights and Recommendations (Day 7)

Present findings:

  1. Top Performing Products: Product A and C drive 60% of revenue
  2. High-Value Customers: Top 20% of customers generate 80% of revenue
  3. Growth Opportunity: Region South underperforming; has potential
  4. Forecast: Expected 8% revenue growth next month
  5. Recommendations:
    • Expand Product A in Region South
    • Launch loyalty program for Champions segment
    • Implement retention campaign for At Risk customers
    • Optimize pricing in high-margin products

CAREER PATHWAYS AND ADVANCED TOPICS

Module 15: Career Development

Lesson 15.1: Data Analyst Career Paths

Entry-Level Analyst

  • 0-2 years experience
  • Focus: Data collection, cleaning, basic reporting
  • Salary: ₹3-6 lakhs annually
  • Skills: Excel, SQL, basic Python/R

Mid-Level Analyst

  • 2-5 years experience
  • Focus: Advanced analysis, dashboards, project leadership
  • Salary: ₹6-12 lakhs
  • Skills: SQL, Python, Power

    Course Overview and Learning Objectives

    What is Data Analytics?

    Definition: Data Analytics is the practice of examining raw data to discover patterns, draw conclusions, and support decision-making. It combines statistics, programming, and business acumen to transform data into actionable insights.

    Course Structure

    This comprehensive course is designed to take you from complete beginner to job-ready data analyst in 12 weeks through structured learning, hands-on projects, and real-world applications.

    Prerequisite: Basic computer literacy. No prior analytics or programming experience required.

    Learning Outcomes

    By completing this course, you will be able to:

    • Understand data analytics fundamentals and concepts
    • Use Excel for data analysis and visualization
    • Write SQL queries to extract and transform data
    • Use Python for data manipulation and analysis
    • Create compelling data visualizations
    • Develop analytics dashboards
    • Conduct exploratory data analysis
    • Build predictive models
    • Present insights to stakeholders
    • Execute end-to-end analytics projects

    WEEK 1-2: FOUNDATIONS OF DATA ANALYTICS

    Module 1: Data Analytics Fundamentals

    Lesson 1.1: Introduction to Data Analytics

    Key Concepts:

    Data Raw facts and observations. Examples: Sales transactions, customer demographics, website clicks.

    Information Data processed into meaningful context. Example: "Sales increased 15% in Q3 among 25-35 age group in North India."

    Insights Conclusions drawn from information that drive decisions. Example: "We should target 25-35 age group with Q4 campaign in North India for maximum ROI."

    The Analytics Process:

    Raw Data → Collection → Cleaning → Analysis → Visualization → 
    Insights → Action → Business Impact

    Lesson 1.2: Types of Analytics

    Descriptive Analytics - "What Happened?" Analyzing historical data to understand patterns.

    Example: "Average monthly sales over past 12 months"

    Tools: Excel, Power BI, Tableau

    Outcome: Reports, dashboards, summaries

    Diagnostic Analytics - "Why Did It Happen?" Investigating reasons behind outcomes.

    Example: "Why did sales increase 15% in Q3?"

    Techniques: Correlation analysis, trend analysis, root cause analysis

    Tools: SQL, Python, statistical software

    Predictive Analytics - "What Will Happen?" Using historical data to forecast future outcomes.

    Example: "Predict next month's sales based on historical trends"

    Techniques: Machine learning, regression, time series forecasting

    Tools: Python, R, specialized ML platforms

    Prescriptive Analytics - "What Should We Do?" Recommending actions to achieve desired outcomes.

    Example: "Allocate 40% budget to North India, 35% to South, 25% to East for maximum projected revenue"

    Techniques: Optimization, simulation, decision analysis

    Module 2: Data Types and Structures

    Lesson 2.1: Data Types

    Structured Data Organized in tables with rows and columns. Example: Customer database with columns (Name, Age, Email, Purchase Amount).

    Unstructured Data No predefined structure. Examples: Text documents, images, videos, audio.

    Semi-Structured Data Mix of structured and unstructured. Examples: JSON, XML, log files.

    Quantitative (Numerical) Data Measurable quantities. Examples: Age, salary, temperature, revenue.

    Categorical (Qualitative) Data Non-numeric categories. Examples: Color, gender, product category, region.

    Lesson 2.2: Data Distribution

    Normal Distribution Bell-shaped curve; most data points near mean. Many natural phenomena follow normal distribution.

    Skewed Distribution Asymmetrical. Positive skew (tail right), negative skew (tail left).

    Bimodal Distribution Two peaks. Suggests two distinct groups in data.

    Understanding distributions helps identify:

    • Outliers and anomalies
    • Appropriate statistical tests
    • Data transformation needs

    Module 3: Key Analytics Concepts

    Lesson 3.1: Statistical Foundations

    Mean (Average) Sum of values divided by count.

    Mean = (Sum of all values) / (Number of values)
    
    Example: (10 + 20 + 30 + 40 + 50) / 5 = 30

    Median Middle value when data sorted. Less affected by outliers.

    Example: 10, 20, 30, 40, 50
    Median = 30 (middle value)

    Mode Most frequently occurring value.

    Example: 10, 20, 20, 30, 30, 30
    Mode = 30 (occurs 3 times)

    Standard Deviation Measures spread/variability of data. High SD = data spread out; Low SD = data clustered.

    Variance Square of standard deviation. Measures how far data points are from mean.

    Correlation Relationship between two variables. Ranges from -1 to +1.

    Correlation = 1: Perfect positive relationship
    Correlation = 0: No relationship
    Correlation = -1: Perfect negative relationship

    Lesson 3.2: Sampling and Bias

    Sampling Selecting subset of data for analysis. Used when full dataset is too large.

    Sample Size Larger samples more accurately represent population. Balance between precision and resources.

    Bias Systematic error skewing results. Types: Selection bias, measurement bias, response bias.

    Mitigation: Random sampling, stratified sampling, careful data collection.


    WEEK 3-4: EXCEL FOR DATA ANALYSIS

    Module 4: Excel Fundamentals and Data Preparation

    Lesson 4.1: Excel Basics

    Spreadsheet Structure Rows (horizontal), columns (vertical), cells (intersections).

    Data Entry Best Practices:

    • One data point per cell
    • Consistent formatting
    • No merged cells in data ranges
    • Column headers in first row
    • No blank rows/columns within data

    Lesson 4.2: Formulas and Functions

    Mathematical Functions

    =SUM(A1:A10)          // Add values
    =AVERAGE(A1:A10)      // Calculate average
    =COUNT(A1:A10)        // Count cells with numbers
    =MIN(A1:A10)          // Find minimum
    =MAX(A1:A10)          // Find maximum

    Logical Functions

    =IF(A1>100, "High", "Low")        // Conditional logic
    =AND(A1>50, B1<100)               // All conditions true?
    =OR(A1=1, A1=2, A1=3)             // Any condition true?

    Text Functions

    =LEN(A1)              // Length of text
    =UPPER(A1)            // Convert to uppercase
    =CONCATENATE(A1, " ", B1)  // Combine text
    =LEFT(A1, 3)          // First 3 characters

    Lesson 4.3: Data Cleaning

    Common Data Quality Issues:

    Missing Values

    • Identify: Look for blank cells
    • Handle: Delete, fill with mean/median, leave as is depending on context

    Duplicate Records

    • Identify: Sort and visually inspect, or Data > Remove Duplicates
    • Handle: Remove duplicates or investigate root cause

    Inconsistent Formatting

    • Standardize text cases (UPPER, LOWER, PROPER functions)
    • Consistent date formats
    • Consistent number formats

    Outliers

    • Identify: Values significantly different from others
    • Verify: Confirm if genuine or data error
    • Handle: Keep if genuine, remove if error

    Module 5: Pivot Tables and Summarization

    Lesson 5.1: Creating Pivot Tables

    What is a Pivot Table? Summarizes large datasets into meaningful insights. Rotates (pivots) data dimensions.

    Example Scenario: Raw data: Sales transactions with Date, Region, Product, Amount

    Pivot table shows: Revenue by Region and Product

    Steps to Create:

    1. Select data range
    2. Insert > Pivot Table
    3. Drag fields to Rows, Columns, Values
    4. Customize as needed

    Lesson 5.2: Aggregation Functions in Pivot Tables

    SUM:        Total of values
    AVERAGE:    Mean value
    COUNT:      Number of items
    MIN/MAX:    Smallest/largest value
    STDEV:      Standard deviation

    Practical Example:

    Data: Sales by Salesperson, Month, Product
    
    Pivot Table Structure:
    ─────────────────────────────────────────
          Jan      Feb      Mar    Total
    ─────────────────────────────────────────
    John  $5,000   $6,000   $7,000  $18,000
    Jane  $4,500   $5,500   $6,200  $16,200
    Bob   $3,200   $3,800   $4,100  $11,100
    ─────────────────────────────────────────
    Total $12,700  $15,300  $17,300 $45,300

    WEEK 5-6: SQL FOR DATA ANALYSIS

    Module 6: SQL Fundamentals

    Lesson 6.1: Introduction to SQL

    What is SQL? Structured Query Language. Universal language for interacting with databases.

    Why SQL?

    • Access large datasets efficiently
    • Extract specific data subsets
    • Perform complex calculations
    • Join data from multiple tables

    Lesson 6.2: Basic SELECT Queries

    Selecting All Data:

    sql
    SELECT * FROM customers;

    Selecting Specific Columns:

    sql
    SELECT name, email, age FROM customers;

    Filtering with WHERE:

    sql
    SELECT * FROM customers
    WHERE age > 25 AND region = 'North India';

    Logical Operators:

    AND:  Both conditions true
    OR:   At least one condition true
    NOT:  Condition not true
    IN:   Value in list
    BETWEEN: Value in range
    LIKE: Pattern matching

    Lesson 6.3: Aggregation and Grouping

    Aggregate Functions:

    sql
    SELECT 
      COUNT(*) as total_customers,
      AVG(purchase_amount) as avg_purchase,
      SUM(purchase_amount) as total_revenue,
      MAX(purchase_amount) as max_purchase
    FROM orders;

    GROUP BY - Aggregate by Category:

    sql
    SELECT 
      region,
      COUNT(*) as customer_count,
      SUM(purchase_amount) as total_sales,
      AVG(purchase_amount) as avg_sale
    FROM orders
    GROUP BY region;

    HAVING - Filter Groups:

    sql
    SELECT 
      product_category,
      COUNT(*) as sales_count
    FROM orders
    GROUP BY product_category
    HAVING COUNT(*) > 50;

    Module 7: Advanced SQL

    Lesson 7.1: Joins

    INNER JOIN - Common Records

    sql
    SELECT 
      o.order_id,
      c.customer_name,
      o.purchase_amount
    FROM orders o
    INNER JOIN customers c
      ON o.customer_id = c.customer_id;

    LEFT JOIN - All from Left Table

    sql
    SELECT 
      c.customer_name,
      COUNT(o.order_id) as order_count
    FROM customers c
    LEFT JOIN orders o
      ON c.customer_id = o.customer_id
    GROUP BY c.customer_name;

    Lesson 7.2: Subqueries

    Subquery in WHERE Clause:

    sql
    SELECT * FROM customers
    WHERE customer_id IN (
      SELECT customer_id FROM orders
      WHERE purchase_amount > 10000
    );

    Subquery in FROM Clause:

    sql
    SELECT 
      region,
      avg_purchase
    FROM (
      SELECT 
        region,
        AVG(purchase_amount) as avg_purchase
      FROM orders
      GROUP BY region
    ) as regional_avg
    WHERE avg_purchase > 5000;

    Lesson 7.3: Window Functions

    Running Total:

    sql
    SELECT 
      date,
      amount,
      SUM(amount) OVER (
        ORDER BY date
      ) as running_total
    FROM sales
    ORDER BY date;

    Rank Within Groups:

    sql
    SELECT 
      salesperson,
      sales,
      RANK() OVER (ORDER BY sales DESC) as rank
    FROM sales_performance;

    WEEK 7-8: PYTHON FOR DATA ANALYSIS

    Module 8: Python Basics for Analytics

    Lesson 8.1: Python Setup and Libraries

    Essential Libraries:

    python
    # Data manipulation
    import pandas as pd
    import numpy as np
    
    # Visualization
    import matplotlib.pyplot as plt
    import seaborn as sns
    
    # Statistical analysis
    from scipy import stats
    import statsmodels.api as sm
    
    # Machine learning (later weeks)
    from sklearn import preprocessing, ensemble, metrics

    Lesson 8.2: Working with Data in Pandas

    Creating DataFrames:

    python
    # From dictionary
    data = {
      'Name': ['Alice', 'Bob', 'Charlie'],
      'Age': [25, 30, 35],
      'Salary': [50000, 60000, 75000]
    }
    df = pd.DataFrame(data)
    
    # From CSV
    df = pd.read_csv('customers.csv')
    
    # From Excel
    df = pd.read_excel('sales_data.xlsx')

    Exploring Data:

    python
    # First/last rows
    df.head()
    df.tail()
    
    # Dataset info
    df.info()           # Data types, missing values
    df.describe()       # Statistical summary
    df.shape            # Rows and columns
    
    # Check for missing values
    df.isnull().sum()

    Data Cleaning:

    python
    # Remove duplicates
    df = df.drop_duplicates()
    
    # Handle missing values
    df = df.dropna()                    # Remove rows with NaN
    df['Age'] = df['Age'].fillna(df['Age'].mean())  # Fill with mean
    
    # Remove outliers (values > 3 std devs)
    df = df[np.abs(stats.zscore(df['Salary'])) < 3]
    
    # Data type conversion
    df['Date'] = pd.to_datetime(df['Date'])
    df['Age'] = df['Age'].astype(int)

    Filtering and Selection:

    python
    # Filter rows
    high_earners = df[df['Salary'] > 60000]
    
    # Multiple conditions
    result = df[(df['Age'] > 25) & (df['Salary'] < 80000)]
    
    # Select columns
    selected = df[['Name', 'Salary']]
    
    # Select by location
    df.loc[0]           # First row
    df.iloc[0:5]        # First 5 rows
    df.loc[df['Age'] > 30, 'Name']  # Names of people over 30

    Module 9: Data Analysis with Python

    Lesson 9.1: Descriptive Statistics

    python
    # Mean, median, mode
    df['Salary'].mean()
    df['Salary'].median()
    df['Salary'].mode()
    
    # Spread
    df['Salary'].std()      # Standard deviation
    df['Salary'].var()      # Variance
    df['Salary'].min()
    df['Salary'].max()
    
    # Percentiles
    df['Salary'].quantile(0.25)  # 25th percentile
    df['Salary'].quantile(0.75)  # 75th percentile

    Lesson 9.2: Grouping and Aggregation

    python
    # Group by single column
    by_region = df.groupby('Region')['Salary'].mean()
    
    # Multiple aggregations
    summary = df.groupby('Department').agg({
        'Salary': ['mean', 'min', 'max'],
        'Age': 'mean',
        'Name': 'count'
    })
    
    # Group by multiple columns
    by_dept_region = df.groupby(['Department', 'Region'])['Salary'].sum()

    Lesson 9.3: Merging and Joining Data

    python
    # Merge DataFrames
    df_merged = pd.merge(df_customers, df_orders, 
                         on='customer_id', 
                         how='inner')
    
    # Concatenate
    df_combined = pd.concat([df1, df2], axis=0)
    
    # Join
    df_result = df1.join(df2, on='key')

    Lesson 9.4: Correlation and Statistical Testing

    python
    # Correlation matrix
    correlation = df.corr()
    
    # Correlation with specific column
    df.corr()['Salary'].sort_values(ascending=False)
    
    # Statistical test - t-test (comparing two groups)
    from scipy import stats
    group1 = df[df['Region'] == 'North']['Salary']
    group2 = df[df['Region'] == 'South']['Salary']
    t_stat, p_value = stats.ttest_ind(group1, group2)
    
    print(f"T-statistic: {t_stat}, P-value: {p_value}")
    if p_value < 0.05:
        print("Significant difference between regions")

    WEEK 8-9: DATA VISUALIZATION

    Module 10: Visualization Fundamentals

    Lesson 10.1: Visualization Principles

    Choose Right Chart Type:

    Line Chart - Show trends over time

    Sales Over 12 Months (ascending trend)

    Bar Chart - Compare categories

    Sales by Region (North > South > East)

    Pie Chart - Show parts of whole

    Market Share: Company A 40%, B 35%, C 25%

    Scatter Plot - Show relationship between two variables

    Age vs Salary (positive correlation)

    Histogram - Show distribution

    Customer age distribution (bell curve)

    Lesson 10.2: Visualization with Python

    Matplotlib - Basic Plotting:

    python
    import matplotlib.pyplot as plt
    
    # Line chart
    plt.figure(figsize=(10, 6))
    plt.plot(df['Month'], df['Sales'], marker='o')
    plt.xlabel('Month')
    plt.ylabel('Sales ($)')
    plt.title('Monthly Sales Trend')
    plt.grid(True)
    plt.show()
    
    # Bar chart
    plt.bar(df['Region'], df['Total_Sales'])
    plt.xlabel('Region')
    plt.ylabel('Sales ($)')
    plt.title('Sales by Region')
    plt.show()
    
    # Histogram
    plt.hist(df['Age'], bins=20)
    plt.xlabel('Age')
    plt.ylabel('Frequency')
    plt.title('Age Distribution')
    plt.show()

    Seaborn - Advanced Visualization:

    python
    import seaborn as sns
    
    # Heatmap showing correlations
    sns.heatmap(df.corr(), annot=True, cmap='coolwarm')
    plt.title('Correlation Heatmap')
    plt.show()
    
    # Box plot by category
    sns.boxplot(x='Region', y='Salary', data=df)
    plt.title('Salary Distribution by Region')
    plt.show()
    
    # Scatter plot with colors
    sns.scatterplot(x='Age', y='Salary', hue='Region', data=df)
    plt.title('Age vs Salary by Region')
    plt.show()

    Module 11: Business Intelligence Dashboards

    Lesson 11.1: Dashboard Design Principles

    KPI (Key Performance Indicator) Selection: Choose 3-5 most important metrics for target audience.

    Visual Hierarchy: Most important metrics prominent; supporting details secondary.

    Color Usage:

    • Green = Good
    • Yellow = Warning
    • Red = Alert
    • Avoid rainbow colors (accessibility)

    Interactivity: Filters, drill-downs, date ranges for exploration.

    Lesson 11.2: Building Dashboards with Power BI

    Dashboard Components:

    ┌─────────────────────────────────────┐
    │    Dashboard: Monthly Performance    │
    ├─────────────────────────────────────┤
    │  Revenue: $2.5M    Growth: +15%     │
    │  Customers: 15K    Churn: 2.3%      │
    ├────────────────────┬────────────────┤
    │  Revenue Trend     │ Sales by Region│
    │  (Line Chart)      │ (Pie Chart)    │
    ├────────────────────┼────────────────┤
    │  Top Products      │ Forecast       │
    │  (Bar Chart)       │ (Area Chart)   │
    └────────────────────┴────────────────┘

    WEEK 10: EXPLORATORY DATA ANALYSIS (EDA)

    Module 12: Conducting Comprehensive EDA

    Lesson 12.1: EDA Framework

    Step 1: Understand Data Context

    • What does data represent?
    • How was data collected?
    • What are obvious limitations?

    Step 2: Explore Structure

    • Rows and columns count
    • Data types
    • Missing values percentage

    Step 3: Univariate Analysis Analyze individual variables.

    python
    # Numerical variables
    df['Age'].describe()
    plt.hist(df['Age'], bins=30)
    
    # Categorical variables
    df['Region'].value_counts()
    df['Region'].value_counts().plot(kind='bar')

    Step 4: Multivariate Analysis Analyze relationships between variables.

    python
    # Correlation between continuous variables
    df[['Age', 'Salary', 'Experience']].corr()
    
    # Relationship between categorical and continuous
    df.groupby('Region')['Salary'].mean()
    sns.boxplot(x='Region', y='Salary', data=df)

    Step 5: Anomalies and Outliers Identify unusual patterns.

    python
    # Identify outliers
    Q1 = df['Salary'].quantile(0.25)
    Q3 = df['Salary'].quantile(0.75)
    IQR = Q3 - Q1
    outliers = df[(df['Salary'] < Q1 - 1.5*IQR) | 
                  (df['Salary'] > Q3 + 1.5*IQR)]
    
    # Flag suspicious patterns
    df[df['Age'] > 100]  # Unrealistic ages
    df[df['Salary'] < 0]  # Negative salaries

    WEEK 11: PREDICTIVE ANALYTICS

    Module 13: Introduction to Predictive Modeling

    Lesson 13.1: Regression Analysis

    Linear Regression - Predicting Continuous Values

    Problem: Predict employee salary based on experience years

    python
    from sklearn.linear_model import LinearRegression
    from sklearn.model_selection import train_test_split
    from sklearn.metrics import r2_score, mean_squared_error
    import numpy as np
    
    # Prepare data
    X = df[['Years_Experience']].values
    y = df['Salary'].values
    
    # Split into training (80%) and testing (20%)
    X_train, X_test, y_train, y_test = train_test_split(
      X, y, test_size=0.2, random_state=42
    )
    
    # Train model
    model = LinearRegression()
    model.fit(X_train, y_train)
    
    # Make predictions
    y_pred = model.predict(X_test)
    
    # Evaluate
    r2 = r2_score(y_test, y_pred)
    rmse = np.sqrt(mean_squared_error(y_test, y_pred))
    
    print(f"R² Score: {r2:.2f}")  # 0 to 1; higher is better
    print(f"RMSE: ${rmse:.2f}")

    Interpretation:

    • R² = 0.85: Model explains 85% of salary variation
    • RMSE = $5,000: Average prediction error is $5,000

    Lesson 13.2: Classification - Predicting Categories

    Problem: Predict if customer will churn (leave)

    python
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.metrics import accuracy_score, confusion_matrix
    import pandas as pd
    
    # Prepare features and target
    X = df[['Age', 'Tenure_Months', 'Monthly_Spend', 'Support_Tickets']]
    y = df['Churned']  # 0 = No, 1 = Yes
    
    # Split data
    X_train, X_test, y_train, y_test = train_test_split(
      X, y, test_size=0.2
    )
    
    # Train model
    model = RandomForestClassifier(n_estimators=100)
    model.fit(X_train, y_train)
    
    # Predict
    y_pred = model.predict(X_test)
    
    # Evaluate
    accuracy = accuracy_score(y_test, y_pred)
    print(f"Accuracy: {accuracy:.2%}")  # Percentage correct
    
    # Confusion matrix
    cm = confusion_matrix(y_test, y_pred)
    print(cm)
    # [[TN  FP]   True Negatives & False Positives
    #  [FN  TP]]  False Negatives & True Positives

    WEEK 12: CAPSTONE PROJECT & CAREER PATHWAYS

    Module 14: Capstone Analytics Project

    Project: Retail Sales Analysis and Prediction

    Scenario: You're hired as data analyst for online retail company. Your task: analyze sales patterns, identify high-value customers, forecast revenue, and recommend strategies.

    Dataset Provided:

    Columns: Date, Product, Region, Customer_Type, 
             Units_Sold, Unit_Price, Total_Sales, Customer_ID
    Rows: 100,000+ transactions over 2 years

    Part 1: Exploratory Analysis (Week 12, Days 1-2)

    python
    import pandas as pd
    import numpy as np
    import matplotlib.pyplot as plt
    import seaborn as sns
    
    # Load data
    df = pd.read_csv('retail_sales.csv')
    df['Date'] = pd.to_datetime(df['Date'])
    
    # Basic exploration
    print(f"Dataset shape: {df.shape}")
    print(f"\nMissing values:\n{df.isnull().sum()}")
    print(f"\nData types:\n{df.dtypes}")
    
    # Statistical summary
    print(df.describe())
    
    # Analysis
    print("\nTop 5 Products by Revenue:")
    print(df.groupby('Product')['Total_Sales'].sum().sort_values(ascending=False).head())
    
    print("\nSales by Region:")
    print(df.groupby('Region')['Total_Sales'].sum().sort_values(ascending=False))
    
    # Visualizations
    fig, axes = plt.subplots(2, 2, figsize=(15, 10))
    
    # Sales trend over time
    daily_sales = df.groupby('Date')['Total_Sales'].sum()
    axes[0, 0].plot(daily_sales)
    axes[0, 0].set_title('Daily Sales Trend')
    
    # Sales by region
    df.groupby('Region')['Total_Sales'].sum().plot(kind='bar', ax=axes[0, 1])
    axes[0, 1].set_title('Sales by Region')
    
    # Product performance
    df.groupby('Product')['Units_Sold'].sum().plot(kind='barh', ax=axes[1, 0])
    axes[1, 0].set_title('Units Sold by Product')
    
    # Customer type distribution
    df.groupby('Customer_Type')['Total_Sales'].mean().plot(kind='bar', ax=axes[1, 1])
    axes[1, 1].set_title('Average Sales by Customer Type')
    
    plt.tight_layout()
    plt.show()

    Part 2: Customer Segmentation (Days 3-4)

    python
    # RFM Analysis: Recency, Frequency, Monetary
    from datetime import datetime
    
    reference_date = df['Date'].max() + timedelta(days=1)
    
    rfm = df.groupby('Customer_ID').agg({
        'Date': lambda x: (reference_date - x.max()).days,  # Recency
        'Customer_ID': 'count',  # Frequency
        'Total_Sales': 'sum'  # Monetary
    }).rename(columns={
        'Date': 'Recency',
        'Customer_ID': 'Frequency',
        'Total_Sales': 'Monetary'
    })
    
    # Score customers (1=worst, 5=best)
    rfm['R_Score'] = pd.qcut(rfm['Recency'], 5, labels=[5,4,3,2,1])
    rfm['F_Score'] = pd.qcut(rfm['Frequency'].rank(method='first'), 5, labels=[1,2,3,4,5])
    rfm['M_Score'] = pd.qcut(rfm['Monetary'], 5, labels=[1,2,3,4,5])
    
    rfm['RFM_Score'] = rfm['R_Score'].astype(int) + rfm['F_Score'].astype(int) + rfm['M_Score'].astype(int)
    
    # Segment customers
    def segment(score):
        if score >= 13: return 'Champions'
        elif score >= 10: return 'Loyal'
        elif score >= 7: return 'At Risk'
        else: return 'Lost'
    
    rfm['Segment'] = rfm['RFM_Score'].apply(segment)
    
    print("\nCustomer Segments:")
    print(rfm['Segment'].value_counts())

    Part 3: Revenue Forecasting (Days 5-6)

    python
    from sklearn.linear_model import LinearRegression
    import numpy as np
    
    # Prepare time series data
    daily_revenue = df.groupby('Date')['Total_Sales'].sum().reset_index()
    daily_revenue['Days_Since_Start'] = (daily_revenue['Date'] - daily_revenue['Date'].min()).dt.days
    
    # Split into train/test
    train_size = int(len(daily_revenue) * 0.8)
    train = daily_revenue[:train_size]
    test = daily_revenue[train_size:]
    
    # Train model
    X_train = train['Days_Since_Start'].values.reshape(-1, 1)
    y_train = train['Total_Sales'].values
    
    model = LinearRegression()
    model.fit(X_train, y_train)
    
    # Forecast next 30 days
    future_days = np.arange(daily_revenue['Days_Since_Start'].max() + 1, 
                            daily_revenue['Days_Since_Start'].max() + 31).reshape(-1, 1)
    forecast = model.predict(future_days)
    
    print(f"30-Day Revenue Forecast: ${forecast.sum():,.2f}")
    print(f"Average Daily Forecast: ${forecast.mean():,.2f}")
    
    # Visualize
    plt.figure(figsize=(12, 6))
    plt.plot(daily_revenue['Date'], daily_revenue['Total_Sales'], label='Actual')
    future_dates = [daily_revenue['Date'].max() + timedelta(days=i) for i in range(1, 31)]
    plt.plot(future_dates, forecast, label='Forecast', linestyle='--')
    plt.xlabel('Date')
    plt.ylabel('Daily Revenue ($)')
    plt.title('Revenue Forecast')
    plt.legend()
    plt.show()

    Part 4: Insights and Recommendations (Day 7)

    Present findings:

    1. Top Performing Products: Product A and C drive 60% of revenue
    2. High-Value Customers: Top 20% of customers generate 80% of revenue
    3. Growth Opportunity: Region South underperforming; has potential
    4. Forecast: Expected 8% revenue growth next month
    5. Recommendations:
      • Expand Product A in Region South
      • Launch loyalty program for Champions segment
      • Implement retention campaign for At Risk customers
      • Optimize pricing in high-margin products

    CAREER PATHWAYS AND ADVANCED TOPICS

    Module 15: Career Developmentgf

    Lesson 15.1: Data Analyst Career Paths

    Entry-Level Analyst

    • 0-2 years experience
    • Focus: Data collection, cleaning, basic reporting
    • Salary: ₹3-6 lakhs annually
    • Skills: Excel, SQL, basic Python/R

    Mid-Level Analyst

    • 2-5 years experience
    • Focus: Advanced analysis, dashboards, project leadership
    • Salary: ₹6-12 lakhs
    • Skills: SQL, Python, Power

Comments

Popular posts from this blog

Artificial Intelligence in Business: Strategy, Implementation, and Competitive Advantage

🌐 What is Digital Marketing?