Data Analytics: Complete Course Script - From Beginner to Professional

Comprehensive data analytics course covering fundamentals, tools, techniques, real-world projects, and career pathways. Learn skills to become a professional data analyst in 12 weeks.

Course Overview and Learning Objectives

What is Data Analytics?

Definition: Data Analytics is the practice of examining raw data to discover patterns, draw conclusions, and support decision-making. It combines statistics, programming, and business acumen to transform data into actionable insights.

Course Structure

This comprehensive course is designed to take you from complete beginner to job-ready data analyst in 12 weeks through structured learning, hands-on projects, and real-world applications.

Prerequisite: Basic computer literacy. No prior analytics or programming experience required.

Learning Outcomes

By completing this course, you will be able to:

Understand data analytics fundamentals and concepts
Use Excel for data analysis and visualization
Write SQL queries to extract and transform data
Use Python for data manipulation and analysis
Create compelling data visualizations
Develop analytics dashboards
Conduct exploratory data analysis
Build predictive models
Present insights to stakeholders
Execute end-to-end analytics projects

WEEK 1-2: FOUNDATIONS OF DATA ANALYTICS

Module 1: Data Analytics Fundamentals

Lesson 1.1: Introduction to Data Analytics

Key Concepts:

Data Raw facts and observations. Examples: Sales transactions, customer demographics, website clicks.

Information Data processed into meaningful context. Example: "Sales increased 15% in Q3 among 25-35 age group in North India."

Insights Conclusions drawn from information that drive decisions. Example: "We should target 25-35 age group with Q4 campaign in North India for maximum ROI."

The Analytics Process:


Raw Data → Collection → Cleaning → Analysis → Visualization → 
Insights → Action → Business Impact

Lesson 1.2: Types of Analytics

Descriptive Analytics - "What Happened?" Analyzing historical data to understand patterns.

Example: "Average monthly sales over past 12 months"

Tools: Excel, Power BI, Tableau

Outcome: Reports, dashboards, summaries

Diagnostic Analytics - "Why Did It Happen?" Investigating reasons behind outcomes.

Example: "Why did sales increase 15% in Q3?"

Techniques: Correlation analysis, trend analysis, root cause analysis

Tools: SQL, Python, statistical software

Predictive Analytics - "What Will Happen?" Using historical data to forecast future outcomes.

Example: "Predict next month's sales based on historical trends"

Techniques: Machine learning, regression, time series forecasting

Tools: Python, R, specialized ML platforms

Prescriptive Analytics - "What Should We Do?" Recommending actions to achieve desired outcomes.

Example: "Allocate 40% budget to North India, 35% to South, 25% to East for maximum projected revenue"

Techniques: Optimization, simulation, decision analysis

Module 2: Data Types and Structures

Lesson 2.1: Data Types

Structured Data Organized in tables with rows and columns. Example: Customer database with columns (Name, Age, Email, Purchase Amount).

Unstructured Data No predefined structure. Examples: Text documents, images, videos, audio.

Semi-Structured Data Mix of structured and unstructured. Examples: JSON, XML, log files.

Quantitative (Numerical) Data Measurable quantities. Examples: Age, salary, temperature, revenue.

Categorical (Qualitative) Data Non-numeric categories. Examples: Color, gender, product category, region.

Lesson 2.2: Data Distribution

Normal Distribution Bell-shaped curve; most data points near mean. Many natural phenomena follow normal distribution.

Skewed Distribution Asymmetrical. Positive skew (tail right), negative skew (tail left).

Bimodal Distribution Two peaks. Suggests two distinct groups in data.

Understanding distributions helps identify:

Outliers and anomalies
Appropriate statistical tests
Data transformation needs

Module 3: Key Analytics Concepts

Lesson 3.1: Statistical Foundations

Mean (Average) Sum of values divided by count.

Mean = (Sum of all values) / (Number of values)

Example: (10 + 20 + 30 + 40 + 50) / 5 = 30

Median Middle value when data sorted. Less affected by outliers.

Example: 10, 20, 30, 40, 50
Median = 30 (middle value)

Mode Most frequently occurring value.

Example: 10, 20, 20, 30, 30, 30
Mode = 30 (occurs 3 times)

Standard Deviation Measures spread/variability of data. High SD = data spread out; Low SD = data clustered.

Variance Square of standard deviation. Measures how far data points are from mean.

Correlation Relationship between two variables. Ranges from -1 to +1.

Correlation = 1: Perfect positive relationship
Correlation = 0: No relationship
Correlation = -1: Perfect negative relationship

Lesson 3.2: Sampling and Bias

Sampling Selecting subset of data for analysis. Used when full dataset is too large.

Sample Size Larger samples more accurately represent population. Balance between precision and resources.

Bias Systematic error skewing results. Types: Selection bias, measurement bias, response bias.

Mitigation: Random sampling, stratified sampling, careful data collection.

WEEK 3-4: EXCEL FOR DATA ANALYSIS

Module 4: Excel Fundamentals and Data Preparation

Lesson 4.1: Excel Basics

Spreadsheet Structure Rows (horizontal), columns (vertical), cells (intersections).

Data Entry Best Practices:

One data point per cell
Consistent formatting
No merged cells in data ranges
Column headers in first row
No blank rows/columns within data

Lesson 4.2: Formulas and Functions

Mathematical Functions

=SUM(A1:A10)          // Add values
=AVERAGE(A1:A10)      // Calculate average
=COUNT(A1:A10)        // Count cells with numbers
=MIN(A1:A10)          // Find minimum
=MAX(A1:A10)          // Find maximum

Logical Functions

=IF(A1>100, "High", "Low")        // Conditional logic
=AND(A1>50, B1<100)               // All conditions true?
=OR(A1=1, A1=2, A1=3)             // Any condition true?

Text Functions

=LEN(A1)              // Length of text
=UPPER(A1)            // Convert to uppercase
=CONCATENATE(A1, " ", B1)  // Combine text
=LEFT(A1, 3)          // First 3 characters

Lesson 4.3: Data Cleaning

Common Data Quality Issues:

Missing Values

Identify: Look for blank cells
Handle: Delete, fill with mean/median, leave as is depending on context

Duplicate Records

Identify: Sort and visually inspect, or Data > Remove Duplicates
Handle: Remove duplicates or investigate root cause

Inconsistent Formatting

Standardize text cases (UPPER, LOWER, PROPER functions)
Consistent date formats
Consistent number formats

Outliers

Identify: Values significantly different from others
Verify: Confirm if genuine or data error
Handle: Keep if genuine, remove if error

Module 5: Pivot Tables and Summarization

Lesson 5.1: Creating Pivot Tables

What is a Pivot Table? Summarizes large datasets into meaningful insights. Rotates (pivots) data dimensions.

Example Scenario: Raw data: Sales transactions with Date, Region, Product, Amount

Pivot table shows: Revenue by Region and Product

Steps to Create:

Select data range
Insert > Pivot Table
Drag fields to Rows, Columns, Values
Customize as needed

Lesson 5.2: Aggregation Functions in Pivot Tables

SUM:        Total of values
AVERAGE:    Mean value
COUNT:      Number of items
MIN/MAX:    Smallest/largest value
STDEV:      Standard deviation

Practical Example:

Data: Sales by Salesperson, Month, Product

Pivot Table Structure:
─────────────────────────────────────────
      Jan      Feb      Mar    Total
─────────────────────────────────────────
John  $5,000   $6,000   $7,000  $18,000
Jane  $4,500   $5,500   $6,200  $16,200
Bob   $3,200   $3,800   $4,100  $11,100
─────────────────────────────────────────
Total $12,700  $15,300  $17,300 $45,300

WEEK 5-6: SQL FOR DATA ANALYSIS

Module 6: SQL Fundamentals

Lesson 6.1: Introduction to SQL

What is SQL? Structured Query Language. Universal language for interacting with databases.

Why SQL?

Access large datasets efficiently
Extract specific data subsets
Perform complex calculations
Join data from multiple tables

Lesson 6.2: Basic SELECT Queries

Selecting All Data:

sql

SELECT * FROM customers;

Selecting Specific Columns:

sql

SELECT name, email, age FROM customers;

Filtering with WHERE:

sql

SELECT * FROM customers
WHERE age > 25 AND region = 'North India';

Logical Operators:

AND:  Both conditions true
OR:   At least one condition true
NOT:  Condition not true
IN:   Value in list
BETWEEN: Value in range
LIKE: Pattern matching

Lesson 6.3: Aggregation and Grouping

Aggregate Functions:

sql

SELECT 
  COUNT(*) as total_customers,
  AVG(purchase_amount) as avg_purchase,
  SUM(purchase_amount) as total_revenue,
  MAX(purchase_amount) as max_purchase
FROM orders;

GROUP BY - Aggregate by Category:

sql

SELECT 
  region,
  COUNT(*) as customer_count,
  SUM(purchase_amount) as total_sales,
  AVG(purchase_amount) as avg_sale
FROM orders
GROUP BY region;

HAVING - Filter Groups:

sql

SELECT 
  product_category,
  COUNT(*) as sales_count
FROM orders
GROUP BY product_category
HAVING COUNT(*) > 50;

Module 7: Advanced SQL

Lesson 7.1: Joins

INNER JOIN - Common Records

sql

SELECT 
  o.order_id,
  c.customer_name,
  o.purchase_amount
FROM orders o
INNER JOIN customers c
  ON o.customer_id = c.customer_id;

LEFT JOIN - All from Left Table

sql

SELECT 
  c.customer_name,
  COUNT(o.order_id) as order_count
FROM customers c
LEFT JOIN orders o
  ON c.customer_id = o.customer_id
GROUP BY c.customer_name;

Lesson 7.2: Subqueries

Subquery in WHERE Clause:

sql

SELECT * FROM customers
WHERE customer_id IN (
  SELECT customer_id FROM orders
  WHERE purchase_amount > 10000
);

Subquery in FROM Clause:

sql

SELECT 
  region,
  avg_purchase
FROM (
  SELECT 
    region,
    AVG(purchase_amount) as avg_purchase
  FROM orders
  GROUP BY region
) as regional_avg
WHERE avg_purchase > 5000;

Lesson 7.3: Window Functions

Running Total:

sql

SELECT 
  date,
  amount,
  SUM(amount) OVER (
    ORDER BY date
  ) as running_total
FROM sales
ORDER BY date;

Rank Within Groups:

sql

SELECT 
  salesperson,
  sales,
  RANK() OVER (ORDER BY sales DESC) as rank
FROM sales_performance;

WEEK 7-8: PYTHON FOR DATA ANALYSIS

Module 8: Python Basics for Analytics

Lesson 8.1: Python Setup and Libraries

Essential Libraries:

python

# Data manipulation
import pandas as pd
import numpy as np

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Statistical analysis
from scipy import stats
import statsmodels.api as sm

# Machine learning (later weeks)
from sklearn import preprocessing, ensemble, metrics

Lesson 8.2: Working with Data in Pandas

Creating DataFrames:

python

# From dictionary
data = {
  'Name': ['Alice', 'Bob', 'Charlie'],
  'Age': [25, 30, 35],
  'Salary': [50000, 60000, 75000]
}
df = pd.DataFrame(data)

# From CSV
df = pd.read_csv('customers.csv')

# From Excel
df = pd.read_excel('sales_data.xlsx')

Exploring Data:

python

# First/last rows
df.head()
df.tail()

# Dataset info
df.info()           # Data types, missing values
df.describe()       # Statistical summary
df.shape            # Rows and columns

# Check for missing values
df.isnull().sum()

Data Cleaning:

python

# Remove duplicates
df = df.drop_duplicates()

# Handle missing values
df = df.dropna()                    # Remove rows with NaN
df['Age'] = df['Age'].fillna(df['Age'].mean())  # Fill with mean

# Remove outliers (values > 3 std devs)
df = df[np.abs(stats.zscore(df['Salary'])) < 3]

# Data type conversion
df['Date'] = pd.to_datetime(df['Date'])
df['Age'] = df['Age'].astype(int)

Filtering and Selection:

python

# Filter rows
high_earners = df[df['Salary'] > 60000]

# Multiple conditions
result = df[(df['Age'] > 25) & (df['Salary'] < 80000)]

# Select columns
selected = df[['Name', 'Salary']]

# Select by location
df.loc[0]           # First row
df.iloc[0:5]        # First 5 rows
df.loc[df['Age'] > 30, 'Name']  # Names of people over 30

Module 9: Data Analysis with Python

Lesson 9.1: Descriptive Statistics

python

# Mean, median, mode
df['Salary'].mean()
df['Salary'].median()
df['Salary'].mode()

# Spread
df['Salary'].std()      # Standard deviation
df['Salary'].var()      # Variance
df['Salary'].min()
df['Salary'].max()

# Percentiles
df['Salary'].quantile(0.25)  # 25th percentile
df['Salary'].quantile(0.75)  # 75th percentile

Lesson 9.2: Grouping and Aggregation

python

# Group by single column
by_region = df.groupby('Region')['Salary'].mean()

# Multiple aggregations
summary = df.groupby('Department').agg({
    'Salary': ['mean', 'min', 'max'],
    'Age': 'mean',
    'Name': 'count'
})

# Group by multiple columns
by_dept_region = df.groupby(['Department', 'Region'])['Salary'].sum()

Lesson 9.3: Merging and Joining Data

python

# Merge DataFrames
df_merged = pd.merge(df_customers, df_orders, 
                     on='customer_id', 
                     how='inner')

# Concatenate
df_combined = pd.concat([df1, df2], axis=0)

# Join
df_result = df1.join(df2, on='key')

Lesson 9.4: Correlation and Statistical Testing

python

# Correlation matrix
correlation = df.corr()

# Correlation with specific column
df.corr()['Salary'].sort_values(ascending=False)

# Statistical test - t-test (comparing two groups)
from scipy import stats
group1 = df[df['Region'] == 'North']['Salary']
group2 = df[df['Region'] == 'South']['Salary']
t_stat, p_value = stats.ttest_ind(group1, group2)

print(f"T-statistic: {t_stat}, P-value: {p_value}")
if p_value < 0.05:
    print("Significant difference between regions")

WEEK 8-9: DATA VISUALIZATION

Module 10: Visualization Fundamentals

Lesson 10.1: Visualization Principles

Choose Right Chart Type:

Line Chart - Show trends over time

Sales Over 12 Months (ascending trend)

Bar Chart - Compare categories

Sales by Region (North > South > East)

Pie Chart - Show parts of whole

Market Share: Company A 40%, B 35%, C 25%

Scatter Plot - Show relationship between two variables

Age vs Salary (positive correlation)

Histogram - Show distribution

Customer age distribution (bell curve)

Lesson 10.2: Visualization with Python

Matplotlib - Basic Plotting:

python

import matplotlib.pyplot as plt

# Line chart
plt.figure(figsize=(10, 6))
plt.plot(df['Month'], df['Sales'], marker='o')
plt.xlabel('Month')
plt.ylabel('Sales ($)')
plt.title('Monthly Sales Trend')
plt.grid(True)
plt.show()

# Bar chart
plt.bar(df['Region'], df['Total_Sales'])
plt.xlabel('Region')
plt.ylabel('Sales ($)')
plt.title('Sales by Region')
plt.show()

# Histogram
plt.hist(df['Age'], bins=20)
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.title('Age Distribution')
plt.show()

Seaborn - Advanced Visualization:

python

import seaborn as sns

# Heatmap showing correlations
sns.heatmap(df.corr(), annot=True, cmap='coolwarm')
plt.title('Correlation Heatmap')
plt.show()

# Box plot by category
sns.boxplot(x='Region', y='Salary', data=df)
plt.title('Salary Distribution by Region')
plt.show()

# Scatter plot with colors
sns.scatterplot(x='Age', y='Salary', hue='Region', data=df)
plt.title('Age vs Salary by Region')
plt.show()

Module 11: Business Intelligence Dashboards

Lesson 11.1: Dashboard Design Principles

KPI (Key Performance Indicator) Selection: Choose 3-5 most important metrics for target audience.

Visual Hierarchy: Most important metrics prominent; supporting details secondary.

Color Usage:

Green = Good
Yellow = Warning
Red = Alert
Avoid rainbow colors (accessibility)

Interactivity: Filters, drill-downs, date ranges for exploration.

Lesson 11.2: Building Dashboards with Power BI

Dashboard Components:

┌─────────────────────────────────────┐
│    Dashboard: Monthly Performance    │
├─────────────────────────────────────┤
│  Revenue: $2.5M    Growth: +15%     │
│  Customers: 15K    Churn: 2.3%      │
├────────────────────┬────────────────┤
│  Revenue Trend     │ Sales by Region│
│  (Line Chart)      │ (Pie Chart)    │
├────────────────────┼────────────────┤
│  Top Products      │ Forecast       │
│  (Bar Chart)       │ (Area Chart)   │
└────────────────────┴────────────────┘

WEEK 10: EXPLORATORY DATA ANALYSIS (EDA)

Module 12: Conducting Comprehensive EDA

Lesson 12.1: EDA Framework

Step 1: Understand Data Context

What does data represent?
How was data collected?
What are obvious limitations?

Step 2: Explore Structure

Rows and columns count
Data types
Missing values percentage

Step 3: Univariate Analysis Analyze individual variables.

python

# Numerical variables
df['Age'].describe()
plt.hist(df['Age'], bins=30)

# Categorical variables
df['Region'].value_counts()
df['Region'].value_counts().plot(kind='bar')

Step 4: Multivariate Analysis Analyze relationships between variables.

python

# Correlation between continuous variables
df[['Age', 'Salary', 'Experience']].corr()

# Relationship between categorical and continuous
df.groupby('Region')['Salary'].mean()
sns.boxplot(x='Region', y='Salary', data=df)

Step 5: Anomalies and Outliers Identify unusual patterns.

python

# Identify outliers
Q1 = df['Salary'].quantile(0.25)
Q3 = df['Salary'].quantile(0.75)
IQR = Q3 - Q1
outliers = df[(df['Salary'] < Q1 - 1.5*IQR) | 
              (df['Salary'] > Q3 + 1.5*IQR)]

# Flag suspicious patterns
df[df['Age'] > 100]  # Unrealistic ages
df[df['Salary'] < 0]  # Negative salaries

WEEK 11: PREDICTIVE ANALYTICS

Module 13: Introduction to Predictive Modeling

Lesson 13.1: Regression Analysis

Linear Regression - Predicting Continuous Values

Problem: Predict employee salary based on experience years

python

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error
import numpy as np

# Prepare data
X = df[['Years_Experience']].values
y = df['Salary'].values

# Split into training (80%) and testing (20%)
X_train, X_test, y_train, y_test = train_test_split(
  X, y, test_size=0.2, random_state=42
)

# Train model
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Evaluate
r2 = r2_score(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))

print(f"R² Score: {r2:.2f}")  # 0 to 1; higher is better
print(f"RMSE: ${rmse:.2f}")

Interpretation:

R² = 0.85: Model explains 85% of salary variation
RMSE = $5,000: Average prediction error is $5,000

Lesson 13.2: Classification - Predicting Categories

Problem: Predict if customer will churn (leave)

python

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix
import pandas as pd

# Prepare features and target
X = df[['Age', 'Tenure_Months', 'Monthly_Spend', 'Support_Tickets']]
y = df['Churned']  # 0 = No, 1 = Yes

# Split data
X_train, X_test, y_train, y_test = train_test_split(
  X, y, test_size=0.2
)

# Train model
model = RandomForestClassifier(n_estimators=100)
model.fit(X_train, y_train)

# Predict
y_pred = model.predict(X_test)

# Evaluate
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2%}")  # Percentage correct

# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
print(cm)
# [[TN  FP]   True Negatives & False Positives
#  [FN  TP]]  False Negatives & True Positives

WEEK 12: CAPSTONE PROJECT & CAREER PATHWAYS

Module 14: Capstone Analytics Project

Project: Retail Sales Analysis and Prediction

Scenario: You're hired as data analyst for online retail company. Your task: analyze sales patterns, identify high-value customers, forecast revenue, and recommend strategies.

Dataset Provided:

Columns: Date, Product, Region, Customer_Type, 
         Units_Sold, Unit_Price, Total_Sales, Customer_ID
Rows: 100,000+ transactions over 2 years

Part 1: Exploratory Analysis (Week 12, Days 1-2)

python

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Load data
df = pd.read_csv('retail_sales.csv')
df['Date'] = pd.to_datetime(df['Date'])

# Basic exploration
print(f"Dataset shape: {df.shape}")
print(f"\nMissing values:\n{df.isnull().sum()}")
print(f"\nData types:\n{df.dtypes}")

# Statistical summary
print(df.describe())

# Analysis
print("\nTop 5 Products by Revenue:")
print(df.groupby('Product')['Total_Sales'].sum().sort_values(ascending=False).head())

print("\nSales by Region:")
print(df.groupby('Region')['Total_Sales'].sum().sort_values(ascending=False))

# Visualizations
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# Sales trend over time
daily_sales = df.groupby('Date')['Total_Sales'].sum()
axes[0, 0].plot(daily_sales)
axes[0, 0].set_title('Daily Sales Trend')

# Sales by region
df.groupby('Region')['Total_Sales'].sum().plot(kind='bar', ax=axes[0, 1])
axes[0, 1].set_title('Sales by Region')

# Product performance
df.groupby('Product')['Units_Sold'].sum().plot(kind='barh', ax=axes[1, 0])
axes[1, 0].set_title('Units Sold by Product')

# Customer type distribution
df.groupby('Customer_Type')['Total_Sales'].mean().plot(kind='bar', ax=axes[1, 1])
axes[1, 1].set_title('Average Sales by Customer Type')

plt.tight_layout()
plt.show()

Part 2: Customer Segmentation (Days 3-4)

python

# RFM Analysis: Recency, Frequency, Monetary
from datetime import datetime

reference_date = df['Date'].max() + timedelta(days=1)

rfm = df.groupby('Customer_ID').agg({
    'Date': lambda x: (reference_date - x.max()).days,  # Recency
    'Customer_ID': 'count',  # Frequency
    'Total_Sales': 'sum'  # Monetary
}).rename(columns={
    'Date': 'Recency',
    'Customer_ID': 'Frequency',
    'Total_Sales': 'Monetary'
})

# Score customers (1=worst, 5=best)
rfm['R_Score'] = pd.qcut(rfm['Recency'], 5, labels=[5,4,3,2,1])
rfm['F_Score'] = pd.qcut(rfm['Frequency'].rank(method='first'), 5, labels=[1,2,3,4,5])
rfm['M_Score'] = pd.qcut(rfm['Monetary'], 5, labels=[1,2,3,4,5])

rfm['RFM_Score'] = rfm['R_Score'].astype(int) + rfm['F_Score'].astype(int) + rfm['M_Score'].astype(int)

# Segment customers
def segment(score):
    if score >= 13: return 'Champions'
    elif score >= 10: return 'Loyal'
    elif score >= 7: return 'At Risk'
    else: return 'Lost'

rfm['Segment'] = rfm['RFM_Score'].apply(segment)

print("\nCustomer Segments:")
print(rfm['Segment'].value_counts())

Part 3: Revenue Forecasting (Days 5-6)

python

from sklearn.linear_model import LinearRegression
import numpy as np

# Prepare time series data
daily_revenue = df.groupby('Date')['Total_Sales'].sum().reset_index()
daily_revenue['Days_Since_Start'] = (daily_revenue['Date'] - daily_revenue['Date'].min()).dt.days

# Split into train/test
train_size = int(len(daily_revenue) * 0.8)
train = daily_revenue[:train_size]
test = daily_revenue[train_size:]

# Train model
X_train = train['Days_Since_Start'].values.reshape(-1, 1)
y_train = train['Total_Sales'].values

model = LinearRegression()
model.fit(X_train, y_train)

# Forecast next 30 days
future_days = np.arange(daily_revenue['Days_Since_Start'].max() + 1, 
                        daily_revenue['Days_Since_Start'].max() + 31).reshape(-1, 1)
forecast = model.predict(future_days)

print(f"30-Day Revenue Forecast: ${forecast.sum():,.2f}")
print(f"Average Daily Forecast: ${forecast.mean():,.2f}")

# Visualize
plt.figure(figsize=(12, 6))
plt.plot(daily_revenue['Date'], daily_revenue['Total_Sales'], label='Actual')
future_dates = [daily_revenue['Date'].max() + timedelta(days=i) for i in range(1, 31)]
plt.plot(future_dates, forecast, label='Forecast', linestyle='--')
plt.xlabel('Date')
plt.ylabel('Daily Revenue ($)')
plt.title('Revenue Forecast')
plt.legend()
plt.show()

Part 4: Insights and Recommendations (Day 7)

Present findings:

Top Performing Products: Product A and C drive 60% of revenue
High-Value Customers: Top 20% of customers generate 80% of revenue
Growth Opportunity: Region South underperforming; has potential
Forecast: Expected 8% revenue growth next month
Recommendations:
- Expand Product A in Region South
- Launch loyalty program for Champions segment
- Implement retention campaign for At Risk customers
- Optimize pricing in high-margin products

CAREER PATHWAYS AND ADVANCED TOPICS

Module 15: Career Development

Lesson 15.1: Data Analyst Career Paths

Entry-Level Analyst

0-2 years experience
Focus: Data collection, cleaning, basic reporting
Salary: ₹3-6 lakhs annually
Skills: Excel, SQL, basic Python/R

Mid-Level Analyst

2-5 years experience
Focus: Advanced analysis, dashboards, project leadership
Salary: ₹6-12 lakhs

Skills: SQL, Python, Power

Course Overview and Learning Objectives

What is Data Analytics?

Course Structure

This comprehensive course is designed to take you from complete beginner to job-ready data analyst in 12 weeks through structured learning, hands-on projects, and real-world applications.

Prerequisite: Basic computer literacy. No prior analytics or programming experience required.

Learning Outcomes

By completing this course, you will be able to:

Understand data analytics fundamentals and concepts
Use Excel for data analysis and visualization
Write SQL queries to extract and transform data
Use Python for data manipulation and analysis
Create compelling data visualizations
Develop analytics dashboards
Conduct exploratory data analysis
Build predictive models
Present insights to stakeholders
Execute end-to-end analytics projects

WEEK 1-2: FOUNDATIONS OF DATA ANALYTICS

Module 1: Data Analytics Fundamentals

Lesson 1.1: Introduction to Data Analytics

Key Concepts:

Data Raw facts and observations. Examples: Sales transactions, customer demographics, website clicks.

Information Data processed into meaningful context. Example: "Sales increased 15% in Q3 among 25-35 age group in North India."

Insights Conclusions drawn from information that drive decisions. Example: "We should target 25-35 age group with Q4 campaign in North India for maximum ROI."

The Analytics Process:

Raw Data → Collection → Cleaning → Analysis → Visualization → 
Insights → Action → Business Impact

Lesson 1.2: Types of Analytics

Descriptive Analytics - "What Happened?" Analyzing historical data to understand patterns.

Example: "Average monthly sales over past 12 months"

Tools: Excel, Power BI, Tableau

Outcome: Reports, dashboards, summaries

Diagnostic Analytics - "Why Did It Happen?" Investigating reasons behind outcomes.

Example: "Why did sales increase 15% in Q3?"

Techniques: Correlation analysis, trend analysis, root cause analysis

Tools: SQL, Python, statistical software

Predictive Analytics - "What Will Happen?" Using historical data to forecast future outcomes.

Example: "Predict next month's sales based on historical trends"

Techniques: Machine learning, regression, time series forecasting

Tools: Python, R, specialized ML platforms

Prescriptive Analytics - "What Should We Do?" Recommending actions to achieve desired outcomes.

Example: "Allocate 40% budget to North India, 35% to South, 25% to East for maximum projected revenue"

Techniques: Optimization, simulation, decision analysis

Module 2: Data Types and Structures

Lesson 2.1: Data Types

Structured Data Organized in tables with rows and columns. Example: Customer database with columns (Name, Age, Email, Purchase Amount).

Unstructured Data No predefined structure. Examples: Text documents, images, videos, audio.

Semi-Structured Data Mix of structured and unstructured. Examples: JSON, XML, log files.

Quantitative (Numerical) Data Measurable quantities. Examples: Age, salary, temperature, revenue.

Categorical (Qualitative) Data Non-numeric categories. Examples: Color, gender, product category, region.

Lesson 2.2: Data Distribution

Normal Distribution Bell-shaped curve; most data points near mean. Many natural phenomena follow normal distribution.

Skewed Distribution Asymmetrical. Positive skew (tail right), negative skew (tail left).

Bimodal Distribution Two peaks. Suggests two distinct groups in data.

Understanding distributions helps identify:

Outliers and anomalies
Appropriate statistical tests
Data transformation needs

Module 3: Key Analytics Concepts

Lesson 3.1: Statistical Foundations

Mean (Average) Sum of values divided by count.

Mean = (Sum of all values) / (Number of values)

Example: (10 + 20 + 30 + 40 + 50) / 5 = 30

Median Middle value when data sorted. Less affected by outliers.

Example: 10, 20, 30, 40, 50
Median = 30 (middle value)

Mode Most frequently occurring value.

Example: 10, 20, 20, 30, 30, 30
Mode = 30 (occurs 3 times)

Standard Deviation Measures spread/variability of data. High SD = data spread out; Low SD = data clustered.

Variance Square of standard deviation. Measures how far data points are from mean.

Correlation Relationship between two variables. Ranges from -1 to +1.

Correlation = 1: Perfect positive relationship
Correlation = 0: No relationship
Correlation = -1: Perfect negative relationship

Lesson 3.2: Sampling and Bias

Sampling Selecting subset of data for analysis. Used when full dataset is too large.

Sample Size Larger samples more accurately represent population. Balance between precision and resources.

Bias Systematic error skewing results. Types: Selection bias, measurement bias, response bias.

Mitigation: Random sampling, stratified sampling, careful data collection.

WEEK 3-4: EXCEL FOR DATA ANALYSIS

Module 4: Excel Fundamentals and Data Preparation

Lesson 4.1: Excel Basics

Spreadsheet Structure Rows (horizontal), columns (vertical), cells (intersections).

Data Entry Best Practices:

One data point per cell
Consistent formatting
No merged cells in data ranges
Column headers in first row
No blank rows/columns within data

Lesson 4.2: Formulas and Functions

Mathematical Functions

=SUM(A1:A10)          // Add values
=AVERAGE(A1:A10)      // Calculate average
=COUNT(A1:A10)        // Count cells with numbers
=MIN(A1:A10)          // Find minimum
=MAX(A1:A10)          // Find maximum

Logical Functions

=IF(A1>100, "High", "Low")        // Conditional logic
=AND(A1>50, B1<100)               // All conditions true?
=OR(A1=1, A1=2, A1=3)             // Any condition true?

Text Functions

=LEN(A1)              // Length of text
=UPPER(A1)            // Convert to uppercase
=CONCATENATE(A1, " ", B1)  // Combine text
=LEFT(A1, 3)          // First 3 characters

Lesson 4.3: Data Cleaning

Common Data Quality Issues:

Missing Values

Identify: Look for blank cells
Handle: Delete, fill with mean/median, leave as is depending on context

Duplicate Records

Identify: Sort and visually inspect, or Data > Remove Duplicates
Handle: Remove duplicates or investigate root cause

Inconsistent Formatting

Standardize text cases (UPPER, LOWER, PROPER functions)
Consistent date formats
Consistent number formats

Outliers

Identify: Values significantly different from others
Verify: Confirm if genuine or data error
Handle: Keep if genuine, remove if error

Module 5: Pivot Tables and Summarization

Lesson 5.1: Creating Pivot Tables

What is a Pivot Table? Summarizes large datasets into meaningful insights. Rotates (pivots) data dimensions.

Example Scenario: Raw data: Sales transactions with Date, Region, Product, Amount

Pivot table shows: Revenue by Region and Product

Steps to Create:

Select data range
Insert > Pivot Table
Drag fields to Rows, Columns, Values
Customize as needed

Lesson 5.2: Aggregation Functions in Pivot Tables

SUM:        Total of values
AVERAGE:    Mean value
COUNT:      Number of items
MIN/MAX:    Smallest/largest value
STDEV:      Standard deviation

Practical Example:

Data: Sales by Salesperson, Month, Product

Pivot Table Structure:
─────────────────────────────────────────
      Jan      Feb      Mar    Total
─────────────────────────────────────────
John  $5,000   $6,000   $7,000  $18,000
Jane  $4,500   $5,500   $6,200  $16,200
Bob   $3,200   $3,800   $4,100  $11,100
─────────────────────────────────────────
Total $12,700  $15,300  $17,300 $45,300

WEEK 5-6: SQL FOR DATA ANALYSIS

Module 6: SQL Fundamentals

Lesson 6.1: Introduction to SQL

What is SQL? Structured Query Language. Universal language for interacting with databases.

Why SQL?

Access large datasets efficiently
Extract specific data subsets
Perform complex calculations
Join data from multiple tables

Lesson 6.2: Basic SELECT Queries

Selecting All Data:

sql

SELECT * FROM customers;

Selecting Specific Columns:

sql

SELECT name, email, age FROM customers;

Filtering with WHERE:

sql

SELECT * FROM customers
WHERE age > 25 AND region = 'North India';

Logical Operators:

AND:  Both conditions true
OR:   At least one condition true
NOT:  Condition not true
IN:   Value in list
BETWEEN: Value in range
LIKE: Pattern matching

Lesson 6.3: Aggregation and Grouping

Aggregate Functions:

sql

SELECT 
  COUNT(*) as total_customers,
  AVG(purchase_amount) as avg_purchase,
  SUM(purchase_amount) as total_revenue,
  MAX(purchase_amount) as max_purchase
FROM orders;

GROUP BY - Aggregate by Category:

sql

SELECT 
  region,
  COUNT(*) as customer_count,
  SUM(purchase_amount) as total_sales,
  AVG(purchase_amount) as avg_sale
FROM orders
GROUP BY region;

HAVING - Filter Groups:

sql

SELECT 
  product_category,
  COUNT(*) as sales_count
FROM orders
GROUP BY product_category
HAVING COUNT(*) > 50;

Module 7: Advanced SQL

Lesson 7.1: Joins

INNER JOIN - Common Records

sql

SELECT 
  o.order_id,
  c.customer_name,
  o.purchase_amount
FROM orders o
INNER JOIN customers c
  ON o.customer_id = c.customer_id;

LEFT JOIN - All from Left Table

sql

SELECT 
  c.customer_name,
  COUNT(o.order_id) as order_count
FROM customers c
LEFT JOIN orders o
  ON c.customer_id = o.customer_id
GROUP BY c.customer_name;

Lesson 7.2: Subqueries

Subquery in WHERE Clause:

sql

SELECT * FROM customers
WHERE customer_id IN (
  SELECT customer_id FROM orders
  WHERE purchase_amount > 10000
);

Subquery in FROM Clause:

sql

SELECT 
  region,
  avg_purchase
FROM (
  SELECT 
    region,
    AVG(purchase_amount) as avg_purchase
  FROM orders
  GROUP BY region
) as regional_avg
WHERE avg_purchase > 5000;

Lesson 7.3: Window Functions

Running Total:

sql

SELECT 
  date,
  amount,
  SUM(amount) OVER (
    ORDER BY date
  ) as running_total
FROM sales
ORDER BY date;

Rank Within Groups:

sql

SELECT 
  salesperson,
  sales,
  RANK() OVER (ORDER BY sales DESC) as rank
FROM sales_performance;

WEEK 7-8: PYTHON FOR DATA ANALYSIS

Module 8: Python Basics for Analytics

Lesson 8.1: Python Setup and Libraries

Essential Libraries:

python

# Data manipulation
import pandas as pd
import numpy as np

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Statistical analysis
from scipy import stats
import statsmodels.api as sm

# Machine learning (later weeks)
from sklearn import preprocessing, ensemble, metrics

Lesson 8.2: Working with Data in Pandas

Creating DataFrames:

python

# From dictionary
data = {
  'Name': ['Alice', 'Bob', 'Charlie'],
  'Age': [25, 30, 35],
  'Salary': [50000, 60000, 75000]
}
df = pd.DataFrame(data)

# From CSV
df = pd.read_csv('customers.csv')

# From Excel
df = pd.read_excel('sales_data.xlsx')

Exploring Data:

python

# First/last rows
df.head()
df.tail()

# Dataset info
df.info()           # Data types, missing values
df.describe()       # Statistical summary
df.shape            # Rows and columns

# Check for missing values
df.isnull().sum()

Data Cleaning:

python

# Remove duplicates
df = df.drop_duplicates()

# Handle missing values
df = df.dropna()                    # Remove rows with NaN
df['Age'] = df['Age'].fillna(df['Age'].mean())  # Fill with mean

# Remove outliers (values > 3 std devs)
df = df[np.abs(stats.zscore(df['Salary'])) < 3]

# Data type conversion
df['Date'] = pd.to_datetime(df['Date'])
df['Age'] = df['Age'].astype(int)

Filtering and Selection:

python

# Filter rows
high_earners = df[df['Salary'] > 60000]

# Multiple conditions
result = df[(df['Age'] > 25) & (df['Salary'] < 80000)]

# Select columns
selected = df[['Name', 'Salary']]

# Select by location
df.loc[0]           # First row
df.iloc[0:5]        # First 5 rows
df.loc[df['Age'] > 30, 'Name']  # Names of people over 30

Module 9: Data Analysis with Python

Lesson 9.1: Descriptive Statistics

python

# Mean, median, mode
df['Salary'].mean()
df['Salary'].median()
df['Salary'].mode()

# Spread
df['Salary'].std()      # Standard deviation
df['Salary'].var()      # Variance
df['Salary'].min()
df['Salary'].max()

# Percentiles
df['Salary'].quantile(0.25)  # 25th percentile
df['Salary'].quantile(0.75)  # 75th percentile

Lesson 9.2: Grouping and Aggregation

python

# Group by single column
by_region = df.groupby('Region')['Salary'].mean()

# Multiple aggregations
summary = df.groupby('Department').agg({
    'Salary': ['mean', 'min', 'max'],
    'Age': 'mean',
    'Name': 'count'
})

# Group by multiple columns
by_dept_region = df.groupby(['Department', 'Region'])['Salary'].sum()

Lesson 9.3: Merging and Joining Data

python

# Merge DataFrames
df_merged = pd.merge(df_customers, df_orders, 
                     on='customer_id', 
                     how='inner')

# Concatenate
df_combined = pd.concat([df1, df2], axis=0)

# Join
df_result = df1.join(df2, on='key')

Lesson 9.4: Correlation and Statistical Testing

python

# Correlation matrix
correlation = df.corr()

# Correlation with specific column
df.corr()['Salary'].sort_values(ascending=False)

# Statistical test - t-test (comparing two groups)
from scipy import stats
group1 = df[df['Region'] == 'North']['Salary']
group2 = df[df['Region'] == 'South']['Salary']
t_stat, p_value = stats.ttest_ind(group1, group2)

print(f"T-statistic: {t_stat}, P-value: {p_value}")
if p_value < 0.05:
    print("Significant difference between regions")

WEEK 8-9: DATA VISUALIZATION

Module 10: Visualization Fundamentals

Lesson 10.1: Visualization Principles

Choose Right Chart Type:

Line Chart - Show trends over time

Sales Over 12 Months (ascending trend)

Bar Chart - Compare categories

Sales by Region (North > South > East)

Pie Chart - Show parts of whole

Market Share: Company A 40%, B 35%, C 25%

Scatter Plot - Show relationship between two variables

Age vs Salary (positive correlation)

Histogram - Show distribution

Customer age distribution (bell curve)

Lesson 10.2: Visualization with Python

Matplotlib - Basic Plotting:

python

import matplotlib.pyplot as plt

# Line chart
plt.figure(figsize=(10, 6))
plt.plot(df['Month'], df['Sales'], marker='o')
plt.xlabel('Month')
plt.ylabel('Sales ($)')
plt.title('Monthly Sales Trend')
plt.grid(True)
plt.show()

# Bar chart
plt.bar(df['Region'], df['Total_Sales'])
plt.xlabel('Region')
plt.ylabel('Sales ($)')
plt.title('Sales by Region')
plt.show()

# Histogram
plt.hist(df['Age'], bins=20)
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.title('Age Distribution')
plt.show()

Seaborn - Advanced Visualization:

python

import seaborn as sns

# Heatmap showing correlations
sns.heatmap(df.corr(), annot=True, cmap='coolwarm')
plt.title('Correlation Heatmap')
plt.show()

# Box plot by category
sns.boxplot(x='Region', y='Salary', data=df)
plt.title('Salary Distribution by Region')
plt.show()

# Scatter plot with colors
sns.scatterplot(x='Age', y='Salary', hue='Region', data=df)
plt.title('Age vs Salary by Region')
plt.show()

Module 11: Business Intelligence Dashboards

Lesson 11.1: Dashboard Design Principles

KPI (Key Performance Indicator) Selection: Choose 3-5 most important metrics for target audience.

Visual Hierarchy: Most important metrics prominent; supporting details secondary.

Color Usage:

Green = Good
Yellow = Warning
Red = Alert
Avoid rainbow colors (accessibility)

Interactivity: Filters, drill-downs, date ranges for exploration.

Lesson 11.2: Building Dashboards with Power BI

Dashboard Components:

┌─────────────────────────────────────┐
│    Dashboard: Monthly Performance    │
├─────────────────────────────────────┤
│  Revenue: $2.5M    Growth: +15%     │
│  Customers: 15K    Churn: 2.3%      │
├────────────────────┬────────────────┤
│  Revenue Trend     │ Sales by Region│
│  (Line Chart)      │ (Pie Chart)    │
├────────────────────┼────────────────┤
│  Top Products      │ Forecast       │
│  (Bar Chart)       │ (Area Chart)   │
└────────────────────┴────────────────┘

WEEK 10: EXPLORATORY DATA ANALYSIS (EDA)

Module 12: Conducting Comprehensive EDA

Lesson 12.1: EDA Framework

Step 1: Understand Data Context

What does data represent?
How was data collected?
What are obvious limitations?

Step 2: Explore Structure

Rows and columns count
Data types
Missing values percentage

Step 3: Univariate Analysis Analyze individual variables.

python

# Numerical variables
df['Age'].describe()
plt.hist(df['Age'], bins=30)

# Categorical variables
df['Region'].value_counts()
df['Region'].value_counts().plot(kind='bar')

Step 4: Multivariate Analysis Analyze relationships between variables.

python

# Correlation between continuous variables
df[['Age', 'Salary', 'Experience']].corr()

# Relationship between categorical and continuous
df.groupby('Region')['Salary'].mean()
sns.boxplot(x='Region', y='Salary', data=df)

Step 5: Anomalies and Outliers Identify unusual patterns.

python

# Identify outliers
Q1 = df['Salary'].quantile(0.25)
Q3 = df['Salary'].quantile(0.75)
IQR = Q3 - Q1
outliers = df[(df['Salary'] < Q1 - 1.5*IQR) | 
              (df['Salary'] > Q3 + 1.5*IQR)]

# Flag suspicious patterns
df[df['Age'] > 100]  # Unrealistic ages
df[df['Salary'] < 0]  # Negative salaries

WEEK 11: PREDICTIVE ANALYTICS

Module 13: Introduction to Predictive Modeling

Lesson 13.1: Regression Analysis

Linear Regression - Predicting Continuous Values

Problem: Predict employee salary based on experience years

python

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error
import numpy as np

# Prepare data
X = df[['Years_Experience']].values
y = df['Salary'].values

# Split into training (80%) and testing (20%)
X_train, X_test, y_train, y_test = train_test_split(
  X, y, test_size=0.2, random_state=42
)

# Train model
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Evaluate
r2 = r2_score(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))

print(f"R² Score: {r2:.2f}")  # 0 to 1; higher is better
print(f"RMSE: ${rmse:.2f}")

Interpretation:

R² = 0.85: Model explains 85% of salary variation
RMSE = $5,000: Average prediction error is $5,000

Lesson 13.2: Classification - Predicting Categories

Problem: Predict if customer will churn (leave)

python

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix
import pandas as pd

# Prepare features and target
X = df[['Age', 'Tenure_Months', 'Monthly_Spend', 'Support_Tickets']]
y = df['Churned']  # 0 = No, 1 = Yes

# Split data
X_train, X_test, y_train, y_test = train_test_split(
  X, y, test_size=0.2
)

# Train model
model = RandomForestClassifier(n_estimators=100)
model.fit(X_train, y_train)

# Predict
y_pred = model.predict(X_test)

# Evaluate
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2%}")  # Percentage correct

# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
print(cm)
# [[TN  FP]   True Negatives & False Positives
#  [FN  TP]]  False Negatives & True Positives

WEEK 12: CAPSTONE PROJECT & CAREER PATHWAYS

Module 14: Capstone Analytics Project

Project: Retail Sales Analysis and Prediction

Scenario: You're hired as data analyst for online retail company. Your task: analyze sales patterns, identify high-value customers, forecast revenue, and recommend strategies.

Dataset Provided:

Columns: Date, Product, Region, Customer_Type, 
         Units_Sold, Unit_Price, Total_Sales, Customer_ID
Rows: 100,000+ transactions over 2 years

Part 1: Exploratory Analysis (Week 12, Days 1-2)

python

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Load data
df = pd.read_csv('retail_sales.csv')
df['Date'] = pd.to_datetime(df['Date'])

# Basic exploration
print(f"Dataset shape: {df.shape}")
print(f"\nMissing values:\n{df.isnull().sum()}")
print(f"\nData types:\n{df.dtypes}")

# Statistical summary
print(df.describe())

# Analysis
print("\nTop 5 Products by Revenue:")
print(df.groupby('Product')['Total_Sales'].sum().sort_values(ascending=False).head())

print("\nSales by Region:")
print(df.groupby('Region')['Total_Sales'].sum().sort_values(ascending=False))

# Visualizations
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# Sales trend over time
daily_sales = df.groupby('Date')['Total_Sales'].sum()
axes[0, 0].plot(daily_sales)
axes[0, 0].set_title('Daily Sales Trend')

# Sales by region
df.groupby('Region')['Total_Sales'].sum().plot(kind='bar', ax=axes[0, 1])
axes[0, 1].set_title('Sales by Region')

# Product performance
df.groupby('Product')['Units_Sold'].sum().plot(kind='barh', ax=axes[1, 0])
axes[1, 0].set_title('Units Sold by Product')

# Customer type distribution
df.groupby('Customer_Type')['Total_Sales'].mean().plot(kind='bar', ax=axes[1, 1])
axes[1, 1].set_title('Average Sales by Customer Type')

plt.tight_layout()
plt.show()

Part 2: Customer Segmentation (Days 3-4)

python

# RFM Analysis: Recency, Frequency, Monetary
from datetime import datetime

reference_date = df['Date'].max() + timedelta(days=1)

rfm = df.groupby('Customer_ID').agg({
    'Date': lambda x: (reference_date - x.max()).days,  # Recency
    'Customer_ID': 'count',  # Frequency
    'Total_Sales': 'sum'  # Monetary
}).rename(columns={
    'Date': 'Recency',
    'Customer_ID': 'Frequency',
    'Total_Sales': 'Monetary'
})

# Score customers (1=worst, 5=best)
rfm['R_Score'] = pd.qcut(rfm['Recency'], 5, labels=[5,4,3,2,1])
rfm['F_Score'] = pd.qcut(rfm['Frequency'].rank(method='first'), 5, labels=[1,2,3,4,5])
rfm['M_Score'] = pd.qcut(rfm['Monetary'], 5, labels=[1,2,3,4,5])

rfm['RFM_Score'] = rfm['R_Score'].astype(int) + rfm['F_Score'].astype(int) + rfm['M_Score'].astype(int)

# Segment customers
def segment(score):
    if score >= 13: return 'Champions'
    elif score >= 10: return 'Loyal'
    elif score >= 7: return 'At Risk'
    else: return 'Lost'

rfm['Segment'] = rfm['RFM_Score'].apply(segment)

print("\nCustomer Segments:")
print(rfm['Segment'].value_counts())

Part 3: Revenue Forecasting (Days 5-6)

python

from sklearn.linear_model import LinearRegression
import numpy as np

# Prepare time series data
daily_revenue = df.groupby('Date')['Total_Sales'].sum().reset_index()
daily_revenue['Days_Since_Start'] = (daily_revenue['Date'] - daily_revenue['Date'].min()).dt.days

# Split into train/test
train_size = int(len(daily_revenue) * 0.8)
train = daily_revenue[:train_size]
test = daily_revenue[train_size:]

# Train model
X_train = train['Days_Since_Start'].values.reshape(-1, 1)
y_train = train['Total_Sales'].values

model = LinearRegression()
model.fit(X_train, y_train)

# Forecast next 30 days
future_days = np.arange(daily_revenue['Days_Since_Start'].max() + 1, 
                        daily_revenue['Days_Since_Start'].max() + 31).reshape(-1, 1)
forecast = model.predict(future_days)

print(f"30-Day Revenue Forecast: ${forecast.sum():,.2f}")
print(f"Average Daily Forecast: ${forecast.mean():,.2f}")

# Visualize
plt.figure(figsize=(12, 6))
plt.plot(daily_revenue['Date'], daily_revenue['Total_Sales'], label='Actual')
future_dates = [daily_revenue['Date'].max() + timedelta(days=i) for i in range(1, 31)]
plt.plot(future_dates, forecast, label='Forecast', linestyle='--')
plt.xlabel('Date')
plt.ylabel('Daily Revenue ($)')
plt.title('Revenue Forecast')
plt.legend()
plt.show()

Part 4: Insights and Recommendations (Day 7)

Present findings:

Top Performing Products: Product A and C drive 60% of revenue
High-Value Customers: Top 20% of customers generate 80% of revenue
Growth Opportunity: Region South underperforming; has potential
Forecast: Expected 8% revenue growth next month
Recommendations:
- Expand Product A in Region South
- Launch loyalty program for Champions segment
- Implement retention campaign for At Risk customers
- Optimize pricing in high-margin products

CAREER PATHWAYS AND ADVANCED TOPICS

Module 15: Career Developmentgf

Lesson 15.1: Data Analyst Career Paths

Entry-Level Analyst

0-2 years experience
Focus: Data collection, cleaning, basic reporting
Salary: ₹3-6 lakhs annually
Skills: Excel, SQL, basic Python/R

Mid-Level Analyst

2-5 years experience
Focus: Advanced analysis, dashboards, project leadership
Salary: ₹6-12 lakhs
Skills: SQL, Python, Power

DIGITAL MARKETING