Skip to content

Exercise 9: Complex Queries 🧩

🎯 Objective

Master advanced SQL patterns for complex business requirements and analytics.

📋 Problem Statement

Real-world applications require sophisticated queries combining multiple techniques. This exercise integrates everything you've learned.

📥 Tasks

Task 1: Pivot Table - Sales by Category and Month

Create a pivot table showing total sales by category (rows) and month (columns) for 2024.

Task 2: Cohort Analysis

Calculate customer retention: For each customer's first order month, show how many ordered again in subsequent months.

Task 3: Running Difference

Calculate month-over-month sales growth (absolute and percentage).

Task 4: Top N per Group with Ties

Find top 3 products by sales in each category, including ties.

Task 5: Gap and Island Detection

Find consecutive date ranges where orders were placed daily (no gaps).

Task 6: Conditional Aggregation

Create a summary showing: total orders, total revenue, average order value, and percentage of orders above $100.

💡 Hints

Hint 1: Pivot Table Use CASE statements with aggregation: ```sql SELECT category, SUM(CASE WHEN month = 1 THEN sales ELSE 0 END) AS jan, SUM(CASE WHEN month = 2 THEN sales ELSE 0 END) AS feb FROM ... GROUP BY category; ```
Hint 2: Cohort Analysis 1. Find each customer's first order date 2. Calculate months since first order for each subsequent order 3. Count customers active in each cohort month
Hint 3: Gap Detection Use LAG() to compare consecutive dates, identify gaps where difference > 1 day.

Solutions

Click to reveal solutions

Solution 1: Pivot Table - Sales by Category and Month

sql
WITH monthly_sales AS (
  SELECT 
    p.category,
    strftime('%m', o.order_date) AS month,
    SUM(o.quantity * p.price) AS sales
  FROM orders o
  JOIN products p ON o.product_id = p.product_id
  WHERE strftime('%Y', o.order_date) = '2024'
  GROUP BY p.category, strftime('%m', o.order_date)
)
SELECT 
  category,
  SUM(CASE WHEN month = '01' THEN sales ELSE 0 END) AS Jan,
  SUM(CASE WHEN month = '02' THEN sales ELSE 0 END) AS Feb,
  SUM(CASE WHEN month = '03' THEN sales ELSE 0 END) AS Mar,
  SUM(CASE WHEN month = '04' THEN sales ELSE 0 END) AS Apr,
  SUM(CASE WHEN month = '05' THEN sales ELSE 0 END) AS May,
  SUM(CASE WHEN month = '06' THEN sales ELSE 0 END) AS Jun,
  SUM(CASE WHEN month = '07' THEN sales ELSE 0 END) AS Jul,
  SUM(CASE WHEN month = '08' THEN sales ELSE 0 END) AS Aug,
  SUM(CASE WHEN month = '09' THEN sales ELSE 0 END) AS Sep,
  SUM(CASE WHEN month = '10' THEN sales ELSE 0 END) AS Oct,
  SUM(CASE WHEN month = '11' THEN sales ELSE 0 END) AS Nov,
  SUM(CASE WHEN month = '12' THEN sales ELSE 0 END) AS Dec,
  SUM(sales) AS Total
FROM monthly_sales
GROUP BY category
ORDER BY Total DESC;

Result:

category     | Jan    | Feb    | Mar | ... | Total
-------------|--------|--------|-----|-----|--------
Electronics  | 3899.95| 2599.96| ... | ... | 15000.00
Furniture    | 1599.96| 999.98 | ... | ... | 8000.00

Explanation:

  1. CTE calculates sales per category-month
  2. CASE statements pivot months into columns
  3. SUM aggregates across all months

Alternative (PostgreSQL - CROSSTAB):

sql
SELECT * FROM crosstab(
  'SELECT category, month, sales FROM monthly_sales ORDER BY 1,2',
  'SELECT DISTINCT month FROM monthly_sales ORDER BY 1'
) AS ct(category TEXT, Jan NUMERIC, Feb NUMERIC, ...);

Solution 2: Cohort Analysis

sql
WITH first_orders AS (
  -- Find each customer's first order date
  SELECT 
    customer_id,
    MIN(order_date) AS first_order_date,
    strftime('%Y-%m', MIN(order_date)) AS cohort_month
  FROM orders
  GROUP BY customer_id
),
order_months AS (
  -- Calculate months since first order for each order
  SELECT 
    o.customer_id,
    fo.cohort_month,
    strftime('%Y-%m', o.order_date) AS order_month,
    (strftime('%Y', o.order_date) - strftime('%Y', fo.first_order_date)) * 12 +
    (strftime('%m', o.order_date) - strftime('%m', fo.first_order_date)) AS months_since_first
  FROM orders o
  JOIN first_orders fo ON o.customer_id = fo.customer_id
)
SELECT 
  cohort_month,
  COUNT(DISTINCT CASE WHEN months_since_first = 0 THEN customer_id END) AS Month_0,
  COUNT(DISTINCT CASE WHEN months_since_first = 1 THEN customer_id END) AS Month_1,
  COUNT(DISTINCT CASE WHEN months_since_first = 2 THEN customer_id END) AS Month_2,
  COUNT(DISTINCT CASE WHEN months_since_first = 3 THEN customer_id END) AS Month_3,
  -- Calculate retention rates
  ROUND(100.0 * COUNT(DISTINCT CASE WHEN months_since_first = 1 THEN customer_id END) / 
        NULLIF(COUNT(DISTINCT CASE WHEN months_since_first = 0 THEN customer_id END), 0), 2) AS Retention_Month_1
FROM order_months
GROUP BY cohort_month
ORDER BY cohort_month;

Result:

cohort_month | Month_0 | Month_1 | Month_2 | Month_3 | Retention_Month_1
-------------|---------|---------|---------|---------|------------------
2024-01      | 100     | 45      | 30      | 20      | 45.00%
2024-02      | 80      | 35      | 25      | NULL    | 43.75%
2024-03      | 90      | 40      | NULL    | NULL    | 44.44%

Explanation:

  • Month_0: Customers who made first order in cohort month
  • Month_1: Customers who ordered again 1 month later
  • Retention_Month_1: Percentage who returned in month 1

Business Insight: Track customer retention over time to measure product-market fit.


Solution 3: Running Difference (Month-over-Month Growth)

sql
WITH monthly_sales AS (
  SELECT 
    strftime('%Y-%m', order_date) AS month,
    SUM(quantity * (SELECT price FROM products WHERE product_id = orders.product_id)) AS total_sales
  FROM orders
  GROUP BY strftime('%Y-%m', order_date)
),
sales_with_previous AS (
  SELECT 
    month,
    total_sales,
    LAG(total_sales) OVER (ORDER BY month) AS previous_month_sales
  FROM monthly_sales
)
SELECT 
  month,
  total_sales,
  previous_month_sales,
  total_sales - previous_month_sales AS absolute_growth,
  ROUND(100.0 * (total_sales - previous_month_sales) / NULLIF(previous_month_sales, 0), 2) AS growth_percentage
FROM sales_with_previous
ORDER BY month;

Result:

month   | total_sales | previous_month_sales | absolute_growth | growth_percentage
--------|-------------|----------------------|-----------------|------------------
2024-01 | 5000.00     | NULL                 | NULL            | NULL
2024-02 | 6500.00     | 5000.00              | 1500.00         | 30.00%
2024-03 | 5800.00     | 6500.00              | -700.00         | -10.77%
2024-04 | 7200.00     | 5800.00              | 1400.00         | 24.14%

Explanation:

  1. Calculate total sales per month
  2. Use LAG() to get previous month's sales
  3. Calculate absolute and percentage growth

Key Insight: Negative growth in March indicates seasonal dip or issue to investigate.


Solution 4: Top N per Group with Ties

sql
WITH product_sales AS (
  SELECT 
    p.category,
    p.product_name,
    SUM(o.quantity * p.price) AS total_sales
  FROM orders o
  JOIN products p ON o.product_id = p.product_id
  GROUP BY p.category, p.product_id, p.product_name
),
ranked_products AS (
  SELECT 
    category,
    product_name,
    total_sales,
    RANK() OVER (PARTITION BY category ORDER BY total_sales DESC) AS sales_rank
  FROM product_sales
)
SELECT 
  category,
  product_name,
  total_sales,
  sales_rank
FROM ranked_products
WHERE sales_rank <= 3
ORDER BY category, sales_rank;

Result:

category     | product_name          | total_sales | sales_rank
-------------|-----------------------|-------------|------------
Electronics  | Laptop Pro 15         | 5199.96     | 1
Electronics  | Monitor 27"           | 349.99      | 2
Electronics  | Keyboard Mechanical   | 359.96      | 3
Furniture    | Standing Desk         | 999.98      | 1
Furniture    | Office Chair          | 599.97      | 2
Furniture    | Desk Lamp             | 45.99       | 3

Why RANK() instead of ROW_NUMBER()?

  • RANK(): Includes ties (if two products have same sales, both get rank 2, next is 4)
  • ROW_NUMBER(): Arbitrary ordering of ties (one gets 2, other gets 3)
  • DENSE_RANK(): Includes ties, no gaps (both get 2, next is 3)

Use Case: "Show top 3 best-selling products per category, including ties"


Solution 5: Gap and Island Detection

sql
WITH order_dates AS (
  SELECT DISTINCT DATE(order_date) AS order_date
  FROM orders
  ORDER BY order_date
),
gaps AS (
  SELECT 
    order_date,
    LAG(order_date) OVER (ORDER BY order_date) AS prev_date,
    julianday(order_date) - julianday(LAG(order_date) OVER (ORDER BY order_date)) AS days_diff
  FROM order_dates
),
islands AS (
  SELECT 
    order_date,
    SUM(CASE WHEN days_diff > 1 OR days_diff IS NULL THEN 1 ELSE 0 END) 
      OVER (ORDER BY order_date) AS island_id
  FROM gaps
)
SELECT 
  island_id,
  MIN(order_date) AS streak_start,
  MAX(order_date) AS streak_end,
  COUNT(*) AS consecutive_days
FROM islands
GROUP BY island_id
HAVING COUNT(*) >= 3  -- Only streaks of 3+ days
ORDER BY consecutive_days DESC;

Result:

island_id | streak_start | streak_end | consecutive_days
----------|--------------|------------|------------------
1         | 2024-01-15   | 2024-01-24 | 10
3         | 2024-02-01   | 2024-02-05 | 5
2         | 2024-01-28   | 2024-01-30 | 3

Explanation:

  1. Get distinct order dates
  2. Calculate days between consecutive dates (LAG)
  3. Identify gaps (days_diff > 1)
  4. Assign island_id (increments at each gap)
  5. Group by island to find streaks

Use Case: Find periods of consistent daily activity or identify gaps in data.


Solution 6: Conditional Aggregation

sql
WITH order_values AS (
  SELECT 
    o.order_id,
    o.order_date,
    SUM(o.quantity * p.price) AS order_value
  FROM orders o
  JOIN products p ON o.product_id = p.product_id
  GROUP BY o.order_id, o.order_date
)
SELECT 
  COUNT(*) AS total_orders,
  SUM(order_value) AS total_revenue,
  ROUND(AVG(order_value), 2) AS avg_order_value,
  ROUND(100.0 * SUM(CASE WHEN order_value > 100 THEN 1 ELSE 0 END) / COUNT(*), 2) AS pct_orders_above_100,
  SUM(CASE WHEN order_value > 100 THEN 1 ELSE 0 END) AS orders_above_100,
  SUM(CASE WHEN order_value <= 100 THEN 1 ELSE 0 END) AS orders_below_100,
  MIN(order_value) AS min_order,
  MAX(order_value) AS max_order,
  -- Percentiles (database-specific)
  PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY order_value) AS median_order_value,
  PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY order_value) AS p95_order_value
FROM order_values;

Result:

total_orders | total_revenue | avg_order_value | pct_orders_above_100 | orders_above_100 | orders_below_100 | min_order | max_order | median | p95
-------------|---------------|-----------------|----------------------|------------------|------------------|-----------|-----------|--------|------
10           | 8549.85       | 854.99          | 60.00%               | 6                | 4                | 12.99     | 2599.98   | 524.99 | 2299.99

Explanation:

  • Conditional aggregation: SUM(CASE WHEN...) counts rows meeting condition
  • Percentiles: Show distribution (median = 50th percentile, p95 = 95th percentile)

Business Insight: 60% of orders are high-value (>$100), indicating strong customer engagement.

🚀 Follow-up Challenges

  1. Funnel Analysis: Calculate conversion rates through multi-step process
  2. RFM Segmentation: Segment customers by Recency, Frequency, Monetary value
  3. Time Series Forecasting: Calculate moving averages and trends
  4. ABC Analysis: Classify products by cumulative revenue contribution

🔑 Key Takeaways

Complex Query Patterns

PatternUse CaseKey Technique
Pivot TableCross-tabulationCASE + SUM + GROUP BY
Cohort AnalysisRetention trackingSelf-join + Date math
Running CalculationsTrends, growthWindow functions (LAG, SUM OVER)
Top N per GroupRankingsRANK() + PARTITION BY
Gap DetectionFind missing dataLAG() + Conditional logic
Conditional AggregationComplex metricsSUM(CASE WHEN...)

Query Building Strategy

1. Break Down the Problem

  • What is the final output?
  • What intermediate steps are needed?
  • What data sources are required?

2. Build Incrementally

  • Start with simplest query
  • Add one CTE at a time
  • Test each step independently

3. Optimize Last

  • Get correct results first
  • Then optimize for performance
  • Use EXPLAIN to identify bottlenecks

Advanced Aggregation Techniques

Conditional Counting:

sql
SUM(CASE WHEN condition THEN 1 ELSE 0 END) AS count_matching

Conditional Summing:

sql
SUM(CASE WHEN condition THEN amount ELSE 0 END) AS total_matching

Multiple Conditions:

sql
SUM(CASE 
  WHEN condition1 THEN value1
  WHEN condition2 THEN value2
  ELSE 0 
END) AS complex_metric

Percentage Calculation:

sql
100.0 * SUM(CASE WHEN condition THEN 1 ELSE 0 END) / COUNT(*) AS percentage

Date Math Patterns

Months Between Dates:

sql
(YEAR(date2) - YEAR(date1)) * 12 + (MONTH(date2) - MONTH(date1))

Days Between Dates:

sql
DATEDIFF(date2, date1)  -- MySQL
julianday(date2) - julianday(date1)  -- SQLite
date2 - date1  -- PostgreSQL

First Day of Month:

sql
DATE_TRUNC('month', date)  -- PostgreSQL
DATE_FORMAT(date, '%Y-%m-01')  -- MySQL

📚 Real-World Applications

Application 1: E-commerce Dashboard

sql
-- Daily sales summary with comparisons
WITH daily_sales AS (
  SELECT 
    DATE(order_date) AS date,
    COUNT(*) AS orders,
    SUM(quantity * price) AS revenue
  FROM orders o
  JOIN products p ON o.product_id = p.product_id
  GROUP BY DATE(order_date)
)
SELECT 
  date,
  orders,
  revenue,
  LAG(revenue, 1) OVER (ORDER BY date) AS prev_day_revenue,
  LAG(revenue, 7) OVER (ORDER BY date) AS same_day_last_week,
  AVG(revenue) OVER (ORDER BY date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS rolling_7day_avg
FROM daily_sales;

Application 2: Customer Segmentation (RFM)

sql
WITH customer_metrics AS (
  SELECT 
    customer_id,
    MAX(order_date) AS last_order_date,
    COUNT(*) AS frequency,
    SUM(quantity * price) AS monetary
  FROM orders o
  JOIN products p ON o.product_id = p.product_id
  GROUP BY customer_id
),
rfm_scores AS (
  SELECT 
    customer_id,
    julianday('now') - julianday(last_order_date) AS recency_days,
    frequency,
    monetary,
    NTILE(5) OVER (ORDER BY julianday('now') - julianday(last_order_date) DESC) AS recency_score,
    NTILE(5) OVER (ORDER BY frequency) AS frequency_score,
    NTILE(5) OVER (ORDER BY monetary) AS monetary_score
  FROM customer_metrics
)
SELECT 
  customer_id,
  recency_score || frequency_score || monetary_score AS rfm_segment,
  CASE 
    WHEN recency_score >= 4 AND frequency_score >= 4 THEN 'Champions'
    WHEN recency_score >= 3 AND frequency_score >= 3 THEN 'Loyal'
    WHEN recency_score >= 4 AND frequency_score <= 2 THEN 'Promising'
    WHEN recency_score <= 2 THEN 'At Risk'
    ELSE 'Other'
  END AS segment_name
FROM rfm_scores;

Application 3: Inventory Reorder Alert

sql
WITH product_velocity AS (
  SELECT 
    p.product_id,
    p.product_name,
    p.stock_quantity,
    AVG(o.quantity) AS avg_daily_sales
  FROM products p
  LEFT JOIN orders o ON p.product_id = o.product_id
    AND o.order_date >= DATE('now', '-30 days')
  GROUP BY p.product_id, p.product_name, p.stock_quantity
)
SELECT 
  product_name,
  stock_quantity,
  ROUND(avg_daily_sales, 2) AS avg_daily_sales,
  ROUND(stock_quantity / NULLIF(avg_daily_sales, 0), 1) AS days_of_stock,
  CASE 
    WHEN stock_quantity / NULLIF(avg_daily_sales, 0) < 7 THEN 'URGENT'
    WHEN stock_quantity / NULLIF(avg_daily_sales, 0) < 14 THEN 'WARNING'
    ELSE 'OK'
  END AS reorder_status
FROM product_velocity
WHERE avg_daily_sales > 0
ORDER BY days_of_stock;

⚠️ Common Pitfalls

⚠️ Cạm bẫy

Pitfall 1: Division by Zero

sql
-- WRONG: Crashes if denominator is 0
SELECT revenue / orders AS avg_order_value;

-- CORRECT: Use NULLIF
SELECT revenue / NULLIF(orders, 0) AS avg_order_value;

⚠️ Cạm bẫy

Pitfall 2: Date Comparison with Time

sql
-- WRONG: Misses orders on 2024-01-31 after midnight
SELECT * FROM orders WHERE order_date < '2024-02-01';

-- CORRECT: Use DATE() or explicit time
SELECT * FROM orders WHERE DATE(order_date) < '2024-02-01';

⚠️ Cạm bẫy

Pitfall 3: Aggregating Aggregates

sql
-- WRONG: Can't nest aggregates
SELECT category, AVG(SUM(price)) FROM products GROUP BY category;

-- CORRECT: Use subquery or CTE
WITH category_totals AS (
  SELECT category, SUM(price) AS total
  FROM products GROUP BY category
)
SELECT AVG(total) FROM category_totals;