YYC-Wander

In my journey through the urban landscape

HomeEngineering & Data Projects → Case 1 · Annual Telecom API Peak Load Analysis (Hadoop / Hive)

Case 1 · Annual Telecom API Peak Load Analysis (Hadoop / Hive)

Project summary (TL;DR)

  • Industry: Telecom operator, ~6.5M active / billing subscribers
  • Problem: Deliver a full-year, minute-level API peak load analysis within 48 hours without impacting production
  • Challenge: ~15.3B rows of logs; a traditional Oracle-based approach is not feasible in terms of time and capacity
  • Approach: Hadoop + Hive with multi-cluster parallel processing (aggressive scale-out from 1 to 5 clusters)
  • Outcome: Core processing finished in ~8 hours, zero production incidents, approved by head office and later promoted as a standard approach in multiple provinces

This case comes from a real telecom production environment: a province with roughly 6.5 million active / billing subscribers and more than 42 million API calls per day. The business team required a full-year, minute-level peak analysis to be delivered within 48 hours. The traditional Oracle-based solution lit up red on time, capacity, and production risk, and we ultimately delivered on time by using a Hadoop + Hive multi-cluster scale-out design.

Tech Stack: Oracle Hadoop Hive Shell / ETL
Focus: API log analysis Peak load evaluation Big data parallel computing Production risk control

Context

This project was part of the operator’s annual “interconnection evaluation” at the provincial level. The goal was to deliver a minute-level peak-load report for all APIs within two days, without impacting the live production system. The Oracle-based approach was not viable across time, capacity, and production risk, so we moved to a Hadoop + Hive multi-cluster design and used an incremental, aggressive scale-out strategy to complete the core computation in about 8 hours.

1. Problem: Why did we have to compute one year of peaks in two days?

In this province, there were roughly 6.5 million active / billing subscribers, and the back-end generated more than 42 million API calls per day, including:

  • Plan / package lookup
  • Data balance queries
  • Top-up / recharge history
  • Customer validation checks
  • Loyalty / points redemption
  • Customer profile lookup
  • … and 100+ other core APIs

At year-end, the client raised an urgent request:

Deliver a full-year, minute-level peak load report for every API within 2 days.

The online system only retained the most recent 3 months of logs; older logs were archived as compressed files on a tertiary storage server. Completing a full-year, minute-level peak analysis within 48 hours was essentially impossible with a traditional architecture.

2. Why was the Oracle-based approach fundamentally not viable?

The original idea from the client was:

Pull all compressed logs back into Oracle → decompress → partition by day → run full-table scans.

After a quick assessment, the conclusion was very clear: this path does not work.

[Is the Oracle-based approach viable?]
        │
┌───────┴─────────────────────────┐
▼                                 ▼
Time-window check             Capacity / design check
(48h vs 182h)                 (~15.3B rows)
        │                              │
  [Fail]                            [Fail]
        └───────────────┬───────────────┘
                        ▼
     Production risk check (Tier-1 / Tier-2 servers)
                        │
                     [Fail]
                        │
                        ▼
   ⇒ Overall conclusion: the Oracle approach is not viable
            

(1) Time window: 182 hours vs 48 hours

Observed in testing:

  • One day of logs: decompress + load into Oracle ≈ 30 minutes

For 365 days in a year:

365 × 0.5 hours = 182 hours (about 7.5 days)

→ Just pulling data back into Oracle would already exceed the 48-hour window, without any analysis yet.

(2) Capacity / architecture: Oracle cannot hold ~15.3B rows of raw logs

Full-year log volume:
42M rows/day × 365 days ≈ 15.3B rows
            

Even with daily partitioning, this would lead to:

  • Huge index growth
  • High risk of global locks
  • Exploding undo / redo volumes
  • Severely degraded concurrency
  • Real impact on live API services

(3) Production risk: even the Tier-2 servers could not handle it

The Tier-2 servers were already handling production tasks (points redemption, validation checks, etc.). The constraints were:

  • No heavy workloads allowed outside 23:00–06:00
  • Running for 180+ hours straight was not realistic
  • Any failure would require a partial reload from the failure point
  • We would need complex watchdog scripts for monitoring, alerting, cleanup, and restarts

→ Too risky, too fragile, and not deliverable.

3. Solution: Hadoop + Hive with multi-cluster parallel processing

After revisiting the feasibility, I proposed a completely different architecture:

Instead of pulling 15.3B rows into Oracle, we send the compressed logs directly into Hadoop / Hive and process them in parallel at the file level.

The interface team only needed to hand over the compressed archives: no loading into Oracle, no manual decompression, and no pre-processing on their side.

Standardized 5-step ETL pipeline:

  1. Bulk decompression: unzip / gzip -d
  2. Character conversion: GBK → UTF-8 (iconv)
  3. Load raw logs into Hive (log_tran_m1, delimited by ;)
  4. Clean and structure into log_tran_m2:
    • Parse timestamps
    • Extract API names
    • Mark success / failure (S = success, F = failure)
  5. Aggregate by “day / API / minute” to build m3 (minute-level call counts)
┌─────────────────────────────┐
│   Single-cluster ETL flow (m1 → m2 → m3) │
└─────────────────────────────┘
        │
        ▼
┌─────────────────────────────┐
│ Step 1: Bulk decompression  │
│   unzip + gzip -d           │
└──────────────┬──────────────┘
               ▼
┌─────────────────────────────┐
│ Step 2: Charset conversion  │
│   GBK → UTF-8 (iconv)       │
└──────────────┬──────────────┘
               ▼
┌─────────────────────────────┐
│ Step 3: Load into Hive (m1) │
│   LOAD DATA ...             │
└──────────────┬──────────────┘
               ▼
┌─────────────────────────────┐
│ Step 4: Clean & structure   │
│   into m2                   │
│ • Time parsing              │
│ • API name extraction       │
│ • Success / failure flag    │
└──────────────┬──────────────┘
               ▼
┌─────────────────────────────┐
│ Step 5: Aggregate by day /  │
│   API / minute (m3)         │
└──────────────┬──────────────┘
               ▼
┌─────────────────────────────┐
│ Output: minute-level API    │
│   call volume               │
└─────────────────────────────┘
            

4. Key performance breakthrough: one cluster processes one day in 8 minutes

Based on real log statistics, one day contained roughly ~43 million rows.

The measured end-to-end processing time on a single Hadoop cluster was:

Step                         Duration (seconds)
File load into Hive (LOAD)          267
Clean & transform (m1 → m2)         138
Minute-level peak aggregation (m3)   72
----------------------------------------
Total:                              477 seconds ≈ 7.95 minutes
            

✔ One cluster processes 1 day in ≈ 8 minutes ✔ 5 days in ≈ 40 minutes ❌ Running all 365 days on a single cluster would be ~46 hours → leaving virtually no buffer in a 48-hour window

5. The critical move: aggressive scale-out — cloning 4 more Hadoop clusters in 2 hours

To meet the “must deliver in two days” hard deadline, I took the most direct and effective path: horizontal scale-out — aggressively cloning more Hadoop clusters and scaling compute capacity to 5×.

Operational steps included:

  • Cloning the existing Hadoop cluster image
  • Renaming the master / slave nodes
  • Updating IP / host mappings
  • Syncing the Hive metastore
  • Fixing HDFS paths
  • Distributing custom MapReduce / HQL scripts
  • Verifying NameNode / DataNode health and job execution

✔ All clusters were brought online in about 2 hours ✔ We scaled from 1 cluster to 5 clusters in a single burst

Multi-cluster parallel “burst” scale-out

                Original Hadoop cluster (1)
                            │
        ┌───────────────────┴───────────────────┐
        ▼                                       ▼
   Clone cluster A (strongest)            Clone cluster B (strong)
        │                                       │
        ▼                                       ▼
   Clone cluster C (medium)               Clone cluster D (weaker)
        │                                       │
        ▼                                       ▼
┌────────────────────────────────────────┐
│      Final cluster count: 5 (capacity × 5)   │
└────────────────────────────────────────┘

          ▼ Distribute files by cluster capacity ▼

┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐
│ Cluster A │ │ Cluster B │ │ Cluster C │ │ Cluster D │ │ Original │
│ more days │ │ more days │ │ medium    │ │ fewer days│ │ medium   │
└─────┬────┘ └─────┬────┘ └─────┬────┘ └─────┬────┘ └─────┬────┘
      ▼            ▼            ▼            ▼            ▼
       └────────→ Parallel processing (~8 minutes / day / cluster) ←───────┘
            

6. Final performance: 5 days (~210M rows) processed in 8 minutes

With the multi-cluster parallel design:

  • Each cluster still processed about one day in ~8 minutes
  • Shard scheduling greatly improved overall throughput
  • File-level parallelism reduced wait time to almost zero

Measured result:

5 days (~210M API records) → fully processed in roughly 8 minutes.

Final aggregation and delivery flow

All clusters output m2 / m3 → minute-level aggregation → Hive summary tables → export to Excel → client analysis
        │                                      │
        ▼                                      ▼
╔══════════════════════╗      ╔════════════════════════════╗
║   minute_level_log    ║      ║   final_peak_report_daily  ║
╚══════════════════════╝      ╚════════════════════════════╝
        │                                      │
        └──────────────→        Excel report        ←──────────────┘
                                       │
                                       ▼
      Delivered artefacts: peak minutes, Top 10 high-risk APIs, time-series charts
            

The final output was a minute-level dataset exported to Excel, which the front-line team could immediately use for trend analysis and capacity planning.

7. Deliverables

The “Annual API Peak Load Analysis” report included:

  • Minute-level peak call volume for every API
  • Full-year peak load for each API
  • A list of the Top 10 high-risk APIs
  • Detection of high-frequency, sustained-call APIs
  • Daily / weekly / hourly record-volume time series
  • Success rate, failure rate, and exception share
  • Correlation analysis between business peaks and system peaks

The client passed head-office review on the very next day. Feedback from head office:

“This is the most complete, most accurate, and fastest-produced peak-load analysis report we’ve ever had for provincial interconnection.”

8. Final outcome

Approach              Per-day runtime     Full-year runtime
----------------------------------------------------------
Original (Oracle)       30 minutes        182 hours (7.5 days)
Single Hadoop cluster    8 minutes        ~48 hours (still tight)
5 clusters in parallel   1.6 minutes      ≈ 10 hours for the core flow
            

Including communication and deployment:

  • End-to-end communication & solution sign-off: ~1.5 hours
  • Multi-cluster deployment (burst scale-out): ~2 hours
  • Shard scheduling and computation: ~10 hours

  • Delivered comfortably within the 48-hour deadline
  • Zero impact on the production Oracle system
  • Very low incremental cost, reusable design — later adopted as a standard API analysis approach in multiple provinces
  • 9. My role and contributions

    In this project, my main responsibilities were:

    • Leading the assessment and formally rejecting the Oracle approach, based on time, capacity, and production risk
    • Designing and implementing the Hadoop + Hive multi-cluster solution, including the ETL pipeline, shell scripts, and scheduling logic
    • Personally running the performance benchmark and proving the single-cluster throughput of ~8 minutes per day
    • Owning the multi-cluster “burst” deployment and shard-allocation strategy, assigning different day ranges to clusters based on capacity
    • Defining the final analysis-report structure and working with the front-line team to complete the Excel / report delivery and interpret the findings