Home → Engineering & Data Projects → Case 1 · Annual Telecom API Peak Load Analysis (Hadoop / Hive)
Case 1 · Annual Telecom API Peak Load Analysis (Hadoop / Hive)
Project summary (TL;DR)
- Industry: Telecom operator, ~6.5M active / billing subscribers
- Problem: Deliver a full-year, minute-level API peak load analysis within 48 hours without impacting production
- Challenge: ~15.3B rows of logs; a traditional Oracle-based approach is not feasible in terms of time and capacity
- Approach: Hadoop + Hive with multi-cluster parallel processing (aggressive scale-out from 1 to 5 clusters)
- Outcome: Core processing finished in ~8 hours, zero production incidents, approved by head office and later promoted as a standard approach in multiple provinces
This case comes from a real telecom production environment: a province with roughly 6.5 million active / billing subscribers and more than 42 million API calls per day. The business team required a full-year, minute-level peak analysis to be delivered within 48 hours. The traditional Oracle-based solution lit up red on time, capacity, and production risk, and we ultimately delivered on time by using a Hadoop + Hive multi-cluster scale-out design.
Context
This project was part of the operator’s annual “interconnection evaluation” at the provincial level. The goal was to deliver a minute-level peak-load report for all APIs within two days, without impacting the live production system. The Oracle-based approach was not viable across time, capacity, and production risk, so we moved to a Hadoop + Hive multi-cluster design and used an incremental, aggressive scale-out strategy to complete the core computation in about 8 hours.
1. Problem: Why did we have to compute one year of peaks in two days?
In this province, there were roughly 6.5 million active / billing subscribers, and the back-end generated more than 42 million API calls per day, including:
- Plan / package lookup
- Data balance queries
- Top-up / recharge history
- Customer validation checks
- Loyalty / points redemption
- Customer profile lookup
- … and 100+ other core APIs
At year-end, the client raised an urgent request:
Deliver a full-year, minute-level peak load report for every API within 2 days.
The online system only retained the most recent 3 months of logs; older logs were archived as compressed files on a tertiary storage server. Completing a full-year, minute-level peak analysis within 48 hours was essentially impossible with a traditional architecture.
2. Why was the Oracle-based approach fundamentally not viable?
The original idea from the client was:
Pull all compressed logs back into Oracle → decompress → partition by day → run full-table scans.
After a quick assessment, the conclusion was very clear: this path does not work.
[Is the Oracle-based approach viable?]
│
┌───────┴─────────────────────────┐
▼ ▼
Time-window check Capacity / design check
(48h vs 182h) (~15.3B rows)
│ │
[Fail] [Fail]
└───────────────┬───────────────┘
▼
Production risk check (Tier-1 / Tier-2 servers)
│
[Fail]
│
▼
⇒ Overall conclusion: the Oracle approach is not viable
(1) Time window: 182 hours vs 48 hours
Observed in testing:
- One day of logs: decompress + load into Oracle ≈ 30 minutes
For 365 days in a year:
365 × 0.5 hours = 182 hours (about 7.5 days)
→ Just pulling data back into Oracle would already exceed the 48-hour window, without any analysis yet.
(2) Capacity / architecture: Oracle cannot hold ~15.3B rows of raw logs
Full-year log volume:
42M rows/day × 365 days ≈ 15.3B rows
Even with daily partitioning, this would lead to:
- Huge index growth
- High risk of global locks
- Exploding undo / redo volumes
- Severely degraded concurrency
- Real impact on live API services
(3) Production risk: even the Tier-2 servers could not handle it
The Tier-2 servers were already handling production tasks (points redemption, validation checks, etc.). The constraints were:
- No heavy workloads allowed outside 23:00–06:00
- Running for 180+ hours straight was not realistic
- Any failure would require a partial reload from the failure point
- We would need complex watchdog scripts for monitoring, alerting, cleanup, and restarts
→ Too risky, too fragile, and not deliverable.
3. Solution: Hadoop + Hive with multi-cluster parallel processing
After revisiting the feasibility, I proposed a completely different architecture:
Instead of pulling 15.3B rows into Oracle, we send the compressed logs directly into Hadoop / Hive and process them in parallel at the file level.
The interface team only needed to hand over the compressed archives: no loading into Oracle, no manual decompression, and no pre-processing on their side.
Standardized 5-step ETL pipeline:
- Bulk decompression:
unzip / gzip -d - Character conversion: GBK → UTF-8 (
iconv) - Load raw logs into Hive (
log_tran_m1, delimited by;) - Clean and structure into
log_tran_m2:- Parse timestamps
- Extract API names
- Mark success / failure (S = success, F = failure)
- Aggregate by “day / API / minute” to build
m3(minute-level call counts)
┌─────────────────────────────┐
│ Single-cluster ETL flow (m1 → m2 → m3) │
└─────────────────────────────┘
│
▼
┌─────────────────────────────┐
│ Step 1: Bulk decompression │
│ unzip + gzip -d │
└──────────────┬──────────────┘
▼
┌─────────────────────────────┐
│ Step 2: Charset conversion │
│ GBK → UTF-8 (iconv) │
└──────────────┬──────────────┘
▼
┌─────────────────────────────┐
│ Step 3: Load into Hive (m1) │
│ LOAD DATA ... │
└──────────────┬──────────────┘
▼
┌─────────────────────────────┐
│ Step 4: Clean & structure │
│ into m2 │
│ • Time parsing │
│ • API name extraction │
│ • Success / failure flag │
└──────────────┬──────────────┘
▼
┌─────────────────────────────┐
│ Step 5: Aggregate by day / │
│ API / minute (m3) │
└──────────────┬──────────────┘
▼
┌─────────────────────────────┐
│ Output: minute-level API │
│ call volume │
└─────────────────────────────┘
4. Key performance breakthrough: one cluster processes one day in 8 minutes
Based on real log statistics, one day contained roughly ~43 million rows.
The measured end-to-end processing time on a single Hadoop cluster was:
Step Duration (seconds)
File load into Hive (LOAD) 267
Clean & transform (m1 → m2) 138
Minute-level peak aggregation (m3) 72
----------------------------------------
Total: 477 seconds ≈ 7.95 minutes
✔ One cluster processes 1 day in ≈ 8 minutes ✔ 5 days in ≈ 40 minutes ❌ Running all 365 days on a single cluster would be ~46 hours → leaving virtually no buffer in a 48-hour window
5. The critical move: aggressive scale-out — cloning 4 more Hadoop clusters in 2 hours
To meet the “must deliver in two days” hard deadline, I took the most direct and effective path: horizontal scale-out — aggressively cloning more Hadoop clusters and scaling compute capacity to 5×.
Operational steps included:
- Cloning the existing Hadoop cluster image
- Renaming the master / slave nodes
- Updating IP / host mappings
- Syncing the Hive metastore
- Fixing HDFS paths
- Distributing custom MapReduce / HQL scripts
- Verifying NameNode / DataNode health and job execution
✔ All clusters were brought online in about 2 hours ✔ We scaled from 1 cluster to 5 clusters in a single burst
Multi-cluster parallel “burst” scale-out
Original Hadoop cluster (1)
│
┌───────────────────┴───────────────────┐
▼ ▼
Clone cluster A (strongest) Clone cluster B (strong)
│ │
▼ ▼
Clone cluster C (medium) Clone cluster D (weaker)
│ │
▼ ▼
┌────────────────────────────────────────┐
│ Final cluster count: 5 (capacity × 5) │
└────────────────────────────────────────┘
▼ Distribute files by cluster capacity ▼
┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐
│ Cluster A │ │ Cluster B │ │ Cluster C │ │ Cluster D │ │ Original │
│ more days │ │ more days │ │ medium │ │ fewer days│ │ medium │
└─────┬────┘ └─────┬────┘ └─────┬────┘ └─────┬────┘ └─────┬────┘
▼ ▼ ▼ ▼ ▼
└────────→ Parallel processing (~8 minutes / day / cluster) ←───────┘
6. Final performance: 5 days (~210M rows) processed in 8 minutes
With the multi-cluster parallel design:
- Each cluster still processed about one day in ~8 minutes
- Shard scheduling greatly improved overall throughput
- File-level parallelism reduced wait time to almost zero
Measured result:
5 days (~210M API records) → fully processed in roughly 8 minutes.
Final aggregation and delivery flow
All clusters output m2 / m3 → minute-level aggregation → Hive summary tables → export to Excel → client analysis
│ │
▼ ▼
╔══════════════════════╗ ╔════════════════════════════╗
║ minute_level_log ║ ║ final_peak_report_daily ║
╚══════════════════════╝ ╚════════════════════════════╝
│ │
└──────────────→ Excel report ←──────────────┘
│
▼
Delivered artefacts: peak minutes, Top 10 high-risk APIs, time-series charts
The final output was a minute-level dataset exported to Excel, which the front-line team could immediately use for trend analysis and capacity planning.
7. Deliverables
The “Annual API Peak Load Analysis” report included:
- Minute-level peak call volume for every API
- Full-year peak load for each API
- A list of the Top 10 high-risk APIs
- Detection of high-frequency, sustained-call APIs
- Daily / weekly / hourly record-volume time series
- Success rate, failure rate, and exception share
- Correlation analysis between business peaks and system peaks
The client passed head-office review on the very next day. Feedback from head office:
“This is the most complete, most accurate, and fastest-produced peak-load analysis report we’ve ever had for provincial interconnection.”
8. Final outcome
Approach Per-day runtime Full-year runtime
----------------------------------------------------------
Original (Oracle) 30 minutes 182 hours (7.5 days)
Single Hadoop cluster 8 minutes ~48 hours (still tight)
5 clusters in parallel 1.6 minutes ≈ 10 hours for the core flow
Including communication and deployment:
- End-to-end communication & solution sign-off: ~1.5 hours
- Multi-cluster deployment (burst scale-out): ~2 hours
- Shard scheduling and computation: ~10 hours
9. My role and contributions
In this project, my main responsibilities were:
- Leading the assessment and formally rejecting the Oracle approach, based on time, capacity, and production risk
- Designing and implementing the Hadoop + Hive multi-cluster solution, including the ETL pipeline, shell scripts, and scheduling logic
- Personally running the performance benchmark and proving the single-cluster throughput of ~8 minutes per day
- Owning the multi-cluster “burst” deployment and shard-allocation strategy, assigning different day ranges to clusters based on capacity
- Defining the final analysis-report structure and working with the front-line team to complete the Excel / report delivery and interpret the findings