Home → Engineering & Data Projects → Case 1 · Turned a 7.5-Day Oracle Job into a 14-Hour Hadoop Delivery
Case 1 · Turned a 7.5-Day Oracle Job into a 14-Hour Hadoop Delivery
Project Summary
- Industry context: telecom operator with roughly 6.5 million active / billing subscribers.
- Core data: API logs from online business systems, about 42 million records per day.
- Business request: deliver a full-year, minute-level peak-load analysis for all APIs within 48 hours.
- Core challenge: the Oracle-based path failed on time, capacity, and production-risk checks.
- Redesign: moved compressed logs directly into Hadoop / Hive and processed them with a multi-cluster parallel ETL design.
- Result: core processing finished in about 10 hours, with zero production impact, and the approach was later promoted as a reusable provincial analysis pattern.
This case comes from a real telecom production environment. The business team needed a full-year API peak-load report in two days, but only three months of logs were available online. Older data existed as compressed archives on tertiary storage. The key engineering decision was to avoid loading 15.3B raw records back into Oracle and instead use file-level parallel processing on Hadoop / Hive.
1. Business Context and Delivery Pressure
The provincial telecom environment had about 6.5 million active / billing subscribers and generated more than 42 million API calls per day. These calls covered plan lookup, data-balance queries, recharge history, customer validation, loyalty redemption, customer profile lookup, and more than 100 other core APIs.
At year-end, the client asked for a full-year, minute-level peak load report for every API within 2 days. This required reconstructing a full year of historical API activity, calculating minute-level volume, and identifying high-risk peak periods for capacity planning.
2. Why the Oracle-based Approach Was Rejected
The original idea was to pull archived compressed logs back into Oracle, decompress them, partition them by day, and run full-table analysis. A quick feasibility check showed that this path could not meet the deadline or production-safety requirements.
[Is the Oracle-based approach viable?]
│
┌───────┴─────────────────────────┐
▼ ▼
Time-window check Capacity / design check
(48h vs 182h) (~15.3B rows)
│ │
[Fail] [Fail]
└───────────────┬───────────────┘
▼
Production risk check (Tier-1 / Tier-2 servers)
│
[Fail]
│
▼
⇒ Overall conclusion: the Oracle approach is not viable
Time window: testing showed that one day of logs required about 30 minutes to decompress and load into Oracle.
365 × 0.5 hours = 182 hours (about 7.5 days)
Data volume:
Full-year log volume:
42M rows/day × 365 days ≈ 15.3B rows
Even with daily partitioning, this design would create huge index growth, high undo / redo pressure, poor concurrency, and real risk to live API services. The Tier-2 servers were also carrying production tasks, and heavy workloads were only allowed during the 23:00–06:00 low-load window.
3. Solution: Hadoop + Hive File-level Parallel Processing
The redesigned approach sent compressed logs directly into Hadoop / Hive. The interface team only needed to provide the archived files. No Oracle reload, no manual decompression, and no pre-processing were required on their side.
- Bulk decompression:
unzip / gzip -d - Character conversion: GBK → UTF-8 using
iconv - Load raw logs into Hive table
log_tran_m1 - Clean and structure records into
log_tran_m2, including timestamp parsing, API-name extraction, and success / failure marking - Aggregate by day, API, and minute to build
m3minute-level call counts
Single-cluster ETL flow
Compressed logs
│
▼
Bulk decompression
│
▼
GBK → UTF-8 conversion
│
▼
Load into Hive raw table m1
│
▼
Clean and structure into m2
│
▼
Aggregate by day / API / minute into m3
│
▼
Output minute-level API call volume
4. Performance Benchmark
Based on real log statistics, one day contained roughly 43 million rows. On a single Hadoop cluster, the end-to-end processing time was about 8 minutes per day.
Step Duration (seconds)
File load into Hive (LOAD) 267
Clean & transform (m1 → m2) 138
Minute-level peak aggregation (m3) 72
----------------------------------------
Total: 477 seconds ≈ 7.95 minutes
A single Hadoop cluster could technically process the full year in about 48 hours, but that left almost no buffer for transfer, validation, reruns, report preparation, or unexpected failures. The deadline required more capacity, not just a better query.
5. Critical Move: Burst Scale-out to Five Hadoop Clusters
To create a real delivery buffer, I scaled horizontally by cloning four additional Hadoop clusters and distributing date ranges according to each cluster’s capacity. The cluster burst, host updates, Hive metastore sync, HDFS path fixes, script deployment, and health checks were completed in about 2 hours.
Multi-cluster parallel burst scale-out
Original Hadoop cluster
│
├── Clone cluster A
├── Clone cluster B
├── Clone cluster C
└── Clone cluster D
Final cluster count: 5
│
▼
Distribute files by cluster capacity
│
▼
Parallel processing at file / day level
Engineering Judgment
The core decision was not simply “use Hadoop.” The important judgment was knowing when Oracle was the wrong execution platform, proving Hadoop throughput with a benchmark, and then scaling out fast enough to protect the 48-hour delivery window.
6. Final Delivery Flow
With five clusters running in parallel, the workload was scheduled by file and by day. The output tables were merged into the final reporting layer, exported to Excel, and handed to the front-line team for capacity analysis.
Final aggregation and delivery flow
All clusters output m2 / m3
│
▼
Minute-level aggregation
│
▼
Hive summary tables
│
▼
Excel report
│
▼
Delivered artefacts: peak minutes, Top 10 high-risk APIs, time-series charts
7. Deliverables
- Minute-level peak call volume for every API
- Full-year peak load for each API
- A list of the Top 10 high-risk APIs
- Detection of high-frequency, sustained-call APIs
- Daily / weekly / hourly record-volume time series
- Success rate, failure rate, and exception share
- Correlation analysis between business peaks and system peaks
8. Final Outcome: The Numbers That Matter
Approach Per-day runtime Full-year runtime
----------------------------------------------------------
Original (Oracle) 30 minutes 182 hours (7.5 days)
Single Hadoop cluster 8 minutes ~48 hours (still tight)
5 clusters in parallel 1.6 minutes ≈ 10 hours for the core flow
- End-to-end communication and solution sign-off: about 1.5 hours
- Multi-cluster deployment: about 2 hours
- Core shard scheduling and computation: about 10 hours
- Delivered within the 48-hour deadline
- Zero impact on the production Oracle system
- Reusable design later adopted as a standard API analysis approach in multiple provinces
9. My Role and Contributions
- Led the architecture decision and end-to-end delivery under a hard 48-hour deadline.
- Designed and implemented the Hadoop + Hive multi-cluster solution, including ETL pipeline, shell scripts, and scheduling logic.
- Ran the performance benchmark and proved the single-cluster throughput of about 8 minutes per day.
- Owned the burst scale-out deployment and shard-allocation strategy, assigning date ranges based on cluster capacity.
- Defined the final analysis-report structure and worked with the front-line team to deliver and interpret the Excel report.
Notes & Methods
- Data scope: full-year API logs from a provincial telecom production environment.
- Processing method: Hadoop / Hive ETL, file-level parallelism, character conversion, cleaning, and minute-level aggregation.
- Risk control: avoided loading archived raw logs into Oracle and kept heavy analysis away from live production systems.