YYC-Wander

Engineering and data stories from real production systems

HomeEngineering & Data Projects → Case 1 · Turned a 7.5-Day Oracle Job into a 14-Hour Hadoop Delivery

Case 1 · Turned a 7.5-Day Oracle Job into a 14-Hour Hadoop Delivery

Processed 15.3B records • Met a 48-hour deadline • Scaled to 5 Hadoop clusters • Delivered core flow in ~10 hours • Zero production impact • Replaced an infeasible Oracle path
15.3B full-year API log records processed
48h hard deadline met
5 clusters Hadoop / Hive scale-out capacity
~10h core computation flow

Project Summary

  • Industry context: telecom operator with roughly 6.5 million active / billing subscribers.
  • Core data: API logs from online business systems, about 42 million records per day.
  • Business request: deliver a full-year, minute-level peak-load analysis for all APIs within 48 hours.
  • Core challenge: the Oracle-based path failed on time, capacity, and production-risk checks.
  • Redesign: moved compressed logs directly into Hadoop / Hive and processed them with a multi-cluster parallel ETL design.
  • Result: core processing finished in about 10 hours, with zero production impact, and the approach was later promoted as a reusable provincial analysis pattern.

This case comes from a real telecom production environment. The business team needed a full-year API peak-load report in two days, but only three months of logs were available online. Older data existed as compressed archives on tertiary storage. The key engineering decision was to avoid loading 15.3B raw records back into Oracle and instead use file-level parallel processing on Hadoop / Hive.

Tech Stack: Oracle Hadoop Hive Shell / ETL Compressed Logs Parallel Processing
Core Capabilities: Feasibility Assessment Peak Load Analysis Big Data ETL Production Risk Control Multi-cluster Scale-out

1. Business Context and Delivery Pressure

The provincial telecom environment had about 6.5 million active / billing subscribers and generated more than 42 million API calls per day. These calls covered plan lookup, data-balance queries, recharge history, customer validation, loyalty redemption, customer profile lookup, and more than 100 other core APIs.

At year-end, the client asked for a full-year, minute-level peak load report for every API within 2 days. This required reconstructing a full year of historical API activity, calculating minute-level volume, and identifying high-risk peak periods for capacity planning.

2. Why the Oracle-based Approach Was Rejected

The original idea was to pull archived compressed logs back into Oracle, decompress them, partition them by day, and run full-table analysis. A quick feasibility check showed that this path could not meet the deadline or production-safety requirements.

[Is the Oracle-based approach viable?]
        │
┌───────┴─────────────────────────┐
▼                                 ▼
Time-window check             Capacity / design check
(48h vs 182h)                 (~15.3B rows)
        │                              │
  [Fail]                            [Fail]
        └───────────────┬───────────────┘
                        ▼
     Production risk check (Tier-1 / Tier-2 servers)
                        │
                     [Fail]
                        │
                        ▼
   ⇒ Overall conclusion: the Oracle approach is not viable
      
182h Oracle load-only estimate, before analysis
23:00–06:00 limited low-load production window
~8 min/day single-cluster Hadoop benchmark
Zero production incidents during delivery

Time window: testing showed that one day of logs required about 30 minutes to decompress and load into Oracle.

365 × 0.5 hours = 182 hours (about 7.5 days)

Data volume:

Full-year log volume:
42M rows/day × 365 days ≈ 15.3B rows
      

Even with daily partitioning, this design would create huge index growth, high undo / redo pressure, poor concurrency, and real risk to live API services. The Tier-2 servers were also carrying production tasks, and heavy workloads were only allowed during the 23:00–06:00 low-load window.

3. Solution: Hadoop + Hive File-level Parallel Processing

The redesigned approach sent compressed logs directly into Hadoop / Hive. The interface team only needed to provide the archived files. No Oracle reload, no manual decompression, and no pre-processing were required on their side.

  1. Bulk decompression: unzip / gzip -d
  2. Character conversion: GBK → UTF-8 using iconv
  3. Load raw logs into Hive table log_tran_m1
  4. Clean and structure records into log_tran_m2, including timestamp parsing, API-name extraction, and success / failure marking
  5. Aggregate by day, API, and minute to build m3 minute-level call counts
Single-cluster ETL flow

Compressed logs
        │
        ▼
Bulk decompression
        │
        ▼
GBK → UTF-8 conversion
        │
        ▼
Load into Hive raw table m1
        │
        ▼
Clean and structure into m2
        │
        ▼
Aggregate by day / API / minute into m3
        │
        ▼
Output minute-level API call volume
      

4. Performance Benchmark

Based on real log statistics, one day contained roughly 43 million rows. On a single Hadoop cluster, the end-to-end processing time was about 8 minutes per day.

Step                         Duration (seconds)
File load into Hive (LOAD)          267
Clean & transform (m1 → m2)         138
Minute-level peak aggregation (m3)   72
----------------------------------------
Total:                              477 seconds ≈ 7.95 minutes
      

A single Hadoop cluster could technically process the full year in about 48 hours, but that left almost no buffer for transfer, validation, reruns, report preparation, or unexpected failures. The deadline required more capacity, not just a better query.

5. Critical Move: Burst Scale-out to Five Hadoop Clusters

To create a real delivery buffer, I scaled horizontally by cloning four additional Hadoop clusters and distributing date ranges according to each cluster’s capacity. The cluster burst, host updates, Hive metastore sync, HDFS path fixes, script deployment, and health checks were completed in about 2 hours.

Multi-cluster parallel burst scale-out

Original Hadoop cluster
        │
        ├── Clone cluster A
        ├── Clone cluster B
        ├── Clone cluster C
        └── Clone cluster D

Final cluster count: 5
        │
        ▼
Distribute files by cluster capacity
        │
        ▼
Parallel processing at file / day level
      

Engineering Judgment

The core decision was not simply “use Hadoop.” The important judgment was knowing when Oracle was the wrong execution platform, proving Hadoop throughput with a benchmark, and then scaling out fast enough to protect the 48-hour delivery window.

6. Final Delivery Flow

With five clusters running in parallel, the workload was scheduled by file and by day. The output tables were merged into the final reporting layer, exported to Excel, and handed to the front-line team for capacity analysis.

Final aggregation and delivery flow

All clusters output m2 / m3
        │
        ▼
Minute-level aggregation
        │
        ▼
Hive summary tables
        │
        ▼
Excel report
        │
        ▼
Delivered artefacts: peak minutes, Top 10 high-risk APIs, time-series charts
      

7. Deliverables

  • Minute-level peak call volume for every API
  • Full-year peak load for each API
  • A list of the Top 10 high-risk APIs
  • Detection of high-frequency, sustained-call APIs
  • Daily / weekly / hourly record-volume time series
  • Success rate, failure rate, and exception share
  • Correlation analysis between business peaks and system peaks

8. Final Outcome: The Numbers That Matter

Approach              Per-day runtime     Full-year runtime
----------------------------------------------------------
Original (Oracle)       30 minutes        182 hours (7.5 days)
Single Hadoop cluster    8 minutes        ~48 hours (still tight)
5 clusters in parallel   1.6 minutes      ≈ 10 hours for the core flow
      
  • End-to-end communication and solution sign-off: about 1.5 hours
  • Multi-cluster deployment: about 2 hours
  • Core shard scheduling and computation: about 10 hours
  • Delivered within the 48-hour deadline
  • Zero impact on the production Oracle system
  • Reusable design later adopted as a standard API analysis approach in multiple provinces

9. My Role and Contributions

  • Led the architecture decision and end-to-end delivery under a hard 48-hour deadline.
  • Designed and implemented the Hadoop + Hive multi-cluster solution, including ETL pipeline, shell scripts, and scheduling logic.
  • Ran the performance benchmark and proved the single-cluster throughput of about 8 minutes per day.
  • Owned the burst scale-out deployment and shard-allocation strategy, assigning date ranges based on cluster capacity.
  • Defined the final analysis-report structure and worked with the front-line team to deliver and interpret the Excel report.

Notes & Methods

  • Data scope: full-year API logs from a provincial telecom production environment.
  • Processing method: Hadoop / Hive ETL, file-level parallelism, character conversion, cleaning, and minute-level aggregation.
  • Risk control: avoided loading archived raw logs into Oracle and kept heavy analysis away from live production systems.