Skip to content

📊 Data Platforms

Level: Core Solves: Chọn đúng data platform cho analytics workloads với trade-offs về performance, cost, và flexibility

🎯 Mục tiêu (Outcomes)

Sau khi áp dụng kiến thức trong trang này, bạn sẽ có khả năng:

  • Chọn đúng Data Platform dựa trên use case (BigQuery vs Dataproc)
  • Thiết kế BigQuery với partitioning, clustering, và cost control
  • Triển khai Lakehouse Pattern với BigLake và open formats
  • Sử dụng Dataproc cho Spark/Hadoop workloads
  • Tối ưu Chi phí với storage tiers và query optimization
  • Xây dựng Data Pipeline với modern data stack

Khi nào dùng

PlatformUse CaseLý do
BigQuerySQL analytics, BIServerless, fast, SQL-native
BigQuery + GCSLakehouseFlexibility, open formats
DataprocSpark/HadoopExisting code, ML pipelines
Dataproc ServerlessBatch SparkNo cluster management
DataflowStream processingAuto-scaling, exactly-once

Khi nào KHÔNG dùng

PatternVấn đềThay thế
BigQuery cho OLTPNot designed for transactionsCloud SQL, Spanner
Dataproc persistent clustersIdle costServerless hoặc ephemeral
SELECT * trên large tablesHuge costSelect specific columns
BigQuery không partitionFull scan costPartition by date
Spark cho simple SQLOverkillBigQuery

⚠️ Cảnh báo từ Raizo

"Một analyst chạy SELECT * từ 50TB table. $300 cho 1 query. Không có quotas set up. Enable dry_run trước, set project quotas, và train team về cost awareness."

Platform Landscape

GCP Data Platform Spectrum

┌─────────────────────────────────────────────────────────────────┐
│                 GCP DATA PLATFORM SPECTRUM                      │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  STRUCTURED                                    UNSTRUCTURED     │
│  SQL-FIRST                                     CODE-FIRST       │
│  ◄─────────────────────────────────────────────────────────►    │
│                                                                 │
│  ┌─────────┐  ┌─────────┐  ┌─────────┐  ┌─────────────────┐    │
│  │BigQuery │  │BigQuery │  │Dataproc │  │  Cloud Storage  │    │
│  │   DW    │  │   +GCS  │  │ (Spark) │  │   (Data Lake)   │    │
│  └─────────┘  └─────────┘  └─────────┘  └─────────────────┘    │
│                                                                 │
│  Pure Data      Lakehouse    Processing     Raw Storage         │
│  Warehouse      Pattern      Engine         Layer               │
│                                                                 │
│  Best for:      Best for:    Best for:      Best for:          │
│  • BI/Reports   • Mixed      • Complex ETL  • Landing zone     │
│  • Ad-hoc SQL   • External   • ML pipelines • Archive          │
│  • Dashboards   • data       • Streaming    • Unstructured     │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

BigQuery

BigQuery Architecture

┌─────────────────────────────────────────────────────────────────┐
│                 BIGQUERY ARCHITECTURE                           │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │                    Dremel Engine                        │    │
│  │  ┌─────────────────────────────────────────────────┐    │    │
│  │  │              Query Execution                     │    │    │
│  │  │  • Distributed SQL engine                        │    │    │
│  │  │  • Automatic parallelization                     │    │    │
│  │  │  • In-memory shuffle                             │    │    │
│  │  └─────────────────────────────────────────────────┘    │    │
│  └─────────────────────────────────────────────────────────┘    │
│                          │                                      │
│                          ▼                                      │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │                   Colossus Storage                      │    │
│  │  ┌─────────────────────────────────────────────────┐    │    │
│  │  │              Columnar Format                     │    │    │
│  │  │  • Capacitor (columnar)                          │    │    │
│  │  │  • Automatic compression                         │    │    │
│  │  │  • Encryption at rest                            │    │    │
│  │  └─────────────────────────────────────────────────┘    │    │
│  └─────────────────────────────────────────────────────────┘    │
│                                                                 │
│  KEY INSIGHT: Compute and Storage are SEPARATED                 │
│  • Pay for storage (cheap)                                      │
│  • Pay for compute (on-demand or slots)                         │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

BigQuery Pricing Models

ModelHow it WorksBest For
On-demand$6.25/TB scannedAd-hoc queries, variable workloads
Capacity (Slots)$0.04/slot-hourPredictable, heavy workloads
EditionsStandard/Enterprise/Enterprise PlusDifferent SLA/features

BigQuery Best Practices

sql
-- ✅ GOOD: Partition and cluster for performance
CREATE TABLE `project.dataset.events`
PARTITION BY DATE(event_timestamp)
CLUSTER BY user_id, event_type
AS SELECT * FROM source_table;

-- ✅ GOOD: Use approximate functions for large datasets
SELECT APPROX_COUNT_DISTINCT(user_id) as unique_users
FROM `project.dataset.events`
WHERE DATE(event_timestamp) = '2024-01-01';

-- ❌ BAD: SELECT * on large tables
SELECT * FROM `project.dataset.events`;

-- ✅ GOOD: Select only needed columns
SELECT user_id, event_type, event_timestamp
FROM `project.dataset.events`
WHERE DATE(event_timestamp) = '2024-01-01';

BigQuery Organization

┌─────────────────────────────────────────────────────────────────┐
│              BIGQUERY RESOURCE HIERARCHY                        │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Project: analytics-prod                                        │
│  ├── Dataset: raw_data (US multi-region)                        │
│  │   ├── Table: events                                          │
│  │   ├── Table: users                                           │
│  │   └── External Table: gcs_logs (→ GCS)                       │
│  │                                                              │
│  ├── Dataset: staging (US multi-region)                         │
│  │   ├── Table: stg_events                                      │
│  │   └── View: v_active_users                                   │
│  │                                                              │
│  ├── Dataset: marts (US multi-region)                           │
│  │   ├── Table: dim_users                                       │
│  │   ├── Table: fact_events                                     │
│  │   └── Materialized View: mv_daily_metrics                    │
│  │                                                              │
│  └── Dataset: ml_models (US multi-region)                       │
│      └── Model: churn_prediction                                │
│                                                                 │
│  NAMING CONVENTION:                                             │
│  • Datasets: {layer}_{domain} (raw_sales, marts_finance)        │
│  • Tables: {prefix}_{entity} (dim_users, fact_orders)           │
│  • Views: v_{description}                                       │
│  • Materialized Views: mv_{description}                         │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Dataproc

When to Use Dataproc

┌─────────────────────────────────────────────────────────────────┐
│              DATAPROC vs BIGQUERY                               │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  USE BIGQUERY WHEN:                  USE DATAPROC WHEN:         │
│  ─────────────────                   ────────────────           │
│  • SQL-based analytics               • Complex Spark/Hadoop     │
│  • BI tool integration               • Existing Spark code      │
│  • Ad-hoc exploration                • ML with Spark MLlib      │
│  • Serverless preferred              • Streaming (Spark)        │
│  • Structured data                   • Custom libraries         │
│                                                                 │
│  COST MODEL:                         COST MODEL:                │
│  • Pay per query (TB scanned)        • Pay per cluster time     │
│  • Or reserved slots                 • + Spot VMs discount      │
│                                                                 │
│  SCALING:                            SCALING:                   │
│  • Automatic, serverless             • Manual or autoscaling    │
│  • No cluster management             • Cluster lifecycle mgmt   │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Dataproc Cluster Types

┌─────────────────────────────────────────────────────────────────┐
│              DATAPROC CLUSTER PATTERNS                          │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  EPHEMERAL CLUSTERS (Recommended)                               │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │  • Create → Run Job → Delete                            │    │
│  │  • Use Dataproc Workflows or Cloud Composer             │    │
│  │  • Cost-effective for batch jobs                        │    │
│  │  • No idle cluster costs                                │    │
│  └─────────────────────────────────────────────────────────┘    │
│                                                                 │
│  PERSISTENT CLUSTERS (When needed)                              │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │  • Interactive development (Jupyter)                    │    │
│  │  • Frequent short jobs                                  │    │
│  │  • Use autoscaling to minimize cost                     │    │
│  │  • Consider Dataproc Serverless instead                 │    │
│  └─────────────────────────────────────────────────────────┘    │
│                                                                 │
│  DATAPROC SERVERLESS (Best of both)                             │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │  • No cluster management                                │    │
│  │  • Auto-scaling                                         │    │
│  │  • Pay per job                                          │    │
│  │  • Spark batches and interactive sessions               │    │
│  └─────────────────────────────────────────────────────────┘    │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Lakehouse Pattern

BigQuery + Cloud Storage

┌─────────────────────────────────────────────────────────────────┐
│              LAKEHOUSE ARCHITECTURE                             │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │                    Query Layer                          │    │
│  │  ┌─────────────────────────────────────────────────┐    │    │
│  │  │              BigQuery                            │    │    │
│  │  │  • Native tables (managed)                       │    │    │
│  │  │  • External tables (federated)                   │    │    │
│  │  │  • BigLake tables (governed)                     │    │    │
│  │  └─────────────────────────────────────────────────┘    │    │
│  └─────────────────────────────────────────────────────────┘    │
│                          │                                      │
│                          ▼                                      │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │                   Storage Layer                         │    │
│  │  ┌─────────────────────────────────────────────────┐    │    │
│  │  │           Cloud Storage (Data Lake)             │    │    │
│  │  │  • Parquet, ORC, Avro, JSON                      │    │    │
│  │  │  • Delta Lake, Apache Iceberg                    │    │    │
│  │  │  • Open formats, portable                        │    │    │
│  │  └─────────────────────────────────────────────────┘    │    │
│  └─────────────────────────────────────────────────────────┘    │
│                                                                 │
│  BENEFITS:                                                      │
│  • Query data in place (no ETL to BigQuery)                     │
│  • Open formats (avoid lock-in)                                 │
│  • Fine-grained access control with BigLake                     │
│  • Cost-effective for large, infrequently queried data          │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

BigLake Tables

sql
-- Create BigLake table with row-level security
CREATE EXTERNAL TABLE `project.dataset.sales_biglake`
WITH CONNECTION `project.region.connection_id`
OPTIONS (
  format = 'PARQUET',
  uris = ['gs://bucket/sales/*.parquet'],
  metadata_cache_mode = 'AUTOMATIC'
);

-- Apply row-level security
CREATE ROW ACCESS POLICY region_filter
ON `project.dataset.sales_biglake`
GRANT TO ('user:analyst@example.com')
FILTER USING (region = 'APAC');
])

Data Pipeline Patterns

Modern Data Stack on GCP

LayerToolPurpose
OrchestrationCloud Composer (Airflow)Workflow scheduling
IngestionDataflow, Pub/SubBatch and streaming
Transformdbt, DataformSQL transformations
QualityDataplexData quality, lineage
CatalogData CatalogMetadata management
VisualizationLooker, Looker StudioBI and dashboards

Cost Optimization

BigQuery Cost Control

sql
-- Set up cost controls
-- 1. Project-level quota
-- In Cloud Console: BigQuery > Quotas > Query usage per day

-- 2. User-level quota
-- Use custom quotas in IAM

-- 3. Query dry run (estimate cost before running)
-- bq query --dry_run --use_legacy_sql=false 'SELECT ...'

-- 4. Use LIMIT for exploration
SELECT * FROM `project.dataset.large_table` LIMIT 1000;

-- 5. Partition pruning (always filter on partition column)
SELECT * FROM `project.dataset.events`
WHERE DATE(event_timestamp) BETWEEN '2024-01-01' AND '2024-01-31';

Storage Optimization

┌─────────────────────────────────────────────────────────────────┐
│              BIGQUERY STORAGE TIERS                             │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ACTIVE STORAGE: $0.02/GB/month                                 │
│  • Tables modified in last 90 days                              │
│  • Streaming buffer                                             │
│                                                                 │
│  LONG-TERM STORAGE: $0.01/GB/month (50% cheaper!)               │
│  • Tables NOT modified in 90 days                               │
│  • Automatic, no action needed                                  │
│  • Same query performance                                       │
│                                                                 │
│  OPTIMIZATION TIPS:                                             │
│  • Use table expiration for temp tables                         │
│  • Partition by date, cluster by high-cardinality columns       │
│  • Archive old partitions to GCS (external tables)              │
│  • Use materialized views for repeated queries                  │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Best Practices Checklist

  • [ ] Use BigQuery for SQL analytics, Dataproc for Spark
  • [ ] Partition tables by date, cluster by query columns
  • [ ] Set up cost controls and quotas
  • [ ] Use Dataproc Serverless for batch Spark jobs
  • [ ] Implement lakehouse pattern for flexibility
  • [ ] Use BigLake for governed external data
  • [ ] Enable Data Catalog for discoverability
  • [ ] Implement data quality checks with Dataplex

⚖️ Trade-offs

Trade-off 1: BigQuery On-Demand vs Capacity

Khía cạnhOn-DemandCapacity (Slots)
Cost model$6.25/TB scanned$0.04/slot-hour
PredictabilityVariableFixed
Best forAd-hoc, variableHeavy, predictable
Break-even< 500 TB/tháng> 500 TB/tháng

Trade-off 2: Native Tables vs External Tables

Khía cạnhNative TablesExternal/BigLake
PerformanceFastestSlower
CostStorage + QueryQuery only
FlexibilityBigQuery onlyOpen formats
Use caseHot dataCold/archival

Trade-off 3: BigQuery vs Dataproc

Khía cạnhBigQueryDataproc
InterfaceSQLSpark/Python
Ops overheadZeroCluster management
MLBigQuery ML (SQL)Spark MLlib, custom
CostPer TB scannedPer cluster time
Best forSQL analyticsComplex transformations

🚨 Failure Modes

Failure Mode 1: Runaway Query Costs

🔥 Incident thực tế

Analyst chạy CROSS JOIN trên 2 large tables. Query scan 500TB. $3,000 bill từ 1 query. No quotas configured.

Cách phát hiệnCách phòng tránh
Cost spike alertsUser/project quotas
Slot utilization spikesDry run before execution
Monthly bill shockCost controls per query

Failure Mode 2: Data Quality Issues

Cách phát hiệnCách phòng tránh
Downstream report errorsDataplex data quality
NULL counts spikeSchema validation
Duplicate recordsDeduplication in pipeline

Failure Mode 3: Partition/Cluster Inefficiency

Cách phát hiệnCách phòng tránh
Query scans unexpected TBCheck partition pruning
Slow query performanceReview clustering columns
High bytes billedEXPLAIN plan analysis

🔐 Security Baseline

Data Security Requirements

RequirementImplementationVerification
Column-level securityPolicy tagsAccess audit
Row-level securityRow access policiesQuery logs
EncryptionCMEK for sensitiveKey audit
Data maskingDynamic data maskingPolicy review
Access controlIAM + dataset permissionsAccess Analyzer

BigQuery Security Checklist

Security ItemStatus
Dataset-level IAM☑ Required
Column-level policy tags☑ For PII
Row-level access policies☑ For multi-tenant
CMEK encryption☑ For compliance
Audit logging enabled☑ Required
VPC Service Controls☑ For sensitive data

📊 Ops Readiness

Metrics cần Monitoring

MetricSourceAlert Threshold
Slot utilizationBigQuery> 80% sustained
Bytes billed per dayBigQuery> budget
Query error rateBigQuery> 1%
Job execution timeBigQueryP95 > baseline
Dataproc cluster utilizationDataproc< 50% (idle)

Runbook Entry Points

Tình huốngRunbook
High query costrunbook/bigquery-cost-optimization.md
Slow query performancerunbook/bigquery-query-tuning.md
Data quality failurerunbook/data-quality-investigation.md
Dataproc job failurerunbook/dataproc-troubleshooting.md
Access deniedrunbook/bigquery-access-debug.md
Slot exhaustionrunbook/bigquery-slot-management.md

Design Review Checklist

Platform Selection

  • [ ] Đúng platform cho use case
  • [ ] Cost model understood
  • [ ] Team skills matched
  • [ ] Scaling requirements met

BigQuery Design

  • [ ] Tables partitioned
  • [ ] Clustering columns defined
  • [ ] Naming convention followed
  • [ ] Materialized views for common queries

Security

  • [ ] IAM configured per dataset
  • [ ] PII columns tagged
  • [ ] Row-level security if multi-tenant
  • [ ] Audit logging enabled

Operations

  • [ ] Cost quotas configured
  • [ ] Query monitoring set up
  • [ ] Data quality checks
  • [ ] Runbooks documented

📎 Liên kết