📊 Data Platforms

Level: Core Solves: Chọn đúng data platform cho analytics workloads với trade-offs về performance, cost, và flexibility

🎯 Mục tiêu (Outcomes)

Sau khi áp dụng kiến thức trong trang này, bạn sẽ có khả năng:

Chọn đúng Data Platform dựa trên use case (BigQuery vs Dataproc)
Thiết kế BigQuery với partitioning, clustering, và cost control
Triển khai Lakehouse Pattern với BigLake và open formats
Sử dụng Dataproc cho Spark/Hadoop workloads
Tối ưu Chi phí với storage tiers và query optimization
Xây dựng Data Pipeline với modern data stack

✅ Khi nào dùng

Platform	Use Case	Lý do
BigQuery	SQL analytics, BI	Serverless, fast, SQL-native
BigQuery + GCS	Lakehouse	Flexibility, open formats
Dataproc	Spark/Hadoop	Existing code, ML pipelines
Dataproc Serverless	Batch Spark	No cluster management
Dataflow	Stream processing	Auto-scaling, exactly-once

❌ Khi nào KHÔNG dùng

Pattern	Vấn đề	Thay thế
BigQuery cho OLTP	Not designed for transactions	Cloud SQL, Spanner
Dataproc persistent clusters	Idle cost	Serverless hoặc ephemeral
*SELECT trên large tables**	Huge cost	Select specific columns
BigQuery không partition	Full scan cost	Partition by date
Spark cho simple SQL	Overkill	BigQuery

⚠️ Cảnh báo từ Raizo

"Một analyst chạy SELECT * từ 50TB table. $300 cho 1 query. Không có quotas set up. Enable dry_run trước, set project quotas, và train team về cost awareness."

Platform Landscape

GCP Data Platform Spectrum

┌─────────────────────────────────────────────────────────────────┐
│                 GCP DATA PLATFORM SPECTRUM                      │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  STRUCTURED                                    UNSTRUCTURED     │
│  SQL-FIRST                                     CODE-FIRST       │
│  ◄─────────────────────────────────────────────────────────►    │
│                                                                 │
│  ┌─────────┐  ┌─────────┐  ┌─────────┐  ┌─────────────────┐    │
│  │BigQuery │  │BigQuery │  │Dataproc │  │  Cloud Storage  │    │
│  │   DW    │  │   +GCS  │  │ (Spark) │  │   (Data Lake)   │    │
│  └─────────┘  └─────────┘  └─────────┘  └─────────────────┘    │
│                                                                 │
│  Pure Data      Lakehouse    Processing     Raw Storage         │
│  Warehouse      Pattern      Engine         Layer               │
│                                                                 │
│  Best for:      Best for:    Best for:      Best for:          │
│  • BI/Reports   • Mixed      • Complex ETL  • Landing zone     │
│  • Ad-hoc SQL   • External   • ML pipelines • Archive          │
│  • Dashboards   • data       • Streaming    • Unstructured     │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

BigQuery

BigQuery Architecture

┌─────────────────────────────────────────────────────────────────┐
│                 BIGQUERY ARCHITECTURE                           │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │                    Dremel Engine                        │    │
│  │  ┌─────────────────────────────────────────────────┐    │    │
│  │  │              Query Execution                     │    │    │
│  │  │  • Distributed SQL engine                        │    │    │
│  │  │  • Automatic parallelization                     │    │    │
│  │  │  • In-memory shuffle                             │    │    │
│  │  └─────────────────────────────────────────────────┘    │    │
│  └─────────────────────────────────────────────────────────┘    │
│                          │                                      │
│                          ▼                                      │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │                   Colossus Storage                      │    │
│  │  ┌─────────────────────────────────────────────────┐    │    │
│  │  │              Columnar Format                     │    │    │
│  │  │  • Capacitor (columnar)                          │    │    │
│  │  │  • Automatic compression                         │    │    │
│  │  │  • Encryption at rest                            │    │    │
│  │  └─────────────────────────────────────────────────┘    │    │
│  └─────────────────────────────────────────────────────────┘    │
│                                                                 │
│  KEY INSIGHT: Compute and Storage are SEPARATED                 │
│  • Pay for storage (cheap)                                      │
│  • Pay for compute (on-demand or slots)                         │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

BigQuery Pricing Models

Model	How it Works	Best For
On-demand	$6.25/TB scanned	Ad-hoc queries, variable workloads
Capacity (Slots)	$0.04/slot-hour	Predictable, heavy workloads
Editions	Standard/Enterprise/Enterprise Plus	Different SLA/features

BigQuery Best Practices

sql

-- ✅ GOOD: Partition and cluster for performance
CREATE TABLE `project.dataset.events`
PARTITION BY DATE(event_timestamp)
CLUSTER BY user_id, event_type
AS SELECT * FROM source_table;

-- ✅ GOOD: Use approximate functions for large datasets
SELECT APPROX_COUNT_DISTINCT(user_id) as unique_users
FROM `project.dataset.events`
WHERE DATE(event_timestamp) = '2024-01-01';

-- ❌ BAD: SELECT * on large tables
SELECT * FROM `project.dataset.events`;

-- ✅ GOOD: Select only needed columns
SELECT user_id, event_type, event_timestamp
FROM `project.dataset.events`
WHERE DATE(event_timestamp) = '2024-01-01';

BigQuery Organization

┌─────────────────────────────────────────────────────────────────┐
│              BIGQUERY RESOURCE HIERARCHY                        │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Project: analytics-prod                                        │
│  ├── Dataset: raw_data (US multi-region)                        │
│  │   ├── Table: events                                          │
│  │   ├── Table: users                                           │
│  │   └── External Table: gcs_logs (→ GCS)                       │
│  │                                                              │
│  ├── Dataset: staging (US multi-region)                         │
│  │   ├── Table: stg_events                                      │
│  │   └── View: v_active_users                                   │
│  │                                                              │
│  ├── Dataset: marts (US multi-region)                           │
│  │   ├── Table: dim_users                                       │
│  │   ├── Table: fact_events                                     │
│  │   └── Materialized View: mv_daily_metrics                    │
│  │                                                              │
│  └── Dataset: ml_models (US multi-region)                       │
│      └── Model: churn_prediction                                │
│                                                                 │
│  NAMING CONVENTION:                                             │
│  • Datasets: {layer}_{domain} (raw_sales, marts_finance)        │
│  • Tables: {prefix}_{entity} (dim_users, fact_orders)           │
│  • Views: v_{description}                                       │
│  • Materialized Views: mv_{description}                         │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Dataproc

When to Use Dataproc

┌─────────────────────────────────────────────────────────────────┐
│              DATAPROC vs BIGQUERY                               │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  USE BIGQUERY WHEN:                  USE DATAPROC WHEN:         │
│  ─────────────────                   ────────────────           │
│  • SQL-based analytics               • Complex Spark/Hadoop     │
│  • BI tool integration               • Existing Spark code      │
│  • Ad-hoc exploration                • ML with Spark MLlib      │
│  • Serverless preferred              • Streaming (Spark)        │
│  • Structured data                   • Custom libraries         │
│                                                                 │
│  COST MODEL:                         COST MODEL:                │
│  • Pay per query (TB scanned)        • Pay per cluster time     │
│  • Or reserved slots                 • + Spot VMs discount      │
│                                                                 │
│  SCALING:                            SCALING:                   │
│  • Automatic, serverless             • Manual or autoscaling    │
│  • No cluster management             • Cluster lifecycle mgmt   │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Dataproc Cluster Types

┌─────────────────────────────────────────────────────────────────┐
│              DATAPROC CLUSTER PATTERNS                          │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  EPHEMERAL CLUSTERS (Recommended)                               │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │  • Create → Run Job → Delete                            │    │
│  │  • Use Dataproc Workflows or Cloud Composer             │    │
│  │  • Cost-effective for batch jobs                        │    │
│  │  • No idle cluster costs                                │    │
│  └─────────────────────────────────────────────────────────┘    │
│                                                                 │
│  PERSISTENT CLUSTERS (When needed)                              │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │  • Interactive development (Jupyter)                    │    │
│  │  • Frequent short jobs                                  │    │
│  │  • Use autoscaling to minimize cost                     │    │
│  │  • Consider Dataproc Serverless instead                 │    │
│  └─────────────────────────────────────────────────────────┘    │
│                                                                 │
│  DATAPROC SERVERLESS (Best of both)                             │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │  • No cluster management                                │    │
│  │  • Auto-scaling                                         │    │
│  │  • Pay per job                                          │    │
│  │  • Spark batches and interactive sessions               │    │
│  └─────────────────────────────────────────────────────────┘    │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Lakehouse Pattern

BigQuery + Cloud Storage

┌─────────────────────────────────────────────────────────────────┐
│              LAKEHOUSE ARCHITECTURE                             │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │                    Query Layer                          │    │
│  │  ┌─────────────────────────────────────────────────┐    │    │
│  │  │              BigQuery                            │    │    │
│  │  │  • Native tables (managed)                       │    │    │
│  │  │  • External tables (federated)                   │    │    │
│  │  │  • BigLake tables (governed)                     │    │    │
│  │  └─────────────────────────────────────────────────┘    │    │
│  └─────────────────────────────────────────────────────────┘    │
│                          │                                      │
│                          ▼                                      │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │                   Storage Layer                         │    │
│  │  ┌─────────────────────────────────────────────────┐    │    │
│  │  │           Cloud Storage (Data Lake)             │    │    │
│  │  │  • Parquet, ORC, Avro, JSON                      │    │    │
│  │  │  • Delta Lake, Apache Iceberg                    │    │    │
│  │  │  • Open formats, portable                        │    │    │
│  │  └─────────────────────────────────────────────────┘    │    │
│  └─────────────────────────────────────────────────────────┘    │
│                                                                 │
│  BENEFITS:                                                      │
│  • Query data in place (no ETL to BigQuery)                     │
│  • Open formats (avoid lock-in)                                 │
│  • Fine-grained access control with BigLake                     │
│  • Cost-effective for large, infrequently queried data          │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

BigLake Tables

sql

-- Create BigLake table with row-level security
CREATE EXTERNAL TABLE `project.dataset.sales_biglake`
WITH CONNECTION `project.region.connection_id`
OPTIONS (
  format = 'PARQUET',
  uris = ['gs://bucket/sales/*.parquet'],
  metadata_cache_mode = 'AUTOMATIC'
);

-- Apply row-level security
CREATE ROW ACCESS POLICY region_filter
ON `project.dataset.sales_biglake`
GRANT TO ('user:analyst@example.com')
FILTER USING (region = 'APAC');
])

Data Pipeline Patterns

Modern Data Stack on GCP

Recommended Tools

Layer	Tool	Purpose
Orchestration	Cloud Composer (Airflow)	Workflow scheduling
Ingestion	Dataflow, Pub/Sub	Batch and streaming
Transform	dbt, Dataform	SQL transformations
Quality	Dataplex	Data quality, lineage
Catalog	Data Catalog	Metadata management
Visualization	Looker, Looker Studio	BI and dashboards

Cost Optimization

BigQuery Cost Control

sql

-- Set up cost controls
-- 1. Project-level quota
-- In Cloud Console: BigQuery > Quotas > Query usage per day

-- 2. User-level quota
-- Use custom quotas in IAM

-- 3. Query dry run (estimate cost before running)
-- bq query --dry_run --use_legacy_sql=false 'SELECT ...'

-- 4. Use LIMIT for exploration
SELECT * FROM `project.dataset.large_table` LIMIT 1000;

-- 5. Partition pruning (always filter on partition column)
SELECT * FROM `project.dataset.events`
WHERE DATE(event_timestamp) BETWEEN '2024-01-01' AND '2024-01-31';

Storage Optimization

┌─────────────────────────────────────────────────────────────────┐
│              BIGQUERY STORAGE TIERS                             │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ACTIVE STORAGE: $0.02/GB/month                                 │
│  • Tables modified in last 90 days                              │
│  • Streaming buffer                                             │
│                                                                 │
│  LONG-TERM STORAGE: $0.01/GB/month (50% cheaper!)               │
│  • Tables NOT modified in 90 days                               │
│  • Automatic, no action needed                                  │
│  • Same query performance                                       │
│                                                                 │
│  OPTIMIZATION TIPS:                                             │
│  • Use table expiration for temp tables                         │
│  • Partition by date, cluster by high-cardinality columns       │
│  • Archive old partitions to GCS (external tables)              │
│  • Use materialized views for repeated queries                  │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Best Practices Checklist

[ ] Use BigQuery for SQL analytics, Dataproc for Spark
[ ] Partition tables by date, cluster by query columns
[ ] Set up cost controls and quotas
[ ] Use Dataproc Serverless for batch Spark jobs
[ ] Implement lakehouse pattern for flexibility
[ ] Use BigLake for governed external data
[ ] Enable Data Catalog for discoverability
[ ] Implement data quality checks with Dataplex

⚖️ Trade-offs

Trade-off 1: BigQuery On-Demand vs Capacity

Khía cạnh	On-Demand	Capacity (Slots)
Cost model	$6.25/TB scanned	$0.04/slot-hour
Predictability	Variable	Fixed
Best for	Ad-hoc, variable	Heavy, predictable
Break-even	< 500 TB/tháng	> 500 TB/tháng

Trade-off 2: Native Tables vs External Tables

Khía cạnh	Native Tables	External/BigLake
Performance	Fastest	Slower
Cost	Storage + Query	Query only
Flexibility	BigQuery only	Open formats
Use case	Hot data	Cold/archival

Trade-off 3: BigQuery vs Dataproc

Khía cạnh	BigQuery	Dataproc
Interface	SQL	Spark/Python
Ops overhead	Zero	Cluster management
ML	BigQuery ML (SQL)	Spark MLlib, custom
Cost	Per TB scanned	Per cluster time
Best for	SQL analytics	Complex transformations

🚨 Failure Modes

Failure Mode 1: Runaway Query Costs

🔥 Incident thực tế

Analyst chạy CROSS JOIN trên 2 large tables. Query scan 500TB. $3,000 bill từ 1 query. No quotas configured.

Cách phát hiện	Cách phòng tránh
Cost spike alerts	User/project quotas
Slot utilization spikes	Dry run before execution
Monthly bill shock	Cost controls per query

Failure Mode 2: Data Quality Issues

Cách phát hiện	Cách phòng tránh
Downstream report errors	Dataplex data quality
NULL counts spike	Schema validation
Duplicate records	Deduplication in pipeline

Failure Mode 3: Partition/Cluster Inefficiency

Cách phát hiện	Cách phòng tránh
Query scans unexpected TB	Check partition pruning
Slow query performance	Review clustering columns
High bytes billed	EXPLAIN plan analysis

🔐 Security Baseline

Data Security Requirements

Requirement	Implementation	Verification
Column-level security	Policy tags	Access audit
Row-level security	Row access policies	Query logs
Encryption	CMEK for sensitive	Key audit
Data masking	Dynamic data masking	Policy review
Access control	IAM + dataset permissions	Access Analyzer

BigQuery Security Checklist

Security Item	Status
Dataset-level IAM	☑ Required
Column-level policy tags	☑ For PII
Row-level access policies	☑ For multi-tenant
CMEK encryption	☑ For compliance
Audit logging enabled	☑ Required
VPC Service Controls	☑ For sensitive data

📊 Ops Readiness

Metrics cần Monitoring

Metric	Source	Alert Threshold
Slot utilization	BigQuery	> 80% sustained
Bytes billed per day	BigQuery	> budget
Query error rate	BigQuery	> 1%
Job execution time	BigQuery	P95 > baseline
Dataproc cluster utilization	Dataproc	< 50% (idle)

Runbook Entry Points

Tình huống	Runbook
High query cost	`runbook/bigquery-cost-optimization.md`
Slow query performance	`runbook/bigquery-query-tuning.md`
Data quality failure	`runbook/data-quality-investigation.md`
Dataproc job failure	`runbook/dataproc-troubleshooting.md`
Access denied	`runbook/bigquery-access-debug.md`
Slot exhaustion	`runbook/bigquery-slot-management.md`

✅ Design Review Checklist

Platform Selection

[ ] Đúng platform cho use case
[ ] Cost model understood
[ ] Team skills matched
[ ] Scaling requirements met

BigQuery Design

[ ] Tables partitioned
[ ] Clustering columns defined
[ ] Naming convention followed
[ ] Materialized views for common queries

Security

[ ] IAM configured per dataset
[ ] PII columns tagged
[ ] Row-level security if multi-tenant
[ ] Audit logging enabled

Operations

[ ] Cost quotas configured
[ ] Query monitoring set up
[ ] Data quality checks
[ ] Runbooks documented

📎 Liên kết

📎 AWS Storage & Data Protection - So sánh với AWS data services
📎 GCP IAM Model - Data access control
📎 VPC & Networking - Private connectivity cho data services
📎 Cost & Quotas - Data platform cost management
📎 GCP Security Perimeter - VPC Service Controls cho data

📊 Data Platforms ​

🎯 Mục tiêu (Outcomes) ​

✅ Khi nào dùng ​

❌ Khi nào KHÔNG dùng ​

Platform Landscape ​

GCP Data Platform Spectrum ​

BigQuery ​

BigQuery Architecture ​

BigQuery Pricing Models ​

BigQuery Best Practices ​

BigQuery Organization ​

Dataproc ​

When to Use Dataproc ​

Dataproc Cluster Types ​

Lakehouse Pattern ​

BigQuery + Cloud Storage ​

BigLake Tables ​

Data Pipeline Patterns ​

Modern Data Stack on GCP ​

Recommended Tools ​

Cost Optimization ​

BigQuery Cost Control ​

Storage Optimization ​

Best Practices Checklist ​

⚖️ Trade-offs ​

Trade-off 1: BigQuery On-Demand vs Capacity ​

Trade-off 2: Native Tables vs External Tables ​

Trade-off 3: BigQuery vs Dataproc ​

🚨 Failure Modes ​

Failure Mode 1: Runaway Query Costs ​

Failure Mode 2: Data Quality Issues ​

Failure Mode 3: Partition/Cluster Inefficiency ​

🔐 Security Baseline ​

Data Security Requirements ​

BigQuery Security Checklist ​

📊 Ops Readiness ​

Metrics cần Monitoring ​

Runbook Entry Points ​

✅ Design Review Checklist ​

Platform Selection ​

BigQuery Design ​

Security ​

Operations ​

📎 Liên kết ​