Giao diện
📊 Data Platforms
Level: Core Solves: Chọn đúng data platform cho analytics workloads với trade-offs về performance, cost, và flexibility
🎯 Mục tiêu (Outcomes)
Sau khi áp dụng kiến thức trong trang này, bạn sẽ có khả năng:
- Chọn đúng Data Platform dựa trên use case (BigQuery vs Dataproc)
- Thiết kế BigQuery với partitioning, clustering, và cost control
- Triển khai Lakehouse Pattern với BigLake và open formats
- Sử dụng Dataproc cho Spark/Hadoop workloads
- Tối ưu Chi phí với storage tiers và query optimization
- Xây dựng Data Pipeline với modern data stack
✅ Khi nào dùng
| Platform | Use Case | Lý do |
|---|---|---|
| BigQuery | SQL analytics, BI | Serverless, fast, SQL-native |
| BigQuery + GCS | Lakehouse | Flexibility, open formats |
| Dataproc | Spark/Hadoop | Existing code, ML pipelines |
| Dataproc Serverless | Batch Spark | No cluster management |
| Dataflow | Stream processing | Auto-scaling, exactly-once |
❌ Khi nào KHÔNG dùng
| Pattern | Vấn đề | Thay thế |
|---|---|---|
| BigQuery cho OLTP | Not designed for transactions | Cloud SQL, Spanner |
| Dataproc persistent clusters | Idle cost | Serverless hoặc ephemeral |
| SELECT * trên large tables | Huge cost | Select specific columns |
| BigQuery không partition | Full scan cost | Partition by date |
| Spark cho simple SQL | Overkill | BigQuery |
⚠️ Cảnh báo từ Raizo
"Một analyst chạy SELECT * từ 50TB table. $300 cho 1 query. Không có quotas set up. Enable dry_run trước, set project quotas, và train team về cost awareness."
Platform Landscape
GCP Data Platform Spectrum
┌─────────────────────────────────────────────────────────────────┐
│ GCP DATA PLATFORM SPECTRUM │
├─────────────────────────────────────────────────────────────────┤
│ │
│ STRUCTURED UNSTRUCTURED │
│ SQL-FIRST CODE-FIRST │
│ ◄─────────────────────────────────────────────────────────► │
│ │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────────────┐ │
│ │BigQuery │ │BigQuery │ │Dataproc │ │ Cloud Storage │ │
│ │ DW │ │ +GCS │ │ (Spark) │ │ (Data Lake) │ │
│ └─────────┘ └─────────┘ └─────────┘ └─────────────────┘ │
│ │
│ Pure Data Lakehouse Processing Raw Storage │
│ Warehouse Pattern Engine Layer │
│ │
│ Best for: Best for: Best for: Best for: │
│ • BI/Reports • Mixed • Complex ETL • Landing zone │
│ • Ad-hoc SQL • External • ML pipelines • Archive │
│ • Dashboards • data • Streaming • Unstructured │
│ │
└─────────────────────────────────────────────────────────────────┘BigQuery
BigQuery Architecture
┌─────────────────────────────────────────────────────────────────┐
│ BIGQUERY ARCHITECTURE │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Dremel Engine │ │
│ │ ┌─────────────────────────────────────────────────┐ │ │
│ │ │ Query Execution │ │ │
│ │ │ • Distributed SQL engine │ │ │
│ │ │ • Automatic parallelization │ │ │
│ │ │ • In-memory shuffle │ │ │
│ │ └─────────────────────────────────────────────────┘ │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Colossus Storage │ │
│ │ ┌─────────────────────────────────────────────────┐ │ │
│ │ │ Columnar Format │ │ │
│ │ │ • Capacitor (columnar) │ │ │
│ │ │ • Automatic compression │ │ │
│ │ │ • Encryption at rest │ │ │
│ │ └─────────────────────────────────────────────────┘ │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
│ KEY INSIGHT: Compute and Storage are SEPARATED │
│ • Pay for storage (cheap) │
│ • Pay for compute (on-demand or slots) │
│ │
└─────────────────────────────────────────────────────────────────┘BigQuery Pricing Models
| Model | How it Works | Best For |
|---|---|---|
| On-demand | $6.25/TB scanned | Ad-hoc queries, variable workloads |
| Capacity (Slots) | $0.04/slot-hour | Predictable, heavy workloads |
| Editions | Standard/Enterprise/Enterprise Plus | Different SLA/features |
BigQuery Best Practices
sql
-- ✅ GOOD: Partition and cluster for performance
CREATE TABLE `project.dataset.events`
PARTITION BY DATE(event_timestamp)
CLUSTER BY user_id, event_type
AS SELECT * FROM source_table;
-- ✅ GOOD: Use approximate functions for large datasets
SELECT APPROX_COUNT_DISTINCT(user_id) as unique_users
FROM `project.dataset.events`
WHERE DATE(event_timestamp) = '2024-01-01';
-- ❌ BAD: SELECT * on large tables
SELECT * FROM `project.dataset.events`;
-- ✅ GOOD: Select only needed columns
SELECT user_id, event_type, event_timestamp
FROM `project.dataset.events`
WHERE DATE(event_timestamp) = '2024-01-01';BigQuery Organization
┌─────────────────────────────────────────────────────────────────┐
│ BIGQUERY RESOURCE HIERARCHY │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Project: analytics-prod │
│ ├── Dataset: raw_data (US multi-region) │
│ │ ├── Table: events │
│ │ ├── Table: users │
│ │ └── External Table: gcs_logs (→ GCS) │
│ │ │
│ ├── Dataset: staging (US multi-region) │
│ │ ├── Table: stg_events │
│ │ └── View: v_active_users │
│ │ │
│ ├── Dataset: marts (US multi-region) │
│ │ ├── Table: dim_users │
│ │ ├── Table: fact_events │
│ │ └── Materialized View: mv_daily_metrics │
│ │ │
│ └── Dataset: ml_models (US multi-region) │
│ └── Model: churn_prediction │
│ │
│ NAMING CONVENTION: │
│ • Datasets: {layer}_{domain} (raw_sales, marts_finance) │
│ • Tables: {prefix}_{entity} (dim_users, fact_orders) │
│ • Views: v_{description} │
│ • Materialized Views: mv_{description} │
│ │
└─────────────────────────────────────────────────────────────────┘Dataproc
When to Use Dataproc
┌─────────────────────────────────────────────────────────────────┐
│ DATAPROC vs BIGQUERY │
├─────────────────────────────────────────────────────────────────┤
│ │
│ USE BIGQUERY WHEN: USE DATAPROC WHEN: │
│ ───────────────── ──────────────── │
│ • SQL-based analytics • Complex Spark/Hadoop │
│ • BI tool integration • Existing Spark code │
│ • Ad-hoc exploration • ML with Spark MLlib │
│ • Serverless preferred • Streaming (Spark) │
│ • Structured data • Custom libraries │
│ │
│ COST MODEL: COST MODEL: │
│ • Pay per query (TB scanned) • Pay per cluster time │
│ • Or reserved slots • + Spot VMs discount │
│ │
│ SCALING: SCALING: │
│ • Automatic, serverless • Manual or autoscaling │
│ • No cluster management • Cluster lifecycle mgmt │
│ │
└─────────────────────────────────────────────────────────────────┘Dataproc Cluster Types
┌─────────────────────────────────────────────────────────────────┐
│ DATAPROC CLUSTER PATTERNS │
├─────────────────────────────────────────────────────────────────┤
│ │
│ EPHEMERAL CLUSTERS (Recommended) │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ • Create → Run Job → Delete │ │
│ │ • Use Dataproc Workflows or Cloud Composer │ │
│ │ • Cost-effective for batch jobs │ │
│ │ • No idle cluster costs │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
│ PERSISTENT CLUSTERS (When needed) │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ • Interactive development (Jupyter) │ │
│ │ • Frequent short jobs │ │
│ │ • Use autoscaling to minimize cost │ │
│ │ • Consider Dataproc Serverless instead │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
│ DATAPROC SERVERLESS (Best of both) │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ • No cluster management │ │
│ │ • Auto-scaling │ │
│ │ • Pay per job │ │
│ │ • Spark batches and interactive sessions │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘Lakehouse Pattern
BigQuery + Cloud Storage
┌─────────────────────────────────────────────────────────────────┐
│ LAKEHOUSE ARCHITECTURE │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Query Layer │ │
│ │ ┌─────────────────────────────────────────────────┐ │ │
│ │ │ BigQuery │ │ │
│ │ │ • Native tables (managed) │ │ │
│ │ │ • External tables (federated) │ │ │
│ │ │ • BigLake tables (governed) │ │ │
│ │ └─────────────────────────────────────────────────┘ │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Storage Layer │ │
│ │ ┌─────────────────────────────────────────────────┐ │ │
│ │ │ Cloud Storage (Data Lake) │ │ │
│ │ │ • Parquet, ORC, Avro, JSON │ │ │
│ │ │ • Delta Lake, Apache Iceberg │ │ │
│ │ │ • Open formats, portable │ │ │
│ │ └─────────────────────────────────────────────────┘ │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
│ BENEFITS: │
│ • Query data in place (no ETL to BigQuery) │
│ • Open formats (avoid lock-in) │
│ • Fine-grained access control with BigLake │
│ • Cost-effective for large, infrequently queried data │
│ │
└─────────────────────────────────────────────────────────────────┘BigLake Tables
sql
-- Create BigLake table with row-level security
CREATE EXTERNAL TABLE `project.dataset.sales_biglake`
WITH CONNECTION `project.region.connection_id`
OPTIONS (
format = 'PARQUET',
uris = ['gs://bucket/sales/*.parquet'],
metadata_cache_mode = 'AUTOMATIC'
);
-- Apply row-level security
CREATE ROW ACCESS POLICY region_filter
ON `project.dataset.sales_biglake`
GRANT TO ('user:analyst@example.com')
FILTER USING (region = 'APAC');
])Data Pipeline Patterns
Modern Data Stack on GCP
Recommended Tools
| Layer | Tool | Purpose |
|---|---|---|
| Orchestration | Cloud Composer (Airflow) | Workflow scheduling |
| Ingestion | Dataflow, Pub/Sub | Batch and streaming |
| Transform | dbt, Dataform | SQL transformations |
| Quality | Dataplex | Data quality, lineage |
| Catalog | Data Catalog | Metadata management |
| Visualization | Looker, Looker Studio | BI and dashboards |
Cost Optimization
BigQuery Cost Control
sql
-- Set up cost controls
-- 1. Project-level quota
-- In Cloud Console: BigQuery > Quotas > Query usage per day
-- 2. User-level quota
-- Use custom quotas in IAM
-- 3. Query dry run (estimate cost before running)
-- bq query --dry_run --use_legacy_sql=false 'SELECT ...'
-- 4. Use LIMIT for exploration
SELECT * FROM `project.dataset.large_table` LIMIT 1000;
-- 5. Partition pruning (always filter on partition column)
SELECT * FROM `project.dataset.events`
WHERE DATE(event_timestamp) BETWEEN '2024-01-01' AND '2024-01-31';Storage Optimization
┌─────────────────────────────────────────────────────────────────┐
│ BIGQUERY STORAGE TIERS │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ACTIVE STORAGE: $0.02/GB/month │
│ • Tables modified in last 90 days │
│ • Streaming buffer │
│ │
│ LONG-TERM STORAGE: $0.01/GB/month (50% cheaper!) │
│ • Tables NOT modified in 90 days │
│ • Automatic, no action needed │
│ • Same query performance │
│ │
│ OPTIMIZATION TIPS: │
│ • Use table expiration for temp tables │
│ • Partition by date, cluster by high-cardinality columns │
│ • Archive old partitions to GCS (external tables) │
│ • Use materialized views for repeated queries │
│ │
└─────────────────────────────────────────────────────────────────┘Best Practices Checklist
- [ ] Use BigQuery for SQL analytics, Dataproc for Spark
- [ ] Partition tables by date, cluster by query columns
- [ ] Set up cost controls and quotas
- [ ] Use Dataproc Serverless for batch Spark jobs
- [ ] Implement lakehouse pattern for flexibility
- [ ] Use BigLake for governed external data
- [ ] Enable Data Catalog for discoverability
- [ ] Implement data quality checks with Dataplex
⚖️ Trade-offs
Trade-off 1: BigQuery On-Demand vs Capacity
| Khía cạnh | On-Demand | Capacity (Slots) |
|---|---|---|
| Cost model | $6.25/TB scanned | $0.04/slot-hour |
| Predictability | Variable | Fixed |
| Best for | Ad-hoc, variable | Heavy, predictable |
| Break-even | < 500 TB/tháng | > 500 TB/tháng |
Trade-off 2: Native Tables vs External Tables
| Khía cạnh | Native Tables | External/BigLake |
|---|---|---|
| Performance | Fastest | Slower |
| Cost | Storage + Query | Query only |
| Flexibility | BigQuery only | Open formats |
| Use case | Hot data | Cold/archival |
Trade-off 3: BigQuery vs Dataproc
| Khía cạnh | BigQuery | Dataproc |
|---|---|---|
| Interface | SQL | Spark/Python |
| Ops overhead | Zero | Cluster management |
| ML | BigQuery ML (SQL) | Spark MLlib, custom |
| Cost | Per TB scanned | Per cluster time |
| Best for | SQL analytics | Complex transformations |
🚨 Failure Modes
Failure Mode 1: Runaway Query Costs
🔥 Incident thực tế
Analyst chạy CROSS JOIN trên 2 large tables. Query scan 500TB. $3,000 bill từ 1 query. No quotas configured.
| Cách phát hiện | Cách phòng tránh |
|---|---|
| Cost spike alerts | User/project quotas |
| Slot utilization spikes | Dry run before execution |
| Monthly bill shock | Cost controls per query |
Failure Mode 2: Data Quality Issues
| Cách phát hiện | Cách phòng tránh |
|---|---|
| Downstream report errors | Dataplex data quality |
| NULL counts spike | Schema validation |
| Duplicate records | Deduplication in pipeline |
Failure Mode 3: Partition/Cluster Inefficiency
| Cách phát hiện | Cách phòng tránh |
|---|---|
| Query scans unexpected TB | Check partition pruning |
| Slow query performance | Review clustering columns |
| High bytes billed | EXPLAIN plan analysis |
🔐 Security Baseline
Data Security Requirements
| Requirement | Implementation | Verification |
|---|---|---|
| Column-level security | Policy tags | Access audit |
| Row-level security | Row access policies | Query logs |
| Encryption | CMEK for sensitive | Key audit |
| Data masking | Dynamic data masking | Policy review |
| Access control | IAM + dataset permissions | Access Analyzer |
BigQuery Security Checklist
| Security Item | Status |
|---|---|
| Dataset-level IAM | ☑ Required |
| Column-level policy tags | ☑ For PII |
| Row-level access policies | ☑ For multi-tenant |
| CMEK encryption | ☑ For compliance |
| Audit logging enabled | ☑ Required |
| VPC Service Controls | ☑ For sensitive data |
📊 Ops Readiness
Metrics cần Monitoring
| Metric | Source | Alert Threshold |
|---|---|---|
| Slot utilization | BigQuery | > 80% sustained |
| Bytes billed per day | BigQuery | > budget |
| Query error rate | BigQuery | > 1% |
| Job execution time | BigQuery | P95 > baseline |
| Dataproc cluster utilization | Dataproc | < 50% (idle) |
Runbook Entry Points
| Tình huống | Runbook |
|---|---|
| High query cost | runbook/bigquery-cost-optimization.md |
| Slow query performance | runbook/bigquery-query-tuning.md |
| Data quality failure | runbook/data-quality-investigation.md |
| Dataproc job failure | runbook/dataproc-troubleshooting.md |
| Access denied | runbook/bigquery-access-debug.md |
| Slot exhaustion | runbook/bigquery-slot-management.md |
✅ Design Review Checklist
Platform Selection
- [ ] Đúng platform cho use case
- [ ] Cost model understood
- [ ] Team skills matched
- [ ] Scaling requirements met
BigQuery Design
- [ ] Tables partitioned
- [ ] Clustering columns defined
- [ ] Naming convention followed
- [ ] Materialized views for common queries
Security
- [ ] IAM configured per dataset
- [ ] PII columns tagged
- [ ] Row-level security if multi-tenant
- [ ] Audit logging enabled
Operations
- [ ] Cost quotas configured
- [ ] Query monitoring set up
- [ ] Data quality checks
- [ ] Runbooks documented
📎 Liên kết
- 📎 AWS Storage & Data Protection - So sánh với AWS data services
- 📎 GCP IAM Model - Data access control
- 📎 VPC & Networking - Private connectivity cho data services
- 📎 Cost & Quotas - Data platform cost management
- 📎 GCP Security Perimeter - VPC Service Controls cho data