Skip to content

Deploy Model with Flask


1. Purpose

Make your model usable by others. A model file on your laptop is useless. An API Endpoint allows the Website, Mobile App, or other microservices to send data and get predictions.


2. When to Use / When Not to Use

Use This Workflow When

  • You need a simple, lightweight inference server.
  • The model is CPU-bound and reasonably fast (< 500ms).
  • You are deploying to a standard container runtime (K8s/Docker).

Do NOT Use This Workflow When

  • High-throughput batch processing (Use Spark/Ray).
  • Ultra-low latency < 10ms (Use C++/Triton Inference Server).
  • You need GPU batching (Use TorchServe/Triton).

3. Inputs

Required Inputs

  • [[MODEL_PATH]]: Pickled model (model.pkl) or framework format.
  • [[INPUT_SCHEMA]]: JSON structure e.g., {"age": int, "income": float}.
  • [[PORT_NUMBER]]: e.g., 5000 or 8080.

4. Outputs

  • Endpoint: POST /predict.
  • Healthcheck: GET /health.
  • WSGI Server: Production-ready Gunicorn runner.

5. Preconditions

  • Virtual environment active.
  • Model loaded successfully in Python shell.

6. Procedure

Phase 1: Application Structure

  1. Action: Load Model Globally.

    • Expected Output: Load model once at startup model = load(Path), not inside the request loop.
    • Notes: Loading is slow and memory intensive.
  2. Action: Define Routes.

    • Expected Output:
      • @app.route('/predict', methods=['POST']).
      • @app.route('/health', methods=['GET']) (Returns 200 OK).

Phase 2: Request Handling

  1. Action: Validate Input.

    • Expected Output: Check request.json against [[INPUT_SCHEMA]]. Return 400 Bad Request if missing keys.
  2. Action: Inference & Serialization.

    • Expected Output: Convert vector -> Prediction. Return JSON {"prediction": 1, "confidence": 0.95}.
    • Notes: Use numpy.tolist() if returning numpy arrays.

Phase 3: Production Server

  1. Action: Configure Gunicorn.
    • Expected Output: gunicorn -w 4 -b 0.0.0.0:5000 app:app.
    • Notes: Flask dev server is NOT for production.

7. Quality Gates

  • [ ] Thread Safety: Model prediction method is thread-safe or workers = 1.
  • [ ] Error Handling: 500 errors are caught and logged, not exposed as stack traces to user.
  • [ ] Serialization: JSON response is valid (No NaN or Infinity).

8. Failure Handling

High Latency

  • Symptoms: Inference takes 2s.
  • Recovery: Profile code. Is it input preprocessing? Or the model? If model, use ONNX Runtime or smaller quantization.

Memory Leak

  • Symptoms: RAM usage grows until crash.
  • Recovery: Do not append requests to a global list (common debug mistake). Ensure Tensor memory is freed.

9. Paste Prompt

TIP

One-Click Agent Invocation Copy the prompt below, replace placeholders, and paste into your agent.

text
Role: Act as a Backend Engineer.
Task: Execute the Flask Model Serving workflow.

## Objective
Serve [[MODEL_PATH]] on port [[PORT_NUMBER]].

## Inputs
- **Schema**: [[INPUT_SCHEMA]]

## Procedure
Execute the following phases:

1. **Setup**:
   - Initialize Flask app.
   - Load Model (Global scope).

2. **Endpoints**:
   - `/health`: Return status "ok".
   - `/predict`:
     - Parse JSON.
     - Validate Schema.
     - Run `model.predict()`.
     - Return JSON result.

3. **Run**:
   - Create `gunicorn_config.py`.
   - Set workers = 2 * CPU + 1.

## Quality Gates
- [ ] Input validation logic present.
- [ ] No `app.run` in production path.
- [ ] JSON serialization handles Numpy types.

## Constraints
- Output: Python Code.
- Framework: Flask.

## Command
Scaffold `app.py`.

Cập nhật lần cuối: