Deploy Model with Flask

1. Purpose

Make your model usable by others. A model file on your laptop is useless. An API Endpoint allows the Website, Mobile App, or other microservices to send data and get predictions.

2. When to Use / When Not to Use

Use This Workflow When

You need a simple, lightweight inference server.
The model is CPU-bound and reasonably fast (< 500ms).
You are deploying to a standard container runtime (K8s/Docker).

Do NOT Use This Workflow When

High-throughput batch processing (Use Spark/Ray).
Ultra-low latency < 10ms (Use C++/Triton Inference Server).
You need GPU batching (Use TorchServe/Triton).

3. Inputs

Required Inputs

[[MODEL_PATH]]: Pickled model (model.pkl) or framework format.
[[INPUT_SCHEMA]]: JSON structure e.g., {"age": int, "income": float}.
[[PORT_NUMBER]]: e.g., 5000 or 8080.

4. Outputs

Endpoint: POST /predict.
Healthcheck: GET /health.
WSGI Server: Production-ready Gunicorn runner.

5. Preconditions

Virtual environment active.
Model loaded successfully in Python shell.

6. Procedure

Phase 1: Application Structure

Action: Load Model Globally.
- Expected Output: Load model once at startup model = load(Path), not inside the request loop.
- Notes: Loading is slow and memory intensive.
Action: Define Routes.
- Expected Output:
  - @app.route('/predict', methods=['POST']).
  - @app.route('/health', methods=['GET']) (Returns 200 OK).

Phase 2: Request Handling

Action: Validate Input.
- Expected Output: Check request.json against [[INPUT_SCHEMA]]. Return 400 Bad Request if missing keys.
Action: Inference & Serialization.
- Expected Output: Convert vector -> Prediction. Return JSON {"prediction": 1, "confidence": 0.95}.
- Notes: Use numpy.tolist() if returning numpy arrays.

Phase 3: Production Server

Action: Configure Gunicorn.
- Expected Output: gunicorn -w 4 -b 0.0.0.0:5000 app:app.
- Notes: Flask dev server is NOT for production.

7. Quality Gates

[ ] Thread Safety: Model prediction method is thread-safe or workers = 1.
[ ] Error Handling: 500 errors are caught and logged, not exposed as stack traces to user.
[ ] Serialization: JSON response is valid (No NaN or Infinity).

8. Failure Handling

High Latency

Symptoms: Inference takes 2s.
Recovery: Profile code. Is it input preprocessing? Or the model? If model, use ONNX Runtime or smaller quantization.

Memory Leak

Symptoms: RAM usage grows until crash.
Recovery: Do not append requests to a global list (common debug mistake). Ensure Tensor memory is freed.

9. Paste Prompt

TIP

One-Click Agent Invocation Copy the prompt below, replace placeholders, and paste into your agent.

text

Role: Act as a Backend Engineer.
Task: Execute the Flask Model Serving workflow.

## Objective
Serve [[MODEL_PATH]] on port [[PORT_NUMBER]].

## Inputs
- **Schema**: [[INPUT_SCHEMA]]

## Procedure
Execute the following phases:

1. **Setup**:
   - Initialize Flask app.
   - Load Model (Global scope).

2. **Endpoints**:
   - `/health`: Return status "ok".
   - `/predict`:
     - Parse JSON.
     - Validate Schema.
     - Run `model.predict()`.
     - Return JSON result.

3. **Run**:
   - Create `gunicorn_config.py`.
   - Set workers = 2 * CPU + 1.

## Quality Gates
- [ ] Input validation logic present.
- [ ] No `app.run` in production path.
- [ ] JSON serialization handles Numpy types.

## Constraints
- Output: Python Code.
- Framework: Flask.

## Command
Scaffold `app.py`.

Deploy Model with Flask ​

1. Purpose ​

2. When to Use / When Not to Use ​

Use This Workflow When ​

Do NOT Use This Workflow When ​

3. Inputs ​

Required Inputs ​

4. Outputs ​

5. Preconditions ​

6. Procedure ​

Phase 1: Application Structure ​

Phase 2: Request Handling ​

Phase 3: Production Server ​

7. Quality Gates ​

8. Failure Handling ​

High Latency ​

Memory Leak ​

9. Paste Prompt ​