Giao diện
Deploy Model with Flask
1. Purpose
Make your model usable by others. A model file on your laptop is useless. An API Endpoint allows the Website, Mobile App, or other microservices to send data and get predictions.
2. When to Use / When Not to Use
Use This Workflow When
- You need a simple, lightweight inference server.
- The model is CPU-bound and reasonably fast (< 500ms).
- You are deploying to a standard container runtime (K8s/Docker).
Do NOT Use This Workflow When
- High-throughput batch processing (Use Spark/Ray).
- Ultra-low latency < 10ms (Use C++/Triton Inference Server).
- You need GPU batching (Use TorchServe/Triton).
3. Inputs
Required Inputs
- [[MODEL_PATH]]: Pickled model (
model.pkl) or framework format. - [[INPUT_SCHEMA]]: JSON structure e.g.,
{"age": int, "income": float}. - [[PORT_NUMBER]]: e.g.,
5000or8080.
4. Outputs
- Endpoint:
POST /predict. - Healthcheck:
GET /health. - WSGI Server: Production-ready Gunicorn runner.
5. Preconditions
- Virtual environment active.
- Model loaded successfully in Python shell.
6. Procedure
Phase 1: Application Structure
Action: Load Model Globally.
- Expected Output: Load model once at startup
model = load(Path), not inside the request loop. - Notes: Loading is slow and memory intensive.
- Expected Output: Load model once at startup
Action: Define Routes.
- Expected Output:
@app.route('/predict', methods=['POST']).@app.route('/health', methods=['GET'])(Returns 200 OK).
- Expected Output:
Phase 2: Request Handling
Action: Validate Input.
- Expected Output: Check
request.jsonagainst [[INPUT_SCHEMA]]. Return400 Bad Requestif missing keys.
- Expected Output: Check
Action: Inference & Serialization.
- Expected Output: Convert vector -> Prediction. Return JSON
{"prediction": 1, "confidence": 0.95}. - Notes: Use
numpy.tolist()if returning numpy arrays.
- Expected Output: Convert vector -> Prediction. Return JSON
Phase 3: Production Server
- Action: Configure Gunicorn.
- Expected Output:
gunicorn -w 4 -b 0.0.0.0:5000 app:app. - Notes: Flask dev server is NOT for production.
- Expected Output:
7. Quality Gates
- [ ] Thread Safety: Model prediction method is thread-safe or workers = 1.
- [ ] Error Handling: 500 errors are caught and logged, not exposed as stack traces to user.
- [ ] Serialization: JSON response is valid (No
NaNor Infinity).
8. Failure Handling
High Latency
- Symptoms: Inference takes 2s.
- Recovery: Profile code. Is it input preprocessing? Or the model? If model, use ONNX Runtime or smaller quantization.
Memory Leak
- Symptoms: RAM usage grows until crash.
- Recovery: Do not append requests to a global list (common debug mistake). Ensure Tensor memory is freed.
9. Paste Prompt
TIP
One-Click Agent Invocation Copy the prompt below, replace placeholders, and paste into your agent.
text
Role: Act as a Backend Engineer.
Task: Execute the Flask Model Serving workflow.
## Objective
Serve [[MODEL_PATH]] on port [[PORT_NUMBER]].
## Inputs
- **Schema**: [[INPUT_SCHEMA]]
## Procedure
Execute the following phases:
1. **Setup**:
- Initialize Flask app.
- Load Model (Global scope).
2. **Endpoints**:
- `/health`: Return status "ok".
- `/predict`:
- Parse JSON.
- Validate Schema.
- Run `model.predict()`.
- Return JSON result.
3. **Run**:
- Create `gunicorn_config.py`.
- Set workers = 2 * CPU + 1.
## Quality Gates
- [ ] Input validation logic present.
- [ ] No `app.run` in production path.
- [ ] JSON serialization handles Numpy types.
## Constraints
- Output: Python Code.
- Framework: Flask.
## Command
Scaffold `app.py`.