martysai · martysai · Feb 16, 2026 · Jan 25, 2026 · Jan 25, 2026 · Feb 3, 2026
diff --git a/.gitattributes b/.gitattributes
@@ -0,0 +1,27 @@
+# Auto detect text files and perform LF normalization
+* text=auto
+
+# Python files
+*.py text eol=lf
+
+# Markdown files
+*.md text eol=lf
+
+# JSON files
+*.json text eol=lf
+
+# YAML files
+*.yml text eol=lf
+*.yaml text eol=lf
+
+# Shell scripts
+*.sh text eol=lf
+
+# Configuration files
+*.toml text eol=lf
+*.cfg text eol=lf
+*.ini text eol=lf
+
+# Keep Windows batch files with CRLF
+*.bat text eol=crlf
+*.cmd text eol=crlf
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -0,0 +1,43 @@
+name: CI
+
+on:
+  push:
+    branches: [master, qwen2.5-coder]
+  pull_request:
+    branches: [master, qwen2.5-coder]
+
+jobs:
+  test:
+    runs-on: ubuntu-latest
+
+    steps:
+      - uses: actions/checkout@v4
+
+      - name: Set up Python 3.12
+        uses: actions/setup-python@v5
+        with:
+          python-version: "3.12"
+          cache: 'pip'
+
+      - name: Install dependencies
+        run: |
+          python -m pip install --upgrade pip
+          pip install -e ".[dev]"
+
+      - name: Run linter
+        run: python -m ruff check .
+        continue-on-error: true
+
+      - name: Run tests with coverage
+        run: |
+          python -m pytest tests/ -v --tb=short --cov=src --cov-report=term-missing --cov-report=xml
+
+      - name: Upload coverage to Codecov
+        uses: codecov/codecov-action@v4
+        with:
+          token: ${{ secrets.CODECOV_TOKEN }}
+          files: ./coverage.xml
+          flags: unittests
+          name: ci-coverage
+          fail_ci_if_error: false
+          verbose: true
diff --git a/.gitignore b/.gitignore
@@ -23,6 +23,7 @@ build/
 
 # IDE
 .vscode/
+.pytest_cache/
 
 # Jupyter
 notebooks/.ipynb_checkpoints/

diff --git a/README.md b/README.md
@@ -53,8 +53,11 @@ Test Dataset + Model Predictions --> [benchmark.py] --> Metrics Report
 - **`train_lora.py`** - LoRA fine-tuning using HuggingFace Trainer + PEFT. Supports
   QLoRA (4-bit quantization) for training on 1-2 A100 GPUs.
 
-- **`serve.py`** - FastAPI inference server that loads the fine-tuned model and
-  serves docstring generation via HTTP.
+- **`serve.py`** - FastAPI inference server that uses ollama API to generate
+  docstrings. Supports multiple Qwen Coder models with model-specific configurations.
+
+- **`models.py`** - Model configuration registry with sampling parameters for
+  Qwen 2.5 Coder and Qwen3 Coder variants.
 
 ### Evaluation (`src/evaluation/`)
 
@@ -87,6 +90,223 @@ python -m src.data.convert_seed \
     --output-dir data/processed/python-method
 ```
 
+## Serving
+
+The FastAPI inference server provides HTTP endpoints for docstring generation using
+ollama as the backend. The server uses a system prompt stored in
+`src/training/prompts/system_prompt.md` to generate NumPy-style docstrings.
+
+### Prerequisites
+
+1. **Install ollama**: Make sure [ollama](https://ollama.ai/) is installed and running locally
+2. **Pull a model**: Download one of the supported code models:
+   ```bash
+   # Qwen 2.5 Coder (dense models)
+   ollama pull qwen2.5-coder:32b  # Default, ~18GB Q4
+   ollama pull qwen2.5-coder:14b  # Mid-size, ~8GB Q4
+   ollama pull qwen2.5-coder:7b   # Fast, ~4GB Q4
+
+   # Qwen3 Coder (MoE model)
+   ollama pull qwen3-coder:30b-a3b  # Best quality, ~18GB Q4, 256K context
+   ```
+
+### Starting the Server
+
+Start the FastAPI server using uvicorn:
+
+**Linux/macOS:**
+```bash
+# Using uvicorn directly
+uvicorn src.training.serve:app --host 0.0.0.0 --port 8000
+
+# Or run the module directly
+python -m src.training.serve
+```
+
+**Windows (PowerShell):**
+```powershell
+uvicorn src.training.serve:app --host 0.0.0.0 --port 8000
+```
+
+The server will start on `http://localhost:8000` by default.
+
+### Configuration
+
+The server can be configured using environment variables:
+
+- `OLLAMA_URL` - Ollama API endpoint (default: `http://localhost:11434/api/chat`)
+- `OLLAMA_MODEL` - Model key or Ollama model name (default: `qwen2.5-coder-32b`)
+- `REQUEST_TIMEOUT` - Request timeout in seconds (default: `120.0`)
+
+**Linux/macOS:**
+```bash
+OLLAMA_MODEL=qwen3-coder-30b uvicorn src.training.serve:app --port 8000
+```
+
+**Windows (PowerShell):**
+```powershell
+$env:OLLAMA_MODEL="qwen3-coder-30b"; uvicorn src.training.serve:app --port 8000
+```
+
+**Windows (CMD):**
+```cmd
+set OLLAMA_MODEL=qwen3-coder-30b && uvicorn src.training.serve:app --port 8000
+```
+
+### Available Models
+
+| Model Key | Ollama Model | Architecture | Memory (Q4) | Context | Description |
+|-----------|--------------|--------------|-------------|---------|-------------|
+| `qwen2.5-coder-32b` | `qwen2.5-coder:32b` | Dense | ~18GB | 32K | Default, balanced quality/speed |
+| `qwen2.5-coder-14b` | `qwen2.5-coder:14b` | Dense | ~8GB | 32K | Mid-size, good performance |
+| `qwen2.5-coder-7b` | `qwen2.5-coder:7b` | Dense | ~4GB | 32K | Fast inference |
+| `qwen3-coder-30b` | `qwen3-coder:30b-a3b` | MoE | ~18GB | 256K | Best quality, 3.3B active params |
+
+Each model has optimized sampling parameters:
+- **Qwen 2.5 Coder**: temperature=0.7, top_p=0.9, top_k=40
+- **Qwen3 Coder**: temperature=1.0, top_p=0.95, top_k=40 (per official recommendations)
+
+### Model Selection
+
+You can select a model in two ways:
+
+1. **Environment variable** (applies to all requests):
+   ```bash
+   OLLAMA_MODEL=qwen3-coder-30b uvicorn src.training.serve:app
+   ```
+
+2. **Per-request** (via API):
+   ```bash
+   curl -X POST http://localhost:8000/generate \
+     -H "Content-Type: application/json" \
+     -d '{"code": "def add(x, y): return x + y", "model": "qwen3-coder-30b"}'
+   ```
+
+### List Available Models
+
+**Via CLI:**
+```bash
+python scripts/run_ollama.py --list-models
+```
+
+**Via API:**
+```bash
+curl http://localhost:8000/models
+```
+
+### API Endpoints
+
+#### Health Check
+
+Check if the service is healthy and ollama is accessible:
+
+```bash
+curl http://localhost:8000/health
+```
+
+**Response (200 OK):**
+```json
+{
+  "status": "healthy",
+  "service": "ollama",
+  "active_model": "Qwen 2.5 Coder 32B",
+  "ollama_model": "qwen2.5-coder:32b"
+}
+```
+
+**Response (503 Service Unavailable):**
+```json
+{
+  "detail": "Service unhealthy: ollama is not running or not accessible"
+}
+```
+
+#### Generate Docstring
+
+Generate a docstring for a Python function:
+
+```bash
+curl -X POST http://localhost:8000/generate \
+  -H "Content-Type: application/json" \
+  -d '{
+    "code": "def add(x, y):\n    return x + y",
+    "max_new_tokens": 256
+  }'
+```
+
+**Request Body:**
+- `code` (required): Python function code as a string
+- `max_new_tokens` (optional): Maximum number of tokens to generate (uses model default if not specified)
+- `model` (optional): Model key or Ollama model name to use for this request
+
+**Response (200 OK):**
+```json
+{
+  "docstring": "\"\"\"Compute the sum of two numbers.\n\nParameters\n----------\nx : int\n    First number.\ny : int\n    Second number.\n\nReturns\n-------\nint\n    Sum of x and y.\"\"\"",
+  "model": "qwen2.5-coder:32b"
+}
+```
+
+**Response (500 Internal Server Error):**
+```json
+{
+  "detail": "Failed to generate docstring: <error message>"
+}
+```
+
+#### List Models
+
+Get available model configurations:
+
+```bash
+curl http://localhost:8000/models
+```
+
+**Response (200 OK):**
+```json
+{
+  "default": "qwen2.5-coder-32b",
+  "active": "qwen2.5-coder-32b",
+  "models": [
+    {
+      "key": "qwen2.5-coder-32b",
+      "name": "Qwen 2.5 Coder 32B",
+      "ollama_model": "qwen2.5-coder:32b",
+      "context_window": 32768,
+      "architecture": "dense",
+      "memory_q4": "~18GB",
+      "description": "Dense 32B model, good balance of quality and speed"
+    }
+  ]
+}
+```
+
+### CLI Tool
+
+The CLI tool allows testing docstring generation directly:
+
+```bash
+# Use default model
+python scripts/run_ollama.py --user "def add(x, y): return x + y"
+
+# Use specific model by key
+python scripts/run_ollama.py --model-key qwen3-coder-30b --user "def foo(): pass"
+
+# Use raw Ollama model name
+python scripts/run_ollama.py --model qwen2.5-coder:7b --user "def bar(): pass"
+
+# List available models
+python scripts/run_ollama.py --list-models
+```
+
+### Testing
+
+Run the test suite to verify the API endpoints:
+
+```bash
+pytest tests/test_serve.py tests/test_models.py -v
+```
+
 ## Dataset
 
 The seed dataset comes from the [NeuralCodeSum](https://github.com/wasiahmad/NeuralCodeSum)

diff --git a/codecov.yml b/codecov.yml
@@ -0,0 +1,11 @@
+comment:
+  layout: "reach,diff,flags,tree"
+  behavior: default
+  require_changes: false
+  require_base: no
+  require_head: yes
+
+# Optional: configure thresholds or ignore patterns below
+# coverage:
+#   precision: 2
+#   round: down
diff --git a/pyproject.toml b/pyproject.toml
@@ -24,12 +24,15 @@ dependencies = [
     "safetensors",
     "fastapi>=0.104.0",
     "uvicorn>=0.24.0",
+    "requests>=2.31.0",
 ]
 
 [project.optional-dependencies]
 dev = [
     "pytest>=7.0",
+    "pytest-cov>=4.0",
     "ruff>=0.1.0",
+    "httpx>=0.24.0",
 ]
 
 [tool.hatch.build.targets.wheel]