Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
27 changes: 27 additions & 0 deletions .gitattributes
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
# Auto detect text files and perform LF normalization
* text=auto

# Python files
*.py text eol=lf

# Markdown files
*.md text eol=lf

# JSON files
*.json text eol=lf

# YAML files
*.yml text eol=lf
*.yaml text eol=lf

# Shell scripts
*.sh text eol=lf

# Configuration files
*.toml text eol=lf
*.cfg text eol=lf
*.ini text eol=lf

# Keep Windows batch files with CRLF
*.bat text eol=crlf
*.cmd text eol=crlf
136 changes: 125 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -54,8 +54,10 @@ Test Dataset + Model Predictions --> [benchmark.py] --> Metrics Report
QLoRA (4-bit quantization) for training on 1-2 A100 GPUs.

- **`serve.py`** - FastAPI inference server that uses ollama API to generate
docstrings. The server uses a hard-coded system prompt for NumPy-style docstring
generation.
docstrings. Supports multiple Qwen Coder models with model-specific configurations.

- **`models.py`** - Model configuration registry with sampling parameters for
Qwen 2.5 Coder and Qwen3 Coder variants.

### Evaluation (`src/evaluation/`)

Expand Down Expand Up @@ -97,15 +99,22 @@ ollama as the backend. The server uses a system prompt stored in
### Prerequisites

1. **Install ollama**: Make sure [ollama](https://ollama.ai/) is installed and running locally
2. **Pull a model**: Download a code model (e.g., `qwen2.5-coder:32b`):
2. **Pull a model**: Download one of the supported code models:
```bash
ollama pull qwen2.5-coder:32b
# Qwen 2.5 Coder (dense models)
ollama pull qwen2.5-coder:32b # Default, ~18GB Q4
ollama pull qwen2.5-coder:14b # Mid-size, ~8GB Q4
ollama pull qwen2.5-coder:7b # Fast, ~4GB Q4

# Qwen3 Coder (MoE model)
ollama pull qwen3-coder:30b-a3b # Best quality, ~18GB Q4, 256K context
```

### Starting the Server

Start the FastAPI server using uvicorn:

**Linux/macOS:**
```bash
# Using uvicorn directly
uvicorn src.training.serve:app --host 0.0.0.0 --port 8000
Expand All @@ -114,19 +123,75 @@ uvicorn src.training.serve:app --host 0.0.0.0 --port 8000
python -m src.training.serve
```

**Windows (PowerShell):**
```powershell
uvicorn src.training.serve:app --host 0.0.0.0 --port 8000
```

The server will start on `http://localhost:8000` by default.

### Configuration

The server can be configured using environment variables:

- `OLLAMA_URL` - Ollama API endpoint (default: `http://localhost:11434/api/chat`)
- `OLLAMA_MODEL` - Model name to use (default: `qwen2.5-coder:32b`)
- `OLLAMA_MODEL` - Model key or Ollama model name (default: `qwen2.5-coder-32b`)
- `REQUEST_TIMEOUT` - Request timeout in seconds (default: `120.0`)

Example:
**Linux/macOS:**
```bash
OLLAMA_MODEL=qwen3-coder-30b uvicorn src.training.serve:app --port 8000
```

**Windows (PowerShell):**
```powershell
$env:OLLAMA_MODEL="qwen3-coder-30b"; uvicorn src.training.serve:app --port 8000
```

**Windows (CMD):**
```cmd
set OLLAMA_MODEL=qwen3-coder-30b && uvicorn src.training.serve:app --port 8000
```

### Available Models

| Model Key | Ollama Model | Architecture | Memory (Q4) | Context | Description |
|-----------|--------------|--------------|-------------|---------|-------------|
| `qwen2.5-coder-32b` | `qwen2.5-coder:32b` | Dense | ~18GB | 32K | Default, balanced quality/speed |
| `qwen2.5-coder-14b` | `qwen2.5-coder:14b` | Dense | ~8GB | 32K | Mid-size, good performance |
| `qwen2.5-coder-7b` | `qwen2.5-coder:7b` | Dense | ~4GB | 32K | Fast inference |
| `qwen3-coder-30b` | `qwen3-coder:30b-a3b` | MoE | ~18GB | 256K | Best quality, 3.3B active params |

Each model has optimized sampling parameters:
- **Qwen 2.5 Coder**: temperature=0.7, top_p=0.9, top_k=40
- **Qwen3 Coder**: temperature=1.0, top_p=0.95, top_k=40 (per official recommendations)

### Model Selection

You can select a model in two ways:

1. **Environment variable** (applies to all requests):
```bash
OLLAMA_MODEL=qwen3-coder-30b uvicorn src.training.serve:app
```

2. **Per-request** (via API):
```bash
curl -X POST http://localhost:8000/generate \
-H "Content-Type: application/json" \
-d '{"code": "def add(x, y): return x + y", "model": "qwen3-coder-30b"}'
```

### List Available Models

**Via CLI:**
```bash
python scripts/run_ollama.py --list-models
```

**Via API:**
```bash
OLLAMA_MODEL=qwen2.5-coder:7b uvicorn src.training.serve:app --port 8000
curl http://localhost:8000/models
```

### API Endpoints
Expand All @@ -143,7 +208,9 @@ curl http://localhost:8000/health
```json
{
"status": "healthy",
"service": "ollama"
"service": "ollama",
"active_model": "Qwen 2.5 Coder 32B",
"ollama_model": "qwen2.5-coder:32b"
}
```

Expand All @@ -169,12 +236,14 @@ curl -X POST http://localhost:8000/generate \

**Request Body:**
- `code` (required): Python function code as a string
- `max_new_tokens` (optional): Maximum number of tokens to generate (default: 256)
- `max_new_tokens` (optional): Maximum number of tokens to generate (uses model default if not specified)
- `model` (optional): Model key or Ollama model name to use for this request

**Response (200 OK):**
```json
{
"docstring": "\"\"\"Compute the sum of two numbers.\n\nParameters\n----------\nx : int\n First number.\ny : int\n Second number.\n\nReturns\n-------\nint\n Sum of x and y.\n\"\"\""
"docstring": "\"\"\"Compute the sum of two numbers.\n\nParameters\n----------\nx : int\n First number.\ny : int\n Second number.\n\nReturns\n-------\nint\n Sum of x and y.\"\"\"",
"model": "qwen2.5-coder:32b"
}
```

Expand All @@ -185,12 +254,57 @@ curl -X POST http://localhost:8000/generate \
}
```

#### List Models

Get available model configurations:

```bash
curl http://localhost:8000/models
```

**Response (200 OK):**
```json
{
"default": "qwen2.5-coder-32b",
"active": "qwen2.5-coder-32b",
"models": [
{
"key": "qwen2.5-coder-32b",
"name": "Qwen 2.5 Coder 32B",
"ollama_model": "qwen2.5-coder:32b",
"context_window": 32768,
"architecture": "dense",
"memory_q4": "~18GB",
"description": "Dense 32B model, good balance of quality and speed"
}
]
}
```

### CLI Tool

The CLI tool allows testing docstring generation directly:

```bash
# Use default model
python scripts/run_ollama.py --user "def add(x, y): return x + y"

# Use specific model by key
python scripts/run_ollama.py --model-key qwen3-coder-30b --user "def foo(): pass"

# Use raw Ollama model name
python scripts/run_ollama.py --model qwen2.5-coder:7b --user "def bar(): pass"

# List available models
python scripts/run_ollama.py --list-models
```

### Testing

Run the test suite to verify the API endpoints:

```bash
pytest tests/test_serve.py -v
pytest tests/test_serve.py tests/test_models.py -v
```

## Dataset
Expand Down
Loading