This Harper component is a high-performance content optimization system that serves pre-rendered markdown versions of web page content to LLM bots.
- Component Architecture
- Content Delivery Flow
- Getting Started
- Deployment
- Environment Variables
- API Endpoints
- Database Schema
- Scheduler / Worker / Job Flow
- Performance Characteristics
Deployment Architecture:
This component requires cluster instances to be split (such as by regionA01 and regionA02) for optimal performance and job distribution. Instances are configured as follows:
-
delivery instances: Configured to receive incoming traffic via GTM. These instances handle API requests from clients and serve cached content directly. This must be configured by setting the
DELIVERYenvironment variable with the instances in a comma-separated list (e.g.DELIVERY=cust-env-regionA.harperdbcloud.com,cust-env-regionB.harperdbcloud.com,cust-env-regionC.harperdbcloud.com). -
worker instances: Designated as dedicated render worker nodes that process jobs from the job queue. These instances handle the resource intensive operations of fetching, filtering, and converting HTML to markdown. This must be configured by setting the
WORKERSenvironment variable with the instances in a comma-separated list (e.g.WORKERS=cust-env-regionD.harperdbcloud.com,cust-env-regionE.harperdbcloud.com,cust-env-regionF.harperdbcloud.com). -
scheduler instance: One worker instance must be selected as the primary Scheduler by setting the
SCHEDULERenvironment variable as the instance (e.g.,SCHEDULER=cust-env-regionD.harperdbcloud.com). This instance coordinates job distribution across all workers and manages scheduled refresh operations.
Note: When configuring the GTM (Global Traffic Manager), only include the delivery instances to receive traffic. The worker and scheduler instances operate transparently as renderer nodes and should not be exposed to direct client traffic.
Content Processing Pipeline:
- HTML Fetching - Retrieves content from source URLs with optimized HTTP client configuration including DNS caching and connection pooling
- Content Filtering - Applies CSS selector-based rules to remove unwanted elements (ads, navigation, etc.) before processing
- Markdown Conversion - Transforms filtered HTML to markdown using
node-html-markdownwith strict timeout controls - Compression & Storage - Gzip compresses content and stores as blobs with metadata in the PageContent table
Performance Optimizations:
- DNS Caching via
CacheableLookupto avoid repeated DNS lookups under high load - Connection Pooling using Undici Agent with configurable keep-alive and concurrency settings
- Timeout Protection with different limits for ad-hoc (on-demand) vs scheduled operations
- Graceful Degradation falls back to original HTML when markdown conversion fails or times out
Content Management:
- Smart Caching respects
nextRefreshtimestamps and refresh intervals - Error Handling serves static markdown error pages when source content is unavailable
- Metadata Extraction captures page titles and publish times for enhanced markdown headers
- Background Refresh automatically updates stale content based on configured intervals
Cache Hit Path (Optimal):
- Request arrives for cached content within refresh window
- Content served directly from PageContent table with appropriate headers
lastAccessedtimestamp updated (batched every 60 seconds)- Analytics recorded for cache hit
Cache Miss/Stale Path:
- On-demand (ad-hoc) render handled by delivery node
- Content fetched from origin with optimized HTTP client
- HTML filtered using path-specific CSS selectors
- Filtered content converted to markdown with timeout protection
- Result compressed, stored as blob, and served to user
- Background job queued for future refresh based on default
refreshInterval
Error Handling Path:
- Network or processing errors trigger static markdown error page generation
- Higher-priority background job queued for on-demand, blob failures, and retry
- Temporary cache headers prevent repeated failed requests
- Graceful degradation maintains service availability
git clone https://github.com/HarperFast/template-markdown-prerender.gitcd template-markdown-prerendernpm installharperdb run .
This assumes you have the Harper stack already [installed](Install HarperDB | HarperDB) globally.
Deploy the component using Harper's Operations API via the Harper CLI.
Configure the following environment variables in your .env file:
| Variable | Required | Description |
|---|---|---|
DELIVERY |
Yes | Comma-separated list of instance hostnames configured with GTM that serve API requests from clients and return cached content directly. Example: cust-env-regionA.harperdbcloud.com,cust-env-regionB.harperdbcloud.com,cust-env-regionC.harperdbcloud.com |
WORKERS |
Yes | Comma-separated list of instance hostnames that serve as render workers to process jobs from the queue. Example: cust-env-regionD.harperdbcloud.com,cust-env-regionE.harperdbcloud.com,cust-env-regionF.harperdbcloud.com |
SCHEDULER |
Yes | Hostname of the Scheduler instance from the worker list of instanced. This instance coordinates job distribution and manages scheduled refresh operations across the cluster. Example: cust-env-regionD.harperdbcloud.com |
REFRESH_INTERVAL |
No | Default refresh interval for pages in milliseconds. Used when a page does not specify its own refresh interval. Default: 86400000 (24 hours). Example: 86400000 |
EVICTION_PERIOD |
No | Number of days before cached pages and job queue records are automatically removed if not accessed. This prevents database bloat and removes stale content. Default: 30 days. Example: 30 |
ALLOW_STALE |
No | Whether to serve stale content while refreshing in background (true) or force on-demand refresh (false) Default: true |
Example .env file:
DELIVERY=cust-env-regionA.harperdbcloud.com,cust-env-regionB.harperdbcloud.com,cust-env-regionC.harperdbcloud.com
WORKERS=cust-env-regionD.harperdbcloud.com,cust-env-regionE.harperdbcloud.com,cust-env-regionF.harperdbcloud.com
SCHEDULER=cust-env-regionD.harperdbcloud.com
REFRESH_INTERVAL=86400000
EVICTION_PERIOD=30
ALLOW_STALE=true| Endpoint | Description |
|---|---|
/page_content |
Fetches, caches, and serves optimized page content from the PageContent table. |
/page_filter |
Manages CSS selector-based filtering rules for content processing. |
/bulk_upload |
Provides bulk upload functionality for adding URLs to the processing queue. |
/cache_metrics |
Retrieves page content cache performance metrics from the analytics store. |
/PageContent |
Direct REST endpoint for the PageContent table (see below) |
/PageFilter |
Direct REST endpoint for the PageFilter table (see below) |
/Sitemap |
Direct REST endpoint for the Sitemap table (see below) |
/JobQueue |
Direct REST endpoint for the JobQueue table (see below) |
/WorkerStatus |
Direct REST endpoint for the WorkerStatus table (see below) |
The Harper REST API gives low level control over your data. The first four (4) endpoints are component level and provide higher level functionality. The last five (5) enpdoints are direct access to Harper's REST API. For a full description of what the REST API can do and how to use if your can refer to its documentation.
This REST interface for the various tables can be used to manually manipulate the data. See the Schema below for details on the structure of each table.
Authentication: Role-based access control using allowedReadRoles
Methods:
GET- Retrieve page content for a given URL- Uses cache when available and fresh (checks
nextRefreshtimestamp) - Falls back to source fetch on cache miss or stale content
- Updates
lastAccessedtimestamps (batched every 60 seconds) - Handles blob errors by serving static markdown error pages and queuing refresh jobs
- Returns content with appropriate headers (content-type, content-length, cache-control, etc.)
- Uses cache when available and fresh (checks
POST,PUT,PATCH,DELETE- Returns 405 Method Not Allowed with markdown response
Features:
- Automatic error page generation on blob failures
- Background job queuing for failed content retrieval
- Analytics recording for cache hits/misses
- Header normalization and merging
- Passes through origin response status code or sets 504 Gateway Timeout on fetch timeout or 500 for unknown errors
URL Options: There are multiple options for supplying a URL with the GET request:
- Full URL as encoded URL
GET /page_content/https%3A%2F%2Fwww.example.com- Full URL as
pathquery param (recommend usingX-Query-Stringheader for query param portion of URL to avoid parsing errors)
GET /page_content?path=https://www.example.com/some-path
Headers
X-Query-String: "?arg1=val"- Full URL as
Pathheader (can include query params)
GET /page_content
Headers
Path: "https://www.example.com/some-path?arg1=val"Response: Returns either markdown page content, markdown error page (if fetching fails/times out), or HTML content (if filtering or conversion fails/times out)
Response Headers:
Content-Type:"text/markdown; charset=utf-8"or"text/html; charset=utf-8"Content-Encoding:"gzip"Content-Length: Byte length of compressed contentLast-Modified: UTC timestamp when page last fetched and cachedX-Markdown-Version: Version of node-html-markdown libraryServer-Timing: Fetch and process times for performance diagnostics (cache miss only)Retry-After: Included for fetch, filter, convert, or blob errors to trigger bot retry (300 seconds)Cache-Control: Included for success responses with max-age=refreshInterval(in seconds)
Example Response Headers:
Content-Type: text/markdown; charset=utf-8
Content-Encoding: gzip
Content-Length: 1234
Last-Modified: Wed, 03 Sep 2025 14:30:00 GMT
X-Markdown-Version: 1.1.0
Server-Timing: fetch-resolve;dur=150.25, process-resolve;dur=75.50
Cache-Control: public, max-age=86400Methods:
GET /{pathname}- Retrieve filters for a specific path- Always includes "default" filters
- Combines default and path-specific selectors
- Returns array of CSS selectors to remove from HTML
POST- Create or update filter rules- Requires
pathandfilters(comma-separated CSS selectors) - Example:
- Requires
{ "path": "blog", "filters": "header, footer, .ads, #popup, figure img.hero" }DELETE /{pathname}- Remove filter rules for a specific pathPUTandPATCH- Return 405 method not allowed
Features:
- Path name normalization (lowercase, underscores to dashes)
- Default filters applied to all pages
- CSS selector validation
- Error handling with detailed messages
Methods:
-
POST- Add URLs to job queue with three supported input formats:Single URL:
{ "url": "https://example.com/page", "refreshInterval": 3600000 }Sitemap Processing:
{ "sitemap": "https://example.com/sitemap.xml", "isIndex": false, "refreshInterval": 3600000 } or { "sitemap": "https://example.com/sitemap_index.xml", "isIndex": true, "refreshInterval": 3600000 }URL List:
{ "urlList": ["https://example.com/page1", "https://example.com/page2"], "refreshInterval": 3600000 }
Features:
- Batch processing with configurable batch size (default: 100 URLs)
- Sitemap and sitemap index parsing, with periodic refresh of sitemap URLs to ensure new pages are added to prerender schedule
- Job prioritization (batch uploads get medium low priority - after on-demand, blob failures, and retries)
- Partial success handling with detailed failure reporting for url and urlList methods
- Promise-based concurrent processing with error isolation
Response Format:
201- Full success207- Partial success (some failures)400- Invalid input500- Processing error
Authentication: Inherited from Resource base class
Methods:
GET- Retrieve cache metrics from the last 60 seconds
Metrics Returned:
-
Timing Metrics (in milliseconds):
page-fetch-time- Time to fetch content from sourcepage-filter-time- Time to apply CSS filterspage-convert-time- Time to convert HTML to markdownpage-process-time- Total page processing timejob-complete-time- Total job processing time- Each includes: avg, p50, p90, p95 percentiles
-
Cache Performance:
page-cache-hit-rate- Cache hit percentage- Based on
page-cache-hitandpage-cache-misscounts
Response Format:
{
"timing": [
{
"metric": "page-fetch-time",
"period": [start_timestamp, end_timestamp],
"unit": "ms",
"avg": "123.456",
"p50": "100.000",
"p90": "200.000",
"p95": "250.000"
}
],
"cacheHitRate": {
"metric": "page-cache-hit-rate",
"period": [start_timestamp, end_timestamp],
"unit": "percent",
"rate": 85.5
}
}The system uses two databases to manage content delivery and processing:
PageContent - Stores static page content and metadata for LLM bot-optimized delivery
| Field | Type | Description |
|---|---|---|
url |
String (Primary Key) | Full URL of the page |
pageContent |
Blob | Either markdown or HTML content |
contentType |
String | "text/markdown; charset=utf-8" or "text/html; charset=utf-8" |
contentLength |
Int | Length of the content in bytes |
statusCode |
Int | HTTP status code of the page fetch |
headers |
Any | Object with header key:value pairs |
mdVersion |
String | node-html-markdown version for debugging |
serverTiming |
String (Optional) | Server timing values for headers |
errorMessage |
String (Optional) | Error message if applicable |
lastAccessed |
Date (Indexed) | Last time the page was accessed in epoch milliseconds |
lastRefreshed |
Date | Last time the page was refreshed in epoch milliseconds |
refreshInterval |
Long | Time between refreshes in milliseconds |
nextRefresh |
Date (Indexed) | Next scheduled refresh time in epoch milliseconds |
createdAt |
Date | When the record was created in epoch milliseconds |
PageFilter - Defines filtering rules for content processing
| Field | Type | Description |
|---|---|---|
path |
String (Primary Key) | Path portion of URL to match (e.g., "/blog"). |
filters |
String | Comma-separated list of CSS selectors to remove from page before rendering |
updatedAt |
Date | When the record was last updated in epoch milliseconds |
createdAt |
Date | When the record was created in epoch milliseconds |
Notes:
- Filter rules apply to all URLs starting with the same
path. Default filters usepath= "default" and applies to all URLs in addition to any other matching path filters- The
filterstring can include any valid CSS selectios (e.g., "header, footer, .ads, #popup, div#banner, figure img.hero")
Sitemap - Stores sitemaps and sitemap indexes that refresh on intervals to add new pages to PageContent
| Field | Type | Description |
|---|---|---|
url |
String (Primary Key) | Full URL of the sitemap |
isIndex |
Boolean (Indexed) | Whether sitemap is an index |
parentSitemap |
String (Indexed) | url of parent sitemap index |
pageRefresh |
Long | Time between refreshes for sitemap pages in epoch milliseconds |
lastRefreshed |
Date | Last time the page was refreshed in epoch milliseconds |
refreshInterval |
Long | Time between sitemap refreshes in milliseconds |
nextRefresh |
Date (Indexed) | Next scheduled refresh time in epoch milliseconds |
createdAt |
Date | When the record was created in epoch milliseconds |
JobQueue - Queue of jobs for workers to process
| Field | Type | Description |
|---|---|---|
id |
ID (Primary Key) | Unique job ID |
url |
String | URL of page to process |
pageConfig |
Any | Page configuration options (object with any PageContent fields to pass to worker) |
status |
String (Indexed) | Job status (created, claimed, completed, failed) |
attempts |
Int | Number of processing attempts |
priority |
Int (Indexed) | Job priority (low value = high priority) |
errorMessage |
String (Optional) | Error message if the job failed |
assignedTo |
String (Indexed) | Worker hostname assigned to this job |
claimedAt |
Date | Time when worker claimed the job in epoch milliseconds |
completedAt |
Date | Time when job was completed in epoch milliseconds |
createdAt |
Date | Time when job was created in epoch milliseconds |
Note:
priorityoptions include:
0: highest for ad-hoc (on-demand) requests1: medium high for blob failures2: medium for retries3: medium low for batch uploads4: lowest for regular scheduled refresh
WorkerStatus - Status tracking for workers in the cluster
| Field | Type | Description |
|---|---|---|
id |
String (Primary Key) | Unique worker hostname |
status |
String (Indexed) | Current status (idle, working) |
isScheduler |
Boolean | Whether worker is designated as Scheduler |
The system uses a distributed job processing architecture with an Scheduler managing job distribution and workers executing page processing tasks.
Job Lifecycle:
- Created - New job added to queue or expired job released, awaiting assignment
- Claimed - Worker has claimed the job for processing
- Completed - Job finished successfully
- Failed - Job failed after retries or critical errors
Job Priorities (lower number = higher priority):
0- Ad-Hoc (On-demand) requests (highest priority)1- Blob failures (medium-high priority)2- Automatic retries (medium priority)3- Batch uploads (medium-low priority)4- Scheduled refreshes (lowest priority)
Job Management Features:
- Duplicate prevention: Only one pending job per URL allowed
- Configuration updates: Higher priority jobs can update existing job configs
- Retry logic: Jobs can be retried up to 3 times before permanent failure
- Expiration handling: Jobs claimed too long are automatically released
The Scheduler runs on a designated node (set by hostname in .env) and manages job distribution across the worker nodes:
Core Functions:
- Job Assignment: Assigns unassigned jobs to available workers
- Failure Recovery: Reassigns failed jobs for retry (tries to assign to a different worker than original attempt)
- Sitemap Refresh: Crawls sitemaps based on
nextRefreshto create jobs for new pages - Scheduled Refresh: Creates jobs for pages needing refresh based on
nextRefreshtimestamps - Cleanup: Removes old job records and inactive page content based on eviction policies
Scheduling Intervals:
- Orphaned Jobs: 1 minute (for blob failures and bulk uploads)
- Failed Jobs: 10 minutes (matches worker expiration check)
- Unclaimed Jobs: 20 minutes (allows worker retry window)
- Scheduled Refresh: Minimum of 5 minutes or
defaultRefreshInterval - Cleanup Tasks: Minimum of 24 hours or
evictionPeriod
Job Sources Handled:
- Orphaned jobs (no worker assigned)
- Failed jobs (marked for retry)
- Unclaimed jobs (created but not claimed within threshold)
- Scheduled refresh jobs (pages / sitemaps after their refresh interval)
Each worker node processes jobs assigned to it by the Scheduler:
Worker Lifecycle:
- Registration: Worker registers in
WorkerStatustable as "idle" - Job Claiming: Periodically claims assigned jobs (up to set concurrency)
- Processing: Fetches and processes page content using
pageSource - Completion: Marks jobs as completed or failed
- Status Updates: Updates status between "idle" and "working"
Key Features:
- Concurrency: Processes up to concurrency limit of jobs simultaneously
- Timeout Protection: Jobs expire after 5 minutes if not completed
- Retry Logic: Failed jobs are released for reassignment (up to 3 attempts)
- Error Handling: Detailed error messages preserved across retries
Processing Flow:
- Worker claims jobs (status: created → claimed)
- For each job:
- Fetch page content via pageSource
- Apply filters and convert to markdown
- Update PageContent table
- Mark job as completed/failed
- Release expired jobs back to queue
- Update worker status to idle
Worker Failures:
- Jobs expire after 5 minutes without completion
- Expired jobs are automatically reassigned by Scheduler
- Scheduler avoids reassignment to previously failed workers when possible
Job Failures:
- Jobs retry up to 3 times
- Error messages accumulate across retry attempts
- Persistent failures are marked as permanently failed
Data Consistency:
- Transaction support ensures atomic job state changes
- Duplicate job prevention via URL-based deduplication
- Background cleanup prevents database bloat
Monitoring:
- Worker status tracking (idle/working)
- Job attempt counting and error logging
- Analytics recording for performance metrics
- Automatic cleanup of old records based on eviction periods
| Operation | On-Demand Timeout | Scheduled Refresh Timeout | Fallback Behavior |
|---|---|---|---|
| Fetch | 3000ms | 10000ms | Static error page with 503 status |
| Filtering | 400ms | 1000ms | Return original HTML content |
| Conversion | 400ms | 1000ms | Return original HTML content |
The Server-Timing header includes:
fetch-resolve;dur=X.XX- Time to fetch original content (milliseconds)process-resolve;dur=Y.YY- Time to filter and convert content (milliseconds)