Harper Markdown Prerender Component

Overview

This Harper component is a high-performance content optimization system that serves pre-rendered markdown versions of web page content to LLM bots.

Component Architecture

Deployment Architecture:

This component requires cluster instances to be split (such as by regionA01 and regionA02) for optimal performance and job distribution. Instances are configured as follows:

delivery instances: Configured to receive incoming traffic via GTM. These instances handle API requests from clients and serve cached content directly. This must be configured by setting the DELIVERY environment variable with the instances in a comma-separated list (e.g. DELIVERY=cust-env-regionA.harperdbcloud.com,cust-env-regionB.harperdbcloud.com,cust-env-regionC.harperdbcloud.com).
worker instances: Designated as dedicated render worker nodes that process jobs from the job queue. These instances handle the resource intensive operations of fetching, filtering, and converting HTML to markdown. This must be configured by setting the WORKERS environment variable with the instances in a comma-separated list (e.g. WORKERS=cust-env-regionD.harperdbcloud.com,cust-env-regionE.harperdbcloud.com,cust-env-regionF.harperdbcloud.com).
scheduler instance: One worker instance must be selected as the primary Scheduler by setting the SCHEDULER environment variable as the instance (e.g., SCHEDULER=cust-env-regionD.harperdbcloud.com). This instance coordinates job distribution across all workers and manages scheduled refresh operations.

Note: When configuring the GTM (Global Traffic Manager), only include the delivery instances to receive traffic. The worker and scheduler instances operate transparently as renderer nodes and should not be exposed to direct client traffic.

Content Processing Pipeline:

HTML Fetching - Retrieves content from source URLs with optimized HTTP client configuration including DNS caching and connection pooling
Content Filtering - Applies CSS selector-based rules to remove unwanted elements (ads, navigation, etc.) before processing
Markdown Conversion - Transforms filtered HTML to markdown using node-html-markdown with strict timeout controls
Compression & Storage - Gzip compresses content and stores as blobs with metadata in the PageContent table

Performance Optimizations:

DNS Caching via CacheableLookup to avoid repeated DNS lookups under high load
Connection Pooling using Undici Agent with configurable keep-alive and concurrency settings
Timeout Protection with different limits for ad-hoc (on-demand) vs scheduled operations
Graceful Degradation falls back to original HTML when markdown conversion fails or times out

Content Management:

Smart Caching respects nextRefresh timestamps and refresh intervals
Error Handling serves static markdown error pages when source content is unavailable
Metadata Extraction captures page titles and publish times for enhanced markdown headers
Background Refresh automatically updates stale content based on configured intervals

Content Delivery Flow

Cache Hit Path (Optimal):

Request arrives for cached content within refresh window
Content served directly from PageContent table with appropriate headers
lastAccessed timestamp updated (batched every 60 seconds)
Analytics recorded for cache hit

Cache Miss/Stale Path:

On-demand (ad-hoc) render handled by delivery node
Content fetched from origin with optimized HTTP client
HTML filtered using path-specific CSS selectors
Filtered content converted to markdown with timeout protection
Result compressed, stored as blob, and served to user
Background job queued for future refresh based on default refreshInterval

Error Handling Path:

Network or processing errors trigger static markdown error page generation
Higher-priority background job queued for on-demand, blob failures, and retry
Temporary cache headers prevent repeated failed requests
Graceful degradation maintains service availability

Getting Started

git clone https://github.com/HarperFast/template-markdown-prerender.git
cd template-markdown-prerender
npm install
harperdb run .

This assumes you have the Harper stack already [installed](Install HarperDB | HarperDB) globally.

Deployment

Deploy the component using Harper's Operations API via the Harper CLI.

Environment Variables

Configure the following environment variables in your .env file:

Variable	Required	Description
`DELIVERY`	Yes	Comma-separated list of instance hostnames configured with GTM that serve API requests from clients and return cached content directly. Example: `cust-env-regionA.harperdbcloud.com,cust-env-regionB.harperdbcloud.com,cust-env-regionC.harperdbcloud.com`
`WORKERS`	Yes	Comma-separated list of instance hostnames that serve as render workers to process jobs from the queue. Example: `cust-env-regionD.harperdbcloud.com,cust-env-regionE.harperdbcloud.com,cust-env-regionF.harperdbcloud.com`
`SCHEDULER`	Yes	Hostname of the Scheduler instance from the worker list of instanced. This instance coordinates job distribution and manages scheduled refresh operations across the cluster. Example: `cust-env-regionD.harperdbcloud.com`
`REFRESH_INTERVAL`	No	Default refresh interval for pages in milliseconds. Used when a page does not specify its own refresh interval. Default: `86400000` (24 hours). Example: `86400000`
`EVICTION_PERIOD`	No	Number of days before cached pages and job queue records are automatically removed if not accessed. This prevents database bloat and removes stale content. Default: `30` days. Example: `30`
`ALLOW_STALE`	No	Whether to serve stale content while refreshing in background (true) or force on-demand refresh (false) Default: `true`

Example .env file:

DELIVERY=cust-env-regionA.harperdbcloud.com,cust-env-regionB.harperdbcloud.com,cust-env-regionC.harperdbcloud.com
WORKERS=cust-env-regionD.harperdbcloud.com,cust-env-regionE.harperdbcloud.com,cust-env-regionF.harperdbcloud.com
SCHEDULER=cust-env-regionD.harperdbcloud.com
REFRESH_INTERVAL=86400000
EVICTION_PERIOD=30
ALLOW_STALE=true

API Endpoints

Endpoints

Endpoint	Description
`/page_content`	Fetches, caches, and serves optimized page content from the PageContent table.
`/page_filter`	Manages CSS selector-based filtering rules for content processing.
`/bulk_upload`	Provides bulk upload functionality for adding URLs to the processing queue.
`/cache_metrics`	Retrieves page content cache performance metrics from the analytics store.
`/PageContent`	Direct REST endpoint for the PageContent table (see below)
`/PageFilter`	Direct REST endpoint for the PageFilter table (see below)
`/Sitemap`	Direct REST endpoint for the Sitemap table (see below)
`/JobQueue`	Direct REST endpoint for the JobQueue table (see below)
`/WorkerStatus`	Direct REST endpoint for the WorkerStatus table (see below)

The Harper REST API gives low level control over your data. The first four (4) endpoints are component level and provide higher level functionality. The last five (5) enpdoints are direct access to Harper's REST API. For a full description of what the REST API can do and how to use if your can refer to its documentation.

This REST interface for the various tables can be used to manually manipulate the data. See the Schema below for details on the structure of each table.

`/page_content` - Page Content Retrieval

Authentication: Role-based access control using allowedReadRoles

Methods:

GET - Retrieve page content for a given URL
- Uses cache when available and fresh (checks nextRefresh timestamp)
- Falls back to source fetch on cache miss or stale content
- Updates lastAccessed timestamps (batched every 60 seconds)
- Handles blob errors by serving static markdown error pages and queuing refresh jobs
- Returns content with appropriate headers (content-type, content-length, cache-control, etc.)
POST, PUT, PATCH, DELETE - Returns 405 Method Not Allowed with markdown response

Features:

Automatic error page generation on blob failures
Background job queuing for failed content retrieval
Analytics recording for cache hits/misses
Header normalization and merging
Passes through origin response status code or sets 504 Gateway Timeout on fetch timeout or 500 for unknown errors

GET Request Example

URL Options: There are multiple options for supplying a URL with the GET request:

Full URL as encoded URL

GET /page_content/https%3A%2F%2Fwww.example.com

Full URL as path query param (recommend using X-Query-String header for query param portion of URL to avoid parsing errors)

GET /page_content?path=https://www.example.com/some-path

Headers
X-Query-String: "?arg1=val"

Full URL as Path header (can include query params)

GET /page_content

Headers
Path: "https://www.example.com/some-path?arg1=val"

Response: Returns either markdown page content, markdown error page (if fetching fails/times out), or HTML content (if filtering or conversion fails/times out)

Response Headers:

Content-Type: "text/markdown; charset=utf-8" or "text/html; charset=utf-8"
Content-Encoding: "gzip"
Content-Length: Byte length of compressed content
Last-Modified: UTC timestamp when page last fetched and cached
X-Markdown-Version: Version of node-html-markdown library
Server-Timing: Fetch and process times for performance diagnostics (cache miss only)
Retry-After: Included for fetch, filter, convert, or blob errors to trigger bot retry (300 seconds)
Cache-Control: Included for success responses with max-age=refreshInterval (in seconds)

Example Response Headers:

Content-Type: text/markdown; charset=utf-8
Content-Encoding: gzip
Content-Length: 1234
Last-Modified: Wed, 03 Sep 2025 14:30:00 GMT
X-Markdown-Version: 1.1.0
Server-Timing: fetch-resolve;dur=150.25, process-resolve;dur=75.50
Cache-Control: public, max-age=86400

`/page_filter` - Content Filtering Rules

Methods:

GET /{pathname} - Retrieve filters for a specific path
- Always includes "default" filters
- Combines default and path-specific selectors
- Returns array of CSS selectors to remove from HTML
POST - Create or update filter rules
- Requires path and filters (comma-separated CSS selectors)
- Example:

{ "path": "blog", "filters": "header, footer, .ads, #popup, figure img.hero" }

DELETE /{pathname} - Remove filter rules for a specific path
PUT and PATCH - Return 405 method not allowed

Features:

Path name normalization (lowercase, underscores to dashes)
Default filters applied to all pages
CSS selector validation
Error handling with detailed messages

`/bulk_upload` - Batch URL Processing

Methods:

POST - Add URLs to job queue with three supported input formats:

Single URL:

{
	"url": "https://example.com/page",
	"refreshInterval": 3600000
}

Sitemap Processing:

{
	"sitemap": "https://example.com/sitemap.xml",
	"isIndex": false,
	"refreshInterval": 3600000
}

or

{
	"sitemap": "https://example.com/sitemap_index.xml",
	"isIndex": true,
	"refreshInterval": 3600000
}

URL List:

{
	"urlList": ["https://example.com/page1", "https://example.com/page2"],
	"refreshInterval": 3600000
}

Features:

Batch processing with configurable batch size (default: 100 URLs)
Sitemap and sitemap index parsing, with periodic refresh of sitemap URLs to ensure new pages are added to prerender schedule
Job prioritization (batch uploads get medium low priority - after on-demand, blob failures, and retries)
Partial success handling with detailed failure reporting for url and urlList methods
Promise-based concurrent processing with error isolation

Response Format:

201 - Full success
207 - Partial success (some failures)
400 - Invalid input
500 - Processing error

`/cache_metrics` - Performance Analytics

Authentication: Inherited from Resource base class

Methods:

GET - Retrieve cache metrics from the last 60 seconds

Metrics Returned:

Timing Metrics (in milliseconds):
- page-fetch-time - Time to fetch content from source
- page-filter-time - Time to apply CSS filters
- page-convert-time - Time to convert HTML to markdown
- page-process-time - Total page processing time
- job-complete-time - Total job processing time
- Each includes: avg, p50, p90, p95 percentiles
Cache Performance:
- page-cache-hit-rate - Cache hit percentage
- Based on page-cache-hit and page-cache-miss counts

Response Format:

{
  "timing": [
    {
      "metric": "page-fetch-time",
      "period": [start_timestamp, end_timestamp],
      "unit": "ms",
      "avg": "123.456",
      "p50": "100.000",
      "p90": "200.000",
      "p95": "250.000"
    }
  ],
  "cacheHitRate": {
    "metric": "page-cache-hit-rate",
    "period": [start_timestamp, end_timestamp],
    "unit": "percent",
    "rate": 85.5
  }
}

Database Schema

The system uses two databases to manage content delivery and processing:

Markdown Database

PageContent - Stores static page content and metadata for LLM bot-optimized delivery

Field	Type	Description
`url`	String (Primary Key)	Full URL of the page
`pageContent`	Blob	Either markdown or HTML content
`contentType`	String	"text/markdown; charset=utf-8" or "text/html; charset=utf-8"
`contentLength`	Int	Length of the content in bytes
`statusCode`	Int	HTTP status code of the page fetch
`headers`	Any	Object with header key:value pairs
`mdVersion`	String	node-html-markdown version for debugging
`serverTiming`	String (Optional)	Server timing values for headers
`errorMessage`	String (Optional)	Error message if applicable
`lastAccessed`	Date (Indexed)	Last time the page was accessed in epoch milliseconds
`lastRefreshed`	Date	Last time the page was refreshed in epoch milliseconds
`refreshInterval`	Long	Time between refreshes in milliseconds
`nextRefresh`	Date (Indexed)	Next scheduled refresh time in epoch milliseconds
`createdAt`	Date	When the record was created in epoch milliseconds

PageFilter - Defines filtering rules for content processing

Field	Type	Description
`path`	String (Primary Key)	Path portion of URL to match (e.g., "/blog").
`filters`	String	Comma-separated list of CSS selectors to remove from page before rendering
`updatedAt`	Date	When the record was last updated in epoch milliseconds
`createdAt`	Date	When the record was created in epoch milliseconds

Notes:

Filter rules apply to all URLs starting with the same path. Default filters use path = "default" and applies to all URLs in addition to any other matching path filters

The filter string can include any valid CSS selectios (e.g., "header, footer, .ads, #popup, div#banner, figure img.hero")

Sitemap - Stores sitemaps and sitemap indexes that refresh on intervals to add new pages to PageContent

Field	Type	Description
`url`	String (Primary Key)	Full URL of the sitemap
`isIndex`	Boolean (Indexed)	Whether sitemap is an index
`parentSitemap`	String (Indexed)	url of parent sitemap index
`pageRefresh`	Long	Time between refreshes for sitemap pages in epoch milliseconds
`lastRefreshed`	Date	Last time the page was refreshed in epoch milliseconds
`refreshInterval`	Long	Time between sitemap refreshes in milliseconds
`nextRefresh`	Date (Indexed)	Next scheduled refresh time in epoch milliseconds
`createdAt`	Date	When the record was created in epoch milliseconds

Workers Database

JobQueue - Queue of jobs for workers to process

Field	Type	Description
`id`	ID (Primary Key)	Unique job ID
`url`	String	URL of page to process
`pageConfig`	Any	Page configuration options (object with any PageContent fields to pass to worker)
`status`	String (Indexed)	Job status (created, claimed, completed, failed)
`attempts`	Int	Number of processing attempts
`priority`	Int (Indexed)	Job priority (low value = high priority)
`errorMessage`	String (Optional)	Error message if the job failed
`assignedTo`	String (Indexed)	Worker hostname assigned to this job
`claimedAt`	Date	Time when worker claimed the job in epoch milliseconds
`completedAt`	Date	Time when job was completed in epoch milliseconds
`createdAt`	Date	Time when job was created in epoch milliseconds

Note:

priority options include:

0: highest for ad-hoc (on-demand) requests

1: medium high for blob failures

2: medium for retries

3: medium low for batch uploads

4: lowest for regular scheduled refresh

WorkerStatus - Status tracking for workers in the cluster

Field	Type	Description
`id`	String (Primary Key)	Unique worker hostname
`status`	String (Indexed)	Current status (idle, working)
`isScheduler`	Boolean	Whether worker is designated as Scheduler

Scheduler / Worker / Job Flow

The system uses a distributed job processing architecture with an Scheduler managing job distribution and workers executing page processing tasks.

Job Queue System

Job Lifecycle:

Created - New job added to queue or expired job released, awaiting assignment
Claimed - Worker has claimed the job for processing
Completed - Job finished successfully
Failed - Job failed after retries or critical errors

Job Priorities (lower number = higher priority):

0 - Ad-Hoc (On-demand) requests (highest priority)
1 - Blob failures (medium-high priority)
2 - Automatic retries (medium priority)
3 - Batch uploads (medium-low priority)
4 - Scheduled refreshes (lowest priority)

Job Management Features:

Duplicate prevention: Only one pending job per URL allowed
Configuration updates: Higher priority jobs can update existing job configs
Retry logic: Jobs can be retried up to 3 times before permanent failure
Expiration handling: Jobs claimed too long are automatically released

Scheduler Responsibilities

The Scheduler runs on a designated node (set by hostname in .env) and manages job distribution across the worker nodes:

Core Functions:

Job Assignment: Assigns unassigned jobs to available workers
Failure Recovery: Reassigns failed jobs for retry (tries to assign to a different worker than original attempt)
Sitemap Refresh: Crawls sitemaps based on nextRefresh to create jobs for new pages
Scheduled Refresh: Creates jobs for pages needing refresh based on nextRefresh timestamps
Cleanup: Removes old job records and inactive page content based on eviction policies

Scheduling Intervals:

Orphaned Jobs: 1 minute (for blob failures and bulk uploads)
Failed Jobs: 10 minutes (matches worker expiration check)
Unclaimed Jobs: 20 minutes (allows worker retry window)
Scheduled Refresh: Minimum of 5 minutes or defaultRefreshInterval
Cleanup Tasks: Minimum of 24 hours or evictionPeriod

Job Sources Handled:

Orphaned jobs (no worker assigned)
Failed jobs (marked for retry)
Unclaimed jobs (created but not claimed within threshold)
Scheduled refresh jobs (pages / sitemaps after their refresh interval)

Worker Operations

Each worker node processes jobs assigned to it by the Scheduler:

Worker Lifecycle:

Registration: Worker registers in WorkerStatus table as "idle"
Job Claiming: Periodically claims assigned jobs (up to set concurrency)
Processing: Fetches and processes page content using pageSource
Completion: Marks jobs as completed or failed
Status Updates: Updates status between "idle" and "working"

Key Features:

Concurrency: Processes up to concurrency limit of jobs simultaneously
Timeout Protection: Jobs expire after 5 minutes if not completed
Retry Logic: Failed jobs are released for reassignment (up to 3 attempts)
Error Handling: Detailed error messages preserved across retries

Processing Flow:

Worker claims jobs (status: created → claimed)
For each job:
- Fetch page content via pageSource
- Apply filters and convert to markdown
- Update PageContent table
- Mark job as completed/failed
Release expired jobs back to queue
Update worker status to idle

Fault Tolerance

Worker Failures:

Jobs expire after 5 minutes without completion
Expired jobs are automatically reassigned by Scheduler
Scheduler avoids reassignment to previously failed workers when possible

Job Failures:

Jobs retry up to 3 times
Error messages accumulate across retry attempts
Persistent failures are marked as permanently failed

Data Consistency:

Transaction support ensures atomic job state changes
Duplicate job prevention via URL-based deduplication
Background cleanup prevents database bloat

Monitoring:

Worker status tracking (idle/working)
Job attempt counting and error logging
Analytics recording for performance metrics
Automatic cleanup of old records based on eviction periods

Performance Characteristics

Timeouts

Operation	On-Demand Timeout	Scheduled Refresh Timeout	Fallback Behavior
Fetch	3000ms	10000ms	Static error page with 503 status
Filtering	400ms	1000ms	Return original HTML content
Conversion	400ms	1000ms	Return original HTML content

Server Timing Metrics

The Server-Timing header includes:

fetch-resolve;dur=X.XX - Time to fetch original content (milliseconds)
process-resolve;dur=Y.YY - Time to filter and convert content (milliseconds)

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
data		data
resources		resources
tests		tests
utils		utils
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
config.yaml		config.yaml
eslint.config.mjs		eslint.config.mjs
package-lock.json		package-lock.json
package.json		package.json
prettier.config.js		prettier.config.js
vitest.config.js		vitest.config.js

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Harper Markdown Prerender Component

Overview

Table of Contents

Component Architecture

Content Delivery Flow

Getting Started

Deployment

Environment Variables

API Endpoints

Endpoints

`/page_content` - Page Content Retrieval

GET Request Example

`/page_filter` - Content Filtering Rules

`/bulk_upload` - Batch URL Processing

`/cache_metrics` - Performance Analytics

Database Schema

Markdown Database

Workers Database

Scheduler / Worker / Job Flow

Job Queue System

Scheduler Responsibilities

Worker Operations

Fault Tolerance

Performance Characteristics

Timeouts

Server Timing Metrics

About

Uh oh!

Releases

Packages

Languages

License

HarperFast/template-markdown-prerender

Folders and files

Latest commit

History

Repository files navigation

Harper Markdown Prerender Component

Overview

Table of Contents

Component Architecture

Content Delivery Flow

Getting Started

Deployment

Environment Variables

API Endpoints

Endpoints

/page_content - Page Content Retrieval

GET Request Example

/page_filter - Content Filtering Rules

/bulk_upload - Batch URL Processing

/cache_metrics - Performance Analytics

Database Schema

Markdown Database

Workers Database

Scheduler / Worker / Job Flow

Job Queue System

Scheduler Responsibilities

Worker Operations

Fault Tolerance

Performance Characteristics

Timeouts

Server Timing Metrics

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

`/page_content` - Page Content Retrieval

`/page_filter` - Content Filtering Rules

`/bulk_upload` - Batch URL Processing

`/cache_metrics` - Performance Analytics

Packages