Skip to content

HarperFast/template-markdown-prerender

Harper Markdown Prerender Component

Overview

This Harper component is a high-performance content optimization system that serves pre-rendered markdown versions of web page content to LLM bots.

Table of Contents

Component Architecture

Deployment Architecture:

This component requires cluster instances to be split (such as by regionA01 and regionA02) for optimal performance and job distribution. Instances are configured as follows:

  • delivery instances: Configured to receive incoming traffic via GTM. These instances handle API requests from clients and serve cached content directly. This must be configured by setting the DELIVERY environment variable with the instances in a comma-separated list (e.g. DELIVERY=cust-env-regionA.harperdbcloud.com,cust-env-regionB.harperdbcloud.com,cust-env-regionC.harperdbcloud.com).

  • worker instances: Designated as dedicated render worker nodes that process jobs from the job queue. These instances handle the resource intensive operations of fetching, filtering, and converting HTML to markdown. This must be configured by setting the WORKERS environment variable with the instances in a comma-separated list (e.g. WORKERS=cust-env-regionD.harperdbcloud.com,cust-env-regionE.harperdbcloud.com,cust-env-regionF.harperdbcloud.com).

  • scheduler instance: One worker instance must be selected as the primary Scheduler by setting the SCHEDULER environment variable as the instance (e.g., SCHEDULER=cust-env-regionD.harperdbcloud.com). This instance coordinates job distribution across all workers and manages scheduled refresh operations.

Note: When configuring the GTM (Global Traffic Manager), only include the delivery instances to receive traffic. The worker and scheduler instances operate transparently as renderer nodes and should not be exposed to direct client traffic.

Content Processing Pipeline:

  1. HTML Fetching - Retrieves content from source URLs with optimized HTTP client configuration including DNS caching and connection pooling
  2. Content Filtering - Applies CSS selector-based rules to remove unwanted elements (ads, navigation, etc.) before processing
  3. Markdown Conversion - Transforms filtered HTML to markdown using node-html-markdown with strict timeout controls
  4. Compression & Storage - Gzip compresses content and stores as blobs with metadata in the PageContent table

Performance Optimizations:

  • DNS Caching via CacheableLookup to avoid repeated DNS lookups under high load
  • Connection Pooling using Undici Agent with configurable keep-alive and concurrency settings
  • Timeout Protection with different limits for ad-hoc (on-demand) vs scheduled operations
  • Graceful Degradation falls back to original HTML when markdown conversion fails or times out

Content Management:

  • Smart Caching respects nextRefresh timestamps and refresh intervals
  • Error Handling serves static markdown error pages when source content is unavailable
  • Metadata Extraction captures page titles and publish times for enhanced markdown headers
  • Background Refresh automatically updates stale content based on configured intervals

Content Delivery Flow

Cache Hit Path (Optimal):

  1. Request arrives for cached content within refresh window
  2. Content served directly from PageContent table with appropriate headers
  3. lastAccessed timestamp updated (batched every 60 seconds)
  4. Analytics recorded for cache hit

Cache Miss/Stale Path:

  1. On-demand (ad-hoc) render handled by delivery node
  2. Content fetched from origin with optimized HTTP client
  3. HTML filtered using path-specific CSS selectors
  4. Filtered content converted to markdown with timeout protection
  5. Result compressed, stored as blob, and served to user
  6. Background job queued for future refresh based on default refreshInterval

Error Handling Path:

  1. Network or processing errors trigger static markdown error page generation
  2. Higher-priority background job queued for on-demand, blob failures, and retry
  3. Temporary cache headers prevent repeated failed requests
  4. Graceful degradation maintains service availability

Getting Started

  1. git clone https://github.com/HarperFast/template-markdown-prerender.git
  2. cd template-markdown-prerender
  3. npm install
  4. harperdb run .

This assumes you have the Harper stack already [installed](Install HarperDB | HarperDB) globally.

Deployment

Deploy the component using Harper's Operations API via the Harper CLI.

Environment Variables

Configure the following environment variables in your .env file:

Variable Required Description
DELIVERY Yes Comma-separated list of instance hostnames configured with GTM that serve API requests from clients and return cached content directly. Example: cust-env-regionA.harperdbcloud.com,cust-env-regionB.harperdbcloud.com,cust-env-regionC.harperdbcloud.com
WORKERS Yes Comma-separated list of instance hostnames that serve as render workers to process jobs from the queue. Example: cust-env-regionD.harperdbcloud.com,cust-env-regionE.harperdbcloud.com,cust-env-regionF.harperdbcloud.com
SCHEDULER Yes Hostname of the Scheduler instance from the worker list of instanced. This instance coordinates job distribution and manages scheduled refresh operations across the cluster. Example: cust-env-regionD.harperdbcloud.com
REFRESH_INTERVAL No Default refresh interval for pages in milliseconds. Used when a page does not specify its own refresh interval. Default: 86400000 (24 hours). Example: 86400000
EVICTION_PERIOD No Number of days before cached pages and job queue records are automatically removed if not accessed. This prevents database bloat and removes stale content. Default: 30 days. Example: 30
ALLOW_STALE No Whether to serve stale content while refreshing in background (true) or force on-demand refresh (false) Default: true

Example .env file:

DELIVERY=cust-env-regionA.harperdbcloud.com,cust-env-regionB.harperdbcloud.com,cust-env-regionC.harperdbcloud.com
WORKERS=cust-env-regionD.harperdbcloud.com,cust-env-regionE.harperdbcloud.com,cust-env-regionF.harperdbcloud.com
SCHEDULER=cust-env-regionD.harperdbcloud.com
REFRESH_INTERVAL=86400000
EVICTION_PERIOD=30
ALLOW_STALE=true

API Endpoints

Endpoints

Endpoint Description
/page_content Fetches, caches, and serves optimized page content from the PageContent table.
/page_filter Manages CSS selector-based filtering rules for content processing.
/bulk_upload Provides bulk upload functionality for adding URLs to the processing queue.
/cache_metrics Retrieves page content cache performance metrics from the analytics store.
/PageContent Direct REST endpoint for the PageContent table (see below)
/PageFilter Direct REST endpoint for the PageFilter table (see below)
/Sitemap Direct REST endpoint for the Sitemap table (see below)
/JobQueue Direct REST endpoint for the JobQueue table (see below)
/WorkerStatus Direct REST endpoint for the WorkerStatus table (see below)

The Harper REST API gives low level control over your data. The first four (4) endpoints are component level and provide higher level functionality. The last five (5) enpdoints are direct access to Harper's REST API. For a full description of what the REST API can do and how to use if your can refer to its documentation.

This REST interface for the various tables can be used to manually manipulate the data. See the Schema below for details on the structure of each table.

/page_content - Page Content Retrieval

Authentication: Role-based access control using allowedReadRoles

Methods:

  • GET - Retrieve page content for a given URL
    • Uses cache when available and fresh (checks nextRefresh timestamp)
    • Falls back to source fetch on cache miss or stale content
    • Updates lastAccessed timestamps (batched every 60 seconds)
    • Handles blob errors by serving static markdown error pages and queuing refresh jobs
    • Returns content with appropriate headers (content-type, content-length, cache-control, etc.)
  • POST, PUT, PATCH, DELETE - Returns 405 Method Not Allowed with markdown response

Features:

  • Automatic error page generation on blob failures
  • Background job queuing for failed content retrieval
  • Analytics recording for cache hits/misses
  • Header normalization and merging
  • Passes through origin response status code or sets 504 Gateway Timeout on fetch timeout or 500 for unknown errors

GET Request Example

URL Options: There are multiple options for supplying a URL with the GET request:

  1. Full URL as encoded URL
GET /page_content/https%3A%2F%2Fwww.example.com
  1. Full URL as path query param (recommend using X-Query-String header for query param portion of URL to avoid parsing errors)
GET /page_content?path=https://www.example.com/some-path

Headers
X-Query-String: "?arg1=val"
  1. Full URL as Path header (can include query params)
GET /page_content

Headers
Path: "https://www.example.com/some-path?arg1=val"

Response: Returns either markdown page content, markdown error page (if fetching fails/times out), or HTML content (if filtering or conversion fails/times out)

Response Headers:

  • Content-Type: "text/markdown; charset=utf-8" or "text/html; charset=utf-8"
  • Content-Encoding: "gzip"
  • Content-Length: Byte length of compressed content
  • Last-Modified: UTC timestamp when page last fetched and cached
  • X-Markdown-Version: Version of node-html-markdown library
  • Server-Timing: Fetch and process times for performance diagnostics (cache miss only)
  • Retry-After: Included for fetch, filter, convert, or blob errors to trigger bot retry (300 seconds)
  • Cache-Control: Included for success responses with max-age=refreshInterval (in seconds)

Example Response Headers:

Content-Type: text/markdown; charset=utf-8
Content-Encoding: gzip
Content-Length: 1234
Last-Modified: Wed, 03 Sep 2025 14:30:00 GMT
X-Markdown-Version: 1.1.0
Server-Timing: fetch-resolve;dur=150.25, process-resolve;dur=75.50
Cache-Control: public, max-age=86400

/page_filter - Content Filtering Rules

Methods:

  • GET /{pathname} - Retrieve filters for a specific path
    • Always includes "default" filters
    • Combines default and path-specific selectors
    • Returns array of CSS selectors to remove from HTML
  • POST - Create or update filter rules
    • Requires path and filters (comma-separated CSS selectors)
    • Example:
{ "path": "blog", "filters": "header, footer, .ads, #popup, figure img.hero" }
  • DELETE /{pathname} - Remove filter rules for a specific path
  • PUT and PATCH - Return 405 method not allowed

Features:

  • Path name normalization (lowercase, underscores to dashes)
  • Default filters applied to all pages
  • CSS selector validation
  • Error handling with detailed messages

/bulk_upload - Batch URL Processing

Methods:

  • POST - Add URLs to job queue with three supported input formats:

    Single URL:

    {
    	"url": "https://example.com/page",
    	"refreshInterval": 3600000
    }

    Sitemap Processing:

    {
    	"sitemap": "https://example.com/sitemap.xml",
    	"isIndex": false,
    	"refreshInterval": 3600000
    }
    
    or
    
    {
    	"sitemap": "https://example.com/sitemap_index.xml",
    	"isIndex": true,
    	"refreshInterval": 3600000
    }

    URL List:

    {
    	"urlList": ["https://example.com/page1", "https://example.com/page2"],
    	"refreshInterval": 3600000
    }

Features:

  • Batch processing with configurable batch size (default: 100 URLs)
  • Sitemap and sitemap index parsing, with periodic refresh of sitemap URLs to ensure new pages are added to prerender schedule
  • Job prioritization (batch uploads get medium low priority - after on-demand, blob failures, and retries)
  • Partial success handling with detailed failure reporting for url and urlList methods
  • Promise-based concurrent processing with error isolation

Response Format:

  • 201 - Full success
  • 207 - Partial success (some failures)
  • 400 - Invalid input
  • 500 - Processing error

/cache_metrics - Performance Analytics

Authentication: Inherited from Resource base class

Methods:

  • GET - Retrieve cache metrics from the last 60 seconds

Metrics Returned:

  • Timing Metrics (in milliseconds):

    • page-fetch-time - Time to fetch content from source
    • page-filter-time - Time to apply CSS filters
    • page-convert-time - Time to convert HTML to markdown
    • page-process-time - Total page processing time
    • job-complete-time - Total job processing time
    • Each includes: avg, p50, p90, p95 percentiles
  • Cache Performance:

    • page-cache-hit-rate - Cache hit percentage
    • Based on page-cache-hit and page-cache-miss counts

Response Format:

{
  "timing": [
    {
      "metric": "page-fetch-time",
      "period": [start_timestamp, end_timestamp],
      "unit": "ms",
      "avg": "123.456",
      "p50": "100.000",
      "p90": "200.000",
      "p95": "250.000"
    }
  ],
  "cacheHitRate": {
    "metric": "page-cache-hit-rate",
    "period": [start_timestamp, end_timestamp],
    "unit": "percent",
    "rate": 85.5
  }
}

Database Schema

The system uses two databases to manage content delivery and processing:

Markdown Database

PageContent - Stores static page content and metadata for LLM bot-optimized delivery

Field Type Description
url String (Primary Key) Full URL of the page
pageContent Blob Either markdown or HTML content
contentType String "text/markdown; charset=utf-8" or "text/html; charset=utf-8"
contentLength Int Length of the content in bytes
statusCode Int HTTP status code of the page fetch
headers Any Object with header key:value pairs
mdVersion String node-html-markdown version for debugging
serverTiming String (Optional) Server timing values for headers
errorMessage String (Optional) Error message if applicable
lastAccessed Date (Indexed) Last time the page was accessed in epoch milliseconds
lastRefreshed Date Last time the page was refreshed in epoch milliseconds
refreshInterval Long Time between refreshes in milliseconds
nextRefresh Date (Indexed) Next scheduled refresh time in epoch milliseconds
createdAt Date When the record was created in epoch milliseconds

PageFilter - Defines filtering rules for content processing

Field Type Description
path String (Primary Key) Path portion of URL to match (e.g., "/blog").
filters String Comma-separated list of CSS selectors to remove from page before rendering
updatedAt Date When the record was last updated in epoch milliseconds
createdAt Date When the record was created in epoch milliseconds

Notes:

  1. Filter rules apply to all URLs starting with the same path. Default filters use path = "default" and applies to all URLs in addition to any other matching path filters
  2. The filter string can include any valid CSS selectios (e.g., "header, footer, .ads, #popup, div#banner, figure img.hero")

Sitemap - Stores sitemaps and sitemap indexes that refresh on intervals to add new pages to PageContent

Field Type Description
url String (Primary Key) Full URL of the sitemap
isIndex Boolean (Indexed) Whether sitemap is an index
parentSitemap String (Indexed) url of parent sitemap index
pageRefresh Long Time between refreshes for sitemap pages in epoch milliseconds
lastRefreshed Date Last time the page was refreshed in epoch milliseconds
refreshInterval Long Time between sitemap refreshes in milliseconds
nextRefresh Date (Indexed) Next scheduled refresh time in epoch milliseconds
createdAt Date When the record was created in epoch milliseconds

Workers Database

JobQueue - Queue of jobs for workers to process

Field Type Description
id ID (Primary Key) Unique job ID
url String URL of page to process
pageConfig Any Page configuration options (object with any PageContent fields to pass to worker)
status String (Indexed) Job status (created, claimed, completed, failed)
attempts Int Number of processing attempts
priority Int (Indexed) Job priority (low value = high priority)
errorMessage String (Optional) Error message if the job failed
assignedTo String (Indexed) Worker hostname assigned to this job
claimedAt Date Time when worker claimed the job in epoch milliseconds
completedAt Date Time when job was completed in epoch milliseconds
createdAt Date Time when job was created in epoch milliseconds

Note:

  1. priority options include:
  • 0: highest for ad-hoc (on-demand) requests
  • 1: medium high for blob failures
  • 2: medium for retries
  • 3: medium low for batch uploads
  • 4: lowest for regular scheduled refresh

WorkerStatus - Status tracking for workers in the cluster

Field Type Description
id String (Primary Key) Unique worker hostname
status String (Indexed) Current status (idle, working)
isScheduler Boolean Whether worker is designated as Scheduler

Scheduler / Worker / Job Flow

The system uses a distributed job processing architecture with an Scheduler managing job distribution and workers executing page processing tasks.

Job Queue System

Job Lifecycle:

  1. Created - New job added to queue or expired job released, awaiting assignment
  2. Claimed - Worker has claimed the job for processing
  3. Completed - Job finished successfully
  4. Failed - Job failed after retries or critical errors

Job Priorities (lower number = higher priority):

  • 0 - Ad-Hoc (On-demand) requests (highest priority)
  • 1 - Blob failures (medium-high priority)
  • 2 - Automatic retries (medium priority)
  • 3 - Batch uploads (medium-low priority)
  • 4 - Scheduled refreshes (lowest priority)

Job Management Features:

  • Duplicate prevention: Only one pending job per URL allowed
  • Configuration updates: Higher priority jobs can update existing job configs
  • Retry logic: Jobs can be retried up to 3 times before permanent failure
  • Expiration handling: Jobs claimed too long are automatically released

Scheduler Responsibilities

The Scheduler runs on a designated node (set by hostname in .env) and manages job distribution across the worker nodes:

Core Functions:

  • Job Assignment: Assigns unassigned jobs to available workers
  • Failure Recovery: Reassigns failed jobs for retry (tries to assign to a different worker than original attempt)
  • Sitemap Refresh: Crawls sitemaps based on nextRefresh to create jobs for new pages
  • Scheduled Refresh: Creates jobs for pages needing refresh based on nextRefresh timestamps
  • Cleanup: Removes old job records and inactive page content based on eviction policies

Scheduling Intervals:

  • Orphaned Jobs: 1 minute (for blob failures and bulk uploads)
  • Failed Jobs: 10 minutes (matches worker expiration check)
  • Unclaimed Jobs: 20 minutes (allows worker retry window)
  • Scheduled Refresh: Minimum of 5 minutes or defaultRefreshInterval
  • Cleanup Tasks: Minimum of 24 hours or evictionPeriod

Job Sources Handled:

  • Orphaned jobs (no worker assigned)
  • Failed jobs (marked for retry)
  • Unclaimed jobs (created but not claimed within threshold)
  • Scheduled refresh jobs (pages / sitemaps after their refresh interval)

Worker Operations

Each worker node processes jobs assigned to it by the Scheduler:

Worker Lifecycle:

  1. Registration: Worker registers in WorkerStatus table as "idle"
  2. Job Claiming: Periodically claims assigned jobs (up to set concurrency)
  3. Processing: Fetches and processes page content using pageSource
  4. Completion: Marks jobs as completed or failed
  5. Status Updates: Updates status between "idle" and "working"

Key Features:

  • Concurrency: Processes up to concurrency limit of jobs simultaneously
  • Timeout Protection: Jobs expire after 5 minutes if not completed
  • Retry Logic: Failed jobs are released for reassignment (up to 3 attempts)
  • Error Handling: Detailed error messages preserved across retries

Processing Flow:

  1. Worker claims jobs (status: created → claimed)
  2. For each job:
    • Fetch page content via pageSource
    • Apply filters and convert to markdown
    • Update PageContent table
    • Mark job as completed/failed
  3. Release expired jobs back to queue
  4. Update worker status to idle

Fault Tolerance

Worker Failures:

  • Jobs expire after 5 minutes without completion
  • Expired jobs are automatically reassigned by Scheduler
  • Scheduler avoids reassignment to previously failed workers when possible

Job Failures:

  • Jobs retry up to 3 times
  • Error messages accumulate across retry attempts
  • Persistent failures are marked as permanently failed

Data Consistency:

  • Transaction support ensures atomic job state changes
  • Duplicate job prevention via URL-based deduplication
  • Background cleanup prevents database bloat

Monitoring:

  • Worker status tracking (idle/working)
  • Job attempt counting and error logging
  • Analytics recording for performance metrics
  • Automatic cleanup of old records based on eviction periods

Performance Characteristics

Timeouts

Operation On-Demand Timeout Scheduled Refresh Timeout Fallback Behavior
Fetch 3000ms 10000ms Static error page with 503 status
Filtering 400ms 1000ms Return original HTML content
Conversion 400ms 1000ms Return original HTML content

Server Timing Metrics

The Server-Timing header includes:

  • fetch-resolve;dur=X.XX - Time to fetch original content (milliseconds)
  • process-resolve;dur=Y.YY - Time to filter and convert content (milliseconds)

About

Markdown Prerender Template Component

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published