Files
deflock/serverless/blog_scraper
2025-12-17 18:17:05 -07:00
..
2025-12-14 18:52:53 -07:00
2025-12-17 18:17:05 -07:00
2025-12-14 18:52:53 -07:00
2025-12-14 18:52:53 -07:00
2025-12-14 18:52:53 -07:00

Blog RSS Scraper

This Lambda function ingests RSS feeds into a Directus CMS instance. It's specifically configured to pull from the "Have I Been Flocked?" RSS feed and sync the posts with your Directus blog collection.

Features

  • RSS Feed Parsing: Extracts title, link, pubDate, and description from RSS entries
  • Directus Integration: Creates, updates, and deletes blog posts via Directus API
  • Idempotent Operation: Safe to run multiple times - only makes necessary changes
  • Selective Sync: Only manages RSS-ingested posts (identified by externalUrl field)
  • Error Handling: Comprehensive logging and error recovery

Setup

Environment Variables

Set the following environment variables:

# Required
DIRECTUS_API_TOKEN=your_directus_api_token_here

# Optional (defaults to https://cms.deflock.me)
DIRECTUS_BASE_URL=https://your-directus-instance.com

Directus Collection Schema

Your Directus blog collection should have the following fields:

  • id (integer, auto-increment)
  • title (string, required)
  • description (text)
  • content (rich text, optional - RSS posts will have this as null)
  • externalUrl (string, optional - identifies RSS-ingested posts)
  • published (datetime)

Dependencies

Install dependencies using uv:

uv init

Usage

Local Testing

uv run main.py

AWS Lambda

Deploy as a Python 3.14 Lambda function. The lambda_handler function serves as the entry point.

Sample Lambda Event

The function doesn't require any specific event data:

{}

Sample Response

Success:

{
  "statusCode": 200,
  "body": {
    "message": "RSS synchronization completed successfully",
    "stats": {
      "created": 2,
      "updated": 1,
      "deleted": 0,
      "errors": 0
    }
  }
}

Error:

{
  "statusCode": 500,
  "body": {
    "message": "RSS synchronization failed",
    "error": "DIRECTUS_API_TOKEN environment variable is required"
  }
}

How It Works

  1. Fetch RSS Feed: Downloads and parses the RSS feed from https://haveibeenflocked.com/feed.xml

  2. Get Existing Posts: Queries Directus for all blog posts that have an externalUrl (these are RSS-managed posts)

  3. Synchronization:

    • Create: New RSS entries that don't exist in Directus
    • Update: Existing posts where title or description has changed
    • Delete: Directus posts with externalUrl that no longer exist in the RSS feed
  4. Preserve Manual Posts: Posts without an externalUrl are left untouched

RSS Feed Structure

The scraper expects standard RSS 2.0 format with the following elements:

  • <title>: Post title
  • <link>: Post URL (becomes externalUrl)
  • <pubDate>: Publication date (becomes published)
  • <description> or <content>: Post description (HTML tags are stripped)

Error Handling

  • Invalid dates are logged as warnings but don't stop processing
  • Individual post errors are logged and counted but don't stop the entire sync
  • HTTP errors from Directus API are logged with full details
  • Missing environment variables cause immediate failure with clear error messages

Logging

The function uses Python's standard logging module with INFO level. Key events logged:

  • RSS feed fetch status
  • Number of entries parsed
  • Create/update/delete operations
  • Errors and warnings
  • Final synchronization statistics

Security Considerations

  • Store the Directus API token securely (AWS Secrets Manager recommended for production)
  • Use HTTPS for all API communications (enforced by default)
  • The function only modifies posts with externalUrl - manual posts are safe
  • Consider rate limiting if running frequently

Deployment

AWS Lambda Deployment Package

  1. Navigate to the terraform directory.
  2. Set the required variables in a local copy of terraform.tfvars.
  3. Run terraform apply.

Environment Variables in Lambda

Set in the Lambda function configuration:

  • DIRECTUS_API_TOKEN: Your Directus API token
  • DIRECTUS_BASE_URL: Your Directus instance URL (optional)

Scheduling

The Terraform configutaiton sets up a CloudWatch Events rule to run this function periodically.

Troubleshooting

Common Issues

  1. 401 Unauthorized: Check your DIRECTUS_API_TOKEN
  2. 404 Not Found: Verify DIRECTUS_BASE_URL and collection name (blog)
  3. RSS Parse Errors: Check if the RSS feed is accessible and valid
  4. Date Parse Failures: Usually logged as warnings and don't stop processing

Testing Connection

The function will fail fast if it can't connect to Directus, making debugging easier.

Development

Local Development Setup

# Clone and navigate to the blog_scraper directory
cd serverless/blog_scraper

# Install dependencies
uv init

# Set environment variables
export DIRECTUS_API_TOKEN="your_token"
export DIRECTUS_BASE_URL="https://cms.deflock.me"

# Run locally
uv run main.py

Testing with Different RSS Feeds

To test with a different RSS feed, modify the rss_url in the BlogScraper.__init__ method.