Blog RSS Scraper
This Lambda function ingests RSS feeds into a Directus CMS instance. It's specifically configured to pull from the "Have I Been Flocked?" RSS feed and sync the posts with your Directus blog collection.
Features
- RSS Feed Parsing: Extracts title, link, pubDate, and description from RSS entries
- Directus Integration: Creates, updates, and deletes blog posts via Directus API
- Idempotent Operation: Safe to run multiple times - only makes necessary changes
- Selective Sync: Only manages RSS-ingested posts (identified by
externalUrlfield) - Error Handling: Comprehensive logging and error recovery
Setup
Environment Variables
Set the following environment variables:
# Required
DIRECTUS_API_TOKEN=your_directus_api_token_here
# Optional (defaults to https://cms.deflock.me)
DIRECTUS_BASE_URL=https://your-directus-instance.com
Directus Collection Schema
Your Directus blog collection should have the following fields:
id(integer, auto-increment)title(string, required)description(text)content(rich text, optional - RSS posts will have this as null)externalUrl(string, optional - identifies RSS-ingested posts)published(datetime)
Dependencies
Install dependencies using uv:
uv init
Usage
Local Testing
uv run main.py
AWS Lambda
Deploy as a Python 3.14 Lambda function. The lambda_handler function serves as the entry point.
Sample Lambda Event
The function doesn't require any specific event data:
{}
Sample Response
Success:
{
"statusCode": 200,
"body": {
"message": "RSS synchronization completed successfully",
"stats": {
"created": 2,
"updated": 1,
"deleted": 0,
"errors": 0
}
}
}
Error:
{
"statusCode": 500,
"body": {
"message": "RSS synchronization failed",
"error": "DIRECTUS_API_TOKEN environment variable is required"
}
}
How It Works
-
Fetch RSS Feed: Downloads and parses the RSS feed from
https://haveibeenflocked.com/feed.xml -
Get Existing Posts: Queries Directus for all blog posts that have an
externalUrl(these are RSS-managed posts) -
Synchronization:
- Create: New RSS entries that don't exist in Directus
- Update: Existing posts where title or description has changed
- Delete: Directus posts with
externalUrlthat no longer exist in the RSS feed
-
Preserve Manual Posts: Posts without an
externalUrlare left untouched
RSS Feed Structure
The scraper expects standard RSS 2.0 format with the following elements:
<title>: Post title<link>: Post URL (becomesexternalUrl)<pubDate>: Publication date (becomespublished)<description>or<content>: Post description (HTML tags are stripped)
Error Handling
- Invalid dates are logged as warnings but don't stop processing
- Individual post errors are logged and counted but don't stop the entire sync
- HTTP errors from Directus API are logged with full details
- Missing environment variables cause immediate failure with clear error messages
Logging
The function uses Python's standard logging module with INFO level. Key events logged:
- RSS feed fetch status
- Number of entries parsed
- Create/update/delete operations
- Errors and warnings
- Final synchronization statistics
Security Considerations
- Store the Directus API token securely (AWS Secrets Manager recommended for production)
- Use HTTPS for all API communications (enforced by default)
- The function only modifies posts with
externalUrl- manual posts are safe - Consider rate limiting if running frequently
Deployment
AWS Lambda Deployment Package
- Navigate to the terraform directory.
- Set the required variables in a local copy of
terraform.tfvars. - Run
terraform apply.
Environment Variables in Lambda
Set in the Lambda function configuration:
DIRECTUS_API_TOKEN: Your Directus API tokenDIRECTUS_BASE_URL: Your Directus instance URL (optional)
Scheduling
The Terraform configutaiton sets up a CloudWatch Events rule to run this function periodically.
Troubleshooting
Common Issues
- 401 Unauthorized: Check your
DIRECTUS_API_TOKEN - 404 Not Found: Verify
DIRECTUS_BASE_URLand collection name (blog) - RSS Parse Errors: Check if the RSS feed is accessible and valid
- Date Parse Failures: Usually logged as warnings and don't stop processing
Testing Connection
The function will fail fast if it can't connect to Directus, making debugging easier.
Development
Local Development Setup
# Clone and navigate to the blog_scraper directory
cd serverless/blog_scraper
# Install dependencies
uv init
# Set environment variables
export DIRECTUS_API_TOKEN="your_token"
export DIRECTUS_BASE_URL="https://cms.deflock.me"
# Run locally
uv run main.py
Testing with Different RSS Feeds
To test with a different RSS feed, modify the rss_url in the BlogScraper.__init__ method.