# Blog RSS Scraper This Lambda function ingests RSS feeds into a Directus CMS instance. It's specifically configured to pull from the "Have I Been Flocked?" RSS feed and sync the posts with your Directus blog collection. ## Features - **RSS Feed Parsing**: Extracts title, link, pubDate, and description from RSS entries - **Directus Integration**: Creates, updates, and deletes blog posts via Directus API - **Idempotent Operation**: Safe to run multiple times - only makes necessary changes - **Selective Sync**: Only manages RSS-ingested posts (identified by `externalUrl` field) - **Error Handling**: Comprehensive logging and error recovery ## Setup ### Environment Variables Set the following environment variables: ```bash # Required DIRECTUS_API_TOKEN=your_directus_api_token_here # Optional (defaults to https://cms.deflock.me) DIRECTUS_BASE_URL=https://your-directus-instance.com ``` ### Directus Collection Schema Your Directus `blog` collection should have the following fields: - `id` (integer, auto-increment) - `title` (string, required) - `description` (text) - `content` (rich text, optional - RSS posts will have this as null) - `externalUrl` (string, optional - identifies RSS-ingested posts) - `published` (datetime) ### Dependencies Install dependencies using uv: ```bash uv init ``` ## Usage ### Local Testing ```bash uv run main.py ``` ### AWS Lambda Deploy as a Python 3.14 Lambda function. The `lambda_handler` function serves as the entry point. #### Sample Lambda Event The function doesn't require any specific event data: ```json {} ``` #### Sample Response Success: ```json { "statusCode": 200, "body": { "message": "RSS synchronization completed successfully", "stats": { "created": 2, "updated": 1, "deleted": 0, "errors": 0 } } } ``` Error: ```json { "statusCode": 500, "body": { "message": "RSS synchronization failed", "error": "DIRECTUS_API_TOKEN environment variable is required" } } ``` ## How It Works 1. **Fetch RSS Feed**: Downloads and parses the RSS feed from `https://haveibeenflocked.com/feed.xml` 2. **Get Existing Posts**: Queries Directus for all blog posts that have an `externalUrl` (these are RSS-managed posts) 3. **Synchronization**: - **Create**: New RSS entries that don't exist in Directus - **Update**: Existing posts where title or description has changed - **Delete**: Directus posts with `externalUrl` that no longer exist in the RSS feed 4. **Preserve Manual Posts**: Posts without an `externalUrl` are left untouched ## RSS Feed Structure The scraper expects standard RSS 2.0 format with the following elements: - ``: Post title - `<link>`: Post URL (becomes `externalUrl`) - `<pubDate>`: Publication date (becomes `published`) - `<description>` or `<content>`: Post description (HTML tags are stripped) ## Error Handling - Invalid dates are logged as warnings but don't stop processing - Individual post errors are logged and counted but don't stop the entire sync - HTTP errors from Directus API are logged with full details - Missing environment variables cause immediate failure with clear error messages ## Logging The function uses Python's standard logging module with INFO level. Key events logged: - RSS feed fetch status - Number of entries parsed - Create/update/delete operations - Errors and warnings - Final synchronization statistics ## Security Considerations - Store the Directus API token securely (AWS Secrets Manager recommended for production) - Use HTTPS for all API communications (enforced by default) - The function only modifies posts with `externalUrl` - manual posts are safe - Consider rate limiting if running frequently ## Deployment ### AWS Lambda Deployment Package 1. Navigate to [the terraform directory](../../terraform/). 2. Set the required variables in a local copy of `terraform.tfvars`. 3. Run `terraform apply`. ### Environment Variables in Lambda Set in the Lambda function configuration: - `DIRECTUS_API_TOKEN`: Your Directus API token - `DIRECTUS_BASE_URL`: Your Directus instance URL (optional) ### Scheduling The Terraform configutaiton sets up a CloudWatch Events rule to run this function periodically. ## Troubleshooting ### Common Issues 1. **401 Unauthorized**: Check your `DIRECTUS_API_TOKEN` 2. **404 Not Found**: Verify `DIRECTUS_BASE_URL` and collection name (`blog`) 3. **RSS Parse Errors**: Check if the RSS feed is accessible and valid 4. **Date Parse Failures**: Usually logged as warnings and don't stop processing ### Testing Connection The function will fail fast if it can't connect to Directus, making debugging easier. ## Development ### Local Development Setup ```bash # Clone and navigate to the blog_scraper directory cd serverless/blog_scraper # Install dependencies uv init # Set environment variables export DIRECTUS_API_TOKEN="your_token" export DIRECTUS_BASE_URL="https://cms.deflock.me" # Run locally uv run main.py ``` ### Testing with Different RSS Feeds To test with a different RSS feed, modify the `rss_url` in the `BlogScraper.__init__` method.