{{ post.title }}
++ {{ formatDate(post.published) }} +
+ ++ {{ post.description }} +
+ +diff --git a/.gitignore b/.gitignore index 7f48f40..e13e093 100644 --- a/.gitignore +++ b/.gitignore @@ -43,6 +43,7 @@ terraform.tfstate* # Lambda Python Stuff serverless/*/lambda.zip +serverless/*/.build/ serverless/*/src/* !serverless/*/src/alpr_cache.py !serverless/*/src/alpr_clusters.py diff --git a/serverless/blog_scraper/.python-version b/serverless/blog_scraper/.python-version new file mode 100644 index 0000000..6324d40 --- /dev/null +++ b/serverless/blog_scraper/.python-version @@ -0,0 +1 @@ +3.14 diff --git a/serverless/blog_scraper/README.md b/serverless/blog_scraper/README.md new file mode 100644 index 0000000..51a26af --- /dev/null +++ b/serverless/blog_scraper/README.md @@ -0,0 +1,192 @@ +# Blog RSS Scraper + +This Lambda function ingests RSS feeds into a Directus CMS instance. It's specifically configured to pull from the "Have I Been Flocked?" RSS feed and sync the posts with your Directus blog collection. + +## Features + +- **RSS Feed Parsing**: Extracts title, link, pubDate, and description from RSS entries +- **Directus Integration**: Creates, updates, and deletes blog posts via Directus API +- **Idempotent Operation**: Safe to run multiple times - only makes necessary changes +- **Selective Sync**: Only manages RSS-ingested posts (identified by `externalUrl` field) +- **Error Handling**: Comprehensive logging and error recovery + +## Setup + +### Environment Variables + +Set the following environment variables: + +```bash +# Required +DIRECTUS_API_TOKEN=your_directus_api_token_here + +# Optional (defaults to https://cms.deflock.me) +DIRECTUS_BASE_URL=https://your-directus-instance.com +``` + +### Directus Collection Schema + +Your Directus `blog` collection should have the following fields: + +- `id` (integer, auto-increment) +- `title` (string, required) +- `description` (text) +- `content` (rich text, optional - RSS posts will have this as null) +- `externalUrl` (string, optional - identifies RSS-ingested posts) +- `published` (datetime) + +### Dependencies + +Install dependencies using uv: + +```bash +uv init +``` + +## Usage + +### Local Testing + +```bash +uv run main.py +``` + +### AWS Lambda + +Deploy as a Python 3.14 Lambda function. The `lambda_handler` function serves as the entry point. + +#### Sample Lambda Event + +The function doesn't require any specific event data: + +```json +{} +``` + +#### Sample Response + +Success: +```json +{ + "statusCode": 200, + "body": { + "message": "RSS synchronization completed successfully", + "stats": { + "created": 2, + "updated": 1, + "deleted": 0, + "errors": 0 + } + } +} +``` + +Error: +```json +{ + "statusCode": 500, + "body": { + "message": "RSS synchronization failed", + "error": "DIRECTUS_API_TOKEN environment variable is required" + } +} +``` + +## How It Works + +1. **Fetch RSS Feed**: Downloads and parses the RSS feed from `https://haveibeenflocked.com/feed.xml` + +2. **Get Existing Posts**: Queries Directus for all blog posts that have an `externalUrl` (these are RSS-managed posts) + +3. **Synchronization**: + - **Create**: New RSS entries that don't exist in Directus + - **Update**: Existing posts where title or description has changed + - **Delete**: Directus posts with `externalUrl` that no longer exist in the RSS feed + +4. **Preserve Manual Posts**: Posts without an `externalUrl` are left untouched + +## RSS Feed Structure + +The scraper expects standard RSS 2.0 format with the following elements: +- `
+
{{ description }}
+ {{ formatDate(post.published) }} +
+ ++ {{ post.description }} +
+ +Check back later for updates!
+