Files
deflock/serverless/blog_scraper/README.md
T
Will Freeman 1c241d17c9 Add Blog (#83)
* add rss ingestor

* it works

* blog working

* fix the theme switcher

* finalize blog, re-add store

* fix terraform for blog_scraper

* update sitemap

* update readme
2025-12-14 18:52:53 -07:00

193 lines
5.0 KiB
Markdown

# Blog RSS Scraper
This Lambda function ingests RSS feeds into a Directus CMS instance. It's specifically configured to pull from the "Have I Been Flocked?" RSS feed and sync the posts with your Directus blog collection.
## Features
- **RSS Feed Parsing**: Extracts title, link, pubDate, and description from RSS entries
- **Directus Integration**: Creates, updates, and deletes blog posts via Directus API
- **Idempotent Operation**: Safe to run multiple times - only makes necessary changes
- **Selective Sync**: Only manages RSS-ingested posts (identified by `externalUrl` field)
- **Error Handling**: Comprehensive logging and error recovery
## Setup
### Environment Variables
Set the following environment variables:
```bash
# Required
DIRECTUS_API_TOKEN=your_directus_api_token_here
# Optional (defaults to https://cms.deflock.me)
DIRECTUS_BASE_URL=https://your-directus-instance.com
```
### Directus Collection Schema
Your Directus `blog` collection should have the following fields:
- `id` (integer, auto-increment)
- `title` (string, required)
- `description` (text)
- `content` (rich text, optional - RSS posts will have this as null)
- `externalUrl` (string, optional - identifies RSS-ingested posts)
- `published` (datetime)
### Dependencies
Install dependencies using uv:
```bash
uv init
```
## Usage
### Local Testing
```bash
uv run main.py
```
### AWS Lambda
Deploy as a Python 3.14 Lambda function. The `lambda_handler` function serves as the entry point.
#### Sample Lambda Event
The function doesn't require any specific event data:
```json
{}
```
#### Sample Response
Success:
```json
{
"statusCode": 200,
"body": {
"message": "RSS synchronization completed successfully",
"stats": {
"created": 2,
"updated": 1,
"deleted": 0,
"errors": 0
}
}
}
```
Error:
```json
{
"statusCode": 500,
"body": {
"message": "RSS synchronization failed",
"error": "DIRECTUS_API_TOKEN environment variable is required"
}
}
```
## How It Works
1. **Fetch RSS Feed**: Downloads and parses the RSS feed from `https://haveibeenflocked.com/feed.xml`
2. **Get Existing Posts**: Queries Directus for all blog posts that have an `externalUrl` (these are RSS-managed posts)
3. **Synchronization**:
- **Create**: New RSS entries that don't exist in Directus
- **Update**: Existing posts where title or description has changed
- **Delete**: Directus posts with `externalUrl` that no longer exist in the RSS feed
4. **Preserve Manual Posts**: Posts without an `externalUrl` are left untouched
## RSS Feed Structure
The scraper expects standard RSS 2.0 format with the following elements:
- `<title>`: Post title
- `<link>`: Post URL (becomes `externalUrl`)
- `<pubDate>`: Publication date (becomes `published`)
- `<description>` or `<content>`: Post description (HTML tags are stripped)
## Error Handling
- Invalid dates are logged as warnings but don't stop processing
- Individual post errors are logged and counted but don't stop the entire sync
- HTTP errors from Directus API are logged with full details
- Missing environment variables cause immediate failure with clear error messages
## Logging
The function uses Python's standard logging module with INFO level. Key events logged:
- RSS feed fetch status
- Number of entries parsed
- Create/update/delete operations
- Errors and warnings
- Final synchronization statistics
## Security Considerations
- Store the Directus API token securely (AWS Secrets Manager recommended for production)
- Use HTTPS for all API communications (enforced by default)
- The function only modifies posts with `externalUrl` - manual posts are safe
- Consider rate limiting if running frequently
## Deployment
### AWS Lambda Deployment Package
1. Navigate to [the terraform directory](../../terraform/).
2. Set the required variables in a local copy of `terraform.tfvars`.
3. Run `terraform apply`.
### Environment Variables in Lambda
Set in the Lambda function configuration:
- `DIRECTUS_API_TOKEN`: Your Directus API token
- `DIRECTUS_BASE_URL`: Your Directus instance URL (optional)
### Scheduling
The Terraform configutaiton sets up a CloudWatch Events rule to run this function periodically.
## Troubleshooting
### Common Issues
1. **401 Unauthorized**: Check your `DIRECTUS_API_TOKEN`
2. **404 Not Found**: Verify `DIRECTUS_BASE_URL` and collection name (`blog`)
3. **RSS Parse Errors**: Check if the RSS feed is accessible and valid
4. **Date Parse Failures**: Usually logged as warnings and don't stop processing
### Testing Connection
The function will fail fast if it can't connect to Directus, making debugging easier.
## Development
### Local Development Setup
```bash
# Clone and navigate to the blog_scraper directory
cd serverless/blog_scraper
# Install dependencies
uv init
# Set environment variables
export DIRECTUS_API_TOKEN="your_token"
export DIRECTUS_BASE_URL="https://cms.deflock.me"
# Run locally
uv run main.py
```
### Testing with Different RSS Feeds
To test with a different RSS feed, modify the `rss_url` in the `BlogScraper.__init__` method.