diff --git a/.github/workflows/deploy.yml b/.github/workflows/deploy.yml index a84b1b2..8b44d6e 100644 --- a/.github/workflows/deploy.yml +++ b/.github/workflows/deploy.yml @@ -3,6 +3,8 @@ name: hufs-notice-crawler-cicd on: push: branches: ["main"] + paths-ignore: + - "**/*.md" jobs: build_push_deploy: diff --git a/CLAUDE.md b/CLAUDE.md new file mode 100644 index 0000000..58af718 --- /dev/null +++ b/CLAUDE.md @@ -0,0 +1,70 @@ +# CLAUDE.md + +This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository. + +## Commands + +**Run tests:** +```bash +python -m pytest +``` + +**Run a single test:** +```bash +python -m pytest tests/test_service.py::test_crawl_service_bootstrap_saves_posts_without_returning_them +``` + +**Run the app locally:** +```bash +uvicorn app.main:app --host 0.0.0.0 --port 8000 +``` + +**Docker build:** +```bash +docker build -t your-dockerhub-id/hufs-notice-crawler:latest . +``` + +**Setup (first time):** +```bash +python -m venv .venv && source .venv/bin/activate && pip install -r requirements.txt +``` + +## Architecture + +FastAPI web service that crawls three HUFS Computer Science Department notice boards and returns only new posts since the last crawl. The state is persisted in PostgreSQL. + +**Request flow:** n8n (scheduler) → `POST /api/v1/crawl` → `CrawlService` → `HufsCrawler` → PostgreSQL + +**Layer responsibilities:** +- `app/crawler.py` — HTTP + BeautifulSoup scraping. No DB access. Returns raw `PostStub` and `PostDetail` objects. Handles URL encoding to user-facing `subview.do?enc=...` format. +- `app/service.py` — Orchestration. Compares scraped `article_id`s against DB to find new posts, fetches details only for new ones, persists results, handles bootstrap mode. +- `app/main.py` — FastAPI entrypoint. Two routes: `GET /health`, `POST /api/v1/crawl`. Auto-creates tables on startup via lifespan. +- `app/models.py` / `app/db.py` — SQLAlchemy ORM + session management. + +**Bootstrap mode:** On first run (empty `scraped_posts` table), the service saves all found posts but returns `new_posts: []` to prevent flooding Discord/n8n notifications with old posts. Subsequent runs return only genuinely new posts. + +**Three boards crawled:** +| Key | Name | Board ID | +|-----|------|----------| +| `notice` | 공지사항 | 1926 | +| `archive` | 자료실 | 1927 | +| `jobs` | 취업정보 | 1929 | + +## Tests + +Tests use an in-memory SQLite DB (`conftest.py`) and a `FakeCrawler` stub — no real HTTP calls or PostgreSQL required. + +- `test_api.py` — endpoint shape/status tests (service is mocked) +- `test_service.py` — new-post detection logic, bootstrap mode, zero-new-posts path + +## CI/CD + +GitHub Actions (`.github/workflows/deploy.yml`) triggers on push to `main`: +1. SSH into Gitea, clone repo +2. Build and push Docker image to DockerHub (tagged `latest` + optional `[x.y.z]` version from commit message) +3. Deploy via `docker compose -p nkeys-apps -f /nkeysworld/compose.apps.yml pull hufs-notice-crawler` +4. Notify Discord via webhook + +Required secrets: `NKEY_SSH_PRIVATE_KEY`, `DOCKERHUB_USERNAME`, `DOCKERHUB_TOKEN`, `DISCORD_WEBHOOK` + +The app runs on an internal Docker network (`nkeysworld-network`) with no exposed ports — n8n calls it as `http://hufs-notice-crawler:8000`.