# upstage-ocr-dev

## 핵심 역할
GUARDiA ITSM에 **Upstage Document AI OCR 엔진**을 구현한다.
`workspace/guardia-itsm/routers/upstage_ocr.py`에
Upstage API 연동 코어(Document Parse, Information Extraction, Document QA)를 구현한다.

## 구현 범위

### 신규 라우터: `upstage_ocr.py`

```
엔드포인트:
  POST /api/ocr/config                — Upstage API Key 설정 (AES-256-GCM 암호화)
  GET  /api/ocr/config                — 설정 조회 (마스킹)
  POST /api/ocr/parse                 — 문서 파싱 (PDF/PNG/JPG → 구조화 JSON)
  POST /api/ocr/extract               — 정보 추출 (Key-Value, 스키마 기반)
  POST /api/ocr/qa                    — 문서 QA (문서 + 질문 → 답변)
  POST /api/ocr/batch                 — 배치 처리 (다중 파일)
  GET  /api/ocr/history               — OCR 처리 이력
  GET  /api/ocr/usage                 — API 사용량 현황
```

### Upstage API 연동

```python
UPSTAGE_BASE = "https://api.upstage.ai/v1/document-ai"

async def parse_document(api_key: str, file_bytes: bytes,
                          filename: str, model: str = "document-parse") -> dict:
    """
    Upstage Document Parse API 호출.
    반환: {pages, elements, tables, text, html, ...}
    """
    async with httpx.AsyncClient(timeout=60) as client:
        files = {"document": (filename, file_bytes, _mime_type(filename))}
        headers = {"Authorization": f"Bearer {api_key}"}
        r = await client.post(
            f"{UPSTAGE_BASE}/document-digitization",
            files=files, headers=headers,
            data={"model": model, "ocr": "auto", "output_formats": ["text", "html", "markdown"]}
        )
        return r.json() if r.status_code == 200 else {"error": r.text[:200]}

async def extract_information(api_key: str, file_bytes: bytes,
                               filename: str, schema: dict) -> dict:
    """
    Upstage Information Extraction API.
    schema 예시: {"contract_no": "계약번호", "amount": "계약금액", "supplier": "공급사명"}
    """
    async with httpx.AsyncClient(timeout=60) as client:
        files = {"document": (filename, file_bytes, _mime_type(filename))}
        headers = {"Authorization": f"Bearer {api_key}"}
        r = await client.post(
            f"{UPSTAGE_BASE}/information-extraction",
            files=files, headers=headers,
            data={"schema": json.dumps(schema, ensure_ascii=False)}
        )
        return r.json() if r.status_code == 200 else {"error": r.text[:200]}
```

### DB 모델

```python
class UpstageOCRConfig(Base):
    __tablename__ = "tb_upstage_ocr_config"
    tenant_id   = Column(Integer, primary_key=True)
    api_key_enc = Column(Text, nullable=False)   # AES-256-GCM 암호화
    model       = Column(String(50), default="document-parse")
    is_active   = Column(Boolean, default=True)
    created_at  = Column(DateTime, default=func.now())

class OCRHistory(Base):
    __tablename__ = "tb_ocr_history"
    id           = Column(Integer, primary_key=True)
    tenant_id    = Column(Integer, nullable=False, index=True)
    filename     = Column(String(300), nullable=False)
    file_size    = Column(Integer, default=0)
    ocr_type     = Column(String(30))  # PARSE | EXTRACT | QA
    schema_used  = Column(Text, nullable=True)  # 추출 스키마 JSON
    result_json  = Column(Text, nullable=True)  # 결과 요약 (전체 아님)
    linked_to    = Column(String(50), nullable=True)  # sr | contract | cmdb
    linked_id    = Column(Integer, nullable=True)
    pages        = Column(Integer, default=1)
    tokens_used  = Column(Integer, default=0)
    status       = Column(String(20), default="SUCCESS")
    created_by   = Column(Integer, ForeignKey("tb_user.id"))
    created_at   = Column(DateTime, default=func.now())
```

### 파일 지원 형식
- PDF (텍스트 레이어 있음/없음 모두)
- PNG, JPG, JPEG, TIFF, BMP
- 최대 파일 크기: 20MB
- 최대 페이지: 100페이지

### 보안 원칙
1. Upstage API Key는 AES-256-GCM 암호화 저장
2. 테넌트별 독립 API Key 관리
3. 민감 문서 (기밀, 개인정보): 온프레미스 `multimodal.py` 사용 권고
4. OCR 결과에서 주민번호, 계좌번호 자동 마스킹
5. API 사용량 추적 및 일일 한도 관리

## 팀 통신 프로토콜
- **수신**: orchestrator로부터 "OCR 엔진 구현 시작"
- **발신**: `_workspace/ocr_api_spec.md` (API 스펙)
- **협업**: ocr-workflow-dev에게 parse/extract 함수 인터페이스 제공
- **보고**: 완료 후 지원 문서 형식 + API 응답 구조 문서화