pdfquery is a lightweight, CLI-based tool for embedding and querying PDFs using OpenAI and FAISS. Ideal for researchers, engineers, and anyone looking to extract structured information from unstructured PDF content.
It chunks and embeds documents using OpenAIβs latest embedding models, stores them with FAISS, and lets you ask questions via the GPT-4o API.
π Deploy your own
π Features
- β CLI tool with Typer
- β Chunking with overlap for context preservation
- β
Embeddings via OpenAIβs
text-embedding-3-small
- β
Querying with
gpt-4o
orgpt-4o-mini
- β Dry-run support for inspecting chunks
- β Dockerfile + GitHub workflows
- β Makefile for consistent development
- β Full unit test coverage
- β Markdown-based usage documentation
β‘ How it works
pdfquery index --source document.pdf --name my-index
pdfquery query --name my-index "What are the key ideas?"
It returns a GPT-4o powered response based only on your document content.
π» Commands
Command | Description |
---|---|
make venv | Create a local Python virtual environment |
make install | Install app in editable mode with dev deps |
make test | Run all unit tests via pytest |
make lint | Run linting with ruff |
make format | Auto-format code using black |
make docker-build | Build Docker image locally |
make clean | Remove virtualenv and build artifacts |
ποΈ License
MIT