2 min read

pdfquery

Table of Contents

pdfquery

pdfquery is a lightweight, CLI-based tool for embedding and querying PDFs using OpenAI and FAISS. Ideal for researchers, engineers, and anyone looking to extract structured information from unstructured PDF content.

It chunks and embeds documents using OpenAI’s latest embedding models, stores them with FAISS, and lets you ask questions via the GPT-4o API.

πŸš€ Deploy your own

πŸ“‹ Features

  • βœ… CLI tool with Typer
  • βœ… Chunking with overlap for context preservation
  • βœ… Embeddings via OpenAI’s text-embedding-3-small
  • βœ… Querying with gpt-4o or gpt-4o-mini
  • βœ… Dry-run support for inspecting chunks
  • βœ… Dockerfile + GitHub workflows
  • βœ… Makefile for consistent development
  • βœ… Full unit test coverage
  • βœ… Markdown-based usage documentation

⚑ How it works

  1. pdfquery index --source document.pdf --name my-index
  2. pdfquery query --name my-index "What are the key ideas?"

It returns a GPT-4o powered response based only on your document content.

πŸ’» Commands

CommandDescription
make venvCreate a local Python virtual environment
make installInstall app in editable mode with dev deps
make testRun all unit tests via pytest
make lintRun linting with ruff
make formatAuto-format code using black
make docker-buildBuild Docker image locally
make cleanRemove virtualenv and build artifacts

πŸ›οΈ License

MIT