Victor Miti

Engineer

Madhav Manoj

Senior Engineer

Will Heinemann

New Business Director

How to get started with wagtail-pdf-converter

4 mins read

PDFs are everywhere, and they're a problem. They break on mobile, resist screen readers, and most of what sits in a typical document library simply isn't being read. We've written before about why this matters and how we're solving it - including our work with the Financial Reporting Council to convert 7,800+ documents into accessible HTML.

wagtail-pdf-converter is the open-source Wagtail package behind that work. It converts PDFs into accessible, searchable HTML - automatically, using AI. The converted HTML lives alongside the original file. Users can still download the PDF; the HTML version is an addition, not a replacement.

The AI does the heavy lifting. Editors review, correct, and publish.

This guide is for developers and content or digital leads alike. It covers what the package does and how to get moving.

What is wagtail-pdf-converter?

It is an opt-in package, completely separate from Wagtail core. Installing it changes nothing until you configure it.

When a PDF is uploaded, the package works in the background: it extracts images, classifying each as meaningful or decorative, converts every page to structured Markdown via an AI model, and stores the result as clean HTML. No extra steps for editors. No change to how they upload documents. The conversion just happens.

1. Choose your AI provider

wagtail-pdf-converter is model and provider-agnostic. You are not locked into anything.

It ships with Google Gemini as the default backend, but any provider with an API works: Anthropic, OpenAI, Mistral, or a self-hosted model. If data sovereignty matters to your organisation - common in health, legal, or public sector contexts - you can pick a provider that meets your requirements, including European-hosted or on-premises options.

Worth knowing: HTML pages produce measurably less CO₂ than equivalent PDFs. If your organisation has sustainability commitments, choosing a provider that reports per-query energy use gives you numbers to point at.

The AI call happens once per conversion, not on every page view. For most organisations, converting a full document library costs a fraction of what manual remediation would.

2. Get the package installed

The wagtail-pdf-converter getting started guide covers everything you need. You should be up and running within half an hour or so.

For content or digital leads, this is the moment to loop in your development team or your Wagtail partner. If you work with Torchbox, speak to your client partner and we will help you get set up. If you have an in-house team or work with another agency, the documentation linked above is the right place to point them.

3. What your editors get

Once the package is installed, editors don't need to change how they work.

A conversion status indicator on the document edit page shows whether a document is pending, processing, completed, or failed. A Conversion Metrics panel in the Wagtail admin sidebar gives a breakdown across the whole document library, so content leads can see progress at a glance.

If a conversion fails or the output needs work, editors can hit a Retry Conversion button to re-queue it without involving a developer. Once conversion completes, an Edit Generated Content button opens a Markdown editor where they can fix misread text, adjust headings, or clean up artefacts - without re-running the full conversion.

The original PDF always remains intact, while the HTML version of it can be embedded and displayed on your site.

4. Start with your existing document library

Most organisations are sitting on years of reports, policies, and guidance documents that have never been accessible. Your existing library is the best place to begin.

A single management command queues every eligible document in one step. Run it in dry-run mode first to see what would be affected before committing. Once the first batch finishes, review a sample in the admin editor. The AI handles most documents well; scanned PDFs or heavily formatted reports may need some editing.

5. Going further

Converted content is indexed by Wagtail's search backend automatically - no extra configuration needed. This matters because most CMS search tools don't index PDF content at all.

Set your base template and converted documents will inherit your site's layout. If you need finer control over which documents get converted, a custom query helper class lets you restrict conversion by tag, collection, or any other logic. Converted documents aren’t indexed for search by default; you can enable search engine indexing per document when the output is ready.

Implementation guide for developers

This section is aimed at developers getting the package running. For full installation instructions and configuration reference, see the documentation. The package is open source: github.com/torchbox/wagtail-pdf-converter.

Installation

Install via pip, add to INSTALLED_APPS, wire up a URL, apply PDFConversionMixin to your Document model, run migrations, configure your AI backend, and start the background worker. The package needs a custom Wagtail Document model. If you don't have one, the docs walk through it - it's a standard Wagtail pattern.

Converting your existing library

If you're installing into an existing Wagtail site, run update_document_conversion_status first. It inspects your current files, flags which ones are PDFs, and marks them as pending so they're ready to queue.

Then use convert_documents --all to queue everything. Run with --dry-run first to see what would be processed. Large PDFs (over 50 pages) are automatically split into chunks and processed in parallel, so an annual report won't block the queue.

Expected conversion times

Short documents like letters, policies, and information sheets typically take 30–60 seconds. Reports in the 30–50 page range take 2–4 minutes. Annual reports and longer documents (60+ pages) can take upwards of 7 minutes. Status is visible in the admin throughout.

Configuration worth knowing about

  • AUTO_CONVERT_PDFS: Enable this and conversion triggers automatically when a PDF is saved. No extra steps after uploading.
  • BASE_TEMPLATE: Set this to your project's base template and converted documents will inherit your site's layout.
  • allow_indexing: Converted documents default to noindex. Set allow_indexing = True per document when the output is ready for search engines.
  • Custom query helpers: A custom query helper class lets you restrict conversion by tag, collection, or any other logic.
  • Custom backends: To use a provider not natively supported, subclass the base backend and point your settings at it.

Maintenance

Run cleanup_stuck_conversions to mark documents as failed if they've been stuck in processing past a configurable timeout. Worth running hourly in production via cron or a Procfile.

Ready to get started?

Visit the wagtail-pdf-converter documentation for the full installation guide and configuration reference.

The package is open source: github.com/torchbox/wagtail-pdf-converter.

For Torchbox clients who want help getting this set up, speak to your client partner. If you're new to Torchbox, we'd love to hear from you.