← Back to blog

Paperless-ngx on Kubernetes — from scanner to searchable archive

Rico Twesten-Weber Principal DevOps Engineer
homelabpaperless-ngxkubernetesself-hosting

I used to have a drawer. Tax documents, insurance letters, receipts for things I’d definitely need to return someday. The drawer became a box. The box became two boxes. Finding anything meant fifteen minutes of shuffling paper and mild regret.

Now I have a scanner on my desk and a pipeline that turns paper into searchable, tagged, classified documents without any manual steps after the scan. The whole thing runs on my K3s cluster, backed by NAS storage, and managed through GitOps. Here’s exactly how it works.

The pipeline

The flow is straightforward:

  1. I put a document in my Brother ADS-1700W network scanner and press scan.
  2. The scanner pushes a PDF to a Samba share on my Synology NAS, into a folder called consume.
  3. Paperless-ngx watches that folder. When a new file appears, it picks it up.
  4. Paperless runs OCR (Tesseract) to extract text from the scanned image.
  5. Classification rules match the content against patterns I’ve defined, things like “Stadtwerke” for utility bills or “Finanzamt” for tax documents.
  6. The document gets tagged, dated, and filed into the archive. The original PDF is stored alongside the OCR’d version.

From scanner to searchable archive, it takes about two to three minutes depending on page count. I don’t touch anything after pressing the scan button.

The Kubernetes setup

Paperless-ngx needs a few components: the application itself, a PostgreSQL database for metadata, and a Redis instance for the task queue. Each runs as a separate deployment in a dedicated namespace.

The Paperless deployment mounts two PersistentVolumeClaims. One for the consume directory (where the scanner drops files) and one for the media directory (where processed documents live). Both are backed by Samba shares on the Synology NAS.

PostgreSQL runs with a local NVMe-backed PersistentVolume on a specific node. Database workloads and NAS-backed storage don’t mix well. The write latency over Samba is too inconsistent for a database that cares about fsync. I pin the PostgreSQL pod to one node using a nodeSelector and accept that it’s not highly available. For a homelab document archive, that’s fine.

Redis uses an emptyDir volume. It’s just a task queue. If the pod restarts, pending tasks get re-queued from the database. No persistent storage needed.

The tricky parts

Samba-backed PersistentVolumes on Kubernetes are where I spent most of my debugging time. The default CIFS mount options don’t work well for Paperless-ngx. Here’s what I learned:

Mount options matter more than you’d expect. I had to set uid=1000,gid=1000 to match the Paperless container’s user, file_mode=0644,dir_mode=0755 for correct permissions, and vers=3.0 because SMB1 is disabled on the Synology (as it should be). Without these, Paperless would either fail to write or create files that it couldn’t read back.

The consume directory needs inotify support, and CIFS mounts don’t support it natively. Paperless-ngx has a polling mode that checks the directory on an interval instead. Set PAPERLESS_CONSUMER_POLLING to a value in seconds. I use 30. It means there’s a slight delay between the scan completing and Paperless picking it up, but it’s reliable.

OCR language packs are the difference between “this works” and “this works in German.” The default Tesseract installation only includes English. For German documents, you need deu and possibly deu_frak for older documents with Fraktur typefaces. I set PAPERLESS_OCR_LANGUAGES in the deployment environment variables and bake the language packs into a custom container image.

Classification rules took a few iterations to get right. Paperless-ngx can auto-tag based on content matching, correspondent matching, and document type matching. I started with broad rules and refined them as I scanned more documents. The system gets better over time because it also learns from your manual corrections.

GitOps integration

The entire Paperless stack is defined in a single HelmRelease custom resource. The Helm chart packages the deployment, services, PVCs, and ConfigMap. All values, including Samba credentials (stored as a SealedSecret), mount paths, OCR settings, and resource limits, are defined in the HelmRelease values.

FluxCD reconciles this every five minutes. If I need to change a setting, I update the values in Git, push, and FluxCD applies the change. If something breaks, I revert the commit. The cluster state always matches what’s in the repo.

This is the part that makes self-hosting sustainable. Without GitOps, every configuration change is a manual kubectl edit or helm upgrade that I’ll forget about in three months. With GitOps, the repo is the source of truth. When my cluster had to be rebuilt after a power supply failure, the Paperless stack came back up automatically once FluxCD reconciled. I didn’t have to remember a single setting.

The daily workflow

My actual day-to-day with Paperless-ngx is almost boring, which is exactly the point. A letter arrives, I open it, scan it, and put the paper in a “to shred” pile. Within a few minutes, the document appears in Paperless, already tagged and classified.

When I need to find something, I search. Full-text search across every document I’ve scanned in the past two years. “Mietvertrag” finds my rental contract. “Rechnung Vodafone 2025” finds every Vodafone invoice from last year. The search is fast because it’s querying the OCR’d text stored in PostgreSQL, not scanning files.

I access Paperless through a web UI exposed via an Ingress on my lab VLAN. It’s not exposed to the internet. For remote access, I VPN into my home network. The UI is functional, nothing fancy, but it handles viewing, downloading, bulk editing, and manual tag corrections.

Was it worth the effort?

Setting up the pipeline took about a weekend. Getting classification rules dialed in took another week of incremental tweaks as I scanned my backlog. Total effort: maybe 15-20 hours spread over two weeks.

Self-hosting document management sounds like overkill until you actually need to find a receipt from 2023 for a warranty claim, or your landlord asks for proof of a payment from eighteen months ago. Before Paperless, that meant digging through boxes. Now it takes ten seconds.

The best part is confidence. Every document I’ve scanned is searchable and backed up to an offsite location through the Synology. If the cluster dies, I rebuild it from Git. If the NAS dies, the offsite backup has everything. The drawer is empty. The boxes are recycled. And I haven’t lost a document since.

Rico Twesten-Weber

Principal DevOps Engineer. I build platforms that run themselves, and write about DevOps and AI.

Explore

Connect

© 2026 Rico Twesten-Weber Impressum Datenschutz