[EN] Setting up Prometheus and Thanos to use S3 as storage backend

tl;dr — Prometheus is great at scraping and querying, but it was never designed to be your long-term storage. Once my TSDB grew to 311 GB and the disk hit 87%, I needed a way out that didn't mean buying a bigger and bigger volume forever. The fix: put Thanos in front of Prometheus and ship the blocks to S3. Here's how I did it, including the two permission gotchas that bit me.

Background

I run a self-hosted observability stack — a single Prometheus, Grafana, and a handful of exporters — all on one VM with Docker Compose. Prometheus stores its time-series database (TSDB) on the local disk (EBS), with 90 days of retention. For a long time this was perfectly fine.

The thing is, a single Prometheus is a vertically-scaled box. It keeps everything on local disk, and the only knobs you have are "bigger disk" and "shorter retention". That works until it doesn't.

The Problem

One day I looked at the disk and it was 87% full. The TSDB alone was 311 GB, and growing. That's a real problem for two reasons:

First, a full TSDB disk doesn't just throw a warning — Prometheus stops ingesting, and your whole observability stack goes blind right when you might need it most.

Second, the only native answer is to keep growing the EBS volume, which is expensive per GB and has a hard ceiling. I didn't want to be the person resizing this volume every quarter.

The deeper issue is architectural: Prometheus isn't designed for long-term storage, and it doesn't scale horizontally. It's built to be a reliable, simple, local-first scraper. Durable, cheap, long-term retention is explicitly not its job. And here's the catch I learned quickly — vanilla Prometheus cannot store its TSDB on S3. The --storage.tsdb.path flag needs a real filesystem; there's no object-storage backend. So "just point it at a bucket" isn't a thing.

The Solution

This is exactly the gap Thanos fills. Thanos sits on top of an existing Prometheus and gives it the two things it lacks: cheap, unbounded long-term storage (in object storage like S3), and a path to scale and query globally. Prometheus keeps doing what it's good at — scraping and holding a short, hot window of data locally — while Thanos takes over everything long-term.

A quick tour of the components I deployed:

Sidecar — runs next to Prometheus, uploads each sealed 2h TSDB block to S3, and serves the recent (not-yet-uploaded) data over gRPC.
Store Gateway — serves the historical blocks straight from S3. It needs almost no local disk.
Compactor — compacts, downsamples (raw → 5m → 1h), and enforces retention inside the bucket. Must run as a singleton.
Query — the single PromQL endpoint Grafana talks to. It fans out to the sidecar (recent) and store gateway (historical) and stitches them into one seamless view.

Object storage config

The bucket config goes in an objstore.yml. Because my VM has an IAM instance role attached, I don't hardcode any keys — aws_sdk_auth: true tells Thanos to use the AWS SDK credential chain, which picks up the role automatically:

type: S3
config:
  bucket: "your-prometheus-bucket"
  endpoint: "s3.<region>.amazonaws.com"
  region: "<region>"
  aws_sdk_auth: true

💡 Prefer an IAM instance role over static access keys. Fewer secrets to leak, nothing to rotate. And please review the IAM policy against your own security standards — don't blindly copy-paste. Mine grants only ListBucket on the bucket and GetObject/PutObject/DeleteObject on its objects. Nothing more.

The one Prometheus change that matters

For the sidecar to upload safely, you must disable Prometheus' local compaction. The Thanos docs are blunt about it: set min-block-duration and max-block-duration to equal values, otherwise your uploaded blocks get corrupted when the compactor does its job.

# in the prometheus service command:
- '--storage.tsdb.retention.time=2d'      # keep only a short hot window locally
- '--storage.tsdb.min-block-duration=2h'  # equal min == max disables local compaction
- '--storage.tsdb.max-block-duration=2h'

Why 2h and not, say, 1h? Because 2h is both Prometheus' native block cadence and the value Thanos officially recommends — it's the compactor's base unit. A smaller block only buys you a slightly shorter "not yet in S3" window, at the cost of roughly twice the objects and twice the S3 requests. Not worth it. Stay on 2h.

Also note the retention drop to 2d. Once Thanos owns the long-term data in S3, Prometheus only needs to hold a small hot window. That's what actually reclaims your disk.

The gotchas (so you don't lose an hour like I did)

Both containers bind-mount host directories, and both run as non-root users — but the host dirs were owned by root. So everything crash-looped with permission errors until I fixed ownership:

# prom/prometheus runs as uid 65534 (nobody); thanos runs as uid 1001
sudo chown -R 65534:65534 prometheus-data
sudo chown -R 1001:1001  thanos/store thanos/compact

The second one was sneakier. The sidecar needs to write thanos.shipper.json into Prometheus' TSDB directory — a directory now owned by uid 65534. But the sidecar defaults to uid 1001, so it couldn't write there. The fix is to run the sidecar as the same user as Prometheus:

thanos-sidecar:
  image: thanosio/thanos:v0.38.0
  user: "65534:65534"   # match prometheus so the shipper can write into the shared TSDB dir
  # ...

Verifying it actually works

After docker compose up -d, I confirmed the full path end-to-end. First, that Query sees both stores:

curl -s http://localhost:10904/api/v1/stores
# both "sidecar" and "store" endpoints should appear, with lastError: null

Then, that blocks are actually landing in S3 (the sidecar logs "upload new block", and the object shows up in the bucket with chunks/, index, and meta.json), and that the store gateway loads them back ("loaded new block"). When a up query through Query returned live data while the bucket filled with blocks, I knew the pipeline was real.

What changed, in one line

Prometheus keeps a short hot window locally; the sidecar ships sealed 2h blocks to S3; the store gateway reads old blocks back; Query merges hot and cold for Grafana; and the compactor tidies, downsamples, and ages out data inside the bucket. Cheaper per GB, no capacity ceiling, and history that outlives any single disk.

Up next: migrating the existing 311 GB of on-disk blocks into S3 with the sidecar's --shipper.upload-compacted flag — the backfill story has its own set of "do this before that" ordering rules, so it deserves its own post.

[EN] Setting up Prometheus and Thanos to use S3 as storage backend

Background

The Problem

The Solution

Object storage config

The one Prometheus change that matters

The gotchas (so you don't lose an hour like I did)

Verifying it actually works

What changed, in one line

Comments

AWS

[EN] Be careful before applying immediate modifications in AWS RDS

More from this blog

[EN] Harbor-to-Harbor Migration from Block Storage to Object Storage (AWS S3)

[EN] Set Up Amazon ECR Pull-Through Cache for Docker Hub

[EN] Track progress of MySQL Import/Export process using PV

[EN] Lesson learned from using the wrong AWS ElastiCache Redis endpoint

Command Palette

Background

The Problem

The Solution

Object storage config

The one Prometheus change that matters

The gotchas (so you don't lose an hour like I did)

Verifying it actually works

What changed, in one line

Comments

AWS

[EN] Be careful before applying immediate modifications in AWS RDS

More from this blog