A couple of days ago, I learned the hard way that using the wrong endpoint in AWS ElastiCache for Redis can take your app down. I didn’t pay enough attention to the Primary Endpoint, Reader Endpoint, and the Node’s Endpoint. Here’s what happened.

Current setup

I have a Redis OSS instance on AWS ElastiCache configured like this:

Cluster mode: Disabled (single shard)
Nodes: 1
Auto-failover: Disabled
Multi-AZ: Disabled

In short, it’s basically a standalone Redis instance that is enough for my application. For some reason (my laziness + “it’s working so why change it” + lack of reading documentation), my app was connecting directly to the Node Endpoint instead of the Primary Endpoint.

A small change started the issue

I changed some of the settings, including turning on the Encryption in Transit, set it to Preferred, and checking if I’m still able to connect to the Redis (somehow I'm still able to connect to the Redis using the old DNS). Seemed simple enough, right? No, it’s not.

The realization

What I didn’t realize is that changing this setting forces AWS to change the Node Endpoint DNS. The AWS documentation here clearly said:

Unlike the primary endpoint, node endpoints resolve to specific endpoints. If you make a change in your cluster, such as adding or deleting a replica, you must update the node endpoints in your application. There is a difference depending upon whether or not In-Transit encryption is enabled.

In-transit encryption not enabled

clusterName.xxxxxx.nodeId.regionAndAz.cache.amazonaws.com:port

example: redis-01.7abc2d.0001.usw2.cache.amazonaws.com:6379

In-transit encryption enabled

master.clusterName.xxxxxx.regionAndAz.cache.amazonaws.com:port

example: master.ncit.ameaqx.use1.cache.amazonaws.com:6379

The result was predictable. My app was still trying to connect to the old node hostname, but Redis was gone from that address. And the downtime occurred.

Which endpoints to use with Valkey or Redis OSS?

The AWS documentation says:

For a standalone node, use the node's endpoint for both read and write operations.

AFAIK, your Redis instance is considered Standalone if there is no replica, cluster mode is disabled, and encryption in transit is disabled (correct me if I’m wrong). As you can see here, there is only a reader endpoint. So, you'd better connect directly to the node endpoint.

For Valkey or Valkey or Redis OSS (cluster mode disabled) clusters, use the Primary Endpoint for all write operations. Use the Reader Endpoint to evenly split incoming connections to the endpoint between all read replicas. Use the individual Node Endpoints for read operations (In the API/CLI these are referred to as Read Endpoints).

If you enable Encryption in transit, AWS will add a Primary Endpoint. They suggest you use Primary Endpoint for write operations, a Reader Endpoint or you can connect directly to the individual Node Endpoints for read operations. But I, myself, prefer connecting only to the Primary Endpoint for both write and read operations.

For Valkey or Redis OSS (cluster mode enabled) clusters, use the cluster's Configuration Endpoint for all operations that support cluster mode enabled commands. You must use a client that supports either Valkey Cluster, or Redis OSS Cluster on Redis OSS 3.2 and above. You can still read from individual node endpoints (In the API/CLI these are referred to as Read Endpoints).

I haven’t tried this yet, but I believe you can figure it out yourself.

Lesson learned

Make your client resilient: retry, reconnect, and respect DNS TTLs.
Test thoroughly and pay attention to details before you make changes to the production for real.
Plan disruptive changes (like enabling TLS) with a maintenance window.
Always read the official documentation.
Connect to the appropriate endpoints depending on your use case. I prefer using the Primary Endpoints for both write and read operations. This may result in read operations not being distributed evenly across all replicas. But I don't have any replicas, and cluster mode is disabled.

[EN] Lesson learned from using the wrong AWS ElastiCache Redis endpoint

Current setup

A small change started the issue

The realization

Which endpoints to use with Valkey or Redis OSS?

Lesson learned

Comments

AWS

[EN] Set Up Amazon ECR Pull-Through Cache for Docker Hub

More from this blog

[EN] Setting up Prometheus and Thanos to use S3 as storage backend

[EN] Harbor-to-Harbor Migration from Block Storage to Object Storage (AWS S3)

[EN] Set Up Amazon ECR Pull-Through Cache for Docker Hub

[EN] Track progress of MySQL Import/Export process using PV

Command Palette

Current setup

A small change started the issue

The realization

Which endpoints to use with Valkey or Redis OSS?

Lesson learned

Comments

AWS

[EN] Set Up Amazon ECR Pull-Through Cache for Docker Hub

More from this blog