Skip to main content

Command Palette

Search for a command to run...

[EN] Lesson learned from using the wrong AWS ElastiCache Redis endpoint

Updated
3 min read
[EN] Lesson learned from using the wrong AWS ElastiCache Redis endpoint
M
Started my IT career as a Technical Support at an Indonesian web hosting provider, then progressed through various roles as a Linux SysAdmin, Network Engineer, Product Designer, and DevOps Engineer. I moved to a SaaS company and since then I’ve built hands-on experience mainly with AWS and GCP and work daily with popular cloud native tools.

A couple of days ago, I learned the hard way that using the wrong endpoint in AWS ElastiCache for Redis can take your app down. I didn’t pay enough attention to the Primary Endpoint, Reader Endpoint, and the Node’s Endpoint. Here’s what happened.

Current setup

I have a Redis OSS instance on AWS ElastiCache configured like this:

  • Cluster mode: Disabled (single shard)

  • Nodes: 1

  • Auto-failover: Disabled

  • Multi-AZ: Disabled

In short, it’s basically a standalone Redis instance that is enough for my application. For some reason (my laziness + “it’s working so why change it” + lack of reading documentation), my app was connecting directly to the Node Endpoint instead of the Primary Endpoint.

A small change started the issue

I changed some of the settings, including turning on the Encryption in Transit, set it to Preferred, and checking if I’m still able to connect to the Redis (somehow I'm still able to connect to the Redis using the old DNS). Seemed simple enough, right? No, it’s not.

The realization

What I didn’t realize is that changing this setting forces AWS to change the Node Endpoint DNS. The AWS documentation here clearly said:

Unlike the primary endpoint, node endpoints resolve to specific endpoints. If you make a change in your cluster, such as adding or deleting a replica, you must update the node endpoints in your application. There is a difference depending upon whether or not In-Transit encryption is enabled.

In-transit encryption not enabled

clusterName.xxxxxx.nodeId.regionAndAz.cache.amazonaws.com:port

example: redis-01.7abc2d.0001.usw2.cache.amazonaws.com:6379

In-transit encryption enabled

master.clusterName.xxxxxx.regionAndAz.cache.amazonaws.com:port

example: master.ncit.ameaqx.use1.cache.amazonaws.com:6379

The result was predictable. My app was still trying to connect to the old node hostname, but Redis was gone from that address. And the downtime occurred.

Which endpoints to use with Valkey or Redis OSS?

The AWS documentation says:

For a standalone node, use the node's endpoint for both read and write operations.

AFAIK, your Redis instance is considered Standalone if there is no replica, cluster mode is disabled, and encryption in transit is disabled (correct me if I’m wrong). As you can see here, there is only a reader endpoint. So, you'd better connect directly to the node endpoint.


For Valkey or Valkey or Redis OSS (cluster mode disabled) clusters, use the Primary Endpoint for all write operations. Use the Reader Endpoint to evenly split incoming connections to the endpoint between all read replicas. Use the individual Node Endpoints for read operations (In the API/CLI these are referred to as Read Endpoints).

If you enable Encryption in transit, AWS will add a Primary Endpoint. They suggest you use Primary Endpoint for write operations, a Reader Endpoint or you can connect directly to the individual Node Endpoints for read operations. But I, myself, prefer connecting only to the Primary Endpoint for both write and read operations.


For Valkey or Redis OSS (cluster mode enabled) clusters, use the cluster's Configuration Endpoint for all operations that support cluster mode enabled commands. You must use a client that supports either Valkey Cluster, or Redis OSS Cluster on Redis OSS 3.2 and above. You can still read from individual node endpoints (In the API/CLI these are referred to as Read Endpoints).

I haven’t tried this yet, but I believe you can figure it out yourself.

Lesson learned

  • Make your client resilient: retry, reconnect, and respect DNS TTLs.

  • Test thoroughly and pay attention to details before you make changes to the production for real.

  • Plan disruptive changes (like enabling TLS) with a maintenance window.

  • Always read the official documentation.

  • Connect to the appropriate endpoints depending on your use case. I prefer using the Primary Endpoints for both write and read operations. This may result in read operations not being distributed evenly across all replicas. But I don't have any replicas, and cluster mode is disabled.

59 views