Redis connection issues

At my current client, we have been dealing with an ongoing problem while scaling their cloud systems to the ever increasing customer-base. As it is with any software that needs to scale, we’ve been seeing and solving scalability problems along the way. Anyone who has ever dealt with scaling a system to accommodate a lot of concurrent users, knows that issues will be showing up places you aren’t expecting. While scaling the system, we’ve made it more resilient and fault tolerant, and also learned to mitigate many of the problems as they are arising. We’ve improved logging and metrics, so we know exactly what is happening, and can see stability problems before they affect the end-users. Redis loses connection, and will not reconnect In our system, we use StackExchange.Redis(v1.2.6) to communicate with our Redis server. The Redis instance handles all cached data and all communication between services. It is a vital part of our architecture, and currently the architecture is heavily dependent on this connection. The odd thing is that we keep losing connection to it, but Redis is not even breaking a sweat. StackExchange.Redis is supposed to be able to recover from outages, and in most cases it does. But every once in a blue moon, it doesn’t. It think it does, and it thinks it is connected, but in reality all subsequent calls fail, and the pub/sub connection is completely dead, until the application is restarted. We have multiple services, and it will always just be one of them that goes down, every...