Redis偶发连接失败案例实战记录

(编辑：jimmy 日期: 2025/1/31 浏览：3 次 )

前言

本文主要给大家介绍了关于Redis偶发连接失败的相关内容，分享出来供大家参考学习，下面话不多说了，来一起看看详细的介绍吧

【作者】

张延俊：携程技术保障中心资深DBA，对数据库架构和疑难问题分析排查有浓厚的兴趣。

寿向晨：携程技术保障中心高级DBA，主要负责携程Redis及DB的运维工作，在自动化运维，流程化及监控排障等方面有较多的实践经验，喜欢深入分析问题，提高团队运维效率。

【问题描述】

"htmlcode">

CRedis.Client.RExceptions.ExcuteCommandException: Unable to Connect redis server: ---> CRedis.Third.Redis.RedisException: Unable to Connect redis server:
 在 CRedis.Third.Redis.RedisNativeClient.CreateConnectionError()
 在 CRedis.Third.Redis.RedisNativeClient.SendExpectData(Byte[][] cmdWithBinaryArgs)
 在 CRedis.Client.Entities.RedisServer.<>c__DisplayClassd`1.

"color: #ff0000">【问题分析】

"text-align: center">

同时间，服务器端显示Redis服务端有丢包现象：345539 – 344683 = 856个包。

Sat Apr 7 10:41:40 CST 2018
 1699 outgoing packets dropped
 92 dropped because of missing route
 344683 SYNs to LISTEN sockets dropped
 344683 times the listen queue of a socket overflowed

Sat Apr 7 10:41:41 CST 2018
 1699 outgoing packets dropped
 92 dropped because of missing route
 345539 SYNs to LISTEN sockets dropped
 345539 times the listen queue of a socket overflowed

"color: #ff0000">【关于backlog overflow】

"text-align: center">

在BSD版本内核实现的tcp协议中，server端建连过程需要两个队列，一个是SYN queue，一个是accept queue。前者叫半开连接（或者半连接）队列，在接收到client发送的SYN时加入队列。（一种常见的网络攻击方式就是不断发送SYN但是不发送ACK从而导致server端的半开队列撑爆，server端拒绝服务。）后者叫全连接队列，server返回(SYN,ACK)，在接收到client发送ACK后（此时client会认为建连已经完成，会开始发送PSH包），如果accept queue没有满，那么server从SYN queue把连接信息移到accept queue；如果此时accept queue溢出的话，server的行为要看配置。如果tcp_abort_on_overflow为0（默认），那么直接drop掉client发送的PSH包，此时client会进入重发过程，一段时间后server端重新发送SYN,ACK，重新从建连的第二步开始；如果tcp_abort_on_overflow为1，那么server端发现accept queue满之后直接发送reset。

通过wireshark搜索发现在一秒内有超过2000次对Redis Server端发起建连请求。我们尝试修改tcp backlog大小，从511调整到2048, 问题并没有得到解决。所以此类微调，并不能彻底的解决问题。

【网络包分析】

我们用wireshark来识别网络拥塞的准确时间点和原因。我们已经有了准确的报错时间点，先用editcap把超大的tcp包裁剪一下，裁成30秒间隔，并通过wireshark I/O 100ms间隔分析网络阻塞的准确时间点：

"background-color: #ffff00">"background-color: #ffff00">"background-color: #ffff00">"background-color: #ffff00">"background-color: #ffff00">"background-color: #ffff00">"background-color: #ffff00">"background-color: #ffff00">"color: #ff0000">【进一步分析】

为了了解这1.43秒之内，Redis Server在做什么事情，我们用pstack来抓取信息。Pstack本质上是gdb attach. 高频率的抓取会影响redis的吞吐。死循环0.5秒一次无脑抓，在redis-server卡死的时候抓到堆栈如下(过滤了没用的栈信息)：

Thu May 31 11:29:18 CST 2018
Thread 1 (Thread 0x7ff2db6de720 (LWP 8378)):
#0 0x000000000048cec4 in "htmlcode">

clientsCron(server.h):
#define CLIENTS_CRON_MIN_ITERATIONS 5
void clientsCron(void) {
 /* Make sure to process at least numclients/server.hz of clients
  * per call. Since this function is called server.hz times per second
  * we are sure that in the worst case we process all the clients in 1
  * second. */
 int numclients = listLength(server.clients);
 int iterations = numclients/server.hz;
 mstime_t now = mstime();

 /* Process at least a few clients while we are at it, even if we need
  * to process less than CLIENTS_CRON_MIN_ITERATIONS to meet our contract
  * of processing each client once per second. */
 if (iterations < CLIENTS_CRON_MIN_ITERATIONS)
  iterations = (numclients < CLIENTS_CRON_MIN_ITERATIONS) "htmlcode">

clientsCronResizeQueryBuffer(server.h):

/* The client query buffer is an sds.c string that can end with a lot of
 * free space not used, this function reclaims space if needed.
 *
 * The function always returns 0 as it never terminates the client. */
int clientsCronResizeQueryBuffer(client *c) {
 size_t querybuf_size = sdsAllocSize(c->querybuf);
 time_t idletime = server.unixtime - c->lastinteraction;

 /* 只在以下两种情况下会Resize query buffer:
  * 1) Query buffer > BIG_ARG(在server.h 中定义#define PROTO_MBULK_BIG_ARG  (1024*32)) 
   且这个Buffer的小于一段时间的客户端使用的峰值.
  * 2) 客户端空闲超过2s且Buffer size大于1k. */
 if (((querybuf_size > PROTO_MBULK_BIG_ARG) &&
   (querybuf_size/(c->querybuf_peak+1)) > 2) ||
   (querybuf_size > 1024 && idletime > 2))
 {
  /* Only resize the query buffer if it is actually wasting space. */
  if (sdsavail(c->querybuf) > 1024) {
   c->querybuf = sdsRemoveFreeSpace(c->querybuf);
  }
 }
 /* Reset the peak again to capture the peak memory usage in the next
  * cycle. */
 c->querybuf_peak = 0;
 return 0;
}


如果redisClient对象的query buffer满足条件，那么就直接resize掉。满足条件的连接分成两种，一种是真的很大的，比该客户端一段时间内使用的峰值还大；还有一种是很闲（idle>2）的，这两种都要满足一个条件，就是buffer free的部分超过1k。那么redis-server卡住的原因就是正好有那么50个很大的或者空闲的并且free size超过了1k大小连接的同时循环做了resize，由于redis都属于单线程工作的程序，所以block了client。那么解决这个问题办法就很明朗了，让resize 的频率变低或者resize的执行速度变快。
既然问题出在query buffer上，我们先看一下这个东西被修改的位置：


readQueryFromClient（networking.c）:
redisClient *createClient(int fd) {
 redisClient *c = zmalloc(sizeof(redisClient));

 /* passing -1 as fd it is possible to create a non connected client.
  * This is useful since all the Redis commands needs to be executed
  * in the context of a client. When commands are executed in other
  * contexts (for instance a Lua script) we need a non connected client. */
 if (fd != -1) {
  anetNonBlock(NULL,fd);
  anetEnableTcpNoDelay(NULL,fd);
  if (server.tcpkeepalive)
   anetKeepAlive(NULL,fd,server.tcpkeepalive);
  if (aeCreateFileEvent(server.el,fd,AE_READABLE,
   readQueryFromClient, c) == AE_ERR)
  {
   close(fd);
   zfree(c);
   return NULL;
  }
 }

 selectDb(c,0);
 c->id = server.next_client_id++;
 c->fd = fd;
 c->name = NULL;
 c->bufpos = 0;
 c->querybuf = sdsempty(); 初始化是0

readQueryFromClient(networking.c):
void readQueryFromClient(aeEventLoop *el, int fd, void *privdata, int mask) {
 redisClient *c = (redisClient*) privdata;
 int nread, readlen;
 size_t qblen;
 REDIS_NOTUSED(el);
 REDIS_NOTUSED(mask);

 server.current_client = c;
 readlen = REDIS_IOBUF_LEN;
 /* If this is a multi bulk request, and we are processing a bulk reply
  * that is large enough, try to maximize the probability that the query
  * buffer contains exactly the SDS string representing the object, even
  * at the risk of requiring more read(2) calls. This way the function
  * processMultiBulkBuffer() can avoid copying buffers to create the
  * Redis Object representing the argument. */
 if (c->reqtype == REDIS_REQ_MULTIBULK && c->multibulklen && c->bulklen != -1
  && c->bulklen >= REDIS_MBULK_BIG_ARG)
 {
  int remaining = (unsigned)(c->bulklen+2)-sdslen(c->querybuf);

  if (remaining < readlen) readlen = remaining;
 }

 qblen = sdslen(c->querybuf);
 if (c->querybuf_peak < qblen) c->querybuf_peak = qblen;
 c->querybuf = sdsMakeRoomFor(c->querybuf, readlen); 在这里会被扩大


由此可见c->querybuf在连接第一次读取命令后的大小就会被分配至少1024*32，所以回过头再去看resize的清理逻辑就明显存在问题，每个被使用到的query buffer的大小至少就是1024*32，但是清理的时候判断条件是>1024，也就是说，所有的idle>2的被使用过的连接都会被resize掉，下次接收到请求的时候再重新分配到1024*32，这个其实是没有必要的，在访问比较频繁的群集，内存会被频繁得回收重分配，所以我们尝试将清理的判断条件改造为如下，就可以避免大部分没有必要的resize操作：


if (((querybuf_size > REDIS_MBULK_BIG_ARG) &&
   (querybuf_size/(c->querybuf_peak+1)) > 2) ||
   (querybuf_size > 1024*32 && idletime > 2))
 {
  /* Only resize the query buffer if it is actually wasting space. */
  if (sdsavail(c->querybuf) > 1024*32) {
   c->querybuf = sdsRemoveFreeSpace(c->querybuf);
  }
 }


这个改造的副作用是内存的开销，按照一个实例5k连接计算，5000*1024*32=160M，这点内存消耗对于上百G内存的服务器完全可以接受。
【问题重现】


在使用修改过源码的Redis server后，问题仍然重现了，客户端还是会报同类型的错误，且报错的时候，服务器内存依然会出现抖动。抓取内存堆栈信息如下：

Thu Jun 14 21:56:54 CST 2018

#3  0x0000003729ee893d in clone () from /lib64/libc.so.6

Thread 1 (Thread 0x7f2dc108d720 (LWP 27851)):

#0  0x0000003729ee5400 in madvise () from /lib64/libc.so.6

#1  0x0000000000493a1e in je_pages_purge ()

#2  0x000000000048cf40 in arena_purge ()

#3  0x00000000004a7dad in je_tcache_bin_flush_large ()

#4  0x00000000004a85e9 in je_tcache_event_hard ()

#5  0x000000000042c0b5 in decrRefCount ()

#6  0x000000000042744d in resetClient ()

#7  0x000000000042963b in processInputBuffer ()

#8  0x0000000000429762 in readQueryFromClient ()

#9  0x000000000041847c in aeProcessEvents ()

#10 0x000000000041873b in aeMain ()

#11 0x0000000000420fce in main ()

Thu Jun 14 21:56:54 CST 2018

Thread 1 (Thread 0x7f2dc108d720 (LWP 27851)):

#0  0x0000003729ee5400 in madvise () from /lib64/libc.so.6

#1  0x0000000000493a1e in je_pages_purge ()

#2  0x000000000048cf40 in arena_purge ()

#3  0x00000000004a7dad in je_tcache_bin_flush_large ()

#4  0x00000000004a85e9 in je_tcache_event_hard ()

#5  0x000000000042c0b5 in decrRefCount ()

#6  0x000000000042744d in resetClient ()

#7  0x000000000042963b in processInputBuffer ()

#8  0x0000000000429762 in readQueryFromClient ()

#9  0x000000000041847c in aeProcessEvents ()

#10 0x000000000041873b in aeMain ()

#11 0x0000000000420fce in main ()



显然，Querybuffer被频繁resize的问题已经得到了优化，但是还是会出现客户端报错。这就又陷入了僵局。难道还有其他因素导致query buffer resize变慢？我们再次抓取pstack。但这时，jemalloc引起了我们的注意。此时回想Redis的内存分配机制，Redis为避免libc内存不被释放导致大量内存碎片的问题，默认使用的是jemalloc用作内存分配管理，这次报错的堆栈信息中都是je_pages_purge () redis在调用jemalloc回收脏页。我们看下jemalloc做了些什么：


arena_purge(arena.c)
static void
arena_purge(arena_t *arena, bool all)
{
 arena_chunk_t *chunk;
 size_t npurgatory;
 if (config_debug) {
  size_t ndirty = 0;

  arena_chunk_dirty_iter(&arena->chunks_dirty, NULL,
   chunks_dirty_iter_cb, (void *)&ndirty);
  assert(ndirty == arena->ndirty);
 }
 assert(arena->ndirty > arena->npurgatory || all);
 assert((arena->nactive  opt_lg_dirty_mult) < (arena->ndirty -
  arena->npurgatory) || all);

 if (config_stats)
  arena->stats.npurge++;
 npurgatory = arena_compute_npurgatory(arena, all);
 arena->npurgatory += npurgatory;

 while (npurgatory > 0) {
  size_t npurgeable, npurged, nunpurged;

  /* Get next chunk with dirty pages. */
  chunk = arena_chunk_dirty_first(&arena->chunks_dirty);
  if (chunk == NULL) {
   arena->npurgatory -= npurgatory;
   return;
  }
  npurgeable = chunk->ndirty;
  assert(npurgeable != 0);

  if (npurgeable > npurgatory && chunk->nruns_adjac == 0) {
 
   arena->npurgatory += npurgeable - npurgatory;
   npurgatory = npurgeable;
  }
  arena->npurgatory -= npurgeable;
  npurgatory -= npurgeable;
  npurged = arena_chunk_purge(arena, chunk, all);
  nunpurged = npurgeable - npurged;
  arena->npurgatory += nunpurged;
  npurgatory += nunpurged;
 }
}


Jemalloc每次回收都会判断所有实际应该清理的chunck并对清理做count，这个操作对于高响应要求的系统是很奢侈的，所以我们考虑通过升级jemalloc的版本来优化purge的性能。Redis 4.0版本发布后，性能有很大的改进，并可以通过命令回收内存，我们线上也正准备进行升级，跟随4.0发布的jemalloc版本为4.1，jemalloc的版本使用的在jemalloc的4.0之后版本的arena_purge()做了很多优化，去掉了计数器的调用，简化了很多判断逻辑，增加了arena_stash_dirty()方法合并了之前的计算和判断逻辑，增加了purge_runs_sentinel，用保持脏块在每个arena LRU中的方式替代之前的保持脏块在arena树的dirty-run-containing chunck中的方式，大幅度减少了脏块purge的体积，并且在内存回收过程中不再移动内存块。代码如下：


arena_purge(arena.c)
static void
arena_purge(arena_t *arena, bool all)
{
 chunk_hooks_t chunk_hooks = chunk_hooks_get(arena);
 size_t npurge, npurgeable, npurged;
 arena_runs_dirty_link_t purge_runs_sentinel;
 extent_node_t purge_chunks_sentinel;

 arena->purging = true;

 /*
  * Calls to arena_dirty_count() are disabled even for debug builds
  * because overhead grows nonlinearly as memory usage increases.
  */
 if (false && config_debug) {
  size_t ndirty = arena_dirty_count(arena);
  assert(ndirty == arena->ndirty);
 }
 assert((arena->nactive  arena->lg_dirty_mult) < arena->ndirty || all);

 if (config_stats)
  arena->stats.npurge++;

 npurge = arena_compute_npurge(arena, all);
 qr_new(&purge_runs_sentinel, rd_link);
 extent_node_dirty_linkage_init(&purge_chunks_sentinel);

 npurgeable = arena_stash_dirty(arena, &chunk_hooks, all, npurge,
  &purge_runs_sentinel, &purge_chunks_sentinel);
 assert(npurgeable >= npurge);
 npurged = arena_purge_stashed(arena, &chunk_hooks, &purge_runs_sentinel,
  &purge_chunks_sentinel);
 assert(npurged == npurgeable);
 arena_unstash_purged(arena, &chunk_hooks, &purge_runs_sentinel,
  &purge_chunks_sentinel);

 arena->purging = false;
}


【解决问题】


实际上我们有多个选项。可以使用Google的tcmalloc来代替jemalloc，可以升级jemalloc的版本等等。我们根据上面的分析，尝试通过升级jemalloc版本，实际操作为升级Redis版本来解决。我们将Redis的版本升级到4.0.9之后观察，线上客户端连接超时这个棘手的问题得到了解决。
【问题总结】


Redis在生产环境中因其支持高并发，响应快，易操作被广泛使用，对于运维人员而言，其响应时间的要求带来了各种各样的问题，Redis的连接超时问题是其中比较典型的一种，从发现问题，客户端连接超时，到通过抓取客户端与服务端的网络包，内存堆栈定位问题，也被其中一些假象所迷惑，最终通过升级jemalloc（Redis）的版本解决问题，这次最值得总结和借鉴的是整个分析的思路。
总结
以上就是这篇文章的全部内容了，希望本文的内容对大家的学习或者工作具有一定的参考学习价值，如果有疑问大家可以留言交流，谢谢大家对的支持。

上一篇：安装Redis就那么几步,很简单
下一篇：Redis事务涉及的watch、multi等命令详解