linux 打开 tcp_tw_recycle 导致问题(二)
转贴:http://zhangle.is-a-geek.org/2010/11/tcp_tw_recycle和nat/
Van Jacobson在RFC 1323里有这么一段话
An additional mechanism could be added to the TCP, a per-host
cache of the last timestamp received from any connection.
This value could then be used in the PAWS mechanism to reject
old duplicate segments from earlier incarnations of the
connection, if the timestamp clock can be guaranteed to have
ticked at least once since the old connection was open. This
would require that the TIME-WAIT delay plus the RTT together
must be at least one tick of the sender's timestamp clock.
Such an extension is not part of the proposal of this RFC.
Linux实现了这个机制。只是要同时启用timestamp和tcp_tw_recycle。具体的实现代码在net/ipv4/tcp_ipv4.c里的tcp_v4_conn_request函数里:
830 if (tmp_opt.saw_tstamp &&
831 tcp_death_row.sysctl_tw_recycle &&
832 (dst = inet_csk_route_req(sk, req)) != NULL &&
833 (peer = rt_get_peer((struct rtable *)dst)) != NULL &&
834 peer->v4daddr == saddr) {
835 if (xtime.tv_sec < peer->tcp_ts_stamp + TCP_PAWS_MSL &&
836 (s32)(peer->tcp_ts - req->ts_recent) >
837 TCP_PAWS_WINDOW) {
838 NET_INC_STATS_BH(LINUX_MIB_PAWSPASSIVEREJECTED);
839 dst_release(dst);
840 goto drop_and_free;
841 }
842 }
843 /* Kill the following clause, if you dislike this way. */
844 else if (!sysctl_tcp_syncookies &&
845 (sysctl_max_syn_backlog - inet_csk_reqsk_queue_len(sk) <
846 (sysctl_max_syn_backlog >> 2)) &&
847 (!peer || !peer->tcp_ts_stamp) &&
848 (!dst || !dst_metric(dst, RTAX_RTT))) {
849 /* Without syncookies last quarter of
850 * backlog is filled with destinations,
851 * proven to be alive.
852 * It means that we continue to communicate
853 * to destinations, already remembered
854 * to the moment of synflood.
855 */
856 LIMIT_NETDEBUG(KERN_DEBUG "TCP: drop open "
857 "request from %u.%u.%u.%u/%u\n",
858 NIPQUAD(saddr),
859 ntohs(skb->h.th->source));
860 dst_release(dst);
861 goto drop_and_free;
862 }
这个机制依赖于客户端机器的timestamp单调递增。如果服务器在负载均衡器后面,同时这个负载均衡器做了NAT且不改变数据包的
timestamp,那么有可能导致某个客户端发出的syn包被丢弃,造成连接请求超时。因为timestamp的值来自于源机器的jiffies。不同的机器开机时间很难是完全相同的。此时,除了客户端请求超时外,在服务器上还可以观察到netstat
-s的结果里passive connections rejected by timestamp这一行的数值在增长。
所以在NAT后面的机器不应该启用tcp_tw_recycle。
这里还有另外一个小插曲,请看这个表达式,这里两个数都是无符号32位整数,这里可能造成underflow,也就是前者比后者小2的31次方以上,结果就成了正数。我当时分析的时候恰恰出现了这种情况,险些不能自圆其说,囧……
(s32)(peer->tcp_ts - req->ts_recent)
我写了个补丁,想消除这种情况,可是用了我的方法就不能正确处理wrap-around,而之前之所以那么写就是为了可以正确处理wrap-
around。所以恐怕除了加一点警告之外,其他的也没什么能做的了。