前置环境
- Azure
- centos7.6
- python3.6.5
- tornado==5.0.2
- Tornado-MySQL==0.5.1
- tornado-redis==2.4.18
事故场景
最终的结果是浏览器所有的请求都是pending的状态。
通过lsof -i:9000
查看:
1 | python3 105792 apps 5u IPv4 2782706 0t0 TCP *:cslistener (LISTEN) |
2个worker进程已经是CLOSE_WAIT的状态了。
事故解决
I. Updating Linux
For a Linux client, there are four operating system keepalive parameters to change:
tcp_keepalive_probes
- the number of probes that are sent and unacknowledged before the client considers the connection broken and notifies the application layertcp_keepalive_time
- the interval between the last data packet sent and the first keepalive probetcp_keepalive_intvl
- the interval between subsequent keepalive probestcp_retries2
- the maximum number of times a packet is retransmitted before giving up
II. On the Linux operating system, update these parameters using the “echo” command:
1 | echo "60" > /proc/sys/net/ipv4/tcp_keepalive_time |
The tcp_keepalive_time
and tcp_keepalive_intvl
values are expressed in seconds. To retain these values after a system restart, they must be added to the /etc/sysctl.conf
file.
1 | net.ipv4.tcp_keepalive_time = 60 |
注
worker出现CLOSE_WAIT,问题一定在代码部分。tornado mysql缺少快速失败机制,始终是一个坑。重构代码,需要集中的时间。更改系统参数,是一个缓冲方案。