Background
During the stress test of the application service, nginx began to report errors after the continuous pressure test request for about 1min. It took some time to investigate the causes of the errors and finally locate the problems. Now the process is summarized.
Pressure measuring tools
The pressure test here uses siege
, which is very easy to specify the number of concurrent accesses and concurrent time, and has very clear result feedback, number of successful accesses, number of failures, throughput and other performance results.
Pressure measurement index
Single interface pressure test, concurrent 100, lasting for 1min.
Error reported by pressure measuring tool
The server is now under siege...
[error] socket: unable to connect sock.c:249: Connection timed out
[error] socket: unable to connect sock.c:249: Connection timed out
Nginx error.log error
2018/11/21 17:31:23 [error] 15622#0: *24993920 connect() failed (110: Connection timed out) while connecting to upstream, client: 192.168.xx.xx, server: xx-qa.xx.com, request: "GET /guide/v1/activities/1107 HTTP/1.1", upstream: "http://192.168.xx.xx:8082/xx/v1/activities/1107", host: "192.168.86.90"
2018/11/21 18:21:09 [error] 4469#0: *25079420 connect() failed (110: Connection timed out) while connecting to upstream, client: 192.168.xx.xx, server: xx-qa.xx.com, request: "GET /guide/v1/activities/1107 HTTP/1.1", upstream: "http://192.168.xx.xx:8082/xx/v1/activities/1107", host: "192.168.86.90"
Troubleshooting problems
When you see timed out
, your first impression is that there is a performance problem in the application service, resulting in failure to respond to concurrent requests; By checking the application service logs, it is found that there are no errors in the application service;
Observe the CPU load of the application service (docker container docker state ID
), and find that the CPU utilization increases during concurrent requests. There are no other exceptions, which is normal. However, through continuous observation, it is found that after the start of pressure measurement and error reporting, the CPU load of the application service decreases, and there is no request log in the application service log. For the time being, it can be determined that the unresponsive request should come from the previous node of the application service link, that is, nginx;
Use the command to check the TCP connection of the server where nginx is located during pressure test
# View the current number of connections on port 80
netstat -nat|grep -i "80"|wc -l
5407
# View the status of the current TCP connection
netstat -na | awk '/^tcp/ {++S[$NF]} END {for(a in S) print a, S[a]}'
LISTEN 12
SYN_RECV 1
ESTABLISHED 454
FIN_WAIT1 1
TIME_WAIT 5000
Two outliers were found in the TCP connection
There are 5K more connections
TCP status time_ Wait to 5000 and stop growing
Start the analysis on these two points:
Theoretically, there should be only 100 connections for the pressure test of 100 concurrent users. The reason for this should be that 5000 connections were created during the pressure test of College
# View siege configuration
vim ~/.siege/siege.conf
# The truth is clear, the original siege in the pressure test, the connection is closed by default, that is, in the continuous pressure test, after the end of each request, directly close the connection, and then create a new connection, then you can understand why the pressure test when Nginx is located on the server TCP connections more than 5000, rather than 100;
# Connection directive. Options "close" and "keep-alive" Starting with
# version 2.57, siege implements persistent connections in accordance
# to RFC 2068 using both chunked encoding and content-length directives
# to determine the page size.
#
# To run siege with persistent connections set this to keep-alive.
#
# CAUTION: Use the keep-alive directive with care.
# DOUBLE CAUTION: This directive does not work well on HPUX
# TRIPLE CAUTION: We don't recommend you set this to keep-alive
# ex: connection = close
# connection = keep-alive
#
connection = close
TIME_ Wait to 5000 analysis. First find out the TCP status time_ What does wait mean
Time-wait: wait enough time to ensure that the remote TCP receives the confirmation of the connection interruption request; TCP should ensure that all data can be delivered correctly under all possible circumstances. When you close a socket, actively closing the socket at one end will enter time_ The wait state, while the passive shutdown party turns to the closed state, which can indeed ensure that all data is transmitted.
From the analysis of the time-wait definition, when the connection of the pressure test tool is closed, in fact, the connection of the machine where Nginx is located is not closed immediately, but enters the time-wait state. You can find a lot of explanations on the packet loss caused by too much time-wait on the Internet, which is the same as what I encountered during the pressure test.
# Check the configuration of the server on which Nginx is running
cat /etc/sysctl.conf
# sysctl settings are defined through files in
# /usr/lib/sysctl.d/, /run/sysctl.d/, and /etc/sysctl.d/.
#
# Vendors settings live in /usr/lib/sysctl.d/.
# To override a whole file, create a new file with the same in
# /etc/sysctl.d/ and put new settings there. To override
# only specific settings, add a file with a lexically later
# name in /etc/sysctl.d/ and put new settings there.
#
# For more information, see sysctl.conf(5) and sysctl.d(5).
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1
vm.swappiness = 0
net.ipv4.neigh.default.gc_stale_time=120
# see details in https://help.aliyun.com/knowledge_detail/39428.html
net.ipv4.conf.all.rp_filter=0
net.ipv4.conf.default.rp_filter=0
net.ipv4.conf.default.arp_announce = 2
net.ipv4.conf.lo.arp_announce=2
net.ipv4.conf.all.arp_announce=2
# see details in https://help.aliyun.com/knowledge_detail/41334.html
net.ipv4.tcp_max_tw_buckets = 5000
net.ipv4.tcp_syncookies = 1
net.ipv4.tcp_max_syn_backlog = 1024
net.ipv4.tcp_synack_retries = 2
kernel.sysrq = 1
fs.file-max = 65535
net.ipv4.ip_forward = 1
net.ipv4.tcp_fin_timeout = 30
net.ipv4.tcp_max_syn_backlog = 10240
net.ipv4.tcp_keepalive_time = 1200
net.ipv4.tcp_synack_retries = 3
net.ipv4.tcp_syn_retries = 3
net.ipv4.tcp_max_orphans = 8192
net.ipv4.tcp_max_tw_buckets = 5000
net.ipv4.tcp_window_scaling = 0
net.ipv4.tcp_sack = 0
net.ipv4.tcp_timestamps = 0
net.ipv4.tcp_syncookies = 1
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_tw_recycle = 1
net.ipv4.ip_local_port_range = 1024 65000
net.ipv4.icmp_echo_ignore_all = 0
net.ipv4.tcp_max_tw_buckets = 5000
5000 Indicates the maximum number of TIME_WAIT sockets that the system can maintain simultaneously. If this number is exceeded, the TIME_WAIT sockets will be cleared immediately and a warning message will be printed.
Optimization scheme
Adjust the Linux kernel parameter optimization by referring to the information obtained from online search:
net.ipv4.tcp_syncookies = 1 indicates that SYN cookies are enabled to handle when there is a SYN wait queue overflow, which protects against a small number of SYN attacks; the default is 0, which means off.
net.ipv4.tcp_tw_reuse = 1 indicates that reuse is enabled. Allows Time-WAIT sockets to be reused for new TCP connections, default is 0 for off.
net.ipv4.tcp_tw_recycle = 1 Indicates that fast recycling of TIME-WAIT sockets for TCP connections is turned on, default is 0, indicating off.
net.ipv4.tcp_fin_timeout = 30 indicates that this parameter determines how long a socket remains in FIN-WAIT-2 state if it is requested to be closed by this end.
net.ipv4.tcp_keepalive_time = 1200 Indicates how often TCP sends keepalive messages when keepalive is up. The default is 2 hours, change to 20 minutes.
net.ipv4.ip_local_port_range = 1024 65000 Indicates the range of ports used for outbound connections. The default is very small: 32768 to 61000, change to 1024 to 65000.
net.ipv4.tcp_max_syn_backlog = 8192 Indicates the length of the SYN queue, default is 1024, increase the queue length to 8192 to accommodate a larger number of network connections waiting to connect.
net.ipv4.tcp_max_tw_buckets = 5000 indicates the maximum number of TIME_WAIT sockets the system can keep simultaneously. If this number is exceeded, the TIME_WAIT sockets will be cleared immediately and a warning message will be printed. The default is 180000, change it to 5000.