It should have been mentioned before that our online architecture has been rebuilt as a whole, and spring boot is used at the application level. Recently, due to some third-party reasons, we started the online internal test ahead of time
Then the operation and maintenance found a problem, the server’s HTTPS port has a large number of close_ WAIT:
My first reaction was that there was a bug in spring boot, because this project was divided into two services: HTTP and HTTPS, which were started in the form of jar. There was no problem with HTTP. At the same time, there was no problem with services of the old architecture in Tomcat providing services in the form of HTTPS. At that time, I thought that it could be judged that there should be no problem at the socket level, So I started to analyze the spring boot code
After debugging and analysis (if there is a chance in the process, I’ll sort out another article), although the cause of this phenomenon is not found, a rule is found. In the dorun method of socketprocessor, which is the internal class of org.apache.tomcat.util.net.nioendpoint, the handshake state is always in handshake = = selectionkey.op_ Read, monitoring will never be turned off
Although, at this point, it seems that the problem should appear at the socket level, I still think it should be spring boot, because the Tomcat code referenced by spring boot to handle this part of the function is embedded (tomcat-embedded-core-8.5.4), but it is no different from the full version, and the full version does not have this problem
Then, for two reasons, I decided to continue the investigation and directly mention issue: first, it takes a lot of time to analyze the relevant code to ensure that this problem can be solved without other problems; 2、 To be sure, it’s not about our new architecture and development. So I went to GitHub and asked an issue https://github.com/spring-projects/spring-boot/issues/7780 The next day, however, I was advised to ask Tomcat for issue:
Although I still think it’s a toss, I don’t have any proof that it’s not a Tomcat problem. So I looked at the code again, trying to prove it, but I didn’t find it
Finally, I went to ask Tomcat a bug, https://bz.apache.org/bugzilla/show_ Bug. CGI?Id = 60555. The reply points to another bug. The reason is that this version does have this problem
The problem occurs for TLS connections when the connection is dropped after the socket has been accepted but before the handshake is complete. The socket ended up in a loop:
- timeout -> ERROR event
- process ERROR (this is the new bit from r1746551)
- try to finish handshake
- need more data from client
- register with poller for READ
- wait for timeout
- timeout ...
... and around you go.
Well, since Tomcat is connected, I don’t want to say much, but I compared the code of the local class package with that of r1746551. After debugging, I found that it wasn’t caused by the code he said, because I still didn’t solve the problem after debugging the code of r1746551. However, there is a barely acceptable solution to the problem of online environment. The embedded Tomcat is replaced by the embedded jetty. As expected, there is no problem
Now the spring boot starter web reference to embedded Tomcat is excluded from gradle.build
compile('org.springframework.boot:spring-boot-starter-web:1.4.0.RELEASE'){
exclude module: "spring-boot-starter-tomcat"
}
Then change to jetty
[group: 'org.springframework.boot', name: 'spring-boot-starter-jetty', version: '1.4.0.RELEASE'],
As for the question raised to tomcat, I’ll take time to think about it carefully and then raise it. However, after testing and upgrading the version just now, there is no problem
After debugging for a while, I really feel that it’s not his r1746551 that solves the problem. Here’s what I found when I looked at the code. The part that directly solves the problem is not included in r1746551. The original part that has the problem is as follows:
if (socket.isHandshakeComplete() || event == SocketEvent.STOP) {
handshake = 0;
} else {
handshake = socket.handshake(key.isReadable(), key.isWritable());
// The handshake process reads/writes from/to the
// socket. status may therefore be OPEN_WRITE once
// the handshake completes. However, the handshake
// happens when the socket is opened so the status
// must always be OPEN_READ after it completes. It
// is OK to always set this as it is only used if
// the handshake completes.
event = SocketEvent.OPEN_READ;
}
Now the code is OK
if (socket.isHandshakeComplete()) {
// No TLS handshaking required. Let the handler
// process this socket/event combination.
handshake = 0;
} else if (event == SocketEvent.STOP || event == SocketEvent.DISCONNECT ||
event == SocketEvent.ERROR) {
// Unable to complete the TLS handshake. Treat it as
// if the handshake failed.
handshake = -1;
} else {
handshake = socket.handshake(key.isReadable(), key.isWritable());
// The handshake process reads/writes from/to the
// socket. status may therefore be OPEN_WRITE once
// the handshake completes. However, the handshake
// happens when the socket is opened so the status
// must always be OPEN_READ after it completes. It
// is OK to always set this as it is only used if
// the handshake completes.
event = SocketEvent.OPEN_READ;
}
Because the problem is caused by the handshake being closed in the process of normal establishment. As long as the judgment is changed to the above, when the handshake is caused by the failure of socket establishment, it will go to the close method, but the original judgment method can’t do it, so the problem is solved. As for the location of this code, I said at the beginning, hehe…, If there’s anything I miss, I’ll be told
==========================================================
My recent GitHub: https://github.com/saaavsaaa
WeChat official account: