The Problem
The particular problem I experienced was this: The application maintained an open session with an LDAP server for the purpose of retrieving LDAP information on an as needed basis to application clients. The firewall in between the application server and the LDAP server was set to kill idle sessions with an idle time of longer than 30 minutes.The problem manifested itself any time there was a longer than 30 minute time frame between LDAP searches. In a development environment this was quite frequent, but it was extremely troublesome in that virtually any change we made in our environment seemed to make the problem disappear for the duration of our testing cycle. The problem would then re-appear at seemingly undiscernable intervals. Sometimes it would reappear after an hour or so. Sometimes it would reappear the next day. Sometimes it would reappear a few days later.
Laying the Blame
The problem isn't necessarily a firewall problem in and of itself. Applications should be able to discern and deal with broken TCP connections. Should and will, however are two different stories. Not all applications out there are mature enough to have vetted out all of their respective low-level idiosyncrasies. While I'm not willing to lay the blame entirely on the firewall, the action of dropping a session and discontinuing traffic forwarding over a previously working tcp connection is a "feature" that I very much frown upon.I certainly can see why a need for such a feature arose - firewall resources are typically very limited and in high volume environments maintaining open and unused connections can have performance implications. Additionally, leaving sessions open indefinitely can have security implications where session hijacking is concerned. I do, however, disagree with the idea that a network resource that has a major responsibility of pushing traffic through the network should arbitrarily stop doing just that.
Dealing with the Firewall killing idle sessions
Once the problem was determined to be an issue with the firewall dropping idle connections, dealing with it was not very difficult. There are four ways to deal with this issue:- TCP Keepalives - Out of the box, at least with RHEL/CentOS systems, the default keepalive timing kicks in after 2 hours of idle time. Reducing this timing at the OS level to a time frame within the firewall timeout window resolves the problem. The downside is that by reducing this time frame, you are increasing network congestion associated with any other connections idle for the given timeframe. It's not much, but with a lot of open connections, a lot of equipment, and a lot of services, those little packets turn into something that has to be accounted for.
- Arbitrary Traffic Generation - This is the duct-tape-and-hammer solution. It works, it's not pretty, and it's an abuse of network resources. Generating traffic on a connection for the sole purposes of preventing the firewall from killing the connection works, but you are generating traffic that serves no real purpose.
- Increase Firewall Timeouts - You could increase the firewall timeouts to be greater than the 2 hour keepalive timer that the OS has in place by default. This saves you the trouble of reconfiguring existing equipment, but it also means that your live-session counter could be in for an exponential increase depending on how many idle sessions the firewall is actively killing on a day to day basis.
- Don't maintain idle connections - Persistent and idle connections can provide significant performance gains, but from both the client and the server side, your applications should be able to configure an idle-timeout value. You end up playing cat and mouse with each deployment at each enterprise as far as the appropriate idle-timeouts, but it is not unreasonable for the first connection in over 30 minutes to have tcp level session initiation overhead.
In our case, we went with Arbitrary Traffic Generation to verify our hypothesis, and TCP Keepalives to resolve the issue to satisfaction. We talked about setting up server side idle-timeouts, but given that the TCP Keepalives worked I don't know that the server side configuration changes will be made.
With so many applications these days living in multi-tiered, shared resource environments, application developers will need to become more and more aware of the network equipment that surrounds us. This certainly becomes even more true with network appliances taking on more responsibilities and becoming more complex. (the more features they have, the greater potential for problems)
Discuss Firewall Session Problems
