This week was one of those where you spend too much time trying to solve a bug. It was an interesting process though, as it allows you learn and to deep dive into things. Here is the story of this journey
I was adding a login feature from an LDAP server into a flask application. The application was running in docker and the LDAP server was running in another private network accessible by a VPN.
The login form was ready, if you submit a username that doesn't exist in LDAP, you receive an error "user not found". However, if you submit a username that exists in LDAP, you get a timeout, regardless of the password.
High level debugging
At the very beginning we needed to see if LDAP responds correctly over the VPN. Let's start by an ldapsearch (from my host) like
ldapsearch -d 1 -vv -x -LLL -H ldap://10.10.0.10 -D 'CN=root,CN=Users,DC=Project,DC=domain,DC=com' -b 'DC=Project,DC=domain,DC=com' -W 2>&1 | less
It works, so it's probably a bug in the python lib. I was using flask-simpleldap so let's try another one. I chose flask-ldapconnect, refactored the code and … failed again. Same exact pattern. It might be the same bug if both libs were relying on the same package, but it wasn't the case.
Now let's try to remove any intermediates, we decided to try the application directly in the private network, still in docker and the bug disappeared. So it was becoming clear that the bug was related to my network/VPN/Docker.
Low level debugging
Now we need to investigate what's going in the network. What's the difference between the request sent from docker vs the one sent from my host ?
We can see here that the connection to the LDAP server is well established (line 5). Then we send a searchRequest (line 7) which is 321 bytes long and we receive a good response line 11. The remaining lines contain no useful debugging information so we can ignore them for now.
We can see here that the connection is also established (l.7) and we send the exact same searchRequest of 321 bytes (l.9) and then …nothing. We can see l.12 that my machine sends a FIN flag to close the connection. We can see that between l.11 and l.12 it takes about 10 seconds which corresponds to the timeout.
Still, no idea why this fails but the other succeeds :(
Then I decided to run the application out of docker directly on my host. I created the virtualenv, fixed a bit the configuration and after few minutes, the application was running and … the LDAP login was working fine. So it looks like the vpn was working correctly but the bridge between my host and my container was the issue. I decided to change the network mode of my container. Instead of bridge which is the default network mode, I tried the host mode. This mode avoids the creation of a new network layer for docker containers. With host mode, there is no port mapping, the port of the container will be accessible directly from the host IP. This step made my login feature work.
We were getting closer.
After running wireshark on the LDAP server we found out that packets over 1400 bytes seem to be dropped. We then checked the mtu of the interface used by the vpn, aka ppp0
$ ifconfig | grep ppp0
ppp0: flags=4305<UP,POINTOPOINT,RUNNING,NOARP,MULTICAST> mtu 1400
MTU 1400, while docker interfaces are all limited to 1500. So what's happening was that when the application was running directly on my host it was using the ppp0 primarily and was asking for packets no bigger than 1400 bytes and LDAP was fine with that. But when the request was sent from docker, where the max size was 1500, the LDAP was responding with packets up to 1500 bytes and in that case, the vpn link was rejecting the packets, leading to a no response in the app and then timeout.
So the solution was now obvious, I needed to start the container and configure its network with a mtu of 1400. As we were working with docker-compose, the following snippet does the trick
It was a really frustrating bug but it was very interesting to start from a very high level and dive slowly into more networking.
I'd like to thank my colleagues Victor and Jean for helping me solve this issue.