LDAP TLS certificate auth didn't work on leap day
TL;DR pfsense failed to validate a valid LDAP server certificate, but only on February 29.
Friday night (Feb 28, a few hours after the start of Feb 29 UTC), I started getting reports that users couldn't authenticate to the VPN service hosted on a pfsense instance. The authentication server was reporting that the LDAP server's TLS certificate was not valid. Nothing had changed, but I nuked and rebuilt the installation anyway on the assumption that it was corrupt. That didn't fix it.
What did fix it was waiting until Monday, when LDAP over TLS worked again.
My usual assumption when something goes weirdly wrong is that I made a mistake, but the timing of this event is suspicious. Did anyone else experience a similar problem over the weekend?
It's more likely a problem on the server side but I suppose it could go either way. Though on pfSense, the PHP+LDAP stuff just uses OpenSSL to validate so if that were going to fail it would have failed for everything for everyone everywhere, not just in LDAP.
What is the date on the certificate? What type of server certificate is it?
Is there a chance the server certificate expired late on the 28th (something mistakenly thought it was the end of that month) but it didn't renew until the first? I could see that happening with ACME if it were manually setup with some odd end of month/start of month timing on its renewals.
@jimp Those are all reasonable questions. I really appreciate that you're helping with what is most likely a mistake on my part.
The cert is valid from 8/5/2019 to 8/4/2020.
I'm not sure what you mean by "type of server certificate". It was issued from a Win2012R2 certificate authority to a Win2012R2 domain controller, using the DomainController template. The key usage flags are Digital Signature, Key Encipherment (a0), Client Authentication, and Server Authentication. The key is 1024-bit RSA. As far as I can tell, there is nothing special about it.
Again, it works just fine today. It worked fine (most of) Friday. It didn't work starting some time Friday night 2/28 (UTC-7) and ending some time before Monday morning 3/2.
Hmm, so nothing likely to be an issue there.
Since it's past already it's probably not viable to reproduce it in an experiment, but you could disable NTP and set the clock on both the server and pfSense to sometime on the 29th and try again, I'd be curious to know what a packet capture of their exchange looked like at the time if it fails.
How is the auth setup on pfSense? Is pfSense talking directly to the LDAP server? Or is it going through something else like FreeRADIUS first?
@jimp I intend to set up a sandbox to (try to) reproduce the problem. I'll report back with the results. Other than setting the clock back on all of the sandboxed hosts, is there anything that you think would be useful to change, watch, or capture?
I have a packet capture from the incident. The router connects to the server (TCP 3269) and establishes a TLS session. Immediately (123 usec) after establishing the TLS session, the client closes the TCP connection. There is no traffic between establishing the session and closing the connection.
The authentication server connects directly to Active Directly LDAP services on TCP:3269 (Global Catalog over TLS). It uses my organization's internal root CA certificate. I don't think there is anything interesting about the query settings, since we never actually got to that point during the failure, and authentication worked just fine when I switched to an unencrypted transport.
Maybe save/check the logs on both pfSense and the AD system to see what both say at the time.
Look in the packet capture with Wireshark, check the certificate and see what, if any, timestamps are shown and so on.
We had another brief instance today where pfsense stopped authenticating over LDAP, so I can rule out leap-day shenanigans. My best guess is that our virtual infrastructure is doing something funky during backups. I have idea idea why that would cause a problem that persisted for hours last time, but only a few minutes today, but I think it's safe to rule out pfsense.
Thanks for your help @jimp !