Gmail/Google services unresponsive
I'm using the latest pfsense in a multi-WAN setting : 4 gateways, each on their own interface and subnet, and passing traffic to a different ISP router. 4 gateways, 4 interfaces, 4 routers.
My main gateway is passing traffic to a 300mbps/30mbps router and from time to time we're having huge issues accessing Gmail/Google services (Gmail won't open, google.com won't open, etc.). All other websites load with no lag whatsoever. I've plugged a laptop directly into the ISP's router and no issues with Google so it's not the ISP at fault.
When the problem comes up, I've deactivated/activated the interface that passes traffic to the router and the problem immediately disappears (only to reappear the next day). I've swapped the interface for a new one and the problem is still popping up from time to time...
The 3 remaining gateways have no issues passing traffic to Google via their respective routers.
I'm not using load balancing and there's no Squid caching stuff, and DNS Forwarding is activated (all gateways are using DNS from the same provider - CleanBrowsing.org).
Any ideas where the problem could be coming from? Any setting I could check or change?
I've done PING and tracert thru pfsense to Google services when the problem comes up, but latency is fine in the 20-30ms range. All web services are working perfectly fine except Google.
Thanks for your input.
Thanks for the information.
How does it actually fail when it does? Can you get a packet capture of a failed connection? Do you see anything come back at all?
If you reload the firewall at Status > Filter Reload does that re-establish connectivity?
You mean capturing some packets off the LAN using Wireshark? Or does pfsense have anything to capture packets?
When the connection fails, the web browser displays a "timeout" page or the browser hangs indefinitely showing "connecting to ssl.gstatic.com..." in the bottom left-hand corner (gstatic.com belongs to Google).
I've contacted Google about this, but according to them, there's no issue on their side. Besides, my remaining 3 gateways have no issues, while when going thru the main gateway the browser keeps timing out or hangs forever...
I will try the "Filter Reload" next time this happens (if and when it does; it's completely random).
pfSense can capture that traffic directly:
The problem showed up again this morning... Gmail/Google services were unresponsive going thru our main gateway. All other traffic was fine.
I did the Filter Reload (twice) but that did not help.
What did fix the problem was deactivating/activating the interface (WAN1) that passes traffic to the ISP's router.
Before I did the off/on of the interface, I did a packet capture from my computer (my Gmail was unresponsive) to the pfsense LAN interface. Anything I should look for in the packet capture?
EDIT: problem has appeared again. This time I grabbed some packets being exchanged between Google and a computer opening a file on Google Drive. Google Drive was unresponsive.
There are some "TCP Retransmission" errors in packet capture, and client computer and Google server keep bouncing back and forth between TLS version 1.2 and 1.3.
Server keeps saying "server hello, change cipher spec".
Is the problem appearing because client and server cannot negotiate encryption cypher? Anyone?
When you bring the link down like that does it change the public IP?
Something changes if it then allows the traffic.
What happens if you clear the state table in pfSense?
Problem appeared this morning and I cleared the pfsense state table but Gmail remained unresponsive.
So I toggled the interface WAN1 off/on and that fixed the problem.
When I turn off WAN1, failover kicks in and traffic is routed thru WAN2 with a different IP. A few seconds later, I turn WAN1 back on, traffic is again routed thru WAN1 and Gmail works fine.
Bear in mind that WAN1 is passing traffic to a router with a static IP.
The only time the IP changes is those few seconds when I turn off WAN1 and failover kicks in passing traffic to WAN2.
So toggling WAN1 off/on does not change the public IP of WAN1. However, turning it off and turning it back on again seems to "reset" something else, which allows traffic to flow to Gmail.
Reloading the filter or clearing the state table does not change anything.
I did the following experiment: I set up a single workstation to send its traffic directly to the ISP's router (the one used by WAN1), however, pfsense was still doing DNS Forwarding for that workstation. Guess what happened? Gmail became unresponsive, just like on the workstations which route everything thru pfsense's WAN1.
So am I dealing with some sort of DNS problem?
Potentially. Though I would expect to see some sort of resolving error if so.
Try setting a different DNS server on the failing client directly and see if that corrects the issue.
Yeah, I'd expect the web browser to give me a resolving error, but that's not the case. The browser hangs or times out. It never gives me a "unable to resolve name" error. Weird.
But the workstation experiment is pretty indicative of a DNS problem, since all non-DNS traffic is sent directly to the ISP's router, and only DNS requests go thru pfsense. Hmm...
When I set a DNS server on the failing client directly thru the ISP's router, Gmail works fine. It's like plugging a laptop directly into the ISP's router, which I already did before.
On monday I will pass all DNS traffic directly to the ISP's router, and all non-DNS traffic thru pfsense. I'm curious what will happen.
Funny thing is, I have a second pfsense box in another part of town and no problems. Latest pfsense, the same ISP (different static IP), same DNS servers and no issues with Google/Gmail...
Which WAN is Unbound using to access the DNS servers? It will use the system default gateway unless you've set it use something else.
It's possible the outgoing IP will return different DNS results causing a failure. If any of the WANs are over a VPN for example.
I haven't touched the default settings of the DNS Resolver service. For Network Interfaces it says "all", and for Outgoing Network Interfaces it also says "all".
Since this morning all clients are getting their DNS queries directly thru the ISP's router and all is well. All non-DNS traffic is still going thru pfsense. Let's see how long this lasts.
UPDATE: "happy monday" did not last long. Gmail just became unresponsive with all DNS queries going directly to the ISP's router.
So I started sending all DNS to a different ISP - did not help.
I also changed the DNS server to Google's 18.104.22.168 - did not help.
OK, I'm thinking now this is NOT a DNS issue.
I agree. It seems unlikely to be a DNS issue from that evidence.
Previously you speculated it might be an encryption negotiation issue. Do you see this problem on all clients? All browsers?
If it was something like that I would expect it to affect only something outdated.
I think the encryption negotiation idea leads nowhere. All my faulty clients are running latest Firefox or Chrome, so TLS cipher negotiation shouldn't be a problem. I was just thinking out loud...
I did another test which may indicate the ISP (or Google?) is at fault somehow after all.
I split up 40 faulty clients into 2 groups of 20, group A and B.
Both groups have DNS vetted, that is they are going directly to the ISP's router and hitting Google's 22.214.171.124.
As for non-DNS traffic, both groups are going thru pfsense, but:
Group A is going thru the "faulty" 300mbps cable internet, while Group B is going thru my backup ISP's 40mbps ADSL pipe.
And what happened was that Group A suffered from unresponsive Gmail, while Group B was completely fine.
I checked Group A for any lag, but Ping and tracert to Gmail came back fine, even though Gmail itself was unresponsive thru web browser.
I know Ping and tracert are UDP/ICMP, while Gmail is TCP with TLS overhead, but that obviously shouldn't cause Gmail to be unresponsive.
So both groups are going thru pfsense, hitting different ISPs, and one group ends up being faulty, while the other is fine.
ISP at fault after all?
I've contacted the ISP a few times, but they told me everything is fine on their side, and they said to "check my firewall for port blocking"...
Yeah right, port blocking, when everything that is HTTPS/443 works fine except for Google services...
So tomorrow I will reprogram Group A to go directly to the ISP's router and I'll wait for Gmail to crap out again (or maybe it won't).
UPDATE: I sent traffic from Group A directly to the "faulty" ISP's router and user feedback was that Gmail/Drive was very slow or unresponsive...
So when I plug a single laptop into the ISP's router no problems, but a group of 20 clients "crashes" Google services? Very odd...
Hmm, that is strange. Triggering some blocking somewhere upstream? Something in the ISPs router maybe?
An anti DOS service or similar perhaps.
I've done some more packet tracing with software called WinMTR.
I've traced packets going to mail.google.com and I have as much as 10% packet loss at the last hop or second to last hop. Apparently the issue is on Google's end. They're dropping about 10% of our traffic, no wonder Gmail/Drive doesn't work properly. Ugh.
I've sent screenshots of the traces to Google. I wonder if and when they will find the reason for the 10% drops.
It seems pfsense is not the culprit. And WinMTR is your best friend when there's packet loss. :)
There is an MTR package for pfSense in case you weren't aware.
The webgui page for it runs for a limited number of packets but you can run it at the console with the usual options:
[2.4.5-RC][email@example.com]/root: mtr --help Usage: mtr [options] hostname -F, --filename FILE read hostname(s) from a file -4 use IPv4 only -6 use IPv6 only -u, --udp use UDP instead of ICMP echo -T, --tcp use TCP instead of ICMP echo -a, --address ADDRESS bind the outgoing socket to ADDRESS -f, --first-ttl NUMBER set what TTL to start -m, --max-ttl NUMBER maximum number of hops -U, --max-unknown NUMBER maximum unknown host -P, --port PORT target port number for TCP, SCTP, or UDP -L, --localport LOCALPORT source port number for UDP -s, --psize PACKETSIZE set the packet size used for probing -B, --bitpattern NUMBER set bit pattern to use in payload -i, --interval SECONDS ICMP echo request interval -G, --gracetime SECONDS number of seconds to wait for responses -Q, --tos NUMBER type of service field in IP header -e, --mpls display information from ICMP extensions -Z, --timeout SECONDS seconds to keep probe sockets open -r, --report output using report mode -w, --report-wide output wide report -c, --report-cycles COUNT set the number of pings sent -j, --json output json -x, --xml output xml -C, --csv output comma separated values -l, --raw output raw format -p, --split split output -t, --curses use curses terminal interface --displaymode MODE select initial display mode -n, --no-dns do not resove host names -b, --show-ips show IP numbers and host names -o, --order FIELDS select output fields -y, --ipinfo NUMBER select IP information in output -z, --aslookup display AS number -h, --help display this help and exit -v, --version output version information and exit See the 'man 8 mtr' for details.
Gertjan last edited by Gertjan
Just an idea ...
I saw a "Youtube" presentation of a known guy using the free gmail/google cloud free disk space as a huge (even bigger) free remote NAS backup space.
They even combined several Google accounts together, to make it even bigger.
They discovered that Google will limit you when you = a a user on an IP, uploads more then xxxxxxxxxxx bytes a day.
I'll post that video here when my memory comes back. .. edit : https://www.youtube.com/watch?v=y2F0wjoKEhg
Could this explain what you are seeing ?
For example : your cable WAN IP is shared with others ? so it's maybe not even you uploading ....
UPDATE: Google engineer has replied saying that after analyzing my WinMTR results and "extensive research" he believes the problem is on the ISP's side...
How can it be that WinMTR is showing packet loss on the very last hop of the trace, and Google says it's the ISP?
Besides, I also have "TCP Retransmission errors" in my packet captures which point to Google's IP.
No packet loss according to WinMTR from any of the in-between hops. Google's conclusion makes no sense to me.
If it were an ISP problem there should be packet loss at the in-between hops, right or wrong?
Man, this problem is going to drive me mad...
I've also checked the DNS logs over at my DNS provider cleanbrowsing.org
I don't know why, but about 16% of all DNS queries were AAAA records being queried. AAAA records are IPv6 host records, but ISP told me IPv6 is turned off. Besides, my pfsense is not allowing IPv6.
I have no idea why these AAAA are popping up in the DNS logs... And quite a few are AAAA records belonging to Google. Hmm...
Gertjan last edited by
very last hop of the trace, and Google says it's the ISP?
ISP's and major info providers like Google have a ongoing discussion about who pays the the POP between them.
Or that link is right now to small ....
AAAA : something like : most if not all Google apps and services you use on your devices prefer IPv6, thus resolve AAAA first.
To find out later on a IPv6 link can't be opened, so they switch back to A (IPv4).
If you have any IPv6 connectivity at all but not full connectivity that can really bork stuff.
I have seen sites appear to fail because clients think they can connect ober v6 but cannot. Triple check that!