[Bug?] pfSense empirically causing legacy WordPress sites to fail
-
Are you up for a remote session to see if anything sticks out in the config?
-
@johnpoz So with TLS disabled to keep everything nice and simple.
Client side
Client established
- 212.159.x.x establishes session to 81.150.196.82
- 81.150.196.82 SYN ACK's
- Client sends GET to 81.150.196.82 with correct host header
- 81.150.196.82 and client exchange TCP keep-alive requests every 45 seconds
- Until after about 300-400 seconds the HTTPd watchdog kills the request and IIS sends the error page triggering a FIN ACK
Server Side
- Client 212.159.x.x TCP session is SYN ACK'd with httpd 172.16.1.1
- HTTP GET is received from 212.159.x.x
- MySQL transactions fire and conclude successfully with RDBMS server IP
- 172.16.1.1 negotiates congestion notification with 81.150.196.82 and fires off a WP cron job to its own host header - WordPress does this
- More database IO for the cron job
- WP CRON job is HTTP 200'd
- Every 45 seconds the TCP keep-alive triggers and is ACK'd
- A couple more CRON request are fired by the thread
- Until 300 seconds when 172.16.1.1 sends to 212.159.x.x the HTTP 500 timeout after which is is FIN ACK'd
- On receiving the error page, the browser asks for the favicon and wordpress responds with its programmed PNG file (at some non-standard arbitrary storage location)
IIS logs the transaction with the successful CRON jobs in the 5 minutes proceeding the 500 and the two favicon requests
2022-05-31 09:47:55 172.16.1.1 POST /wp-cron.php doing_wp_cron=1653990475.2987639904022216796875 80 - 172.16.1.2 WordPress/6.0;+http://www.businesscoachspecialist.co.uk http://www.businesscoachspecialist.co.uk/wp-cron.php?doing_wp_cron=1653990475.2987639904022216796875 200 0 0 537
2022-05-31 09:47:55 172.16.1.1 GET /wp-content/uploads/2016/06/comodo_secure_seal_113x59_transp-2.png - 80 - 172.16.1.2 - - 200 0 0 2
2022-05-31 09:48:46 172.16.1.1 POST /wp-cron.php doing_wp_cron=1653990526.1701300144195556640625 80 - 172.16.1.2 WordPress/6.0;+http://www.businesscoachspecialist.co.uk http://www.businesscoachspecialist.co.uk/wp-cron.php?doing_wp_cron=1653990526.1701300144195556640625 200 0 0 555
2022-05-31 09:48:46 172.16.1.1 GET /robots.txt - 80 - 34.76.25.117 DnBCrawler-Analytics - 200 0 0 1069
2022-05-31 09:48:46 172.16.1.1 GET /wp-content/uploads/2016/06/comodo_secure_seal_113x59_transp-2.png - 80 - 172.16.1.2 - - 200 0 0 1
2022-05-31 09:48:56 172.16.1.1 GET /wp-content/uploads/2017/10/BusinessCoach.jpg - 80 - 172.16.1.2 - - 200 0 0 2
2022-05-31 09:49:47 172.16.1.1 GET /wp-content/uploads/2017/10/BusinessCoach.jpg - 80 - 172.16.1.2 - - 200 0 0 9
2022-05-31 09:49:56 172.16.1.1 GET /wp-content/uploads/2017/10/SalesTraining.jpg - 80 - 172.16.1.2 - - 200 0 0 2
2022-05-31 09:50:47 172.16.1.1 GET /wp-content/uploads/2017/10/SalesTraining.jpg - 80 - 172.16.1.2 - - 200 0 0 4
2022-05-31 09:50:49 172.16.1.1 POST /wp-cron.php doing_wp_cron=1653990648.4238090515136718750000 80 - 172.16.1.2 WordPress/6.0;+http://www.businesscoachspecialist.co.uk http://www.businesscoachspecialist.co.uk/wp-cron.php?doing_wp_cron=1653990648.4238090515136718750000 200 0 0 574
2022-05-31 09:50:49 172.16.1.1 GET /wp-content/uploads/2016/06/comodo_secure_seal_113x59_transp-2.png - 80 - 172.16.1.2 - - 200 0 0 1
2022-05-31 09:50:56 172.16.1.1 GET /wp-content/uploads/2017/10/StaffDevelopment.jpg - 80 - 172.16.1.2 - - 200 0 0 1
2022-05-31 09:51:47 172.16.1.1 GET /wp-content/uploads/2017/10/StaffDevelopment.jpg - 80 - 172.16.1.2 - - 200 0 0 2
2022-05-31 09:51:49 172.16.1.1 GET /wp-content/uploads/2017/10/BusinessCoach.jpg - 80 - 172.16.1.2 - - 200 0 0 2
2022-05-31 09:51:56 172.16.1.1 GET /wp-content/uploads/2016/06/FreeCoachingSession.jpg - 80 - 172.16.1.2 - - 200 0 0 2
2022-05-31 09:52:18 172.16.1.1 POST /wp-cron.php doing_wp_cron=1653990737.7030880451202392578125 80 - 172.16.1.2 WordPress/6.0;+http://www.businesscoachspecialist.co.uk http://www.businesscoachspecialist.co.uk/wp-cron.php?doing_wp_cron=1653990737.7030880451202392578125 200 0 0 570
2022-05-31 09:52:18 172.16.1.1 GET /wp-content/uploads/2016/06/comodo_secure_seal_113x59_transp-2.png - 80 - 172.16.1.2 - - 200 0 0 0
2022-05-31 09:52:47 172.16.1.1 GET /wp-content/uploads/2016/06/FreeCoachingSession.jpg - 80 - 172.16.1.2 - - 200 0 0 2
2022-05-31 09:52:49 172.16.1.1 GET /wp-content/uploads/2017/10/SalesTraining.jpg - 80 - 172.16.1.2 - - 200 0 0 4
2022-05-31 09:52:54 172.16.1.1 GET / - 80 - 212.159.x.x Mozilla/5.0+(Windows+NT+10.0;+Win64;+x64)+AppleWebKit/537.36+(KHTML,+like+Gecko)+Chrome/102.0.5005.61+Safari/537.36 - 500 0 258 300143
2022-05-31 09:52:54 172.16.1.1 GET /favicon.ico - 80 - 212.159.x.x Mozilla/5.0+(Windows+NT+10.0;+Win64;+x64)+AppleWebKit/537.36+(KHTML,+like+Gecko)+Chrome/102.0.5005.61+Safari/537.36 http://www.businesscoachspecialist.co.uk/ 302 0 0 512
2022-05-31 09:52:54 172.16.1.1 GET /wp-includes/images/w-logo-blue-white-bg.png - 80 - 212.159.x.x Mozilla/5.0+(Windows+NT+10.0;+Win64;+x64)+AppleWebKit/537.36+(KHTML,+like+Gecko)+Chrome/102.0.5005.61+Safari/537.36 http://www.businesscoachspecialist.co.uk/ 200 0 0 16The highlighted Time taken being 300.143 seconds.
I repeated it with WP-CRON disabled, no difference other than it not firing that traffic.
So all that tells us what we already knew, the request it getting black holed in the httpd worker process. The question is why does the guys over there swapping the router fix it!
-
@cool_corona The admin over there is not comfortable with that at the moment. They're happy to share screenshots and cli output.
-
@c-amie I understand. But it will take 5 mins to solve if anything is misconfigured or something has been overlooked.
Its up to you. Catch me on PM if you need assistance.
-
@cool_corona Thanks, I'll let the team know that your offer stands.
Cheers and have a good day.
-
@c-amie said in [Bug?] pfSense empirically causing legacy WordPress sites to fail:
The question is why does the guys over there swapping the router fix it!
No idea - but that sure isn't pfsense doing something to the packets..
There isn't anything to do with the packets, and if pfsense was doing something to them or not doing something - why would any other site work?
-
@johnpoz Quite! That's why I reached out to see if anyone else had and bright ideas. It isn't making any sense to me either.
-
@johnpoz Okay, having burned the entire day on this now, I've half worked it out.
It does seems to be a NAT reflection problem that is not present on the DrayTek.
The host has its own public DNS on it, on the same server 172.16.1.1
All TCP/UDP 53 traffic from 172.16.1.1 goes out via 81.150.196.83
All TCP/UDP 53 traffic to 81.150.196.83 goes to 172.16.1.1All other traffic from 172.16.1.1 goes out via 81.150.196.82
All TCP 80/443 traffic to 81.150.196.82 goes to 172.16.1.1I surmise (note I am assuming) that there is a DNS request for itself being made in the code in the template file. This is dying 90% of the time.
If I stop the public DNSd service and set the hosts file to resolve the sites FQDN to 172.16.1.1 everything magically works both internally and externally. The second I clear the entry in the hosts file, everything goes back to grinding to a halt again.
There are no issues with DNS resolution on the server via the CLI when the hosts file is disabled. Stick the DrayTek back in and the problem ceases.
As sheer speculation it might appear that the public .83 and .82 addresses cannot talk to each other for whatever reason.
-
@c-amie said in [Bug?] pfSense empirically causing legacy WordPress sites to fail:
that the public .83 and .82 addresses cannot talk to each other for whatever reason.
Yeah that would be a nat reflection problem - which not to say I told you, but I did bring that up ;)
If a client resolves something to the public IP, it would have to be reflected back in.. While you can setup nat reflection in pfsense - its a hack if you ask me.. And only reason would be if something is hard coded to an IP and can not be changed.. If its using fqdn, then host override to point to the local IP would be better solution - why send traffic to pfsense, just for pfsense to send it back - when the the server is right there local anyway.
Other reason you would have to use nat reflection, if device behind pfsense has to use public dns, and no way for it to use local, so there is no way to put in a record to resolve whatever fqdn to the local IP vs the public IP.
So I take your all sorted now?
-
@johnpoz You did :) And so did I; I've just not had a handle on that 'what' until now.
While I support your reasoning, the necessity to modify the hosts file and reign in the DNS server lookup is in itself a messy hack that needs to be institutionally remembered and maintained. I cannot control the actions of third party code, nor have any desire to modify it.
If the code has to self-reference its own FQDN and the DNS server is inside the NAT envelope, then it is an application layer decision. I cannot put another line and router in in order to stop the necessity for NAT reflection.
So while it is hacked to a working state, it is not sorted. The need to modify hosts files for all affected tenants is a manual chore that does not scale with host migration. As the issue is not present on another manufacturers device, I would stand by my initial assertion that this is looking like a bug in the pfSense implementation. If DrayTek can fix it, then I would have every confidence that the guys and gal's folks at NetGate must be able to derive a solution should there be willingness.
-
@c-amie https://docs.netgate.com/pfsense/en/latest/nat/reflection.html#internal-dns-servers
-
@heper said in [Bug?] pfSense empirically causing legacy WordPress sites to fail:
https://docs.netgate.com/pfsense/en/latest/nat/reflection.html#internal-dns-servers
The system is using internal DNS servers. Web server queries caching internal DNS servers, internal DNS servers query root servers. This is related to the public DNS so it is flowing:
Web server (172.16.1.1) > internal DNS (172.16.1.x) > root DNS system > public DNS server (.83)
Whatever the PHP code is doing, it is forcing a DNS re-query and then trying to query from the web server to the Internet and then back to the DNS server inside the NAT envelope on a different public IP address.
-
@c-amie said in [Bug?] pfSense empirically causing legacy WordPress sites to fail:
internal DNS (172.16.1.x)
Put a record here for what you want it to resolve to 172.168.1.y for example.
-
@johnpoz Yes, that will of course work and reduce the need for replication to between DNS groups instead of every host file on every httpd host. However, it is still a hack to workaround a bug.
-
@c-amie said in [Bug?] pfSense empirically causing legacy WordPress sites to fail:
However, it is still a hack to workaround a bug.
No its resolving what is local when your local, and public when your public.
Go ahead do nat reflection if you like processing traffic over your firewall for zero reason.
-
@johnpoz I hear what you are saying johnpoz, but the reality of the situation is that you cannot expect tenants to know that they need to contact support and ask for the setup of split horizon DNS as a means to mask a functionality discrepancy. They see that their site doesn't work and that's that. We have no control over what internal code to someone's WordPress site is doing.
It's the old adage that a developer will never encounter a bug in their own code, but give it to an inexperienced end user for half an hour and you can guarantee that they'll break it.
In an ideal world, of course, it would be preferred if we didn't need it to double hit the firewall. However the way this stack is currently architectured - rightly or wrongly - means that there is an issue. It has been architectured like this for many, many years because it has worked fine; and yes, it could be architectured better. No arguments from us. The point though isn't that it could be redesigned to work around the problem. The point is that there is a problem in the first place. We merely seek to report it to the community and note that other vendors have overcome the issue.
It is up to far cleverer people than I as to whether it is practical, possible or prudent to do anything about it - or even if anyone cares.
Having spent a few man days looking into it, we've done our bit by letting people in the community know about our issue in good faith and subsequently what we did to hack it back in to a working state.
Our thanks to you and the others who have posted for your comments and insights.
-
@c-amie said in [Bug?] pfSense empirically causing legacy WordPress sites to fail:
We have no control over what internal code to someone's WordPress site is doing.
Valid point.. One way to fix it would be to put the servers actually on a public IP via routed network behind pfsense.