[Bug?] pfSense empirically causing legacy WordPress sites to fail
-
You will baulk at this one too, because I did. However thus far, all evidence is supporting the contrary that the presence of pfSense is causing a small number of WordPress sites to fail.
In 2021 we had to have a telco change at a site, so we drained the site's server farms over to a different site 100 miles away. On completion, a few - but not all - WordPress instances immediately started playing up.
Page load requests would take up to 5 minutes to execute. Timeouts were of course a regular occurrence in PHP, however the httpd worker processes would just sit there doing absolutely nothing what so ever until the request eventually succeeded for no apparent reason or out and out failed when memory was scavenged by the httpd.
IIS logs would indicate that the requests were active with no issues.We were unable to get to the bottom of it for the 3/4 days that the main site was down and the problems stopped instantly when we moved everything back after the line works.
At the time we put it down to old WordPress code and something to do with our nested virtualisation stack. As the migration was done, were not in a position to troubleshooting any further.
The offlined site was running DrayTek kit, the temporary site pfSense (current to the time).
Flash forward to a couple of weeks ago and the site that had the line works replaced its DrayTek hardware with pfSense (current to today).
The same WordPress sites instantly fell back to the same non-working states. Exact same symptoms.
The guys over there have done a fail back to the DrayTek this afternoon and (as unlikely as this feels) the WordPress sites instantaneously started working again just fine.
The kit is all properly capacity planned, no architectural problems. Router RAM and CPU are idle. HTTPD RAM and CPU's are idle when waiting for these requests to complete. There are no issues on any other services. It doesn't impact all WordPress installs, just certain ones that all tend to be with tenants who are using old, bloated theme code.
pfSense configs are vanilla, no IDS's extensions or add-on services are installed. If they were, it would be easily explainable. They are simply vanilla installs.
I was wondering if anyone might have any thoughts or ideas towards resolving this?
Thanks,
-
@c-amie said in [Bug?] pfSense empirically causing legacy WordPress sites to fail:
I was wondering if anyone might have any thoughts or ideas towards resolving this?
How about posting up one of these sites fqdn, so can test it via going through our pfsense, and not via say cell connection on phone.. Then could do sniffs and see what might be going on.
MTU issues would be one thing that "could" cause issues.
I would suggest you sniff (packet capture) might glean some insight to what is actually going on. Pfsense has zero clue to it being a wordpress site or google.com etc.. It passes packets..
-
@johnpoz Thank-you for your reply, much appreciated.
Other than doubt, MTU was my first guess too but there doesn't seem to be anything up with it. It certainly feels as though it is an issue framed in an application later context. So far we have torn one of the sites to pieces - MySQL versions, PHP versions 5.6.40 - 7.4.29, WordPress version 5.4.0 - 6.0, plugins. The only two things that fix it are removing the themes and styling engines from the sites (nuking the client site) and the shift back to the DrayTek.
The site is currently still running on the DrayTek, I will post back with all of the requested information as soon as they have swapped back to pfSense.
Cheers,
-
@c-amie so your saying that user out on the internet has issues getting to this when your site hosting the servers/sites has pfsense as it router.
So if I went to the site now it would be fine, but if you put in pfsense, then my connection would be not fine..
So you are port forwarding from your wan to some rfc1918 address? Or you have these servers hosting the sites on their own routed public space through pfsense?
Or you have some reverse proxy your running through.. Again psfense routes packets, it has no care to what these are, or what site be it wordpress or drupal or Joomla, or just some site on apache or IIS, etc. or nginx or lighttpd, etc. etc..
Have no idea how your network is setup.. Maybe your asymmetrical and your daytek doesn't care, but pfsense being a stateful firewall does, etc.
Could you post up a url so could check how it works when you say its fine, and then when you put pfsense back and its not fine can see the difference in browser via the webtools and what might be loading or not loading, etc.. or where the actual delays are, etc.
-
@johnpoz Thanks for your reply. Yes thats exactly what is happening.
The sites are port forwarding to an IIS server on both sites. No reverse proxy.
The url is https://www.businesscoachspecialist.co.uk. The site should currently be working just fine. Let me know when and we can swop back to pfsense.
-
@c-amie Loads quickly here :)
-
@c-amie site loads very quickly here, I snagged a profile with firefox web developer. and I grabbed a pcap..
So sure if you want to move it over to pfsense, can see if spot any differences.. You don't answer ping..
I'm getting a 404 for this jpg
https://www.businesscoachspecialist.co.uk/wp-content/uploads/2016/11/row2.jpg
And your ssl score is only a B because you still supporting tls 1.1
-
@johnpoz Thanks John. I have swopped back to the pfsense router and site will no longer load.....
We've been ripping this site to pieces all day trying to troubleshoot this, it is currently running on a backup restored version from over a year ago. Things are very down-level at the moment.
-
@c-amie well it just stops get a completed handshake for ssl, but then just don't get an answer back.. new test on ssl labs shows not answer http.
HTTP request to this server failed, see below for details.
There clearly something wrong - but again pfsense doesn't care its just packets.. Its passing some - can get the handshake completed, etc.
So again - what is doing the ssl, are you offloading this, do you have it some reverse proxy setup?
I did get this back
So again pfsense just passes packets... Your going to have to give us some more details of your setup, you have this in nat section - how are you doing nat.. How would we get successful https handshake if server is doing it - this would after the nat, etc.
Is your server trying to load stuff from a different server? How is it accessing that - is that something pfsense has to allow.. Without more details of the overall network setup, not much help we can give. But clearly pfsense passed the 443 to something that did the handshake - but then from that servers 500 error it had problem doing something..
You stated your not using any packages - so take it pfsense is not doing ssl offload via haproxy, etc. What else changes in your network when you swap out the daytek for pfsense?
-
That is what we get internally as well as nationally. Evry now and then it will load just fine, but takes many minutes.
Equally, I understand your frustration. I spent most of the morning saying the router couldn't have anything to do with it, yet they swapped it to eliminate it and there was the proof. Instant restoration of service.
Nothing is jumping out in the pfSense logs. There are no PPPoE errors, signs of dropped frames or CRC issues being recorded etc.That node is isolated now, it's IIS on Server 2019. WAN has a /26 on it with a PAT forward for 443, 80 to the LAN IP and a outobund-NAT affinity back to its public IP ending .82 in this case.
WAN performance benchmarks consistently between both setups.
It doesn't do it on any other site currently on the HTTPd running on the same network, the same VIP. There is no offloading, no reverse proxy, LBL and no IDS. Security security software on the httpd is consistent.
The only other machine involved in this transaction on this test setup is the RDBMS which is on the same subnet connected via an internal hypervisor core switch. The traffic never leaves the hypervisor.
No other websites on this particular httpd are having issues currently. Same RDBMS, same wordpress versions, the only difference is the theme file and its application later rendering stack.If we disable https enforcement, the same issue occurs over plain old HTTP.
MTUs from the HTTPd to the router are straight 1500. PPPoE MTUs are as per the ISP.
It is a physical LAN disconnect, switch the media converter into the DrayTek and patch the DrayTek back in to the same port in the distribution layer switch.
The setup in the failover site last year was more complicated as it involved more VLANs, more firewalls and virtualised hypervisors... but fundamentally it was the same sort of setup showing the same issues for the exact same https sites.
We very much appreciate your thoughts. Please be aware that we are in the UK and it is nearly 10pm, so we will be signing off.
Cheers,
-
@c-amie the pfsense webgui isn't running on port 80/443 right?
-
@c-amie not buying pfsense as cause - some sort of configuration, nat reflection? You mention it does access another server - well clearly in the error it presents as saying it can not access something..
the only difference is the theme file and its application later rendering stack.
And where is that.. How would pfsense have any clue to any of that inside a ssl tunnel? Where are those files - how are they accessed hard coded local IP, a fqdn and then via nat reflection?
No other websites on this particular httpd are having issues currently
So pfsense dicks with traffic somehow inside a ssl tunnel, but only for this 1 site - how does that make any sense. You need to look at details of what could be different in the setup, nat reflection comes to mind - because pfsense sure wouldn't do that without configuration.
-
@heper Correct
-
@johnpoz NAT reflection would be pfSense as the cause as pfSense is handling the NAT reflection. However, neither you or I are accessing it via NAT reflection, we are both external to the site.
No, the timeout you got isn't saying it cannot access something. The PHP worker process idles for too long so the service watchdog kills it. If you offline the RDBMS the error is instant. If you keep retrying, in our experience it will eventually load.
Files are on local disk. Everything has been simplified in troubleshooting. When I said the only external dependency was the RDBMS, I was accurate. There is no iSCSI, NFS, SAN involved.
As said, it doesn't have to be a SSL tunnel. I do not think that it is a stateful packet inspection problem. It 'feels' more like MTU, repeated failed retransmission or more succinctly like an asymmetric route that is failing just for HTTP responses from that site. Something in the pfSense config is causing it to get chewed, I agree. The question is what? Does anyone have any config change ideas?
-
@c-amie said in [Bug?] pfSense empirically causing legacy WordPress sites to fail:
repeated failed retransmission or more succinctly like an asymmetric route that is failing just for HTTP responses from that site
Well do a sniff on pfsense then, both on the wan side and the lan side - what do you see.
But you stated no other sites having any issues - what is different about them?
No other websites on this particular httpd are having issues currently.
-
@johnpoz Theme and rendering stack
-
Are you up for a remote session to see if anything sticks out in the config?
-
@johnpoz So with TLS disabled to keep everything nice and simple.
Client side
Client established
- 212.159.x.x establishes session to 81.150.196.82
- 81.150.196.82 SYN ACK's
- Client sends GET to 81.150.196.82 with correct host header
- 81.150.196.82 and client exchange TCP keep-alive requests every 45 seconds
- Until after about 300-400 seconds the HTTPd watchdog kills the request and IIS sends the error page triggering a FIN ACK
Server Side
- Client 212.159.x.x TCP session is SYN ACK'd with httpd 172.16.1.1
- HTTP GET is received from 212.159.x.x
- MySQL transactions fire and conclude successfully with RDBMS server IP
- 172.16.1.1 negotiates congestion notification with 81.150.196.82 and fires off a WP cron job to its own host header - WordPress does this
- More database IO for the cron job
- WP CRON job is HTTP 200'd
- Every 45 seconds the TCP keep-alive triggers and is ACK'd
- A couple more CRON request are fired by the thread
- Until 300 seconds when 172.16.1.1 sends to 212.159.x.x the HTTP 500 timeout after which is is FIN ACK'd
- On receiving the error page, the browser asks for the favicon and wordpress responds with its programmed PNG file (at some non-standard arbitrary storage location)
IIS logs the transaction with the successful CRON jobs in the 5 minutes proceeding the 500 and the two favicon requests
2022-05-31 09:47:55 172.16.1.1 POST /wp-cron.php doing_wp_cron=1653990475.2987639904022216796875 80 - 172.16.1.2 WordPress/6.0;+http://www.businesscoachspecialist.co.uk http://www.businesscoachspecialist.co.uk/wp-cron.php?doing_wp_cron=1653990475.2987639904022216796875 200 0 0 537
2022-05-31 09:47:55 172.16.1.1 GET /wp-content/uploads/2016/06/comodo_secure_seal_113x59_transp-2.png - 80 - 172.16.1.2 - - 200 0 0 2
2022-05-31 09:48:46 172.16.1.1 POST /wp-cron.php doing_wp_cron=1653990526.1701300144195556640625 80 - 172.16.1.2 WordPress/6.0;+http://www.businesscoachspecialist.co.uk http://www.businesscoachspecialist.co.uk/wp-cron.php?doing_wp_cron=1653990526.1701300144195556640625 200 0 0 555
2022-05-31 09:48:46 172.16.1.1 GET /robots.txt - 80 - 34.76.25.117 DnBCrawler-Analytics - 200 0 0 1069
2022-05-31 09:48:46 172.16.1.1 GET /wp-content/uploads/2016/06/comodo_secure_seal_113x59_transp-2.png - 80 - 172.16.1.2 - - 200 0 0 1
2022-05-31 09:48:56 172.16.1.1 GET /wp-content/uploads/2017/10/BusinessCoach.jpg - 80 - 172.16.1.2 - - 200 0 0 2
2022-05-31 09:49:47 172.16.1.1 GET /wp-content/uploads/2017/10/BusinessCoach.jpg - 80 - 172.16.1.2 - - 200 0 0 9
2022-05-31 09:49:56 172.16.1.1 GET /wp-content/uploads/2017/10/SalesTraining.jpg - 80 - 172.16.1.2 - - 200 0 0 2
2022-05-31 09:50:47 172.16.1.1 GET /wp-content/uploads/2017/10/SalesTraining.jpg - 80 - 172.16.1.2 - - 200 0 0 4
2022-05-31 09:50:49 172.16.1.1 POST /wp-cron.php doing_wp_cron=1653990648.4238090515136718750000 80 - 172.16.1.2 WordPress/6.0;+http://www.businesscoachspecialist.co.uk http://www.businesscoachspecialist.co.uk/wp-cron.php?doing_wp_cron=1653990648.4238090515136718750000 200 0 0 574
2022-05-31 09:50:49 172.16.1.1 GET /wp-content/uploads/2016/06/comodo_secure_seal_113x59_transp-2.png - 80 - 172.16.1.2 - - 200 0 0 1
2022-05-31 09:50:56 172.16.1.1 GET /wp-content/uploads/2017/10/StaffDevelopment.jpg - 80 - 172.16.1.2 - - 200 0 0 1
2022-05-31 09:51:47 172.16.1.1 GET /wp-content/uploads/2017/10/StaffDevelopment.jpg - 80 - 172.16.1.2 - - 200 0 0 2
2022-05-31 09:51:49 172.16.1.1 GET /wp-content/uploads/2017/10/BusinessCoach.jpg - 80 - 172.16.1.2 - - 200 0 0 2
2022-05-31 09:51:56 172.16.1.1 GET /wp-content/uploads/2016/06/FreeCoachingSession.jpg - 80 - 172.16.1.2 - - 200 0 0 2
2022-05-31 09:52:18 172.16.1.1 POST /wp-cron.php doing_wp_cron=1653990737.7030880451202392578125 80 - 172.16.1.2 WordPress/6.0;+http://www.businesscoachspecialist.co.uk http://www.businesscoachspecialist.co.uk/wp-cron.php?doing_wp_cron=1653990737.7030880451202392578125 200 0 0 570
2022-05-31 09:52:18 172.16.1.1 GET /wp-content/uploads/2016/06/comodo_secure_seal_113x59_transp-2.png - 80 - 172.16.1.2 - - 200 0 0 0
2022-05-31 09:52:47 172.16.1.1 GET /wp-content/uploads/2016/06/FreeCoachingSession.jpg - 80 - 172.16.1.2 - - 200 0 0 2
2022-05-31 09:52:49 172.16.1.1 GET /wp-content/uploads/2017/10/SalesTraining.jpg - 80 - 172.16.1.2 - - 200 0 0 4
2022-05-31 09:52:54 172.16.1.1 GET / - 80 - 212.159.x.x Mozilla/5.0+(Windows+NT+10.0;+Win64;+x64)+AppleWebKit/537.36+(KHTML,+like+Gecko)+Chrome/102.0.5005.61+Safari/537.36 - 500 0 258 300143
2022-05-31 09:52:54 172.16.1.1 GET /favicon.ico - 80 - 212.159.x.x Mozilla/5.0+(Windows+NT+10.0;+Win64;+x64)+AppleWebKit/537.36+(KHTML,+like+Gecko)+Chrome/102.0.5005.61+Safari/537.36 http://www.businesscoachspecialist.co.uk/ 302 0 0 512
2022-05-31 09:52:54 172.16.1.1 GET /wp-includes/images/w-logo-blue-white-bg.png - 80 - 212.159.x.x Mozilla/5.0+(Windows+NT+10.0;+Win64;+x64)+AppleWebKit/537.36+(KHTML,+like+Gecko)+Chrome/102.0.5005.61+Safari/537.36 http://www.businesscoachspecialist.co.uk/ 200 0 0 16The highlighted Time taken being 300.143 seconds.
I repeated it with WP-CRON disabled, no difference other than it not firing that traffic.
So all that tells us what we already knew, the request it getting black holed in the httpd worker process. The question is why does the guys over there swapping the router fix it!
-
@cool_corona The admin over there is not comfortable with that at the moment. They're happy to share screenshots and cli output.
-
@c-amie I understand. But it will take 5 mins to solve if anything is misconfigured or something has been overlooked.
Its up to you. Catch me on PM if you need assistance.