Netgate Discussion Forum
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Search
    • Register
    • Login

    [Bug?] pfSense empirically causing legacy WordPress sites to fail

    Scheduled Pinned Locked Moved NAT
    33 Posts 4 Posters 3.3k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • C
      C-Amie @johnpoz
      last edited by

      @johnpoz

      That is what we get internally as well as nationally. Evry now and then it will load just fine, but takes many minutes.

      Equally, I understand your frustration. I spent most of the morning saying the router couldn't have anything to do with it, yet they swapped it to eliminate it and there was the proof. Instant restoration of service.
      Nothing is jumping out in the pfSense logs. There are no PPPoE errors, signs of dropped frames or CRC issues being recorded etc.

      That node is isolated now, it's IIS on Server 2019. WAN has a /26 on it with a PAT forward for 443, 80 to the LAN IP and a outobund-NAT affinity back to its public IP ending .82 in this case.

      WAN performance benchmarks consistently between both setups.

      It doesn't do it on any other site currently on the HTTPd running on the same network, the same VIP. There is no offloading, no reverse proxy, LBL and no IDS. Security security software on the httpd is consistent.

      The only other machine involved in this transaction on this test setup is the RDBMS which is on the same subnet connected via an internal hypervisor core switch. The traffic never leaves the hypervisor.
      No other websites on this particular httpd are having issues currently. Same RDBMS, same wordpress versions, the only difference is the theme file and its application later rendering stack.

      If we disable https enforcement, the same issue occurs over plain old HTTP.

      MTUs from the HTTPd to the router are straight 1500. PPPoE MTUs are as per the ISP.

      It is a physical LAN disconnect, switch the media converter into the DrayTek and patch the DrayTek back in to the same port in the distribution layer switch.

      The setup in the failover site last year was more complicated as it involved more VLANs, more firewalls and virtualised hypervisors... but fundamentally it was the same sort of setup showing the same issues for the exact same https sites.

      We very much appreciate your thoughts. Please be aware that we are in the UK and it is nearly 10pm, so we will be signing off.

      Cheers,

      H johnpozJ 2 Replies Last reply Reply Quote 0
      • H
        heper @C-Amie
        last edited by

        @c-amie the pfsense webgui isn't running on port 80/443 right?

        C 1 Reply Last reply Reply Quote 0
        • johnpozJ
          johnpoz LAYER 8 Global Moderator @C-Amie
          last edited by

          @c-amie not buying pfsense as cause - some sort of configuration, nat reflection? You mention it does access another server - well clearly in the error it presents as saying it can not access something..

          the only difference is the theme file and its application later rendering stack.

          And where is that.. How would pfsense have any clue to any of that inside a ssl tunnel? Where are those files - how are they accessed hard coded local IP, a fqdn and then via nat reflection?

          No other websites on this particular httpd are having issues currently

          So pfsense dicks with traffic somehow inside a ssl tunnel, but only for this 1 site - how does that make any sense. You need to look at details of what could be different in the setup, nat reflection comes to mind - because pfsense sure wouldn't do that without configuration.

          An intelligent man is sometimes forced to be drunk to spend time with his fools
          If you get confused: Listen to the Music Play
          Please don't Chat/PM me for help, unless mod related
          SG-4860 24.11 | Lab VMs 2.7.2, 24.11

          C 1 Reply Last reply Reply Quote 0
          • C
            C-Amie @heper
            last edited by

            @heper Correct

            1 Reply Last reply Reply Quote 0
            • C
              C-Amie @johnpoz
              last edited by

              @johnpoz NAT reflection would be pfSense as the cause as pfSense is handling the NAT reflection. However, neither you or I are accessing it via NAT reflection, we are both external to the site.

              No, the timeout you got isn't saying it cannot access something. The PHP worker process idles for too long so the service watchdog kills it. If you offline the RDBMS the error is instant. If you keep retrying, in our experience it will eventually load.

              Files are on local disk. Everything has been simplified in troubleshooting. When I said the only external dependency was the RDBMS, I was accurate. There is no iSCSI, NFS, SAN involved.

              As said, it doesn't have to be a SSL tunnel. I do not think that it is a stateful packet inspection problem. It 'feels' more like MTU, repeated failed retransmission or more succinctly like an asymmetric route that is failing just for HTTP responses from that site. Something in the pfSense config is causing it to get chewed, I agree. The question is what? Does anyone have any config change ideas?

              johnpozJ 1 Reply Last reply Reply Quote 0
              • johnpozJ
                johnpoz LAYER 8 Global Moderator @C-Amie
                last edited by

                @c-amie said in [Bug?] pfSense empirically causing legacy WordPress sites to fail:

                repeated failed retransmission or more succinctly like an asymmetric route that is failing just for HTTP responses from that site

                Well do a sniff on pfsense then, both on the wan side and the lan side - what do you see.

                But you stated no other sites having any issues - what is different about them?

                No other websites on this particular httpd are having issues currently.

                An intelligent man is sometimes forced to be drunk to spend time with his fools
                If you get confused: Listen to the Music Play
                Please don't Chat/PM me for help, unless mod related
                SG-4860 24.11 | Lab VMs 2.7.2, 24.11

                C 2 Replies Last reply Reply Quote 0
                • C
                  C-Amie @johnpoz
                  last edited by

                  @johnpoz Theme and rendering stack

                  1 Reply Last reply Reply Quote 0
                  • Cool_CoronaC
                    Cool_Corona
                    last edited by

                    Are you up for a remote session to see if anything sticks out in the config?

                    C 1 Reply Last reply Reply Quote 0
                    • C
                      C-Amie @johnpoz
                      last edited by C-Amie

                      @johnpoz So with TLS disabled to keep everything nice and simple.

                      Client side

                      Client established

                      • 212.159.x.x establishes session to 81.150.196.82
                      • 81.150.196.82 SYN ACK's
                      • Client sends GET to 81.150.196.82 with correct host header
                      • 81.150.196.82 and client exchange TCP keep-alive requests every 45 seconds
                      • Until after about 300-400 seconds the HTTPd watchdog kills the request and IIS sends the error page triggering a FIN ACK

                      Server Side

                      • Client 212.159.x.x TCP session is SYN ACK'd with httpd 172.16.1.1
                      • HTTP GET is received from 212.159.x.x
                      • MySQL transactions fire and conclude successfully with RDBMS server IP
                      • 172.16.1.1 negotiates congestion notification with 81.150.196.82 and fires off a WP cron job to its own host header - WordPress does this
                      • More database IO for the cron job
                      • WP CRON job is HTTP 200'd
                      • Every 45 seconds the TCP keep-alive triggers and is ACK'd
                      • A couple more CRON request are fired by the thread
                      • Until 300 seconds when 172.16.1.1 sends to 212.159.x.x the HTTP 500 timeout after which is is FIN ACK'd
                      • On receiving the error page, the browser asks for the favicon and wordpress responds with its programmed PNG file (at some non-standard arbitrary storage location)

                      IIS logs the transaction with the successful CRON jobs in the 5 minutes proceeding the 500 and the two favicon requests
                      2022-05-31 09:47:55 172.16.1.1 POST /wp-cron.php doing_wp_cron=1653990475.2987639904022216796875 80 - 172.16.1.2 WordPress/6.0;+http://www.businesscoachspecialist.co.uk http://www.businesscoachspecialist.co.uk/wp-cron.php?doing_wp_cron=1653990475.2987639904022216796875 200 0 0 537
                      2022-05-31 09:47:55 172.16.1.1 GET /wp-content/uploads/2016/06/comodo_secure_seal_113x59_transp-2.png - 80 - 172.16.1.2 - - 200 0 0 2
                      2022-05-31 09:48:46 172.16.1.1 POST /wp-cron.php doing_wp_cron=1653990526.1701300144195556640625 80 - 172.16.1.2 WordPress/6.0;+http://www.businesscoachspecialist.co.uk http://www.businesscoachspecialist.co.uk/wp-cron.php?doing_wp_cron=1653990526.1701300144195556640625 200 0 0 555
                      2022-05-31 09:48:46 172.16.1.1 GET /robots.txt - 80 - 34.76.25.117 DnBCrawler-Analytics - 200 0 0 1069
                      2022-05-31 09:48:46 172.16.1.1 GET /wp-content/uploads/2016/06/comodo_secure_seal_113x59_transp-2.png - 80 - 172.16.1.2 - - 200 0 0 1
                      2022-05-31 09:48:56 172.16.1.1 GET /wp-content/uploads/2017/10/BusinessCoach.jpg - 80 - 172.16.1.2 - - 200 0 0 2
                      2022-05-31 09:49:47 172.16.1.1 GET /wp-content/uploads/2017/10/BusinessCoach.jpg - 80 - 172.16.1.2 - - 200 0 0 9
                      2022-05-31 09:49:56 172.16.1.1 GET /wp-content/uploads/2017/10/SalesTraining.jpg - 80 - 172.16.1.2 - - 200 0 0 2
                      2022-05-31 09:50:47 172.16.1.1 GET /wp-content/uploads/2017/10/SalesTraining.jpg - 80 - 172.16.1.2 - - 200 0 0 4
                      2022-05-31 09:50:49 172.16.1.1 POST /wp-cron.php doing_wp_cron=1653990648.4238090515136718750000 80 - 172.16.1.2 WordPress/6.0;+http://www.businesscoachspecialist.co.uk http://www.businesscoachspecialist.co.uk/wp-cron.php?doing_wp_cron=1653990648.4238090515136718750000 200 0 0 574
                      2022-05-31 09:50:49 172.16.1.1 GET /wp-content/uploads/2016/06/comodo_secure_seal_113x59_transp-2.png - 80 - 172.16.1.2 - - 200 0 0 1
                      2022-05-31 09:50:56 172.16.1.1 GET /wp-content/uploads/2017/10/StaffDevelopment.jpg - 80 - 172.16.1.2 - - 200 0 0 1
                      2022-05-31 09:51:47 172.16.1.1 GET /wp-content/uploads/2017/10/StaffDevelopment.jpg - 80 - 172.16.1.2 - - 200 0 0 2
                      2022-05-31 09:51:49 172.16.1.1 GET /wp-content/uploads/2017/10/BusinessCoach.jpg - 80 - 172.16.1.2 - - 200 0 0 2
                      2022-05-31 09:51:56 172.16.1.1 GET /wp-content/uploads/2016/06/FreeCoachingSession.jpg - 80 - 172.16.1.2 - - 200 0 0 2
                      2022-05-31 09:52:18 172.16.1.1 POST /wp-cron.php doing_wp_cron=1653990737.7030880451202392578125 80 - 172.16.1.2 WordPress/6.0;+http://www.businesscoachspecialist.co.uk http://www.businesscoachspecialist.co.uk/wp-cron.php?doing_wp_cron=1653990737.7030880451202392578125 200 0 0 570
                      2022-05-31 09:52:18 172.16.1.1 GET /wp-content/uploads/2016/06/comodo_secure_seal_113x59_transp-2.png - 80 - 172.16.1.2 - - 200 0 0 0
                      2022-05-31 09:52:47 172.16.1.1 GET /wp-content/uploads/2016/06/FreeCoachingSession.jpg - 80 - 172.16.1.2 - - 200 0 0 2
                      2022-05-31 09:52:49 172.16.1.1 GET /wp-content/uploads/2017/10/SalesTraining.jpg - 80 - 172.16.1.2 - - 200 0 0 4
                      2022-05-31 09:52:54 172.16.1.1 GET / - 80 - 212.159.x.x Mozilla/5.0+(Windows+NT+10.0;+Win64;+x64)+AppleWebKit/537.36+(KHTML,+like+Gecko)+Chrome/102.0.5005.61+Safari/537.36 - 500 0 258 300143
                      2022-05-31 09:52:54 172.16.1.1 GET /favicon.ico - 80 - 212.159.x.x Mozilla/5.0+(Windows+NT+10.0;+Win64;+x64)+AppleWebKit/537.36+(KHTML,+like+Gecko)+Chrome/102.0.5005.61+Safari/537.36 http://www.businesscoachspecialist.co.uk/ 302 0 0 512
                      2022-05-31 09:52:54 172.16.1.1 GET /wp-includes/images/w-logo-blue-white-bg.png - 80 - 212.159.x.x Mozilla/5.0+(Windows+NT+10.0;+Win64;+x64)+AppleWebKit/537.36+(KHTML,+like+Gecko)+Chrome/102.0.5005.61+Safari/537.36 http://www.businesscoachspecialist.co.uk/ 200 0 0 16

                      The highlighted Time taken being 300.143 seconds.

                      I repeated it with WP-CRON disabled, no difference other than it not firing that traffic.

                      So all that tells us what we already knew, the request it getting black holed in the httpd worker process. The question is why does the guys over there swapping the router fix it!

                      😵

                      johnpozJ 1 Reply Last reply Reply Quote 0
                      • C
                        C-Amie @Cool_Corona
                        last edited by

                        @cool_corona The admin over there is not comfortable with that at the moment. They're happy to share screenshots and cli output.

                        Cool_CoronaC 1 Reply Last reply Reply Quote 0
                        • Cool_CoronaC
                          Cool_Corona @C-Amie
                          last edited by

                          @c-amie I understand. But it will take 5 mins to solve if anything is misconfigured or something has been overlooked.

                          Its up to you. Catch me on PM if you need assistance.

                          C 1 Reply Last reply Reply Quote 0
                          • C
                            C-Amie @Cool_Corona
                            last edited by

                            @cool_corona Thanks, I'll let the team know that your offer stands.

                            Cheers and have a good day.

                            1 Reply Last reply Reply Quote 0
                            • johnpozJ
                              johnpoz LAYER 8 Global Moderator @C-Amie
                              last edited by

                              @c-amie said in [Bug?] pfSense empirically causing legacy WordPress sites to fail:

                              The question is why does the guys over there swapping the router fix it!

                              No idea - but that sure isn't pfsense doing something to the packets..

                              There isn't anything to do with the packets, and if pfsense was doing something to them or not doing something - why would any other site work?

                              An intelligent man is sometimes forced to be drunk to spend time with his fools
                              If you get confused: Listen to the Music Play
                              Please don't Chat/PM me for help, unless mod related
                              SG-4860 24.11 | Lab VMs 2.7.2, 24.11

                              C 2 Replies Last reply Reply Quote 0
                              • C
                                C-Amie @johnpoz
                                last edited by

                                @johnpoz Quite! That's why I reached out to see if anyone else had and bright ideas. It isn't making any sense to me either.

                                1 Reply Last reply Reply Quote 0
                                • C
                                  C-Amie @johnpoz
                                  last edited by C-Amie

                                  @johnpoz Okay, having burned the entire day on this now, I've half worked it out.

                                  It does seems to be a NAT reflection problem that is not present on the DrayTek.

                                  The host has its own public DNS on it, on the same server 172.16.1.1

                                  All TCP/UDP 53 traffic from 172.16.1.1 goes out via 81.150.196.83
                                  All TCP/UDP 53 traffic to 81.150.196.83 goes to 172.16.1.1

                                  All other traffic from 172.16.1.1 goes out via 81.150.196.82
                                  All TCP 80/443 traffic to 81.150.196.82 goes to 172.16.1.1

                                  I surmise (note I am assuming) that there is a DNS request for itself being made in the code in the template file. This is dying 90% of the time.

                                  If I stop the public DNSd service and set the hosts file to resolve the sites FQDN to 172.16.1.1 everything magically works both internally and externally. The second I clear the entry in the hosts file, everything goes back to grinding to a halt again.

                                  There are no issues with DNS resolution on the server via the CLI when the hosts file is disabled. Stick the DrayTek back in and the problem ceases.

                                  As sheer speculation it might appear that the public .83 and .82 addresses cannot talk to each other for whatever reason.

                                  johnpozJ 1 Reply Last reply Reply Quote 0
                                  • johnpozJ
                                    johnpoz LAYER 8 Global Moderator @C-Amie
                                    last edited by johnpoz

                                    @c-amie said in [Bug?] pfSense empirically causing legacy WordPress sites to fail:

                                    that the public .83 and .82 addresses cannot talk to each other for whatever reason.

                                    Yeah that would be a nat reflection problem - which not to say I told you, but I did bring that up ;)

                                    If a client resolves something to the public IP, it would have to be reflected back in.. While you can setup nat reflection in pfsense - its a hack if you ask me.. And only reason would be if something is hard coded to an IP and can not be changed.. If its using fqdn, then host override to point to the local IP would be better solution - why send traffic to pfsense, just for pfsense to send it back - when the the server is right there local anyway.

                                    Other reason you would have to use nat reflection, if device behind pfsense has to use public dns, and no way for it to use local, so there is no way to put in a record to resolve whatever fqdn to the local IP vs the public IP.

                                    So I take your all sorted now?

                                    An intelligent man is sometimes forced to be drunk to spend time with his fools
                                    If you get confused: Listen to the Music Play
                                    Please don't Chat/PM me for help, unless mod related
                                    SG-4860 24.11 | Lab VMs 2.7.2, 24.11

                                    C 1 Reply Last reply Reply Quote 0
                                    • C
                                      C-Amie @johnpoz
                                      last edited by

                                      @johnpoz You did :) And so did I; I've just not had a handle on that 'what' until now.

                                      While I support your reasoning, the necessity to modify the hosts file and reign in the DNS server lookup is in itself a messy hack that needs to be institutionally remembered and maintained. I cannot control the actions of third party code, nor have any desire to modify it.

                                      If the code has to self-reference its own FQDN and the DNS server is inside the NAT envelope, then it is an application layer decision. I cannot put another line and router in in order to stop the necessity for NAT reflection.

                                      So while it is hacked to a working state, it is not sorted. The need to modify hosts files for all affected tenants is a manual chore that does not scale with host migration. As the issue is not present on another manufacturers device, I would stand by my initial assertion that this is looking like a bug in the pfSense implementation. If DrayTek can fix it, then I would have every confidence that the guys and gal's folks at NetGate must be able to derive a solution should there be willingness.

                                      H 1 Reply Last reply Reply Quote 0
                                      • H
                                        heper @C-Amie
                                        last edited by

                                        @c-amie https://docs.netgate.com/pfsense/en/latest/nat/reflection.html#internal-dns-servers

                                        C 1 Reply Last reply Reply Quote 0
                                        • C
                                          C-Amie @heper
                                          last edited by

                                          @heper said in [Bug?] pfSense empirically causing legacy WordPress sites to fail:

                                          https://docs.netgate.com/pfsense/en/latest/nat/reflection.html#internal-dns-servers

                                          The system is using internal DNS servers. Web server queries caching internal DNS servers, internal DNS servers query root servers. This is related to the public DNS so it is flowing:

                                          Web server (172.16.1.1) > internal DNS (172.16.1.x) > root DNS system > public DNS server (.83)

                                          Whatever the PHP code is doing, it is forcing a DNS re-query and then trying to query from the web server to the Internet and then back to the DNS server inside the NAT envelope on a different public IP address.

                                          johnpozJ 1 Reply Last reply Reply Quote 0
                                          • johnpozJ
                                            johnpoz LAYER 8 Global Moderator @C-Amie
                                            last edited by

                                            @c-amie said in [Bug?] pfSense empirically causing legacy WordPress sites to fail:

                                            internal DNS (172.16.1.x)

                                            Put a record here for what you want it to resolve to 172.168.1.y for example.

                                            An intelligent man is sometimes forced to be drunk to spend time with his fools
                                            If you get confused: Listen to the Music Play
                                            Please don't Chat/PM me for help, unless mod related
                                            SG-4860 24.11 | Lab VMs 2.7.2, 24.11

                                            C 1 Reply Last reply Reply Quote 0
                                            • First post
                                              Last post
                                            Copyright 2025 Rubicon Communications LLC (Netgate). All rights reserved.