Any known issues with HAproxy on 2.5.2?

stephenw10

40% of how much? With HAProxy running?

lewis

This one has only 4GB in it because it's very low traffic and on a 50Mbps connection. Maybe I never noticed it was at 40% but I think it would have gotten my attention.

stephenw10

You can check the process list in Diag > System Activity to see if any one thing is using it.

If not and it is not actually exhausted it's probably not an issue.

Steve

lewis

Nothing really obvious other than this;

2275 root 20 0 9988K 1368K select 1 0:00 0.00% /sbin/devd -q -f /etc/pfSense-devd.conf

stephenw10

Can we see the actual usage screen? 1.4MB is nothing, something must be using more than that.

lewis

Do you mean the dashboard or all of the processes?

stephenw10

The processes. So for example the output of top -aSPo res after a few cycles, like:

last pid: 79792;  load averages:  0.31,  0.35,  0.30                                                     up 1+05:39:48  19:49:48
148 processes: 2 running, 145 sleeping, 1 waiting
CPU 0:  0.0% user,  0.0% nice,  0.8% system,  0.0% interrupt, 99.2% idle
CPU 1:  0.0% user,  0.0% nice,  0.4% system,  1.2% interrupt, 98.4% idle
Mem: 97M Active, 717M Inact, 655M Wired, 1840M Free
ARC: 431M Total, 120M MFU, 289M MRU, 32K Anon, 3266K Header, 19M Other
     354M Compressed, 740M Uncompressed, 2.09:1 Ratio

  PID USERNAME    THR PRI NICE   SIZE    RES STATE    C   TIME    WCPU COMMAND
95987 root          2  20    0   417M   373M bpf      1   5:55   0.12% /usr/local/bin/snort -R _28847 -D -q --suppress-config-lo
48404 root          6  52    0   113M    86M kqread   0   0:00   0.00% /usr/local/sbin/radiusd
42053 root          1  52    0   140M    48M accept   0   1:16   0.00% php-fpm: pool nginx (php-fpm)
 1262 root          1  52    0   140M    48M accept   1   1:03   0.00% php-fpm: pool nginx (php-fpm)
12485 root          1  52    0   141M    48M accept   1   1:14   0.00% php-fpm: pool nginx (php-fpm)
 1261 root          1  52    0   140M    47M accept   0   1:43   0.00% php-fpm: pool nginx (php-fpm)
 1466 root          1  20    0   141M    47M accept   1   1:06   0.00% php-fpm: pool nginx (php-fpm)
81073 squid         1  20    0   105M    37M kqread   1   4:17   0.03% (squid-1) --kid squid-1 -f /usr/local/etc/squid/squid.con
 1260 root          1  20    0   100M    26M kqread   0   0:05   0.01% php-fpm: master process (/usr/local/lib/php-fpm.conf) (ph
39411 unbound       2  52    0    40M    20M kqread   1   0:00   0.00% /usr/local/sbin/unbound -c /var/unbound/unbound.conf
80523 squid         1  20    0    79M    19M wait     0   0:00   0.00% /usr/local/sbin/squid -f /usr/local/etc/squid/squid.conf
93560 root         17  52    0    50M    17M sigwai   1   0:12   0.00% /usr/local/libexec/ipsec/charon --use-syslog
44082 www           1  20    0    26M    14M kqread   1   0:01   0.00% /usr/local/sbin/haproxy -f /var/etc/haproxy/haproxy.cfg -
51992 root         10  20    0    65M    12M select   1   0:13   0.00% /usr/local/sbin/zebra -d
63253 dhcpd         1  20    0    22M    12M select   0   0:20   0.02% /usr/local/sbin/dhcpd -user dhcpd -group _dhcp -chroot /v
53603 root          4  20    0    33M    10M select   0   0:06   0.00% /usr/local/sbin/bgpd -d
32357 root          2  20    0    25M  9792K kqread   0   0:00   0.00% /usr/local/sbin/syslog-ng -p /var/run/syslog-ng.pid
23079 root          1  20    0    19M  9104K select   0   0:00   0.02% sshd: admin@pts/0 (sshd)
53811 root          1  20    0    28M  8480K kqread   1   0:05   0.00% nginx: worker process (nginx)
53681 root          1  20    0    28M  8312K kqread   0   0:02   0.00% nginx: worker process (nginx)
32255 root          1  52    0    18M  8220K wait     0   0:00   0.00% /usr/local/sbin/syslog-ng -p /var/run/syslog-ng.pid
25756 squid         1  20    0    17M  8084K select   1   0:11   0.02% (pinger) (pinger)
57293 squid         1  20    0    17M  8084K select   0   0:10   0.02% (pinger) (pinger)
97532 squid         1  20    0    17M  8084K select   1   0:11   0.02% (pinger) (pinger)

lewis

That's what I thought but wasn't sure :).
Nothing too unusual.
I never noticed that before, 42M active, 118M Inact, 1471M Wired.
Is the system holding some memory in some sort of buffer or something?

I've never seen that on Centos or other flavors I've worked with.

stephenw10

Mmm, so just wired memory from the kernel (probably).
It's not an issue as far as I know. If the actual free memory runs low the kernel will start releasing wired memory. It is different behaviour to 2.5.2 though.

Steve

lewis

Sorry it took so long to get back to this but there is definitely something wrong with haproxy, at least on our device.

For the past while, we've been testing everything possible inside our network thinking something between the web connections, the application and the database must be wrong.
After an insane amount of hours troubleshooting, we could simply find nothing what so ever wrong with the application. The only clue was that clients were not communicating at the intervals they are set to.

Eventually, we decided that maybe it's the Internet. Maybe because of the Ukraine war and lots of extra world wide hacking, maybe governments are filtering the net so much that it's caused some latency.

Yes, we started thinking it must be the Internet! :).

Then something dawned on me tonight after spending the entire day on this again. I remembered that I took haproxy out of the mix (as posted above) and things got way better there. Users are no longer getting gateway timeouts. I've been monitoring the logs since then.

This evening, I decided to take this other set of servers off haproxy, put just one online and give traffic direct access. Guess what? The timing is now almost dead on, no longer random and no more missing connections.
All data that is supposed to come in, is coming in, no missing data. It's haproxy causing the loss somehow.

Here is a snip of us watching the logs and everything else a while ago. See the difference in timing? I'm only showing a snip but before haproxy was taken out, this client kept missing sending data, now it's dead on.

With load balancer
# tail -f /var/log/httpd/access_log | grep "1.1.1.1"
www.domain.com 1.1.1.1 - - [12/May/2022:20:22:10 -0700] "POST /app/test.php HTTP/1.1" 200 199351 747 "-" "curl/7.43.0"
www.domain.com 1.1.1.1 - - [12/May/2022:20:22:40 -0700] "POST /app/test.php HTTP/1.1" 200 212418 747 "-" "curl/7.43.0"
www.domain.com 1.1.1.1 - - [12/May/2022:20:23:50 -0700] "POST /app/test.php HTTP/1.1" 200 178076 747 "-" "curl/7.43.0"
www.domain.com 1.1.1.1 - - [12/May/2022:20:24:21 -0700] "POST /app/test.php HTTP/1.1" 200 181307 747 "-" "curl/7.43.0"
www.domain.com 1.1.1.1 - - [12/May/2022:20:24:32 -0700] "POST /app/test.php HTTP/1.1" 200 193764 747 "-" "curl/7.43.0"
www.domain.com 1.1.1.1 - - [12/May/2022:20:24:36 -0700] "POST /app/test.php HTTP/1.1" 200 252216 1 "-" "curl/7.43.0"
www.domain.com 1.1.1.1 - - [12/May/2022:20:24:41 -0700] "POST /app/test.php HTTP/1.1" 200 230704 747 "-" "curl/7.43.0"
www.domain.com 1.1.1.1 - - [12/May/2022:20:25:10 -0700] "POST /app/test.php HTTP/1.1" 200 175718 747 "-" "curl/7.43.0"
www.domain.com 1.1.1.1 - - [12/May/2022:20:25:21 -0700] "POST /app/test.php HTTP/1.1" 200 255809 747 "-" "curl/7.43.0"
www.domain.com 1.1.1.1 - - [12/May/2022:20:25:31 -0700] "POST /app/test.php HTTP/1.1" 200 217827 747 "-" "curl/7.43.0"
www.domain.com 1.1.1.1 - - [12/May/2022:20:26:19 -0700] "POST /app/test.php HTTP/1.1" 200 272213 1 "-" "curl/7.43.0"

Without load balancer
# tail -f /var/log/httpd/access_log | grep "1.1.1.1"
www.domain.com 1.1.1.1 - - [12/May/2022:21:11:21 -0700] "POST /app/test.php HTTP/1.1" 200 580819 747 "-" "curl/7.43.0"
www.domain.com 1.1.1.1 - - [12/May/2022:21:11:31 -0700] "POST /app/test.php HTTP/1.1" 200 430671 747 "-" "curl/7.43.0"
www.domain.com 1.1.1.1 - - [12/May/2022:21:11:41 -0700] "POST /app/test.php HTTP/1.1" 200 550884 747 "-" "curl/7.43.0"
www.domain.com 1.1.1.1 - - [12/May/2022:21:11:51 -0700] "POST /app/test.php HTTP/1.1" 200 564128 747 "-" "curl/7.43.0"
www.domain.com 1.1.1.1 - - [12/May/2022:21:12:01 -0700] "POST /app/test.php HTTP/1.1" 200 418494 747 "-" "curl/7.43.0"
www.domain.com 1.1.1.1 - - [12/May/2022:21:12:06 -0700] "POST /app/test.php HTTP/1.1" 200 303744 1 "-" "curl/7.43.0"
www.domain.com 1.1.1.1 - - [12/May/2022:21:12:11 -0700] "POST /app/test.php HTTP/1.1" 200 364427 747 "-" "curl/7.43.0"
www.domain.com 1.1.1.1 - - [12/May/2022:21:12:20 -0700] "POST /app/test.php HTTP/1.1" 200 285843 747 "-" "curl/7.43.0"
www.domain.com 1.1.1.1 - - [12/May/2022:21:12:30 -0700] "POST /app/test.php HTTP/1.1" 200 234948 747 "-" "curl/7.43.0"
www.domain.com 1.1.1.1 - - [12/May/2022:21:12:37 -0700] "POST /app/test.php HTTP/1.1" 200 310208 1 "-" "curl/7.43.0"
www.domain.com 1.1.1.1 - - [12/May/2022:21:12:40 -0700] "POST /app/test.php HTTP/1.1" 200 182248 747 "-" "curl/7.43.0"
www.domain.com 1.1.1.1 - - [12/May/2022:21:12:51 -0700] "POST /app/test.php HTTP/1.1" 200 381602 747 "-" "curl/7.43.0"
www.domain.com 1.1.1.1 - - [12/May/2022:21:13:00 -0700] "POST /app/test.php HTTP/1.1" 200 246661 747 "-" "curl/7.43.0"
www.domain.com 1.1.1.1 - - [12/May/2022:21:13:05 -0700] "POST /app/test.php HTTP/1.1" 200 258953 1 "-" "curl/7.43.0"
www.domain.com 1.1.1.1 - - [12/May/2022:21:13:10 -0700] "POST /app/test.php HTTP/1.1" 200 225073 747 "-" "curl/7.43.0"
www.domain.com 1.1.1.1 - - [12/May/2022:21:13:20 -0700] "POST /app/test.php HTTP/1.1" 200 185570 747 "-" "curl/7.43.0"
www.domain.com 1.1.1.1 - - [12/May/2022:21:13:30 -0700] "POST /app/test.php HTTP/1.1" 200 296611 747 "-" "curl/7.43.0"
www.domain.com 1.1.1.1 - - [12/May/2022:21:13:40 -0700] "POST /app/test.php HTTP/1.1" 200 259110 747 "-" "curl/7.43.0"
www.domain.com 1.1.1.1 - - [12/May/2022:21:13:50 -0700] "POST /app/test.php HTTP/1.1" 200 210109 747 "-" "curl/7.43.0"
www.domain.com 1.1.1.1 - - [12/May/2022:21:14:01 -0700] "POST /app/test.php HTTP/1.1" 200 392396 747 "-" "curl/7.43.0"

lewis

BTW, just went to update.... got a seg fault.

 0) Logout (SSH only)                  9) pfTop
 1) Assign Interfaces                 10) Filter Logs
 2) Set interface(s) IP address       11) Restart webConfigurator
 3) Reset webConfigurator password    12) PHP shell + pfSense tools
 4) Reset to factory defaults         13) Update from console
 5) Reboot system                     14) Disable Secure Shell (sshd)
 6) Halt system                       15) Restore recent configuration
 7) Ping host                         16) Restart PHP-FPM
 8) Shell

Enter an option: 13

>>> Updating repositories metadata...
Updating pfSense-core repository catalogue...
Fetching meta.conf: . done
Fetching packagesite.pkg: . done
Processing entries: . done
pfSense-core repository update completed. 7 packages processed.
Updating pfSense repository catalogue...
Fetching meta.conf: . done
Fetching packagesite.pkg: .......... done
Processing entries:
Processing entries............. done
pfSense repository update completed. 515 packages processed.
All repositories are up to date.

Child process pid=55390 terminated abnormally: Segmentation fault

**** WARNING ****
Reboot will be required!!
Proceed with upgrade? (y/N) y
>>> Removing vital flag from php74... done.
>>> Downloading upgrade packages...
Updating pfSense-core repository catalogue...
pfSense-core repository is up to date.
Updating pfSense repository catalogue...
pfSense repository is up to date.
All repositories are up to date.
Checking for upgrades (201 candidates): ....
Child process pid=61867 terminated abnormally: Segmentation fault
pfSense - Netgate Device ID: xxx

From command line, with -d or not, I get to the same point of being told all is up to date but no option to upgrade.

[2.5.2-RELEASE][root@]/root: pkg update
Updating pfSense-core repository catalogue...
pfSense-core repository is up to date.
Updating pfSense repository catalogue...
pfSense repository is up to date.
All repositories are up to date.

And (Not sure I want to say yes since I see seg fault)

[2.5.2-RELEASE][root@c]/root: pfSense-upgrade
>>> Setting vital flag on php74... done.
>>> Updating repositories metadata...
Updating pfSense-core repository catalogue...
Fetching meta.conf: . done
Fetching packagesite.pkg: . done
Processing entries: . done
pfSense-core repository update completed. 7 packages processed.
Updating pfSense repository catalogue...
Fetching meta.conf: . done
Fetching packagesite.pkg: .......... done
Processing entries:
Processing entries............. done
pfSense repository update completed. 515 packages processed.
All repositories are up to date.

Child process pid=48857 terminated abnormally: Segmentation fault

**** WARNING ****
Reboot will be required!!
Proceed with upgrade? (y/N)

lewis

Maybe the seg fault warnings are related to the haproxy problem.
I don't know and I'm not sure what to think now. Feels like the firewall might fail at some point which would be bad.

stephenw10

If it's HAProxy not passing the traffic somehow then you should be able to prove that easily using a packet capture on the WAN side. You would still see the connections coming in from the client at regular intervals.

One thing to bare in mind is that HAproxy is an actual proxy so to changes the path of the connection compared to port forward. That might be relevant with the multiple WAN setup you have.

Is there anything logged from that segfault? That doesn't seems familiar at all in 2.5.2.

Steve

lewis

Well, what ever I do, it cannot be disruptive. Tihs problem plagued us to the point that members never came back, blaming our services.

I don't have any multiple WANs on this, traffic coming into this firewall is simply port forwarded to devices on its LAN and we were using haproxy.

The multiple WAN you mention are simply other firewalls that have their LAN side on the same 'cable' but simply co-exist with each other but in different networks.

I didn't get a chance to check for proof but I could set up another test to an isolated server on the LAN. It would just show the irregular traffic as I showed. I don't have physical access to inject something between the WAN and a device to monitor. Everything else seems to work as it should, just haproxy is showing this behavior.

As for segfaults, that was the first time I've seen those last night but it was also the first time I tried upgrading from the cli as you had suggested.

stephenw10

HAProxy is running on the same firewall as the traffic is passing though right?

And when you disable HAProxy as a test you are simply enabling a port forward there instead to just one backend server?

Doing that makes several changes. You are using just one backend, are you sure the load-balancing is not breaking the required connection between the server and client? Have you tried just enabling one backend server through HAProxy?

When you are running the proxy the TCP connection is terminated in HAProxy and a separate connection is created to the backend. That means the backend will reply to HAProxy directly in it's pwn subnet rather than try to reply to the client via it's default route. That may or may not make a difference but I could certainly imagine it might in your network setup. Though I would expect it to fail without HAProxy in that case.

Steve

lewis

Correct, on the same 2.5.2 firewall I'm having a hard time updating and seeing segfaults on.

Correct, I've not actually disabled the haproxy service, I simply bypassed it by creating a new forward rule that goes to just one of the back end servers.

Yes, we've done a crazy amount of testing and that also included husing haproxy to just one back end server. That's how it's been configured for a week as part of our testing. Only last night did I decide to bypass the single back end server from haproxy and direct using a nat rule.

That may or may not make a difference but I could certainly imagine it >might in your network setup. Though I would expect it to fail without >HAProxy in that case.

There really isn't anything that unusual about this network, it's just that it has multiple gateways on the same cable but clients and devices use their own gateways so there are no conflicts that I've noticed at least.

Plus, the traffic to the second pfsense firewall is minimal, it's just device management traffic. All production stuff goes through the firewall we're talking about.

It's the same as if you had one LAN switch, not using VLAN, nothing fancy, just one LAN switch. On that switch, you'd have some devices that are configured with 172.16.1.x. IPs talking to each other and other devices with say 10.0.0.x talking to other 10.0.0.x devices on the same network. The only thing is, they are sharing the same physical cable and switch.

stephenw10

Yes, multiple gateways on one network segment opens numerous possibilities to fail!

Using HAProxy you could use a backend that was not using the firewall as it's default route. That would fail if you then tried to use a port forward instead.
Since that's not what you're seeing here you are not hitting that particular failure mode but you need to be aware of it when using a network setup like that.

Steve

lewis

I guess it means there's something I'm not understanding :).

I've always had devices on different subnets communicating together with and without firewalls.

Devices using 172.16.x.x talk with others on 172.16.x.x, 10.0.0.x talk with devices on 10.0.0.x and 192.168.0.x talk with others on the same.
Some use a gateway, some don't depending on what their tasks are.

In this case, it's that there is only one physical cable to get a LAN from one point to another but on that cable, there are now two firewalls, 10.0.0.x and 10.1.1.x. Only one firewall has DHCP, the other doesn't.

Devices on the first firewall have that one as their gw and communicate with other devices on the same network subnet.
Devices on the second firewall have that one sa their gw and communicate with other devices on the same network subnet.

This conversation keeps diluting the problem with haproxy but you think there is a possibility that haproxy is not working well because of the above network.

I've not seen any problems so based on your input, there must be something I am missing.

Devices communicate with their own gw. The only time it was weird was while ARP was cached all over and one left over rule was overlooked.

I've not seen any problems since other than this haproxy and not being able to update the firewall.

Using HAProxy you could use a backend that was not using the
firewall as it's default route.

This firewall is only working with devices that have the same network which is 10.0.0.x/24.
The back end servers are all on the same 10.0.0.x/24 and have the above as their gw.

That would fail if you then tried to use a port forward instead.

I think you are saying if I used 10.0.0.1 firewall with haproxy and sent traffic from that to 10.1.1.x/24 devices? Not doing that for sure :).

It would not work anyhow since the devices on 10.1.1.x have their gw as 10.1.1.1 so traffic would not get to them without funky a config using vips or something and their outgoing traffic would want to go out the 10.1.1.1 gw.

Since that's not what you're seeing here you are not hitting that particular >failure mode but you need to be aware of it when using a network setup >like that.

Ok, I think you're just warning me not to do stuff like that. I agree, I won't be doing that.

I believe you helped me when I was setting all this up and with some other problems and I've learned quite a lot, even if I don't yet remember it all just yet.

stephenw10

Yes, just be aware it would be very easy to introduce asymmetry and I've seen that bite people many, many times!

If you really are seeing an issue in HAProxy then a pcap should prove it.

I would expect to see something logged though.

Was this working in the old network setup?

Steve

lewis

We had the proxy going for the past couple of years approximately.
During that time, we've had lots of complaints about 500/504 but always blamed our own resources, never once thinking it could be the proxy.

So to answer your question, there is really no way to know other than when I posted this, that was around the time we realized what was happening.

We had taken the proxy out of the mix to do some testing so it was off for maybe a week. Then when we re-enabled it, the timeout complaints started again which got me wondering what was going on. That's when I disabled it again and since then, the complaints stopped and we too were no longer getting them.

We know one problem was a back end one in that there was an issue with the database and it wasn't responding fast enough causing 504's but we were aware of those and could see them in the logs.