Load balancer pool query
-
I got no response from an earlier post about this, so I'll ask the question in a different way in case someone might understand it better this way.
What could cause several of my load balancer pools to be marked as down ?
All my pools use the same two servers, so it isn't because the machines/web service is down.
I can telnet to the ports in question from my pfsense box no problem.
There is definitely a rule in place on the WAN to allow traffic to these servers/ports.It only seems to affect the SSL virtual server for these pools though, judging by the status shown in the load balancer/virtual servers status screen.
Everything had been fine up until I added the last three load balancers. Attempts to get to the web sites for these virtual servers ended up with the DNS Rebind error screen being displayed. I've got round this with a couple of extra ticks in the Advanced config screen, but now it just reverts to the UI login screen.
Anyone have any clues as to what else I could check to see why this is happening ?
Thanks.
-
I don't know but the only time I had issues where it showed down and the server was really up was when I tried to load balance different server ports.
This is with a single pool with 1 server and then creating a second pool with 1 server as a BACKUP for failover only.
EXTIP1:443 -> Pool1 - InternalServer1:8000
-> Pool2 - InternalServer2:8001 (set as backup pool for Pool1)The status showed the last one as down when I knew they were both up.
It turns out the relayd.conf file gets written with both of them as port 8000. I don't know if that is a limit in relayd and pfsense deliberately makes them the same or if it was just not thought of that someone would want to mix server ports on a LB setup.
If you haven't already done so I would take a look at /var/etc/relayd.conf and see if it looks correct. Maybe there is another issue where the config doesn't get written correctly.
-
I've checked relayd.conf and it looks fine to me. I used exactly the same routine to add these new load balancers as I have in the past, so was pretty confident it would work. I've even deleted the failing LB's and added them fresh - still the same outcome.
Can't help thinking there is something going on for these LB's to give the DNS Rebind error screen instead of redirecting as I expected them to.
-
You mention you did exactly the same routine so I wonder if you made sure to select https as the Monitor for SSL servers instead of http. If you selected http monitor for an SSL site then you would get the same results (the ssl sites would show down).
That is pretty basic so I doubt you made that mistake but sometimes simple things get overlooked (I know that from my own silly mistakes :)).
-
Yes, definately HTTPS as the monitor. I did try ICMP as well, just in case, but that didn't work either.
As an extra, I can confirm that I can telnet to the appropriate ports on both web servers from the pfsense firewall, and if I try using the public IP address of one of the webservers, I can get right through to lighttpd page I expect to see, so I'm pretty sure the backend is et up correctly.
I've also tried extending the timeouts on the relayd global settings page (Services: Load Balancer: Settings), with no success.
-
Still the same situation here …
I did see one previous post that said rebooting the firewall cleared up the problem. This is a production firewall that I need to get written permission to take down, so I haven't rebooted it yet.
My question is this - is this likely to have any positive effect ? If not, I don't want to go through the red tape of getting authorisation for a reboot if I can avoid it.
The reboot option now exhausts any troubleshooting options I had left. Anyone else got any ideas ?
-
For posterity, and anyone else googling in the future, this is now resolved.
The problem I had was that the web sites running against the pools were all password protected to prevent unauthorised access before production could begin. This meant that when relayd sent out its http(s) checks every few seconds, these pools would return a 401 code rather than the expected 200 code.
Fixing was simply a case of adding a new monitor called HTTPS401, and making sure it checked for a 401 return code, assigning the new monitor to the new pools, restarting the loadbalancer service, and watch the new pools become available.
So simple now I can stand back and see it.