Upgrade 2.5.2 to 2.6.0, upgrade success, Limiters not passing

thiasaef

@thiasaef said in upgrade 2.5.2 to 2.6.0, upgrade success, no internet conection:

I bet my ass that the issue is related to the DNS Resolver being fucked up once again.

I stand corrected, I have the exact same issue in 2.5.2, sorry!

By the way: When I set the DNS server via DHCP to something other than the firewall itself, it works fine on all LAN interfaces.

thiasaef

@thiasaef said in upgrade 2.5.2 to 2.6.0, upgrade success, no internet conection:

I stand corrected, I have the exact same issue in 2.5.2, sorry!

Which does not make it any better to be honest. How is it possible that a major issue like this that is known for at least 2 months, makes it into a release like 2.6.0?

stephenw10

This thread is getting very confusing, there's at least three different issues being discussed.
It's mostly about Limiters not passing traffic though so let's keep it for that. Please open a new thread for Unbound problems.

Steve

EagleZip

Hello,

Looks like I am late to the party, but we are also experiencing this issue with our limiters since the upgrade to 2.6.0. I have 2 different limiters configured on 2 different inside interfaces. No NAT in use on either interface. Each interface has a subnet of public routable IPs. The limiters were configured to rate limit specific host IPs within each subnet and worked as expected under 2.5.2. Now they block traffic. Can not even ping from the limited host to the interface IP. Removing the limiter from the In/Out Pipes immediately restores full connectivity.

I have watched the Limiter Info output and it appears to be hard limiting to 50 packets and then dropping everything after that point if I am reading this correctly? Every time I re-add the limiter I can ping until this number hits 50 and then everything stops.

00008: 27.000 Mbit/s 0 ms burst 0
q131080 50 sl. 0 flows (1 buckets) sched 65544 weight 0 lmax 0 pri 0 droptail
sched 65544 type FIFO flags 0x1 256 buckets 1 active
mask: 0x00 0xfffffff8/0x0000 -> 0x00000000/0x0000
BKT Prot Source IP/port_ Dest. IP/port Tot_pkt/bytes Pkt/Byte Drp
168 ip 66.###.###.40/0 0.0.0.0/0 4975779 1355919070 50 10861 2291

stephenw10

Hmm, interesting. Seems likely that's because the queue length is 50 packets by default.
If you can try setting the queue length to something longer and see if that changed the number of passed packets.
To be clear you see replies to the first 50 pings?
Hard to see how it could be filling the queue but passing at the same time.

Steve

Luca De Andreis

@stephenw10

Steve I have an update...
I use a good number of pfsense firewalls (more than 20) and I confirm the problem in question that it occurs in version 2.6.0.
But, I also have a good number of PfSense plus updated to the latest version 22.01 and they seem to work perfectly. I'm talking about at least 5 PfSense plus on Netgate appliances (Netgate bare metal). These use limiters and work perfectly. Can it help in understanding the cause?

Luca

stephenw10

Possibly. I've so far been unable to replicate it on anything. There must be some combination of things that cause it because some people seem to be hitting it with a very basic Limiter setup.
Are you able to get me a status_output or config file from any of the installs that are seeing this?
Or just some way to setup something that repeatably hits it?

Steve

Luca De Andreis

@stephenw10

Hi Steve,
then, the PfSense plus that use the limiters and that have NO problems (in fact all my installations with PfSense plus have no problems) use a simple configuration. PfSense official bare metal systems (typically the 7100 rack model).

The PfSense CE in 2.6.0 that have problems have similar characteristics, almost identical: VM in Proxmox, NIC in virtio, a WAN, an MPLS link, a dozen segments with appropriate ACLs. A few ports in NAT WAN to DMZ (with limiters) and all internal traffic segments to WAN with limiters.
The limiters are configured in tail drop, scheduler in default, mask none and a queue in source address / destination address. Very standard. Configuration that works perfectly with version 2.5.2.

In my case there are no problems immediately after a reboot, only after some time the traffic is dropped (both from the WAN to the DMZ, in NAT, and from the internal segments towards the WAN). The problem occurs on multiple firewalls.
Being systems in production I made a rollback to 2.5.2.

Luca

FirewallProblemsOops

@earthlingz

I am also using DNSForwarder, to OpenDNS. Maybe 50% of DNS calls seem to be failing through my updated PFSense 2.6.0 box. No Limiters, very simple rules. If it's not the DNS Fowarder, something might be wrong with the block bogon default rule somehow?

I've been using this same configuration for multiple releases, I have another box nearby running 2.5.2 with the same configuration, I just hadn't upgraded that one yet, and all computers connected to it are having no problems.

DNS issue seems to be tied to the 2.6.0 version, and I don't believe it's related just to Limiters.

Thanks, I've been using PFsense for about 5 years now, no problems like this. Easily 50% failure rate connecting to anything on 2.6.0

stephenw10

You should open a new thread (or use a different open one) for issues with dnsmasq if you're seeing that.
Please keep this thread for issues with Limiters, which I have still been unable to replicate in 22.01 or 2.6.

Steve

Luca De Andreis

@stephenw10

Steve,

If you want , I can send you a full configuration, now working in production on 2.5.2 that presented problems after upgrading to version 2.6. If you think it might be useful

Luca

stephenw10

@luca-de-andreis Yes please, hit me in a PM with a link if you can.

EagleZip

@stephenw10

I am initially receiving ping replies with the limiter enabled, but it is not for 50 pings.

I increased the queue length to 1000 on the limiter. When first applied, I was able to ping 15 times with replies. Then the queue started filling up. This time the queue did in fact fill to 1000 packets before the Limiter Info output indicated anything was being dropped. However the ping replies stopped as soon as packets started entering the queue. That's the end of the road. The packets are never passing out of the queue. Even when I disconnect my test host from the network completely so there is absolutely no more input on the interface, the queue remains full indefinitely until the rules are reloaded.

I tried each of the various Queue Management Algorithms available on the Limiter properties although I must admit I am not familiar with these options. Random Early Detection seems to allow traffic to continue flowing. I see packets being dropped rather than queued on the Limiter Info output. I was able to browse and run several speed tests without losing connectivity. The speed test results were noticeably less than the rates configured in the limiter, but they completed successfully. All other Algorithm choices produced the same result of blocking traffic as soon as packets begin entering the queue.

Luca De Andreis

@stephenw10

Ok Steve, I've just send link and password to donwload the xml file.
I've removed public IPs and altered password and key digest....

Luca

blikie

@stephenw10

I found out that using 2.7.0.a.20220305.0600 devel on vmware test bench, the limiter issue is gone and seems to be working normal.
Previous devel version had the same issue as 2.6.
I also tried the captive portal patch on 2.6 without success regarding the limiter failure. Perhaps this all ties to the IPFW function?

Luca De Andreis

@stephenw10

Is it possible that all functions correctly on PfSense plus 22.01? I have several firewalls with PfSense plus 22.01 with active limiters and with queues, exactly configured as versions 2.6.0 which have problems. But ... what are the differences between PfSense plus 22.01 and Pfsense 2.6.0?

Luca

stephenw10

@blikie said in Upgrade 2.5.2 to 2.6.0, upgrade success, Limiters not passing:

I found out that using 2.7.0.a.20220305.0600 devel on vmware test bench, the limiter issue is gone and seems to be working normal.
Previous devel version had the same issue as 2.6.

You mean literally between the 4th and 5th or March snaps?

That is when the root fix for the captive portal issue went in. Which would be very interesting!

Steve

blikie

@stephenw10
Yes, I went from clean install 2.6 to 2.7 devel a week ago and it had the same issue, fast forward to that day i updated to that build and i was surprised with the same config unchanged and working with limiters on. and i thought it might be something to do with the captive portal patch. i ran another fresh 2.6 with the patch applied but the issue persisted only on 2.6. all run in a vmware workstation 16.2.2.

stephenw10

Ah, that's great to know. Yeah, it's not the same patch.

Ok, that should make it much easier to replicate now we have some idea what's triggering it.

Steve

Luca De Andreis

Finally it can be said that the problem has been identified.

Stephenw10 of the Netgate development team has identified the cause of the problem.
The problem occurs when the Captive Portal is active and limiters are used. Practically due to a captive portal bug, even if this refers to a different interface, the limiters do not work correctly on all interfaces (including WAN ones, therefore the ports natted towards the internal segments and subject to limiters).
If you don't use the Captive Portal the limiters work without problems.

Luca