pfSense not routing or assigning connections properly after a reboot

oldlongjohnson

I took down my pfSense machine for some maintenance (replaced the power supply) and after bringing it back up, I'm having some really strange issues.

From what I have been able to figure out, some machines are no longer able to get DHCP, and some machines are no longer able to access the gateway at all.

My configuration is 2 network cards, one 2 port optical and one 4 port NIC. One port on the optical card is used to connect to the ISP and thats working fine. I can connect to my ISP and working machines on my network can use that connection without issue. The other port is used to connect to my NAS, and it is "working", but in a degraded state. The 4 port NIC connects the rest of my network through some switches. All of these interfaces except the WAN interface are bridged so I can have the machines on the same subnet. There are firewall rules allowing "all" across the bridge, like so:

Same rule for each port thats part of the bridge, OPT1-5

For the machines that were having issues with DHCP, these were all connected to the 4 port NIC. Assigning them static IP's allowed me to bring them back online. When they try to get a DHCP lease, I see dozens of these messages in the firewall logs:

OPT3 	Block IPv4 link-local (1000000101) 	169.254.149.23:137		169.254.255.255:137		UDP

Or similar, depending on which port theyre connected to. 169.* is not my subnet, im using 192.168.1.0/24

For the NAS with the degraded connection, it has a static IP and I can connect to the local network from it no problem, and the local network machines (that are working) can all see it as well. I can ping every other online machine except the firewall. I can SSH freely to everything, so the network connection is working. However, I have no internet connectivity on that machine, and any connection to the pfSense fails. I can however ping from the pfsense machine to that machine over the connection, so it seems like something on the pfSense side is blocking it. I also see those strange link-local messages in the firewall logs, even though im using a static IP, which has me really confused.

I can get a working connection over ethernet to the NAS through a switch connected to a port on the 4 port NIC, I can even get a DHCP lease. The very strange part is a machine connected to the port on the switch right beside it is one of the ones failing to get a DHCP lease...

Anybody have any idea what might have happened here, or even some suggestions as to what to look at next? This was working without issue before the maintenance, theres no hardware issues or anything that I can find. All the logs look OK except for those strange link-local messages, theres no messages about any driver failures or service failures or anything, and DHCP logs arent showing any failures either. The power supply I swapped in is known-good, it was powering a much more demanding machine before I swapped it into the pfSense box, and I double checked all the connections to the motherboard and drives are secure.

oldlongjohnson

One more strange thing ive noticed, it seems like the optical connection is working in some capacity on the NAS... if I have the ethernet plugged in to obtain the connection, while leaving the fiber plugged in, and run a speed test... I see some of the traffic is being sent across that port:

Which aligns with the results because obviously this is impossible for gigabit ethernet...

I noticed this because in the pfSense dashboard I can see OPT5 (the optical connection) was sending some data across it, up into the mbps range which it shouldnt be if it was just pings or random packets across the network. But if I disconnect the ethernet connection, no more internet connection on that machine.... Im very confused.

stephenw10

Did you set/change the bridge filtering sysctls?

Did you do that after creating the bridge and have not rebooted since?

It's possible they were only applied to the bridge when it was then re-created at boot.

If you're running dhcp on the bridge and have not spoofed the MAC address Windows clients will see it as a new network as it will have generated a new random MAC.

Those rules you have OPT5 are showing 0 states and 0 bytes so nothing it hitting them.

Steve

oldlongjohnson

I had rebooted a ton of times recently actually, which is why I didnt expect any problems this time. I use PPPoE to connect to my ISP and with 2.4.5 theres the bug when you change anything related to interfaces, the PPPoE doesnt reconnect until a reboot. I'd actually just rebooted yesterday after backing up my config in preparation for this, just in a worst case scenario I need to restore to new hardware.

The sceenshot was just an example since theyre all identical, nothing had actually run on that port yet when I took it, i'd just rebooted and hadnt turned that other machine on. There is traffic against them now.

I actually partially figured it out, though im still really confused what happened.

I had a kind of... reverse loopback? I accidentally switched the cables on ports 1 and 4 around on the NIC when i reconnected everything (theyre identical at that end, easy mistake). Port 4 was originally feeding the switch which had most of the "bad" DHCP clients on it. Port 1 was a dead port on the client end, that was previously used to feed gigabit directly to the NAS (and i was using again now for testing).

Somehow, switching the cables caused everything to try to feed through NIC port 4 out to the NAS, and then use the optical connection in the NAS as a gateway to the pfSense gateway. Looking through my configuration I have absolutely no idea how this was possible or even worked at all (192.168.1.1 is my pfSense, 192.168.1.149 is my NAS address on ethernet and 192.168.1.100 is the NAS address on the fiber port, everything showed 192.168.1.1 as the gateway and traceroutes showed all traffic using that as the first hop and the connections are NOT bridged on the NAS side). Unplugging the ethernet at the NAS caused all the problematic machines to lose connection and completely drop off the internet, even the ones with static IPs. They still stayed connected via the LAN though which is how I missed that in the first place

I switched the cables back to their proper place and everything is working fine now... I need to hunt through my config settings because there is absolutely no way that should happen... And i still dont even understand how it did after reviewing everything again... I'm not using any kind of static assignment, all static IP's are assigned at the client side, so from pfSense perspective a port is a port is a port. Or at least thats how I thought I had it configured...

Also im going to add labels to all the cables just in case for the future :)

johnpoz

@oldlongjohnson said in pfSense not routing or assigning connections properly after a reboot:

Or similar, depending on which port theyre connected to. 169.* is not my subnet, im using 192.168.1.0/24

169.254 is APIPA - many clients, like pretty much any OS will assign a random 169.254 address to itself when set to dhcp and doesn't get an IP.

oldlongjohnson

Yeah thats actually why im even more confused, because 2 of the machines (the NAS and another machine) were showing up in the logs with addresses in that range, even though they have static IP's assigned.

I still havent been able to figure out why my network falls in on itself when those 2 ports are swapped, everything is looking good so far in the configs. I'm sure its going to be something really obvious im overlooking...

stephenw10

It's probably some 'bridge as a switch' weirdness. A bridge is not a switch and though it mostly functions the same an actual switch is much better there.

The NAS is configured to route traffic?

Which PPPoE issue are you referring to there? In 2.4.5p1?

I use PPPoE connections (over VLANs) here and don't see any issue with it. I haven't seen a problem there since this was fixed in 2.4.4p2: https://redmine.pfsense.org/issues/9148

Steve

oldlongjohnson

Yeah thats the PPPoE issue, I saw it in the fixed issues list for 2.5.0 and that its targeting that release so I assumed it wasnt in yet? Thats the exact same behavior im seeing on 2.4.5-RELEASE-p1, if I make a change to any interface PPPoE goes down and theres no way to recover (reliably) without a reboot. I am also doing PPPoE over vlan.

The NAS is not configured to route traffic as far as I can tell, I didnt set that up or at least not intentionally. It used to use just the gigabit ethernet connection but I got a 10gig card for it a few months ago and set that up. Rather than remove the old networking config I just unplugged the cable.

I agree its probably bridge as a switch issue. Even after 2 hours combing through every config and every log, I still cant make heads or tails of it. The only thing I can think is that because Port 1 is the "main" bridge interface, maybe it didnt like having so many different machines connecting on it? Because aside from being the main interface, thats absolutely no difference in configs between it and Port 4 that I can see. The only difference physically is that Port 1 has a single, non-switched connection, where as Port 4 has 10 different machines across 2 switches on it.

At some point I will get a 10gbe sfp+ capable switch so I can have just one each WAN/LAN interface in pfsense and really simplify the config, but theyre just too expensive to justify right now when this config works, at least when im not breaking it by being dumb :)