Host not accessible from a subnet

dxun

I've been trying to chase down an odd problem of not being able to get through a particular host from a particular subnet so I'd appreciate any ideas you may throw as to why this may be.

My basic LAN setup is this:
-> L1 (192.168.1.0/24) - LAN
-> L2 (192.168.2.0/24) - OPT1
-> L3 (192.168.3.0/24) - OPT2

There is only one firewall rule per subnet - pass all traffic, so no blocking/rejecting. All such rules are disabled. Firewall state table is reset prior to these attempts.
All nodes in L2 access the LAN via a wireless router (Linksys EA2700) which I've stripped off of any router responsibilites (disabled NAT, DHCP and so forth) and directly connected with pfSense node via its LAN port - so it's more of a wireless switch. No other ports on this router are occupied - it's here only as an wireless AP, nothing else.

I am trying to reach a host on L1 with IP 192.168.1.220 (let's call it host X) from L2 and for reasons that escape me - I am unable to do so.
On L2:

pinging the host X's IP doesn't get a response; pinging any other IP on L1 succeeds,
pinging the DNS entry associatied with the host X's IP resolves the IP correctly but doesn't return the ping; doing the same with any other node in L1 is successful and returns the ping,
trying to establish TCP connection to that host (by entering the address in the browser) times out; accessing any other node in L1 is successful.

Trying the same from other subnets works just fine and host X is accessible.

I've triple-checked the wireless router config and couldn't find any spurious configuration - I've performed a factory reset and again disabled any routing capabilites. It did not help.

I've tried to root out the problem by looking at the firewall logs and I am confused - there is absolutely no entry in the logs about any traffic towards host X, irrespective of whether I try to access it from L2 (and I fail) or I try to access it from any other subnet (which succeeds). In fact, attached picture is a search query ran just after I pinged the machine and accessed it via browser - I would've expected at least some TCP traffic logged against host X at the very least. If nothing else, I interpret this as firewall not blocking host X with any rule or remaining state.

I've tried packet capture traffic on L2 machines from pfSense - no help there, I basically saw the pings having no replies and that's it.

Finally, I fired up Wireshark and inspected the traffic as I try to access host X from L2 node - all I can see is ICMP returns no response when pinging and TCP SYNs being re-transmissioned with TCP exponentially backing off each re-transmission; I could discern no root cause.

In a going-quite-mad move, I've even enabled UPnP in pfSense and wireless router in hopes of either rooting out some setting on the router I've overlooked or just simply miraculous negotiation occuring - of course, it didn't do anything for host X.

Finally, I have restarted the pfSense machine, the wireless router and the L2 nodes - nothing.

At this point, I am stumped - this should by all accounts work and I cannot imagine why it doesn't. I don't know if I can coax anything more out of logs - if I can, I don't know how, so if you have any suggestions, let me know. If you have any other ideas I could try, I am all ears - I am really out of them.

Derelict

Probably not going to find the source of your problem in pfSense if everything is as you say it is.

You might want to spend some time examining the problem host and everything else you can think of.

kejianshi

Probably the problem is the wireless router.

Derelict

Host X somehow having a local route to/interface on 192.168.2.0/24 (L2) could also cause this behavior.

~~As could a /23 netmask on Host X.~~ (nm. that would be 192.168.0 and 192.168.1)

kejianshi

"I've even enabled UPnP in pfSense and wireless router"

That router shouldn't be doing anything. It should be acting as a switch.
If its incapable of ONLY acting as a switch, you should replace it with a simple AP.

dxun

I agree with your assesments, my current focus is on wireless router - I just don't see pfSense VM as being a factor in this problem, unless I am misreading the logs or not looking where I am supposed to.

The thing is - it all worked perfectly a couple of days before. Then I added a new interface to pfSense (a new vmnic) and a new subnet (L3), configured it and it was only yesterday that I discovered the wireless router had gone completely haywire and lost all wireless settings (it had worked perfectly before I added the L3 subnet and traffic was normally flowing to host X). I couldn't access its admin console no matter what I tried so I disconnected it from LAN, performed a factory reset and re-configured it to work in router-less mode by connecting just the router and a single L2 node via LAN cable.

I configured the router to have its address on 192.168.2.100 (so as to be able to access its configuration console) and disabled every other service I could find on the router. What's perhaps worth mentioning is that I can access the admin console from L2 subnet only. I can see the router in pfSense ARP table and I can see its static DHCP lease as active. I don't know if this is orthogonal to my current issue (probably is, but still) so the router is probably allowing access to the admin console from its own subnet. Do you think this could indicate some other problem?

Just to be more specific - host X is NAS and its configuration has been constant throughout. As I said, no firewall block/reject rules exist and the IP itself is not special in any way.
I have also disabled UPnP - it was a desperate troubleshooting manoevre and it yielded nothing so I reverted back. I agree with you - the router should do no routing and only switching.

Is there any screenshot I could post or anywhere in particular where you'd look yourself?

Derelict

@dxun:

I configured the router to have its address on 192.168.2.100 (so as to be able to access its configuration console) and disabled every other service I could find on the router. What's perhaps worth mentioning is that I can access the admin console from L2 subnet only. I can see the router in pfSense ARP table and I can see its static DHCP lease as active. I don't know if this is orthogonal to my current issue (probably is, but still) so the router is probably allowing access to the admin console from its own subnet. Do you think this could indicate some other problem?

Are the other L2 devices in pfSense's ARP table?

What happens when you unplug the wireless and connect a test host instead? Does it exhibit the same behavior?

Yes, only being able to access the admin interface from the L2 subnet only means you have something wrong in the wireless device. My first guess would be default gateway but there might be ACLs in it too. If it's really set up as a bridge/access point, that shouldn't affect wireless clients, only access to the web interface.

dxun

Good idea, let's defer the admin interface issue aside for the moment.
Here are the results - in the first ARP screenshot, there is only one host (red @ 2.30) on L2, the wireless router (blue) and the host X on L1 (red @ 1.220). Ping fails for 220 and passes for the rest.

Second ARP screenshot is done by disconnecting the wireless router from OPT2 and connecting the host previously @ 2.30 (green @ 2.41*) directly to OPT2 and performing the ping to host X. Ping fails for 220 and passes for the rest.

This is unexpected - it seems problem lies somewhere in pfSense after all.

Even though packet log is still empty, I am attaching my current LAN firewall rules (there are no floating rules and WAN interface has only the two default blocking rules defined, nothing else).

What next?

=======

- it's been given 2.41 as its wireless MAC is DHCP static leased and LAN is not

Capture-over-wireless-router.png_thumb

Capture-directly-to-pfSense-OPT2.png_thumb

LAN.PNG_thumb

OPT1.PNG_thumb

Derelict

I'd say the problem lies in Host X, not in pfSense.

You should be able to eliminate pfSense by taking traffic captures on interface L1. Either the echo request / tcp syn / etc to host x is sent or it isn't.

What's the ipconfig /all or ifconfig -a or equivalent on host X? And maybe its routing table (netstat -r).

Does it have AV/software firewall? How is it configured? Try disabling it?

If you take host X off the network and put something else on with the same IP does the same thing happen?

If you change the IP of Host X does the problem follow it?

dxun

Success! Sir, you were right - many thanks, the problem was in host X. As soon as I fired up ifconfig -a on X, it struck me.

Namely, X has two LAN ports, both of which should've been assigned to L1. As it turns out, only one had been assigned to L1 (via DHCP), the other LAN port was statically assigned (and now you'll laugh) - to 192.168.2.100, exactly the same IP I assigned to the wireless router. Assigning both LAN ports of X to L1, ping started to function and X started to be accessible from L2. Wireless router admin interface is still inaccessible from L1 but I can live with it (the Linksys router is a commodity hardware and I couldn't figure out how to make it accept access to admin console from different subnets - not from the web UI, at least).
Curious to see an IP address on L1 being inaccessible because of IP conflict on L2. Could you explain why was this issue confined to this particular IP?

I have another issue I didn't want to tackle until we got to the bottom of this - it's about DHCP leases.

Currently, all known devices on LAN are statically leased - which assumes their MAC address is bound to an IP address on DHCP server. And it works as I'd expect - mostly.
It appears DHCPD is assigning one device on L3 subnet (OPT2 interface) even though I've assigned the device to L1 - I cannot understand why this would be happening, as the device is not statically configured (I've double checked as I type these words). Also, take note on the third screenshot, I've left out parts of MAC addresses unobfuscated to signal those ARE exactly the same MAC addresses….except DHCPD is ignoring the binding and decides to give the device a dynamic lease from range pool. More precisely, I'd expect the device to occupy 1.20 instead of 3.254.

So I've tried to narrow down the range pool to a single IP address and additionally Deny unknown clients from ever joining the L3 subnet (at least that's how I understand this feature). DHCPD ignores these settings and I am unsure how to proceed.

The reason I am still mentioning this in this thread (instead of opening up a separate one) is that this whole thing started to unravel as I added a new interface to pfSense (OPT2) - that's where I noticed the ping problem from above and that's also the first time DHCPD started to "misbehave", initially assigning the host beloging on 1.10 to 3.1, which is when I added the constraints to L3. By "rattling the box", I've been able to force DHCPD to understand where it should place the host but I dislike that kind of brute-force approach as it is treting the problem symptomatically instead of attacking the root cause.

Any suggestions/thoughts?

OPT2-DHCP_lease-header.PNG_thumb

OPT2-DHCP_lease-detail.PNG_thumb

DHCP-leases.PNG_thumb

Derelict

I'm not going to even try with all that obfuscation. Nobody cares what your mac addresses are and it makes it impossible to tell if you're missing something.

TIP: There is nothing wrong with the DHCP server in pfSense. Look elsewhere for a problem if your config is valid. The absolute worst case would be stop the dhcp service and restart it.

TIP #2: release/renew the host in question while monitoring the dhcp log. I prefer to do this on the command line: clog -f /var/log/dhcpd.log

If there's lots of activity pipe it through grep.

dxun

Agreed about obfuscation - it's pointless to do it. DHCP inner workings were never in question, I was just at a loss as to where to look.

I've just inspected the dhcp.log and I ama seeing something strange and confusing - it seems DHCP request is coming through wrong network adapter. Why would a request for an IP on L1 be requested through an adapter responsible for L3, it doesn't make any sense. Where would this be configured?

Here's log excerpt:


Sep 23 12:42:36 pfsense dhcpd: DHCPDISCOVER from 00:08:9b:df:21:23 via em1
Sep 23 12:42:36 pfsense dhcpd: DHCPOFFER on 192.168.1.230 to 00:08:9b:df:21:23 via em1
Sep 23 12:42:36 pfsense dhcpd: DHCPREQUEST for 192.168.1.230 (192.168.1.1) from 00:08:9b:df:21:23 via em3: wrong network.
Sep 23 12:42:36 pfsense dhcpd: DHCPNAK on 192.168.1.230 to 00:08:9b:df:21:23 via em3
Sep 23 12:42:36 pfsense dhcpd: DHCPREQUEST for 192.168.1.230 (192.168.1.1) from 00:08:9b:df:21:23 via em1
Sep 23 12:42:36 pfsense dhcpd: DHCPACK on 192.168.1.230 to 00:08:9b:df:21:23 via em1
Sep 23 12:42:36 pfsense dhcpd: DHCPDISCOVER from 00:08:9b:df:21:23 via em3: network 192.168.3.0/24: no free leases

Derelict

You have to figure out why DHCP requests are being received on em1 and em3 from the same MAC at the same time. Check your layer 2 config.

Or in all the gyrations trying to fix the other problem did you end up with the same MAC cloned on multiple devices?

Or are your multiple NICs plugged into the wrong ports, or ???

dxun

Well, I acted on a semi-hunch and it seems I was correct - sometimes less is more.

I was confused with virtual and physical adapters and was….fixated with mandatorily having a physical adapter for each of the esxi Standard Switches. I think I now understand that is not a requirement - quite the opposite.
I'll try to explain, do tell me if you think explanation is nonsensical.

As you can see on the first screenshot, I've removed the physical adapter from DMZ port group - vmnic1 was previously attached to it, and on the second screenshot you can see both vmnic0 and vmnic1 were on the same physical network (192.168.1.0/24 - in fact they were connected to the same physical switch!) but were associated with different logical networks (L1 and L3). I think that was the main reason why eth packets from L1 (which is backed by a physical network) were routed to L3 (which is actually a fully logical network, existing only on the ESXi host).

So I disconnected the vmnic1 from the box - and that did the trick. I can see no "wrong network" logs in dhcpd.log and devices are indeed assigned expected IPs.

I wish I could thank you again for your help and patience - I had already backed up most of pfSense configuration and was ready to reinstall it from scratch, which would've been a total waste of time.

networking.PNG_thumb

ESXi_network_adapters_zpscc4b99c5.png_thumb

network_adapters.PNG_thumb