PfSende is unable to communicate with certain devices over VLAN; OpenWRT switch



  • Hello!

    Sorry, if this is a cross post with OpenWRT https://forum.openwrt.org/viewtopic.php?id=37609 -
    I recently experienced a rare problem as and have received no answer so far:

    Short description:
    Any client connected to a VLAN untagged port of an OpenWRT-Router cannot connect to a virtualized pfSense 2.1 and vice versa.

    Details:
    OpenWRT-Router (TL-WR1043ND) latest Snapshots since the problem first occured around mid april (AFAIR).
    Virtual Machine: Debian, KVM, bridged network - Tested and working with several other OSes (!) - each using em1000 emulation.
    Virtual Machine: Test setup with Ubuntu 12.04, KVM, bridged network - Tested and working with several other OSes (!) - each using em1000 emulation.
    Host machines equipped with Realtek Gigabit adapters RTL8111/8168 (rev 01 and rev 09).

    Conditions:
    All other hosts on the network can connect to each other as planned.
    Only hosts connected to the switches running OpenWRT cannot connect to pfsense and vice versa, while they can reach each other host on the net and vice versa. Even the host machine is reachable. (Reach means: ARP, ICMP, IP). Even clients from untagged ports on either TL-WR1043ND can reach (or not reach) each other as expected from a working vlan switch in a configuration that is presumed to be correct.

    Wireshark on the host bridge shows ARP-whois packets and even answers on the bridge and the vlan in question as expected.
    The PFsense ARP table shows "incomplete" for the affected machines. Manual setting of ARP entries - of course - does not help.

    Circumventing the OpenWRT switch is effective - packets get answered.

    The effect is reproducible on another TL-WR1043ND.

    Side effects: The other managed switches on the network refused to react from time to time while debugging the main problem. They were inaccessible by TCP or SNMP until they experienced a new power cycle. This problem may have other causes (a very cheap low quality so called broadband home router on the far end).

    My assumptions for the main problem are, that:

    1. either something in FreeBSD does not understand the IEEE 802.1q packets from OpenWRT routers and drops them. This conclusion may be partially incorrect, since the OpenWRT-Router itself and tagged ports are working well - only untagged ports on the router's switch cause this behaviour;

    2. or something in the RTL8366RB switch code on OpenWRT is malformed, or FreeBSD regards it to be malformed;

    3. or something on the Linux bridge or KVM affects this. It's working with identical settings on Linux-VMs. Probably something having to do with KVM and FreeBSD - yet to be tested with a pure FreeBSD setup.

    4. or FreeBSD or OpenWRT implements something more strictly;

    5. or some changes in both Pfsense and OpenWRT having to do with Jumbo Frames, MTU or similar are the culprits.

    Unfortunately I am still a FreeBSD novice and have reached my current limits in debugging this problem. I am looking for advice on how to continue in this matter. Any help would be appreciated.

    Thanks in advance!

    Regards
    Epek



  • @epek:

    1. either something in FreeBSD does not understand the IEEE 802.1q packets from OpenWRT routers and drops them. This conclusion may be partially incorrect, since the OpenWRT-Router itself and tagged ports are working well - only untagged ports on the router's switch cause this behaviour;

    Are you sending to pfSense traffic which is both VLAN tagged and untagged on the same pfSense interface? This is not recommended in FreeBSD.



  • Hello Wallabybob!

    Thanks for your answer!

    This (mixed mode) was the case originally, when I first encountered the problem. Since various forum posts advised against it, I have already changed this to a pure vlan configuration using only em0_vlan*, but not the plain em0 anymore. I even tried this before posting here.
    Unfortunately it has made no difference. The problem is still there.

    In case it would have been that simple, the error would - to my experience - have occured more randomly, but as stated before the error occurs reproducible with those very popular TL-WR1043ND wlan-routers, when running OpenWRT. But not on them itselves, but on the machines behind. (These machines send untagged packets to the switches, which tag and hand them over to pfSense). pfSense uses the same tagged connection to talk with the switches. This connection works reliably.

    Since all other tagged traffic from other machines (and other switches - Cisco, Netgear) works perfectly, I am really puzzled.

    I suspect something less trivial, something more substantial :-)



  • Are you looking for a solution? If so, I would want more details: configuration, interconnections between devices, IP addresses and netmasks, specific details of one system that can communicate and one that can't (application used to test communication and what it reports when communication is attempted).

    Are you looking for debugging hints? If so, check the pfSense firewall logs for signs the attempted communication is "firewall blocked" and try packet capture on pfSense to verify the traffic is actually getting to pfSense. More specific hints will reqire more details such as those suggested above.



  • Hello Wallabybob!

    Thanks again for replying.

    Yes, I am looking for a solution, but I am also looking for other users who can confirm my observations.

    Regarding Configurations and other information:

    Host 1:
    Quite simple :-) one Opt-Vlan in 10.0.0.0/8 network for an adsl modem in bridging mode, WAN with PPPOE resulting in one hookup IP address and one routed subnet on another Opt-Vlan device. Various internal VLANs with no overlapping ranges. Some networks may nat to the outworld, some not. Some intranet-natting between internal ranges also applies.

    Host 2 (test setup):
    The other setup (Ubuntu) is even more simple: two vlans, no nat.

    Commons:
    Both with KVM in bridged mode. It's always the same. Devices don't react to pfSensem if and only if they are connected to untagged ports of OpenWRT powered devices (only tested with TL-WR1043ND so far). Other virtual hosts on the same bridge can communicate well with even those devices.

    Specific detail: Any device (Computer, router, print server, voip-ata, …) connected to an OpenWRT powered switch on untagged port will show the described effects. Connecting it to any other switch will eliminiate the described effects, as will using untagging on capable devices themselves.

    Again: Only the pfSense machine cannot be reached from these devices and vice versa. All other hosts work perfectly well (!) That's the strangeness in all this.

    No, the firewall does not yet interact at this level - we already drop out at ARP-level, which is - up to my knowledge - not affected with a clean installation of pfSense as in the case of the test machine. The logs show nothing blocked so far (which would be relevant in this matter). Interfering packet filters at the host system have been eliminated previously to testing and functionality has been confirmed using a virtualized Debian guest on the same host.

    I will try the packet capture - will the feature from the web-interface suffice for this?
    I would rather prefer whireshark over ssh, iptraf or sniffit or sth. like this. Is any of these available?

    The packets nevertheless traverse the bridge, but I have not yet confirmed if they reach the pfSense guest by low level debugging.

    Stay tuned.

    Thanks again!
    Epek

    Update:
    11:51:38.262666 xx:yy:zz:b9:01:f5 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Ethernet (len 6), IPv4 (len 4), Request who-has 192.168.100.16 tell 192.168.100.2, length 28
    11:51:39.281127 xx:yy:zz:b9:01:f5 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Ethernet (len 6), IPv4 (len 4), Request who-has 192.168.100.16 tell 192.168.100.2, length 28
    11:51:40.267116 xx:yy:zz:b9:01:f5 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Ethernet (len 6), IPv4 (len 4), Request who-has 192.168.100.16 tell 192.168.100.2, length 28
    11:51:41.271933 xx:yy:zz:b9:01:f5 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Ethernet (len 6), IPv4 (len 4), Request who-has 192.168.100.16 tell 192.168.100.2, length 28
    11:51:42.331215 xx:yy:zz:b9:01:f5 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Ethernet (len 6), IPv4 (len 4), Request who-has 192.168.100.16 tell 192.168.100.2, length 28

    The Mac address and bits 16-23 of the subnet address have been altered in this post.


    192.168.100.2 represents the pfSense host. Ping from another host on the network to 192.168.100.16 (A cheap router with print server on port 1 of the TL-WR1043.


    11:59:41.577101 xx:yy:zz:b9:01:f5 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Ethernet (len 6), IPv4 (len 4), Request who-has 192.168.100.16 tell 192.168.100.2, length 28
    11:59:42.614769 xx:yy:zz:b9:01:f5 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Ethernet (len 6), IPv4 (len 4), Request who-has 192.168.100.16 tell 192.168.100.2, length 28
    11:59:43.659512 xx:yy:zz:b9:01:f5 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Ethernet (len 6), IPv4 (len 4), Request who-has 192.168.100.16 tell 192.168.100.2, length 28
    11:59:44.710617 xx:yy:zz:b9:01:f5 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Ethernet (len 6), IPv4 (len 4), Request who-has 192.168.100.16 tell 192.168.100.2, length 28
    11:59:45.749881 xx:yy:zz:b9:01:f5 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Ethernet (len 6), IPv4 (len 4), Request who-has 192.168.100.16 tell 192.168.100.2, length 28
    11:59:46.777466 xx:yy:zz:b9:01:f5 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Ethernet (len 6), IPv4 (len 4), Request who-has 192.168.100.16 tell 192.168.100.2, length 28
    11:59:47.789489 xx:yy:zz:b9:01:f5 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Ethernet (len 6), IPv4 (len 4), Request who-has 192.168.100.16 tell 192.168.100.2, length 28
    11:59:48.812295 xx:yy:zz:b9:01:f5 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Ethernet (len 6), IPv4 (len 4), Request who-has 192.168.100.16 tell 192.168.100.2, length 28
    11:59:49.854673 xx:yy:zz:b9:01:f5 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Ethernet (len 6), IPv4 (len 4), Request who-has 192.168.100.16 tell 192.168.100.2, length 28

    When pinging from pfSense to the router print server.

    Update 2 - I forgot:
    ping error message: "ping: sendto: Host is down"


    12:03:34.117591 xx:yy:zz:b9:01:f5 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Ethernet (len 6), IPv4 (len 4), Request who-has 192.168.100.16 tell 192.168.100.2, length 28
    12:03:35.145311 xx:yy:zz:b9:01:f5 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Ethernet (len 6), IPv4 (len 4), Request who-has 192.168.100.16 tell 192.168.100.2, length 28
    12:03:36.195305 xx:yy:zz:b9:01:f5 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Ethernet (len 6), IPv4 (len 4), Request who-has 192.168.100.16 tell 192.168.100.2, length 28
    12:03:37.235007 xx:yy:zz:b9:01:f5 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Ethernet (len 6), IPv4 (len 4), Request who-has 192.168.100.16 tell 192.168.100.2, length 28
    12:03:38.270972 xx:yy:zz:b9:01:f5 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Ethernet (len 6), IPv4 (len 4), Request who-has 192.168.100.16 tell 192.168.100.2, length 28
    12:03:41.988268 xx:yy:zz:b9:01:f5 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Ethernet (len 6), IPv4 (len 4), Request who-has 192.168.100.20 tell 192.168.100.2, length 28
    12:03:41.988791 00:c0:02:a1:be:32 > xx:yy:zz:b9:01:f5, ethertype ARP (0x0806), length 60: Ethernet (len 6), IPv4 (len 4), Reply 192.168.100.20 is-at 00:c0:02:a1:be:32, length 46

    Now pinging a print server connected to another switch (untagged port too).

    12:07:37.494744 xx:yy:zz:b9:01:f5 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Ethernet (len 6), IPv4 (len 4), Request who-has 192.168.100.244 tell 192.168.100.2, length 28
    12:07:37.495580 f4:ec:38:aa:bb:22 > xx:yy:zz:b9:01:f5, ethertype ARP (0x0806), length 56: Ethernet (len 6), IPv4 (len 4), Reply 192.168.100.244 is-at f4:ec:38:aa:bb:22, length 42
    12:07:42.505741 f4:ec:38:aa:bb:22 > xx:yy:zz:b9:01:f5, ethertype ARP (0x0806), length 56: Ethernet (len 6), IPv4 (len 4), Request who-has 192.168.100.2 tell 192.168.100.244, length 42
    12:07:42.505816 xx:yy:zz:b9:01:f5 > f4:ec:38:aa:bb:22, ethertype ARP (0x0806), length 42: Ethernet (len 6), IPv4 (len 4), Reply 192.168.100.2 is-at xx:yy:zz:b9:01:f5, length 28

    Pinging the TL-WR1043ND from pfSense...
    Mac address of TL-WR1043ND changed for posting (aa:bb)

    Proxy-ARP on the OpenWRT-switch will of course repair the pings

    Update 2: Nothing in the firewall logs on the interface or IP-range in question.



  • The problem does no seems to be related to Linux, KVM and FreeBSD. A fresh and clean FreeBSD 8.3 install (virtualized) works well with both tagged and untagged packets even in "mixed" (tagged/untagged) mode. Updating initial post.



  • That problem sounds familiar somehow…
    http://forum.pfsense.org/index.php?topic=34661.0
    ... but I don't see error packets (since "em0 itself is not in use" [as plain opt device].)



  • I compared values on pure FreeBSD and pfSense and tracked it down to vlan_pcp.

    ifconfig em0_vlan100 shows a vlanpcp of 0.
    Setting this manually to vlanpcp 1 by doing ifconfig em0_vlan100 vlanpcp 1 made the hosts pingable temporarily.

    I am still not sure if this is a pfSense or OpenWRT matter.

    So this is a workaround, not a solution.

    I also found two sysctl keys in pfSense
    net.link.vlan.mtag_pcp: 0
    net.link.vlan.soft_pad: 0

    FreeBSD defaults only know of net.link.vlan.soft_pad which defaults to 0.

    I guess, that setting mtag_pcp through sysctl will make the change permanent?

    The question is: is this a bug, a feature or a misconfiguration?
    Please give me some background information on that change (when was it made) and why would "0" interfere with openwrt, when it means "best effort"? See: http://www.watson.org/~robert/freebsd/20120117-ieee8021q.diff

    Still dazzled and confused, thanks so long.

    Regards
    Epek

    Update: of course setting the pcp will "repair" the untagged connections, but in turn all other connections on that interface die…

    Update 2: It's an openwrt/rtl8366rb problem: https://forum.openwrt.org/viewtopic.php?id=38631



  • nevertheless - how can I deactivate the VLAN PCP patches in pfSense? I don't need them anyway.


Locked