How can I be the only one with a broken RC2?

berniem

I've seen no one else reporting this, so I'm really confused. At first, I thought it may have been flakey hardware, but I've tried it on 2.5 different platforms and I'm seeing the same thing. Am I missing something here?

Problem: About every 5-10 boots, the pfSense box mostly ignores clients. Specifically , clients cannot get new (or renewed) DHCP addresses from pfSense. Further, even those machines which still have their old DHCP addresses are ignored (or, at least I see to responses back to the clients). The one exception I see in Wireshark is when a client asks via ARP who has 192.168.1.1, the pfSense responds with the correct information (and the MAC addresses on both ends are correct). But then, when the client sends out its properly-formed DHCP request, no response comes from the pfSense box. A reboot (soft reboot or power cycle; doesn't matter which) will clear it up. I've had it fail with only one "good boot" in between failures, or, more usual, having 5-10 boots between "bad boots".

When it's a "good boot", it will work fine and continue to work without problems. If it was a "bad boot", it never gets itself working and a reboot is required.

Hardware: the 2.5 motherboards on which I can repro this:
*) A Jetway J7F5M1G2E with a 3-gigabit-port daughterboard
*) A Jetway J7F2WE1G5D with the same 3-gigabit-port daughterboard (hence the .5)
*) A Jetway J7F4xxxxxx-something with an Intel Pro dual NIC.

I'm running the November 19 "official" RC2 build. It's a pretty vanilla installation - no extra packages installed, and no patches. The one thing that is common which is out of the "default" is that the installations are installed as a full install and then the /etc/platform file is edited for "embedded". However, I can say that in at least one of the test passes, I changed that back to "pfSense" and still had the problem (after a few boots). Dunno if that has any significance, but it's the only thing I could think of that was non-standard.

Anyone care to venture a guess as to where I might look for a clue? Or is there more information I could supply to help someone else take a guess at what's up?

Thanks.

wallabybob

Do you see the DHCP requests on the pfSense box? (On the console or ssh session, type

tcpdump -i <interface name="">e.g. # tcpdump -i re0</interface>

Do you see the incoming DHCP requests? If not, you most likely have a physical layer problem.

If you see DHCP requests, check the DHCP server log to see if they have been logged: On the web GUI, status -> System logs -> DHCP tab. If not, check if they are blocked by the firewall (click on Firewall tab next to the DHCP tab) and that the DHCP server is running (in ssh session type

ps ax | grep dhcp

That should give some more information to work with.

berniem

Ah! Excellent suggestions. It turns out that tcpdump has likely answered the original question, but then also prompts the next question:

I put the pfSense box in the state where it seems to be ignoring (other than ARP) packets from the client. As soon as I do the tcpdump -i re1 (re1 is the LAN), the DHCP renewal will work immediately. To me, this implies that maybe the card had not been in promiscuous mode. So, I get the box to the bad state again and BEFORE doing the tcpdump, I do a ifconfig and I do see the line:
pflag0: flags-100(PROMISC) metric 0 mtu 33204
which, I assume, means that the system thinks the driver/card is already in promiscuous mode.

Is there a different way I can see whether or not the driver/card is -really- in promiscuous mode?

In either case, anyone have a guess as to what could cause these situations? (IE: case #1: the driver/card is not in promiscuous mode even though it should be and the ifconfig thinks it is; or case #2: the card IS in promiscuous mode but not responding to packets (other than ARP) until tcpdump is started.)

Thanks.

eri--

What happens after you kill tcpdump?

wallabybob

You can access the FreeBSD man pages at http://www.freebsd.org/cgi/man.cgi

tcpdump -p
doesn't put the interface into promiscuous mode.

tcpdump -e
displays the link layer headers including source and destination MAC address.

Perhaps you could use these options together to see if promiscuous mode really is a factor in this. If so, check the link layer addresses. I'm thinking that when DHCP doesn't work the packets coming to pfSense don't have the correct link layer (MAC) address and so are discarded until you switch the interface into promiscuous mode by running tcpdump. Perhaps you have another system on the network that is supply the wrong MAC address in response to an ARP query. Checking the MAC addresses in the packets should give you a clue as to whether or not this is happening.

Is your pfSense connected to a hub or a switch? (If a switch, the promiscuous mode shouldn't make a difference unless pfSense is plugged into a "monitor port" which is getting all the traffice for 'monitoring' purposes.) What is the configuration of your local network?

berniem

*) When I run tcpdump as previously described, all works well. If I, for example, run a continous ping, the ping will run as long as tcpdump is running. If I stop tcpdump, the pings time out while tcpdump is not running. If I restart tcpdump, the pings once again start working just fine.

*) Running tcpdump with the -p flag does nothing to help packets being ignored.

*) As to MAC addresses: in the ARP message where the client asks "Who has 192.168.1.1?" and when I run tcpdump with -e, all MAC addresses look good on both incoming and outgoing on both the client and the pfSense box. When packets are being ignored, running Wireshark on the client shows correct MAC addresses everywhere.

After getting these results, I tried something else: I did the "Reset to Factory Defaults". And went back to the config of which network (LAN, WAN, etc) was plugged into which port (RE0, RE1, DC0, etc). After this, everything worked correctly. Next, I changed /etc/platform to "embedded". Still fine. I then reassigned LAN and WAN to different net ports that were unused in my original config. (Specifically, I moved LAN&WAN from FXP0&FXP1 to RE1&RE0, respectively.) Now the box is back to the problem state where packets are ignored unless tcpdump is running.

(I'm not being a complete dolt, am I? Doing that SHOULD work, right?)

Could it be something weird like the the firewall rules not being correctly update on "embedded" when the LAN / WAN assignments are changed? I've added no firewall rules or NAT stuff, only the defaults exist. The GUI shows the default firewall rules. (With the pfSense in the "ignoring packets" state, I -did- add a new LAN firewall rule to allow all LAN subnet packets (ie: a duplicate of the default LAN firewall rule) to see if that would clear it up (thinking that maybe a new rule would pull the new info), but it did not seem to help, so maybe I'm way off base…)

I think I'm runing out of things to try...

jsenay

berniem,

I am having the same issue on a Asus motherboard. everything was working fine then the dhcpd server kept stopping.

Look under the Status Tab and look at Services. It even states the dhcpd has stopped. If you try to restart it, it claims it did not nother shows under any log.

Check in /etc/fstab where the device and mount points are. reason if the partition where the dhcpd leases are written.

Do a df command and see if the pasrtiton has any space left in it.

Hope this helps you!!

JJS

berniem

In my case, DHCPD is just fine - if/when it actually gets packets all is well.

Thanks for the attempt, though.

wallabybob

@berniem:

*) When I run tcpdump as previously described, all works well. If I, for example, run a continous ping, the ping will run as long as tcpdump is running. If I stop tcpdump, the pings time out while tcpdump is not running. If I restart tcpdump, the pings once again start working just fine.

*) Running tcpdump with the -p flag does nothing to help packets being ignored.

Since the tcpdump option -p doesn't use promiscuous mode it seems like promiscuous mode is necessary for your interface to see the packets. If my memory is correct, one of the differences between promiscuos mode and "normal" mode was that in promiscuous mode the hardware gave the driver ALL packets it saw on the wire while in "normal" mode the hardware gave the driver only packets whose destination MAC address was the same as the hardware MAC address (with some exceptions which shouldn't be relevant here). The MAC address of an interface is reported by (e.g.) # ifconfig re1, see the "ether" line in the output from my system:


# ifconfig rl0
rl0: flags=8943 <up,broadcast,running,promisc,simplex,multicast>metric 0 mtu 1500
        options=8 <vlan_mtu>ether 00:30:18:b0:50:fb
        inet6 fe80::230:18ff:feb0:50fb%rl0 prefixlen 64 scopeid 0x2
        inet 192.168.211.173 netmask 0xffffff80 broadcast 192.168.211.255
        media: Ethernet autoselect (100baseTX <full-duplex>)
        status: active</full-duplex></vlan_mtu></up,broadcast,running,promisc,simplex,multicast>

*) As to MAC addresses: in the ARP message where the client asks "Who has 192.168.1.1?" and when I run tcpdump with -e, all MAC addresses look good on both incoming and outgoing on both the client and the pfSense box. When packets are being ignored, running Wireshark on the client shows correct MAC addresses everywhere.

"look good" is not a technical term with a precise meaning. The destination MAC addresses should be the MAC address of the interface that should be receiving the packet or the broadcast MAC address).

After getting these results, I tried something else: I did the "Reset to Factory Defaults". And went back to the config of which network (LAN, WAN, etc) was plugged into which port (RE0, RE1, DC0, etc). After this, everything worked correctly. Next, I changed /etc/platform to "embedded". Still fine. I then reassigned LAN and WAN to different net ports that were unused in my original config. (Specifically, I moved LAN&WAN from FXP0&FXP1 to RE1&RE0, respectively.) Now the box is back to the problem state where packets are ignored unless tcpdump is running.

(I'm not being a complete dolt, am I? Doing that SHOULD work, right?)

It depends a bit on the details as to whether it WILL work. You used fxp0 as your LAN interface and then changed the LAN interface to re1. Suppose you also moved the IP address of fxp0 to re1. Now anyone who previously knew the MAC address of pfSense's LAN IP address will have a wrong view of the world; in sending to the IP address of the LAN interface they will send to the MAC address of fxp0 INSTEAD of the MAC address of re1. In this case re1 will probably ignore the incoming frames with fxp0's MAC address unless its in promiscuous mode. A client that previously conversed with "fxp0" will have stale knowledge of the MAC address it should use to get to the LAN interface's IP address. ARP entries will normally time out so that a situation like this doesn't persist but the timeout can be suppressed.

Could it be something weird like the the firewall rules not being correctly update on "embedded" when the LAN / WAN assignments are changed? I've added no firewall rules or NAT stuff, only the defaults exist. The GUI shows the default firewall rules. (With the pfSense in the "ignoring packets" state, I -did- add a new LAN firewall rule to allow all LAN subnet packets (ie: a duplicate of the default LAN firewall rule) to see if that would clear it up (thinking that maybe a new rule would pull the new info), but it did not seem to help, so maybe I'm way off base…)

I believe the firewall rules are applied AFTER tcpdump sees the incoming packets. As least that was the case in a another problem I looked at: DHCP requests were visible to tcpdump but were blocked by the firewall (so didn't appear in the dhcpd log).

I think I'm runing out of things to try…

I think its a pretty weird problem and don't have much idea other than MAC address mismatches (which I have been so insistent on the details) or broken hardware. I can't recall any other reports in these forums of the Realtek GigE interfaces needing tcpdump (or to be set into promiscuous mode) to work, so I'm not inclined to be hasty in concluding its a hardware problem. But, you could be the unlucky one to have uncovered some members of a bad batch.

The Jetway boards all have Realtek GigE interfaces on the motherboard don't they? Have you see the same problem using the motherboard interface rather than the daughterboard interface? Do you see the same problem going from (say) the daughterboard interfaces to the fxp interfaces?

Just to add some more information to the report, have you watches the interface counters (# netstat -i) to see if the incoming packets count goes up or the error count goes up while the pfSense box is ignoring the pings?

A network diagram might also be helpful, showing what machines are involved in these conversations and showing what other machines are on your LAN.

naughtyusmaximus

I already bumped another thread before I noticed this one, but I thought I would add a 'me too' to this. Mine is happening on a VIA Epia board.

wallabybob

@naughtyusmaximus:

I already bumped another thread before I noticed this one, but I thought I would add a 'me too' to this. Mine is happening on a VIA Epia board.

I'm confused. I saw you replied to another thread about dhcpd dieing and said you have that problem. Now you reply to this thread about dhcp not responding and say you have this problem. Please clarify what problem you are seeing.

It may be useful to know a bit more about your hardware and the NICs you are using and the particular version of software on which you are seeing your problem.

naughtyusmaximus

I don't know for fact that DHCP is dying, as I didn't think to check that the one time I've been in the office while the symptoms described have occurred, and have just had someone who has been on location reset the router. The symptoms described here best match what I believe I have witnessed. Setting my servers to use static IPs has been an ok workaround for the time being, but it would be nice if I could use the "static dhcp" addresses allocated in pfSense.

berniem

""look good" is not a technical term with a precise meaning. The destination MAC addresses should be the MAC address of the interface that should be receiving the packet or the broadcast MAC address)."

Yes, sorry 'bout that - I was using "look good" as shorthand for "for all packets sent or received to/from both the client and pfSense box, the MAC address for the sender and receipient were correct for both sending and receiving computer." In other words, it's not a problem with MAC addresses.

"A network diagram might also be helpful, showing what machines are involved in these conversations and showing what other machines are on your LAN."

In all of these cases, I wanted to eliminate as many variables as possible so the client and pfSense are connected via a single net cable. There are no other clients / hubs /switches / etc in between pfSense and the client.

As to the possibility of bad hardware and/or drivers: these interfaces work just fine with m0n0wall 1.3b15 which is using FreeBSD 6.3-RELEASE-p5 (a version rather close to what pfSense uses…), so I'm inclined to think that the drivers and hardware are not the issue.

(And, of course, by this time, I'm using RC4 for these tests and am seeing no difference.)

berniem

I'm out of ideas and am hearing nothing else to try. :-(

The build I have that's pre-RC1 from 2008-09-13 works without any of these problems so I guess I'll just fall back to that indefinitely. :-(

sullrich

Sounds like driver changes. Try changing your nics out for Intel (em nics). We do NOT recommend realtek and the other assortment of nics that you mentioned.

berniem

Ok, so far it's looking like the Jetway MB with the built-in Via nic and a dual-port Intel Pro card is working fine with no problems.

Going on the assumption that it was indeed a driver change sometime between September and RC2, what are the available options for a remedy? Is it possible to revert the driver in the build to some version before this regression?

Failing that, would it be reasonable for an end-user (eg: me) to pull the necessary driver files from the September build and drop them into a current build? If so, where would I look for a clue as to what files I'd need to copy over?

Or is there some other alternative (other than users throwing away all nics and motherboards with chipsets that begin with 'R')?

Thanks!

cmb

It's very, very unlikely there was a driver regression between September and December, almost nothing changed in FreeBSD 7.0 between then. It's also highly unlikely any changes in the pfSense code base would have caused anything like that, since nothing at all related to Ethernet interface code has changed in a long time, longer ago than September.

Can you reliably replicate this being resolved going back to an earlier release, then breaking going back to 1.2.1 release? If so, what date is the one that works? Given the lack of changes in any related areas, I doubt if that's truly the case. If so, I have a bunch of 1.2.1 snapshots that I can make available to help narrow down further.