Malfunction?



  • 2 months ago I setup this production environment, using a minor pfsense appliance (AMD Geode, 1GB RAM / 512MB CF) for router. While it worked well for the first month, I didn't get as high bandwidth as I wanted (sitting on a 100Mbit fibre connection). I then ordered another pfsense appliance with the Intel Atom 1.6Ghz CPU, 2GB RAM and 2GB CF.

    From day one I got problems. The DL/UL is usually at 15/87, and after a day in use the router starts giving all kinds of error messages in the GUI. Nothing in the GUI comes up, and I can't write to config.xml so I can't make any changes. It acts as though the CF is full, which shouldn't be possible since the config is identical to the AMD box which only had 512MB. I sent the box back on warranty and they said it was a software / config mismatch so they exchanged the CF and sent it back. I admit, I restored the config from the AMD box to the Intel box, but I won't make that mistake again.
    After getting it, I manually rebuilt the entire config - I didn't restore like I did last time. After installing it I see the exact same symptoms. The GUI is throwing error messages and acts like the CF is full, and speeds are around 15/87 again.

    I'm running these services > apinger, dhcpd, dnsmasq, igmpproxy, ntpd. No packages installed.

    I sent the config and this description to the supplier but no answer yet. As far as they're concerned they say they've tested the ports with iperf (so did I) and it performs well on all 3 interfaces. I agree to that, but obviously it doesn't get stressed like in a production environment, so it doesn't really say much I think.

    Any suggestions?



  • Okay so this is turning out to be rather weird…

    I ordered a new Intel Atom router and just installed it today. Usually a day goes by when the old Intel router died on me, but with this new one it just took 20 minutes. That's no less than two identical routers that's acting the same, in the same network, with identical config.
    Doesn't take a scientist to realise the problem isn't the router, but now to the really weird part; Why the hell does the AMD Geode router work where the faster Intel Atom doesn't?! I just plugged the AMD back in and sure enough, everything is back to normal.

    What is the difference (that could matter in this case) between the AMD Geode and Intel Atom, except gigabit interface and performance?



  • Your odd speeds may be a result of a duplex mismatch on the WAN interface.  Try using fixed 100 FD on both your pfSense box and whatever hardware your ISP supplied.


  • Netgate Administrator

    I think we already looked for a duple mismatch in another thread but it would be worth trying to force both sides to 100Mb-FD as a test.

    @Phatsta:

    That's no less than two identical routers that's acting the same, in the same network, with identical config.

    In fact that is encouraging. Two different routers with different configs with identical behaviour, that would be far more worrying.  ;)

    What other hardware is this box connected to?

    Steve



  • The box is connected on the LAN to the backbone switch and on WAN to a fibre converter then out to the world. Nothing else.

    The reeeeaaaaally fun part is that I configured another brand new opn pfsense box (out of the box this monday) manually, set it up in the troubling environment, and after 20 minutes is failed and stopped routing all together.

    I then brought the same router to another environment at another site, another customer, and boot it up - and it still messes up! It's like it's been infected, if you get my point…
    Even more fun os that the default config xml is gone. Without a trace. I log into the shell and gp cd /conf.default then ls and the folder is empty. So I couldn't even restore the box. Argh!

    Improvising a bit, I setup a virtual pfsense, configured some basic stuff and saved the config from the GUI to a USB stick. Put that into the opn box, logged into the shell and ran these commands:

    mkdir /tmp/usb
    mount_msdosfs /dev/da0s1 /tmp/usb
    /etc/rc.conf_mount_rw
    cp /tmp/usb/config.xml /conf.default/
    /etc/rc.conf_mount_ro
    umount /dev/da0s1
    exit

    Then I chose reset to factory default and presto! It's alive! At least for the moment. I'm not really certain that it's not damaged in some other way, so I'll have to test it. But at least I got somewhere, maybe.


  • Netgate Administrator

    @Phatsta:

    after 20 minutes is failed and stopped routing all together.

    Is it still responding at the console when that happens. Seems like it might be a NIC issue, like out of MBUFs or 'watchdog timeout'. What NICs is the box running?

    Steve



  • @stephenw10:

    @Phatsta:

    after 20 minutes is failed and stopped routing all together.

    Is it still responding at the console when that happens. Seems like it might be a NIC issue, like out of MBUFs or 'watchdog timeout'. What NICs is the box running?

    Steve

    See, that's a funny story too… Actually until stores opened today I had only what showed to be a faulty serial cable, so I didn't get any response on the console at all. For a while I thought it was simply dead, but a short while after reboot I actually could log into the GUI. If I made any changes, it died on me, but it actually wrote the changes at least, so at next boot it could remove one more setting, reboot, one more and so on. In my mind I though maybe I'd remove the faulty setting or whatever caused this, but nope.

    As soon as stored opened I bought a new serial cable and got a connection, no problem. Even after the drops, yes. My collegue is now stress testing the (hopefully) mended router. Time will tell if it behaves.

    The only thing common to the three routers I've tried (where 2 have failed and 1 is still in production) is they where obviously connected to the same equipment, and they all had identical configs. And when I say identical I mean manually reproduced, not restored. So they've never ever been in actual "contact" with each other.

    What my brain is having a really difficult time to understand is how the lesser AMD Geode box can survive in that troubling environment whereas the greater Intel Atoms don't? And what could possibly corrupt the read-only file system of the pfsense so that the default config.xml disappears?



  • Also, attached is a screenshot of what the GUI looks like just before that router drops dead.

    ![pfsense out of order.png](/public/imported_attachments/1/pfsense out of order.png)
    ![pfsense out of order.png_thumb](/public/imported_attachments/1/pfsense out of order.png_thumb)



  • Oh I see I didn't answer your question;

    The NIC's running is re0 for LAN (where all the vlan's reside) and re2 I think for WAN. re1 isn't configured.


  • Netgate Administrator

    Hmm, well that screenshot looks pretty seriously unpleasant! It can't read any of the sysctls to generate the vbar graphs.  :-\

    I'm going to suggest it's a NIC problem, mostly because you're running Realtek NICs. Maybe an interrrupt storm or some watchgdog issue or perhaps a hardware offloading problem. I assume your lack of console cable and the fcat you're running Nano means you haven't been able to access the logs until now? Check them once it's stopped routing. Also you could run 'top -aSH' at the console and wait for it to lock up, check for any large loading issues.

    Steve



  • Well… it's going to be a bit more difficult now. I'll have to give you the background if you're going to understand.

    I've tried 3 routers; R1, R2 and R3 lets call them. R1 is the AMD Geode. R2 is the first Intel Atom, and R3 the second (out of the box this monday).
    R1 is currently the only one working in customer A's network, although it was actually suppose to be installed at customer B. Couldn't remove the only working one from A though, alas R1 is till at A. Instead I brought R3 to customer B, where the router acted the same and wouldn't factory restore. That's when I installed the virtual one, pulled the config and inserted into the file system manually. Then, after factory restore, R3 has been working all day. No interruptions, no error messages, no nothing, just simple routing in high speeds (well over 80+ on a shared fibre line which is great). That project is now closed and untouchable.

    Hence I only have R2 in my posession at the moment, and since that won't work in customer A's network (as first planned), I'll have to figure that out first. I couldn't face the customer if I went up there stirring up trouble, so I'll have to try and figure out what's wrong and fix it first, before swapping routers. My five is on the backbone switch though. It's the only unit in the network that reports loops, and on top of that I just found out that 8 of its 48 ports are dead. Completely dead. Opened a warranty case with HP and it looks like they'll replace it.

    So the logical step is to await the replacement switch, swap it out, look at the logs, test port speeds and so on. If all looks good, we can replace R1 with R2 and we should be good to go.

    I'm afraid I'll never see the error again though, nor do I want to ;) But obviously I'll post any updates here. Might be good to have documented what was actually going on.


  • Netgate Administrator

    @Phatsta:

    It's the only unit in the network that reports loops, and on top of that I just found out that 8 of its 48 ports are dead. Completely dead.

    Mmm, that doesn't look good. Anything that starts misbehaving like that cannot be trusted.
    Let us know what happens.  :)

    Steve



  • @stephenw10:

    Anything that starts misbehaving like that cannot be trusted.

    +1  :)



  • Sorry for late post, I was on a much needed vacation. I have the solution though.

    It does not happen very often, but this time it was actually the net supplier (for lack of a better word), i.e the ones that supply the physical network for ISP's to deliver their services in, that was at fault. We're talking the government fibre project called "stadsnät" in sweden, though it's run by private contractors. Apparently the company that's been maintaining the entire network for the last 10-15 years now lost the deal and another company is gradually taking over. Due to equipment exchanges we got this intermittent (well at least at first intermittent) speed error.

    That's what they say, at least. Sounds more like "someone f-ed up, sorry". If they only could have seen it earlier with their fancy surveillance softwares and hardwares and whatnot… I mean DAMN, the number of hours I put down in vain thinking "the ISP is never wrong". However, I DID actually report the error but no one could see anything wrong. Not until the error was persistant, and they saw it the second time around.

    So there you go. Even the super-pro's can make mistakes. Special thanks to you Steve!


  • Netgate Administrator

    Thanks for reporting back. Yep, it's interesting what you take for granted until it stops working. Sometimes you have to assume nothing!  ;)

    Steve