2.1-RC1 issues with modern hardware?



  • Ok, this one's a bit puzzling to me, and I'm not entirely sure how often other folks see it happen, as I get the sense I tend to run a bit newer hardware than the typical pfsense installation (for reference my current 2.0.3-release install runs on an i3-2120T…overkill yes, but low power-usage).

    Anyway, I was setting up to experiment with the 2.1-RC1 amd64 builds (I originally tried  20130830-1339  but have also tried  20130826-0113 , 20130901-0142 , and 20130902-0701 ).  I'd mostly been sitting out the long beta/RC process, but I'd been wanting to test again with several of my wireless cards that were marginal under 2.0.x, to see if 2.1 would make them play nicer (I'm not talking 11n, here, either, I'd be happy with stable 11g).  My test system was set up as follows:

    MSI Z77A-GD65, i7-2600S, 32GB of DDR3-1600, LG bd-r drive, and an Intel gigabit desktop CT card.  The board also has an Intel Gigabit chip onboard, though testing with or without this makes no difference  (yes, I know this system's overkill for pfsense...it's more often used as my testing system for my primary server).

    Booting into the LiveCD runs normally, up through and past interface assignments (to really streamline I've even just assigned em0 as wan with nothing else; this makes no difference), at which point the bootup resumes until it gets to the second instance of  Configuring firewall… (the one right after "setting up dhcpv6"…at which point the liveCD will hang and not proceed from here.  I've left it sit for extended periods of time (30+ minutes) and it never moves.  For reference, I've used FreeBSD (9.x) installs on this machine, as well as esxi 5.x, various ubuntu flavors, and yes, even Windows...it's been quite a stable system.

    Thinking perhaps that my test system might be a bit quirky, here, I loaded the LiveCD into my desktop system  (Intel DH77EB, i7-3770S, 16GB ram...again, with onbuilt intel nic), only to have it hang at the same spot.

    Just to make sure I wasn't totally and/or somehow screwing things up  (I've been using pfsense since the early 2.0 beta days), I pulled my oldest system, an Atom D510, off the shelf, installed the gigabit CT card (for consistency) and proceeded to use the same amd64 snapshot disc as before…only to have it load all the way just fine (connection functions, console works, etc).

    As a further sanity check, just to make sure that I wasn't somehow running on "too new" of hardware, I pulled out my old 2.0.3-release amd64 livecd, and loaded that onto the test system (the MSI).  Again, it loaded without any problems whatsoever (clear through to the console menu, etc...I'd almost forgotten how loud the pc speaker can be in that system!).

    So, does anyone have any ideas...or suggestions...as to what's going on here?  Yes, I am technically sitting on a LAN, behind another pfsense box (the 2.0.3 system), but the Atom loaded just fine (both on 2.1 and 2.0.3) as did the 2.0.3 image on the test system.  Again, I don't think it's a simple case of,  "hardware too new" here...because the 8.1-based 2.0.3 works, but the 8.3-based 2.1 does not...at least, assuming there hasn't been some sort of odd regression in things.  I'm at a loss here and open to suggestions.



  • For interest, can you install 2.0.3 on the i7 hardware, then do an upgrade to 2.1?
    If the upgrade fails somewhere (maybe when it tries to boot after upgrading!) then there is the potential that people are currently running 2.0.3 on some combination/s of i7 hardware and when 2.1 is released they will upgrade and break. That needs to be resolved.

    The next thing to know is, does FreeBSD 8.1 install fine on your i7 hardware (should do, since pfSense installs). Then, does FreeBSD 8.3 install on the same hardware - if not then there is some FreeBSD 8.3 regression, if 8.3 installs then presumably there is something in the pfSense 2.1 customisations for FreeBSD 8.3 that is an issue.

    If you have time to try a few combinations then it might help track this down, and potentially save grief for others.



  • @Lightningfire:

    Booting into the LiveCD runs normally, up through and past interface assignments (to really streamline I've even just assigned em0 as wan with nothing else; this makes no difference), at which point the bootup resumes until it gets to the second instance of  Configuring firewall… (the one right after "setting up dhcpv6"…at which point the liveCD will hang and not proceed from here.  I've left it sit for extended periods of time (30+ minutes) and it never moves.

    At the hang, if you type Ctrl-T you should get a status display of at least one line. For example on my system when a find command was running:
    load: 0.41  cmd: find 93809 [runnable] 0.85r 0.03u 0.26s 2% 1052k

    If you type a few Ctrl-Ts a few seconds apart does the PID or command or "u" (user) or "s" (system) time change? If not, does a tap on the enter key cause a change?

    Does your motherboard have USB3 devices? Some readers have reported install problems if such devices weren't disabled in the BIOS.



  • Ok, an update for folks on this one based on what my subsequent testing has shown so far.

    On the MSI Z77 system the results are as follows:
    1)  2.0.3 has an odd issue, but ONLY with the livecd.  Namely, it will boot/reboot successfully exactly three times and then on the fourth boot hang right after "Bootup Complete."  ctrl-T in this case indicates  load:  1.18  cmd:  php 41071 [ufs]  37.49r  0.01u  0.01s  0%  17484K  (note the r-value changes with time but u and s stay constant).  At this point, a full cold-off (shut off power supply for ~30s) will make the system boot the livecd successfully again.

    2)  With (1) above, this only applies to the liveCD.  Once the system is installed to the system's hard disk, it can be rebooted any number of times, at will, without any bootup problems, with all interfaces working as expected.

    3)  The events in (1) seem to be unaffected by bios configuration of AHCI vs. IDE,  HPET enable or disabled,  USB enabled or disabled, full cores w/ HT vs no HT and/or limited cores, and changing ram amounts/modules (I've got a few spare sets to play with here, and have tried 32–>16-->8GB).    The events do seem to happen more frequently on this system if the onboard ethernet (Intel 82579V) is enabled, though still happens if only the gigabit desktop CT (Intel  82574L ) is installed.

    4)  Getting the 2.1-RC liveCD, on all the snapshots I've tried (most recent was 0901) is erratic at best, most typically hanging on "Configuring firewall…"  Ctrl-t at this point indicates ntpdate as the relevant running process.  However...

    5)  Doing an upgrade of the above system from the hard disk installed 2.0.3 to 2.1-RC works normally.    The update installs, reboots, does all of its migration processes, then proceeds to function normally.  All interfaces (Intel 82574L, Intel 82579V, even the Atheros 9280 in 11g-mode that I added after the fact as a test) work properly and without errors.  All subsequent reboots are fine as well.

    On the Intel H77 board, the limited results so far are as follows:

    1. Onboard ethernet is the same Intel 82579V chip.  It seems to experience the same hangs as with the MSI Z77 system, especially the happens-every-time hang on the 2.1-RC1 disc at "Configuring firewall…"  with ntpdate.

    2)  I haven't tested this system with the Gigabit CT (83574L) card yet, as it was testing in the MSI system at the time.

    I also haven't tried the full installs of 8.1/8.3 on either system yet, but given that Pfsense works once installed, and only has problems from the LiveCD, I expect that they'll work as intended.

    I've also got another Pro1000/PT card on order here, as this is the card currently functioning in my i3-2120T-based (H67 chipset) 2.0.3 install and I don't recall it giving any problems during install and booting has always been flawless, so I'll likely do some more tests to see if I can figure out if this is a likely 7-series issue, an Intel nic issue, or what.    I do wonder, given that the installs run fine, if it's not something odd that the liveCD is doing differently, though.  Thoughts?

    Edit:  I've been poking around at the nptdate stuff a bit more, and I've found numerous references to it having hangup problems in 2.0, previously.  In 2.0.3 it's called in rc.bootup as follows:

    /* Do an initial time sync /
    echo "Starting NTP time client…";
    /
    At bootup this will just write the config, ntpd will launch from ntpdate_sync_once.sh */
    system_ntp_configure(false);
    shell_exec("echo /usr/local/sbin/ntpdate_sync_once.sh | tcsh");

    But in the 2.1-RC discs it's called as

    /* Do an initial time sync /
    echo "Starting NTP time client...";
    /
    At bootup this will just write the config, ntpd will launch from ntpdate_sync_once.sh */
    system_ntp_configure(false);
    mwexec_bg("/usr/local/sbin/ntpdate_sync_once.sh", true);
    echo "done.\n";

    Is it possible this change is causing the hang, somehow? My script-fu isn't the best, here...



  • @Lightningfire:

    1)  2.0.3 has an odd issue, but ONLY with the livecd.  Namely, it will boot/reboot successfully exactly three times and then on the fourth boot hang right after "Bootup Complete."  ctrl-T in this case indicates  load:  1.18  cmd:  php 41071 [ufs]  37.49r  0.01u  0.01s  0%  17484K  (note the r-value changes with time but u and s stay constant).

    The r-value is "wall clock ("real") time since process start, u is process cpu time in user mode, s is process CPU time in system mode. I guess (haven't looked it up) the "[ufs]" indicates the process is waiting ("u" and "s" times don't increase but "r" does) for notification some file system thing has completed.


  • Rebel Alliance Developer Netgate

    Usually those things are caused by IRQ conflicts between the ATA system and the NICs.

    If you can boot from a USB stick rather than CD, and disable the CD in the BIOS, it may behave differently.



  • @jimp:

    Usually those things are caused by IRQ conflicts between the ATA system and the NICs.

    If you can boot from a USB stick rather than CD, and disable the CD in the BIOS, it may behave differently.

    So, I tried this.  It boots.  However, I also tried booting from a usb stick with the CD left in place, and it boots just fine then, too.  It also boots just fine off the hard drive, once it's actually installed there.  It seems the only thing that has any problem with this system in any fashion is the LiveCD itself.

    But just for completeness, in terms of what pfsense is reporting in its detection, irqwise, it shows the following:

    Kbd is on IRQ1
    RTC is on IRQ8
    ATA is on IRQ19
    EHCI1 is on IRQ23
    ATH0 is on IRQ18
    EHCI0 is on IRQ16
    EM1 is on IRQ18  (same as ATH0, yes, but I can remove ATH0 entirely and it makes zero difference)
    EM0 is on IRQ17.
    VGA is on IRQ16.

    Again, I can readily boot either 2.0.3 or 2.1-RC, repeatedly and without issue, from either hard disk or usb stick. It's only the LiveCD that causes issues.  Admittedly, this isn't the worst thing to happen, if it turns into a case of the standard suggestion on newer systems is to install from usb key instead of CD.  I'm going to try and poke at the 8.3-release-dvd1 disk in the next day or two, as I have time, to see if that has issues.  I'm curious, at this point, if it's a quirk of some bit of customization, or some quirk of how BSD handles CD-related stuff in general. I've tested as well with the 8.3-release-dvd1 disc, and that boots and installs just fine as well (whether pulling files from disc or from network).

    Puzzling stuff…

    Edit:  So I've tested on some additional systems, with the following results where the liveCD is concerned.

    Additional test system #1:  Asus P8Z68-Delux, i7-2600k system, onboard intel nic.  Hangs at the same spot (configuring firewall...) if booting from liveCD.

    Additional test system #2:  1st-gen i3, Asus H55-based board.  Realtek onboard nic.  Boots the liveCD w/o problems.



  • So, I seem to have narrowed this down to an issue with cd9660 and/or the liveCD customizations and 6-series and 7-series systems threading, of all things.  At least, that's the only thing I can figure it is, after the testing I've done.  To summarize:

    • Bootup issue consistently causes hangs on 2.1-RC on the second "configuring firewall…" but -only- on 6-series and 7-series systems that I've tested.  Older systems are fine (I've tested on 5-series and on Atoms).  After a suggestion by another forum member, I have since acquired and tested a Realtek 8111-based pcie card as a further sanity check, and have verified this effect occurs using that card instead of the Intel variants (82579V, Gigabit CT, Pro1000/PT) I had been using

    • A previous poster suggested this was an ATA/NIC irq conflict but that does not seem to be the case, as leaving all ATA devices in place (hard drive, CD) and booting from memstick allows successful boot.  Conversely, disabling ATA entirely and booting from a usb-based cd drive produces the same hang at the exact same spot.  Also, booting from hard drive once an install is done via memstick allows successful, reliable boots at normal functionality.

    • There is a similar issue that manifests with 2.0.3, but not as consistently, again involving the livecd, so this doesn't seem to be a super-new issue.

    • FreeBSD 8.3-RELEASE-dvd1 boots without any issues, and will allow a successful install to hard disk, whether from dvd media install or ftp/network install.

    Using system bios to enable/disable cores/threads gives the following results:

    • Four cores, 8 threads = hang

    • Four cores, 4 threads = hang

    • Two cores, 4 threads = boot ok multiple times

    • Two cores, 2 threads = boot ok multiple times

    • Three cores, 3 threads = hang

    I'm not sure where the above leaves us, either than I can only think it's some quirk of the liveCD customization and/or the liveCD itself that causes these issues, as it's only the LiveCD that produces the problem, regardless of what interface it's using to connect, but only (best as I can tell) if more than two cores are present.  As an upside, using the memstick boot to get around this does work consistently, so perhaps that'll need to be the standard install method on newer systems, though it'd be useful to convey that to folks if that's the case.

    This is about the limit of what I can test here, at this point, unless one of the devs gets involved…



  • I too have this problem. I have a MSI P55-GD80 w/ Intel i7 860. 16GB / onboard NIC using a Sata CD using pfSense-LiveCD-2.1-RC2-amd64-20130906-2049.iso

    Following the same information in this thread, I can repeat the same failure and success.

    All cores on, HT on: Hangs at the second Configuring Firewall, right after Starting NTPD.

    Turn off HT and limit to 2 cores, and I have a system that boots off cdrom each and every time.



  • I had similar hanging problems and the reason was BIOS option called "large disk access mode" (or something like this as I remember), which had two options "DOS" and "Other":

    • when it's set to "DOS" 2.0.3, works, but after upgrade to 2.1, the server hangs during boot

    • when it's set to "Other" 2.0.3, hangs few steps after setup begins, 2.1 works OK



  • @nothing:

    I had similar hanging problems and the reason was BIOS option called "large disk access mode" (or something like this as I remember), which had two options "DOS" and "Other":

    • when it's set to "DOS" 2.0.3, works, but after upgrade to 2.1, the server hangs during boot

    • when it's set to "Other" 2.0.3, hangs few steps after setup begins, 2.1 works OK

    Except in this case, the ATA subsystem can be disabled entirely, and this issue will still occur at the "Configuring firewall.." step.  Interesting to note about the large disk option, but I'm not sure a lot of the newer bios variants even offer this anymore.  Also, booting from a memstick to do the install will bypass this issue entirely, and the issue also does not occur once pfsense (of either 2.0.x or 2.1) is actually installed to disk.  Again, interesting to note, but I'm pretty sure that's a separate issue entirely to the one in question (as the issue in question seems to be a quirk of livecd and the system core count).



  • It's probably worth noting that Lightningfire and I spent quite a bit of time working out this issue off forum.

    Starting with why one of my 7 Series systems worked just fine, and his didn't, and working on figuring out why my old i7 wouldn't work, but a machine of his that was similar would work. We were quite stumped until we joked that it had something to do with too many cores. Which, when faced with no other similarities… We decided to test.

    My newer 7 series was an i3, his an i7. His old rig a dual core, while mine was an i7.
    We spent a few hours looking at different permutations of what and what did not work.

    It's also worth noting that Lightning opened a bug report on redmine, which is probably quite clear.
    http://redmine.pfsense.com/issues/3187



  • Gives a 2.1RC2 a try.

    Been using since day 5, working fine so far.



  • We were.
    Do you have your pfsense box running an i7? Was your wan interface connected during the install? Did you install from a cdrom, or another means?

    If all these things are true, which chipset are you using?


  • Rebel Alliance Moderator

    HI all,

    @lightningfire: Your post with the multicores and -threads rang a bell with me: We had some issues with our new server grade hardware (IBM servers), too. Hanging at some point during the boot procedure. With luck I was capturing a error message in the dmesg earlier, indicating that the problem was with the mbuffers of the intel NICs. Those get really screwed up when having new CPUs with multiple cores AND multithreading. So the system sees like 8 or 16 cores and assigns buffers per core.

    See: https://redmine.pfsense.org/issues/1221

    What fixed the issue (or worked around) for us was to define the mentioned 3 configuration parameters in the boot.loader.conf. Afterwards the system booted just fine. CD boot was also a problem but when intercepting the boot and defining those variables before booting manually, CD boot worked, too. Perhaps it's worth a try?

    Greets
    Jens



  • @JeGr:

    HI all,

    @lightningfire: Your post with the multicores and -threads rang a bell with me: We had some issues with our new server grade hardware (IBM servers), too. Hanging at some point during the boot procedure. With luck I was capturing a error message in the dmesg earlier, indicating that the problem was with the mbuffers of the intel NICs. Those get really screwed up when having new CPUs with multiple cores AND multithreading. So the system sees like 8 or 16 cores and assigns buffers per core.

    See: https://redmine.pfsense.org/issues/1221

    What fixed the issue (or worked around) for us was to define the mentioned 3 configuration parameters in the boot.loader.conf. Afterwards the system booted just fine. CD boot was also a problem but when intercepting the boot and defining those variables before booting manually, CD boot worked, too. Perhaps it's worth a try?

    Greets
    Jens

    I don't think it's the same problem, but it's interesting to note.  For one, Hyperthreading doesn't seem to affect this problem…only core count does.  So 2 cores 4 threads will boot fine, but 3 cores 3 threads, or 4 cores 4 threads will not.  Also, I've replicated the issue using a Realtek 8111 card as well; while I'm not exactly fond of Realtek, to see the issue occur in exactly the same way at exactly the same spot, and only on the livecd and only resolvable by changing core count, would seem to say that this is a different problem.

    I'm also pretty sure I've not seen any of the "cannot receive structures" issue.  The odd thing, if anything, is that the interface (well, both interfaces if I've got multiple nics enabled) seems to be working at that point (I didn't take full notes for -that- part of the procedure, but from memory I'm pretty sure I had my laptop connected to the LAN side and it was getting packets ok)….it's just, on the livecd the bootup never finishes, so the webconfigurator is never accessible, etc.  It's interesting to note, though, for sure.


  • Rebel Alliance Moderator

    Ah that's good to hear - somehow anyways :-
    But it's interesting and a bit disturbing to see another issue arise which is more or less depending on multi-cores and/or threads. Hopefully that won't be the start of a trend.

    Greets


Log in to reply