Page fault while in kernel mode



  • I have two HP Proliant 360 G7 machines runnig 2.0.1 amd64 release.
    They are configured to use WAN, SYNC and LAN interface with 30 VLAN interfaces configured on it.
    The CARP is configured on WAN, LAN and all VLAN interfaces.

    These machines are supposed to provide high-availability service under heavy load so I am doing failover and load tests to ensure that they are able to cope with it.
    Unfortunately I was experiencing kernel panics from very beginning of my testing.
    The machines panicked under various conditions:

    • master server interfaces were disconnected from network to simulate network connectivity issue and first master panicked and slave server followed shortly and panicked as well
    • master server was restarted and slave panicked
    • carp was disabled on master and master panicked shortly followed by slave
    • jmeter was setup to simulate heavy network traffic and after tens of minutes both machines panicked at some point

    As the issue seemd to be in interrupt handling of bce driver, I decided to try it with different network cards so I used quad ports Intel PRO/1000 Network card and disabled onboard Broadcom cards. After this change the servers are more stable but I am still able to trigger kernel panic of master server by disabling carp on it if there is high network interface utilisation.

    
    Fatal trap 12: page fault while in kernel mode
    cpuid = 1; apic id = 02
    fault virtual address   = 0x290
    fault code              = supervisor read data, page not present
    instruction pointer     = 0x20:0xffffffff806fb793
    stack pointer           = 0x28:0xffffff80000faa90
    frame pointer           = 0x28:0xffffff80000faab0
    code segment            = base 0x0, limit 0xfffff, type 0x1b
                            = DPL 0, pres 1, long 1, def32 0, gran 1
    <5>vip64: link state changed to DOWN
    
    processor eflags        = interrupt enabled, resume, IOPL = 0
    current process         = 0 (em0 taskq)
    db:0:kdb.enter.default>  run lockinfo
    db:1:lockinfo> show locks
    No such command
    db:1:locks>  show alllocks
    No such command
    db:1:alllocks>  show lockedvnods
    Locked vnodes
    db:0:kdb.enter.default>  show pcpu
    cpuid        = 1
    dynamic pcpu    = 0xffffff807ef0f300
    curthread    = 0xffffff00025e2000: pid 0 "em0 taskq"
    curpcb       = 0xffffff80000fad40
    fpcurthread  = none
    idlethread   = 0xffffff00023c6000: pid 11 "idle: cpu1"
    curpmap         = 0
    tssp            = 0xffffffff811dab68
    commontssp      = 0xffffffff811dab68
    rsp0            = 0xffffff80000fad40
    gs32p           = 0xffffffff811d99a0
    ldt             = 0xffffffff811d99e0
    tss             = 0xffffffff811d99d0
    db:0:kdb.enter.default>  bt
    Tracing pid 0 tid 64037 td 0xffffff00025e2000
    _rw_rlock() at _rw_rlock+0x83
    carp_forus() at carp_forus+0x58
    ether_input() at ether_input+0x144
    em_rxeof() at em_rxeof+0x1d0
    em_handle_que() at em_handle_que+0x4d
    taskqueue_run() at taskqueue_run+0x93
    taskqueue_thread_loop() at taskqueue_thread_loop+0x46
    fork_exit() at fork_exit+0x118
    fork_trampoline() at fork_trampoline+0xe
    --- trap 0, rip = 0, rsp = 0xffffff80000fad30, rbp = 0 ---
    
    

    I also tried to change mbuf configuration according to Tuning_and_Troubleshooting_Network_Cards guide but it didn't change a thing.



  • Definitely something specific to that combination of hardware regardless of NIC. Couple potential things I'd suggest trying one or both of.

    1. If you don't have > 4 GB RAM, try i386. We've seen a small number of scenarios where driver problems only exist in amd64.
    2. try 2.1 since it has a newer FreeBSD base that at times works much better with new servers.

    I'd go with #2 first personally, from my experience with some new servers.



  • I should have mentioned that I tried 2.0.1 i386 as well with the same results.
    So I went for the second option and upgraded both firewalls to 2.1-BETA0-amd64-20121106-0059.
    This time when I disabled the CARP the master FW panicked and backup one hanged (even console wasn't working any more so I had to force reboot it) so I lost network completely. The kernel panic on master was little bit different but that might be due to changes in freebsd:

    
    Fatal trap 12: page fault while in kernel mode
    cpuid = 1; apic id = 02
    fault virtual address   = 0x308
    fault code              = supervisor read data, page not present
    instruction pointer     = 0x20:0xffffffff80768bd2
    stack pointer           = 0x28:0xffffff80000fba00
    <5>opt164_vip64: link state changed to DOWN
    
    frame pointer           = 0x28:0xffffff80000fba40
    code segment            = base 0x0, limit 0xfffff, type 0x1b
                            = DPL 0, pres 1, long 1, def32 0, gran 1
    processor eflags        = interrupt enabled, resume, 
    <5>opt165_vip65: link state changed to DOWN
    IOPL = 0
    current process         = 0 (em0 que)
    
    0xffffff00048b0588: tag ufs, type VDIR
        usecount 1, writecount 0, refcount 4 mountedhere 0
        flags ()
        v_object 0xffffff00048070d8 ref 0 pages 1
        lock type ufs: EXCL by thread 0xffffff00047c0460 (pid 61142)
            ino 10386432, on dev da0s1a
    
    0xffffff001bc85000: tag ufs, type VREG
        usecount 2, writecount 1, refcount 2 mountedhere 0
        flags ()
        v_object 0xffffff0012f65870 ref 0 pages 0
        lock type ufs: EXCL by thread 0xffffff00047c0460 (pid 61142)
            ino 10386521, on dev da0s1a
    
    db:0:kdb.enter.default>  run lockinfo
    db:1:lockinfo> show locks
    No such command
    db:1:locks>  show alllocks
    No such command
    db:1:alllocks>  show lockedvnods
    Locked vnodes
    db:0:kdb.enter.default>  show pcpu
    cpuid        = 1
    dynamic pcpu = 0xffffff807ece5180
    curthread    = 0xffffff00025798c0: pid 0 "em0 que"
    curpcb       = 0xffffff80000fbd10
    fpcurthread  = none
    idlethread   = 0xffffff00023d68c0: tid 64006 "idle: cpu1"
    curpmap      = 0xffffffff8137c6d0
    tssp         = 0xffffffff814011e8
    commontssp   = 0xffffffff814011e8
    rsp0         = 0xffffff80000fbd10
    gs32p        = 0xffffffff81400020
    ldt          = 0xffffffff81400060
    tss          = 0xffffffff81400050
    db:0:kdb.enter.default>  bt
    Tracing pid 0 tid 64038 td 0xffffff00025798c0
    _rw_rlock() at _rw_rlock+0x92
    carp_forus() at carp_forus+0x5c
    ether_input() at ether_input+0x15f
    em_rxeof() at em_rxeof+0x1c2
    em_handle_que() at em_handle_que+0x5b
    taskqueue_run_locked() at taskqueue_run_locked+0x85
    taskqueue_thread_loop() at taskqueue_thread_loop+0x4e
    fork_exit() at fork_exit+0x11f
    fork_trampoline() at fork_trampoline+0xe
    
    

    As it is all somewhere in network card/driver or interface utilisation, I used one of the Broadcom interfaces for pfSync and left just WAN and LAN to use intel card. Just to spread the load a little. Now the machines run for around 12 hours and I'm generating network utilisation around 900Mbit/s through them.
    During that time I was able to disable/enable CARP 10 times but I wasn't able to reproduce the issue.

    Do you think that it is safe to use 2.1 BETA or better to use 2.0.1?



  • @Lenny:

    Do you think that it is safe to use 2.1 BETA or better to use 2.0.1?

    Every production system we run internally (3 colo datacenters, office, all of our boxes at home) are on 2.1. The biggest risk is that which is inherent in any nightly snapshot builds of anything, upgrading. If everything works on the particular snapshot you're on, it's not going to break. Unless you follow development very closely, there's always risk in upgrading to snapshot builds. Though when you're running a pair you can mitigate that, upgrade the secondary, disable CARP on the primary, after verifying the secondary is good, upgrade the primary. Or if possible, just don't upgrade at all until a RC or release comes out, since those are QAed and automatic snapshot builds aren't.


Log in to reply