Netgate Discussion Forum
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Search
    • Register
    • Login

    Page fault while in kernel mode

    Scheduled Pinned Locked Moved General pfSense Questions
    4 Posts 2 Posters 2.8k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • L
      Lenny
      last edited by

      I have two HP Proliant 360 G7 machines runnig 2.0.1 amd64 release.
      They are configured to use WAN, SYNC and LAN interface with 30 VLAN interfaces configured on it.
      The CARP is configured on WAN, LAN and all VLAN interfaces.

      These machines are supposed to provide high-availability service under heavy load so I am doing failover and load tests to ensure that they are able to cope with it.
      Unfortunately I was experiencing kernel panics from very beginning of my testing.
      The machines panicked under various conditions:

      • master server interfaces were disconnected from network to simulate network connectivity issue and first master panicked and slave server followed shortly and panicked as well
      • master server was restarted and slave panicked
      • carp was disabled on master and master panicked shortly followed by slave
      • jmeter was setup to simulate heavy network traffic and after tens of minutes both machines panicked at some point

      As the issue seemd to be in interrupt handling of bce driver, I decided to try it with different network cards so I used quad ports Intel PRO/1000 Network card and disabled onboard Broadcom cards. After this change the servers are more stable but I am still able to trigger kernel panic of master server by disabling carp on it if there is high network interface utilisation.

      
      Fatal trap 12: page fault while in kernel mode
      cpuid = 1; apic id = 02
      fault virtual address   = 0x290
      fault code              = supervisor read data, page not present
      instruction pointer     = 0x20:0xffffffff806fb793
      stack pointer           = 0x28:0xffffff80000faa90
      frame pointer           = 0x28:0xffffff80000faab0
      code segment            = base 0x0, limit 0xfffff, type 0x1b
                              = DPL 0, pres 1, long 1, def32 0, gran 1
      <5>vip64: link state changed to DOWN
      
      processor eflags        = interrupt enabled, resume, IOPL = 0
      current process         = 0 (em0 taskq)
      db:0:kdb.enter.default>  run lockinfo
      db:1:lockinfo> show locks
      No such command
      db:1:locks>  show alllocks
      No such command
      db:1:alllocks>  show lockedvnods
      Locked vnodes
      db:0:kdb.enter.default>  show pcpu
      cpuid        = 1
      dynamic pcpu    = 0xffffff807ef0f300
      curthread    = 0xffffff00025e2000: pid 0 "em0 taskq"
      curpcb       = 0xffffff80000fad40
      fpcurthread  = none
      idlethread   = 0xffffff00023c6000: pid 11 "idle: cpu1"
      curpmap         = 0
      tssp            = 0xffffffff811dab68
      commontssp      = 0xffffffff811dab68
      rsp0            = 0xffffff80000fad40
      gs32p           = 0xffffffff811d99a0
      ldt             = 0xffffffff811d99e0
      tss             = 0xffffffff811d99d0
      db:0:kdb.enter.default>  bt
      Tracing pid 0 tid 64037 td 0xffffff00025e2000
      _rw_rlock() at _rw_rlock+0x83
      carp_forus() at carp_forus+0x58
      ether_input() at ether_input+0x144
      em_rxeof() at em_rxeof+0x1d0
      em_handle_que() at em_handle_que+0x4d
      taskqueue_run() at taskqueue_run+0x93
      taskqueue_thread_loop() at taskqueue_thread_loop+0x46
      fork_exit() at fork_exit+0x118
      fork_trampoline() at fork_trampoline+0xe
      --- trap 0, rip = 0, rsp = 0xffffff80000fad30, rbp = 0 ---
      
      

      I also tried to change mbuf configuration according to Tuning_and_Troubleshooting_Network_Cards guide but it didn't change a thing.

      1 Reply Last reply Reply Quote 0
      • C
        cmb
        last edited by

        Definitely something specific to that combination of hardware regardless of NIC. Couple potential things I'd suggest trying one or both of.

        1. If you don't have > 4 GB RAM, try i386. We've seen a small number of scenarios where driver problems only exist in amd64.
        2. try 2.1 since it has a newer FreeBSD base that at times works much better with new servers.

        I'd go with #2 first personally, from my experience with some new servers.

        1 Reply Last reply Reply Quote 0
        • L
          Lenny
          last edited by

          I should have mentioned that I tried 2.0.1 i386 as well with the same results.
          So I went for the second option and upgraded both firewalls to 2.1-BETA0-amd64-20121106-0059.
          This time when I disabled the CARP the master FW panicked and backup one hanged (even console wasn't working any more so I had to force reboot it) so I lost network completely. The kernel panic on master was little bit different but that might be due to changes in freebsd:

          
          Fatal trap 12: page fault while in kernel mode
          cpuid = 1; apic id = 02
          fault virtual address   = 0x308
          fault code              = supervisor read data, page not present
          instruction pointer     = 0x20:0xffffffff80768bd2
          stack pointer           = 0x28:0xffffff80000fba00
          <5>opt164_vip64: link state changed to DOWN
          
          frame pointer           = 0x28:0xffffff80000fba40
          code segment            = base 0x0, limit 0xfffff, type 0x1b
                                  = DPL 0, pres 1, long 1, def32 0, gran 1
          processor eflags        = interrupt enabled, resume, 
          <5>opt165_vip65: link state changed to DOWN
          IOPL = 0
          current process         = 0 (em0 que)
          
          0xffffff00048b0588: tag ufs, type VDIR
              usecount 1, writecount 0, refcount 4 mountedhere 0
              flags ()
              v_object 0xffffff00048070d8 ref 0 pages 1
              lock type ufs: EXCL by thread 0xffffff00047c0460 (pid 61142)
                  ino 10386432, on dev da0s1a
          
          0xffffff001bc85000: tag ufs, type VREG
              usecount 2, writecount 1, refcount 2 mountedhere 0
              flags ()
              v_object 0xffffff0012f65870 ref 0 pages 0
              lock type ufs: EXCL by thread 0xffffff00047c0460 (pid 61142)
                  ino 10386521, on dev da0s1a
          
          db:0:kdb.enter.default>  run lockinfo
          db:1:lockinfo> show locks
          No such command
          db:1:locks>  show alllocks
          No such command
          db:1:alllocks>  show lockedvnods
          Locked vnodes
          db:0:kdb.enter.default>  show pcpu
          cpuid        = 1
          dynamic pcpu = 0xffffff807ece5180
          curthread    = 0xffffff00025798c0: pid 0 "em0 que"
          curpcb       = 0xffffff80000fbd10
          fpcurthread  = none
          idlethread   = 0xffffff00023d68c0: tid 64006 "idle: cpu1"
          curpmap      = 0xffffffff8137c6d0
          tssp         = 0xffffffff814011e8
          commontssp   = 0xffffffff814011e8
          rsp0         = 0xffffff80000fbd10
          gs32p        = 0xffffffff81400020
          ldt          = 0xffffffff81400060
          tss          = 0xffffffff81400050
          db:0:kdb.enter.default>  bt
          Tracing pid 0 tid 64038 td 0xffffff00025798c0
          _rw_rlock() at _rw_rlock+0x92
          carp_forus() at carp_forus+0x5c
          ether_input() at ether_input+0x15f
          em_rxeof() at em_rxeof+0x1c2
          em_handle_que() at em_handle_que+0x5b
          taskqueue_run_locked() at taskqueue_run_locked+0x85
          taskqueue_thread_loop() at taskqueue_thread_loop+0x4e
          fork_exit() at fork_exit+0x11f
          fork_trampoline() at fork_trampoline+0xe
          
          

          As it is all somewhere in network card/driver or interface utilisation, I used one of the Broadcom interfaces for pfSync and left just WAN and LAN to use intel card. Just to spread the load a little. Now the machines run for around 12 hours and I'm generating network utilisation around 900Mbit/s through them.
          During that time I was able to disable/enable CARP 10 times but I wasn't able to reproduce the issue.

          Do you think that it is safe to use 2.1 BETA or better to use 2.0.1?

          1 Reply Last reply Reply Quote 0
          • C
            cmb
            last edited by

            @Lenny:

            Do you think that it is safe to use 2.1 BETA or better to use 2.0.1?

            Every production system we run internally (3 colo datacenters, office, all of our boxes at home) are on 2.1. The biggest risk is that which is inherent in any nightly snapshot builds of anything, upgrading. If everything works on the particular snapshot you're on, it's not going to break. Unless you follow development very closely, there's always risk in upgrading to snapshot builds. Though when you're running a pair you can mitigate that, upgrade the secondary, disable CARP on the primary, after verifying the secondary is good, upgrade the primary. Or if possible, just don't upgrade at all until a RC or release comes out, since those are QAed and automatic snapshot builds aren't.

            1 Reply Last reply Reply Quote 0
            • First post
              Last post
            Copyright 2025 Rubicon Communications LLC (Netgate). All rights reserved.