PfSense crashed on Alix

jimp

ahh… well then that wouldn't explain the ucom then. The ALIX would only see serial, not a USB device on its end.

jlepthien

Again :-(

Build from 01/06

db> bt
Tracing pid 11 tid 64025 td 0xc2456d80
pim_input(c30a5816,c24fa000,c308cd00,0,0,…) at pim_input+0xb8c
ip_input(c308cd00,246,c24d8700,c2378bcc,c06fd9b1,...) at ip_input+0x604
netisr_dispatch_src(1,0,c308cd00,c2378c04,c08e3ecf,...) at netisr_dispatch_src+0x89
netisr_dispatch(1,c308cd00,c24fa000,c24fa000,c30a5808,...) at netisr_dispatch+0x20
ether_demux(c24fa000,c308cd00,3,0,3,...) at ether_demux+0x16f
ether_vlanencap(c24fa000,c308cd00,c2456d80,c2378c5c,c0853f81,...) at ether_vlanencap+0x43f
ucom_attach(c0d56e6d,c0cd10c0,c2378cb0,c2378c98,0,...) at ucom_attach+0x542b
ucom_attach(c24ab000,0,109,82593edb,132,...) at ucom_attach+0x89d7
intr_event_execute_handlers(c2436aa0,c2434680,c0b5910d,4f6,c24346f0,...) at intr_event_execute_handlers+0x14b
intr_getaffinity(c24f9b50,c2378d38,0,0,0,...) at intr_getaffinity+0x14a
fork_exit(c080dfe0,c24f9b50,c2378d38) at fork_exit+0x90
fork_trampoline() at fork_trampoline+0x8
--- trap 0, eip = 0, esp = 0xc2378d70, ebp = 0 ---
db>

jlepthien

And again. This is the last things I could grep:

processor eflags = interrupt enabled, resume, IOPL = 0
current process = 11 (irq10: vr0)

jlepthien

Okay. Now again. I am testing igmpproxy right now, perhaps it has something to do with it?

Fatal trap 12: page fault while in kernel mode
fault virtual address = 0x72636524
fault code = supervisor write, page not present
instruction pointer = 0x20:0xc096993c
stack pointer = 0x28:0xc2378b10
frame pointer = 0x28:0xc2378b64
code segment = base 0x0, limit 0xfffff, type 0x1b
= DPL 0, pres 1, def32 1, gran 1
processor eflags = interrupt enabled, resume, IOPL = 0
current process = 11 (irq10: vr0)

cmb

@jlepthien:

Okay. Now again. I am testing igmpproxy right now, perhaps it has something to do with it?

Possibly. Did it happen at all before you started testing it?

We'll get Ermal or someone to take a look at the backtraces when time permits.

jlepthien

Yeah, the first two happened before I think. But at least the first. All the ones from yesterday happened when I tested the igmp proxy…

jlepthien

Another one again. But I can only see stuff starting with the prompt most of the time….

db> bt
Tracing pid 11 tid 64025 td 0xc2456d80
rn_match(c0cd4fcc,c2842200,0,0,c23788a8,...) at rn_match+0x17
pfr_match_addr(c288d9b0,c31e5822,2,c2378894,c2378890,...) at pfr_match_addr+0x63
pf_test_udp(c2378990,c237898c,1,c2562c00,c278a800,...) at pf_test_udp+0x4db
pf_test(1,c24fa000,c2378b54,0,0,...) at pf_test+0xbb5
init_pf_mutex(0,c2378b54,c24fa000,1,0,...) at init_pf_mutex+0x5e6
pfil_run_hooks(c0cfd140,c2378ba4,c24fa000,1,0,...) at pfil_run_hooks+0x7e
ip_input(c278a800,246,c24d2ac0,c2378bcc,c06fd9b1,...) at ip_input+0x278
netisr_dispatch_src(1,0,c278a800,c2378c04,c08e3ecf,...) at netisr_dispatch_src+0x89
netisr_dispatch(1,c278a800,c24fa000,c24fa000,c31e5808,...) at netisr_dispatch+0x20
ether_demux(c24fa000,c278a800,3,0,3,...) at ether_demux+0x16f
ether_vlanencap(c24fa000,c278a800,c2456d80,c2378c5c,c0853f81,...) at ether_vlanencap+0x43f
ucom_attach(c0d56e6d,c0cd10c0,c2378cb0,c2378c98,0,...) at ucom_attach+0x542b
ucom_attach(c24ab000,0,109,cd9a2d5d,38ea,...) at ucom_attach+0x89d7
intr_event_execute_handlers(c2436aa0,c2434680,c0b5910d,4f6,c24346f0,...) at intr_event_execute_handlers+0x14b
intr_getaffinity(c24f9b50,c2378d38,0,0,0,...) at intr_getaffinity+0x14a
fork_exit(c080dfe0,c24f9b50,c2378d38) at fork_exit+0x90
fork_trampoline() at fork_trampoline+0x8
--- trap 0, eip = 0, esp = 0xc2378d70, ebp = 0 ---
db>

jlepthien

Do we have any info yet? Today this happened again and I really would like to know what this is. I can simply install 1.2.3 again and wait until 2.0 is out of beta, but I want to help the project. So devs, what could be the problem? Anything else I should check?

jlepthien

Would you guys please be so kind to give me an answer. Otherwise it is no fun posting these backtraces…
Today it happened again...

jimp

Unfortunately with the snapshot server out of commission until the new one is put in place there isn't much to do or try except keep track of the traces.

jlepthien

Yeah well but this doesn't answer my question what the real problem is. Or do you think that these problems were silently fixed in a new snapshot?

jimp

It's hard to say with any certainty until someone with more in-depth knowledge of the freebsd kernel (such as ermal) can have a look and see if he can tell what is going on.

jlepthien

Yep. That's what I'm waiting for ;)

wallabybob

I've looked at a lot of FreeBSD dumps. This sort of problem is sometimes fairly straight forward to find but can also be very difficult to find. It can have a variety of causes including passing the wrong type of data structure to a function and freeing a data structure then reusing it while its being used for another purpose.

If I was looking at this problem I expect the most useful items of information to me would be

a precise identification of the build on the which the problem was observed
a way of making it happen, even if it makes it happen only one in four times

One of the back traces shows:
ucom_attach(c0d56e6d,c0cd10c0,c2378cb0,c2378c98,0,…) at ucom_attach+0x542b
ucom_attach(c24ab000,0,109,cd9a2d5d,38ea,...) at ucom_attach+0x89d7
The offsets can be misleadng in that static functions don't appear in the symbol table available to the crash time debugger. Since 0x1000 is 4k, 0x89d7 is at least 32k and its pretty unlikely that an attach function would have anything like that amount of code. This offset likely is in some static function whose code starts at a higher address than the code for ucom_attach.

Another of the reports shows:
Fatal trap 12: page fault while in kernel mode
fault virtual address = 0x72636524
fault code = supervisor write, page not present
instruction pointer = 0x20:0xc096993c
stack pointer = 0x28:0xc2378b10
frame pointer = 0x28:0xc2378b64

If you look at the virtual address you might notice that it could be considered to be printable text: "?ecr" (the ? is for the character who binary representation is 0x24; I don't have the mapping from 0x24 to printable character in my head). From the reported code it would appear that a data structure referenced by rn_match has a text string where rn_match is expecting it to hold the address of another data structure. The challenge is to find out how that happened.

jlepthien

What I see now is that this happens every 3-4 days. So I guess I will do a reboot now every night via cron to see if this then stops until I have better builds…

jlepthien

With the daily reboot in place I am not seeing this problem anymore. So what is the status of these problems? Has anyone (ermal) taken a look at the bt's? Is this "problem" fixed in newer snaps?

jlepthien

Today this happened again. So I cannot use this workaround :(

Here is the bt:

rn_match(c0cd504c,c283fd00,0,c2981718,e2992850,…) at rn_match+0x17
pfr_match_addr(c288b9b0,c2741034,2,e299283c,e2992838,...) at pfr_match_addr+0x63
pf_test_tcp(e2992938,e2992934,1,c26c4600,c272bd00,...) at pf_test_tcp+0x4cb
pf_test(1,c2610400,e2992afc,0,0,...) at pf_test+0x8d2
init_pf_mutex(0,e2992afc,c2610400,1,0,...) at init_pf_mutex+0x5e6
pfil_run_hooks(c0cfd1c0,e2992b4c,c2610400,1,0,...) at pfil_run_hooks+0x7e
ip_input(c272bd00,246,c24d38c0,e2992b74,c06fd9a1,...) at ip_input+0x278
netisr_dispatch_src(1,0,c272bd00,e2992bac,c08e3f0f,...) at netisr_dispatch_src+0x89
netisr_dispatch(1,c272bd00,c2610400,c2610400,c274101a,...) at netisr_dispatch+0x20
ether_demux(c2610400,c272bd00,3,0,3,...) at ether_demux+0x16f
ether_vlanencap(c2610400,c272bd00,ece0,18,c272bd00,...) at ether_vlanencap+0x43f
ieee80211_hostap_detach(c2700000,c315a000,c272bd00,c2532480,c2438d80,...) at ieee80211_hostap_detach+0x362
ieee80211_hostap_detach(c315a000,c272bd00,17,ffffffa0,0,...) at ieee80211_hostap_detach+0x29a7
ath_suspend(c2514000,1,0,c0ca937c,0,...) at ath_suspend+0x1f67
taskqueue_run(c251d100,c251d118,0,c0b53f14,0,...) at taskqueue_run+0x132
taskqueue_thread_loop(c2514270,e2992d38,0,0,0,...) at taskqueue_thread_loop+0x88
fork_exit(c086b060,c2514270,e2992d38) at fork_exit+0x90
fork_trampoline() at fork_trampoline+0x8
--- trap 0, eip = 0, esp = 0xe2992d70, ebp = 0 ---

Please guys. Give me any info. What else do you need? Does nobody use 2.0-beta1 on Alix boards? Can't be...

jimp

I use 2.0-beta1 on my ALIX but has not crashed on me yet. I haven't passed much traffic through it though as it's just been used for light testing and such.

xbipin

i use 22nd snapshot on alix, hasnt crashed for me till now so might be some hardware issue or something like that

jlepthien

I don't think it is hardware related since 1.2.3 is running fine on this box. This just happened now with 2.0-beta1…