Torrents crash my pfsense. How can I fix this?

elementalwindx

Downloading a large amount of torrents, or a few amount coming from many sources seems to crash my pfsense box. How can I fix this? I imagine it is killing the states but I have 1 million states setup in the firewall .

Closely watching it, it doesn't seem to fill the firewall states even to 20% before it crashes. Memory doesn't look to fill either.

Here is a paste of the crash log:

http://pastebin.com/bYttQsUi

elementalwindx

51 views and no ideas? :(

Nachtfalke

Perhaps something of the followuing could help you:

Check that the MBUF Usage isn't reaching the limit - could be increased with kern.ipc.nmbclusters
Check kern.ipc.somaxconn. The description I found for that is: The kern.ipc.somaxconn sysctl variable limits the size of the listen queue for accepting new TCP connections. The default value of 128 is typically too low for robust handling of new connections in a heavily loaded web server environment. For such environments, it is recommended to increase this value to 1024 or higher. I set it to 2048
I increased net.inet.tcp.sendbuf_max to 16777216
I increased net.inet.tcp.recvbuf_max to 16777216
I increased net.inet.ip.intr_queue_maxlen to 3000

I put these sysctls into SYSTEM –> Advanced --> System Tunables. Some people in the forum say that they will not work there and you need t put the in /root/loader.conf.local or /root/loader.conf.
What is for sure that you need a reboot after you did these changes.

elementalwindx

@Nachtfalke:

Perhaps something of the followuing could help you:

Check that the MBUF Usage isn't reaching the limit - could be increased with kern.ipc.nmbclusters

Check kern.ipc.somaxconn. The description I found for that is: The kern.ipc.somaxconn sysctl variable limits the size of the listen queue for accepting new TCP connections. The default value of 128 is typically too low for robust handling of new connections in a heavily loaded web server environment. For such environments, it is recommended to increase this value to 1024 or higher. I set it to 2048

I increased net.inet.tcp.sendbuf_max to 16777216

I increased net.inet.tcp.recvbuf_max to 16777216

I increased net.inet.ip.intr_queue_maxlen to 3000

I put these sysctls into SYSTEM –> Advanced --> System Tunables. Some people in the forum say that they will not work there and you need t put the in /root/loader.conf.local or /root/loader.conf.
What is for sure that you need a reboot after you did these changes.

In the /boot/loader.conf file I found kern.ipc.nmbclusters="0" already in it. Does that mean there is no limit set? I will make these other changes and report back.

I ended up having to reinstall pfsense. It looks like the config may have become corrupted. It was extremely flakey today and would crash even under no usage. I reinstalled it, and tried to restore a backup .xml I just made, and it had the exact same issues. I restored all defaults and it started to work perfectly again. I haven't tried the torrents yet though!

Heres the changes I have made. Yours plus some.

autoboot_delay="1"
vm.kmem_size="435544320"
vm.kmem_size_max="535544320"
kern.ipc.nmbclusters="0"
–---altered below this line-----stock above-------
kern.ipc.nmbclusters=262144
kern.ipc.somaxconn=4096
net.inet.tcp.sendbuf_max=16777216
net.inet.tcp.recvbuf_max=16777216
net.inet.ip.intr_queue_maxlen=3000
kern.maxfiles=204800
kern.maxfilesperproc=200000

matguy

I would probably suggest to test your hardware. Maybe run MemTest86 from a CD or something.

wallabybob

The crash report says the system tried to access a "no access" page of memory in kernel address space:

kernel trap 12 with interrupts disabled

Fatal trap 12: page fault while in kernel mode
cpuid = 0; apic id = 00
fault virtual address = 0xeab08d80
fault code = supervisor read, page not present
instruction pointer = 0x20:0xc0e7b41b
stack pointer = 0x28:0xeeb0ec78
frame pointer = 0x28:0xeeb0ecc0
code segment = base 0x0, limit 0xfffff, type 0x1b
= DPL 0, pres 1, def32 1, gran 1
processor eflags = resume, IOPL = 0
current process = 3939 (awk)

while awk was the current process on the CPU.

The stack trace:

db:0:kdb.enter.default> bt
Tracing pid 3939 tid 64130 td 0xc7ccd000
cpu_switch(c7ccd000,0,207,d7f88b4a,1f7,…) at cpu_switch+0x8b
mi_switch(207,0,c0f899d0,d3,eeb0ed18,...) at mi_switch+0xd6
ast(eeb0ed28) at ast+0x1ba
doreti_ast() at doreti_ast+0x17
db:0:kdb.enter.default> ps

is not particularly helpful in identifying what caused the code to attempt this access. It is possible you have a memory problem causing corruption (e.g. a stuck bit) of a pointer or your particular combination of hardware, configuration options and traffic has exposed a bug which I expect will be difficult to find. If you are lucky it will be a memory problem. matguy's suggestion to run memtest86 is a good one. I suggest you let it run at least a few passes. If memtest86 reports any errors you should replace memory. AT current memory prices it might not be worth attempting to identify a specific stick with a problem.

memtest86 (or memtest86+) can be found on a number of Linux live CDs or System Rescue CD which can be fairly easily written to a USB stick if that is more convenient.

elementalwindx

@wallabybob:

The crash report says the system tried to access a "no access" page of memory in kernel address space:

kernel trap 12 with interrupts disabled

Fatal trap 12: page fault while in kernel mode
cpuid = 0; apic id = 00
fault virtual address = 0xeab08d80
fault code = supervisor read, page not present
instruction pointer = 0x20:0xc0e7b41b
stack pointer = 0x28:0xeeb0ec78
frame pointer = 0x28:0xeeb0ecc0
code segment = base 0x0, limit 0xfffff, type 0x1b
= DPL 0, pres 1, def32 1, gran 1
processor eflags = resume, IOPL = 0
current process = 3939 (awk)

while awk was the current process on the CPU.

The stack trace:

db:0:kdb.enter.default> bt
Tracing pid 3939 tid 64130 td 0xc7ccd000
cpu_switch(c7ccd000,0,207,d7f88b4a,1f7,…) at cpu_switch+0x8b
mi_switch(207,0,c0f899d0,d3,eeb0ed18,...) at mi_switch+0xd6
ast(eeb0ed28) at ast+0x1ba
doreti_ast() at doreti_ast+0x17
db:0:kdb.enter.default> ps

is not particularly helpful in identifying what caused the code to attempt this access. It is possible you have a memory problem causing corruption (e.g. a stuck bit) of a pointer or your particular combination of hardware, configuration options and traffic has exposed a bug which I expect will be difficult to find. If you are lucky it will be a memory problem. matguy's suggestion to run memtest86 is a good one. I suggest you let it run at least a few passes. If memtest86 reports any errors you should replace memory. AT current memory prices it might not be worth attempting to identify a specific stick with a problem.

memtest86 (or memtest86+) can be found on a number of Linux live CDs or System Rescue CD which can be fairly easily written to a USB stick if that is more convenient.

Before I just did this install, I simply replaced both memory sticks with 2 known good sticks I had laying around.

cmb

Even if they really are "known good" sticks, that doesn't eliminate many other possibilities of bad hardware causing memory corruption. That's definitely the most likely cause from the looks of it.

wallabybob

@elementalwindx:

Before I just did this install, I simply replaced both memory sticks with 2 known good sticks I had laying around.

"Known good" by what test? "Known good" in that system? "Known good" together in that system?

elementalwindx

@wallabybob:

@elementalwindx:

Before I just did this install, I simply replaced both memory sticks with 2 known good sticks I had laying around.

"Known good" by what test? "Known good" in that system? "Known good" together in that system?

Known good by the fact that they were in packaging from my stock room that has never been opened lol. Yes I know even brand new memory can be bad. If that were the case then I could have used another set of sticks…

I have not had a single problem all day and I have certainly put it to the test today all day non stop.

Thanks guys for helping me out.

Every time the pfsense would crash, it would claim the disks were dirty. I've also switched out the hard drive with another used decent condition hard drive. So either memory, or hard drive could have caused this issue. I think both are to blame as both were 4+ year old parts.

cmb

When you have an unclean shut down (yank the power cord, kernel panic), the disk will be dirty and fsck will run to fix it. That's not related, just normal after a kernel panic.