Out of memory during startup on Alix 2D13
I have Alix 2D13 systems, 2GB CF card, 256MB RAM. When most of my systems start, the "flashing LED" sequence that indicates the pfSense startup is running never stops. Office staff keep asking "why are the lights still flashing?"
During the startup some process or other gets killed with "out of swap space". Since there is zero swap space in this configuration, the message really means "out of memory". The startup never does the led_normalise call. (and probably misses other more critical things)
Now I have a test config that can easily reproduce this, no packages installed, no special rules, 4 OpenVPN peer-ro-peer shared-key servers listening on 4 different WAN ports.
If I have:
2 OpenVPN servers, the startup is good.
3 OpenVPN servers, "out of swap space" happens sometimes.
4 OpenVPN servers, "out of swap space" happens every time.
Once the system is started and idling, memory use is:
43% (2 servers)
44% (3 servers)
46% (4 servers)
I guess that some processes are happening in parallel during the startup, and the concurrent processes are filling up memory.
Here is the console output:
Welcome to pfSense 2.1-BETA0 ... Creating symlinks......done. External config loader 1.0 is now starting... ad0s3 Launching the init system... done. Initializing............................. done. Starting device manager (devd)...done. Loading configuration......done. Updating configuration...done. Cleaning backup cache........done. Setting up extended sysctls...done. Setting timezone...done. Configuring loopback interface...done. Starting Secure Shell Services...done. Setting up polling defaults...done. Setting up interfaces microcode...done. Configuring loopbavr1: link state changed to DOWN ck interface...done. Configuring LAGG interfaces...done. Configuring VLAN interfaces...done. Configuring QinQ interfaces...done. Configuring WAN interface...vr1: link state changed to UP pflog0: promiscuous mode enabled tun1: changing name to 'ovpns1' tun2: changing name to 'ovpns2' ovpns1: link state changed to UP tun3: changing name to 'ovpns3' ovpns2: link state changed to UP ovpns3: link state changed to UP tun4: changing name to 'ovpns4' ovpns4: link state changed to UP vr0: link state changed to DOWN done. Configuring LAN interface...done. Syncing OpenVPN settings...ovpns1: link state changed to DOWN ovpns1: link state changed to UP ovpns2: link state changed to DOWN ovpns2: link state changed to UP ovpns3: link state changed to DOWN ovpns3: link state changed to UP ovpns4: link state changed to DOWN done. vr0: link state changed to UP ovpns4: link state changed to UP Starting syslog...done. Configuring firewall......done. Starting PFLOG...done. Setting up gateway monitors...done. Synchronizing user settings...done. Starting webConfigurator...done. Configuring CRON...done. Starting NTP time client...done. Starting DHCP service...done. Starting DHCPv6 service...done. Starting DNS forwarder...done. Configuring firewall......done. Generating RRD graphs...Killed Starting CRON... done. Bootup complete
and syslog entries straight after boot:
Sep 28 13:04:57 kernel: Timecounter "TSC" frequency 498052748 Hz quality 800 Sep 28 13:04:57 kernel: Timecounters tick every 10.000 msec Sep 28 13:04:57 kernel: IPsec: Initialized Security Association Processing. Sep 28 13:04:57 kernel: usbus0: 12Mbps Full Speed USB v1.0 Sep 28 13:04:57 kernel: usbus1: 480Mbps High Speed USB v2.0 Sep 28 13:04:57 kernel: ugen0.1: <amd> at usbus0 Sep 28 13:04:57 kernel: uhub0: <amd 1="" 9="" ohci="" root="" hub,="" class="" 0,="" rev="" 1.00="" 1.00,="" addr=""> on usbus0 Sep 28 13:04:57 kernel: ugen1.1: <amd> at usbus1 Sep 28 13:04:57 kernel: uhub1: <amd 1="" 9="" ehci="" root="" hub,="" class="" 0,="" rev="" 2.00="" 1.00,="" addr=""> on usbus1 Sep 28 13:04:57 kernel: ad0: 1907MB <sandisk sdcfh-002g="" hdx="" 6.02=""> at ata0-master PIO4 Sep 28 13:04:57 kernel: Root mount waiting for: usbus1 usbus0 Sep 28 13:04:57 kernel: uhub0: 4 ports with 4 removable, self powered Sep 28 13:04:57 kernel: Root mount waiting for: usbus1 Sep 28 13:04:57 kernel: uhub1: 4 ports with 4 removable, self powered Sep 28 13:04:57 kernel: Trying to mount root from ufs:/dev/ufs/pfsense1 Sep 28 13:04:57 kernel: ad0: FAILURE - SETFEATURES 0x85 status=51 <ready,dsc,error>error=4 <aborted>Sep 28 13:04:57 check_reload_status: Linkup starting vr0 Sep 28 13:05:01 php: : ROUTING: setting default route to 10.49.80.250 Sep 28 13:05:06 check_reload_status: Updating all dyndns Sep 28 13:05:09 check_reload_status: Restarting ipsec tunnels Sep 28 13:05:16 kernel: pid 288 (php), uid 0, was killed: out of swap space Sep 28 13:05:18 kernel: pid 36793 (php), uid 0, was killed: out of swap space Sep 28 13:05:19 php: : Restarting/Starting all packages. Sep 28 13:05:21 php: : rc.newwanip: Informational is starting ovpns2. Sep 28 13:05:21 php: : rc.newwanip: on (IP address: 10.202.1.1) (interface: ) (real interface: ovpns2). Sep 28 13:05:21 php: : rc.newwanip: Informational is starting ovpns3. Sep 28 13:05:21 php: : rc.newwanip: Informational is starting ovpns4. Sep 28 13:05:21 check_reload_status: Reloading filter Sep 28 13:05:21 php: : rc.newwanip: on (IP address: 10.203.1.1) (interface: ) (real interface: ovpns3). Sep 28 13:05:21 php: : rc.newwanip: on (IP address: 10.204.1.1) (interface: ) (real interface: ovpns4). Sep 28 13:05:21 php: : pfSense package system has detected an ip change -> ... Restarting packages. Sep 28 13:05:21 check_reload_status: Starting packages Sep 28 13:05:21 login: login on console as root Sep 28 13:05:21 sshlockout: sshlockout/webConfigurator v3.0 starting up Sep 28 13:05:22 php: : pfSense package system has detected an ip change -> ... Restarting packages. Sep 28 13:05:22 php: : Hotplug event detected for lan but ignoring since interface is configured with static IP (192.168.1.1) Sep 28 13:05:22 php: : pfSense package system has detected an ip change -> ... Restarting packages. Sep 28 13:05:24 check_reload_status: rc.newwanip starting vr0 Sep 28 13:05:25 php: : Restarting/Starting all packages. Sep 28 13:05:32 php: : Restarting/Starting all packages. Sep 28 13:05:33 php: : rc.newwanip: Informational is starting vr0. Sep 28 13:05:33 php: : rc.newwanip: on (IP address: 192.168.1.1) (interface: lan) (real interface: vr0). Sep 28 13:05:34 check_reload_status: Reloading filter Sep 28 13:05:39 php: : Resyncing OpenVPN instances for interface LAN. Sep 28 13:05:39 php: : Creating rrd update script Sep 28 13:05:39 php: : pfSense package system has detected an ip change 0.0.0.0 -> ... Restarting packages. Sep 28 13:05:39 check_reload_status: Starting packages Sep 28 13:05:41 php: /index.php: Successful login for user 'admin' from: 192.168.1.101 Sep 28 13:05:41 php: /index.php: Successful login for user 'admin' from: 192.168.1.101 Sep 28 13:05:46 php: : Restarting/Starting all packages.</aborted></ready,dsc,error></sandisk></amd></amd></amd></amd>
The config is attached. It is simple and hopefully will cause the problem on any Alix 2D13.
built on Thu Sep 27 13:07:59 EDT 2012
But I also had this issue on 2.0.1 - it is not new, but I hoped that something in 8.3 FreeBSD and 2.1 pfSense might be a little different in the startup processing and avoid this. Now I want to track it down and fix it.
I will have a good look through the way the startup sequence happens, but any suggestions about good places to put some "ps" or "syscrl -a | grep -i memory" or other debugging commands into the startup flow are welcome - then I can see which part of the startup is overloading memory.
If you have this issue also then please confirm that I am not just crazy!
I've experienced this as well and back then I just disabled OpenVPN connections I rarely use to work around it. I think it is the result of something during the course of 2.0's development that significantly increased PHP's memory usage (but I have no idea what it was). I wasn't paying as much attention to some of the changes at the time, so I don't know exactly when it happened, though I'm thinking sometime around end of third to beginning of fourth quarter of 2010.
I finally got back to looking at this. I put some little loops in /etc/rc.bootup that display the available memory and sleep 2 seconds, doing this 20 times for a total 40 second delay each time. This shows where things startup, gobble up memory during their intialisation, then give much of it back again as the initialisation code exits.
After FreeBSD is up, rc.bootup is running and the memory file systems (/tmp /var) are created, there is 45MB available of the 256MB real memory on the Alix 2D13.
As more things are started, the available memory eventually stabilises down at 14MB, when "Generating RRD graphs" starts. After that, there is only 9MB available. Then the boot process finishes.
Once that is all done, there is about 30MB available as the system runs.
If I let the boot process run without the sleeps, then too much starts at once and things get killed (including the boot process itself).
I am thinking of designing a routine that can be called at various points during rc.bootup - it would check the available memory, and if it was too low, loop around sleeping until the available memory recovered or it had slept for some maximum limit. With suitable selection of the required minimum memory and maximum sleep times, this would put automatically calibrated delays into the bootup of low-memory systems, without delaying bootup on bigger systems.
Would this be an acceptable design?
Note: The test config I am using for this has 4 OpenVPN servers and 1 OpenVPN client. They are just "dummy" ones I setup for testing. I have attached the rc.bootup that I use to display the available memory, the console output of a boot and the test config.
@efonne: Yes, I also had the same symptoms on 2.0.1 and that was one reason that I moved to 2.1 - I wanted to see if it was any better on 2.1 (FreeBSD 8.3). And also, if I can fix it then I can easily submit the fix, get it into a snapshot and use it.
If you disable the OpenVPN instances is that enough to make the problem go away? Curious if that's the root of the issue or if there's something else that's contributing or causing it.
Yes, 1 or 2 OpenVPN instances and the boot succeeds.
3 instances - it works sometimes and sometimes runs out of memory
4 or more instances and a "Killed - out of swap space" happens every time
The test config has 5 instances, to really make sure the problem happens all the time.
I will spit out some "ps" info as well to a file from rc.bootup - it would be interesting to see how many copies of various scripts are firing off as all the OpenVPN instances kick off. Maybe there is some filter configure or rc.newwanip or something that fights itself 5 times (is not seeing $g[booting] and is reacting to every OpenVPN event that happens even though it is still during the boot).
Note for readers: On nanoBSD there is no swap space, so "out of swap space" really means "out of real memory".
This was a problem for me in 2.0.1 and still is in 2.1 - I am surprised that lots of others do not report it also. But I have systems in remote places with bad power, that routinely get shutdown at night and turned on each morning. Most business and personal routers in western countries are left up 24/7 and there are rarely power fails, and even with power fails, businesses usually have UPS for network infrastructure and servers. So maybe not many people boot their systems!
Few people run 3+ instances of OpenVPN on an ALIX, 1 or 2 is very common. I'm pretty sure having that many OpenVPN instances must be the only thing at least in the base system that will trigger that.
If you just add a delay between starting OpenVPN instances only does that suffice to work around it? Haven't checked the code but off the top of my head they'll all get launched more or less instantaneously, and apparently OpenVPN chews up quite a bit more RAM at startup than it does during normal runtime. We may just need some logic there so if you have > 2 OpenVPN instances and <= 256 MB RAM, it sleeps a couple seconds between launching each instance. Startup doesn't really run much in parallel, so other things shouldn't be interfering much.
Might I suggest >= 2 instances of OpenVPN. I encounter the same issue with just 2 servers. If I disable 1 then the bootup succeeds without issue. With 2, then it exhibits the behavior of something being killed during startup. It could just be that in my situation one of the servers quickly becomes established during bootup rather than just waiting for connections because it is a site-to-site server. I'd expect a server with a successful link would use more memory.
I have 2 main sites that all the others need to connect to, so rather than making a full mesh of OpenVPN links, I have now made each of the 6 smaller sites just have 2 OpenVPN site-to-site links, to the 2 main sites. This means that these 6 Alix systems boot nicely (With 2 links I don't see the memory issue during boot).
The 2 main sites have 7 links each (6 to the smaller sites and 1 to the other main site) - so I am aiming to be able to boot reliably with 7 OpenVPN site-to-site shared-key links defined. I am away from home at one of the smaller sites at present, so I can't play with my test system right now, maybe at the weekend.
On a related note, I spoke to Netgate about whether there is any chance of Alix systems with more on-board memory being available. They have other customers who have asked a similar question. For me, I just want more real memory and keep the 12V DC input, my sites all have slow links in remote areas, so I don't care about Gb ethernet or CPU speed. Netgate are looking out for something, but no news yet.
i had spoken to creater of alix for a upgraded board and the below is what he said
apu = AMD 1 GHz dual core, 2 or 4 GB DDR3, 3 RTL8111E Gigabit Ethernet,
2 miniPCI express, 1 m-SATA SSD, 1 SD card connected via USB, same
form factor as alix2d13. Expected ~ Q1 2013.