Working on getting OpenVPN server bridging to fly.

Numbski

I think I may have had an epiphany on the random-lockup thing with OpenVPN and CARP. This doesn't effect the bridge problem itself (bridge just stops working).

I've been having openvpn listen on the WAN carp interface's IP address.

So visualize this - you have two firewalls listening on the CARP IP, so only one can answer. You make a change or do something that causes a filter reload. Temporarily, the CARP IP address becomes unavailable, so the second box takes over. Now, you have OpenVPN set up on a "keep state", so it has a state going on the first box, but now suddenly the second box answers. The CARP IP becomes available on the first box again. So now you flip back, all the while, we're tunneling layer two traffic, both boxes bridged onto the LAN.

Something tells me that this exchange is far from graceful, and in fact we're really hacking off OpenVPN, and causing an exception that the kernel just can't deal with. I have to keep reminding myself that tap is a kernel driver, so making tap unhappy makes the kernel unhappy. Thus the unresponsive kernel.

So the way to deploy this is probably to have both boxes listen on the "real" WAN interface, and have two remote statements on the clients.

http://openvpn.net/howto.html#client

If you look, it has provisions for load balancing between OpenVPN servers. I'll give this a try, see if it resolves our issues.

Numbski

Update - this does not fix the aribtrary system lockup problem. :(

Numbski

Ah ha!!!

http://www.sigmasoft.com/~openbsd/archives/html/openbsd-bugs/2005-08/msg00018.html

Just running out of time to dig into this right now….but this looks like we might have a winner. (Or loser, depending.)

Just realized I hadn't explained myself. I had remote syslogging going, and for the first time I caught the openvpn logging right at the time of the crash. The last line read:

Sep 14 19:10/46 lbfw openvpn[367]: event_wait : Interrupted system call (code=4)

Then nothing until I rebooted, which is what lead me to this article. This happens to be a triple gigabit Hacom box.

More complete log:


Sep 14 11:23:58 lbfw1 openvpn[96002]: TUN/TAP device /dev/tap0 opened
Sep 14 11:23:58 lbfw1 openvpn[96002]: /sbin/ifconfig tap0 192.168.168.169 netmask 192.168.168.170 mtu 1500 up
Sep 14 11:23:58 lbfw1 openvpn[96002]: UDPv4 link local (bound): x.x.x.x:1194
Sep 14 11:23:58 lbfw1 openvpn[96002]: UDPv4 link remote: [undef]
Sep 14 11:23:58 lbfw1 openvpn[96002]: Initialization Sequence Completed
Sep 14 11:24:02 lbfw1 openvpn[96002]: 208.231.66.99:52385 Re-using SSL/TLS context
Sep 14 11:24:02 lbfw1 openvpn[96002]: 208.231.66.99:52385 LZO compression initialized
Sep 14 11:24:03 lbfw1 openvpn[96002]: 208.231.66.99:52385 [Tony_Shadwick] Peer Connection Initiated with 208.231.66.99:52385
Sep 14 11:56:39 lbfw1 openvpn[96002]: 208.231.66.99:52398 Re-using SSL/TLS context
Sep 14 11:56:39 lbfw1 openvpn[96002]: 208.231.66.99:52398 LZO compression initialized
Sep 14 11:56:40 lbfw1 openvpn[96002]: 208.231.66.99:52398 [Tony_Shadwick] Peer Connection Initiated with 208.231.66.99:52398
Sep 14 18:32:41 lbfw1 openvpn[371]: 208.231.66.99:52637 Re-using SSL/TLS context
Sep 14 18:32:41 lbfw1 openvpn[371]: 208.231.66.99:52637 LZO compression initialized
Sep 14 18:32:42 lbfw1 openvpn[371]: 208.231.66.99:52637 [Tony_Shadwick] Peer Connection Initiated with 208.231.66.99:52637
Sep 14 18:36:00 lbfw1 openvpn[371]: Tony_Shadwick/208.231.66.99:52637 [Tony_Shadwick] Inactivity timeout (--ping-restart), restarting
Sep 14 18:48:32 lbfw1 openvpn[367]: 208.231.66.99:52663 Re-using SSL/TLS context
Sep 14 18:48:32 lbfw1 openvpn[367]: 208.231.66.99:52663 LZO compression initialized
Sep 14 18:48:33 lbfw1 openvpn[367]: 208.231.66.99:52663 [Tony_Shadwick] Peer Connection Initiated with 208.231.66.99:52663
Sep 14 18:51:09 lbfw1 openvpn[367]: Tony_Shadwick/208.231.66.99:52663 [Tony_Shadwick] Inactivity timeout (--ping-restart), restarting
Sep 14 19:10:46 lbfw1 openvpn[367]: event_wait : Interrupted system call (code=4)
Sep 14 21:00:31 lbfw1 openvpn[371]: 208.231.66.99:52809 Re-using SSL/TLS context
Sep 14 21:00:31 lbfw1 openvpn[371]: 208.231.66.99:52809 LZO compression initialized
Sep 14 21:00:32 lbfw1 openvpn[371]: 208.231.66.99:52809 [Tony_Shadwick] Peer Connection Initiated with 208.231.66.99:52809

sullrich

We run FreeBSD. Are you seeing this exact error?

Numbski

Now that you mention it, no. :( kif == NULL probably is shorthand for kernel interface equals null. So likely not it.

That even_wait: Interrupted system call line bugs me though. It's almost as though something tried to interrupt openvpn, failed to do so, and the entire system just sits there in an endless loop waiting for some even to occur that never will. All interfaces stop responding, occassionally I can hit ctrl-alt-del, and the system will attempt a halt, but will never actually be able to fully halt itself.

Numbski

I really should quit this and start winding down for bed, but it's driving me nuts.

Okay, we're dealing with two enigmas here. 1, a tap interface that the system may or may not know what to do with, and 2, and a bridge utilizing that tap interface.

Now I seem to recall that every time there is a change to config.xml, the openvpn process gets killed and relaunched. Is it possible that there's a condition that may be seemingly random to me that might come along and try to reap bridges or individual interfaces, or even the openvpn process itself, fails to do so, and then chases its tail until there are no more resources available to consume?

Numbski

/etc/rc.bootup, line 181.

Seen any harm in moving that down two commands so it comes after openvpn_resync_all();? Theoretically it would mean openvpn would be up, tap0 would be created prior to bridges being brought online, right?

Only thing that comes to mind that runs all the time is /usr/local/sbin/check_reload_status, which is a binary daemon, not php, not a cron job. It appears that it just keeps checking /tmp/check_reload_status, which usually says "sleeping", unless something more interesting is going on. I don't know what it does is there's something more interesting going on though.

Numbski

Promise, last post for the night.

the shellcmd tags DO work, but it requires not one, but two reboots to take effect. I haven't the slightest idea why that is, but upon reboot, nothing happens. Reboot again, it works. ???

Really. Going to go rest now.

Really.

sullrich

It's handy to remove /tmp/config.cache before signaling a reload.

Ie: rm /tmp/config.cache from the command prompt after making config.xml changes.

Numbski

Had another lockup today. Lasted for about 18 hours, and then same behavior. I'm seriously going to have to recompile the kernel with sw_watchdog until I figure this out. It's completely maddening, as the second firewall never picks up because "technically" the first box is still up, but not really. :\

Numbski

Okay. I just finally got the kernel with sw_watchdog enabled built. I'll share it here just in case someone else finds it useful.

http://www.numbski.net.nyud.net:8080/downloads/pfSense/kernel-with-sw_watchdog.tar.gz

Numbski

Since I have now given the sw_watchdog a proper workout, now I need to figure wtf is causing this. grrr….

Please note I'm now officially grasping for straws. This post is from 2004:

http://lists.osdl.org/pipermail/bridge/2004-January/000146.html

I've added a crontab to remove stp from both bridge interfaces. We'll see how it goes.

Also, if anyone has a good idea of what I can do to get a proper dump of the kernel when the watchdog fires, please let me know. Nothing useful is getting logged.

sullrich

Did you ever send your configuration to Andrew Thompson?

Numbski

No. :(

Part of the problem is that the main config I'm doing this on is kinda confidential. I have another one I can send him, but I've been too tied up to get it over to him.

I'll make a concerted effort to get that over to him "soon".

sullrich

No offense, but we cannot help you until you send the configuration.

Andrew IS the maintainer of the if_bridge subsystem and he expressed his willingness to help but you continue to post messages at an alarming rate, not sending him the information he needs.

It will never get fixed at this rate. Please send him the information he needs or just accept the fact that this will not work.

Numbski

I just e-mailed him asking if a sanitized version of the config.xml would suffice. I would really prefer not to go giving out password hashes and IP addresses. :(

sullrich

Sanatize the passwords but your fear about ip addresses is kinda silly.

If you trust the code that we put into this product then I don't see why you cannot trust someone knowing your ip address.

Numbski

Sent.

Numbski

Just made an observation.

These hangups seem to occur consistently when I'm sending a whole lot of traffic through the firewalls, such as a cvsup. Doesn't have to be traffic across the vpn, just traffic in general.

Numbski

Heh, sullrich. You're not going to believe this.

I fully understand what you told me in irc about you guys not doing anything to or with tun/tap interfaces, and that everything is done via openvpn.

That said, after setting sysctl net.link.tap.user_open to 1, I've had the most uptime since I've started this whole debugging fiasco. Totally odd. Just thought I'd point it out in case someone might have an explanation for it.

To bring people who might be reading this up to speed, net.link.tap.user_open is set to 0 by default. What that means is that only root (or similarly privileged users) have permission to make changes to, or siginficantly impact a tap interface. When set to 1, non-privileged users can do the same. This might be construed as a security concern, but for testing purposes there's no harm. If indeed this "fixes" my problem, it raises more questions than it answers, as OpenVPN runs as root right now, meaning that either something else is touching the tap interface, OR openvpn is somehow dropping privs at some point.