IPsec fails to renegotiate after loss of a peer
-
Whoooooooooo so glad this has been looked at. Serious problems this cause me. Just need the traffic shaper fix now.
-
Jimp,
Some new for you, if the partner vpn starts a ping the tunnel reconnects. But it if drops from my end it will not estiablish. Thought I let you know. I try t get some information on the typ of firewall they are using.
RC -
Jimp,
Some new for you, if the partner vpn starts a ping the tunnel reconnects. But it if drops from my end it will not estiablish. Thought I let you know. I try t get some information on the typ of firewall they are using.
RCAt the moment we're in a holding pattern waiting on ipsec-tools 0.8 to get some fixes in, and I believe that's the way things are going to go.
There are too many of these issues to fix in ipsec-tools 0.7.2 with patches, unfortunately.
-
That's cool. I not updating at the momment. I more concerned with a stable enviroment. I can't afford any more down time.
I am in the process of moving and have a ton of stuff going on.
RC
-
At the moment we're in a holding pattern waiting on ipsec-tools 0.8 to get some fixes in, and I believe that's the way things are going to go.
There are too many of these issues to fix in ipsec-tools 0.7.2 with patches, unfortunately.
Now ipsec-tools 0.7.3 is out any thoughts on using it?
-
It was even worse off than 0.8 in some respects. We had to stay at 0.7.2 but drop NAT-T to get some semblance of stability.
-
Is IPsec renegotiating properly now in 1.2.3-RC3?
-
Yes, it is working now as far as all my tests have shown both in actual tests and in running it at home and having some Internet stability issues. Seems to work fine as far as I can tell.
-
Jimp wher'd I'd find RC3? It's not on the offical mirrors - only RC1. Or is RC3 a current snapshot? I'm using embedded.
-
Jimp wher'd I'd find RC3? It's not on the offical mirrors - only RC1. Or is RC3 a current snapshot? I'm using embedded.
It's only in snapshots at the moment, but there will probably be an "official" cut of RC3 (or perhaps RC4?) before release.
Here are the NanoBSD (new embedded system) snapshots:
http://snapshots.pfsense.org/FreeBSD_RELENG_7_2/pfSense_RELENG_1_2/nanobsd/?C=M;O=D -
I am seeing an issue that seems to be the same. I am testing RC 1.2.3 20090924. I have tunnels set up to two separate sites to a Netopia router on the other end. The tunnels are working when I leave work in the evening. When I get to work in the morning, they are not working. The IPSec status page (SAD) shows that the tunnels are up. If I restart raccoon, the tunnel status goes down. I then ping a site and everything gets renegotiated and it works again.
I currently have a 28800 lifetime for phase 1 and 86400 for phase 2.
I am willing to test a couple things for a day or two if someone has a suggestion. After that I will need to put my pfsense box into production without the tunnels and I will be limited in what I can try. -
@bkm:
I am seeing an issue that seems to be the same. I am testing RC 1.2.3 20090924. I have tunnels set up to two separate sites to a Netopia router on the other end. The tunnels are working when I leave work in the evening. When I get to work in the morning, they are not working. The IPSec status page (SAD) shows that the tunnels are up. If I restart raccoon, the tunnel status goes down. I then ping a site and everything gets renegotiated and it works again.
I currently have a 28800 lifetime for phase 1 and 86400 for phase 2.
I am willing to test a couple things for a day or two if someone has a suggestion. After that I will need to put my pfsense box into production without the tunnels and I will be limited in what I can try.It might help to know more about these tunnels, at least this much: Are they static tunnels or mobile clients? Are they using main mode or aggressive mode? Do you have DPD enabled? Keep Alive? What shows up in the logs when the tunnels are broken?
And anything else you can think of.
-
im using the latest 2.0 snapshot, do you recommend leaving DPD enabled? I dont have access to the logs right now so i cant post them but it appears that when a tunnel goes down because of the internet connection on my end or the other end i have to restart the racoon service on both ends for the tunnel to reestablish, this is between 2 pfsense boxes… i dont even want to get started on my linksys vpn tunnel issues
-
im using the latest 2.0 snapshot
Don't. That's not going to be stable. Pretty sure the 7.2/2.0 builds still use NAT-T which has renegotiation issues, and the 8 snapshots likely don't have a proper ipsec-tools either.
-
@cmb:
im using the latest 2.0 snapshot
Don't. That's not going to be stable. Pretty sure the 7.2/2.0 builds still use NAT-T which has renegotiation issues, and the 8 snapshots likely don't have a proper ipsec-tools either.
damn… well now that i think of it i really only switched to check out the new gui and multiple dyndns accounts for each of my wans, so ipsec is much more stable and up-to-date in the RC versions?, i would have figured newer code with more features was included in 2.0 for vpns specifically ipsec
-
-
Ok, I enabled DPD (60 sec) and I believe that this fixed part of the problem. A couple of my tunnels stayed up overnight or at least reconnected. One of the tunnels wouldn't reconnect though until I deleted the numerous SADs and restarted raccoon. (Actually a second tunnel stopped after restarting raccoon and I again had to delete the SAD but not restart raccoon) It appears that some of the SADs were not getting dropped properly. The IPSec status showed that the tunnels were up. It may have coincided with the lifetime setting. I think that I may change my lifetimes to a shorter time frame so that I can try to duplicate the behavior. Below are some of the error messages.
racoon: ERROR: fatal INVALID-SPI notify messsage, phase1 should be deleted.
Sep 29 12:23:18 racoon: ERROR: fatal INVALID-SPI notify messsage, phase1 should be deleted.
Sep 29 12:22:48 racoon: ERROR: fatal INVALID-SPI notify messsage, phase1 should be deleted.
Sep 29 12:22:41 racoon: [VPN Name]: INFO: IPsec-SA established: ESP MyWanIPxx.xx[0]->RemoteIPxx.xx[0] spi=2670541922(0x9f2d3c62)
Sep 29 12:22:41 racoon: [VPN Name]: INFO: IPsec-SA established: ESP RemoteIPxx.xx[0]->MyWanIPxx.xx[0] spi=37322644(0x2397f94)
Sep 29 12:22:40 racoon: [VPN Name]: INFO: respond new phase 2 negotiation: MyWanIPxx.xx[0]<=>RemoteIPxx.xx[0]Background: The tunnels are static tunnels to Netopia routers. One tunnel to each site or router.
Aggressive mode is used. For the Keep Alive I am using the remote Lan address of the Netopia router.After I duplicate the problem, I will provide more log info. Most was overwritten before I thought of copying it.
-
I would have figured newer code with more features was included in 2.0 for vpns specifically ipsec
Not in this case. Even so, newer code and more features don't usually translate to more stability, especially in an alpha release :)
-
@bkm:
Ok, I enabled DPD (60 sec) and I believe that this fixed part of the problem. A couple of my tunnels stayed up overnight or at least reconnected. One of the tunnels wouldn't reconnect though until I deleted the numerous SADs and restarted raccoon. (Actually a second tunnel stopped after restarting raccoon and I again had to delete the SAD but not restart raccoon) It appears that some of the SADs were not getting dropped properly. The IPSec status showed that the tunnels were up. It may have coincided with the lifetime setting. I think that I may change my lifetimes to a shorter time frame so that I can try to duplicate the behavior. Below are some of the error messages.
racoon: ERROR: fatal INVALID-SPI notify messsage, phase1 should be deleted.
Sep 29 12:23:18 racoon: ERROR: fatal INVALID-SPI notify messsage, phase1 should be deleted.
Sep 29 12:22:48 racoon: ERROR: fatal INVALID-SPI notify messsage, phase1 should be deleted.
Sep 29 12:22:41 racoon: [VPN Name]: INFO: IPsec-SA established: ESP MyWanIPxx.xx[0]->RemoteIPxx.xx[0] spi=2670541922(0x9f2d3c62)
Sep 29 12:22:41 racoon: [VPN Name]: INFO: IPsec-SA established: ESP RemoteIPxx.xx[0]->MyWanIPxx.xx[0] spi=37322644(0x2397f94)
Sep 29 12:22:40 racoon: [VPN Name]: INFO: respond new phase 2 negotiation: MyWanIPxx.xx[0]<=>RemoteIPxx.xx[0]Background: The tunnels are static tunnels to Netopia routers. One tunnel to each site or router.
Aggressive mode is used. For the Keep Alive I am using the remote Lan address of the Netopia router.After I duplicate the problem, I will provide more log info. Most was overwritten before I thought of copying it.
The full log is in /var/log/ipsec.log, and you can view it by executing the command: clog /var/log/ipsec.log
It may have just been gone from the GUI, which only shows a limited number of lines. Also, it would help to show the logs in normal order, not reverse order. If you have the reverse order box checked on Status > System Logs, Settings tab, uncheck it and save, then copy/paste the logs. There was an old bug that caused the IPsec logs to ignore this setting, but it was fixed before the snapshot you said you were running. You may also want to update to the most recent snapshot to be sure you really have the most current updates.
One more thing: When testing these settings, be sure to stop and restart racoon after making your changes, to be on the safe side and to be sure the SAD and SPD are clear.
-
One more thing: When testing these settings, be sure to stop and restart racoon after making your changes, to be on the safe side and to be sure the SAD and SPD are clear.
sorry if this sounds ignorant but do you mean delete all the SPD and SAD entries?
-
One more thing: When testing these settings, be sure to stop and restart racoon after making your changes, to be on the safe side and to be sure the SAD and SPD are clear.
sorry if this sounds ignorant but do you mean delete all the SPD and SAD entries?
Stopping and restarting racoon should do this. If you want to do it by hand, run:
setkey -F
setkey -F -P -
Thanks for the advice. I knew the logs had to be somewhere. I will also restart raccoon from now on after changes are made.
"sorry if this sounds ignorant but do you mean delete all the SPD and SAD entries?"
In the GUI, I would go to Status-IPSec, then the SAD tab and delete the entries that corresponded to the tunnel that I was having problems with. Normally, I see two entries for each tunnel, MyWanIP to RemoteWanIP and RemoteWanIP to MyWanIP. When I was having problems, this area would get filled with entries where it was trying to connect. There is an X beside each one where they can be deleted. Restarting raccoon also deletes them though. (I was experimenting a little when deleting manually) I did not change anything on the SPD tab.
I did change the lifetime setting to 1800 and 3600, on two of the tunnels so that I could see if they came back up after an expiration without waiting until tomorrow. They did come up this time. If it is down in the morning, I will post a few log entries.
-
One of the tunnels went down again last night. Attached is my IPSec log. It was working at 17:00 on 9/29/09 and was not working at the end of the log on 9/30/09. The log was modified to remove the IP addresses. My Wan IP address is listed as MyWanIP. The remote Wan IP of the tunnel that went down is listed as RemoteWanIPTunnel1. Any advice welcome. Thanks.
-
Well, the tunnel came back up without any action from me other than I did try to ping the site twice. The first time it timed out. An hour later I tried and got an immediate reply. For the Keep Alive address, I am using the remote LAN IP. Should I be using the remote Wan IP instead?
-
I ran a ping test on my tunnels last night to see how long and how often they were down. The test ran for about 15 hours. Tunnel 1 went down 6 times averaging 30-35 minutes each time. Tunnel 2 went down 3 times also averaging 30-35 minutes each time. Tunnel 3 stayed up the whole time. I didn't count the times where the tunnel was only down for about a minute, however this would also be a problem for our VPNs. I cannot verify that times under a couple minutes were not due to a line problem since I alternated my pings between sites. The range for the downtime was between 25 and 45 minutes. Most occurrences appeared to be at 30-35 minutes.
My lifetime settings for tunnel 1 and 2 are at 1800 (30 min) for phase 1 and 3600 (60 min) for phase 2. My lifetimes for tunnel 3 are 28800 and 86400. Tunnel 3's phase 2 lifetime did not expire during the test.
Attached is a partial IPSec log. Tunnel 1 went down sometime between 5:56 and 6:01. It came back up between 6:41 and 6:47 in case someone wants to look. The clocks between the two machines could be slightly off.
I will try to upgrade again in the next day or so and test again, but I am not very hopeful at this point.
Any suggestions are welcome.
Thanks -
I have noticed similar behavior with a pfsense<–>Linksys VPN.
Basically, at some unspecified time, the tunnel appears to die, and stays down for the time the phase 2 lifetime (for me).
I've seen the same behavoir without pfsense in the mix though, using just the Linksys VPN routers (which I am replacing with pfsense boxes, slowly). -
I'm making some progress on my problem, but I am not going to trust it for production yet. I did have a tunnel stay up all night. I am posting this in case it helps someone else. I don't profess to completely understand all of the intricacies of IPSec, but I do have a basic understanding of what happens.
One of the problems with IPSec is different understandings and implementations of the standard. Because the standard is not clear on some issues such as re-keying, different vendors choose different methods.One of my problems was related to "dangling phase 2 SAs". The option on the other end of my tunnels were set to Not allow dangling phase 2 SA's. There is no GUI option for this in pfsense nor is there any option related to either using the newest SAs immediately or to wait until they expire. Either of these can cause a tunnel to go down after the timeout expires.
Pfsense is not wrong in its method, it just has not been made clear in the documentation which method works for it.
I changed one of my tunnels to "Allow Dangling Phase 2 SAs" and it has made a huge difference. This will not help those with a pfsense to pfsense tunnel, but for those who have this option on one end of their tunnel, they should experiment with it.
If either of these options are available in a config file somewhere, I would be interested in knowing.
Thanks
-
Actually, there is an option for using the newest or older SAs. I haven't found anything related to dangling phase 2 SAs though.
-
Has there been any progress with this issue is 1.2.3RC3? I am setting up a new router and heard on the mailing list that not much will change between RC3 and the final release. I am currently using 1.2.2 and everything works great on most of my installations. The major feature I want from 1.2.3 is the ability to make changes to one IPSEC tunnel and not restart each of the 30 tunnels running.
-
Has there been any progress with this issue is 1.2.3RC3? I am setting up a new router and heard on the mailing list that not much will change between RC3 and the final release. I am currently using 1.2.2 and everything works great on most of my installations. The major feature I want from 1.2.3 is the ability to make changes to one IPSEC tunnel and not restart each of the 30 tunnels running.
It appears to be resolved now. I've not had any trouble with my usually-volatile tunnels that used to give me no end of grief making me restart racoon all the time. It's been working like a dream with RC3.
-
Has there been any progress with this issue is 1.2.3RC3?
It's been resolved since NAT-T was removed over a month ago.
-
Thanks to both of you for the information. I already put RC3 on the new router but I was starting to second guess myself because of some other issues that may be unrelated. Hearing that these problems are at the very least not nearly as common gives me the confidence to continue with the project.
-
I'm not sure if I would consider this resolved… I'm running the current 1.2.3-RC3 build and having plenty of IPsec issues that fit this bill. So far, enabling DPD (30 Sec) and 'Prefer old IPsec SA's' seem to have resolved it. It's been ~20 hours and no tunnels have gone down, but I'll give it another day before I feel like it is stable. In any case, this is more of a work around than a resolution.
-
Depending on your scenario, you may need DPD. It's not a work around, it's a desirable thing to have enabled most of the time. It would be sensible to always use it where the other end supports it.
There are no known issues, all our major deployments are running RC3 with no problems.
-
Well, I've been running my ipsec tunnels on RC1 for several months in a production environment without any issues until I upgraded to RC3, so I'd say that's an issue. No? The configuration worked fine without DPD, until the upgrade. If you're interested, here is my post on my issue: http://forum.pfsense.org/index.php/topic,20043.0.html
I'd love to have some additional feedback on my issue.
-
I'm actually wondering how bkm is faring now because I am having the exact same issue with one location that is still using a Linksys router
In the GUI, I would go to Status-IPSec, then the SAD tab and delete the entries that corresponded to the tunnel that I was having problems with. Normally, I see two entries for each tunnel, MyWanIP to RemoteWanIP and RemoteWanIP to MyWanIP. When I was having problems, this area would get filled with entries where it was trying to connect. There is an X beside each one where they can be deleted. Restarting raccoon also deletes them though. (I was experimenting a little when deleting manually) I did not change anything on the SPD tab.
Manually deleting these SAD items brings the site right back up… but only for the lifetime of phase2 which is 3600 at this time. I still have another router with 1.2.2 connecting to the same Linksys BEFVP41 routers and they do not have this problem.
-
Well, I'm still experimenting and trying to find the best settings for my situation. I currently have nine tunnels in production. I have slightly different settings on a few a them to try to find out what works best on my equipment. The tunnels had been up for 4 1/2 days (some a little longer) but they all went down today for some reason. I had one other occurrence of this happening after they were up for a few days but I thought it was a fluke.
My tunnels have been renegotiating after the lifetime is up most of the time. It seems like pfSense has some trouble starting a larger number of tunnels after they all go down or if you are restarting racoon. (maybe I am not waiting long enough) I usually have to start disabling some of the tunnels before any will start working. Once they start, I can enable the others and they seem to work fine.
Since my tunnels are staying up longer, it takes a longer period of time to test each change. Many of my tweaks are being done on the far router side (non-pfSense), so, they will probably not be very helpful to most users. I think the main thing that users should know is that if you are using a non-pfSense box on one side of a tunnel, do not assume that everyone else's settings are the ones that you should use. They are a good staring point and if they work for you then that's great. If not, you will need to test various settings.
One setting that I believe has helped me was disabling PFS. This does reduce security though. If I can get my tunnels to stay up for a few weeks, I may try to enable it again. I have also set the lifetimes on each of my tunnels to a slightly different time so that they do not all expire at the same time. I don't really know if that helps.Does anyone know when tunnels are actually renegotiated? For instance, if the lifetime is set for 28800 seconds, does a renegotiation start at half that time or a certain number of seconds before expiration?
If I get everything to work completely stable, I will post my settings. I am currently using the released version of RC3. I have not updated to any of the snapshots since RC3 was released.
-
I would like to see others post a message if they are having IPSec problems after upgrading to RC3. I am not as interested in seeing posts from those who have never set up a tunnel before, but from people like netmethods who had a stable system before RC3.
The developers are not going to fix anything if they believe that everything has been resolved. This is not meant as a criticism to any developer. I think all of you are great for putting your time into this project. I would just like to know if my problems are unique because of the equipment I am using or if problems are still widespread.
Thanks -
I posted an update on the thread I started, if anyone is interested. It might help some people that are still having issues.
http://forum.pfsense.org/index.php/topic,20043.0.htmlbkm, what error messages are you getting? I think you also posted somewhere about the tunnels taking a long time to come back up? I've noticed that it is taking much longer now to bring them up as well. It used to be pretty quick, but now it takes several minutes (5-10) to bring 6 tunnels up if I restart racoon.
-
I'm interested in what other devices everyone is using at the other end of the tunnels. I see in your other thread netmethods, you say
All ipsec tunnels are to sonicwalls with standard and enhanced os and one watchguard
I currently have a pfSense 1.2.2 box connecting to about 28 remote locations which are all using a mixture of pfSense 1.2.2 (1 - 1.2.3-PRE-testing but it's working so I don't want to upgrade just yet) and Linksys BEFVP41 and Linksys BEFSX41. These are stable for the most part, meaning they rarely go down and if they do it is usually something other than the equipment that causes it.
I am setting up a new box now because we are changing over to a new ISP and doing it slowly starting with our site-to-site VPNs. I built the box using 1.2.3-RC3 and moved 3 of the remote locations to it. 2 of these are pfSense (1.2.2 and the 1.2.3-PRE-testing) and one is a BEFVP41. The BEFVP41 has been going down every hour when the phase 2 expires. I could increase the lifetime but I want to know sooner when my changes don't work so I can try something else. I'm going to try the "prefer old SAs" later tonight now that the office is closed and see what happens.