So many filterdns instances…
-
2.1-BETA1 (i386)
built on Sat Jan 5 17:06:02 EST 2013
FreeBSD 8.3-RELEASE-p5
Now I should definitely have all the recent filterdns code changes. Still have the same symptoms, the table gets the correct 11 IP addresses translated from the names at boot. 5 minutes later, filterdns dies:[2.1-BETA1][admin@imp-rt-01.imp.infn]/var/log(6): clog system.log | grep filterdns Jan 6 11:55:27 imp-rt-01 kernel: pid 27624 (filterdns), uid 0: exited on signal 11
-
Hrm strange that you see that.
5 minutes is the default update interval for rechecking names.I have run test here with 5 seconds and 10 second update intervals but no issues in that regard!
That makes still thing the snaps do not have the latest version of filterdns.Can you make a md5 of your filterdns ?
-
@ermal:
Can you make a md5 of your filterdns ?
MD5 (/usr/local/sbin/filterdns) = b25470f1942956d6f887ff87c99761c4
-
2.1-BETA1 (i386)
built on Sun Jan 6 11:15:50 EST 2013
FreeBSD 8.3-RELEASE-p5MD5 (/usr/local/sbin/filterdns) = b25470f1942956d6f887ff87c99761c4
5 minutes after startup:
[2.1-BETA1][admin@rt-01.mydomain]/root(2): clog /var/log/system.log | grep filterdns Jan 7 08:07:02 rt-01 kernel: pid 28781 (filterdns), uid 0: exited on signal 11
-
Just bumping up this thread, since filterdns is still exiting + dumping core (note: I had just updated to latest 2.1-BETA1 snapshot)
-
Bump from me also, now on:
2.1-BETA1 (i386)
built on Sun Jan 13 19:34:21 EST 2013
FreeBSD 8.3-RELEASE-p5
and still getting:Jan 14 12:09:19 imp-rt-01 kernel: pid 34114 (filterdns), uid 0: exited on signal 11 (core dumped)
-
Some more information. filterdns only crashes if SIGHUP is received and it goes through the "Cleaning up previous hostnames" code:
Jan 16 08:57:26 imp-rt-01 filterdns: Received signal SIGHUP(1). Jan 16 08:57:26 imp-rt-01 filterdns: Cleaning up previous hostnames
This happens as various interfaces and OpenVPN links come up during startup - filter reloads happen a few times, and are fed to filterdns. It dies with sig 11 at the next scheduled 5 minute wakeup.
Something in the reload of filterdns.conf and attempted preservation of existing threads, removal of threads no longer needed, and addition of threads to monitor new IPs, is freeing memory that is still needed. In filterdns.c, merge_config calls clear_config:static void clear_config(struct thread_list *thrlist) { struct thread_data *thr; pthread_mutex_lock(&sig_mtx); while ((thr = TAILQ_FIRST(thrlist)) != NULL) { if (debug >= 4) syslog(LOG_ERR, "Cleaning up hostname %s", thr->hostname); TAILQ_REMOVE(thrlist, thr, next); if (thr->thr_pid != 0) pthread_cancel(thr->thr_pid); clear_hostname_addresses(thr); if (thr->hostname) free(thr->hostname); if (thr->tablename) free(thr->tablename); free(thr); } pthread_rwlock_unlock(&main_lock); }
merge_config sets thr_pid to 0 for threads that should continue on (do not need to be cancelled). But clear_config frees various data for the thread (hostname and tablename) and the thread data itself, even when the thread is not cancelled.
When the thread awakes in check_hostname at the 5 minute timer, it will have lost its thr data structure - reference to it will cause sig 11.
Perhaps it just needs this code for clear_config:static void clear_config(struct thread_list *thrlist) { struct thread_data *thr; pthread_mutex_lock(&sig_mtx); while ((thr = TAILQ_FIRST(thrlist)) != NULL) { if (debug >= 4) syslog(LOG_ERR, "Cleaning up hostname %s", thr->hostname); TAILQ_REMOVE(thrlist, thr, next); if (thr->thr_pid != 0) { pthread_cancel(thr->thr_pid); clear_hostname_addresses(thr); if (thr->hostname) free(thr->hostname); if (thr->tablename) free(thr->tablename); free(thr); } } pthread_rwlock_unlock(&main_lock); }
Also, "pthread_rwlock_unlock(&main_lock);" at the end seems odd. Shouldn't it be "pthread_mutex_unlock(&sig_mtx);" - to match the "pthread_mutex_lock(&sig_mtx);" at the start of the routine?
@ermal: I don't have an environment to compile in, but this might give enough clues for you to track this down. -
Thanks for the analysis pushed a fix.
-
Thanks, now it doesn't crash. But somewhere in the boot process, with OpenVPN links etc coming up, it has a point where it deletes all the table entries then does not recover them again. After boot, my table that should have 11 IP addresses is empty. The log indicates entries being deleted at one point.
As a side issue:syslog(LOG_WARNING, "\t DELETED %d addresses(%d) to table %s.", io.pfrio_nadd, address->sa_family, pfd->tablename);
should be:
syslog(LOG_WARNING, "\t DELETED %d addresses(%d) to table %s.", io.pfrio_ndel, address->sa_family, pfd->tablename);
(the debug line is reporting pfrio_nadd when it needs to report pfrio_ndel)
If I restart filterdns (kill it by hand, then use Diagnostics:Execute Command:PHP Execute to do:
mwexec("/usr/local/sbin/filterdns -p {$g['varrun_path']}/filterdns.pid -i 300 -c {$g['varetc_path']}/filterdns.conf -d 10");
It comes up nicely and puts all 11 IPs in the table.
After this, the entries survive when I stop and start an OpenVPN client process - the log looks good.
@ermal: I will PM you a log of filterdns behaviour at boot with -d 10 set. -
Also, in filterdns.c main, it:
a) reads the config, filling in thread_list
b) loops creating a check_hostname thread for each host
c) inits main_lock
d) creates the thread for merge_configTAILQ_FOREACH(thr, &thread_list, next) { error = pthread_create(&thr->thr_pid, &attr, check_hostname, thr); if (error != 0) { if (debug >= 1) syslog(LOG_ERR, "Unable to create monitoring thread for host %s", thr->hostname); } pthread_set_name_np(thr->thr_pid, thr->hostname); } pthread_rwlock_init(&main_lock, NULL); sig_mtx = PTHREAD_MUTEX_INITIALIZER; sig_condvar = PTHREAD_COND_INITIALIZER; error = pthread_create(&sig_thr, &attr, merge_config, NULL); if (error != 0) { if (debug >= 1) syslog(LOG_ERR, "Unable to create signal thread %s", thr->hostname); } pthread_set_name_np(sig_thr, "signal-thread");
But check_hostname uses main_lock. So it is possible that main_lock is not initialized when check_hostname runs the first time.
Maybe that could cause some early accesses to thread_list to be inconsistent?
Maybe:pthread_rwlock_init(&main_lock, NULL);
should be moved earlier in main.
-
I did make the code correct but i think the issue was mostly related to getaddrinfo code not reporting correctly the EAGAIN error.
This made entries expire, though it does not explain why it does not reenter them. -
Upgraded to:
2.1-BETA1 (i386)
built on Fri Jan 18 03:21:43 EST 2013
It puts the 11 IP address entries in the table at the start, then sometime over the next few minutes, the addresses are all deleted from the table. The problem comes from when this message is reported 11 times (site names 1 to 11):Jan 18 20:19:43 imp-rt-01 filterdns: Creating a new thread for host site1.dyndns-ip.com!
It already has all 11 threads for the 11 names in the table. Then, for whatever reason, it decides to create 11 new threads. In the process, it ends up clearing out the 11 table entries and never actually putting them back.
@ermal: I will send another full debug log. -
Upgraded to today's latest snapshot, I'm still getting "exited on signal 11 (core dumped)" and I see only one filterdns process running (whereas in the past there used to be more filterdns processes – for ipsec / CP / etc)
-
I have been following this thread because of similar problems with filterdns crash/core dumps and I have an observation:
My problem seems to be related to the filterdns that gets started through the vpn/ipsec stuff.
After updating to the latest snapshot today:
2.1-BETA1 (amd64)
built on Fri Jan 18 04:21:30 EST 2013
FreeBSD 8.3-RELEASE-p5- I increased the filterdns debug level to 10 (in vpn.inc, line 984, '-d 10' switch) and clicked save on the VPN -> IPsec page to restart the filterdns process monitoring the vpn endpoints.
Here is the log output I get after this:
Jan 18 12:29:51 pfs check_reload_status: Syncing firewall
Jan 18 12:29:51 pfs filterdns: Found hostname vpn.net.loc with netmask 32.
Jan 18 12:29:51 pfs filterdns: found entry 10.5.0.6 for (null)
Jan 18 12:29:51 pfs filterdns: found entry 10.5.0.6 for (null)
Jan 18 12:29:51 pfs filterdns: entry 10.5.0.6 exists in table (null)
Jan 18 12:29:51 pfs filterdns: found entry 10.5.0.6 for (null)
Jan 18 12:29:51 pfs filterdns: entry 10.5.0.6 exists in table (null)
Jan 18 12:29:51 pfs filterdns: Found 1 entries for vpn.net.loc
Jan 18 12:29:51 pfs check_reload_status: Restarting ipsec tunnels
Jan 18 12:29:51 pfs filterdns: Ran command /usr/local/sbin/pfSctl -c "service reload ipsecdns" with exit status 0 because a dns change on hostname vpn.net.loc was detected.
Jan 18 12:29:53 pfs php: : IPSEC: One or more IPsec tunnel endpoints has changed its IP. Refreshing.
Jan 18 12:29:58 pfs php: : Could not determine VPN endpoint for 'WAN IPv4 IPsec Mobile Phase1 '
Jan 18 12:30:03 pfs php: : Could not determine VPN endpoint for 'WAN IPv4 IPsec Mobile Phase1 '
Jan 18 12:30:03 pfs filterdns: Received signal SIGHUP(1).
Jan 18 12:30:03 pfs kernel: pid 61925 (filterdns), uid 0: exited on signal 11 (core dumped)This is probably not causing any real problems on my system because my remote vpn endpoint dns doesn't change or if it's related to the mobile ipsec phase1 not having an endpoint I am not sure how that would affect me, but I have noticed the core dump syslog messages and I have read that there can be up to three running filterdns processes (filter, vpn, captiveportal).
Hope this helps…
-
I think all this happens because a filter reload will clear the contents of the table with what the filter config sends in.
I changed filterdns again to force update of addresses on table when a SIGHUP happens.Hopefully by monday snapshot the updated filterdns will be there.
-
2.1-BETA1 (i386)
built on Sat Jan 19 20:44:40 EST 2013
Looking good - Alix nanoBSD test system has been up 9 hours. The table that should translate 11 names to 11 IPs now has 14 IP address entries. (3 of the names have dynamically switched IP in this time.) filterdns is adding to the table and not removing old entries, but I don't really care about that (feature or bug?) -
2.1-BETA1 (i386)
built on Sat Jan 19 20:44:40 EST 2013There have been a few more changes after that date, you will have to try again tomorrow or so with a newer snapshot.
-
I just upgraded to latest snapshot but still get filterdns problems:
FreeBSD fw.localdomain 8.3-RELEASE-p5 FreeBSD 8.3-RELEASE-p5 #1: Sat Jan 19 21:12:44 EST 2013 root@snapshots-8_3-i386.builders.pfsense.org:/usr/obj./usr/pfSensesrc/src/sys/pfSense_SMP.8 i386
MD5 (/usr/local/sbin/filterdns) = 6949816348947b7762586fe3c59b356e
…
Jan 21 00:05:28 fw kernel: pid 47308 (filterdns), uid 0: exited on signal 11 (core dumped)
Jan 21 00:05:29 fw check_reload_status: Restarting ipsec tunnels
Jan 21 00:05:30 fw login: login on ttyv0 as root
Jan 21 00:05:36 fw php: : IPSEC: One or more IPsec tunnel endpoints has changed its IP. Refreshing.
Jan 21 00:05:37 fw check_reload_status: Updating all dyndns
Jan 21 00:05:37 fw check_reload_status: Restarting OpenVPN tunnels/interfaces
Jan 21 00:05:38 fw check_reload_status: Reloading filter
Jan 21 00:05:40 fw kernel: pid 83611 (filterdns), uid 0: exited on signal 11 (core dumped) -
dhatz that happens probably because of upgrade is not replacing the filterdns process.
Can you kill all you filterdns processes before running an upgrade and try again or
extract the archive of the upgrade and install manually the filterdns binary, it is located on usr/local/sbin iirc.I am tracking even this issue of upgrade not replacing binaries at some time.
-
Indeed it seems that the filterdns binary is not replaced by the upgrade process.
I will upgrade as soon as a new 2.1 snapshot image becomes available (currently the latest snapshot is from 19-Jan) and let you know how it goes.
-
nanobsd upgrade to 2.1-BETA1 (i386) built on Tue Jan 22 05:52:55 EST 2013 gets the version feature (1.1), but that is kind of obvious since nanoBSD is provided with a full slice. We will see what dhatz gets with a upgrade of a full install.
filterdns working well for me - but it does accumulate all the IP addresses known to it over time for the list of names. My table now has 15 IPs for 11 names. -
@ermal:
dhatz that happens probably because of upgrade is not replacing the filterdns process.
Can you kill all you filterdns processes before running an upgrade and try again or
I am tracking even this issue of upgrade not replacing binaries at some time.Just a quick reminder that doing an upgrade still won't replace the old filterdns binary.
Btw I have tried killing all filterdns processes before running an upgrade (and verified they had been killed just before starting the upgrade procedure).
-
Is the issue fixed for you dhatz?
-
Ermal, I upgraded the filterdns binary to
MD5 (/usr/local/sbin/filterdns) = af355106eef6aff00d9e6653cca696ebHowever it seems that the new filterdns needs too much memory at system startup, causing errors like:
swap_pager_getswapspace(16): failed
swap_pager_getswapspace(12): failed
swap_pager_getswapspace(6): failed
swap_pager_getswapspace(16): failed
swap_pager_getswapspace(12): failed
swap_pager_getswapspace(9): failedand it dies shortly after…
I've been running the latest pfsense-2.1 snap in a 256MB VM for the past ~10 months and never had this type of problem before.
-
Do you have a long list of aliases in the one where you have the hostname?
-
However it seems that the new filterdns needs too much memory at system startup, causing errors like:
swap_pager_getswapspace(16): failed
swap_pager_getswapspace(12): failed
swap_pager_getswapspace(6): failed
swap_pager_getswapspace(16): failed
swap_pager_getswapspace(12): failed
swap_pager_getswapspace(9): failedand it dies shortly after…
I'm seeing this too on my Atom box but not on my Whitebox. Same snap from yesterday and both amd64. Filterdns eats up all my memory and then uses up all the swap space before it dies.
-
@ermal:
Do you have a long list of aliases in the one where you have the hostname?
I have
-
two (2) aliases in fw -> aliases
www_google_com
www_paypal_com
(note: this was just for testing) -
one (1) hostname in IPsec
-
no "allowed hostnames" in CP
In the past (until ~2 months ago) these settings resulted into two filterdns processes: one for fw-aliases and one for ipsec, none for CP.
-
-
It's chewing through 100% cpu and all my RAM until it runs out of swap space for me, and I only have three aliases that contain hostnames.
truss -p on the filterdns proc shows it trying doing mmap over and over again.
mmap(0x0,1048576,PROT_READ|PROT_WRITE,MAP_PRIVATE|MAP_ANON,-1,0x0) = -1953497088 (0x8b900000) mmap(0x0,1048576,PROT_READ|PROT_WRITE,MAP_PRIVATE|MAP_ANON,-1,0x0) = -1952448512 (0x8ba00000) mmap(0x0,1048576,PROT_READ|PROT_WRITE,MAP_PRIVATE|MAP_ANON,-1,0x0) = -1951399936 (0x8bb00000) mmap(0x0,1048576,PROT_READ|PROT_WRITE,MAP_PRIVATE|MAP_ANON,-1,0x0) = -1950351360 (0x8bc00000) mmap(0x0,1048576,PROT_READ|PROT_WRITE,MAP_PRIVATE|MAP_ANON,-1,0x0) = -1949302784 (0x8bd00000)
I have a few decent-sized aliases but nothing huge. The filterdns.conf file on this box was only three lines. The size of the aliases involved in the filterdns thread were 1, 63, and 3. So it shouldn't have been all that busy/large.
68313 root 124 20 545M 543M RUN 0:05 35.35% filterdns{github.com} 68313 root 124 20 545M 543M RUN 0:05 35.35% filterdns{filterdns} 68313 root 124 20 545M 543M RUN 0:05 35.25% filterdns{some.other.hostname.you.dont.need.to.see} 262 root 76 20 3416K 736K kqread 1:14 12.35% check_reload_status 68313 root 76 20 545M 543M ucond 0:00 10.99% filterdns{signal-thread}
filterdns -v shows 1.1.
Size and sha256 match the one on the builder so it is the most current build. (tar is set to preserve old creation times, so of course the date doesn't update…)
-r-xr-xr-x 1 root wheel 24160 Nov 19 05:07 /usr/local/sbin/filterdns SHA256 (/usr/local/sbin/filterdns) = 193ebd8250147041b79385d84efe0f5d09f9ce868ba666b18f91b5098ecce1f3
It was being run with the following parameters:
/usr/local/sbin/filterdns -p /var/run/filterdns.pid -i 300 -c /var/etc/filterdns.conf -d 1
-
@ermal - when you sort this out, and each time filterdns is updated, can you bump the version number in filterdns.c - that will make it very easy for us all to quickly see which version we have.
-
Should be better on the later snapshots.
Sorry for the noise. -
Seems a bit better so far, copied a binary off the builder to my box and it isn't constantly using that much cpu now, though it did still spike up and use 100% total for about 20-30 sec it did eventually slow down and fall off the first screen of top output.
-
Latest snap v1.2 seems better (no more out-of-swap issues) but is still exits and dumps core:
Jan 28 06:34:45 fw kernel: pid 18434 (filterdns), uid 0: exited on signal 11 (core dumped)
Jan 28 06:34:53 fw kernel: pid 26566 (filterdns), uid 0: exited on signal 11 (core dumped)
Jan 28 06:35:23 fw kernel: pid 21538 (filterdns), uid 0, was killed: out of swap space
Jan 28 08:25:37 fw kernel: pid 49708 (filterdns), uid 0: exited on signal 11 (core dumped)
Jan 28 08:25:50 fw kernel: pid 71990 (filterdns), uid 0: exited on signal 11 (core dumped)
Jan 28 08:25:52 fw kernel: pid 81465 (filterdns), uid 0: exited on signal 11 (core dumped)
Jan 28 18:26:29 fw kernel: pid 10297 (filterdns), uid 0: exited on signal 11 (core dumped) <– updated to latest snapMD5 (/usr/local/sbin/filterdns) = aea0850239de6ab9817f9330f1807cec
SHA256 (/usr/local/sbin/filterdns) = f2c43ff8e8d6f21047c351e071a203df48bc2899ca7f1564a9cd1998e690081dOn my system there is currently only one filterdns process, whereas there should be a second one handling ipsec hostname(s) – at least that was the case until ~2 months ago.
Edit: There are only two filterdns-related files on my system:
/var/etc/filterdns.conf
pf www.google.com www_google_com
pf www.paypal.com www_paypal_comand
/var/etc/ipsec/filterdns-ipsec.hosts
cmd vpn.example.com '/usr/local/sbin/pfSctl -c "service reload ipsecdns"'
(whereas vpn.example.com is the name used in P1 remote gw)Finally /var/run/filterdns-ipsec.pid shows 10297 and timestamp 18:26 which is the process that had crashed earlier (see syslog extract copied above)
-
Found the issue the ipsec instance is crashing for you.
Should be fixed on next coming snapshot. -
I'm afraid that even the latest snap is still crashing on my system, same symptoms as in my last post.
-
Some more protections put on the next snapshots.
Though it runs happily here. -
Sorry latest snap filterdns v1.2 still bombs out on my VM:
MD5 (/usr/local/sbin/filterdns) = feb00f677248ba323cfdf6398660653a
syslog:
Jan 29 23:56:14 fw kernel: pid 48762 (filterdns), uid 0: exited on signal 11 (core dumped)
Jan 29 23:56:30 fw kernel: pid 80109 (filterdns), uid 0: exited on signal 11 (core dumped)ls -lR /var | fgrep filterdns:
-rw-r–r-- 1 root wheel 66 Jan 29 23:56 filterdns.conf
-rw-r--r-- 1 root wheel 75 Jan 29 23:56 filterdns-ipsec.hosts
-rw-r--r-- 1 root wheel 6 Jan 29 23:56 filterdns-ipsec.pid
-rw-r--r-- 1 root wheel 6 Jan 29 22:24 filterdns.pid <–-- strange time-stampps:
22425 ?? Is 0:00.03 /usr/local/sbin/filterdns -p /var/run/filterdns.pid -filterdns.pid:
22425filterdns-ipsec.pid:
80109But if filterdns works fine for everyone else, maybe I should re-install my pfsense from scratch, or I can send you my /filterdns.core file (3.4MB) …
-
Probably that's teh best choice i guess!
-
For the record, my filterdns is working OK on 3 systems running 2.1-BETA1 (i386) built on Tue Jan 29 16:42:56 EST 2013
My 11-entry table now has 12 entries, I guess one of the names in the list has changed its IP address, and the old value is also left in the table.
I only have the 1 ordinary filterdns for pf. -
I only have the 1 ordinary filterdns for pf.
Phil: Well, that might be difference, since in my test-VM I (should) have 2 filterdns processes (one for pf-fw-aliases and another for ipsec). The "ordinary" filterdns seems to work for me too, it's the ipsec-related one that bombs out …
Ermal: I don't see what good a full re-install from sceatch will do (I guess in IT it's standard procedure LOL), but I'll try it anyway.
Update: I'm happy to report that I just upgraded the existing VM to the very latest snap (from 29-Jan to 30-Jan-2013 04:20:11 EST) and filterdns now seems to work correctly for ipsec too! Only odd thing I've noticed is that the /var/run/filterdns*.pid files seem to have old time-stamps.