Serial console not working on Alix after upgrading from 2.1.4 to 2.1.5

stephenw10

There was initially an issue with the Nano update files for 2.1.5. It was switching to Nano+VGA which removes the serial console. It was fixed though so if you actually upgraded yesterday from the webgui you won't have seen this problem.
Is it just the console menu you aren't seeing? You still see the boot messages up to 'bootup complete'?

Steve

silvertip257

@ MaxPF:
gotcha - I'm not extremely worried about the flashing LEDs.
I'm sure if I took the time, I could use the libraries pfsense uses to flash the LEDs and write something to stop them from flashing. Would be fun, but I have to figure out my other current problems first! :-\

@stephenw10:

There was initially an issue with the Nano update files for 2.1.5. It was switching to Nano+VGA which removes the serial console. It was fixed though so if you actually upgraded yesterday from the webgui you won't have seen this problem.
Is it just the console menu you aren't seeing? You still see the boot messages up to 'bootup complete'?

Steve

Hi Steve!

Yes, I did "just" upgrade yesterday.
I saw a thread about the 2.1.4 nanobsd URLs pointing at the VGA instead of serial ones (or whatever exactly). I figured I'd be in the clear since we're on 2.1.5 now…

Your reply to this thread made me think to check ... and "darn it!" -- I have a pound prompt # once again. Instead of the expected pfsense menu.


# echo $SHELL
/bin/sh

# uptime
 8:35PM  up 1 day, 36 mins, 1 user, load averages: 0.01, 0.02, 0.08

# uname -a
FreeBSD fw0.localdomain 8.3-RELEASE-p16 FreeBSD 8.3-RELEASE-p16 #0: Mon Aug 25 08:27:41 EDT 2014     root@pf2_1_1_i386.pfsense.org:/usr/obj.i386/usr/pfSensesrc/src/sys/pfSense_wrap.8.i386  i386

# cat /etc/version*
2.1.5-RELEASE
Mon Aug 25 07:44:26 EDT 2014
a94b9befbdee3beaea5741e105336c7fba77e3de
8.3
8.1

; now jump into the ALIX via SSH
[2.1.5-RELEASE][admin@fw0.localdomain]/root(2): fstat | awk '\!/CMD/{print $2}' | sort | uniq -c | sort -n | tail -n5
  25 racoon
  36 minicron
  36 syslogd
  52 php
 355 filterdns

[2.1.5-RELEASE][admin@fw0.localdomain]/root(4): top -d1 | head -n5
last pid: 95974;  load averages:  0.03,  0.05,  0.07  up 1+00:42:50    20:41:35
41 processes:  2 running, 39 sleeping

Mem: 37M Active, 11M Inact, 71M Wired, 1008K Cache, 33M Buf, 110M Free
Swap:

[2.1.5-RELEASE][admin@fw0.localdomain]/root(17): grep "`date +'%b %d'`" /var/log/system.log | grep kill
Sep 10 10:13:36 fw0 kernel: pid 5517 (php), uid 0, was killed: out of swap space
Sep 10 10:13:38 fw0 kernel: pid 6078 (php), uid 0, was killed: out of swap space
Sep 10 10:13:40 fw0 kernel: pid 5853 (php), uid 0, was killed: out of swap space
Sep 10 10:13:42 fw0 kernel: pid 7292 (php), uid 0, was killed: out of swap space
Sep 10 12:26:53 fw0 kernel: pid 86265 (php), uid 0, was killed: out of swap space
Sep 10 12:26:55 fw0 kernel: pid 87438 (php), uid 0, was killed: out of swap space
Sep 10 12:26:57 fw0 kernel: pid 86911 (php), uid 0, was killed: out of swap space
Sep 10 12:26:59 fw0 kernel: pid 85831 (php), uid 0, was killed: out of swap space
Sep 10 13:18:26 fw0 kernel: pid 51718 (php), uid 0, was killed: out of swap space
Sep 10 13:18:28 fw0 kernel: pid 52134 (php), uid 0, was killed: out of swap space
Sep 10 13:18:30 fw0 kernel: pid 51889 (php), uid 0, was killed: out of swap space
Sep 10 17:39:45 fw0 kernel: pid 66578 (php), uid 0, was killed: out of swap space
Sep 10 17:39:46 fw0 kernel: pid 66237 (php), uid 0, was killed: out of swap space

Holy cow … what in the blazes is going on with filterdns!?!?!?
355 open file handles in just over 24 hours.

And the notion of "out of swap space" is just not right, there is no swap space. I guess the likely case is that the system is low on RAM, throws this error, and starts OOM killing procs.

I'm open to ideas from the pfSense community as well as pfSene admins/developers.
I suspect my next step will be to re-examine my firewall rules … I don't have many and there aren't that many aliases.

stephenw10

Your use of awk and piped commands is impresive! :)
Check /etc/platform to make sure you're still on Nano. You are seeing far more on the serial port than you would in nano+vga.
Yes the swap errors are really out of memory in Nano. Are you running any packages?

Steve

silvertip257

@stephenw10:

Your use of awk and piped commands is impresive! :)
Check /etc/platform to make sure you're still on Nano. You are seeing far more on the serial port than you would in nano+vga.
Yes the swap errors are really out of memory in Nano. Are you running any packages?

Steve

awk, sed, regexes are part of my day job (I'm no expert, but I'm exercising my mind!) :)

Below includes my uptime (just as a reference, doesn't mean much as I had only a # prompt on serial console two days ago) and the output from /etc/platform … and yep it's nanobsd as expected.


# uptime
 8:30PM  up 3 days, 32 mins, 1 user, load averages: 0.00, 0.00, 0.03

# cat /etc/platform
nanobsd

I am not running any additional packages.
I have the following daemons running:

apinger
dhcpd
dnsmasq
ntpd
racoon (one IPSec tunnel to work)

I noticed the following in logs, which may indicate the source of the problem I'm seeing. But again, why in the heck does filterdns have so many open file handles!?


[2.1.5-RELEASE][admin@fw0.localdomain]/root(3): fstat | awk '\!/CMD/{print $2}' | sort | uniq -c | sort -n | tail -n5
  25 racoon
  36 minicron
  36 syslogd
  53 php
 969 filterdns               <------- holy smokes!

[2.1.5-RELEASE][admin@fw0.localdomain]/root(3): grep "`date +'%b %d'`" /var/log/system.log | grep -i ipsec

<...snipped...>

Sep 12 20:07:02 fw0 check_reload_status: Restarting ipsec tunnels
Sep 12 20:07:02 fw0 check_reload_status: Restarting ipsec tunnels
Sep 12 20:07:28 fw0 php: rc.newipsecdns: IPSEC: One or more IPsec tunnel endpoints has changed its IP. Refreshing.
Sep 12 20:07:29 fw0 php: rc.newipsecdns: IPSEC: One or more IPsec tunnel endpoints has changed its IP. Refreshing.
Sep 12 20:38:33 fw0 check_reload_status: Restarting ipsec tunnels
Sep 12 20:38:33 fw0 check_reload_status: Restarting ipsec tunnels
Sep 12 20:39:01 fw0 php: rc.newipsecdns: IPSEC: One or more IPsec tunnel endpoints has changed its IP. Refreshing.

I expect the large amount of IPSec instability is relative to low memory, which for now I'll blame on filterdns.

I do have a domain name in my IPSec config, which is largely unnecessary since that host has a static IP. I'm setting it to the corresponding static IP and will probably reboot my pfsense box to see if that remedies the ~24 hour OOM killing and the serial console droping to a # pound prompt.

Also some of my firewall rules have pf aliases, but I don't expect they'd cause filterdns to go ape.

silvertip257

Less than 12 hours later (I removed the host name from my IPSec config at nearly 21:00 and it's 08:15 now), I'm looking at a # pound prompt again.


[2.1.5-RELEASE][admin@fw0.localdomain]/root(2): uptime
 8:13AM  up 11:18, 2 users, load averages: 0.07, 0.35, 0.33

[2.1.5-RELEASE][admin@fw0.localdomain]/root(3): fstat | awk '\!/CMD/{print $2}' | sort | uniq -c | sort -n | tail -n5
  23 dhclient
  25 racoon
  36 minicron
  36 syslogd
  54 php

[2.1.5-RELEASE][admin@fw0.localdomain]/root(5): grep "`date +'%b %d'`" /var/log/system.log | egrep -i 'ipsec|check_reload_status'

<...snipped...>

Sep 13 08:08:55 fw0 check_reload_status: updating dyndns <apinger_gw_name>Sep 13 08:08:55 fw0 check_reload_status: Restarting ipsec tunnels
Sep 13 08:08:55 fw0 check_reload_status: Restarting OpenVPN tunnels/interfaces
Sep 13 08:08:55 fw0 check_reload_status: Reloading filter
Sep 13 08:08:55 fw0 check_reload_status: updating dyndns <apinger_gw_name>Sep 13 08:08:55 fw0 check_reload_status: Restarting ipsec tunnels
Sep 13 08:08:55 fw0 check_reload_status: Restarting OpenVPN tunnels/interfaces
Sep 13 08:08:55 fw0 check_reload_status: Reloading filter
Sep 13 08:09:22 fw0 php: rc.newipsecdns: IPSEC: One or more IPsec tunnel endpoints has changed its IP. Refreshing.
Sep 13 08:09:22 fw0 php: rc.newipsecdns: IPSEC: One or more IPsec tunnel endpoints has changed its IP. Refreshing.
Sep 13 08:16:05 fw0 check_reload_status: updating dyndns <apinger_gw_name>Sep 13 08:16:05 fw0 check_reload_status: Restarting ipsec tunnels
Sep 13 08:16:05 fw0 check_reload_status: Restarting OpenVPN tunnels/interfaces
Sep 13 08:16:05 fw0 check_reload_status: Reloading filter
Sep 13 08:16:05 fw0 check_reload_status: updating dyndns <apinger_gw_name>Sep 13 08:16:05 fw0 check_reload_status: Restarting ipsec tunnels
Sep 13 08:16:05 fw0 check_reload_status: Restarting OpenVPN tunnels/interfaces
Sep 13 08:16:05 fw0 check_reload_status: Reloading filter
Sep 13 08:16:32 fw0 php: rc.newipsecdns: IPSEC: One or more IPsec tunnel endpoints has changed its IP. Refreshing.
Sep 13 08:16:32 fw0 php: rc.newipsecdns: IPSEC: One or more IPsec tunnel endpoints has changed its IP. Refreshing.</apinger_gw_name></apinger_gw_name></apinger_gw_name></apinger_gw_name>

apinger logs show the custom/user-defined VPN end-point that I'm monitoring is down, then alarm canceled - but those are within no-less-than 12 minute periods (which doesn't match up with other logs).

–---

One positive result though, there are not a bunch of open file handles by filterdns since I removed the domain name from the IPSec config. Unfortunately the problem still raises its head, so I'm back to square one.

I have gateway monitoring set up for a host at the other end of the IPSec VPN tunnel. (I toned down the ping interval and all, so it's not steady every second.) In logs, it is odd that check_reload_status refers to the name of my apinger/gateway as a dyndns name (which it is not), then says the IP changed (it didn't), and then reloads my tunnel at insanely short intervals.

My next step is to disable that gateway monitoring to:

see if the IPSec tunnel is reloaded at the short intervals still
see if the pfSense serial console is again killed off in 12 hours (which might be closer to the actual elapsed time after reboot)

silvertip257

So I'm looking at just over 60 hours uptime and the serial console is back to a pound symbol (not the menu as expected). I did check at various intervals the previous day(s) and the menu displayed just fine.

I'm about out of things to try (short of disabling my IPSec tunnel, which is not a long-term option – defeats my reasons for running pfSense at that point).

Open to ideas.


[2.1.5-RELEASE][admin@fw0.localdomain]/root(1): uptime
10:32PM  up 2 days, 13:48, 2 users, load averages: 0.40, 0.20, 0.08

[2.1.5-RELEASE][admin@fw0.localdomain]/root(3): top -d1 | head -n5
last pid:  6435;  load averages:  0.16,  0.17,  0.07  up 2+13:48:56    22:33:39
40 processes:  2 running, 38 sleeping

Mem: 41M Active, 49M Inact, 72M Wired, 33M Buf, 68M Free
Swap:

[2.1.5-RELEASE][admin@fw0.localdomain]/root(4): fstat | awk '\!/CMD/{print $2}' | sort | uniq -c | sort -n | tail -n5
  24 tcsh
  25 racoon
  36 minicron
  36 syslogd
  53 php

# date
Mon Sep 15 22:36:09 EDT 2014
# uptime
10:36PM  up 2 days, 13:52, 2 users, load averages: 0.07, 0.13, 0.07

stephenw10

Hmm, puzzling. Are you still seeing 'out of swap space' errors? What about the filesystem? Are either of the ramdisk devices full, what does df show?

Steve

silvertip257

@stephenw10:

Hmm, puzzling. Are you still seeing 'out of swap space' errors? What about the filesystem? Are either of the ramdisk devices full, what does df show?

So last evening I rebooted my Alix before going to bed … just to see if I could get a 2+ day pattern out of the serial console menu/shell dying. Well here I am to say no … between then and now - shy of 20 hours it died off.


# uptime
 6:42PM  up 19:58, 1 user, load averages: 0.00, 0.02, 0.00

I do not see swap space errors (or the kill messages) in /var/log/system.log

As a recap, I have removed a DNS host name from my IPSec configuration and disabled apinger from monitoring my VPN end point.

There are not any physical ramdisk devices.


[2.1.5-RELEASE][admin@fw0.localdomain]/root(9): gpart show
=>     63  2006865  ad0  MBR  (980M)
       63   922257    1  freebsd  [active]  (450M)
   922320       63       - free -  (31k)
   922383   922257    2  freebsd  (450M)
  1844640   102816    3  freebsd  (50M)
  1947456    59472       - free -  (29M)

=>     0  922257  ad0s1  BSD  (450M)
       0      16         - free -  (8.0k)
      16  922241      1  !0  (450M)

=>     0  922257  ad0s2  BSD  (450M)
       0      16         - free -  (8.0k)
      16  922241      1  !0  (450M)

[2.1.5-RELEASE][admin@fw0.localdomain]/root(11): df
Filesystem        1K-blocks   Used  Avail Capacity  Mounted on
/dev/ufs/pfsense0    453327 150964 266097    36%    /
devfs                     1      1      0   100%    /dev
/dev/ufs/cf           50527   4015  42470     9%    /cf
/dev/md0              39406    404  35850     1%    /tmp
/dev/md1              59246  21396  33112    39%    /var
devfs                     1      1      0   100%    /var/dhcpd/dev

devfs file systems are 100% utilized – not sure if that's normal (ex: immediately post-reboot for example).

stephenw10

In nano /var and /tmp run in ram drives as standard. You can see them listed as md0 and md1 in df. If either of those become full you get problems but you would normally also see errors.

[2.1.5-RELEASE][root@pfsense.fire.box]/root(1): df
Filesystem        1K-blocks   Used  Avail Capacity  Mounted on
/dev/ufs/pfsense0    453327 246254 170807    59%    /
devfs                     1      1      0   100%    /dev
/dev/ufs/cf           50527   8345  38140    18%    /cf
/dev/md0              39406   3004  33250     8%    /tmp
/dev/md1              79086  26334  46426    36%    /var
devfs                     1      1      0   100%    /var/dhcpd/dev

Steve

nwebber

@silvertip257:

Holy cow … what in the blazes is going on with filterdns!?!?!?
355 open file handles in just over 24 hours.

I have this same issue - seems to be triggered by using names (FQDN) in firewall IP aliases and I've worked around it (avoided it) by using hardcoded IP addresses instead. The IPs I need to put into those rules are static anyway but it was a (conceptually) nice feature to be able to refer to them by name instead of number.

silvertip257

Been a bit busy lately and haven't rebooted my pfsense device since I last said I did. That was almost 13 days ago, but of course after 20 hours it had already dropped the serial console to a /bin/sh shell with a # pound prompt.

@nwebber:
Yeah, that's precisely what I ended up doing for my IPSec tunnel – no FQDNs. And I'll agree, being able to use FQDNs was a nice feature so I could use them as documentation of sorts.

@stephenw10:
md[01] are not full, none of them have utilization greater than 39% (md1 which is /var).
I have a 1GB Compact Flash card in my ALIX 2D13 which does not likely have any bearing on the problems. Thank you for your continued suggestions.

This is quite irritating … while I still have functional SSH access, in case of emergency the serial console is of little use.

At this point, I probably should:

create another complete thread (without the [SOLVED] portion in the title)
link back to this thread
continue troubleshooting

Community feedback and suggestions are welcome … I'm running out of ideas.

In a way, I'd almost be pacified knowing how to restart the serial console menu when it dies! (Clearly the serial shell dying should not happen, but so far I don't have a root cause. And being able to fix it is almost good enough for the time being.)

https://www.freebsd.org/doc/handbook/serialconsole-setup.html
https://www.freebsd.org/doc/en_US.ISO8859-1/articles/console-server/freebsd.html
https://www.freebsd.org/doc/faq/serial.html

FreeBSD's init

stephenw10

Hmm, confusing indeed. It looks like the console itself is not failing rather the shell is. The only vaguely similar thing I have seen is where the menu fails to spawn completely at the end of boot and you're left with no prompt at all. There were a few workarounds in that case. One was to simply change the serial baudrate. Seems unlikely to make a difference here but it's an gui setting now so an easy test. For example:
https://forum.pfsense.org/index.php?topic=34814.0
Could try running at the #:

/etc/rc.initial

Not entirely sure what that might do to an already running system so probably best not to do it in 'production'!

Steve

MaxPF

@silvertip257:

At this point, I probably should:

create another complete thread (without the [SOLVED] portion in the title)

link back to this thread

continue troubleshooting

I removed [SOLVED] from the title.

silvertip257

@stephenw10:

@stephenw10:

Hmm, confusing indeed. It looks like the console itself is not failing rather the shell is. The only vaguely similar thing I have seen is where the menu fails to spawn completely at the end of boot and you're left with no prompt at all. There were a few workarounds in that case. One was to simply change the serial baudrate. Seems unlikely to make a difference here but it's an gui setting now so an easy test. For example:
https://forum.pfsense.org/index.php?topic=34814.0
Could try running at the #:
/etc/rc.initial
Not entirely sure what that might do to an already running system so probably best not to do it in 'production'!

Steve

I previously had the baud rate set to 9600 (default), but my ALIX BIOS uses 38400 … so a few weeks back I switched it to 38400 on pfsense as well.

In my case the menu does spawn on boot, but dies some hours later. Each time (in this thread) I've said I rebooted my ALIX, I did verify I had a workable menu on the serial console.

I just tested running /etc/rc.initial from the serial console ... which so far appears not to have any ill effects (though I'm typing this up a few minutes after running it!). And restores the serial console menu.


[2.1.5-RELEASE][admin@fw0.localdomain]/root(4): date
Sat Oct 11 18:17:45 EDT 2014

[2.1.5-RELEASE][admin@fw0.localdomain]/root(5): uptime
 6:17PM  up 25 days, 19:33, 2 users, load averages: 0.00, 0.00, 0.00

[2.1.5-RELEASE][admin@fw0.localdomain]/root(10): ps auxff | grep rc.init
root   83274  0.0  0.6  3644  1512  u0  I+    6:11PM   0:00.02 /bin/sh /etc/rc.initial
root   36527  0.0  0.4  2444  1060   0  R+    6:21PM   0:00.00 grep rc.init
root   80618  0.0  0.5  3644  1320   0  Is    6:11PM   0:00.02 /bin/sh /etc/rc.initial

@MaxPF:

@MaxPF:

<…snipped...>

I removed [SOLVED] from the title.

Thanks Max!
I'll continue to use this thread at this point.