LCDProc 0.5.4-dev

mdima

hehe! sorry buddy, if some watchguard representative sends me a couple of Fireboxes I can test them also! :D

tix

New test driver:

https://github.com/downloads/fmertz/sdeclcd/sdeclcd.so

I removed the call to the process scheduler. Give it a try…

Downloaded this driver and going to try it. Left everything else unchanged from the .9 dev package and will see if the driver alone makes any difference in the morning. If the driver doesn't help, I will restore the original .9 driver and change to LCDd 0.53 in stephenw10's manual package. Seems a methodical approach should help me narrow this down.

I haven't had any resource issues other than LCDd locking CPU to 100% until I kill it. Even in my box with only 256M, I still have over 128M free and no swap in use. In fact, today it ran for 8 hours at 100% while I was at work and continued to route and firewall properly.


load averages: 10.06,  9.81,  9.30
101 processes: 13 running, 76 sleeping, 12 waiting
CPU: 20.4% user,  0.0% nice, 78.6% system,  1.0% interrupt,  0.0% idle
Mem: 62M Active, 12M Inact, 35M Wired, 25M Buf, 125M Free
Swap: 512M Total, 512M Free

  PID USERNAME PRI NICE   SIZE    RES STATE    TIME   WCPU COMMAND
12019 nobody    74  r30  3368K  1496K RUN    528:54 100.00% LCDd

Lastly math was off in a previous post, my failures all seem to start at around 9 hours (+/- 1 hour) of uptime (not 16 as previously reported).

I will report status in the morning and with any luck the new driver resolves this.

stephenw10

I'm trying to run stuff for at least 24hrs so I can be relatively sure there's no problem / is a problem.
For reference I'm now coming up to 21hrs running LCDd 0.53 with the old driver, no problems.


last pid: 21474;  load averages:  1.65,  1.37,  1.43                                up 10+14:12:56  14:02:28
107 processes: 5 running, 85 sleeping, 1 zombie, 16 waiting
CPU: 46.3% user,  0.0% nice, 53.4% system,  0.4% interrupt,  0.0% idle
Mem: 58M Active, 18M Inact, 59M Wired, 152K Cache, 59M Buf, 350M Free
Swap:

  PID USERNAME PRI NICE   SIZE    RES STATE    TIME   WCPU COMMAND
   10 root     171 ki31     0K     8K RUN    168.6H 33.98% idle
31442 root      76    0 43356K 14908K RUN      0:14  3.96% php
30124 root      76    0 43356K 15060K accept   0:27  2.98% php
55054 root      76    0 43356K 15060K ppwait   0:27  2.98% php
51320 root      76    0 43356K 14908K accept   0:16  2.98% php
   11 root     -32    -     0K   128K WAIT    26:28  0.00% {swi4: clock}
   11 root     -68    -     0K   128K WAIT    19:58  0.00% {irq18: em0 ath0+}
63361 nobody    44    0  3364K  1460K RUN     10:14  0.00% LCDd

Important thing here is ~10minutes CPU time in 21hours.

Steve

Edit: Actually that looked like too much CPU time (but not much) something odd happened at 4.24am which mat have used some cycles:


Jan 26 04:24:34 	LCDd: Connect from host 127.0.0.1:17658 on socket 11
Jan 26 04:24:32 	php: lcdproc: Connection to LCDd process lost ()
Jan 26 04:24:24 	LCDd: sock_send: socket write error
Jan 26 04:24:24 	LCDd: sock_send: socket write error
Jan 26 04:24:24 	LCDd: sock_send: socket write error
Jan 26 04:24:24 	LCDd: Client on socket 11 disconnected
Jan 26 04:24:24 	LCDd: error: huh? Too much data received... quiet down!

It doesn't seem to have effected it though. Recovered from said event without issue.

stephenw10

Ok, so 24 hours is up running 0.53 LCDd and the old sdec driver and some interesting results are in!
First off I have had no problem accessing the box during that time and the lcdclient and LCDd processes have run solidly with only one instance of each.

Much more interstingly is that twice during the 24hr period the logs show the error in the post above 'too much data received'. It is clear that the process recovers and carries on with seemingly no other effects however it's also clear that during that 'event' the LCDd process uses far more CPU cycles.
After 24hrs:


 PID USERNAME PRI NICE   SIZE    RES STATE    TIME   WCPU COMMAND
   10 root     171 ki31     0K     8K RUN    171.4H 98.00% idle
   11 root     -32    -     0K   128K WAIT    26:54  0.00% {swi4: clock}
63361 nobody    44    0  3364K  1460K nanslp  25:56  0.00% LCDd

My original estimate was that it should consume around 2minutes in 24hrs.

So is it possible that either the newer driver or LCDd process is unable to recover from the 'too much data received' event?

One other thing I noted is that the LCDd process is run with nice level 0 which doesn't seem right.

Steve

mdima

Hi, according to the Steve's experience, I am trying to "slow down" a bit the panel… for example I changed TitleSpeed to 5 (as was in the configuration of the tarball package). This impacts on the screens that have a "scrolling"... let's see how it goes, I test it for some days...

Ciao,
Michele

mdima

reading some documentation about this "LCDd: error: huh? Too much data received… quiet down" error, maybe I could add some delay (10, 20ms) between each command the client sends to LCDd...

stephenw10

This is certainly a great learning experience! :)
It seems that the 'too much data received' message only exists in 0.53.
It seems to be an error related to the amount of data sent rather than the speed. More than 7168B (perhaps bits?).
Subsequent versions attempt to read all the data into a buffer and process it. My guess is that the buffer is filled and it gets stuck in a loop but we're not seeing any of the warning messages for some reason (even though I have turned the logging level up to 5).
Even in 0.53:


} else if (nbytes > (MAXMSG - (MAXMSG / 8)))	/* Very noisy client...*/
	{
		sock_send_string(clientSocketMap->socket, "huh? Too much data received... quiet down!\n");
		report(RPT_WARNING, "%s: Too much data received on socket %d", 
                       __FUNCTION__, clientSocketMap->socket);
		return -1;

We are seeing the error message sent back to the client but not the warning report.
I'm not sure how internal 'sockets' work so I'm guessing here.

Steve

mdima

@stephenw10:

It seems to be an error related to the amount of data sent rather than the speed. More than 7168B (perhaps bits?).

could be, but a little delay in sending the data could help in flushing the buffer… don't know, but since fortunately now I am having some problem too I only changed the "scrolling" delay. If with this change I solve the problem I will post an update... but since now I can reproduce the error I have also some investigation to do...

Thanks,
Michele

jpsb

@fmertz:

@jpsb:

Hi I having problem with the display running on the alix2d13 hardware.
U204FB-A1 20x4 Display

What is the driver for this LCD?

I use the hd44780 driver, on a usb port.
The system runs pfsense 2.0.1 i386 on a 4 Gb CF-card
U204FB-A1 20x4 Display (LCD2USB)(Controller hd44780)

I have another setup with a
Asus Hummibird AtomD510
4Gb Ram
250Gb HD
U204FB-A1 20x4 Display (LCD2USB)(Controller hd44780)
pfsense 2.0.1 64bit

I've no problem with this system.

tix

Well I had no luck with the "test" sdeclcd.so driver. Hit 100% CPU after 10:35 uptime. Interestingly I watched it go from 72% at 10 hours to 100% 35 mins later.

I'm now going to install the same config as stephenw10 as well and try. stephenw would you mind reposting the tarball in this thread for ease of finding?

Hi, according to the Steve's experience, I am trying to "slow down" a bit the panel… for example I changed TitleSpeed to 5 (as was in the configuration of the tarball package). This impacts on the screens that have a "scrolling"... let's see how it goes, I test it for some days...

Ciao,
Michele

My screens default refresh interval is 5 seconds.

stephenw10

Here you go. Remove the .png extension.

We need to test either the new driver compiled against 0.53 or the old driver compiled against 0.55. I'm sure I have one of those here somewhere.
Bah! I have many files all named sdeclcd.so. ::)

Steve

lcdd5.tar.png

tix

Which one is this tarball complied against? I have just completed installing the sdeclcd.so and LCDd from your from this tarball, all other files are unchanged from the -DEV v. 0.9 (lcdproc-0.5.5) package. Should know something in about 10 hours ;D

I will be happy to coordinate testing with everyone - Just let me know what version you're using and I will run another configuration…

Brak

Anyone with a compiler setup want to help test this EZIO-100/MTB134 driver? I found it online, but it appears abandoned - I'm not sure if it will work or not. I tried to get it to compile, but clearly pfSense isn't meant to be used for compiling.

Attachments are trailed with .png for attachment rules sake.

mtb124.h.png
mtb-134.c.png

mdima

@tix:

Well I had no luck with the "test" sdeclcd.so driver. Hit 100% CPU after 10:35 uptime. Interestingly I watched it go from 72% at 10 hours to 100% 35 mins later.
…
My screens default refresh interval is 5 seconds.

Guys, I have 2 servers running pfSense, one with refresh 1 second, and in this I have NO PROBLEMS, one with refresh 5 seconds and I get the problem. The servers use the same panel (sureelect).

The client goes to "sleep" for the seconds set in the refresh multiplied for the number of screens available (I thought this is the best way to not to waste resources, since every screen is shown every that seconds).

Can you please ALL try a refresh of 1 second??

Thanks,
Michele

tix

I think we are making progress for the sdeclcd driver. I installed the sdeclcd.so and LCDd versions provided by Steve and I'm happy to report that after 13 hours of uptime I still have a working LCD display and a responsive machine.

This may be short-lived as I am seeing the usage of LCDd climb - not as quickly as with the newer versions: after 13 hours, LCDd has ran for 10:15 and showing 0% CPU.

I'm going to stay with the current configuration until I reach 24 hours uptime or LCDd hits 100% before I change to a refresh interval of 1 sec as suggested by Michele.

I will post the status later when I get back home….. but it's looking better ;D

stephenw10

Here's something perhaps of note:


[2.0.1-RELEASE][root@pfsense.fire.box]/root(11): clog /var/log/system.log | grep huh
Jan 26 04:24:24 pfsense LCDd: error: huh? Too much data received... quiet down!
Jan 26 15:41:46 pfsense LCDd: error: huh? Too much data received... quiet down!
Jan 27 03:45:35 pfsense LCDd: error: huh? Too much data received... quiet down!
Jan 27 15:01:05 pfsense LCDd: error: huh? Too much data received... quiet down!

Because I was able to predict when it would happen I could watch top and found that even though the logs show the event taking only 10 seoconds in fact LCDd is stuck at 100% for 15 minutes before that.

That is with LCDd 0.53, old sdec driver, 0.8 package code and refresh set to 5 seconds.

Testing now as above but refresh set to 2 seconds. Can't set to 1 second with 0.53:


Jan 27 15:09:39 	LCDd: Waittime should be at least 2 (seconds). Set to 2 seconds.

Steve

@tix: Are you seeing errors in the logs?

mdima

Steve,
looking my secondary machine, I have the feeling that the problems are related to the "scrolling" feature of the panel.

In fact I see sometime frozen screens where there is the scrolling… I will keep an eye on it and try to see if it is the problem...

Ciao,
Michele

tix

Steve I get the same log entries but they occur at the same time yet the display continues to work unlike with the newer code.


Jan 27 05:45:18 pfsense LCDd: error: huh? Too much data received... quiet down!
Jan 27 05:45:18 pfsense LCDd: Client on socket 11 disconnected
Jan 27 05:45:18 pfsense LCDd: sock_send: socket write error
Jan 27 05:45:18 pfsense LCDd: sock_send: socket write error
Jan 27 05:45:18 pfsense LCDd: sock_send: socket write error
Jan 27 05:45:43 pfsense php: lcdproc: Connection to LCDd process lost  ()
Jan 27 05:45:44 pfsense LCDd: Connect from host 127.0.0.1:8170 on socket 11

What's interesting to me is that this is right at the 10 hour uptime mark where the newer versions stopped working. I wonder if there is something time related causing this as anything newer than 0.53 version of LCDd breaks on my system after 10 hours?? I wouldn't think so but it's strange it was always around 10 hours before reverting…. weird...

stephenw10

Interesting that your box (X700?) takes a lot longer than 10 seconds to sort itself out in the log.
The 0.53 code just gives up and errors out where as newer versions include code to handle the extra data so they keep trying.

Steve

fmertz

@tix:

Well I had no luck with the "test" sdeclcd.so driver. Hit 100% CPU after 10:35 uptime. Interestingly I watched it go from 72% at 10 hours to 100% 35 mins later.

Ok, so leaving the process out of "realtime round robin", and leaving it with default priority had no effect.

Long shot: When running at 100%, try and "kill" LCDd with signal 6 (kill -6 <pid of="" lcdd="">). This should give a memory image of the process (core dump). If you can make the core file available, I can give a try to loading it up in the debugger and see where the execution ended. The trick is that this needs to be a version of LCDd I have the code for, like V0.5.5, so the debugger can match the binary with the source. I have never done this, so this is will probably lead nowhere…</pid>