LCDProc 0.5.4-dev
-
@tix:
Well I had no luck with the "test" sdeclcd.so driver. Hit 100% CPU after 10:35 uptime. Interestingly I watched it go from 72% at 10 hours to 100% 35 mins later.
Ok, so leaving the process out of "realtime round robin", and leaving it with default priority had no effect.
Long shot: When running at 100%, try and "kill" LCDd with signal 6 (kill -6 <pid of="" lcdd="">). This should give a memory image of the process (core dump). If you can make the core file available, I can give a try to loading it up in the debugger and see where the execution ended. The trick is that this needs to be a version of LCDd I have the code for, like V0.5.5, so the debugger can match the binary with the source. I have never done this, so this is will probably lead nowhere…</pid>
-
Could try compiling LCDd with the debug option enabled to get far more logging output.
Steve
-
Could try compiling LCDd with the debug option enabled to get far more logging output.
MyCommand = YourWish;
-
I will try using kill -6 tomorrow, for now I'm enjoying everything working on my x700 for now. ;D
I'm still hung up on the idea of some kind of time issue. I see a problem every 10 hours. Here is the log from this morning and after running during the day today:
Jan 27 05:45:18 pfsense LCDd: error: huh? Too much data received... quiet down! Jan 27 05:45:18 pfsense LCDd: Client on socket 11 disconnected Jan 27 05:45:18 pfsense LCDd: sock_send: socket write error Jan 27 05:45:18 pfsense LCDd: sock_send: socket write error Jan 27 05:45:18 pfsense LCDd: sock_send: socket write error Jan 27 05:45:43 pfsense php: lcdproc: Connection to LCDd process lost () Jan 27 05:45:44 pfsense LCDd: Connect from host 127.0.0.1:8170 on socket 11 ... Jan 27 15:48:23 pfsense LCDd: error: huh? Too much data received... quiet down! Jan 27 15:48:23 pfsense LCDd: Client on socket 11 disconnected Jan 27 15:48:23 pfsense LCDd: sock_send: socket write error Jan 27 15:48:49 pfsense php: lcdproc: Connection to LCDd process lost () Jan 27 15:48:50 pfsense LCDd: Connect from host 127.0.0.1:8576 on socket 11
10 hours apart and the 05:45 was 10 hours of uptime!
As it stands, everything is working great (excluding the log entries) on v0.53 kernel module and v0.53 LCDd. The display continues to work with the default refresh of 5 secs and the webif and ssh connections are responsive. In fact, I would happily accept this level of functionality permanently. :)
But in the interest of perfection, I will apply the v0.9 package kernel mod and LCDd and when it stops responding on the webif after what I believe will be 10 hours of uptime, will kill it with the -6 option (instead of 15). The next step for me after that will be to use the debug-enabled LCDd and wait.
-
A very interesting result:
[2.0.1-RELEASE][root@pfsense.fire.box]/root(2): clog /var/log/system.log | grep huh Jan 26 04:24:24 pfsense LCDd: error: huh? Too much data received... quiet down! Jan 26 15:41:46 pfsense LCDd: error: huh? Too much data received... quiet down! Jan 27 03:45:35 pfsense LCDd: error: huh? Too much data received... quiet down! Jan 27 15:01:05 pfsense LCDd: error: huh? Too much data received... quiet down! Jan 27 17:13:45 pfsense LCDd: error: huh? Too much data received... quiet down! Jan 27 19:16:44 pfsense LCDd: error: huh? Too much data received... quiet down! Jan 27 21:18:07 pfsense LCDd: error: huh? Too much data received... quiet down! Jan 27 23:23:00 pfsense LCDd: error: huh? Too much data received... quiet down!
I changed the refresh time from 5 seconds to 1 second at 15.09. (1 second was seemingly auto changed to 2)
The logs show that gap between errors reduced from ~11 hours to ~ 2 hours.
This implies that the problem lies in the total data or number of screen refreshes sent not the actual time or uptime.Steve
-
Steve,
can you please try this: Add only screens that do not have any scrolling. When I stopped to give "scrolling screens" the problem look solved on my machine.
For "scroll" I mean when the text is bigger than the width of your screen, so it scrolls left/right.Thanks,
Michele -
Long shot: When running at 100%, try and "kill" LCDd with signal 6 (kill -6 <pid of="" lcdd="">). This should give a memory image of the process (core dump). If you can make the core file available, I can give a try to loading it up in the debugger and see where the execution ended. The trick is that this needs to be a version of LCDd I have the code for, like V0.5.5, so the debugger can match the binary with the source. I have never done this, so this is will probably lead nowhere…</pid>
fmertz - LCDd hit 100% after 10 hours as suspected. I kill LCDd with "kill -6 <pid>" but it did not leave a core file, or not one I can find. I assume it would be named core–-- or similar and a find on the filesystem doesn't locate any corefiles. I'm I just looking in the wrong place?
My next step is to test with the debug-enabled LCDd, leaving the rest of v0.9 untouched.
A very interesting result:
I changed the refresh time from 5 seconds to 1 second at 15.09. (1 second was seemingly auto changed to 2)
The logs show that gap between errors reduced from ~11 hours to ~ 2 hours.
This implies that the problem lies in the total data or number of screen refreshes sent not the actual time or uptime.Steve
By my calculations, you are reaching a problem at (7200[2hrs in secs]/2updates=) 3600 'updates' and I'm reaching it in (36000[10hrs in secs]/5updates=) 7200 'updates'. Which is interesting as well as 3600 is half of 7200.</pid>
-
I ran into that twice installing the pfSense LCDproc 5.5 Dev v0.8 package. So I had to manually install the package file after installing the pfSense package because no LCDproc 5.5 core files just the pfSene php front end.
So first install pfSense LCDproc 5.5 Dev package and then next do the following.
Here is the link to the core files to install go to console and do this:
pkg_add -r http://files.pfsense.org/packages/8/All/lcdproc-0.5.5.tbz
-Joe Cowboy
-
I ran into that twice installing the pfSense LCDproc 5.5 Dev package. So I had to manually install the package file after installing the pfSense package because no LCDproc 5.5 core files just the pfSene php front end.
So first install pfSense LCDproc 5.5 Dev package and then next do the following.
Here is the link to the core files to install go to console and do this:
pkg_add -r http://files.pfsense.org/packages/8/All/lcdproc-0.5.5.tbz
-Joe Cowboy
what ver of pfsense are your running? i'm using 2.1-dev and have to manually install binaries because the box is trying to install pbi instead… gets annoying but i've gotten used to it..
-
I am running 2.1-dev – LCDProc 0.5.5-dev v0.8 I didn't realize he had just updated to v0.9..... So, I just did a reinstall and seemed to install correctly this time. Sorry for not posting the version last time and now have v0.9 installed. Unless, something was fixed in one of the last gitsyncs for 2.1-dev??? Thanks for all you hard work...
-Joe Cowboy
-
Steve,
looking my secondary machine, I have the feeling that the problems are related to the "scrolling" feature of the panel.In fact I see sometime frozen screens where there is the scrolling… I will keep an eye on it and try to see if it is the problem...
Ciao,
MicheleHello,
I am running LCDproc 0.5.5 with the package 0.9 and only the "traffic (wan)" screen since 2 days and everything is going fine…Do anyone else has tried to avoid screens that do not scroll with a positive result?
Thanks,
Michele -
None of the screens I have enabled scroll (Uptime, States, Mbuf, & WAN with 5 second refresh) yet the display will still stop responding and the system cannot be connected to after 10 hours. The firewall continues to function normally as near as I can tell other than that - by that I mean, DHCP still works, existing hosts can continue to send/recv traffic and initate new traffic. Just the webif and SSH access no longer can connect due to the high load averages.
I have tried all combinations of sdecld.so and LCDd (v0.53, v0.55, debug-enabled LCDd) and having success ONLY on v0.53 of both the module and LCDd. Any mix of the various versions with that exception all send load to 100% at the 10 hour uptime mark. All testing was performed with the LCDdproc-dev-v0.9 package files changing only the sdeclcd.so and LCDd files - no other file was changed or modified.
I have over 48 hours of uptime without any visible problem running v0.53. This version will generate the following log entries every 10 hours but load stays less than 0.20 and the display works.
Jan 30 19:16:18 LCDd: error: huh? Too much data received... quiet down! Jan 30 19:16:18 LCDd: Client on socket 11 disconnected Jan 30 19:16:18 LCDd: sock_send: socket write error Jan 30 19:16:18 LCDd: sock_send: socket write error Jan 30 19:16:18 LCDd: sock_send: socket write error Jan 30 19:16:18 LCDd: sock_send: socket write error Jan 30 19:16:18 LCDd: sock_send: socket write error Jan 30 19:16:43 php: lcdproc: Connection to LCDd process lost () Jan 30 19:16:45 LCDd: Connect from host 127.0.0.1:5248 on socket 11
I'm not sure what was changed between 0.53 and 0.55 and would be willing to test 0.54 if someone can provide those files.
-
@tix:
None of the screens I have enabled scroll (Uptime, States, Mbuf, & WAN with 5 second refresh) yet the display will still stop responding and the system cannot be connected to after 10 hours.
Hi,
in my case the the states screen definitely scrolls… I have a 20x4 LCD display, max states: 500'000. When the states are more than 10'000 the screen scrolls.
Can you pls tell me what is your display size and what is your max states setting?Thanks,
Michele -
Hi,
in my case the the states screen definitely scrolls… I have a 20x4 LCD display, max states: 500'000. When the states are more than 10'000 the screen scrolls.
Can you pls tell me what is your display size and what is your max states setting?Thanks,
MicheleThe display is the 2x20 standard included on the Firebox X series (X700). My states are only 50000, so it doesn't scroll.
My display finally stopped working on v0.53 but it took 50 hours or the 5th 10-hour interval. Interestingly, the LCDd just died but the client continued to function and the box is as responsive as normal. The log shows the 'normal for me on this version' entries except for the missing "reconnect' entry.
Jan 31 05:19:29 LCDd: error: huh? Too much data received... quiet down! Jan 31 05:19:29 LCDd: Client on socket 11 disconnected Jan 31 05:19:29 LCDd: sock_send: socket write error Jan 31 05:19:29 LCDd: sock_send: socket write error Jan 31 05:19:29 LCDd: sock_send: socket write error Jan 31 05:19:29 LCDd: sock_send: socket write error Jan 31 05:19:29 LCDd: sock_send: socket write error Jan 31 05:19:29 LCDd: sock_send: socket write error Jan 31 05:19:29 LCDd: sock_send: socket write error Jan 31 05:19:54 php: lcdproc: Connection to LCDd process lost ()
-
Do anyone else has tried to avoid screens that do not scroll with a positive result?
Yes. Running interface traffic with WAN selected as the only screen has eliminated the 100% CPU problem (after 15hours testing at 1sec refresh).
I am also running the LCDd the fmertz compiled with debugging enabled but it gives me only a tiny amount of extra information. This is with logging set to level 5. I think I've not under stood how that's supposed to work. Time to re-read the developer guide!
I also tried running LCDd with different nice levels in order to be able access the box during the problem event but it made no difference. It seems like you should be able to set it to Nice 20 and it will be very low priority but that doesn't happen.
In fact if you look at the output of top it runs at 'r30' which I cannot find any reference to anywhere. :-\last pid: 27372; load averages: 0.02, 0.08, 0.06 up 0+23:30:46 12:43:59 48 processes: 1 running, 47 sleeping CPU: 0.0% user, 0.0% nice, 0.4% system, 0.4% interrupt, 99.3% idle Mem: 55M Active, 16M Inact, 55M Wired, 1060K Cache, 49M Buf, 359M Free Swap: PID USERNAME THR PRI NICE SIZE RES STATE TIME WCPU COMMAND 44832 root 1 76 20 3656K 1396K wait 1:01 0.00% sh 4015 root 1 45 0 47452K 17464K nanslp 0:57 0.00% php 2024 nobody 1 74 r30 3368K 1500K nanslp 0:39 0.00% LCDd
Steve
-
Yes. Running interface traffic with WAN selected as the only screen has eliminated the 100% CPU problem (after 15hours testing at 1sec refresh).
I also tried running LCDd with different nice levels
Maybe another test: run the normal lcdproc client provided by the project. FWIW, I run lcdproc on 2 hosts (a NAS and the router itself), and LCDd on the router itself, and it seems to run just fine for weeks. lcdproc has a bunch of screens with scrolling, vbars, hbars, icons, big nums… This is Linux, but the same code. This could help isolate more info about the problem.
If you need it: https://github.com/downloads/fmertz/sdeclcd/lcdproc
For nice, the driver code sets the process priority to "realtime round robin" as part of the initialization for the portable "wait" routines. Maybe this is the "r" you are seeing. The call to set the priority was removed in the driver I posted earlier.
-
Folks,
I would like to kick off the effort to bring the LED support into the driver again. We have some support already, but only for the box I own (the X-Core-e). I was hoping folks with the other models could run a command to help me identify the EXACT ICH we need to code for. Best I can figure, this command is already in pfSense and should be run as root:
pciconf -r pci0:31:0 0:256
This command reads the PCI configuration area (256 bytes) for the Low Pin count (LPC) device. The LPC device does GPIO, and can control the LEDs. Based on the exact device id, I can look up the spec, and find out the offset for GPIO base register, etc.
I would like the output of the command, for the X-Core and X-Peak models. The key is the first 8 digits, the last 4 being 8086, Intel's vendor ID. Thanks.
-
X-Peak:
[2.0.1-RELEASE][root@pfsense.fire.box]/root(7): pciconf -r pci0:31:0 0:256 25a18086 0280000f 06010002 00800000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000401 00000000 00000000 00000000 00000000 00000000 00000481 00000010 050a0c0b 000000d0 09808080 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 000054d5 00000000 00000000 00000000 00000220 00000000 0000000d 00000300 00000000 00000000 05415555 00000000 00000000 00000000 00000000 00000000 00002186 00000f02 00000004 00000000 c0000000 34040000 00112233 45670291 00e40000 00000000 00020f66 00010000 ffffffff
I'll have to try your driver without priority setting and see what happens.
Steve
Edit: You were correct. The driver with priority removed can be set to nice 20. I am testing now with several scrolling screens to see if I can still access the box.
-
Hello everybody,
I am making some tests to improve the stability of the package from the "client side". Until now from what I read all the tries have been made on the binary package and the driver, I think that maybe also a little help from the client can solve some problems.I am testing this changes on my boxes and I find no problems, so I would like to share this changes with you.
The changes are:
-
Added a 20ms delay between each command sent from the client to LCDproc.
-
Better managed errors. Now the client resets the error counter every successful communication session with LCDproc (before was a global counter). The error counter is managed inside the client (lcdproc_client.php).
-
Because of the above change, now the "client script" (lcdclient.sh) do not cycle anymore.
I hope at least some of the problems will be solved… I wait for your feedback. The new version is XXX.0.9.1.
Thanks,
Michele -
-
… and we are with 0.9.2.
I didn't realize that there were some clients pending, that with the new error counter management could work behind. So now all the lcdproc_client.php processes are killed during the package resync.
Sorry for the people that was already upgrading do 0.9.1...