Log files manipulation(and crash issue)



  • First and foremost, I must say that PfSense is such a wonderful and powerful server. However, the features are more then I could understand with my humble tiny little brain, so I have some question marks to crack here.

    I noticed that my portalauth.log file is being cleared occasionally, it appeared that it is being cleared every restart.

    This log file is needed because I need to keep the captive portal log. But I risk loss of information during a power outage due to eg. nuclear reactor down or something.

    Here are my questions:

    • How exactly these logs work?
    • Is there any way I can retain the log files?
    • Is there any way to keep a 30 days log files for PfSense logs like how squid did?

    Finally here is the million dollar question(it is not a bounty, just that the dollar sign here is "?" ):

    • When PfSense is running a captive portal authentication on RADIUS, and squid is on transparent mode. How do you make lightsquid to generate usage reports based on user ID from captive portal?
      (Goal is to authenticated everybody on a open wifi network without any configuration on user's rig.)
      (Edit added: everyone is on DHCP too)

    Thank you very much!

    Btw, pray for Japan…


  • Netgate Administrator

    If you want to keep the logs for any length of time you need to use a syslog server. pfSense logs to ram only as far as I am aware. (That's certainly true for the nanobsd install I run)

    See: http://doc.pfsense.org/index.php/Copying_Logs_to_a_Remote_Host_with_Syslog

    Steve



  • Thank you for the prompt reply.

    The syslog server is a very fascinating linux service!

    But there are concerns for a syslog server:

    • if syslog server is down and all my pfsense servers are running, means I will loss all the logs
    • it will increase network traffic pushing our limited bandwidth closer to the limit.

    FYI, I am implementing a more "tedious" method to acquire these logs by running a linux system with scheduled customized scripts to remotely copy them over during off peak hours. Just want to know if I am doing a simple thing the hard way…


  • Netgate Administrator

    I must admit that having a complete set of logs is the one thing I miss from my previous firewall software, IPCop.
    It seems it's possible to run a syslog server on pfSense itself. Seems like a roundabout way of doing things but:

    @cmb:

    Only if you install a syslog server on the firewall. Some people use syslog-ng for that, you can pkg_add it.

    I imagine this would be an unsupported configuration that may not survive across an update for example.

    I can't believe the traffic from syslogging to a remote server would amount to much.

    Steve

    Edit: A post with some how-to, unfortunately from 2008, is here.



  • Neither do I think that syslogging will take any noticeable bandwidth…

    Anyway I am using script to remotely copy over the logs to a linux box via ssh, feel like reinventing the wheels.


  • Rebel Alliance Developer Netgate

    Syslog is the best way, push them off to that same Linux box with syslog and you don't have to do anything manually. You can store and rotate them there.

    Alternately, you could edit the file that generates pfSense's syslog config file and change it so it doesn't use clog for logging, but some other steps and care are needed for that. It's sort of on my to-do list for 2.1 or later to make switching that easier.



  • Thanks for the hat-tip and found this, that's exactly what I meant.

    Will use script to trim and process old logs routinely.

    Indeed, more options on managing the logs will be nice, both on displaying them on gui as well as on how to keep the text files.

    All the best.



  • The portalauth.log cleared again after a power surge… it seem no difference after I cleared the %, or have I understood it wrongly.


    Eventually I worked out a sh script to copy new entries from portalauth.log to a text file save at custom location. Added an entry to crontab to run it every minutes. I noticed that the WebGui captive portal logs no longer displaying any information. Will continue to monitor the performance.

    A little bit on the sh script: don't use while read line, very slow. Use awk, much efficient.



  • Rexis,

    are you willing to share your script with the community?

    Would be great, not to invent the wheel again.

    Thanks
    WK



  • @wk:

    Rexis,

    are you willing to share your script with the community?

    Would be great, not to invent the wheel again.

    Thanks
    WK

    Sorry for the late reply.  I am sure there must be a better way to do this(this is like reinvent a little wheel to make the old one spin), anyhow here it is:

    awk '{ if($0 ~ "logportalauth") print $0 >> "/var/log/any_other_filename.log" }' /var/log/portalauth.log
    sed -e '/logportalauth/d' /var/log/portalauth.log > /var/log/portalauth2.log
    mv /var/log/portalauth2.log /var/log/portalauth.log
    (Note: just added the last line, without it the sed will remove the whole file, although I intended to clear the log but what the original script does is not exactly what I have in mind, it simply create an empty file. Take note I have not tested the above new codes yet!)

    Save it on an executable and put it on crontab and make it run every minute. It is pretty efficient and doesn't affect much on CPU usage.

    Also, for this to work you have to remove the "%" in front "{$g['varlog_path']}/portalauth.log" in system.inc (otherwise the script wont retain anything or causing some errors I can't remember).

    All logs will be found in "any_other_filename.log" and it will survive thru reboots and nuclear winter.

    (please note that I am on 1.2.3, yet to try on 2.0 RC1?)
    (please make sure you clear it regularly otherwise it will accumulates and taking up hdd space)



  • I tried the script and it looks promissing, but I had to do some changes. I don't know, whether it is because I'm on 2.0 RC2.
    First in /etc/inc/system.inc there is not an '%' in front of  "{$g['varlog_path']}/portalauth.log", but '{$log_directive}'. I deleted this.

    Then I had to move /var/log/any_other_filename.log to /root/any_other_filename.log because it was deleted after rebooting.

    If there are some other glitches, I will report.



  • Found another glitch.

    When I upgraded to the newest built of 2.0 RC2 today, /etc/inc/system.inc was rewritten and my changes were lost.
    So we have to keep in mind, that it is necessary to correct the system.inc after every version-upgrade.

    crontab and script-file in directory /root were untouched during upgrade  :)



  • Great observation there on 2.0 :)

    Give a hint that 2.0 sort of remake folders upon reboots?

    I created a sub folder within /var/log/ to keep my script and secondary log, something like:
    #ls /var/log/log2/
    script.sh any_other_filename.log
    Anyhow it might be better to put it on /root/

    Hint: you can also modified the script to name output files based on date, so you won't have 10 year worth of information in one little basket, such as:
    any_other_filename.20110709.log
    any_other_filename.20110710.log
    any_other_filename.20110711.log
    any_other_filename.20110712.log
    (Am still learning but sh is a really useful thing.)

    –-

    (Tested: mkdir /var/log/somerandomfoldername will survive reboot)



  • Rexis,

    I found the lines in /etc/rc:

    make some directories in /var

    /bin/mkdir -p /var/run /var/log /var/etc /var/db/entropy /var/at/jobs/ /var/empty 2>/dev/null
    /bin/rm /var/log/* 2>/dev/null

    That's where all logs are deleted. So it' more safe to store the logs in root's home directory.

    I love the idea with the date in the filename.



  • Not sure if this has anything to do with the log file script here or not but I am posting this here since it is the same machine that have this problem.

    Setting: PfSense + Squid running as transparent proxy, captive portal is on radius authentication.
    Scenario: a couple of identical servers is running covering various wifi spot.
    Usage: Up to 70+ simultaneous users login per server.
    Problem: The server occasionally hang(loss network ping and console frozen), usually under heavy usage, some hang a few times per day, some hang once since started, some never faced this problem before. Sever happily recover after cold reset.

    Still no idea what causes this, the only log I can refer to is squid log and portalauth.log(that retained with the above mentioned script).

    I should use the same way to retain the system log for troubleshooting purpose, but also unsure if it is the script itself causes the hang. But do let me know if you face this issue as well, dual core is better than single core :) let's thought this out.



  • A bit of update…

    Swapped in another rig with the exact same version(1.2.3), setting and customization(as mentioned above). And it has been running happily for 4 days straight without any problem, with up to over 100 concurrent CP users, on a tiny 512MB ram.

    So its likely caused by hardware issue, they are 5yo PC anyway :/ PfSense has been extremely stable.

    PS: The uptime is about 23 days now, yay... but still puzzled by which exact part of hardware caused the hang on the original rig.



  • Further update:

    Perhaps it is already off topic but since it is the same machines I am referring to so I put them here.

    Finally I have replicated the crash under a test condition and found that it is the NIC that have been causing the problem, all the problematic servers are using the same NIC with the driver: dev.dc.0.%desc: Macronix 98715AEC-C 10/100BaseTX

    The crash occur after "TX underrun - using store and forward" message appeared during a network stress test(downloading large files). Hence I start looking at the NIC, replaced with another brand and everything works like a bliss, not even TX underrun message popping up. Both RealTek and DLink NIC works happily.

    Poor thing I have been spending weeks torturing it with CPU and HDD stress test scripts and it end up the real culprit is the NIC :/


Locked