Firewall went down? (Solved)



  • OK, so I ran into a weird issue today.
    I remoted into a client's site via TeamViewer to a machine then logged into the firewall. I noticed it needed the new update to 2.2.6 from 2.2.5 so I thought what the heck, might as well get'r done

    As a practice I remove PfBlocker and NTopNG because sometimes they stall out during install on a upgrade then proceed with the upgrade. After the upgrade I reinstalled them.

    Mind you this firewall has 2 WANS, 3 LANs and is running the following packages…
    arpwatch
    autoconfigbackup
    bandwidthd
    lightsquid
    mailreport
    ntopng
    nut
    pfblockerng
    rrd summary
    snort
    squid
    squidguard

    So after the update completes I usually see a small pause in my remote session and then everything comes back. If I get disconnected I know the firewall is inaccessible. How do I know this? I've lost remote connectivity to the firewall before.
    Well it went inaccessible! So crap, I have to drive down there AGAIN. Thank God this was on a weekend and early in the morning!

    So I get down there, log into the firewall and everything looks fine but I am unable to pass traffic from anywhere on the network to the internet. Then I look at the system logs and HOLY CRAP! BRUTE FORCE ATTACKS on ssh. Mind you I leave SSH on only to one local network strictly for my use. No external access allowed. There are no rules in place to allow external access to ssh AT ALL!

    So I turn SSH off so I can continue trying to figure out why no traffic is being passed. Basically proceeding with the 1 problem at a time rule.

    I noticed alerts where coming up about the Q's and they were having errors, so I deleted the traffic shaping rules and WAM! The internet traffic started to move!
    Now my question was why was traffic passing from the external world through the firewall to allow attempted ssh logins?
    I looked around, and I'm still not sure why the firewall was passing SSH traffic.

    NAT has the following rules... (See NAT.jpg attachment)
    The DSL has the following rules... (See DSL.jpg attachment)
    and the Cable has the following rules... (see Cable.jpg attachment)

    As you can see nothing allows SSH. But maybe I'm missing something.

    Now, here's where it gets weird. Everything again is looking fine. The SSH attacks have stopped, the firewall looks to be working, traffic is moving. So I leave to go do other things. Then I get a call about 20 minutes later, "The internet is down". I'm thinking, WTF took the whole internet down?

    I get there and the firewall is inaccessible via the local network. I can't even ping it. I turn the monitor on to look at the console and log in, there are NO interfaces. NONE!!!!!
    I'm thinking, "Crap! what took my config away!!!"
    So I reboot... Nothing, I do a factory reset, get logged into the web gui and restore from backup, reboot and everything comes online.
    I check the logs but I don't see any brute force attempts or anyone, or anything logging in. I do however see time jumping all over the place from 10am to 18:00 hours throughout the log at the time of the event and after the event and at the time the firewall dumped I see "Jan 2 10:11:59 sshlockout[32465]: sshlockout/webConfigurator v3.0 starting up"

    I assume that's the ssh lockout service firing up.
    But I'm not sure why the traffic shaper broke internet traffic during the upgrade, or why ssh traffic was being passed, or why the firewall went completely offline, inaccessible or why the time is jumping around.

    Any thoughts anyone?

    Edit: Added LAN Rules.










  • No one has any ideas?

    Firewall went down again, this time I have no idea why because  I had to wipe it out and reload it from scratch.



  • The only other thing I did notice is that the up and down tones, for when the firewall comes up and goes down were faster than normal. Maybe that means something.



  • do you have serial port disabled on the bios?



  • no. I usually leave it enabled as it is by default.



  • try to disable it and let me know



  • Hi @Visseroth,

    Did you also try disabling first some of the packages and try to do a test for about a day or so, if it is possible? Maybe one of the packages caused the problem?



  • I have not as of yet.

    Today it happened again. I was working on another machine and an employee asked why a web site was unacceptable. I investigated and found ALL the interfaces were gone from the console. Nothing unusual was on the console other than my own failed login attempt previously in the day, so I factory reset the firewall and loaded a configuration backup to bring the network back online.

    I had no time to turn off the serial port.

    I suspect a hardware failure or hardware failing but I'm not sure. I do have a new Supermicro Atom C2758 MiniITX server on the way to replace this machine with.

    Any ideas anyone??

    I even checked the logs and didn't see anything unusual.



  • You know, in hind site I've never seen this problem before until I updated to 2.2.6, so I ASSUME it's a bug in 2.2.6 though I may be a special case. So I think I'm going to roll the firewall back to 2.2.5 and when I replace the machine with the new server board try 2.2.6 again but put this machine on the bench and see if I can make it fail. It seems that it works fine for about a week and then craps out but it sure would be nice to be able to duplicate the issue consistently so as to make sure it doesn't happen again.

    I am extremely curious as to what is going on but also very concerned because this is a production machine that needs to stay online!



  • @Visseroth:

    You know, in hind site I've never seen this problem before until I updated to 2.2.6, so I ASSUME it's a bug in 2.2.6 though I may be a special case. So I think I'm going to roll the firewall back to 2.2.5 and when I replace the machine with the new server board try 2.2.6 again but put this machine on the bench and see if I can make it fail. It seems that it works fine for about a week and then craps out but it sure would be nice to be able to duplicate the issue consistently so as to make sure it doesn't happen again.

    I am extremely curious as to what is going on but also very concerned because this is a production machine that needs to stay online!

    Production machine, definitely rollback if that was stable.
    Unless I missed it, did you give information about the hardware?  NICs, motherboard, etc.  That would help narrow down a search if it is a problem with 2.2.6.



  • No I did not. But I will when I have a chance to look at it closer.

    I know it has a onboard Intel NIC, a PCI Express Intel NIC and 2 realtek NICs, Intel Pentium D and 4GB of RAM and is a Lenovo that was repurposed. The new machine will be a SuperMicro Mini ITX server board with 16GB of ECC. Yea, a bit over kill but I wanted dual sticks in case one errored.

    As soon as I am able I will roll it back. As soon as I pull it I'll do some testing and report back.



  • @Visseroth:

    No I did not. But I will when I have a chance to look at it closer.

    I know it has a onboard Intel NIC, a PCI Express Intel NIC and 2 realtek NICs, Intel Pentium D and 4GB of RAM and is a Lenovo that was repurposed. The new machine will be a SuperMicro Mini ITX server board with 16GB of ECC. Yea, a bit over kill but I wanted dual sticks in case one errored.

    As soon as I am able I will roll it back. As soon as I pull it I'll do some testing and report back.

    It is quite possible that driver changes occured affecting the NIC hardware.  RealTek's have an history of being "interesting" but you say "no interfaces".  Nothing showing up if you do ifconfig -a?  a plain ifconfig typically shows only interfaces that are UP, so it may be they are simply all down.  Any indications in syslog/dmesg about the interfaces having problems?



  • It is possible that it in a driver issue though I'm not completely certain at the moment and am unable to test. The replacement hardware is on the way and I'll be able to do some more testing.

    When I say the interfaces are gone I mean they are gone! Nothing shows on the console. No interfaces at all, only the menu. If I try and set them up again there will still be no interfaces showing on the console. If I set the IP it still doesn't ping and there are still no interfaces on the console. Literally no list of interfaces at all.

    It acted up yesterday and again this morning. However this morning I was prepped and had a USB installer prepped and ready with 2.2.5. Install time… 1 minute 14 seconds from startup to reboot.

    So, hopefully it's just a driver issue between 2.2.6 and 2.2.5, that should make it stable until the new hardware comes in.

    Ironically we ordered new hardware because of the glitches and age of the machine and to setup carp. We plan on ordering another unit, exactly the same and setup a fail over in case of situations like this which should give them a 99+% up time. Well that's the theory anyhow and the new system should be more stable being Intel only NICs.



  • You should be able to exit the menu to a shell.  If you can get to that, the output of ifconfig -a would be interesting.



  • That I will do, but it seems rolling the firewall back to 2.2.5 has thus far resolved the problem. I'm still waiting on pieces for the new firewall but as soon as the new firewall is in I'll be putting this one on the bench for testing.



  • There were no driver changes at all between 2.2.5 and 2.2.6, so something else is going on there.

    The speed of the beeps is dependent on the system clock. That combined with what you mentioned re: other apparent time oddities make the system's ability to accurately keep time seem suspect.



  • If that is the case then I've seen at least 3 system, completely different builds have time a problem in regards to the startup and shutdown beeps being REALLY fast. I find that highly unlikely, though not impossible.

    Last I checked the NTP clock it didn't seem like it was adjusting it's own clock all that much.



  • OK, well rolling it back to 2.2.5 theory is blown away.
    I'm waiting on a couple pieces of the hardware to show up this week then it's going to get swapped out and put on the bench for further testing. I'm going to figure out exactly what is going on in this thing!

    It went down again, always seems to happen on a Friday night, Saturday morning. I've only seen it go down during business hours, while i was there once and I believe that was a Thursday about 3pm.

    Anyhow, I managed to run the ifconfig before I proceeded to blow away the install, reinstall, reload the config and reinstall the packages. Here is what it looks like….

    I keep trying to attach the image but I keep getting...
    500 Internal Server Error
    nginx



  • Doubtfully enough information to go off of but I can't swap the machine out yet as I don't have anything to replace it with as of yet.



  • Well found the problem. Had some hardware go bad that was corrupting things! Including the config!!! Hard to say if it is motherboard related but it's definitely memory related…...
    Yet another reason to go ECC. I'm glad the new system has ECC RAM!



Log in to reply