Updated today to latest snap - very slow

jimp

Is 2.1 still disabling DMA correctly? Does it need to for FreeBSD 8.3?

That hasn't changed

Any power saving 'spin down' type options enabled?

No, in fact we actually force APM off now since otherwise some silly HDDs will kill themselves with load cycles. (But that's only done if the drive reports that feature as supported)

Either way - those two types of things would be fairly universal, they would affect everyone. It wouldn't be fast for me and slow for you.

stephenw10

Agreed.
Perhaps a combination of pfSense and BIOS options then?
Though that wouldn't account for introducing the problem by simply switching to 2.1.
Config slice become corrupt or fragmented somehow? :-\

Steve

moullas

Looking around in /boot/loader.conf defaults again… I'm no BSD expert by a longshot, I've read that increasing the kmem size reduces the change of a kernel panic.
vm.kmem_size="435544320"
vm.kmem_size_max="535544320"

That size looks like 415Mb, and my firewall is running with 256Mb of RAM, could it be related??

barista

Sorry for the late reply, but working on the Alix meant I had no Internet.

Tests were done:

zero the 4GB card, re-image 2.1 from yesterday, re-test : same sluggish results
take the 2GB card that came with the units, image it with 2.1, re-test : same result
re-image 4GB card with 2.0.1 : smooth as butter, fast. 9 seconds to enable SSH, 8 seconds to disable

I posit something IO related is very wonky.

Sadly, until my 2nd Alix is back in my hands, I shall have to remain at 2.0.1

eweri

Hello!

Tested with ALIX.2 v0.99h board and 8GB CF and got the same results as barista:

"Build on Tue Apr 17:30:20 EDT 2012" - DHCP on WAN interface does not work, setting up something takes forever (some minutes to get a respond)
connected via serial console running top - system is idle but I tried to reboot via Web-GUI and tried to start top via serial console and I was not able to start top. Looks like the system got hung for minutes then get back to normal work (top starts in console and I got the shutdown message but the system does not shutdown so I unplug the power).

After reboot I switched the slice (lasting for minutes…) I rebooted (once again waiting minutes ...could not wait ... unplugged power)
"Build on Mon Apr 16 08:51:47 EDT 2012" - WAN got an IP via DHCP but changing configuration takes a long time

both builds take a long time to configure WAN but the damaged takes a lot longer.

At the moment im updating to "build Wed Apr 18 18:57:19 EDT 2012" - reporting later.

Bye,
eweri

phil.davis

Timings from 2 of my Alix 2D13 256MB systems for "mount -uw /" and "mount -ur /":

a) 2.1-DEVELOPMENT (i386)
built on Wed Mar 28 20:25:36 EDT 2012
FreeBSD 8.3-RC2
Both mounts are quick (0:00.04 then 0:01.27)

b) 2.1-DEVELOPMENT (i386)
built on Mon Apr 16 16:52:53 EDT 2012
FreeBSD 8.3-RELEASE
"mount -uw /" is quick (even reporting 0:00.00)
"mount -ur /" is always around 1:02.00 (4 samples between 1 minute and 1:05.00)

I also tried "mount -ufr /" in tghe hope that the "force" flag might make force something to close that was accidentally open or whatever, it made no difference to the timing.

Is this a feature of 8.3-RELEASE?

Can others confirm if the issue is only on 8.3-RELEASE builds?

eweri

Okay here are my results with "build on Wed 18 18:57:19 EDT 2012":

WAN does not get an IP-address via DHCP, configuration changes takes a very long time to complete.

I tried to reboot via Web-GUI and this is what i got from serial console (it hung for a long time):
Apr 19 09:45:43 init: timeout expired for /bin/sh on /etc/rc.shutdown: Interrupted system call; going to single user mode
Waiting (max 60 seconds) for system process vnlru' to stop…donep Waiting (max 60 seconds) for system process bufdaemon' to stop...done
Waiting (max 60 seconds) for system process `syncer' to stop... sh

Hope this helps,
eweri

jimp

: cat /etc/version
2.1-DEVELOPMENT
: cat /etc/nanosize.txt 
4g
: cat /etc/version.buildtime 
Tue Apr 17 06:39:44 EDT 2012
: uname -a
FreeBSD alix.pingle.org 8.3-RELEASE FreeBSD 8.3-RELEASE #0: Tue Apr 17 06:39:26 EDT 2012     root@FreeBSD_8.3_pfSense_2.1.snaps.pfsense.org:/usr/obj./usr/pfSensesrc/src/sys/pfSense_wrap.8.i386  i386
: time mount -uw /
0.000u 0.006s 0:00.00 0.0%      0+0k 0+5io 0pf+0w
: time mount -ur /
0.005u 0.015s 0:01.76 0.5%      64+2680k 0+578io 0pf+0w

DHCP WAN works fine for me, as does IPv6 and such.

What packages are installed on the ones having issues? I have none installed.

stephenw10

Update or fresh install might be pertinent here.

Steve

jimp

That's possible. I've got a ton of CF's around, I'll image a new one and swap it in, see what happens.

phil.davis

My test system with 8.3-RELEASE also had Squid and SquidGuard packages running. For some reason, between "Configuring WAN interface" and "Configuring LAN interface", Squid processes would start. Then later the package startup would also start Squid. There is some sort of double-startup of Squid with the current 2.1-DEVELOPMENT and whatever is happening with the PBI installation process - I will look at that later and start another thread for it.

I removed SquidGuard and Squid.

Now "mount -uw /" is instant (as always) and "mount -ur /" is about 2 seconds.

My 8.3-RC2 snapshot didn't have any packages installed, so that seems to be why it is fast to mount read-only.

After adding packages (at least PBIs like Squid/SquidGuard) the time to mount read-only increases substantially.

jimp

The freshly imaged card was indeed much slower, taking ~30 seconds to mount ro for me.

I don't think it's FreeBSD 8.3 in general though, there were other changes made to how the nanobsd images are generated which could be contributing.

markky

@jimp:

Things are smooth/snappy on my ALIX with a snapshot from today, I've not seen this slowness myself. Saving does take about 6-7 seconds but it always has on ALIX.

Using HTTPS to get to the GUI, everything seems happy.

I have a soekris 4801 (and a couple of 4501's, but not enough ram). Works fine with 2.0.1, but every 2.1 snapshot I've tried in the last week or two exhibits the stall problem.

A number of people have mentioned shell commands like 'ls' hanging etc. If you type ^T (ctrl T) during one of these stalls you'll see that the process is waiting on suspfs, which is one of two functions in kern/vfs_vnops.c (both vn_start_write and vn_start_secondary_write call msleep with the same wmesg string. usually the wait message is unique so that there is no confusion which [tm]sleep we're waiting in. I think this is a bug in itself and the msleep in vn_start_secondary_write should be changed to "suspf2" or something similar to differentiate it ).

Anyway, seeing that processes are stalling in either of these functions leads you back to the mount updates being the cause. If you run /etc/rc.conf_mount_ro from a shell, followed by /etc/rc.conf_mount_rw I see a stall on the second command everytime. ^T reports that the mount process is waiting on biowr.

I put some logging into /etc/inc/config.lib.php (where conf_mount_ro and rw function reside) to determine which of the 4 mount updates was stalling (syslog conveniently timestamps the messages) and it's the read only mount update of /. On my system it's a stall of between 107 and 110 seconds.

I can't reproduce the behaviour using the same image under virtualbox.

The CF card in this 4801 is a cheap card with a brand I had not heard of, I'm going to try a sandisk card when I can find it.

Mark

jimp

Recently we changed the builder to use conv=sparse when making image files to save time/space on the builders. Not sure if that's related, but I backed it out today. We'll have to wait and see if the snapshots from tomorrow, when freshly imaged, work better.

I never did have a problem with my card that I had running 2.0.1 and firmware upgraded to 2.1. Only with a fresh image.

moullas

jimp, I guess that means tomorrow when new images are created a full re-flash is in order right?

Can you let us know when we can test? (weekend coming up, good opportunity) :)

thanks !

bardelot

@jimp:

I never did have a problem with my card that I had running 2.0.1 and firmware upgraded to 2.1. Only with a fresh image.

On my system it's exactly the other way around. The fresh image (same snapshot) is much faster. Although it might also be the CF card which I replaced from 1G to 4G. The datarate while downloading a snapshot increased from 20kbps up to 500kbps, and any saving operation only takes a few seconds again.

markky

@markky:

@jimp:

Things are smooth/snappy on my ALIX with a snapshot from today, I've not seen this slowness myself. Saving does take about 6-7 seconds but it always has on ALIX.

Using HTTPS to get to the GUI, everything seems happy.

I have a soekris 4801 (and a couple of 4501's, but not enough ram). Works fine with 2.0.1, but every 2.1 snapshot I've tried in the last week or two exhibits the stall problem.

A number of people have mentioned shell commands like 'ls' hanging etc. If you type ^T (ctrl T) during one of these stalls you'll see that the process is waiting on suspfs, which is one of two functions in kern/vfs_vnops.c (both vn_start_write and vn_start_secondary_write call msleep with the same wmesg string. usually the wait message is unique so that there is no confusion which [tm]sleep we're waiting in. I think this is a bug in itself and the msleep in vn_start_secondary_write should be changed to "suspf2" or something similar to differentiate it ).

Anyway, seeing that processes are stalling in either of these functions leads you back to the mount updates being the cause. If you run /etc/rc.conf_mount_ro from a shell, followed by /etc/rc.conf_mount_rw I see a stall on the second command everytime. ^T reports that the mount process is waiting on biowr.

I put some logging into /etc/inc/config.lib.php (where conf_mount_ro and rw function reside) to determine which of the 4 mount updates was stalling (syslog conveniently timestamps the messages) and it's the read only mount update of /. On my system it's a stall of between 107 and 110 seconds.

I can't reproduce the behaviour using the same image under virtualbox.

The CF card in this 4801 is a cheap card with a brand I had not heard of, I'm going to try a sandisk card when I can find it.

Mark

I forgot to say that this is likely a race condition, probably triggered by the longer flash write time. It sounds to me very similar to
http://www.freebsd.org/cgi/query-pr.cgi?pr=149022
which apparently was seen on normal drives under heavy load. The bug wasn't resolved and was closed due to lack of feedback.
There was a query about whether a fix to the soft-update code fixed it.
Soft-updates aren't enabled on the flash device. However I did try enabling soft-updates just to see if a change in the write timing might work around the issue I was seeing.

Would be much easier to debug if I could get a 2.1 build environment working, but that's the topic of another post…

"show mount" from the kernel debugger would be very interesting.

Mark

bao

@jimp:

Recently we changed the builder to use conv=sparse when making image files to save time/space on the builders. Not sure if that's related, but I backed it out today. We'll have to wait and see if the snapshots from tomorrow, when freshly imaged, work better.

I noticed something different, which may be related.

In the create_nanobsd_diskimage (), a change was made to the zeroing of the nanobsd image:

dd if=/dev/zero of=${IMG} bs=${NANO_SECTS}b
count=0 seek=`expr ${NANO_MEDIASIZE} / ${NANO_SECTS}

That did not really do anything.

Later on, the dd command was used to write the image, with "conv=sparse". I don't think it matters at this time, since the image probably contains a lot of garbage.

We changed it back for our purposes, since we build 8G and 16GB nanobsd images and we can't compress them to reasonable sizes.

dd if=/dev/zero of=${IMG} bs=${NANO_SECTS}b
count=`expr ${NANO_MEDIASIZE} / ${NANO_SECTS}

On another note, the last patch from ermalluci broke the build process in the last couple of days. We made changes by hands to the headers to get it going again, i.e.

From
….....
diff --git a/sys/netinet/ip_carp.c b/sys/netinet/ip_carp.c
index a4890dd..5b5fb19 100644
--- a/sys/netinet/ip_carp.c
+++ b/sys/netinet/ip_carp.c
.........
To
.........
Index: sys/netinet/ip_carp.c

--- sys/netinet/ip_carp.c
+++ sys/netinet/ip_carp.c
.........

databeestje

the count=0 and seek=number means we don't actually write the image out with zeros to disk.

When reading the file it will have the correct size and everything will be returned as 0.

But you don't actually write an entire 4-16GB file. So it's much faster.

podilarius

On another note, the last patch from ermalluci broke the build process in the last couple of days. We made changes by hands to the headers to get it going again, i.e.

The latest updates seems to have fixed the issue. I ran through the patches and they seem to be fine now. I am going to complete a build tonight and I am guessing the build servers will also produce a snapshot sometime later tonight.

Updated today to latest snap - very slow

From …..... diff --git a/sys/netinet/ip_carp.c b/sys/netinet/ip_carp.c index a4890dd..5b5fb19 100644 --- a/sys/netinet/ip_carp.c +++ b/sys/netinet/ip_carp.c ......... To ......... Index: sys/netinet/ip_carp.c

From
….....
diff --git a/sys/netinet/ip_carp.c b/sys/netinet/ip_carp.c
index a4890dd..5b5fb19 100644
--- a/sys/netinet/ip_carp.c
+++ b/sys/netinet/ip_carp.c
.........
To
.........
Index: sys/netinet/ip_carp.c