linux-kernel - Re: howto combat highly pathologic latencies on a server?

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-Id: <201003110220.28535.hpj@urpla.net>
Date:	Thu, 11 Mar 2010 02:20:28 +0100
From:	"Hans-Peter Jansen" <hpj@...la.net>
To:	linux-kernel@...r.kernel.org
Cc:	David Rees <drees76@...il.com>
Subject: Re: howto combat highly pathologic latencies on a server?

On Thursday 11 March 2010, 00:44:54 David Rees wrote:
> On Wed, Mar 10, 2010 at 9:17 AM, Hans-Peter Jansen <hpj@...la.net> wrote:
> > While this system usually operates fine, it suffers from delays, that
> > are displayed in latencytop as: "Writing page to disk:     8425,5 ms":
> > ftp://urpla.net/lat-8.4sec.png, but we see them also in the 1.7-4.8 sec
> > range: ftp://urpla.net/lat-1.7sec.png, ftp://urpla.net/lat-2.9sec.png,
> > ftp://urpla.net/lat-4.6sec.png and ftp://urpla.net/lat-4.8sec.png.
> >
> > From other observations, this issue "feels" like it is induced by
> > single syncronisation points in the block layer, eg. if I create heavy
> > IO load on one RAID array, say resizing a VMware disk image, it can
> > take up to a minute to log in by ssh, although the ssh login does not
> > touch this area at all (different RAID arrays). Note, that the
> > latencytop snapshots above are made during normal operation, not this
> > kind of load..
> >
> > Might later kernels mitigate this problem? As this is a production
> > system, that is used 6.5 days a week, I cannot do dangerous
> > experiments, also switching to 64 bit is a problem due to the legacy
> > stuff described above... OTOH, my users suffer from this, and anything
> > helping in this respect is highly appreciated.
>
> Seems like a 2.6.32 based kernel which has per-BDI writeback and "CFQ
> low latency mode" changes might help a good deal.  I know that on one
> of my bigger machines (similar in specs to yours) which has a lot of
> processes which do a decent amount of IO, latency and load average has
> gone down after going to a 2.6.32 kernel from a 2.6.31 kernel (Fedora
> 11 system).
>
> Like Chris suggested, I've also heard that using the noop IO scheduler
> can work well on Areca controllers on some kernels and workloads.
> It's worth a shot and you can even try changing it at run-time.

Yes, already done. Hopefully my users will notice.. As I've upgraded this 
server and the clients only two weeks ago, calming things down has highest 
priority.

Switching kernel versions in production systems is always painful, thus I  
try to avoid that, but this time I already needed to roll my own kernel for 
the clients due to some aufs2 vs. apparmor disharmony. That led to the loss 
of the latter - I can live without apparmor, but certainly not without a 
reliable layered filesystem¹.
 
Anyway, thanks for your suggestion and confirmation, David. It is 
appreciated.

Cheers,
Pete

¹) In a way, this is my primary justification to also use Linux on the 
desktops²! Install one, and get the rest (nearly) free.. 
http://download.opensuse.org/repositories/home:/frispete:/aufs2 and below..
²) Don't tell anybody, that I don't like the other OS ;-)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/