lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20081002120444.GA25164@mit.edu>
Date:	Thu, 2 Oct 2008 08:04:44 -0400
From:	Theodore Tso <tytso@....edu>
To:	Andrew Morton <akpm@...ux-foundation.org>
Cc:	Jens Axboe <jens.axboe@...cle.com>,
	Arjan van de Ven <arjan@...radead.org>,
	linux-kernel@...r.kernel.org, Alan Cox <alan@...rguk.ukuu.org.uk>
Subject: Re: [PATCH] Give kjournald a IOPRIO_CLASS_RT io priority

On Thu, Oct 02, 2008 at 01:03:15AM -0700, Andrew Morton wrote:
> 
> An async atime update gets recorded into the current transaction. 
> kjournald is working on the committing transaction.  We try to keep
> those separated, to prevent user processes from getting blocked behind
> kjournald activity.
> 

This is true unless the journal gets too full, and we need to do a
checkpoint operation --- at which point, everything stops.  If this
was metadata-intensive a benchmark, and the journal wasn't large
enough, this could be the problem.  (And if you make the journal
bigger, then when you *do* finally get forced to do a checkpoint
operation, things get stalled for even longer.)

Arjan, is this *really* about atime updates?  I thought most poeple
these days run with noatime or relatime.  If people *really* want true
atime semantics, the best way to solve this problem would be to have
two dirty flags in the inode --- an "atime dirty" and a "dirty" flag.
The atime dirty bit would not actually cause the inode to get written
to disk, unless either (a) we are unmounting the filesystem, or (b) we
are trying to shrink the inode cache due to memory pressure.  If when
we write the inode out to disk, only the atime dirty bit is set, we
can also skip journalling the inode table block.  So if there are
people who really care about true atime semantics, without getting
killed by the I/O writes, there are some solutions we can pursue.

But if this is really about the "entangled fsync problem", where we
have a large number of processes writing a large amount of async data,
and then we have a single process writing a small amount of data and
then calling fsync(), then that's a different (and very long-standing)
problem in ext3/4.  Raising the I/O priority is probably the only
thing we can do in this circumstance.  We could try to do some kind of
complex priority inheritance scheme, but it would certainly be much
simpler to raise the I/O priority.  We could choose a level just below
realtime priority, but the reality is that if a real-time priority is
trying to write to the filesystem, and we are doing a checkpointing
opration, we're going to be blocking the real-time process anyway, and
it will be a priority inversion.  So perhaps the simplest and best
algorithm would be to use a priority level just below real-time when
doing a normal commit, but if we start to do a checkpoint, we go to
IOPRIO_CLASS_RT.

> But sometimes that doesn't work (including the place where I knowingly
> broke it).  If we can find and fix the offending piece of jbd logic (a
> big if) then all is peachy.

Do we have workloads that can easily demonstrate this problem?  If so,
we can add some tracing code which will allow us to see which theory
is correct, and what is actually happening.

						- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ