linux-kernel - Re: [PATCH] writeback: Fix broken sync writeback

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20100222210112.GE23832@thunk.org>
Date:	Mon, 22 Feb 2010 16:01:12 -0500
From:	tytso@....edu
To:	Jan Kara <jack@...e.cz>
Cc:	Linus Torvalds <torvalds@...ux-foundation.org>,
	Jens Axboe <jens.axboe@...cle.com>,
	Linux Kernel <linux-kernel@...r.kernel.org>,
	jengelh@...ozas.de, stable@...nel.org, gregkh@...e.de
Subject: Re: [PATCH] writeback: Fix broken sync writeback

On Mon, Feb 22, 2010 at 06:29:38PM +0100, Jan Kara wrote:
> 
> a) ext4_da_writepages returns after writing 32 MB even in WB_SYNC_ALL mode
> (when 1024 is passed in nr_to_write). Writeback code kind of expects that
> in WB_SYNC_ALL mode all dirty pages in the given range are written (the
> same way as write_cache_pages does that).

Well, we return after writing 128MB because of the magic
s_max_writeback_mb_bump.  The fact that nr_to_write limits the number
of pages which are written is something which is intentional to the
writeback code.  I've disagreed with it, but I don't think it would be
legit to completely ignore nr_to_write in WB_SYNC_ALL mode --- is that
what you are saying we should do?  (If it is indeed legit to ignore
nr_to_write, I would have done it a long time ago; I introduced
s_max_writeback_mb_bump instead as a workaround to what I consider to
be a serious misfeature in the writeback code.)

> b) because of delayed allocation, inode is redirtied during ->writepages
> call and thus writeback_single_inode calls redirty_tail at it. Thus each
> inode will be written at least twice (synchronously, which means a
> transaction commit and a disk cache flush for each such write).

Hmm, does this happen with XFS, too?  If not, I wonder how they handle
it?  And whether we need to push a solution into the generic layers.

> d) ext4_writepage never succeeds to write a page with delayed-allocated
> data. So pageout() function never succeeds in cleaning a page on ext4.
> I think that when other problems in writeback code make writeout slow (like
> in Jan Engelhardt's case), this can bite us and I assume this might be the
> reason why Jan saw kswapd active doing some work during his problems.

Yeah, I've noticed this.  What it means is that if we have a massive
memory pressure in a particular zone, pages which are subject to
delayed allocation won't get written out by mm/vmscan.c.  Anonymous
pages will be written out to swap, and data pages which are re-written
via random access mmap() (and so we know where they will be written on
disk) will get written, and that's not a problem.  So with relatively
large zones, it happens, but most of the time I don't think it's a
major problem.

I am worried about this issue in certain configurations where pseudo
NUMA zones have been created and are artificially really tiny (128MB)
for container support, but that's not standard upstream thing.

This is done to avoid a lock inversion, and so this is an
ext4-specific thing (at least I don't think XFS's delayed allocation
has this misfeature).  It would be interesting if we have documented
evidence that this is easily triggered under normal situations.  If
so, we should look into figuring out how to fix this...

       	      	   		     	 - Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/