linux-kernel - Re: [PATCH] Memory management livelock

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <200810031232.23836.nickpiggin@yahoo.com.au>
Date:	Fri, 3 Oct 2008 12:32:23 +1000
From:	Nick Piggin <nickpiggin@...oo.com.au>
To:	Andrew Morton <akpm@...ux-foundation.org>
Cc:	Mikulas Patocka <mpatocka@...hat.com>,
	linux-kernel@...r.kernel.org, linux-mm@...r.kernel.org,
	agk@...hat.com, mbroz@...hat.com, chris@...chsys.com
Subject: Re: [PATCH] Memory management livelock

On Wednesday 24 September 2008 08:49, Andrew Morton wrote:
> On Tue, 23 Sep 2008 18:34:20 -0400 (EDT)
>
> Mikulas Patocka <mpatocka@...hat.com> wrote:
> > > On Mon, 22 Sep 2008 17:10:04 -0400 (EDT)
> > >
> > > Mikulas Patocka <mpatocka@...xxxxxxx> wrote:
> > > > The bug happens when one process is doing sequential buffered writes
> > > > to a block device (or file) and another process is attempting to
> > > > execute sync(), fsync() or direct-IO on that device (or file). This
> > > > syncing process will wait indefinitelly, until the first writing
> > > > process finishes.
> > > >
> > > > For example, run these two commands:
> > > > dd if=/dev/zero of=/dev/sda1 bs=65536 &
> > > > dd if=/dev/sda1 of=/dev/null bs=4096 count=1 iflag=direct
> > > >
> > > > The bug is caused by sequential walking of address space in
> > > > write_cache_pages and wait_on_page_writeback_range: if some other
> > > > process is constantly making dirty and writeback pages while these
> > > > functions run, the functions will wait on every new page, resulting
> > > > in indefinite wait.

I think the problem has been misidentified, or else I have misread the
code. See below. I hope I'm right, because I think the patches are pretty
heavy on complexity in these already complex paths...

It would help if you explicitly identify the exact livelock. Ie. give a
sequence of behaviour that leads to our progress rate falling to zero.


> > > Shouldn't happen. All the data-syncing functions should have an upper
> > > bound on the number of pages which they attempt to write. In the
> > > example above, we end up in here:
> > >
> > > int __filemap_fdatawrite_range(struct address_space *mapping, loff_t
> > > start,
> > > loff_t end, int sync_mode)
> > > {
> > > int ret;
> > > struct writeback_control wbc = {
> > > .sync_mode = sync_mode,
> > > .nr_to_write = mapping->nrpages * 2, <<--
> > > .range_start = start,
> > > .range_end = end,
> > > };
> > >
> > > so generic_file_direct_write()'s filemap_write_and_wait() will attempt
> > > to write at most 2* the number of pages which are in cache for that
> > > inode.
> >
> > See write_cache_pages:
> >
> > if (wbc->sync_mode != WB_SYNC_NONE)
> >         wait_on_page_writeback(page);	(1)
> > if (PageWriteback(page) ||
> >     !clear_page_dirty_for_io(page)) {
> >         unlock_page(page);		(2)
> >         continue;
> > }
> > ret = (*writepage)(page, wbc, data);
> > if (unlikely(ret == AOP_WRITEPAGE_ACTIVATE)) {
> >         unlock_page(page);
> >         ret = 0;
> > }
> > if (ret || (--(wbc->nr_to_write) <= 0))
> >         done = 1;
> >
> > --- so if it goes by points (1) and (2), the counter is not decremented,
> > yet the function waits for the page. If there is constant stream of
> > writeback pages being generated, it waits on each on them --- that is,
> > forever.

*What* is, forever? Data integrity syncs should have pages operated on
in-order, until we get to the end of the range. Circular writeback could
go through again, possibly, but no more than once.


> > I have seen livelock in this function. For you that example with 
> > two dd's, one buffered write and the other directIO read doesn't work?
> > For me it livelocks here.
> >
> > wait_on_page_writeback_range is another example where the livelock
> > happened, there is no protection at all against starvation.
>
> um, OK.  So someone else is initiating IO for this inode and this
> thread *never* gets to initiate any writeback.  That's a bit of a
> surprise.
>
> How do we fix that?  Maybe decrement nt_to_write for these pages as
> well?

What's the actual problem, though? nr_to_write should not be used for
data integrity operations, and it should not be critical for other
writeout. Upper layers should be able to deal with it rather than
have us lying to them.


> > BTW. that .nr_to_write = mapping->nrpages * 2 looks like a dangerous
> > thing to me.
> >
> > Imagine this case: You have two pages with indices 4 and 5 dirty in a
> > file. You call fsync(). It sets nr_to_write to 4.
> >
> > Meanwhile, another process makes pages 0, 1, 2, 3 dirty.
> >
> > The fsync() process goes to write_cache_pages, writes the first 4 dirty
> > pages and exits because it goes over the limit.
> >
> > result --- you violate fsync() semantics, pages that were dirty before
> > call to fsync() are not written when fsync() exits.

Wow, that's really nasty. Sad we still have known data integrity problems
in such core functions.


> yup, that's pretty much unfixable, really, unless new locks are added
> which block threads which are writing to unrelated sections of the
> file, and that could hurt some workloads quite a lot, I expect.

Why is it unfixable? Just ignore nr_to_write, and write out everything
properly, I would have thought.

Some things may go a tad slower, but those are going to be the things
that are using fsync, in which cases they are going to hurt much more
from the loss of data integrity than a slowdown.

Unfortunately because we have played fast and loose for so long, they
expect this behaviour, were tested and optimised with it, and systems
designed and deployed with it, and will notice performance regressions
if we start trying to do things properly. This is one of my main
arguments for doing things correctly up-front, even if it means a
massive slowdown in some real or imagined workload: at least then we
will hear about complaints and be able to try to improve them rather
than setting ourselves up for failure later.
/rant

Anyway, in this case, I don't think there would be really big problems.
Also, I think there is a reasonable optimisation that might improve it
(2nd last point, in attached patch).

OK, so after glancing at the code... wow, it seems like there are a lot
of bugs in there.

View attachment "mm-fsync-fix.patch" of type "text/x-diff" (7795 bytes)