linux-kernel - Re: [PATCH] Memory management livelock

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <Pine.LNX.4.64.0809231817390.11559@hs20-bc2-1.build.redhat.com>
Date:	Tue, 23 Sep 2008 18:34:20 -0400 (EDT)
From:	Mikulas Patocka <mpatocka@...hat.com>
To:	linux-kernel@...r.kernel.org, akpm@...ux-foundation.org,
	linux-mm@...r.kernel.org
cc:	Alasdair G Kergon <agk@...hat.com>, Milan Broz <mbroz@...hat.com>,
	Chris Webb <chris@...chsys.com>
Subject: Re: [PATCH] Memory management livelock

> On Mon, 22 Sep 2008 17:10:04 -0400 (EDT)
> Mikulas Patocka <mpatocka@...xxxxxxx> wrote:
> 
> > The bug happens when one process is doing sequential buffered writes to
> > a block device (or file) and another process is attempting to execute
> > sync(), fsync() or direct-IO on that device (or file). This syncing
> > process will wait indefinitelly, until the first writing process
> > finishes.
> >
> > For example, run these two commands:
> > dd if=/dev/zero of=/dev/sda1 bs=65536 &
> > dd if=/dev/sda1 of=/dev/null bs=4096 count=1 iflag=direct
> >
> > The bug is caused by sequential walking of address space in
> > write_cache_pages and wait_on_page_writeback_range: if some other
> > process is constantly making dirty and writeback pages while these
> > functions run, the functions will wait on every new page, resulting in
> > indefinite wait.
> 
> Shouldn't happen. All the data-syncing functions should have an upper
> bound on the number of pages which they attempt to write. In the
> example above, we end up in here:
> 
> int __filemap_fdatawrite_range(struct address_space *mapping, loff_t
> start,
> loff_t end, int sync_mode)
> {
> int ret;
> struct writeback_control wbc = {
> .sync_mode = sync_mode,
> .nr_to_write = mapping->nrpages * 2, <<--
> .range_start = start,
> .range_end = end,
> };
> 
> so generic_file_direct_write()'s filemap_write_and_wait() will attempt
> to write at most 2* the number of pages which are in cache for that inode.

See write_cache_pages:

if (wbc->sync_mode != WB_SYNC_NONE)
        wait_on_page_writeback(page);	(1)
if (PageWriteback(page) ||
    !clear_page_dirty_for_io(page)) {
        unlock_page(page);		(2)
        continue;
}
ret = (*writepage)(page, wbc, data);
if (unlikely(ret == AOP_WRITEPAGE_ACTIVATE)) {
        unlock_page(page);
        ret = 0;
}
if (ret || (--(wbc->nr_to_write) <= 0))
        done = 1;

--- so if it goes by points (1) and (2), the counter is not decremented, 
yet the function waits for the page. If there is constant stream of 
writeback pages being generated, it waits on each on them --- that is, 
forever. I have seen livelock in this function. For you that example with 
two dd's, one buffered write and the other directIO read doesn't work? For 
me it livelocks here.

wait_on_page_writeback_range is another example where the livelock 
happened, there is no protection at all against starvation.


BTW. that .nr_to_write = mapping->nrpages * 2 looks like a dangerous thing 
to me.

Imagine this case: You have two pages with indices 4 and 5 dirty in a 
file. You call fsync(). It sets nr_to_write to 4.

Meanwhile, another process makes pages 0, 1, 2, 3 dirty.

The fsync() process goes to write_cache_pages, writes the first 4 dirty 
pages and exits because it goes over the limit.

result --- you violate fsync() semantics, pages that were dirty before 
call to fsync() are not written when fsync() exits.

> I'd say that either a) that logic got broken or b) you didn't wait long
> enough, and we might need to do something to make it not wait so long.
> 
> But before we patch anything we should fully understand what is
> happening and why the current anti-livelock code isn't working in this
> case.

Mikulas
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/