linux-kernel - Re: [patch 1/3] fix illogical behavior in balance_dirty

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-Id: <E1HVkWw-0000yj-00@dorka.pomaz.szeredi.hu>
Date:	Mon, 26 Mar 2007 10:26:18 +0200
From:	Miklos Szeredi <miklos@...redi.hu>
To:	akpm@...ux-foundation.org
CC:	dgc@....com, linux-kernel@...r.kernel.org
Subject: Re: [patch 1/3] fix illogical behavior in balance_dirty_pages()

> > This is a slightly different take on the fix for the deadlock in fuse
> > with dirty balancing.  David Chinner convinced me, that per-bdi
> > counters are too expensive, and that it's not worth trying to account
> > the number of pages under writeback, as they will be limited by the
> > queue anyway.
> > 
> > ----
> > From: Miklos Szeredi <mszeredi@...e.cz>
> > 
> > Current behavior of balance_dirty_pages() is to try to start writeout
> > into the specified queue for at least "write_chunk" number of pages.
> > If "write_chunk" pages have been submitted, then return.
> > 
> > However if there are less than "write_chunk" dirty pages for this
> > queue, then it doesn't return, waiting for the global dirty+writeback
> > counters to subside, but without doing any actual work.
> > 
> > This is illogical behavior: it allows more dirtyings while there are
> > dirty pages, but stops further dirtying completely if there are no
> > more dirty pages.
> 
> That behaviour is perfectly logical.  It prevents the number of
> dirty+writeback pages from exceeding dirty_ratio.

But it does it in an illogical way.  While it's cleaning your pages,
it allows more dirtyings, but when it has cleaned ALL the pages
destined for this queue, you hit a wall and stand still waiting for
other queues to get their act together even if this queue is
_completely idle_.

I think this accounts for 90% of the problems experienced with copying
to slow devices.

> > It also makes a deadlock possible when one filesystem is writing data
> > through another, and the balance_dirty_pages() for the lower
> > filesystem is stalling the writeback for the upper filesystem's
> > data (*).
> 
> I still don't understand this one.  I got lost when belatedly told that
> i_mutex had something to do with it.

This deadlock only happens, if there's some bottleneck for writing
data to the lower filesystem.   This bottleneck could be 

  - i_mutex, preventing parallel writes to the same inode
  - limited number of filesystem threads
  - limited request queue length in the upper filesystem

Imagine it this way: balance_dirty_pages() for the lower filesystem is
stalling a write() because dirty pages in the upper filesystem are
over the limit.  Because there's a bottleneck for writing to the lower
filesystem, this is stalling _other_ writes from completing.  So
there's no progress in writing back pages from the upper filesystem.

The problem is not that dirty pages are not being submitted for
writeback, rather that the writeback itself is stalling.

Hmm?

> > Index: linux/mm/page-writeback.c
> > ===================================================================
> > --- linux.orig/mm/page-writeback.c	2007-03-24 22:06:56.000000000 +0100
> > +++ linux/mm/page-writeback.c	2007-03-24 22:29:02.000000000 +0100
> > @@ -207,7 +207,15 @@ static void balance_dirty_pages(struct a
> >  		 * written to the server's write cache, but has not yet
> >  		 * been flushed to permanent storage.
> >  		 */
> > -		if (nr_reclaimable) {
> > +		if (!nr_reclaimable) {
> > +			/*
> > +			 * If there's nothing more to write back and this queue
> > +			 * is uncongested,  then it is possible to quickly
> > +			 * write out some more data, so let's not wait
> > +			 */
> > +			if (!bdi_write_congested(bdi))
> > +				break;
> > +		} else {
> 
> This says "if there are no dirty pages in the machine at all, then go back
> and dirty some more pages, regardless of the present number of
> under-writeback pages".

No, this says, that "there are no dirty pages in the machine at all,
and this queue is uncongested, so you can safely go back and dirty
some more data, because we are sure it can be written back quickly".

Ditto for the case, when there are no more dirty pages destined for
this queue.

I understand, that this can fill up the memory with under writeback
pages, but only if the data sitting in all the device queues is
comparable to the total memory.  I don't know what the realistic
chance of that is but David Chinner convinced me, that it's not
something that happens in real life.  Quoting him from an earlier
mail:

| Right, and most ppl don't have enough devices in their system for
| this to be a problem. Even those of us that do have enough devices
| for this to potentially be a problem usually have enough RAM in
| the machine so that it is not a problem....

Miklos

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/