linux-kernel - Re: Regression in NFS probably due to very large amounts of readahead

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <Z0YWrvnz5rYcYrjV@casper.infradead.org>
Date: Tue, 26 Nov 2024 18:42:54 +0000
From: Matthew Wilcox <willy@...radead.org>
To: Anders Blomdell <anders.blomdell@...il.com>
Cc: Jan Kara <jack@...e.cz>, Philippe Troin <phil@...i.org>,
	Andrew Morton <akpm@...ux-foundation.org>,
	linux-fsdevel@...r.kernel.org, linux-mm@...ck.org,
	linux-kernel@...r.kernel.org, NeilBrown <neilb@...e.de>
Subject: Re: Regression in NFS probably due to very large amounts of readahead

On Tue, Nov 26, 2024 at 06:26:13PM +0100, Anders Blomdell wrote:
> On 2024-11-26 17:55, Matthew Wilcox wrote:
> > On Tue, Nov 26, 2024 at 04:28:04PM +0100, Anders Blomdell wrote:
> > > On 2024-11-26 16:06, Jan Kara wrote:
> > > > Hum, checking the history the update of ra->size has been added by Neil two
> > > > years ago in 9fd472af84ab ("mm: improve cleanup when ->readpages doesn't
> > > > process all pages"). Neil, the changelog seems as there was some real
> > > > motivation behind updating of ra->size in read_pages(). What was it? Now I
> > > > somewhat disagree with reducing ra->size in read_pages() because it seems
> > > > like a wrong place to do that and if we do need something like that,
> > > > readahead window sizing logic should rather be changed to take that into
> > > > account? But it all depends on what was the real rationale behind reducing
> > > > ra->size in read_pages()...
> > > 
> > > My (rather limited) understanding of the patch is that it was intended to read those pages
> > > that didn't get read because the allocation of a bigger folio failed, while not redoing what
> > > readpages already did; how it was actually going to accomplish that is still unclear to me,
> > > but I even don't even quite understand the comment...
> > > 
> > > 	/*
> > > 	 * If there were already pages in the page cache, then we may have
> > > 	 * left some gaps.  Let the regular readahead code take care of this
> > > 	 * situation.
> > > 	 */
> > > 
> > > the reason for an unchanged async_size is also beyond my understanding.
> > 
> > This isn't because we couldn't allocate a folio, this is when we
> > allocated folios, tried to read them and we failed to submit the I/O.
> > This is a pretty rare occurrence under normal conditions.
> 
> I beg to differ, the code is reached when there is
> no folio support or ra->size < 4 (not considered in
> this discussion) or falling throug when !err, err
> is set by:
> 
>         err = ra_alloc_folio(ractl, index, mark, order, gfp);
>                 if (err)
>                         break;
> 
> isn't the reading done by:
> 
>         read_pages(ractl);
> 
> which does not set err!

You're misunderstanding.  Yes, read_pages() is called when we fail to
allocate a fresh folio; either because there's already one in the
page cache, or because -ENOMEM (or if we raced to install one), but
it's also called when all folios are normally allocated.  Here:

        /*
         * Now start the IO.  We ignore I/O errors - if the folio is not
         * uptodate then the caller will launch read_folio again, and
         * will then handle the error.
         */
        read_pages(ractl);

So at the point that read_pages() is called, all folios that ractl
describes are present in the page cache, locked and !uptodate.

After calling aops->readahead() in read_pages(), most filesystems will
have consumed all folios described by ractl.  It seems that NFS is
choosing not to submit some folios, so rather than leave them sitting
around in the page cache, Neil decided that we should remove them from
the page cache.