linux-kernel - Re: [PATCH v8 03/10] readahead: allocate folios with mapping_min

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20240703141042.ihpay2xxsya44yys@quentin>
Date: Wed, 3 Jul 2024 14:10:42 +0000
From: "Pankaj Raghav (Samsung)" <kernel@...kajraghav.com>
To: "Darrick J. Wong" <djwong@...nel.org>
Cc: david@...morbit.com, willy@...radead.org, chandan.babu@...cle.com,
	brauner@...nel.org, akpm@...ux-foundation.org,
	linux-kernel@...r.kernel.org, yang@...amperecomputing.com,
	linux-mm@...ck.org, john.g.garry@...cle.com,
	linux-fsdevel@...r.kernel.org, hare@...e.de, p.raghav@...sung.com,
	mcgrof@...nel.org, gost.dev@...sung.com, cl@...amperecomputing.com,
	linux-xfs@...r.kernel.org, hch@....de, Zi Yan <zi.yan@...t.com>
Subject: Re: [PATCH v8 03/10] readahead: allocate folios with
 mapping_min_order in readahead

On Tue, Jul 02, 2024 at 12:38:30PM -0700, Darrick J. Wong wrote:
> On Tue, Jun 25, 2024 at 11:44:13AM +0000, Pankaj Raghav (Samsung) wrote:
> > From: Pankaj Raghav <p.raghav@...sung.com>
> > 
> > page_cache_ra_unbounded() was allocating single pages (0 order folios)
> > if there was no folio found in an index. Allocate mapping_min_order folios
> > as we need to guarantee the minimum order if it is set.
> > While we are at it, rework the loop in page_cache_ra_unbounded() to
> > advance with the number of pages in a folio instead of just one page at
> > a time.
> 
> Ok, sounds pretty straightforward so far.
> 
> > page_cache_ra_order() tries to allocate folio to the page cache with a
> > higher order if the index aligns with that order. Modify it so that the
> > order does not go below the mapping_min_order requirement of the page
> > cache. This function will do the right thing even if the new_order passed
> > is less than the mapping_min_order.
> 
> Hmm.  So if I'm understanding this correctly: Currently,
> page_cache_ra_order tries to allocate higher order folios if the
> readahead index happens to be aligned to one of those higher orders.
> With the minimum mapping order requirement, it now expands the readahead
> range upwards and downwards to maintain the mapping_min_order
> requirement.  Right?
> 
Yes. We only expand because the index that was passed needs to included
and not excluded.

> > When adding new folios to the page cache we must also ensure the index
> > used is aligned to the mapping_min_order as the page cache requires the
> > index to be aligned to the order of the folio.
> > 
> > readahead_expand() is called from readahead aops to extend the range of
> > the readahead so this function can assume ractl->_index to be aligned with
> > min_order.
> 
> ...and I guess this function also has to be modified to expand the ra
> range even further if necessary to align with mapping_min_order.  Right?

Yes! This function is a bit different from other two because this is
called from readahead aops callback while the other two are responsible
for calling the readahead aops. That is why we can assume the index
passed is aligned to min order.

> 
> > Signed-off-by: Pankaj Raghav <p.raghav@...sung.com>
> > Co-developed-by: Hannes Reinecke <hare@...e.de>
> > Signed-off-by: Hannes Reinecke <hare@...e.de>
> > ---
> >  mm/readahead.c | 81 +++++++++++++++++++++++++++++++++++++++-----------
> >  1 file changed, 63 insertions(+), 18 deletions(-)
> > 
> > diff --git a/mm/readahead.c b/mm/readahead.c
> > index 66058ae02f2e..2acfd6447d7b 100644
> > --- a/mm/readahead.c
> > +++ b/mm/readahead.c
> > @@ -206,9 +206,10 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
> >  		unsigned long nr_to_read, unsigned long lookahead_size)
> >  {
> >  	struct address_space *mapping = ractl->mapping;
> > -	unsigned long index = readahead_index(ractl);
> > +	unsigned long ra_folio_index, index = readahead_index(ractl);
> >  	gfp_t gfp_mask = readahead_gfp_mask(mapping);
> > -	unsigned long i;
> > +	unsigned long mark, i = 0;
> > +	unsigned int min_nrpages = mapping_min_folio_nrpages(mapping);
> >  
> >  	/*
> >  	 * Partway through the readahead operation, we will have added
> > @@ -223,10 +224,26 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
> 
> I'm not as familiar with this function since xfs/iomap don't use it.

This function ultimately invokes xfs_vm_readahead through read_pages().
So it sort of sits above xfs aops.

> >  	unsigned int nofs = memalloc_nofs_save();
> >  
> >  	filemap_invalidate_lock_shared(mapping);
> > +	index = mapping_align_index(mapping, index);
> > +
> > +	/*
> > +	 * As iterator `i` is aligned to min_nrpages, round_up the
> > +	 * difference between nr_to_read and lookahead_size to mark the
> > +	 * index that only has lookahead or "async_region" to set the
> > +	 * readahead flag.
> > +	 */
> > +	ra_folio_index = round_up(readahead_index(ractl) + nr_to_read - lookahead_size,
> > +				  min_nrpages);
> 
> So at this point we've rounded index down and the readahead region up to
> fit the min_nrpages requirement.  I'm not sure what the lookahead region
> does, since nobody passes nonzero.  Judging from the other functions, I
> guess that's the region that we're allowed to do asynchronously?
> 
> > +	mark = ra_folio_index - index;
> 
> Ah, ok, yes.  We mark the first folio in the "async" region so that we
> (re)start readahead when someone accesses that folio.

Yes. I think we consider it as a hit, so we could expand the readahead
window now.

> 
> > +	if (index != readahead_index(ractl)) {
> > +		nr_to_read += readahead_index(ractl) - index;
> > +		ractl->_index = index;
> > +	}
> 
> So then if we rounded inded down, now we have to add that to the ra
> region.

Yes. I could also make it unconditional because if index ==
readahead_index(ractl), nr_to_read and ractl->_index will just remain
the same. Probably that is more efficient than having a conditinal and
then a substraction.

> 
> > +
> >  	/*
> > +	if (!mapping_large_folio_support(mapping) || ra->size < min_ra_size)
> >  		goto fallback;
> >  
> >  	limit = min(limit, index + ra->size - 1);
> > @@ -507,11 +532,20 @@ void page_cache_ra_order(struct readahead_control *ractl,
> >  		new_order += 2;
> >  		new_order = min(mapping_max_folio_order(mapping), new_order);
> >  		new_order = min_t(unsigned int, new_order, ilog2(ra->size));
> > +		new_order = max(new_order, min_order);
> >  	}
> >  
> >  	/* See comment in page_cache_ra_unbounded() */
> >  	nofs = memalloc_nofs_save();
> >  	filemap_invalidate_lock_shared(mapping);
> > +	/*
> > +	 * If the new_order is greater than min_order and index is
> > +	 * already aligned to new_order, then this will be noop as index
> > +	 * aligned to new_order should also be aligned to min_order.
> > +	 */
> > +	ractl->_index = mapping_align_index(mapping, index);
> > +	index = readahead_index(ractl);
> 
> I guess this also rounds index down to mapping_min_order...
Yes.
> 
> > +
> >  	while (index <= limit) {
> >  		unsigned int order = new_order;
> >  
> > @@ -519,7 +553,7 @@ void page_cache_ra_order(struct readahead_control *ractl,
> >  		if (index & ((1UL << order) - 1))
> >  			order = __ffs(index);
> >  		/* Don't allocate pages past EOF */
> > -		while (index + (1UL << order) - 1 > limit)
> > +		while (order > min_order && index + (1UL << order) - 1 > limit)
> >  			order--;
> 
> ...and then we try to find an order that works and doesn't go below
> min_order.  We already rounded index down to mapping_min_order, so that
> will always succeed.  Right?

Yes. We never go less than min order, even if it means extending
slightly beyond the isize (hence also the mmap fix later on in the
series).

> 
> >  		err = ra_alloc_folio(ractl, index, mark, order, gfp);
> >  		if (err)
> >  
> > @@ -821,9 +864,11 @@ void readahead_expand(struct readahead_control *ractl,
> >  		if (folio && !xa_is_value(folio))
> >  			return; /* Folio apparently present */
> >  
> > -		folio = filemap_alloc_folio(gfp_mask, 0);
> > +		folio = filemap_alloc_folio(gfp_mask, min_order);
> >  		if (!folio)
> >  			return;
> > +
> > +		index = mapping_align_index(mapping, index);
> >  		if (filemap_add_folio(mapping, folio, index, gfp_mask) < 0) {
> >  			folio_put(folio);
> >  			return;
> > @@ -833,10 +878,10 @@ void readahead_expand(struct readahead_control *ractl,
> >  			ractl->_workingset = true;
> >  			psi_memstall_enter(&ractl->_pflags);
> >  		}
> > -		ractl->_nr_pages++;
> > +		ractl->_nr_pages += min_nrpages;
> >  		if (ra) {
> > -			ra->size++;
> > -			ra->async_size++;
> > +			ra->size += min_nrpages;
> > +			ra->async_size += min_nrpages;
> 
> ...and here we are expanding the ra window yet again, only now taking
> min order into account.  Right?  Looks ok to me, though again, iomap/xfs
> don't use this function so I'm not that familiar with it.

It is used only by few filesystems but that is the idea. Before we used
to add single pages but now we add min_order worth of pages if it is
set. Very similar to previous patch.

> 
> If the answer to /all/ the questions is 'yes' then
> 
> Acked-by: Darrick J. Wong <djwong@...nel.org>

Thanks, I will add it :)