lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Z3OeJ9KGLQOt1KOI@gourry-fedora-PF4VCD3F>
Date: Tue, 31 Dec 2024 02:32:55 -0500
From: Gregory Price <gourry@...rry.net>
To: "Huang, Ying" <ying.huang@...ux.alibaba.com>
Cc: linux-mm@...ck.org, linux-kernel@...r.kernel.org, nehagholkar@...a.com,
	abhishekd@...a.com, kernel-team@...a.com, david@...hat.com,
	nphamcs@...il.com, akpm@...ux-foundation.org, hannes@...xchg.org,
	kbusch@...a.com
Subject: Re: [RFC v2 PATCH 0/5] Promotion of Unmapped Page Cache Folios.

On Fri, Dec 27, 2024 at 10:38:45PM -0500, Gregory Price wrote:
> On Fri, Dec 27, 2024 at 02:09:50PM -0500, Gregory Price wrote:
> 
> This seems to imply that the overhead we're seeing from read() even
> when filecache is on the remote node isn't actually related to the
> memory speed, but instead likely related to some kind of stale
> metadata in the filesystem or filecache layers.
> 
> ~Gregory

Mystery solved

> +void promotion_candidate(struct folio *folio)
> +{
... snip ...
> +	list_add(&folio->lru, promo_list);
> +}

read(file, length) will do a linear read, and promotion_candidate will
add those pages to the promotion list head resulting into a reversed
promotion order

so you read [1,2,3,4] folios, you'll promote in [4,3,2,1] order.

The result of this, on an unloaded system, is essentially that pages end
up in the worst possible configuration for the prefetcher, and therefore
TLB hits.  I figured this out because i was seeing the additional ~30%
overhead show up purely in `copy_page_to_iter()` (i.e. copy_to_user).

Swapping this for list_add_tail results in the following test result:

initializing
Read loop took 9.41 seconds  <- reading from CXL
Read loop took 31.74 seconds <- migration enabled
Read loop took 10.31 seconds
Read loop took 7.71 seconds  <-  migration finished
Read loop took 7.71 seconds
Read loop took 7.70 seconds
Read loop took 7.75 seconds
Read loop took 19.34 seconds <- dropped caches
Read loop took 13.68 seconds <- cache refilling to DRAM
Read loop took 7.37 seconds
Read loop took 7.68 seconds
Read loop took 7.65 seconds  <- back to DRAM baseline

On our CXL devices, we're seeing a 22-27% performance penalty for a file
being hosted entirely out of CXL.  When we promote this file out of CXL,
we set a 22-27% performance boost.

Probably list_add_tail is right here, but since files *tend to* be read
linearly with `read()` this should *tend toward* optimal.  That said, we
can probably make this more reliable by adding batch migration function
`mpol_migrate_misplaced_batch()` which also tries to do bulk allocation
of destination folios.  This will also probably save us a bunch of
invalidation overhead.

I'm also noticing that the migration limit (256mbps) is not being
respected, probably because we're doing 1 folio at a time instead of a
batch.  Will probably look at changing promotion_candidate to limit the
number of selected pages to promote per read-call.

---

diff --git a/mm/migrate.c b/mm/migrate.c
index f965814b7d40..99b584f22bcb 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2675,7 +2675,7 @@ void promotion_candidate(struct folio *folio)
                folio_putback_lru(folio);
                return;
        }
-       list_add(&folio->lru, promo_list);
+       list_add_tail(&folio->lru, promo_list);

        return;
 }

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ