linux-kernel - Re: [PATCH] mm: Optimize TLB flushes during page reclaim

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CALf+9YcyxRisLbPqn0uy-tRhtUFWNxjyzxSwyONmNe2AV-EV=Q@mail.gmail.com>
Date: Tue, 21 Jan 2025 12:03:20 -0600
From: Vinay Banakar <vny@...gle.com>
To: Byungchul Park <byungchul@...com>, linux-mm@...ck.org, linux-kernel@...r.kernel.org, 
	willy@...radead.org
Cc: akpm@...ux-foundation.org, mgorman@...e.de, Wei Xu <weixugc@...gle.com>, 
	Greg Thelen <gthelen@...gle.com>, kernel_team@...ynix.com
Subject: Re: [PATCH] mm: Optimize TLB flushes during page reclaim

On Mon, Jan 20, 2025 at 7:44 PM Byungchul Park <byungchul@...com> wrote:
> The *interesting* IPIs will be reduced by 1/512 at most.  Can we see the
improvement number?

Yes, we reduce IPIs by a factor of 512 by sending one IPI (for TLB
flush) per PMD rather than per page. Since shrink_folio_list()
operates on one PMD at a time, I believe we can safely batch these
operations here.

Here's a concrete example:
When swapping out 20 GiB (5.2M pages):
- Current: Each page triggers an IPI to all cores
  - With 6 cores: 31.4M total interrupts (6 cores × 5.2M pages)
- With patch: One IPI per PMD (512 pages)
  - Only 10.2K IPIs required (5.2M/512)
  - With 6 cores: 61.4K total interrupts
  - Results in ~99% reduction in total interrupts

Application performance impact varies by workload, but here's a
representative test case:
- Thread 1: Continuously accesses a 2 GiB private anonymous map (64B
chunks at random offsets)
- Thread 2: Pinned to different core, uses MADV_PAGEOUT on 20 GiB
private anonymous map to swap it out to SSD
- The threads only access their respective maps.
Results:
  - Without patch: Thread 1 sees ~53% throughput reduction during
swap. If there are multiple worker threads (like thread 1), the
cumulative throughput degradation will be much higher
  - With patch: Thread 1 maintains normal throughput

I expect a similar application performance impact when memory reclaim
is triggered by kswapd.