linux-kernel - [PATCH 0/9] mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <20250310172318.653630-1-sj@kernel.org>
Date: Mon, 10 Mar 2025 10:23:09 -0700
From: SeongJae Park <sj@...nel.org>
To: Andrew Morton <akpm@...ux-foundation.org>
Cc: SeongJae Park <sj@...nel.org>,
	"Liam R. Howlett" <howlett@...il.com>,
	David Hildenbrand <david@...hat.com>,
	Lorenzo Stoakes <lorenzo.stoakes@...cle.com>,
	Shakeel Butt <shakeel.butt@...ux.dev>,
	Vlastimil Babka <vbabka@...e.cz>,
	kernel-team@...a.com,
	linux-kernel@...r.kernel.org,
	linux-mm@...ck.org
Subject: [PATCH 0/9] mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV_FREE

When process_madvise() is called to do MADV_DONTNEED[_LOCKED] or
MADV_FREE with multiple address ranges, tlb flushes happen for each of
the given address ranges.  Because such tlb flushes are for same
process, doing those in a batch is more efficient while still being
safe.  Modify process_madvise() entry level code path to do such batched
tlb flushes, while the internal unmap logics do only gathering of the
tlb entries to flush.

In more detail, modify the entry functions to initialize an mmu_gather
ojbect and pass it to the internal logics.  And make the internal logics
do only gathering of the tlb entries to flush into the received
mmu_gather object.  After all internal function calls are done, the
entry functions flush the gathered tlb entries at once.

The inefficiency should be smaller on madvise() use case, since it
receives only a single address range.  But if there are multiple vmas
for the range, same problem can happen.  It is unclear if such use case
is common and the inefficiency is significant.  Make the change for
madivse(), too, since it doesn't really change madvise() internal
behavior while helps keeping the code that shared between
process_madvise() and madvise() internal logics clean.

Patches Seuquence
=================

First four patches are minor cleanups of madvise.c for readability.

Fifth patch defines new data structure for managing information
that required for batched tlb flushes (mmu_gather and behavior), and
update code paths for MADV_DONTNEED[_LOCKED] and MADV_FREE handling
internal logics to receive it.

Sixth and seventh patches make internal logics for handling
MADV_DONTNEED[_LOCKED] MADV_FREE be ready for batched tlb flushing.  The
patches keep the support of unbatched tlb flushes use case, for
fine-grained and safe transitions.

Eighth patch updates madvise() and process_madvise() code to do the
batched tlb flushes utilizing the previous patches introduced changes.

The final ninth patch removes the internal logics' unbatched tlb flushes
use case support code, which is no more be used.

Test Results
============

I measured the latency to apply MADV_DONTNEED advice to 256 MiB memory
using multiple process_madvise() calls.  I apply the advice in 4 KiB
sized regions granularity, but with varying batch size per
process_madvise() call (vlen) from 1 to 1024.  The source code for the
measurement is available at GitHub[1].  To reduce measurement errors, I
did the measurement five times.

The measurement results are as below.  'sz_batch' column shows the batch
size of process_madvise() calls.  'Before' and 'After' columns show the
average of latencies in nanoseconds that measured five times on kernels
that built without and with the tlb flushes batching patch of this
series, respectively.  For the baseline, mm-unstable tree of
2025-03-07[2] has been used.  'B-stdev' and 'A-stdev' columns show
ratios of latency measurements standard deviation to average in percent
for 'Before' and 'After', respectively.  'Latency_reduction' shows the
reduction of the latency that the commit has achieved, in percent.
Higher 'Latency_reduction' values mean more efficiency improvements.

    sz_batch   Before        B-stdev   After         A-stdev   Latency_reduction
    1          128691595.4   6.09      106878698.4   2.76      16.95
    2          94046750.8    3.30      68691907      2.79      26.96
    4          80538496.8    5.35      50230673.8    5.30      37.63
    8          72672225.2    5.50      43918112      3.54      39.57
    16         66955104.4    4.25      36549941.2    1.62      45.41
    32         65814679      5.89      33060291      3.30      49.77
    64         65099205.2    2.69      26003756.4    1.56      60.06
    128        62591307.2    4.02      24008180.4    1.61      61.64
    256        64124368.6    2.93      23868856      2.20      62.78
    512        62325618      5.72      23311330.6    1.74      62.60
    1024       64802138.4    5.05      23651695.2    3.40      63.50

Interestingly, the latency has reduced (improved) even with batch size
1.  I think some of compiler optimizations have affected that, like also
observed with the previous process_madvise() mmap_lock optimization
patch sereis[3].

So, let's focus on the proportion between the improvement and the batch
size.  As expected, tlb flushes batching provides latency reduction that
proportional to the batch size.  The efficiency gain ranges from about
27 percent with batch size 2, and up to 63 percent with batch size
1,024.

Please note that this is a very simple microbenchmark, so real
efficiency gain on real workload could be very different.

Changes from RFC
(https://lore.kernel.org/20250305181611.54484-1-sj@kernel.org)
- Clarify motivation of the change on the cover letter
- Add average and stdev of evaluation results
- Show latency reduction on evaluation results
- Fix !CONFIG_MEMORY_FAILURE build error
- Rename is_memory_populate() to is_madvise_populate()
- Squash patches 5-8
- Add kerneldoc for unmap_vm_area_struct()
- Squash patches 10 and 11
- Squash patches 12-14
- Squash patches 15 and 16

References
==========

[1] https://github.com/sjp38/eval_proc_madvise
[2] commit e664d7d28a7c ("selftest: test system mappings are sealed") # mm-unstable
[3] https://lore.kernel.org/20250211182833.4193-1-sj@kernel.org

SeongJae Park (9):
  mm/madvise: use is_memory_failure() from madvise_do_behavior()
  mm/madvise: split out populate behavior check logic
  mm/madvise: deduplicate madvise_do_behavior() skip case handlings
  mm/madvise: remove len parameter of madvise_do_behavior()
  mm/madvise: define and use madvise_behavior struct for
    madvise_do_behavior()
  mm/memory: split non-tlb flushing part from zap_page_range_single()
  mm/madvise: let madvise_{dontneed,free}_single_vma() caller batches
    tlb flushes
  mm/madvise: batch tlb flushes for
    [process_]madvise(MADV_{DONTNEED[_LOCKED],FREE})
  mm/madvise: remove !tlb support from
    madvise_{dontneed,free}_single_vma()

 mm/internal.h |   3 +
 mm/madvise.c  | 221 +++++++++++++++++++++++++++++++++-----------------
 mm/memory.c   |  38 ++++++---
 3 files changed, 176 insertions(+), 86 deletions(-)


base-commit: e993f5f5b0ac851cf60578cfee5488031dfaa80c
-- 
2.39.5