linux-kernel - Re: [PATCH 0/9] mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <0f90d56e-5960-4478-803e-1054696c0cde@lucifer.local>
Date: Tue, 11 Mar 2025 12:49:38 +0000
From: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
To: SeongJae Park <sj@...nel.org>
Cc: Andrew Morton <akpm@...ux-foundation.org>,
        "Liam R. Howlett" <howlett@...il.com>,
        David Hildenbrand <david@...hat.com>,
        Shakeel Butt <shakeel.butt@...ux.dev>,
        Vlastimil Babka <vbabka@...e.cz>, kernel-team@...a.com,
        linux-kernel@...r.kernel.org, linux-mm@...ck.org
Subject: Re: [PATCH 0/9] mm/madvise: batch tlb flushes for MADV_DONTNEED and
 MADV_FREE

The below commit message talks about tlb bits, but you spend a lot of this
series refactoring mm/madvise.c.

Can we just separate out the two into separate series please?

I am doing the same kind of thing with mremap() at the moment, but sent the
refactor _first_ before the work that builds upon it :)

Your refactoring is great, so I want to be able to take that (and Andrew's
objections obviously don't apply there), and then we can address tlb stuff
separately and in a more focused way.

On Mon, Mar 10, 2025 at 10:23:09AM -0700, SeongJae Park wrote:
> When process_madvise() is called to do MADV_DONTNEED[_LOCKED] or
> MADV_FREE with multiple address ranges, tlb flushes happen for each of
> the given address ranges.  Because such tlb flushes are for same
> process, doing those in a batch is more efficient while still being
> safe.  Modify process_madvise() entry level code path to do such batched
> tlb flushes, while the internal unmap logics do only gathering of the
> tlb entries to flush.
>
> In more detail, modify the entry functions to initialize an mmu_gather
> ojbect and pass it to the internal logics.  And make the internal logics
> do only gathering of the tlb entries to flush into the received
> mmu_gather object.  After all internal function calls are done, the
> entry functions flush the gathered tlb entries at once.
>
> The inefficiency should be smaller on madvise() use case, since it
> receives only a single address range.  But if there are multiple vmas
> for the range, same problem can happen.  It is unclear if such use case
> is common and the inefficiency is significant.  Make the change for
> madivse(), too, since it doesn't really change madvise() internal
> behavior while helps keeping the code that shared between
> process_madvise() and madvise() internal logics clean.
>
> Patches Seuquence
> =================
>
> First four patches are minor cleanups of madvise.c for readability.
>
> Fifth patch defines new data structure for managing information
> that required for batched tlb flushes (mmu_gather and behavior), and
> update code paths for MADV_DONTNEED[_LOCKED] and MADV_FREE handling
> internal logics to receive it.
>
> Sixth and seventh patches make internal logics for handling
> MADV_DONTNEED[_LOCKED] MADV_FREE be ready for batched tlb flushing.  The
> patches keep the support of unbatched tlb flushes use case, for
> fine-grained and safe transitions.
>
> Eighth patch updates madvise() and process_madvise() code to do the
> batched tlb flushes utilizing the previous patches introduced changes.
>
> The final ninth patch removes the internal logics' unbatched tlb flushes
> use case support code, which is no more be used.
>
> Test Results
> ============
>
> I measured the latency to apply MADV_DONTNEED advice to 256 MiB memory
> using multiple process_madvise() calls.  I apply the advice in 4 KiB
> sized regions granularity, but with varying batch size per
> process_madvise() call (vlen) from 1 to 1024.  The source code for the
> measurement is available at GitHub[1].  To reduce measurement errors, I
> did the measurement five times.
>
> The measurement results are as below.  'sz_batch' column shows the batch
> size of process_madvise() calls.  'Before' and 'After' columns show the
> average of latencies in nanoseconds that measured five times on kernels
> that built without and with the tlb flushes batching patch of this
> series, respectively.  For the baseline, mm-unstable tree of
> 2025-03-07[2] has been used.  'B-stdev' and 'A-stdev' columns show
> ratios of latency measurements standard deviation to average in percent
> for 'Before' and 'After', respectively.  'Latency_reduction' shows the
> reduction of the latency that the commit has achieved, in percent.
> Higher 'Latency_reduction' values mean more efficiency improvements.
>
>     sz_batch   Before        B-stdev   After         A-stdev   Latency_reduction
>     1          128691595.4   6.09      106878698.4   2.76      16.95
>     2          94046750.8    3.30      68691907      2.79      26.96
>     4          80538496.8    5.35      50230673.8    5.30      37.63
>     8          72672225.2    5.50      43918112      3.54      39.57
>     16         66955104.4    4.25      36549941.2    1.62      45.41
>     32         65814679      5.89      33060291      3.30      49.77
>     64         65099205.2    2.69      26003756.4    1.56      60.06
>     128        62591307.2    4.02      24008180.4    1.61      61.64
>     256        64124368.6    2.93      23868856      2.20      62.78
>     512        62325618      5.72      23311330.6    1.74      62.60
>     1024       64802138.4    5.05      23651695.2    3.40      63.50
>
> Interestingly, the latency has reduced (improved) even with batch size
> 1.  I think some of compiler optimizations have affected that, like also
> observed with the previous process_madvise() mmap_lock optimization
> patch sereis[3].
>
> So, let's focus on the proportion between the improvement and the batch
> size.  As expected, tlb flushes batching provides latency reduction that
> proportional to the batch size.  The efficiency gain ranges from about
> 27 percent with batch size 2, and up to 63 percent with batch size
> 1,024.
>
> Please note that this is a very simple microbenchmark, so real
> efficiency gain on real workload could be very different.
>
> Changes from RFC
> (https://lore.kernel.org/20250305181611.54484-1-sj@kernel.org)
> - Clarify motivation of the change on the cover letter
> - Add average and stdev of evaluation results
> - Show latency reduction on evaluation results
> - Fix !CONFIG_MEMORY_FAILURE build error
> - Rename is_memory_populate() to is_madvise_populate()
> - Squash patches 5-8
> - Add kerneldoc for unmap_vm_area_struct()
> - Squash patches 10 and 11
> - Squash patches 12-14
> - Squash patches 15 and 16
>
> References
> ==========
>
> [1] https://github.com/sjp38/eval_proc_madvise
> [2] commit e664d7d28a7c ("selftest: test system mappings are sealed") # mm-unstable
> [3] https://lore.kernel.org/20250211182833.4193-1-sj@kernel.org
>
> SeongJae Park (9):
>   mm/madvise: use is_memory_failure() from madvise_do_behavior()
>   mm/madvise: split out populate behavior check logic
>   mm/madvise: deduplicate madvise_do_behavior() skip case handlings
>   mm/madvise: remove len parameter of madvise_do_behavior()
>   mm/madvise: define and use madvise_behavior struct for
>     madvise_do_behavior()
>   mm/memory: split non-tlb flushing part from zap_page_range_single()
>   mm/madvise: let madvise_{dontneed,free}_single_vma() caller batches
>     tlb flushes
>   mm/madvise: batch tlb flushes for
>     [process_]madvise(MADV_{DONTNEED[_LOCKED],FREE})
>   mm/madvise: remove !tlb support from
>     madvise_{dontneed,free}_single_vma()
>
>  mm/internal.h |   3 +
>  mm/madvise.c  | 221 +++++++++++++++++++++++++++++++++-----------------
>  mm/memory.c   |  38 ++++++---
>  3 files changed, 176 insertions(+), 86 deletions(-)
>
>
> base-commit: e993f5f5b0ac851cf60578cfee5488031dfaa80c
> --
> 2.39.5