lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAOUHufZ6LGyBoPBkniB63-77r5=1POWpEWmUTESFtJo2bwbi-w@mail.gmail.com>
Date:   Thu, 1 Sep 2022 19:17:26 -0600
From:   Yu Zhao <yuzhao@...gle.com>
To:     Nadav Amit <nadav.amit@...il.com>
Cc:     Andrew Morton <akpm@...ux-foundation.org>,
        Andi Kleen <ak@...ux.intel.com>,
        Aneesh Kumar <aneesh.kumar@...ux.ibm.com>,
        Catalin Marinas <catalin.marinas@....com>,
        Dave Hansen <dave.hansen@...ux.intel.com>,
        Hillf Danton <hdanton@...a.com>, Jens Axboe <axboe@...nel.dk>,
        Johannes Weiner <hannes@...xchg.org>,
        Jonathan Corbet <corbet@....net>,
        Linus Torvalds <torvalds@...ux-foundation.org>,
        Matthew Wilcox <willy@...radead.org>,
        Mel Gorman <mgorman@...e.de>,
        Michael Larabel <Michael@...haellarabel.com>,
        Michal Hocko <mhocko@...nel.org>,
        Mike Rapoport <rppt@...nel.org>,
        Peter Zijlstra <peterz@...radead.org>,
        Tejun Heo <tj@...nel.org>, Vlastimil Babka <vbabka@...e.cz>,
        Will Deacon <will@...nel.org>,
        Linux ARM <linux-arm-kernel@...ts.infradead.org>,
        "open list:DOCUMENTATION" <linux-doc@...r.kernel.org>,
        LKML <linux-kernel@...r.kernel.org>,
        Linux MM <linux-mm@...ck.org>, X86 ML <x86@...nel.org>,
        Kernel Page Reclaim v2 <page-reclaim@...gle.com>,
        Barry Song <baohua@...nel.org>,
        Brian Geffon <bgeffon@...gle.com>,
        Jan Alexander Steffens <heftig@...hlinux.org>,
        Oleksandr Natalenko <oleksandr@...alenko.name>,
        Steven Barrett <steven@...uorix.net>,
        Suleiman Souhlal <suleiman@...gle.com>,
        Daniel Byrne <djbyrne@....edu>,
        Donald Carr <d@...os-reins.com>,
        Holger Hoffstätte <holger@...lied-asynchrony.com>,
        Konstantin Kharlamov <Hi-Angel@...dex.ru>,
        Shuang Zhai <szhai2@...rochester.edu>,
        Sofia Trinh <sofia.trinh@....works>,
        Vaibhav Jain <vaibhav@...ux.ibm.com>
Subject: Re: [PATCH v14 07/14] mm: multi-gen LRU: exploit locality in rmap

On Thu, Sep 1, 2022 at 3:18 AM Nadav Amit <nadav.amit@...il.com> wrote:
>
>
>
> > On Aug 15, 2022, at 12:13 AM, Yu Zhao <yuzhao@...gle.com> wrote:
> >
> > Searching the rmap for PTEs mapping each page on an LRU list (to test
> > and clear the accessed bit) can be expensive because pages from
> > different VMAs (PA space) are not cache friendly to the rmap (VA
> > space). For workloads mostly using mapped pages, searching the rmap
> > can incur the highest CPU cost in the reclaim path.
>
> Impressive work. Sorry if my feedback is not timely.
>
> Just one minor point for thought, that can be left for a later cleanup.
>
> >
> > +     for (i = 0, addr = start; addr != end; i++, addr += PAGE_SIZE) {
> > +             unsigned long pfn;
> > +
> > +             pfn = get_pte_pfn(pte[i], pvmw->vma, addr);
> > +             if (pfn == -1)
> > +                     continue;
> > +
> > +             if (!pte_young(pte[i]))
> > +                     continue;
> > +
> > +             folio = get_pfn_folio(pfn, memcg, pgdat);
> > +             if (!folio)
> > +                     continue;
> > +
> > +             if (!ptep_test_and_clear_young(pvmw->vma, addr, pte + i))
> > +                     continue;
> > +
>
> You have already checked that the PTE is old (not young) so this check
> seems redundant.

You are right, for x86, which belongs to category 1: hardware and
OS share the same paging data structure.

> I do not see a way in which the access-bit can be cleared
> since you hold the ptl.

There is also category 2: the OS paging data structure is a shadow of what
hardware actually uses, e.g., POWER9 radix.

To make both categories work, the general rule is that the OS paging
data structure must be more strict, i.e., it can have A/D bits set
while the hardware paging data structure may not. The opposite is not
allowed, even for the A bit, because the A bit can also be used to
determine whether a TLB flush is required. The Linux kernel doesn't do
this but there are other OSes that do.

For prefaulted PTEs, we generally mark them young unless
arch_wants_old_prefaulted_pte() returns true (currently only ARMv8.2+
do). On POWER9, we'd see those PTEs pass the first check but fail the
second.

> IOW, there is no need for the “if" and “continue".
>
> Makes me also wonder whether having a separate ptep_clear_young() can
> slightly help, since anyhow the access-bit is more of an estimation,
> and having a separate ptep_clear_young() can enable optimizations.
>
> On x86, for instance, if the PTE is dirty, we may be able to clear the
> access-bit without an atomic operation, which should be faster.

Agreed.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ