linux-kernel - Re: [PATCH] arm64: mm: drop tlb flush operation when clearing the access bit

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAOUHufZ84aDmiW3Efh87q1oMJr-zk5cyaebucCFzevFHx77ngQ@mail.gmail.com>
Date:   Wed, 25 Oct 2023 00:17:14 -0600
From:   Yu Zhao <yuzhao@...gle.com>
To:     Alistair Popple <apopple@...dia.com>
Cc:     Baolin Wang <baolin.wang@...ux.alibaba.com>,
        Barry Song <21cnbao@...il.com>, catalin.marinas@....com,
        will@...nel.org, akpm@...ux-foundation.org, v-songbaohua@...o.com,
        linux-mm@...ck.org, linux-arm-kernel@...ts.infradead.org,
        linux-kernel@...r.kernel.org
Subject: Re: [PATCH] arm64: mm: drop tlb flush operation when clearing the
 access bit

On Tue, Oct 24, 2023 at 9:21 PM Alistair Popple <apopple@...dia.com> wrote:
>
>
> Baolin Wang <baolin.wang@...ux.alibaba.com> writes:
>
> > On 10/25/2023 9:58 AM, Alistair Popple wrote:
> >> Barry Song <21cnbao@...il.com> writes:
> >>
> >>> On Wed, Oct 25, 2023 at 9:18 AM Alistair Popple <apopple@...dia.com> wrote:
> >>>>
> >>>>
> >>>> Barry Song <21cnbao@...il.com> writes:
> >>>>
> >>>>> On Wed, Oct 25, 2023 at 7:16 AM Barry Song <21cnbao@...il.com> wrote:
> >>>>>>
> >>>>>> On Tue, Oct 24, 2023 at 8:57 PM Baolin Wang
> >>>>>> <baolin.wang@...ux.alibaba.com> wrote:
> >> [...]
> >>
> >>>>>> (A). Constant flush cost vs. (B). very very occasional reclaimed hot
> >>>>>> page,  B might
> >>>>>> be a correct choice.
> >>>>>
> >>>>> Plus, I doubt B is really going to happen. as after a page is promoted to
> >>>>> the head of lru list or new generation, it needs a long time to slide back
> >>>>> to the inactive list tail or to the candidate generation of mglru. the time
> >>>>> should have been large enough for tlb to be flushed. If the page is really
> >>>>> hot, the hardware will get second, third, fourth etc opportunity to set an
> >>>>> access flag in the long time in which the page is re-moved to the tail
> >>>>> as the page can be accessed multiple times if it is really hot.
> >>>>
> >>>> This might not be true if you have external hardware sharing the page
> >>>> tables with software through either HMM or hardware supported ATS
> >>>> though.
> >>>>
> >>>> In those cases I think it's much more likely hardware can still be
> >>>> accessing the page even after a context switch on the CPU say. So those
> >>>> pages will tend to get reclaimed even though hardware is still actively
> >>>> using them which would be quite expensive and I guess could lead to
> >>>> thrashing as each page is reclaimed and then immediately faulted back
> >>>> in.
> >
> > That's possible, but the chance should be relatively low. At least on
> > x86, I have not heard of this issue.
>
> Personally I've never seen any x86 system that shares page tables with
> external devices, other than with HMM. More on that below.
>
> >>> i am not quite sure i got your point. has the external hardware sharing cpu's
> >>> pagetable the ability to set access flag in page table entries by
> >>> itself? if yes,
> >>> I don't see how our approach will hurt as folio_referenced can notify the
> >>> hardware driver and the driver can flush its own tlb. If no, i don't see
> >>> either as the external hardware can't set access flags, that means we
> >>> have ignored its reference and only knows cpu's access even in the current
> >>> mainline code. so we are not getting worse.
> >>>
> >>> so the external hardware can also see cpu's TLB? or cpu's tlb flush can
> >>> also broadcast to external hardware, then external hardware sees the
> >>> cleared access flag, thus, it can set access flag in page table when the
> >>> hardware access it?  If this is the case, I feel what you said is true.
> >> Perhaps it would help if I gave a concrete example. Take for example
> >> the
> >> ARM SMMU. It has it's own TLB. Invalidating this TLB is done in one of
> >> two ways depending on the specific HW implementation.
> >> If broadcast TLB maintenance (BTM) is supported it will snoop CPU
> >> TLB
> >> invalidations. If BTM is not supported it relies on SW to explicitly
> >> forward TLB invalidations via MMU notifiers.
> >
> > On our ARM64 hardware, we rely on BTM to maintain TLB coherency.
>
> Lucky you :-)
>
> ARM64 SMMU architecture specification supports the possibilty of both,
> as does the driver. Not that I think whether or not BTM is supported has
> much relevance to this issue.
>
> >> Now consider the case where some external device is accessing mappings
> >> via the SMMU. The access flag will be cached in the SMMU TLB. If we
> >> clear the access flag without a TLB invalidate the access flag in the
> >> CPU page table will not get updated because it's already set in the SMMU
> >> TLB.
> >> As an aside access flag updates happen in one of two ways. If the
> >> SMMU
> >> HW supports hardware translation table updates (HTTU) then hardware will
> >> manage updating access/dirty flags as required. If this is not supported
> >> then SW is relied on to update these flags which in practice means
> >> taking a minor fault. But I don't think that is relevant here - in
> >> either case without a TLB invalidate neither of those things will
> >> happen.
> >> I suppose drivers could implement the clear_flush_young() MMU
> >> notifier
> >> callback (none do at the moment AFAICT) but then won't that just lead to
> >> the opposite problem - that every page ever used by an external device
> >> remains active and unavailable for reclaim because the access flag never
> >> gets cleared? I suppose they could do the flush then which would ensure
> >
> > Yes, I think so too. The reason there is currently no problem, perhaps
> > I think, there are no actual use cases at the moment? At least on our
> > Alibaba's fleet, SMMU and MMU do not share page tables now.
>
> We have systems that do.

Just curious: do those systems run the Linux kernel? If so, are pages
shared with SMMU pinned? If not, then how are IO PFs handled after
pages are reclaimed?