linux-kernel - Re: [PATCH v7 04/12] mm: multigenerational LRU: groundwork

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAOUHufbN_56UJBkgA2LjAfbTt9nzPOCHaSeS4P3GHcYst+Y+eg@mail.gmail.com>
Date:   Mon, 14 Mar 2022 10:45:21 -0600
From:   Yu Zhao <yuzhao@...gle.com>
To:     Barry Song <21cnbao@...il.com>
Cc:     Johannes Weiner <hannes@...xchg.org>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Mel Gorman <mgorman@...e.de>, Michal Hocko <mhocko@...nel.org>,
        Andi Kleen <ak@...ux.intel.com>,
        Aneesh Kumar <aneesh.kumar@...ux.ibm.com>,
        Catalin Marinas <catalin.marinas@....com>,
        Dave Hansen <dave.hansen@...ux.intel.com>,
        Hillf Danton <hdanton@...a.com>, Jens Axboe <axboe@...nel.dk>,
        Jesse Barnes <jsbarnes@...gle.com>,
        Jonathan Corbet <corbet@....net>,
        Linus Torvalds <torvalds@...ux-foundation.org>,
        Matthew Wilcox <willy@...radead.org>,
        Michael Larabel <Michael@...haellarabel.com>,
        Mike Rapoport <rppt@...nel.org>,
        Rik van Riel <riel@...riel.com>,
        Vlastimil Babka <vbabka@...e.cz>,
        Will Deacon <will@...nel.org>,
        Ying Huang <ying.huang@...el.com>,
        LAK <linux-arm-kernel@...ts.infradead.org>,
        Linux Doc Mailing List <linux-doc@...r.kernel.org>,
        LKML <linux-kernel@...r.kernel.org>,
        Linux-MM <linux-mm@...ck.org>,
        Kernel Page Reclaim v2 <page-reclaim@...gle.com>,
        x86 <x86@...nel.org>, Brian Geffon <bgeffon@...gle.com>,
        Jan Alexander Steffens <heftig@...hlinux.org>,
        Oleksandr Natalenko <oleksandr@...alenko.name>,
        Steven Barrett <steven@...uorix.net>,
        Suleiman Souhlal <suleiman@...gle.com>,
        Daniel Byrne <djbyrne@....edu>,
        Donald Carr <d@...os-reins.com>,
        Holger Hoffstätte <holger@...lied-asynchrony.com>,
        Konstantin Kharlamov <Hi-Angel@...dex.ru>,
        Shuang Zhai <szhai2@...rochester.edu>,
        Sofia Trinh <sofia.trinh@....works>
Subject: Re: [PATCH v7 04/12] mm: multigenerational LRU: groundwork

On Mon, Mar 14, 2022 at 5:12 AM Barry Song <21cnbao@...il.com> wrote:
>
> > > > >
> > > > > > We used to put a faulted file page in inactive, if we access it a
> > > > > > second time, it can be promoted
> > > > > > to active. then in recent years, we have also applied this to anon
> > > > > > pages while kernel adds
> > > > > > workingset protection for anon pages. so basically both anon and file
> > > > > > pages go into the inactive
> > > > > > list for the 1st time, if we access it for the second time, they go to
> > > > > > the active list. if we don't access
> > > > > > it any more, they are likely to be reclaimed as they are inactive.
> > > > > > we do have some special fastpath for code section, executable file
> > > > > > pages are kept on active list
> > > > > > as long as they are accessed.
> > > > >
> > > > > Yes.
> > > > >
> > > > > > so all of the above concerns are actually not that correct?
> > > > >
> > > > > They are valid concerns but I don't know any popular workloads that
> > > > > care about them.
> > > >
> > > > Hi Yu,
> > > > here we can get a workload in Kim's patchset while he added workingset
> > > > protection
> > > > for anon pages:
> > > > https://patchwork.kernel.org/project/linux-mm/cover/1581401993-20041-1-git-send-email-iamjoonsoo.kim@lge.com/
> > >
> > > Thanks. I wouldn't call that a workload because it's not a real
> > > application. By popular workloads, I mean applications that the
> > > majority of people actually run on phones, in cloud, etc.
> > >
> > > > anon pages used to go to active rather than inactive, but kim's patchset
> > > > moved to use inactive first. then only after the anon page is accessed
> > > > second time, it can move to active.
> > >
> > > Yes. To clarify, the A-bit doesn't really mean the first or second
> > > access. It can be many accesses each time it's set.
> > >
> > > > "In current implementation, newly created or swap-in anonymous page is
> > > >
> > > > started on the active list. Growing the active list results in rebalancing
> > > > active/inactive list so old pages on the active list are demoted to the
> > > > inactive list. Hence, hot page on the active list isn't protected at all.
> > > >
> > > > Following is an example of this situation.
> > > >
> > > > Assume that 50 hot pages on active list and system can contain total
> > > > 100 pages. Numbers denote the number of pages on active/inactive
> > > > list (active | inactive). (h) stands for hot pages and (uo) stands for
> > > > used-once pages.
> > > >
> > > > 1. 50 hot pages on active list
> > > > 50(h) | 0
> > > >
> > > > 2. workload: 50 newly created (used-once) pages
> > > > 50(uo) | 50(h)
> > > >
> > > > 3. workload: another 50 newly created (used-once) pages
> > > > 50(uo) | 50(uo), swap-out 50(h)
> > > >
> > > > As we can see, hot pages are swapped-out and it would cause swap-in later."
> > > >
> > > > Is MGLRU able to avoid the swap-out of the 50 hot pages?
> > >
> > > I think the real question is why the 50 hot pages can be moved to the
> > > inactive list. If they are really hot, the A-bit should protect them.
> >
> > This is a good question.
> >
> > I guess it  is probably because the current lru is trying to maintain a balance
> > between the sizes of active and inactive lists. Thus, it can shrink active list
> > even though pages might be still "hot" but not the recently accessed ones.
> >
> > 1. 50 hot pages on active list
> > 50(h) | 0
> >
> > 2. workload: 50 newly created (used-once) pages
> > 50(uo) | 50(h)
> >
> > 3. workload: another 50 newly created (used-once) pages
> > 50(uo) | 50(uo), swap-out 50(h)
> >
> > the old kernel without anon workingset protection put workload 2 on active, so
> > pushed 50 hot pages from active to inactive. workload 3 would further contribute
> > to evict the 50 hot pages.
> >
> > it seems mglru doesn't demote pages from the youngest generation to older
> > generation only in order to balance the list size? so mglru is probably safe
> > in these cases.
> >
> > I will run some tests mentioned in Kim's patchset and report the result to you
> > afterwards.
> >
>
> Hi Yu,
> I did find putting faulted pages to the youngest generation lead to some
> regression in the case ebizzy Kim's patchset mentioned while he tried
> to support workingset protection for anon pages.
> i did a little bit modification for rand_chunk() which is probably similar
> with the modifcation() Kim mentioned in his patchset. The modification
> can be found here:
> https://github.com/21cnbao/ltp/commit/7134413d747bfa9ef
>
> The test env is a x86 machine in which I have set memory size to 2.5GB and
> set zRAM to 2GB and disabled external disk swap.
>
> with the vanilla kernel:
> \time -v ./a.out -vv -t 4 -s 209715200 -S 200000
>
> so we have 10 chunks and 4 threads, each trunk is 209715200(200MB)
>
> typical result:
>         Command being timed: "./a.out -vv -t 4 -s 209715200 -S 200000"
>         User time (seconds): 36.19
>         System time (seconds): 229.72
>         Percent of CPU this job got: 371%
>         Elapsed (wall clock) time (h:mm:ss or m:ss): 1:11.59
>         Average shared text size (kbytes): 0
>         Average unshared data size (kbytes): 0
>         Average stack size (kbytes): 0
>         Average total size (kbytes): 0
>         Maximum resident set size (kbytes): 2166196
>         Average resident set size (kbytes): 0
>         Major (requiring I/O) page faults: 9990128
>         Minor (reclaiming a frame) page faults: 33315945
>         Voluntary context switches: 59144
>         Involuntary context switches: 167754
>         Swaps: 0
>         File system inputs: 2760
>         File system outputs: 8
>         Socket messages sent: 0
>         Socket messages received: 0
>         Signals delivered: 0
>         Page size (bytes): 4096
>         Exit status: 0
>
> with gen_lru and lru_gen/enabled=0x3:
> typical result:
> Command being timed: "./a.out -vv -t 4 -s 209715200 -S 200000"
> User time (seconds): 36.34
> System time (seconds): 276.07
> Percent of CPU this job got: 378%
> Elapsed (wall clock) time (h:mm:ss or m:ss): 1:22.46
>            **** 15% time +
> Average shared text size (kbytes): 0
> Average unshared data size (kbytes): 0
> Average stack size (kbytes): 0
> Average total size (kbytes): 0
> Maximum resident set size (kbytes): 2168120
> Average resident set size (kbytes): 0
> Major (requiring I/O) page faults: 13362810
>              ***** 30% page fault +
> Minor (reclaiming a frame) page faults: 33394617
> Voluntary context switches: 55216
> Involuntary context switches: 137220
> Swaps: 0
> File system inputs: 4088
> File system outputs: 8
> Socket messages sent: 0
> Socket messages received: 0
> Signals delivered: 0
> Page size (bytes): 4096
> Exit status: 0
>
> with gen_lru and lru_gen/enabled=0x7:
> typical result:
> Command being timed: "./a.out -vv -t 4 -s 209715200 -S 200000"
> User time (seconds): 36.13
> System time (seconds): 251.71
> Percent of CPU this job got: 378%
> Elapsed (wall clock) time (h:mm:ss or m:ss): 1:16.00
>          *****better than enabled=0x3, worse than vanilla
> Average shared text size (kbytes): 0
> Average unshared data size (kbytes): 0
> Average stack size (kbytes): 0
> Average total size (kbytes): 0
> Maximum resident set size (kbytes): 2120988
> Average resident set size (kbytes): 0
> Major (requiring I/O) page faults: 12706512
> Minor (reclaiming a frame) page faults: 33422243
> Voluntary context switches: 49485
> Involuntary context switches: 126765
> Swaps: 0
> File system inputs: 2976
> File system outputs: 8
> Socket messages sent: 0
> Socket messages received: 0
> Signals delivered: 0
> Page size (bytes): 4096
> Exit status: 0
>
> I can also reproduce the problem on arm64.
>
> I am not saying this is going to block mglru from being mainlined. But  I am
> still curious if this is an issue worth being addressed somehow in mglru.

You've missed something very important: *thoughput* :)

Dollars to doughnuts there was a large increase in throughput -- I
haven't tried this benchmark but I've seen many reports similar to
this one.