linux-kernel - Re: [RFC 0/4] Introduce unbalance proactive reclaim

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZVNwFV7Fid34pU-M@tiehlicka>
Date:   Tue, 14 Nov 2023 14:03:17 +0100
From:   Michal Hocko <mhocko@...e.com>
To:     Huan Yang <link@...o.com>
Cc:     "Huang, Ying" <ying.huang@...el.com>, Tejun Heo <tj@...nel.org>,
        Zefan Li <lizefan.x@...edance.com>,
        Johannes Weiner <hannes@...xchg.org>,
        Jonathan Corbet <corbet@....net>,
        Roman Gushchin <roman.gushchin@...ux.dev>,
        Shakeel Butt <shakeelb@...gle.com>,
        Muchun Song <muchun.song@...ux.dev>,
        Andrew Morton <akpm@...ux-foundation.org>,
        David Hildenbrand <david@...hat.com>,
        Matthew Wilcox <willy@...radead.org>,
        Kefeng Wang <wangkefeng.wang@...wei.com>,
        Peter Xu <peterx@...hat.com>,
        "Vishal Moola (Oracle)" <vishal.moola@...il.com>,
        Yosry Ahmed <yosryahmed@...gle.com>,
        Liu Shixin <liushixin2@...wei.com>,
        Hugh Dickins <hughd@...gle.com>, cgroups@...r.kernel.org,
        linux-doc@...r.kernel.org, linux-kernel@...r.kernel.org,
        linux-mm@...ck.org, opensource.kernel@...o.com
Subject: Re: [RFC 0/4] Introduce unbalance proactive reclaim

On Tue 14-11-23 20:37:07, Huan Yang wrote:
> 
> 在 2023/11/14 18:04, Michal Hocko 写道:
> > On Mon 13-11-23 09:54:55, Huan Yang wrote:
> > > 在 2023/11/10 20:32, Michal Hocko 写道:
> > > > On Fri 10-11-23 14:21:17, Huan Yang wrote:
> > > > [...]
> > > > > > BTW: how do you know the number of pages to be reclaimed proactively in
> > > > > > memcg proactive reclaiming based solution?
> > > > > One point here is that we are not sure how long the frozen application
> > > > > will be opened, it could be 10 minutes, an hour, or even days.  So we
> > > > > need to predict and try, gradually reclaim anonymous pages in
> > > > > proportion, preferably based on the LRU algorithm.  For example, if
> > > > > the application has been frozen for 10 minutes, reclaim 5% of
> > > > > anonymous pages; 30min:25%anon, 1hour:75%, 1day:100%.  It is even more
> > > > > complicated as it requires adding a mechanism for predicting failure
> > > > > penalties.
> > > > Why would make your reclaiming decisions based on time rather than the
> > > > actual memory demand? I can see how a pro-active reclaim could make a
> > > > head room for an unexpected memory pressure but applying more pressure
> > > > just because of inactivity sound rather dubious to me TBH. Why cannot
> > > > you simply wait for the external memory pressure (e.g. from kswapd) to
> > > > deal with that based on the demand?
> > > Because the current kswapd and direct memory reclamation are a passive
> > > memory reclamation based on the watermark, and in the event of triggering
> > > these reclamation scenarios, the smoothness of the phone application cannot
> > > be guaranteed.
> > OK, so you are worried about latencies on spike memory usage.
> > 
> > > (We often observe that when the above reclamation is triggered, there
> > > is a delay in the application startup, usually accompanied by block
> > > I/O, and some concurrency issues caused by lock design.)
> > Does that mean you do not have enough head room for kswapd to keep with
>
> Yes, but if set high watermark a little high, the power consumption
> will be very high.  We usually observe that kswapd will run
> frequently.  Even if we have set a low kswapd water level, kswapd CPU
> usage can still be high in some extreme scenarios.(For example, when
> starting a large application that needs to acquire a large amount of
> memory in a short period of time.)However, we will not discuss it in
> detail here, the reasons are quite complex, and we have not yet sorted
> out a complete understanding of them.

This is definitely worth investigating further before resorting to
proposing a new interface. If the kswapd consumes CPU cycles
unproductively then we should look into why.

If there is a big peak memory demand then that surely requires CPU
capacity for the memory reclaim. The work has to be done, whether that
is in kswapd or the pro-active reclaimer context. I can imagine the
latter one could be invoked with a better timing in mind but that is not
a trivial thing to do. There are examples where this could be driven by
PSI feedback loop but from what you have mention earlier you are doing a
idle time based reclaim. Anyway, this is mostly a tuning related
discussion. I wanted to learn more about what you are trying to achieve
and so far it seems to me you are trying to workaround some issues and
a) we would like to learn about those issues and b) a new interface is
unlikely a good fit to paper over a suboptimal behavior.

> > This would suggest that MADV_PAGEOUT is really what you are looking
> > for.
> 
> Yes, I agree, especially to avoid reclaiming shared anonymous pages.
> 
> However, I did some shallow research and found that MADV_PAGEOUT does
> not reclaim pages with mapcount != 1. Our applications are usually
> composed of multiple processes, and some anonymous pages are shared
> among them. When the application is frozen, the memory that is only
> shared among the processes within the application should be released,
> but MADV_PAGEOUT seems not to be suitable for this scenario?(If I
> misunderstood anything, please correct me.)

Hmm, OK it seems that we are hitting some terminology problems. The
discussion was about private memory so far (essentially MAP_PRIVATE)
now you are talking about a shared anonymous memory. That would imply
shmem and that is indeed not supported by MADV_PAGEOUT. The reason for
that is that this poses a security risk for time based attacks. I can
imagine, though, that we could extend the behavior to support shared
mappings if they do not cross a security boundary (e.g. mapped by the
same user). This would require some analysis though.
 
> In addition, I still have doubts that this approach will consume a lot
> of strategy resources, but it is worth studying.

> > If you really aim at compressing a specific type of memory then
> > tweking reclaim to achieve that sounds like a shortcut because
> > madvise based solution is more involved. But that is not a solid
> > justification for adding a new interface.
> Yes, but this RFC is just adding an additional configuration option to
> the proactive reclaim interface. And in the reclaim path, prioritize
> processing these requests with reclaim tendencies. However, using
> `unlikely` judgment should not have much impact.

Just adding an adding configuration option means user interface contract
that needs to be maintained for ever. Our future reclaim algorithm migh
change (and in fact it has already changed quite a bit with MGLRU) and
explicit request for LRU type specific reclaim might not even have any
sense. See that point?

-- 
Michal Hocko
SUSE Labs