[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAF8kJuPj6-gZ4H+VQtJpJj_MutTgTcR-9BfDQnweayOrXk-NCQ@mail.gmail.com>
Date: Tue, 9 Sep 2025 17:26:57 -0700
From: Chris Li <chrisl@...nel.org>
To: YoungJun Park <youngjun.park@....com>
Cc: Michal Koutný <mkoutny@...e.com>,
akpm@...ux-foundation.org, hannes@...xchg.org, mhocko@...nel.org,
roman.gushchin@...ux.dev, shakeel.butt@...ux.dev, muchun.song@...ux.dev,
shikemeng@...weicloud.com, kasong@...cent.com, nphamcs@...il.com,
bhe@...hat.com, baohua@...nel.org, cgroups@...r.kernel.org,
linux-mm@...ck.org, linux-kernel@...r.kernel.org, gunho.lee@....com,
iamjoonsoo.kim@....com, taejoon.song@....com,
Matthew Wilcox <willy@...radead.org>, David Hildenbrand <david@...hat.com>, Kairui Song <ryncsn@...il.com>,
Wei Xu <weixugc@...gle.com>
Subject: Re: [PATCH 1/4] mm/swap, memcg: Introduce infrastructure for
cgroup-based swap priority
On Sun, Sep 7, 2025 at 10:51 AM YoungJun Park <youngjun.park@....com> wrote:
>
> > On Fri, Sep 5, 2025 at 4:45 PM Chris Li <chrisl@...nel.org> wrote:
> > > > - Mask computation: precompute at interface write-time vs runtime
> > > > recomputation. (TBD; preference?)
> > >
> > > Let's start with runtime. We can have a runtime and cached with
> > > generation numbers on the toplevel. Any change will reset the top
> > > level general number then the next lookup will drop the cache value
> > > and re-evaluate.
> >
> > Scratch that cache value idea. I found the run time evaluation can be
> > very simple and elegant.
> > Each memcg just needs to store the tier onoff value for the local
> > swap.tiers operation. Also a mask to indicate which of those tiers
> > present.
> > e.g. bits 0-1: default, on bit 0 and off bit 1
> > bits 2-3: zswap, on bit 2 and off bit3
> > bits 4-6: first custom tier
> > ...
> >
> > The evaluation of the current tier "memcg" to the parent with the
> > default tier shortcut can be:
> >
> > onoff = memcg->tiers_onoff;
> > mask = memcg->tiers_mask;
> >
> > for (p = memcg->parent; p && !has_default(onoff); p = p->parent) {
> > merge = mask | p->tiers_mask;
> > new = merge ^ mask;
> > onoff |= p->tiers_onoff & new;
> > mask = merge;
> > }
> > if (onoff & DEFAULT_OFF) {
> > // default off, look for the on tiers to turn on
> > } else {
> > // default on, look for the off tiers to turn off
> > }
> >
> > It is an all bit operation that does not need caching at all. This can
> > take advantage of the short cut of the default tier. If the default
> > tier overwrite exists, no need to search the parent further.
> >
> > Chris
> >
>
> Hi Chris,
>
> Thanks a lot for the clear code and explanation.
>
> I’ll proceed with the runtime evaluation approach you suggested.
> I was initially leaning toward precomputing at write-time since (1)
> cgroup depth is might be deep, and (2) swap I/O paths are far more frequent than config
Cgroup depth is typically not deep. Might have a lot of top level
cgroups. That is the more common setup I am family with. If you know
other usage cases contradicting that please let me know.
We can turn this into a LPC discussion question to ask the audience as well.
> writes. Is your preference for runtime for implementation simpleness?
> (Any other reasons I don't know?)
Oh, I think it provides the most flexibility with minimal code
complexity. It is kind of the best world. If the child overrides the
default value with leading "-/+" without tiername. It will trigger the
shortcut path and no need to look up the parent.
However, if the child has a default empty "swap.tiers" file, change to
the parent will impact every child cgroup. We can have it both ways
with what I consider pretty minimal code. That is actually the most
common usage case. K8s pods would change from the top level.
It is a good trade off in terms of ROI from complexity vs feature
flexibility point of view.
BTW, the "swap.tiers" file should require root or some kind of CAPS so
non root users can't write to it by themselves. Otherwise they can
abuse their own setting thus rendering the QoS aspect not effective to
other cgroups.
Chris
Powered by blists - more mailing lists