linux-kernel - Re: [PATCH v2 6/6] zswap: memcg accounting

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20220516143459.GA17557@blackbody.suse.cz>
Date:   Mon, 16 May 2022 16:34:59 +0200
From:   Michal Koutný <mkoutny@...e.com>
To:     Johannes Weiner <hannes@...xchg.org>
Cc:     Andrew Morton <akpm@...ux-foundation.org>,
        Michal Hocko <mhocko@...e.com>, Roman Gushchin <guro@...com>,
        Shakeel Butt <shakeelb@...gle.com>,
        Seth Jennings <sjenning@...hat.com>,
        Dan Streetman <ddstreet@...e.org>,
        Minchan Kim <minchan@...nel.org>, linux-mm@...ck.org,
        cgroups@...r.kernel.org, linux-kernel@...r.kernel.org,
        kernel-team@...com
Subject: Re: [PATCH v2 6/6] zswap: memcg accounting

On Fri, May 13, 2022 at 01:08:13PM -0400, Johannes Weiner <hannes@...xchg.org> wrote:
> Right, it's accounted as a subset rather than fully disjointed. But it
> is a limitable counter of its own, so I exported it as such, with a
> current and a max knob. This is comparable to the kmem counter in v1.

That counter and limit didn't turn out well. I liked the analogy to
writeback (and dirty limit) better.

> From an API POV it would be quite strange to have max for a counter
> that has no current. Likewise it would be strange for a major memory
> consumer to be missing from memory.stat.

My understanding would be to have all memory.stat entries as you
propose, no extra .current counter and the .max knob for zswap
configuration.

> It needs to be configured to the workload's access frequency curve,
> which can be done with trial-and-error (reasonable balance between
> zswpins and pswpins) or in a more targeted manner using tools such as
> page_idle, damon etc.
> [...]
> Because for load tuning, bytes make much more sense. That's how you
> measure the workingset, so a percentage is an awkward indirection. At
> the cgroup level, it makes even less sense: all memcg tunables are in
> bytes, it would be quite weird to introduce a "max" that is 0-100. Add
> the confusion of how percentages would propagate down the hierarchy...

Thanks for the explanation. I guess there's no simple tranformation of
in-kernel available information that'd allow a more semantic
configuration of this value. The rather crude absolute value requires
(but also simply allows) some calibration or responsive tuning.

> I don't traverse all ancestors, I bail on disabled groups and skip
> unlimited ones.

I admit I missed that.

> Flushing unnecessary groups with a ratelimit doesn't sound like an
> improvement to me.

Then I'm only concerned about a situation when there's a single deep
memcg that undergoes both workingset_refault() and zswap querying.
The latter (bare call to cgroup_rstat_flush()) won't reset
stats_flush_threshold, so the former (or the async flush more likely)
would attempt a flush too. The flush work (on the leaf memcg) would be
done twice even though it may be within the tolerance of cumulated
error the second time.

This is a thing that might require attention in the future (depending on
some data how it actually performs). I see how the current approach is
justified.

Regards,
Michal