lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <20220312190715.cx4aznnzf6zdp7wv@google.com>
Date:   Sat, 12 Mar 2022 19:07:15 +0000
From:   Shakeel Butt <shakeelb@...gle.com>
To:     "Michal Koutný" <mkoutny@...e.com>
Cc:     Johannes Weiner <hannes@...xchg.org>,
        Michal Hocko <mhocko@...nel.org>,
        Roman Gushchin <roman.gushchin@...ux.dev>,
        Ivan Babrou <ivan@...udflare.com>,
        Frank Hofmann <fhofmann@...udflare.com>,
        Andrew Morton <akpm@...ux-foundation.org>,
        cgroups@...r.kernel.org, linux-mm@...ck.org,
        linux-kernel@...r.kernel.org, Daniel Dao <dqminh@...udflare.com>,
        stable@...r.kernel.org
Subject: Re: [PATCH] memcg: sync flush only if periodic flush is delayed

Hi Michal,

On Fri, Mar 11, 2022 at 05:00:51PM +0100, Michal Koutný wrote:
> Hello.

> TL;DR rstats are slow but accurate on reader side. To tackle the
> performance regression no flush seems simpler than this patch.


The term 'simpler' is very subjective here and I would argue this patch
is not that complicated than no flush but I think there is no benefit on
arguing on this as these are not some stable API which can not be
changed later. We can always come back and change based on new findings.

Before going further, I do want to mention the main reason to move to
rstat infrastructure was to decouple the error rate in the stats from
the number of memory cgroups and the number of stat items. So, I will
focus on the error rate in this email.

[...]

> The benefit this was traded for was the greater accuracy, the possible
> error is:
> - before
>    - O(nr_cpus * nr_cgroups(subtree) * MEMCG_CHARGE_BATCH)	(1)

Please note that (1) is the possible error for each stat item and
without any time bound.

> - after
>      O(nr_cpus * MEMCG_CHARGE_BATCH) // sync. flush

The above is across all the stat items.

>      or
>      O(flush_period * max_cr) // periodic flush only		(2)
>                               // max_cr is per-counter max change rate

And this above one, I am assuming, is for performance critical readers
(workingset_refault introduced by this patch) and flush_period here is 4
seconds. Please correct me if I misunderstood this.


> So we could argue that if the pre-rstat kernels did just fine with the
> error (1), they would not be worse with periodic flush if we can compare
> (1) and (2).

I agree with this assessment but please note that pre-rstat kernels were
not good for machines with large number of CPUs and running large number
of workloads.

[...]

> I'm not sure whether your patch attempts to solve the problem of
> (a) periodic flush getting stuck or (b) limiting error on refault path.
> If it's (a), it should be tackled more systematically (dedicated wq?).
> If it's (b), why not just rely on periodic flush (self answer: (1) and
> (2) comparison is workload dependent).


It is (b) that I am aiming for in this patch. At least (a) was not
happening in the cloudflare experiments. Are you suggesting having a
dedicated high priority wq would solve both (a) and (b)?

> > Now the question: what are the side-effects of this change? The worst
> > that can happen is the refault codepath will see 4sec old lruvec stats
> > and may cause false (or missed) activations of the refaulted page which
> > may under-or-overestimate the workingset size. Though that is not very
> > concerning as the kernel can already miss or do false activations.

> We can't argue what's the effect of periodic only flushing so this
> newly introduced factor would inherit that too. I find it superfluous.


Sorry I didn't get your point. What is superfluous?


> Michal

> [1] This is worth looking at in more detail.


Oh you did some awesome analysis here.

>  From the flush condition we have
>    cr * Δt = nr_cpus * MEMCG_CHARGE_BATCH
> where Δt is time between flushes and cr is global change rate.

> cr composes of all updates together (corresponds to stats_updates in
> memcg_rstat_updated(), max_cr is change rate per counter)
>    cr = Σ cr_i <= nr_counters * max_cr

I don't get the reason of breaking 'cr' into individual stat item or
counter. What is the benefit? We want to keep the error rate decoupled
from the number of counters (or stat items).


> By combining these two we get shortest time between flushes:
>    cr * Δt <= nr_counters * max_cr * Δt
>    nr_cpus * MEMCG_CHARGE_BATCH <= nr_counters * max_cr * Δt
>    Δt >= (nr_cpus * MEMCG_CHARGE_BATCH) / (nr_counters * max_cr)

> We are interested in
>    R_amort = flush_work / Δt
> which is
>    R_amort <= flush_work * nr_counters * max_cr / (nr_cpus *  
> MEMCG_CHARGE_BATCH)

> R_amort: O( nr_cpus * nr_cgroups(subtree) * nr_counters * (nr_counters *  
> max_cr) / (nr_cpus * MEMCG_CHARGE_BATCH) )
> R_amort: O( nr_cgroups(subtree) * nr_counters^2 * max_cr) /  
> (MEMCG_CHARGE_BATCH) )

> The square looks interesting given there are already tens of counters.
> (As data from Ivan have shown, we can hardly restore the pre-rstat
> performance on the read side even with mere mod_delayed_work().)
> This is what you partially solved with introduction of NR_MEMCG_EVENTS

My main reason behind trying NR_MEMCG_EVENTS was to reduce flush_work by
reducing nr_counters and I don't think nr_counters should have an impact
on Δt.

> but the stats_updates was still sum of all events, so the flush might
> have still triggered too frequently.

> Maybe that would be better long-term approach, splitting into accurate
> and approximate counters and reflect that in the error estimator  
> stats_updates.

> Or some other optimization of mem_cgroup_css_rstat_flush().


Thanks for your insights. This is really awesome and good to explore the
long-term approach. Do you have any strong concerns with the currect
patch? I think we should proceed with this and focus more on long-term
approach.

thanks,
Shakeel



Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ