linux-kernel - Re: [PATCH v1 3/3] cgroup/rstat: introduce ratelimited rstat flushing

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAJD7tkbNvo4nDek5HV7rpZRbARE7yc3y=ufVY5WMBkNH6oL4Mw@mail.gmail.com>
Date: Thu, 18 Apr 2024 14:00:28 -0700
From: Yosry Ahmed <yosryahmed@...gle.com>
To: Jesper Dangaard Brouer <hawk@...nel.org>
Cc: tj@...nel.org, hannes@...xchg.org, lizefan.x@...edance.com, 
	cgroups@...r.kernel.org, longman@...hat.com, netdev@...r.kernel.org, 
	linux-mm@...ck.org, linux-kernel@...r.kernel.org, shakeel.butt@...ux.dev, 
	kernel-team@...udflare.com, Arnaldo Carvalho de Melo <acme@...nel.org>, 
	Sebastian Andrzej Siewior <bigeasy@...utronix.de>, mhocko@...nel.org, Wei Xu <weixugc@...gle.com>
Subject: Re: [PATCH v1 3/3] cgroup/rstat: introduce ratelimited rstat flushing

On Thu, Apr 18, 2024 at 4:00 AM Jesper Dangaard Brouer <hawk@...nelorg> wrote:
>
>
>
> On 18/04/2024 04.21, Yosry Ahmed wrote:
> > On Tue, Apr 16, 2024 at 10:51 AM Jesper Dangaard Brouer <hawk@...nel.org> wrote:
> >>
> >> This patch aims to reduce userspace-triggered pressure on the global
> >> cgroup_rstat_lock by introducing a mechanism to limit how often reading
> >> stat files causes cgroup rstat flushing.
> >>
> >> In the memory cgroup subsystem, memcg_vmstats_needs_flush() combined with
> >> mem_cgroup_flush_stats_ratelimited() already limits pressure on the
> >> global lock (cgroup_rstat_lock). As a result, reading memory-related stat
> >> files (such as memory.stat, memory.numa_stat, zswap.current) is already
> >> a less userspace-triggerable issue.
> >>
> >> However, other userspace users of cgroup_rstat_flush(), such as when
> >> reading io.stat (blk-cgroup.c) and cpu.stat, lack a similar system to
> >> limit pressure on the global lock. Furthermore, userspace can easily
> >> trigger this issue by reading those stat files.
> >>
> >> Typically, normal userspace stats tools (e.g., cadvisor, nomad, systemd)
> >> spawn threads that read io.stat, cpu.stat, and memory.stat (even from the
> >> same cgroup) without realizing that on the kernel side, they share the
> >> same global lock. This limitation also helps prevent malicious userspace
> >> applications from harming the kernel by reading these stat files in a
> >> tight loop.
> >>
> >> To address this, the patch introduces cgroup_rstat_flush_ratelimited(),
> >> similar to memcg's mem_cgroup_flush_stats_ratelimited().
> >>
> >> Flushing occurs per cgroup (even though the lock remains global) a
> >> variable named rstat_flush_last_time is introduced to track when a given
> >> cgroup was last flushed. This variable, which contains the jiffies of the
> >> flush, shares properties and a cache line with rstat_flush_next and is
> >> updated simultaneously.
> >>
> >> For cpu.stat, we need to acquire the lock (via cgroup_rstat_flush_hold)
> >> because other data is read under the lock, but we skip the expensive
> >> flushing if it occurred recently.
> >>
> >> Regarding io.stat, there is an opportunity outside the lock to skip the
> >> flush, but inside the lock, we must recheck to handle races.
> >>
> >> Signed-off-by: Jesper Dangaard Brouer <hawk@...nel.org>
> >
> > As I mentioned in another thread, I really don't like time-based
> > rate-limiting [1]. Would it be possible to generalize the
> > magnitude-based rate-limiting instead? Have something like
> > memcg_vmstats_needs_flush() in the core rstat code?
> >
>
> I've taken a closer look at memcg_vmstats_needs_flush(). And I'm
> concerned about overhead maintaining the stats (that is used as a filter).
>
>    static bool memcg_vmstats_needs_flush(struct memcg_vmstats *vmstats)
>    {
>         return atomic64_read(&vmstats->stats_updates) >
>                 MEMCG_CHARGE_BATCH * num_online_cpus();
>    }
>
> I looked at `vmstats->stats_updates` to see how often this is getting
> updated.  It is updated in memcg_rstat_updated(), but it gets inlined
> into a number function (__mod_memcg_state, __mod_memcg_lruvec_state,
> __count_memcg_events), plus it calls cgroup_rstat_updated().
> Counting invocations per sec (via funccount):
>
>    10:28:09
>    FUNC                                    COUNT
>    __mod_memcg_state                      377553
>    __count_memcg_events                   393078
>    __mod_memcg_lruvec_state              1229673
>    cgroup_rstat_updated                  2632389
>
>
> I'm surprised to see how many time per sec this is getting invoked.
> Originating from memcg_rstat_updated() = 2,000,304 times per sec.
> (On a 128 CPU core machine with 39% idle CPU-load.)
> Maintaining these stats seems excessive...

Well, the number of calls to memcg_rstat_updated() is not affected by
maintaining stats_updates, and this only adds a few percpu updates in
the common path. I did not see any regressions (after all
optimizations) in any benchmarks with this, including will-it-scale
and netperf.

>
> Then how often does the filter lower pressure on lock:
>
>    MEMCG_CHARGE_BATCH(64) * 128 CPU = 8192
>    2000304/(64*128) = 244 time per sec (every ~4ms)
>    (assuming memcg_rstat_updated val=1)

This does not tell the whole story though because:

1. The threshold (8192 in this case) is per-memcg. I am assuming
2,000,304 is the number of calls per second for the entire system. In
this case, the filtering should be more effective.

2. This assumes that updates and flushes are uniform, I am not sure
this applies in practice.

3. In my experiments, this thresholding drastically improved userspace
read latency under heavy contention (100s or 1000s of readers),
especially the tail latencies.

Generally, I think magnitude-based thresholding is better than
time-based, especially in larger systems where a lot can change in a
short amount of time. I did not observe any regressions from this
scheme, and I observed very noticeable improvements in flushing
latency.

Taking a step back, I think this series is trying to address two
issues in one go: interrupt handling latency and lock contention.
While both are related because reducing flushing reduces irq
disablement, I think it would be better if we can fix that issue
separately with a more fundamental solution (e.g. using a mutex or
dropping the lock at each CPU boundary).

After that, we can more clearly evaluate the lock contention problem
with data purely about flushing latency, without taking into
consideration the irq handling problem.

Does this make sense to you?

>
>
> > Also, why do we keep the memcg time rate-limiting with this patch? Is
> > it because we use a much larger window there (2s)? Having two layers
> > of time-based rate-limiting is not ideal imo.
> >
>
> I'm also not-a-fan of having two layer of time-based rate-limiting, but
> they do operate a different time scales *and* are not active at the same
> time with current patch, if you noticed the details, then I excluded
> memcg from using this as I commented "memcg have own ratelimit layer"
> (in do_flush_stats).

Right, I meant generally having two schemes doing very similar things,
even if they are not active at the same time.

I think this is an artifact of different subsystems sharing the same
rstat tree for no specific reason. I think almost all flushing calls
really need the stats from one subsystem after all.

If we have separate trees, lock contention gets slightly better as
different subsystems do not compete. We can also have different
subsystems "customize" their trees, for e.g. by setting different
time-based or magnitude-based rate-limiting thresholds.

I know this is a bigger lift, just thinking out loud :)

>
> I would prefer removing the memcg time rate-limiting and use this more
> granular 50ms (20 timer/sec) for memcg also.  And I was planning to do
> that in a followup patchset.  The 50ms (20 timer/sec) limit will be per
> cgroup in the system, which then "scales"/increase with the number of
> cgroups, but better than unbounded read/access locks per sec.
>
> --Jesper
>
>
> > [1]https://lore.kernel.org/lkml/CAJD7tkYnSRwJTpXxSnGgo-i3-OdD7cdT-e3_S_yf7dSknPoRKw@mail.gmail.com/
>
>
> sudo ./bcc/tools/funccount -Ti 1 -d 10
> '__mod_memcg_state|__mod_memcg_lruvec_state|__count_memcg_events|cgroup_rstat_updated'