netdev - Re: [PATCH v1 2/3] cgroup/rstat: convert cgroup_rstat

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAJD7tkbU1PB6ocRUVM0mw_q-c6kq1r9WsgkZwe1ppNkZG8KdQA@mail.gmail.com>
Date: Thu, 18 Apr 2024 13:38:20 -0700
From: Yosry Ahmed <yosryahmed@...gle.com>
To: Jesper Dangaard Brouer <hawk@...nel.org>
Cc: tj@...nel.org, hannes@...xchg.org, lizefan.x@...edance.com, 
	cgroups@...r.kernel.org, longman@...hat.com, netdev@...r.kernel.org, 
	linux-mm@...ck.org, linux-kernel@...r.kernel.org, shakeel.butt@...ux.dev, 
	kernel-team@...udflare.com, Arnaldo Carvalho de Melo <acme@...nel.org>, 
	Sebastian Andrzej Siewior <bigeasy@...utronix.de>, mhocko@...nel.org, Wei Xu <weixugc@...gle.com>
Subject: Re: [PATCH v1 2/3] cgroup/rstat: convert cgroup_rstat_lock back to mutex

On Thu, Apr 18, 2024 at 2:02 AM Jesper Dangaard Brouer <hawk@...nel.org> wrote:
>
>
>
> On 18/04/2024 04.19, Yosry Ahmed wrote:
> > On Tue, Apr 16, 2024 at 10:51 AM Jesper Dangaard Brouer <hawk@...nel.org> wrote:
> >>
> >> Since kernel v4.18, cgroup_rstat_lock has been an IRQ-disabling spinlock,
> >> as introduced by commit 0fa294fb1985 ("cgroup: Replace cgroup_rstat_mutex
> >> with a spinlock").
> >>
> >> Despite efforts in cgroup_rstat_flush_locked() to yield the lock when
> >> necessary during the collection of per-CPU stats, this approach has led
> >> to several scaling issues observed in production environments. Holding
> >> this IRQ lock has caused starvation of other critical kernel functions,
> >> such as softirq (e.g., timers and netstack). Although kernel v6.8
> >> introduced optimizations in this area, we continue to observe instances
> >> where the spin_lock is held for 64-128 ms in production.
> >>
> >> This patch converts cgroup_rstat_lock back to being a mutex lock. This
> >> change is made possible thanks to the significant effort by Yosry Ahmed
> >> to eliminate all atomic context use-cases through multiple commits,
> >> ending in 0a2dc6ac3329 ("cgroup: removecgroup_rstat_flush_atomic()"),
> >> included in kernel v6.5.
> >>
> >> After this patch lock contention will be less obvious, as converting this
> >> to a mutex avoids multiple CPUs spinning while waiting for the lock, but
> >> it doesn't remove the lock contention. It is recommended to use the
> >> tracepoints to diagnose this.
> >
> > I will keep the high-level conversation about using the mutex here in
> > the cover letter thread, but I am wondering why we are keeping the
> > lock dropping logic here with the mutex?
> >
>
> I agree that yielding the mutex in the loop makes less sense.
> Especially since the raw_spin_unlock_irqrestore(cpu_lock, flags) call
> will be a preemption point for my softirq.   But I kept it because, we
> are running a CONFIG_PREEMPT_VOLUNTARY kernel, so I still worried that
> there was no sched point for other userspace processes while holding the
> mutex, but I don't fully know the sched implication when holding a mutex.

I guess dropping the lock before rescheduling could be more preferable
in this case since we do not need to keep holding the lock for
correctness.

>
> > If this is to reduce lock contention, why does it depend on
> > need_resched()? spin_needbreak() is a good indicator for lock
> > contention, but need_resched() isn't, right?
> >
>
> As I said, I'm unsure of the semantics of holding a mutex.
>
>
> > Also, how was this tested?
> >
>
> I tested this in a testlab, prior to posting upstream, with parallel
> reader of the stat files.

I believe high concurrency is a key point here. CC'ing Wei who
reported regressions on previous attempts of mine before to address
the lock contention from userspace.

> As I said in other mail, I plan to experiment
> with these patches(2+3) in production, as micro-benchmarking will not
> reveal the corner cases we care about.

Right, but micro-benchmarking should give us a signal about
regressions. It was very useful for me when working with this code
before to use synthetic benchmarks with high concurrency of userspace
reads and/or kernel flushers.