[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <gckdqiczjtyd5qdod6a7uyaxppbglg3fkgx2pideuscsyhdrmy@by6rlly6crmz>
Date: Fri, 19 Apr 2024 09:11:35 -0700
From: Shakeel Butt <shakeel.butt@...ux.dev>
To: Jesper Dangaard Brouer <hawk@...nel.org>
Cc: Yosry Ahmed <yosryahmed@...gle.com>, tj@...nel.org, hannes@...xchg.org,
lizefan.x@...edance.com, cgroups@...r.kernel.org, longman@...hat.com,
netdev@...r.kernel.org, linux-mm@...ck.org, linux-kernel@...r.kernel.org,
kernel-team@...udflare.com, Arnaldo Carvalho de Melo <acme@...nel.org>,
Sebastian Andrzej Siewior <bigeasy@...utronix.de>, mhocko@...nel.org
Subject: Re: [PATCH v1 2/3] cgroup/rstat: convert cgroup_rstat_lock back to
mutex
On Fri, Apr 19, 2024 at 03:15:01PM +0200, Jesper Dangaard Brouer wrote:
>
>
> On 18/04/2024 22.39, Yosry Ahmed wrote:
> > On Thu, Apr 18, 2024 at 7:49 AM Shakeel Butt <shakeel.butt@...ux.dev> wrote:
> > >
> > > On Thu, Apr 18, 2024 at 11:02:06AM +0200, Jesper Dangaard Brouer wrote:
> > > >
> > > >
> > > > On 18/04/2024 04.19, Yosry Ahmed wrote:
> > > [...]
> > > > >
> > > > > I will keep the high-level conversation about using the mutex here in
> > > > > the cover letter thread, but I am wondering why we are keeping the
> > > > > lock dropping logic here with the mutex?
> > > > >
> > > >
> > > > I agree that yielding the mutex in the loop makes less sense.
> > > > Especially since the raw_spin_unlock_irqrestore(cpu_lock, flags) call
> > > > will be a preemption point for my softirq. But I kept it because, we
> > > > are running a CONFIG_PREEMPT_VOLUNTARY kernel, so I still worried that
> > > > there was no sched point for other userspace processes while holding the
> > > > mutex, but I don't fully know the sched implication when holding a mutex.
> > > >
> > >
> > > Are the softirqs you are interested in, raised from the same cpu or
> > > remote cpu? What about local_softirq_pending() check in addition to
> > > need_resched() and spin_needbreak() checks? If softirq can only be
> > > raised on local cpu then convert the spin_lock to non-irq one (Please
> > > correct me if I am wrong but on return from hard irq and not within bh
> > > or irq disabled spin_lock, the kernel will run the pending softirqs,
> > > right?). Did you get the chance to test these two changes or something
> > > similar in your prod environment?
> >
> > I tried making the spinlock a non-irq lock before, but Tejun objected [1].
> > [1] https://lore.kernel.org/lkml/ZBz%2FV5a7%2F6PZeM7S@slm.duckdns.org/
> >
>
> After reading [1], I think using a mutex is a better approach (than non-irq
> spinlock).
>
>
> > Perhaps we could experiment with always dropping the lock at CPU
> > boundaries instead?
> >
>
> I don't think this will be enough (always dropping the lock at CPU
> boundaries). My measured "lock-hold" times that is blocking IRQ (and
> softirq) for too long. When looking at prod with my new cgroup
> tracepoint script[2]. When contention occurs, I see many Yields
> happening and with same magnitude as Contended. But still see events
> with long "lock-hold" times, even-though yields are high.
>
> [2] https://github.com/xdp-project/xdp-project/blob/master/areas/latency/cgroup_rstat_tracepoint.bt
>
> Example output:
>
> 12:46:56 High Lock-contention: wait: 739 usec (0 ms) on CPU:56 comm:kswapd7
> 12:46:56 Long lock-hold time: 6381 usec (6 ms) on CPU:27 comm:kswapd3
> 12:46:56 Long lock-hold time: 18905 usec (18 ms) on CPU:100
> comm:kworker/u261:12
>
> 12:46:56 time elapsed: 36 sec (interval = 1 sec)
> Flushes(2051) 15/interval (avg 56/sec)
> Locks(44464) 1340/interval (avg 1235/sec)
> Yields(42413) 1325/interval (avg 1178/sec)
> Contended(42112) 1322/interval (avg 1169/sec)
>
> There is reported 15 flushes/sec, but locks are yielded quickly.
>
> More problematically (for softirq latency) we see a Long lock-hold time
> reaching 18 ms. For network RX softirq I need lower than 0.5ms latency,
> to avoid RX-ring HW queue overflows.
>
>
> --Jesper
> p.s. I'm seeing a pattern with kswapdN contending on this lock.
>
> @stack[697, kswapd3]:
> __cgroup_rstat_lock+107
> __cgroup_rstat_lock+107
> cgroup_rstat_flush_locked+851
> cgroup_rstat_flush+35
> shrink_node+226
> balance_pgdat+807
> kswapd+521
> kthread+228
> ret_from_fork+48
> ret_from_fork_asm+27
>
> @stack[698, kswapd4]:
> __cgroup_rstat_lock+107
> __cgroup_rstat_lock+107
> cgroup_rstat_flush_locked+851
> cgroup_rstat_flush+35
> shrink_node+226
> balance_pgdat+807
> kswapd+521
> kthread+228
> ret_from_fork+48
> ret_from_fork_asm+27
>
> @stack[699, kswapd5]:
> __cgroup_rstat_lock+107
> __cgroup_rstat_lock+107
> cgroup_rstat_flush_locked+851
> cgroup_rstat_flush+35
> shrink_node+226
> balance_pgdat+807
> kswapd+521
> kthread+228
> ret_from_fork+48
> ret_from_fork_asm+27
>
Can you simply replace mem_cgroup_flush_stats() in
prepare_scan_control() with the ratelimited version and see if the issue
still persists for your production traffic?
Also were you able to get which specific stats are getting the most
updates?
Powered by blists - more mailing lists