lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Tue, 22 Aug 2023 14:48:50 +0800
From:   Chen Yu <yu.c.chen@...el.com>
To:     Peter Zijlstra <peterz@...radead.org>,
        Mike Galbraith <umgwanakikbuti@...il.com>
CC:     kernel test robot <oliver.sang@...el.com>,
        <oe-lkp@...ts.linux.dev>, <lkp@...el.com>,
        <linux-kernel@...r.kernel.org>, <x86@...nel.org>,
        "Ingo Molnar" <mingo@...nel.org>,
        K Prateek Nayak <kprateek.nayak@....com>,
        "Feng Tang" <feng.tang@...el.com>
Subject: Re: [tip:sched/eevdf] [sched/fair]  e0c2ff903c:
 phoronix-test-suite.blogbench.Write.final_score -34.8% regression

Hi Peter, Mike,

On 2023-08-18 at 09:09:29 +0800, Chen Yu wrote:
> On 2023-08-16 at 15:40:59 +0200, Peter Zijlstra wrote:
> > On Wed, Aug 16, 2023 at 02:37:16PM +0200, Peter Zijlstra wrote:
> > > On Mon, Aug 14, 2023 at 08:32:55PM +0200, Mike Galbraith wrote:
> > > 
> > > > --- a/kernel/sched/fair.c
> > > > +++ b/kernel/sched/fair.c
> > > > @@ -875,6 +875,12 @@ static struct sched_entity *pick_eevdf(s
> > > >  	if (curr && (!curr->on_rq || !entity_eligible(cfs_rq, curr)))
> > > >  		curr = NULL;
> > > >  
> > > > +	/*
> > > > +	 * Once selected, run the task to parity to avoid overscheduling.
> > > > +	 */
> > > > +	if (sched_feat(RUN_TO_PARITY) && curr)
> > > > +		return curr;
> > > > +
> > > >  	while (node) {
> > > >  		struct sched_entity *se = __node_2_se(node);
> > > >  
> > > 
> > > So I read it wrong last night... but I rather like this idea. But
> > > there's something missing. When curr starts a new slice it should
> > > probably do a full repick and not stick with it.
> > > 
> > > Let me poke at this a bit.. nice
> > 
> > Something like so.. it shouldn't matter much now, but might make a
> > difference once we start mixing different slice lengths.
> > 
> > ---
> >  kernel/sched/fair.c     | 12 ++++++++++++
> >  kernel/sched/features.h |  1 +
> >  2 files changed, 13 insertions(+)
> > 
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index fe5be91c71c7..128a78f3f264 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -873,6 +873,13 @@ static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq)
> >  	if (curr && (!curr->on_rq || !entity_eligible(cfs_rq, curr)))
> >  		curr = NULL;
> >  
> > +	/*
> > +	 * Once selected, run a task until it either becomes non-eligible or
> > +	 * until it gets a new slice. See the HACK in set_next_entity().
> > +	 */
> > +	if (sched_feat(RUN_TO_PARITY) && curr && curr->vlag == curr->deadline)
> > +		return curr;
> > +
> >  	while (node) {
> >  		struct sched_entity *se = __node_2_se(node);
> >  
> > @@ -5168,6 +5175,11 @@ set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
> >  		update_stats_wait_end_fair(cfs_rq, se);
> >  		__dequeue_entity(cfs_rq, se);
> >  		update_load_avg(cfs_rq, se, UPDATE_TG);
> > +		/*
> > +		 * HACK, stash a copy of deadline at the point of pick in vlag,
> > +		 * which isn't used until dequeue.
> > +		 */
> > +		se->vlag = se->deadline;
> >  	}
> >  
> >  	update_stats_curr_start(cfs_rq, se);
> > diff --git a/kernel/sched/features.h b/kernel/sched/features.h
> > index 61bcbf5e46a4..f770168230ae 100644
> > --- a/kernel/sched/features.h
> > +++ b/kernel/sched/features.h
> > @@ -6,6 +6,7 @@
> >   */
> >  SCHED_FEAT(PLACE_LAG, true)
> >  SCHED_FEAT(PLACE_DEADLINE_INITIAL, true)
> > +SCHED_FEAT(RUN_TO_PARITY, true)
> >  
> >  /*
> >   * Prefer to schedule the task we woke last (assuming it failed
> 
> I had a test on top of this, and it restored some performance:
> 
> uname -r
> 6.5.0-rc2-mixed-slice-run-parity
> 
> echo NO_RUN_TO_PARITY > /sys/kernel/debug/sched/features 
> hackbench -g 112 -f 20 --threads -l 30000 -s 100
> Running in threaded mode with 112 groups using 40 file descriptors each (== 4480 tasks)
> Each sender will pass 30000 messages of 100 bytes
> Time: 118.220  <---
> 
> 
> echo RUN_TO_PARITY > /sys/kernel/debug/sched/features 
> hackbench -g 112 -f 20 --threads -l 30000 -s 100
> Running in threaded mode with 112 groups using 40 file descriptors each (== 4480 tasks)
> Each sender will pass 30000 messages of 100 bytes
> Time: 100.873  <---
> 
> Then I switched to the first commit of EEVDF, that is to
> introduce the avg_runtime for cfs_rq(but not use it yet)
> It seems that there is still 50% performance to restored:
> 
> uname -r
> 6.5.0-rc2-add-cfs-avruntime
> 
> hackbench -g 112 -f 20 --threads -l 30000 -s 100
> Running in threaded mode with 112 groups using 40 file descriptors each (== 4480 tasks)
> Each sender will pass 30000 messages of 100 bytes
> Time: 52.569  <---
> 
> I'll collect the statistic to check what's remainning there.
> 
> [Besides, with the $subject benchmark, I tested it with PLACE_DEADLINE_INITIAL
> diabled, and it did not show much difference. I'll work with lkp
> to test it with RUN_TO_PARITY enabled.]
> 
>

To figure out the remainning hackbench performance gap, I pulled the
tip/sched/core and launched Flamegraph to track the perf profile.
The bottleneck according to flamegraph profile is as followed:

Baseline:
"sched/fair: Add cfs_rq::avg_vruntime"
unix_stream_sendmsg() -> slab_alloc_node() -> memcg_slab_pre_alloc_hook()
                      -> obj_cgroup_charge_pages() -> atomic_add()
The per memory cgroup counter atomic write is the bottleneck[1]

Compare:
"sched/eevdf: Curb wakeup-preemption"
unix_stream_sendmsg() -> slab_alloc_node() -> ___slab_alloc()
                      -> get_partial()  -> get_partial_node()
                      -> spinlock_irq()
The per kmem_cache_node's list_lock spinlock is the bottleneck[2]

In theory the negative impact of atomic write should be less than the
spinlock, because the latter would increase the busy wait time a lot.
And this might be the reason why we saw lower throughput with EEVDF.

Since EEVDF does not change any code in mm, I've no idea why the bottleneck
switches from atomic write to spinlock. One reason I can guess is the timing.
Before EEVDF, tasks enter spinlock critical section in difference time
slots, while after EEVDF, tasks enter the critical section at the same
time.

Reducing the spinlock contention could help improve the situation.
Feng Tang told me that there is a discussion on the LKML related to high
slub spinlock contention[2]. And Feng provides a hack patch which
enlarges the slab percpu partial list order so as to reduce the lock
contention on the node(if I understand the patch correctly)

With this patch applied, the spinlock contention has been reduced a lot[3],
and the performance has been improved to:

hackbench -g 112 -f 20 --threads -l 30000 -s 100
Running in threaded mode with 112 groups using 40 file descriptors each (== 4480 tasks)
Each sender will pass 30000 messages of 100 bytes
Time: 71.838

I guess if we enlarge the percpu slab partial list further, it can reduce the
spinlock contention further, and get more improvement. Meanwhile I'm still
wondering what the reason to cause multiple tasks enter the spinlock section
at the same time.

BTW, with RUN_TO_PARITY enabled, the blogbench has also restored some
of its performance.


[1] https://raw.githubusercontent.com/yu-chen-surf/schedtests/master/results/eevdf/flamegraph_add_avg_runtime.svg
[2] https://raw.githubusercontent.com/yu-chen-surf/schedtests/master/results/eevdf/flamegraph_eevdf_parity.svg
[3] https://raw.githubusercontent.com/yu-chen-surf/schedtests/master/results/eevdf/flamegraph_slab_enlarge_order_eevdf_parity.svg

thanks,
Chenyu

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ