[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20200611105811.5q5rga2cmy6ypq7e@e107158-lin.cambridge.arm.com>
Date:   Thu, 11 Jun 2020 11:58:12 +0100
From:   Qais Yousef <qais.yousef@....com>
To:     Mel Gorman <mgorman@...e.de>
Cc:     Dietmar Eggemann <dietmar.eggemann@....com>,
        Peter Zijlstra <peterz@...radead.org>,
        Ingo Molnar <mingo@...hat.com>,
        Randy Dunlap <rdunlap@...radead.org>,
        Jonathan Corbet <corbet@....net>,
        Juri Lelli <juri.lelli@...hat.com>,
        Vincent Guittot <vincent.guittot@...aro.org>,
        Steven Rostedt <rostedt@...dmis.org>,
        Ben Segall <bsegall@...gle.com>,
        Luis Chamberlain <mcgrof@...nel.org>,
        Kees Cook <keescook@...omium.org>,
        Iurii Zaikin <yzaikin@...gle.com>,
        Quentin Perret <qperret@...gle.com>,
        Valentin Schneider <valentin.schneider@....com>,
        Patrick Bellasi <patrick.bellasi@...bug.net>,
        Pavan Kondeti <pkondeti@...eaurora.org>,
        linux-doc@...r.kernel.org, linux-kernel@...r.kernel.org,
        linux-fsdevel@...r.kernel.org
Subject: Re: [PATCH 1/2] sched/uclamp: Add a new sysctl to control RT default
 boost value
On 06/04/20 14:40, Mel Gorman wrote:
> On Wed, Jun 03, 2020 at 01:41:13PM +0100, Qais Yousef wrote:
> > > > netperf-udp
> > > >                                 ./5.7.0-rc7            ./5.7.0-rc7            ./5.7.0-rc7
> > > >                               without-clamp             with-clamp      with-clamp-tskgrp
> > > > 
> > > > Hmean     send-64         153.62 (   0.00%)      151.80 *  -1.19%*      155.60 *   1.28%*
> > > > Hmean     send-128        306.77 (   0.00%)      306.27 *  -0.16%*      309.39 *   0.85%*
> > > > Hmean     send-256        608.54 (   0.00%)      604.28 *  -0.70%*      613.42 *   0.80%*
> > > > Hmean     send-1024      2395.80 (   0.00%)     2365.67 *  -1.26%*     2409.50 *   0.57%*
> > > > Hmean     send-2048      4608.70 (   0.00%)     4544.02 *  -1.40%*     4665.96 *   1.24%*
> > > > Hmean     send-3312      7223.97 (   0.00%)     7158.88 *  -0.90%*     7331.23 *   1.48%*
> > > > Hmean     send-4096      8729.53 (   0.00%)     8598.78 *  -1.50%*     8860.47 *   1.50%*
> > > > Hmean     send-8192     14961.77 (   0.00%)    14418.92 *  -3.63%*    14908.36 *  -0.36%*
> > > > Hmean     send-16384    25799.50 (   0.00%)    25025.64 *  -3.00%*    25831.20 *   0.12%*
> > > > Hmean     recv-64         153.62 (   0.00%)      151.80 *  -1.19%*      155.60 *   1.28%*
> > > > Hmean     recv-128        306.77 (   0.00%)      306.27 *  -0.16%*      309.39 *   0.85%*
> > > > Hmean     recv-256        608.54 (   0.00%)      604.28 *  -0.70%*      613.42 *   0.80%*
> > > > Hmean     recv-1024      2395.80 (   0.00%)     2365.67 *  -1.26%*     2409.50 *   0.57%*
> > > > Hmean     recv-2048      4608.70 (   0.00%)     4544.02 *  -1.40%*     4665.95 *   1.24%*
> > > > Hmean     recv-3312      7223.97 (   0.00%)     7158.88 *  -0.90%*     7331.23 *   1.48%*
> > > > Hmean     recv-4096      8729.53 (   0.00%)     8598.78 *  -1.50%*     8860.47 *   1.50%*
> > > > Hmean     recv-8192     14961.61 (   0.00%)    14418.88 *  -3.63%*    14908.30 *  -0.36%*
> > > > Hmean     recv-16384    25799.39 (   0.00%)    25025.49 *  -3.00%*    25831.00 *   0.12%*
> > > > 
> > > > netperf-tcp
> > > >  
> > > > Hmean     64              818.65 (   0.00%)      812.98 *  -0.69%*      826.17 *   0.92%*
> > > > Hmean     128            1569.55 (   0.00%)     1555.79 *  -0.88%*     1586.94 *   1.11%*
> > > > Hmean     256            2952.86 (   0.00%)     2915.07 *  -1.28%*     2968.15 *   0.52%*
> > > > Hmean     1024          10425.91 (   0.00%)    10296.68 *  -1.24%*    10418.38 *  -0.07%*
> > > > Hmean     2048          17454.51 (   0.00%)    17369.57 *  -0.49%*    17419.24 *  -0.20%*
> > > > Hmean     3312          22509.95 (   0.00%)    22229.69 *  -1.25%*    22373.32 *  -0.61%*
> > > > Hmean     4096          25033.23 (   0.00%)    24859.59 *  -0.69%*    24912.50 *  -0.48%*
> > > > Hmean     8192          32080.51 (   0.00%)    31744.51 *  -1.05%*    31800.45 *  -0.87%*
> > > > Hmean     16384         36531.86 (   0.00%)    37064.68 *   1.46%*    37397.71 *   2.37%*
> > > > 
> > > > The diffs are smaller than on openSUSE Leap 15.1 and some of the
> > > > uclamp taskgroup results are better?
> > > > 
> > > 
> > > I don't see the stddev and coeff but these look close to borderline.
> > > Sure, they are marked with a * so it passed a significant test but it's
> > > still a very marginal difference for netperf. It's possible that the
> > > systemd configurations differ in some way that is significant for uclamp
> > > but I don't know what that is.
> > 
> > Hmm so what you're saying is that Dietmar didn't reproduce the same problem
> > you're observing? I was hoping to use that to dig more into it.
> > 
> 
> Not as such, I'm saying that for whatever reason the problem is not as
> visible with Dietmar's setup. It may be machine-specific or distribution
> specific. There are alternative suggestions for testing just the fast
> paths with a pipe test that may be clearer.
I have regained access to the same machine Dietmar ran his tests on. And I got
some weird results to share..
First I tried with `perf bench -r 20 sched pipe -T` command to identify the
cause of the overhead. And indeed I do see the activate/deactivate_task
overhead going up when uclamp is enabled
With uclamp run #1:
   2.44%  sched-pipe  [kernel.vmlinux]  [k] activate_task
   1.59%  sched-pipe  [kernel.vmlinux]  [k] deactivate_task
With uclamp run #2:
   4.55%  sched-pipe  [kernel.vmlinux]  [k] deactivate_task
   2.34%  sched-pipe  [kernel.vmlinux]  [k] activate_task
Without uclamp run #1:
   0.12%  sched-pipe  [kernel.vmlinux]  [k] activate_task
   0.11%  sched-pipe  [kernel.vmlinux]  [k] deactivate_task
Without uclamp run #2:
   0.11%  sched-pipe  [kernel.vmlinux]  [k] activate_task
   0.07%  sched-pipe  [kernel.vmlinux]  [k] deactivate_task
Looking at the annotation I see in the enqueue
  0.08 │ c5:   mov    %ecx,%esi                                                                                           ▒
  0.99 │       and    $0xfffff800,%r9d                                                                                    ▒
       │       and    $0x7ff,%eax                                                                                         ▒
       │       lea    0xd4(%rsi),%r13                                                                                     ▒
  0.10 │       or     %r9d,%eax                                                                                           ▒
       │     bucket->tasks++;                                                                                             ▒
  0.03 │       lea    (%rsi,%rsi,1),%r11                                                                                  ▒
       │     p->uclamp[clamp_id] = uclamp_eff_get(p, clamp_id);                                                           ▒
  0.02 │       mov    %eax,0x358(%rbx,%rdi,4)                                                                             ▒
       │     bucket = &uc_rq->bucket[uc_se->bucket_id];                                                                   ▒
  0.02 │       movzbl 0x9(%rbx,%r13,4),%ecx                                                                               ▒
       │     bucket->tasks++;                                                                                             ▒
  0.74 │       add    %rsi,%r11                                                                                           ▒
       │     bucket = &uc_rq->bucket[uc_se->bucket_id];                                                                   ▒
  0.02 │       shr    $0x3,%cl                                                                                            ▒
       │     bucket->tasks++;                                                                                             ▒
       │       and    $0x7,%ecx                                                                                           ◆
  0.05 │       lea    0x8(%rcx,%r11,2),%rax                                                                               ▒
  3.52 │       addq   $0x800,0x8(%r12,%rax,8)                                                                             ▒
       │     uc_se->active = true;                                                                                        ▒
       │       orb    $0x40,0x9(%rbx,%r13,4)                                                                              ▒
       │     uclamp_idle_reset(rq, clamp_id, uc_se->value);                                                               ▒
 73.34 │       movzwl 0x8(%rbx,%r13,4),%eax                     <--- XXXXX                                                ▒
       │     uclamp_idle_reset():                                                                                         ▒
       │     if (!(rq->uclamp_flags & UCLAMP_FLAG_IDLE))                                                                  ▒
       │       mov    0xa0(%r12),%r9d                                                                                     ▒
       │       mov    %r9d,%r10d                                                                                          ▒
       │     uclamp_rq_inc_id():                                                                                          ▒
       │     uclamp_idle_reset(rq, clamp_id, uc_se->value)
and at the dequeue
  0.07 │       mov    0x8(%rax),%ecx                                                                                      ▒
       │       test   %ecx,%ecx                                                                                           ▒
       │     ↑ je     60                                                                                                  ▒
  0.30 │       xor    %r14d,%r14d                                                                                         ▒
       │     uclamp_rq_dec_id():                                                                                          ▒
       │     bucket->tasks--;                                                                                             ▒
       │       mov    $0xfffffffffffff800,%r15                                                                            ▒
       │     bucket = &uc_rq->bucket[uc_se->bucket_id];                                                                   ▒
  0.07 │ 90:   mov    %r14d,%ecx                                                                                          ▒
       │       mov    %r14d,%r12d                                                                                         ▒
  0.04 │       lea    0xd4(%rcx),%rax                                                                                     ▒
 20.06 │       movzbl 0x9(%rsi,%rax,4),%r13d                     <--- XXXXX                                               ▒
       │     SCHED_WARN_ON(!bucket->tasks);                                                                               ▒
       │       lea    (%rcx,%rcx,1),%rax                                                                                  ▒
       │       add    %rcx,%rax                                                                                           ▒
       │     bucket = &uc_rq->bucket[uc_se->bucket_id];                                                                   ▒
  0.07 │       shr    $0x3,%r13b                                                                                          ▒
       │     SCHED_WARN_ON(!bucket->tasks);                                                                               ▒
  0.07 │       and    $0x7,%r13d                                                                                          ▒
  0.17 │       lea    0x8(%r13,%rax,2),%rax                                                                               ▒
 24.52 │       testq  $0xfffffffffffff800,0x8(%rbx,%rax,8)       <--- XXXXX                                               ▒
  0.34 │     ↓ je     172
.
.
.
  0.14 │       mov    %ecx,0x40(%rax)                                                                                     ▒
       │     ↑ jmpq   f8                                                                                                  ▒
  1.25 │250:   sub    $0x8,%rcx                                                                                           ▒
       │     uclamp_rq_max_value():                                                                                       ▒
       │     for ( ; bucket_id >= 0; bucket_id--) {                                                                       ▒
  4.97 │       cmp    %rcx,%rax                                                                                           ▒
       │     ↑ jne    22b                                                                                                 ▒
       │     uclamp_idle_value():                                                                                         ▒
       │     return uclamp_none(UCLAMP_MIN);                                                                              ▒
       │       xor    %ecx,%ecx                                                                                           ▒
       │     if (clamp_id == UCLAMP_MAX) {                                                                                ▒
  0.74 │       cmp    $0x1,%r14                                                                                           ▒
       │     ↑ jne    23d                                                                                                 ▒
       │     uclamp_rq_dec_id():                                                                                          ▒
       │     bkt_clamp = uclamp_rq_max_value(rq, clamp_id, uc_se->value);                                                 ▒
 20.10 │       movzwl 0x35c(%rsi),%ecx                           <--- XXXXXX                                              ▒
       │     uclamp_idle_value():                                                                                         ▒
       │     rq->uclamp_flags |= UCLAMP_FLAG_IDLE;                                                                        ▒
       │       orl    $0x1,0xa0(%rbx)                                                                                     ▒
       │     uclamp_rq_dec_id():                                                                                          ▒
       │     bkt_clamp = uclamp_rq_max_value(rq, clamp_id, uc_se->value);                                                 ▒
  0.14 │       and    $0x7ff,%ecx                                                                                         ▒
       │     ↑ jmp    23d
Which I interpreted as accesses to rq->uclamp[].bucket and p->uclamp[] structs.
The movzwl shanangians promoted me to remove the bitfields in case this is
causing some weird effect, and I shortend struct uclamp_bucket to reduce the
potential cache pressure to make it all fit in a single cache line
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 4418f5cb8324..7d0acf250573 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -601,10 +601,10 @@ struct sched_dl_entity {
  * default boost can still drop its own boosting to 0%.
  */
 struct uclamp_se {
-	unsigned int value		: bits_per(SCHED_CAPACITY_SCALE);
-	unsigned int bucket_id		: bits_per(UCLAMP_BUCKETS);
-	unsigned int active		: 1;
-	unsigned int user_defined	: 1;
+	unsigned int value;
+	unsigned int bucket_id;
+	unsigned int active;
+	unsigned int user_defined;
 };
 #endif /* CONFIG_UCLAMP_TASK */
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index db3a57675ccf..a1e7080c48e8 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -833,8 +833,8 @@ extern void rto_push_irq_work_func(struct irq_work *work);
  * clamp value.
  */
 struct uclamp_bucket {
-	unsigned long value : bits_per(SCHED_CAPACITY_SCALE);
-	unsigned long tasks : BITS_PER_LONG - bits_per(SCHED_CAPACITY_SCALE);
+	unsigned short value;
+	unsigned short tasks;
 };
 
 /*
This make the perf output look like this now
With patch run #1:
   1.34%  sched-pipe  [kernel.vmlinux]  [k] deactivate_task
   0.44%  sched-pipe  [kernel.vmlinux]  [k] activate_task
with patch run #2:
   2.41%  sched-pipe  [kernel.vmlinux]  [k] deactivate_task
   0.32%  sched-pipe  [kernel.vmlinux]  [k] activate_task
So it seems to help the activate_task path a lot, but not as much for the
deactivate_task path. Note that activate_task path was hotter than
deactivate_task without this patch.
Further, I have tried adding a static key like you suggested for psi
With static key disabling uclamp:
   0.13%  sched-pipe  [kernel.vmlinux]  [k] activate_task
   0.07%  sched-pipe  [kernel.vmlinux]  [k] deactivate_task
diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index d4f6215ee03f..1814baa95c81 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -84,6 +84,9 @@ extern int sched_rt_handler(struct ctl_table *table, int write,
 extern int sysctl_sched_uclamp_handler(struct ctl_table *table, int write,
 				       void __user *buffer, size_t *lenp,
 				       loff_t *ppos);
+extern int sysctl_sched_uclamp_disabled(struct ctl_table *table, int write,
+				        void __user *buffer, size_t *lenp,
+				        loff_t *ppos);
 #endif
 
 extern int sysctl_numa_balancing(struct ctl_table *table, int write,
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 9a2fbf98fd6f..8d932b3922c9 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -793,6 +793,8 @@ unsigned int sysctl_sched_uclamp_util_max = SCHED_CAPACITY_SCALE;
 /* All clamps are required to be less or equal than these values */
 static struct uclamp_se uclamp_default[UCLAMP_CNT];
 
+DEFINE_STATIC_KEY_FALSE(sched_uclamp_disabled);
+
 /* Integer rounded range for each bucket */
 #define UCLAMP_BUCKET_DELTA DIV_ROUND_CLOSEST(SCHED_CAPACITY_SCALE, UCLAMP_BUCKETS)
 
@@ -1020,6 +1022,9 @@ static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p)
 {
 	enum uclamp_id clamp_id;
 
+	if (static_branch_likely(&sched_uclamp_disabled))
+		return;
+
 	if (unlikely(!p->sched_class->uclamp_enabled))
 		return;
 
@@ -1035,6 +1040,9 @@ static inline void uclamp_rq_dec(struct rq *rq, struct task_struct *p)
 {
 	enum uclamp_id clamp_id;
 
+	if (static_branch_likely(&sched_uclamp_disabled))
+		return;
+
 	if (unlikely(!p->sched_class->uclamp_enabled))
 		return;
 
@@ -1164,6 +1172,30 @@ int sysctl_sched_uclamp_handler(struct ctl_table *table, int write,
 	return result;
 }
 
+int sysctl_sched_uclamp_disabled(struct ctl_table *table, int write,
+			 void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+	struct ctl_table t;
+	int err;
+	int state = static_branch_likely(&sched_uclamp_disabled);
+
+	if (write && !capable(CAP_SYS_ADMIN))
+		return -EPERM;
+
+	t = *table;
+	t.data = &state;
+	err = proc_dointvec_minmax(&t, write, buffer, lenp, ppos);
+	if (err < 0)
+		return err;
+	if (write) {
+		if (state)
+			static_branch_enable(&sched_uclamp_disabled);
+		else
+			static_branch_disable(&sched_uclamp_disabled);
+	}
+	return err;
+}
+
 static int uclamp_validate(struct task_struct *p,
 			   const struct sched_attr *attr)
 {
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 8a176d8727a3..ef842cbf1f91 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -453,6 +453,15 @@ static struct ctl_table kern_table[] = {
 		.mode		= 0644,
 		.proc_handler	= sysctl_sched_uclamp_handler,
 	},
+	{
+		.procname	= "sched_uclamp_disabled",
+		.data		= NULL,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= sysctl_sched_uclamp_disabled,
+		.extra1		= SYSCTL_ZERO,
+		.extra2		= SYSCTL_ONE,
+	},
 #endif
 #ifdef CONFIG_SCHED_AUTOGROUP
 	{
I stopped over here and decided to run with your netperf test to see what
effect this has, and this is where things gets weird.
Running with uclamp gives better results than without uclamp :-/
The nouclamp column is with uclamp disabled at the config level.
The '.disabled' postfix is uclamp disabled via sysctl patch (static key).
uclamp-opt is with the above patch that improved the D$ performance for perf
bench.
I ran uclamp and uclamp-opt with the static key patch applied. I ran them twice
as after noticing the nouclamp is worse than with uclamp I wanted to see how
the numbers differ between runs.
                                    nouclam               nouclamp                  uclam                 uclamp         uclamp.disable                 uclamp                 uclamp                 uclamp
                                   nouclamp              recompile                 uclamp                uclamp2        uclamp.disabled                    opt                   opt2           opt.disabled
Hmean     send-64         158.07 (   0.00%)      156.99 *  -0.68%*      163.83 *   3.65%*      160.97 *   1.83%*      163.93 *   3.71%*      159.62 *   0.98%*      161.79 *   2.36%*      161.14 *   1.94%*
Hmean     send-128        314.86 (   0.00%)      314.41 *  -0.14%*      329.05 *   4.51%*      322.88 *   2.55%*      327.88 *   4.14%*      317.56 *   0.86%*      320.72 *   1.86%*      319.62 *   1.51%*
Hmean     send-256        629.98 (   0.00%)      625.78 *  -0.67%*      652.67 *   3.60%*      639.98 *   1.59%*      643.99 *   2.22%*      631.96 *   0.31%*      635.75 *   0.92%*      644.10 *   2.24%*
Hmean     send-1024      2465.04 (   0.00%)     2452.29 *  -0.52%*     2554.66 *   3.64%*     2509.60 *   1.81%*     2540.71 *   3.07%*     2495.82 *   1.25%*     2490.50 *   1.03%*     2509.86 *   1.82%*
Hmean     send-2048      4717.57 (   0.00%)     4713.17 *  -0.09%*     4923.98 *   4.38%*     4811.01 *   1.98%*     4881.87 *   3.48%*     4793.82 *   1.62%*     4820.28 *   2.18%*     4824.60 *   2.27%*
Hmean     send-3312      7412.33 (   0.00%)     7433.42 *   0.28%*     7717.76 *   4.12%*     7522.97 *   1.49%*     7620.99 *   2.82%*     7522.89 *   1.49%*     7614.51 *   2.73%*     7568.51 *   2.11%*
Hmean     send-4096      9021.55 (   0.00%)     8988.71 *  -0.36%*     9337.62 *   3.50%*     9075.49 *   0.60%*     9258.34 *   2.62%*     9117.17 *   1.06%*     9175.85 *   1.71%*     9079.50 *   0.64%*
Hmean     send-8192     15370.36 (   0.00%)    15467.63 *   0.63%*    15999.52 *   4.09%*    15467.80 *   0.63%*    15978.69 *   3.96%*    15619.84 *   1.62%*    15395.09 *   0.16%*    15779.73 *   2.66%*
Hmean     send-16384    26512.35 (   0.00%)    26498.18 *  -0.05%*    26931.86 *   1.58%*    26513.18 *   0.00%*    26873.98 *   1.36%*    26456.38 *  -0.21%*    26467.77 *  -0.17%*    26975.04 *   1.75%*
Happy to try more things if you have any suggestions. I am getting a bit
stumped by the netperf results, but haven't tried to profile them. I might try
that but thought I'll report this first as it's time consuming.
Thanks
--
Qais Yousef
Powered by blists - more mailing lists
 
