lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Thu, 11 Jun 2020 11:58:12 +0100
From:   Qais Yousef <qais.yousef@....com>
To:     Mel Gorman <mgorman@...e.de>
Cc:     Dietmar Eggemann <dietmar.eggemann@....com>,
        Peter Zijlstra <peterz@...radead.org>,
        Ingo Molnar <mingo@...hat.com>,
        Randy Dunlap <rdunlap@...radead.org>,
        Jonathan Corbet <corbet@....net>,
        Juri Lelli <juri.lelli@...hat.com>,
        Vincent Guittot <vincent.guittot@...aro.org>,
        Steven Rostedt <rostedt@...dmis.org>,
        Ben Segall <bsegall@...gle.com>,
        Luis Chamberlain <mcgrof@...nel.org>,
        Kees Cook <keescook@...omium.org>,
        Iurii Zaikin <yzaikin@...gle.com>,
        Quentin Perret <qperret@...gle.com>,
        Valentin Schneider <valentin.schneider@....com>,
        Patrick Bellasi <patrick.bellasi@...bug.net>,
        Pavan Kondeti <pkondeti@...eaurora.org>,
        linux-doc@...r.kernel.org, linux-kernel@...r.kernel.org,
        linux-fsdevel@...r.kernel.org
Subject: Re: [PATCH 1/2] sched/uclamp: Add a new sysctl to control RT default
 boost value

On 06/04/20 14:40, Mel Gorman wrote:
> On Wed, Jun 03, 2020 at 01:41:13PM +0100, Qais Yousef wrote:
> > > > netperf-udp
> > > >                                 ./5.7.0-rc7            ./5.7.0-rc7            ./5.7.0-rc7
> > > >                               without-clamp             with-clamp      with-clamp-tskgrp
> > > > 
> > > > Hmean     send-64         153.62 (   0.00%)      151.80 *  -1.19%*      155.60 *   1.28%*
> > > > Hmean     send-128        306.77 (   0.00%)      306.27 *  -0.16%*      309.39 *   0.85%*
> > > > Hmean     send-256        608.54 (   0.00%)      604.28 *  -0.70%*      613.42 *   0.80%*
> > > > Hmean     send-1024      2395.80 (   0.00%)     2365.67 *  -1.26%*     2409.50 *   0.57%*
> > > > Hmean     send-2048      4608.70 (   0.00%)     4544.02 *  -1.40%*     4665.96 *   1.24%*
> > > > Hmean     send-3312      7223.97 (   0.00%)     7158.88 *  -0.90%*     7331.23 *   1.48%*
> > > > Hmean     send-4096      8729.53 (   0.00%)     8598.78 *  -1.50%*     8860.47 *   1.50%*
> > > > Hmean     send-8192     14961.77 (   0.00%)    14418.92 *  -3.63%*    14908.36 *  -0.36%*
> > > > Hmean     send-16384    25799.50 (   0.00%)    25025.64 *  -3.00%*    25831.20 *   0.12%*
> > > > Hmean     recv-64         153.62 (   0.00%)      151.80 *  -1.19%*      155.60 *   1.28%*
> > > > Hmean     recv-128        306.77 (   0.00%)      306.27 *  -0.16%*      309.39 *   0.85%*
> > > > Hmean     recv-256        608.54 (   0.00%)      604.28 *  -0.70%*      613.42 *   0.80%*
> > > > Hmean     recv-1024      2395.80 (   0.00%)     2365.67 *  -1.26%*     2409.50 *   0.57%*
> > > > Hmean     recv-2048      4608.70 (   0.00%)     4544.02 *  -1.40%*     4665.95 *   1.24%*
> > > > Hmean     recv-3312      7223.97 (   0.00%)     7158.88 *  -0.90%*     7331.23 *   1.48%*
> > > > Hmean     recv-4096      8729.53 (   0.00%)     8598.78 *  -1.50%*     8860.47 *   1.50%*
> > > > Hmean     recv-8192     14961.61 (   0.00%)    14418.88 *  -3.63%*    14908.30 *  -0.36%*
> > > > Hmean     recv-16384    25799.39 (   0.00%)    25025.49 *  -3.00%*    25831.00 *   0.12%*
> > > > 
> > > > netperf-tcp
> > > >  
> > > > Hmean     64              818.65 (   0.00%)      812.98 *  -0.69%*      826.17 *   0.92%*
> > > > Hmean     128            1569.55 (   0.00%)     1555.79 *  -0.88%*     1586.94 *   1.11%*
> > > > Hmean     256            2952.86 (   0.00%)     2915.07 *  -1.28%*     2968.15 *   0.52%*
> > > > Hmean     1024          10425.91 (   0.00%)    10296.68 *  -1.24%*    10418.38 *  -0.07%*
> > > > Hmean     2048          17454.51 (   0.00%)    17369.57 *  -0.49%*    17419.24 *  -0.20%*
> > > > Hmean     3312          22509.95 (   0.00%)    22229.69 *  -1.25%*    22373.32 *  -0.61%*
> > > > Hmean     4096          25033.23 (   0.00%)    24859.59 *  -0.69%*    24912.50 *  -0.48%*
> > > > Hmean     8192          32080.51 (   0.00%)    31744.51 *  -1.05%*    31800.45 *  -0.87%*
> > > > Hmean     16384         36531.86 (   0.00%)    37064.68 *   1.46%*    37397.71 *   2.37%*
> > > > 
> > > > The diffs are smaller than on openSUSE Leap 15.1 and some of the
> > > > uclamp taskgroup results are better?
> > > > 
> > > 
> > > I don't see the stddev and coeff but these look close to borderline.
> > > Sure, they are marked with a * so it passed a significant test but it's
> > > still a very marginal difference for netperf. It's possible that the
> > > systemd configurations differ in some way that is significant for uclamp
> > > but I don't know what that is.
> > 
> > Hmm so what you're saying is that Dietmar didn't reproduce the same problem
> > you're observing? I was hoping to use that to dig more into it.
> > 
> 
> Not as such, I'm saying that for whatever reason the problem is not as
> visible with Dietmar's setup. It may be machine-specific or distribution
> specific. There are alternative suggestions for testing just the fast
> paths with a pipe test that may be clearer.

I have regained access to the same machine Dietmar ran his tests on. And I got
some weird results to share..

First I tried with `perf bench -r 20 sched pipe -T` command to identify the
cause of the overhead. And indeed I do see the activate/deactivate_task
overhead going up when uclamp is enabled

With uclamp run #1:

   2.44%  sched-pipe  [kernel.vmlinux]  [k] activate_task
   1.59%  sched-pipe  [kernel.vmlinux]  [k] deactivate_task

With uclamp run #2:

   4.55%  sched-pipe  [kernel.vmlinux]  [k] deactivate_task
   2.34%  sched-pipe  [kernel.vmlinux]  [k] activate_task

Without uclamp run #1:

   0.12%  sched-pipe  [kernel.vmlinux]  [k] activate_task
   0.11%  sched-pipe  [kernel.vmlinux]  [k] deactivate_task

Without uclamp run #2:

   0.11%  sched-pipe  [kernel.vmlinux]  [k] activate_task
   0.07%  sched-pipe  [kernel.vmlinux]  [k] deactivate_task

Looking at the annotation I see in the enqueue

  0.08 │ c5:   mov    %ecx,%esi                                                                                           ▒
  0.99 │       and    $0xfffff800,%r9d                                                                                    ▒
       │       and    $0x7ff,%eax                                                                                         ▒
       │       lea    0xd4(%rsi),%r13                                                                                     ▒
  0.10 │       or     %r9d,%eax                                                                                           ▒
       │     bucket->tasks++;                                                                                             ▒
  0.03 │       lea    (%rsi,%rsi,1),%r11                                                                                  ▒
       │     p->uclamp[clamp_id] = uclamp_eff_get(p, clamp_id);                                                           ▒
  0.02 │       mov    %eax,0x358(%rbx,%rdi,4)                                                                             ▒
       │     bucket = &uc_rq->bucket[uc_se->bucket_id];                                                                   ▒
  0.02 │       movzbl 0x9(%rbx,%r13,4),%ecx                                                                               ▒
       │     bucket->tasks++;                                                                                             ▒
  0.74 │       add    %rsi,%r11                                                                                           ▒
       │     bucket = &uc_rq->bucket[uc_se->bucket_id];                                                                   ▒
  0.02 │       shr    $0x3,%cl                                                                                            ▒
       │     bucket->tasks++;                                                                                             ▒
       │       and    $0x7,%ecx                                                                                           ◆
  0.05 │       lea    0x8(%rcx,%r11,2),%rax                                                                               ▒
  3.52 │       addq   $0x800,0x8(%r12,%rax,8)                                                                             ▒
       │     uc_se->active = true;                                                                                        ▒
       │       orb    $0x40,0x9(%rbx,%r13,4)                                                                              ▒
       │     uclamp_idle_reset(rq, clamp_id, uc_se->value);                                                               ▒
 73.34 │       movzwl 0x8(%rbx,%r13,4),%eax                     <--- XXXXX                                                ▒
       │     uclamp_idle_reset():                                                                                         ▒
       │     if (!(rq->uclamp_flags & UCLAMP_FLAG_IDLE))                                                                  ▒
       │       mov    0xa0(%r12),%r9d                                                                                     ▒
       │       mov    %r9d,%r10d                                                                                          ▒
       │     uclamp_rq_inc_id():                                                                                          ▒
       │     uclamp_idle_reset(rq, clamp_id, uc_se->value)

and at the dequeue

  0.07 │       mov    0x8(%rax),%ecx                                                                                      ▒
       │       test   %ecx,%ecx                                                                                           ▒
       │     ↑ je     60                                                                                                  ▒
  0.30 │       xor    %r14d,%r14d                                                                                         ▒
       │     uclamp_rq_dec_id():                                                                                          ▒
       │     bucket->tasks--;                                                                                             ▒
       │       mov    $0xfffffffffffff800,%r15                                                                            ▒
       │     bucket = &uc_rq->bucket[uc_se->bucket_id];                                                                   ▒
  0.07 │ 90:   mov    %r14d,%ecx                                                                                          ▒
       │       mov    %r14d,%r12d                                                                                         ▒
  0.04 │       lea    0xd4(%rcx),%rax                                                                                     ▒
 20.06 │       movzbl 0x9(%rsi,%rax,4),%r13d                     <--- XXXXX                                               ▒
       │     SCHED_WARN_ON(!bucket->tasks);                                                                               ▒
       │       lea    (%rcx,%rcx,1),%rax                                                                                  ▒
       │       add    %rcx,%rax                                                                                           ▒
       │     bucket = &uc_rq->bucket[uc_se->bucket_id];                                                                   ▒
  0.07 │       shr    $0x3,%r13b                                                                                          ▒
       │     SCHED_WARN_ON(!bucket->tasks);                                                                               ▒
  0.07 │       and    $0x7,%r13d                                                                                          ▒
  0.17 │       lea    0x8(%r13,%rax,2),%rax                                                                               ▒
 24.52 │       testq  $0xfffffffffffff800,0x8(%rbx,%rax,8)       <--- XXXXX                                               ▒
  0.34 │     ↓ je     172
.
.
.
  0.14 │       mov    %ecx,0x40(%rax)                                                                                     ▒
       │     ↑ jmpq   f8                                                                                                  ▒
  1.25 │250:   sub    $0x8,%rcx                                                                                           ▒
       │     uclamp_rq_max_value():                                                                                       ▒
       │     for ( ; bucket_id >= 0; bucket_id--) {                                                                       ▒
  4.97 │       cmp    %rcx,%rax                                                                                           ▒
       │     ↑ jne    22b                                                                                                 ▒
       │     uclamp_idle_value():                                                                                         ▒
       │     return uclamp_none(UCLAMP_MIN);                                                                              ▒
       │       xor    %ecx,%ecx                                                                                           ▒
       │     if (clamp_id == UCLAMP_MAX) {                                                                                ▒
  0.74 │       cmp    $0x1,%r14                                                                                           ▒
       │     ↑ jne    23d                                                                                                 ▒
       │     uclamp_rq_dec_id():                                                                                          ▒
       │     bkt_clamp = uclamp_rq_max_value(rq, clamp_id, uc_se->value);                                                 ▒
 20.10 │       movzwl 0x35c(%rsi),%ecx                           <--- XXXXXX                                              ▒
       │     uclamp_idle_value():                                                                                         ▒
       │     rq->uclamp_flags |= UCLAMP_FLAG_IDLE;                                                                        ▒
       │       orl    $0x1,0xa0(%rbx)                                                                                     ▒
       │     uclamp_rq_dec_id():                                                                                          ▒
       │     bkt_clamp = uclamp_rq_max_value(rq, clamp_id, uc_se->value);                                                 ▒
  0.14 │       and    $0x7ff,%ecx                                                                                         ▒
       │     ↑ jmp    23d


Which I interpreted as accesses to rq->uclamp[].bucket and p->uclamp[] structs.

The movzwl shanangians promoted me to remove the bitfields in case this is
causing some weird effect, and I shortend struct uclamp_bucket to reduce the
potential cache pressure to make it all fit in a single cache line


diff --git a/include/linux/sched.h b/include/linux/sched.h
index 4418f5cb8324..7d0acf250573 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -601,10 +601,10 @@ struct sched_dl_entity {
  * default boost can still drop its own boosting to 0%.
  */
 struct uclamp_se {
-	unsigned int value		: bits_per(SCHED_CAPACITY_SCALE);
-	unsigned int bucket_id		: bits_per(UCLAMP_BUCKETS);
-	unsigned int active		: 1;
-	unsigned int user_defined	: 1;
+	unsigned int value;
+	unsigned int bucket_id;
+	unsigned int active;
+	unsigned int user_defined;
 };
 #endif /* CONFIG_UCLAMP_TASK */
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index db3a57675ccf..a1e7080c48e8 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -833,8 +833,8 @@ extern void rto_push_irq_work_func(struct irq_work *work);
  * clamp value.
  */
 struct uclamp_bucket {
-	unsigned long value : bits_per(SCHED_CAPACITY_SCALE);
-	unsigned long tasks : BITS_PER_LONG - bits_per(SCHED_CAPACITY_SCALE);
+	unsigned short value;
+	unsigned short tasks;
 };
 
 /*


This make the perf output look like this now

With patch run #1:

   1.34%  sched-pipe  [kernel.vmlinux]  [k] deactivate_task
   0.44%  sched-pipe  [kernel.vmlinux]  [k] activate_task

with patch run #2:

   2.41%  sched-pipe  [kernel.vmlinux]  [k] deactivate_task
   0.32%  sched-pipe  [kernel.vmlinux]  [k] activate_task


So it seems to help the activate_task path a lot, but not as much for the
deactivate_task path. Note that activate_task path was hotter than
deactivate_task without this patch.

Further, I have tried adding a static key like you suggested for psi

With static key disabling uclamp:

   0.13%  sched-pipe  [kernel.vmlinux]  [k] activate_task
   0.07%  sched-pipe  [kernel.vmlinux]  [k] deactivate_task



diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index d4f6215ee03f..1814baa95c81 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -84,6 +84,9 @@ extern int sched_rt_handler(struct ctl_table *table, int write,
 extern int sysctl_sched_uclamp_handler(struct ctl_table *table, int write,
 				       void __user *buffer, size_t *lenp,
 				       loff_t *ppos);
+extern int sysctl_sched_uclamp_disabled(struct ctl_table *table, int write,
+				        void __user *buffer, size_t *lenp,
+				        loff_t *ppos);
 #endif
 
 extern int sysctl_numa_balancing(struct ctl_table *table, int write,
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 9a2fbf98fd6f..8d932b3922c9 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -793,6 +793,8 @@ unsigned int sysctl_sched_uclamp_util_max = SCHED_CAPACITY_SCALE;
 /* All clamps are required to be less or equal than these values */
 static struct uclamp_se uclamp_default[UCLAMP_CNT];
 
+DEFINE_STATIC_KEY_FALSE(sched_uclamp_disabled);
+
 /* Integer rounded range for each bucket */
 #define UCLAMP_BUCKET_DELTA DIV_ROUND_CLOSEST(SCHED_CAPACITY_SCALE, UCLAMP_BUCKETS)
 
@@ -1020,6 +1022,9 @@ static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p)
 {
 	enum uclamp_id clamp_id;
 
+	if (static_branch_likely(&sched_uclamp_disabled))
+		return;
+
 	if (unlikely(!p->sched_class->uclamp_enabled))
 		return;
 
@@ -1035,6 +1040,9 @@ static inline void uclamp_rq_dec(struct rq *rq, struct task_struct *p)
 {
 	enum uclamp_id clamp_id;
 
+	if (static_branch_likely(&sched_uclamp_disabled))
+		return;
+
 	if (unlikely(!p->sched_class->uclamp_enabled))
 		return;
 
@@ -1164,6 +1172,30 @@ int sysctl_sched_uclamp_handler(struct ctl_table *table, int write,
 	return result;
 }
 
+int sysctl_sched_uclamp_disabled(struct ctl_table *table, int write,
+			 void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+	struct ctl_table t;
+	int err;
+	int state = static_branch_likely(&sched_uclamp_disabled);
+
+	if (write && !capable(CAP_SYS_ADMIN))
+		return -EPERM;
+
+	t = *table;
+	t.data = &state;
+	err = proc_dointvec_minmax(&t, write, buffer, lenp, ppos);
+	if (err < 0)
+		return err;
+	if (write) {
+		if (state)
+			static_branch_enable(&sched_uclamp_disabled);
+		else
+			static_branch_disable(&sched_uclamp_disabled);
+	}
+	return err;
+}
+
 static int uclamp_validate(struct task_struct *p,
 			   const struct sched_attr *attr)
 {
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 8a176d8727a3..ef842cbf1f91 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -453,6 +453,15 @@ static struct ctl_table kern_table[] = {
 		.mode		= 0644,
 		.proc_handler	= sysctl_sched_uclamp_handler,
 	},
+	{
+		.procname	= "sched_uclamp_disabled",
+		.data		= NULL,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= sysctl_sched_uclamp_disabled,
+		.extra1		= SYSCTL_ZERO,
+		.extra2		= SYSCTL_ONE,
+	},
 #endif
 #ifdef CONFIG_SCHED_AUTOGROUP
 	{



I stopped over here and decided to run with your netperf test to see what
effect this has, and this is where things gets weird.


Running with uclamp gives better results than without uclamp :-/


The nouclamp column is with uclamp disabled at the config level.

The '.disabled' postfix is uclamp disabled via sysctl patch (static key).

uclamp-opt is with the above patch that improved the D$ performance for perf
bench.

I ran uclamp and uclamp-opt with the static key patch applied. I ran them twice
as after noticing the nouclamp is worse than with uclamp I wanted to see how
the numbers differ between runs.

                                    nouclam               nouclamp                  uclam                 uclamp         uclamp.disable                 uclamp                 uclamp                 uclamp
                                   nouclamp              recompile                 uclamp                uclamp2        uclamp.disabled                    opt                   opt2           opt.disabled
Hmean     send-64         158.07 (   0.00%)      156.99 *  -0.68%*      163.83 *   3.65%*      160.97 *   1.83%*      163.93 *   3.71%*      159.62 *   0.98%*      161.79 *   2.36%*      161.14 *   1.94%*
Hmean     send-128        314.86 (   0.00%)      314.41 *  -0.14%*      329.05 *   4.51%*      322.88 *   2.55%*      327.88 *   4.14%*      317.56 *   0.86%*      320.72 *   1.86%*      319.62 *   1.51%*
Hmean     send-256        629.98 (   0.00%)      625.78 *  -0.67%*      652.67 *   3.60%*      639.98 *   1.59%*      643.99 *   2.22%*      631.96 *   0.31%*      635.75 *   0.92%*      644.10 *   2.24%*
Hmean     send-1024      2465.04 (   0.00%)     2452.29 *  -0.52%*     2554.66 *   3.64%*     2509.60 *   1.81%*     2540.71 *   3.07%*     2495.82 *   1.25%*     2490.50 *   1.03%*     2509.86 *   1.82%*
Hmean     send-2048      4717.57 (   0.00%)     4713.17 *  -0.09%*     4923.98 *   4.38%*     4811.01 *   1.98%*     4881.87 *   3.48%*     4793.82 *   1.62%*     4820.28 *   2.18%*     4824.60 *   2.27%*
Hmean     send-3312      7412.33 (   0.00%)     7433.42 *   0.28%*     7717.76 *   4.12%*     7522.97 *   1.49%*     7620.99 *   2.82%*     7522.89 *   1.49%*     7614.51 *   2.73%*     7568.51 *   2.11%*
Hmean     send-4096      9021.55 (   0.00%)     8988.71 *  -0.36%*     9337.62 *   3.50%*     9075.49 *   0.60%*     9258.34 *   2.62%*     9117.17 *   1.06%*     9175.85 *   1.71%*     9079.50 *   0.64%*
Hmean     send-8192     15370.36 (   0.00%)    15467.63 *   0.63%*    15999.52 *   4.09%*    15467.80 *   0.63%*    15978.69 *   3.96%*    15619.84 *   1.62%*    15395.09 *   0.16%*    15779.73 *   2.66%*
Hmean     send-16384    26512.35 (   0.00%)    26498.18 *  -0.05%*    26931.86 *   1.58%*    26513.18 *   0.00%*    26873.98 *   1.36%*    26456.38 *  -0.21%*    26467.77 *  -0.17%*    26975.04 *   1.75%*


Happy to try more things if you have any suggestions. I am getting a bit
stumped by the netperf results, but haven't tried to profile them. I might try
that but thought I'll report this first as it's time consuming.

Thanks

--
Qais Yousef

Powered by blists - more mailing lists