linux-kernel - [Discussion v2] Usecases for the per-task latency-nice attribute

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <2bd46086-43ff-f130-8720-8eec694eb55b@linux.ibm.com>
Date:   Mon, 30 Sep 2019 16:13:53 +0530
From:   Parth Shah <parth@...ux.ibm.com>
To:     linux-kernel@...r.kernel.org, patrick.bellasi@...bug.net,
        tim.c.chen@...ux.intel.com, valentin.schneider@....com,
        qais.yousef@....com, linux-pm@...r.kernel.org
Cc:     peterz@...radead.org, vincent.guittot@...aro.org, pavel@....cz,
        David.Laight@...LAB.COM, mingo@...hat.com,
        morten.rasmussen@....com, pjt@...gle.com, dietmar.eggemann@....com,
        tj@...nel.org, rafael.j.wysocki@...el.com,
        daniel.lezcano@...aro.org, dhaval.giani@...cle.com,
        quentin.perret@....com,
        subhra mazumdar <subhra.mazumdar@...cle.com>,
        ggherdovich@...e.cz, viresh.kumar@...aro.org,
        Doug Smythies <dsmythies@...us.net>
Subject: [Discussion v2] Usecases for the per-task latency-nice attribute

Hello everyone,

This is the v2 of the discussion started for introducing per-task
latency-nice attribute for providing scheduler hints.

v1: https://lkml.org/lkml/2019/9/18/555

In brief, we face two challenges with the introduction of such attr.

1. Name:
==============
( Should be relevant to all the possible usecases, not confuse end-user and
reflect the functionality it provides to the scheduler behaviour )

Curated list of proposed names:

1. latency-nice:
   should have a better understanding based on pre-existing concepts

- But poses two interpretation ambiguity
  a) -20 (least nice to latency, i.e. sacrifice latency for throughput)
     +19 (most nice to latency, i.e. sacrifice throughput for latency)
  b) -20 (least nice to other task in terms of sacrificing latency, i.e.
	  latency-sensitive)
     +19 (most nice to other tasks in terms of sacrificing latency, i.e.
	  latency-forgoing)

2. latency-tolerant:
   decouples a bit its meaning from the niceness thus giving maybe a bit
   more freedom in its complete definition and perhaps avoid any
   possible interpretation confusion

3. latency-nasty

4. latency-sensible



2. Value(s):
==============
( Boolean/Ternary, Range of values, profile tagging )

- Recent discussion plots the range of [-20, 19] to be the most agreed upon.

1. Range:
- [-20, 19]:
    Which has similarities with the niceness concept and gives a minimal
    continuous range. This can be on hand for things like scaling the
    vruntime normalization [3]

2. Profile tagging:
- Can be used just like a flag attribute
  e.g., Background, foreground, latency-sensible, reduce-idle-search, etc.

3. Binary:
- 0 for: Latency sensitive/sensible/in-tolerant/hungry...
- 1 for Latency insensitive/insensible/tolerant/nice-to-others/...

  Ternary:
-  0: no effect
- -1: require least latency
- +1: no restrictions in terms of lower/higher latency


------------------
**Usecases**
-----------------

1> Reduce search scan time for idle Cores
( -Subhra Mazumadar )
=====================================
Currently, CFS makes search across LLC domain to search for idle core which
is sometimes exhaustive when the core count increases beyond certain count.
This impacts the latency-sensitive tasks where scheduler spends much of it
time to search for idle core to wakeup a task. This could potentially be
solved by limiting the idle core search for the tasks which requires least
latency. The userland providing hints to the scheduler by tagging such
tasks is a solution proposed in the community and has shown positive
results [1].


2> TurboSched
( -Parth Shah )
====================
TurboSched [2] tries to minimize the number of active cores in a socket by
packing an un-important and low-utilization (named jitter) task on an
already active core and thus refrains from waking up of a new core if
possible. This requires tagging of tasks from the userspace hinting which
tasks are un-important and thus waking-up a new core to minimize the
latency is un-necessary for such tasks.
As per the discussion on the posted RFC, it will be appropriate to use the
task latency property where a task with the highest latency-nice value can
be packed.
But for this specific use-cases, having just a binary value to know which
task is latency-sensitive and which not is sufficient enough, but having a
range is also a good way to go where above some threshold the task can be
packed.


3> Wakeup path tunings
( -Patrick Bellasi )
==========================
Some additional possible use-cases was already discussed in [3]:

 - dynamically tune the policy of a task among SCHED_{OTHER,BATCH,IDLE}
   depending on crossing certain pre-configured threshold of latency
   niceness.

 - dynamically bias the vruntime updates we do in place_entity()
   depending on the actual latency niceness of a task.
   Tuning the tweaks we already have for:
	 - START_DEBIT
	 - GENTLE_FAIR_SLEEPERS
   a bit more parametric and proportional to the latency-nice of a task.

 - bias the decisions we take in check_preempt_tick() still depending
   on a relative comparison of the current and wakeup task latency
   niceness values.


4> Load balance tuning
( -Valentin Schneider )
======================
Already mentioned these in [4]:
- Increase (reduce) nr_balance_failed threshold when trying to active
  balance a latency-sensitive (non-latency-sensitive) task.

- Increase (decrease) sched_migration_cost factor in task_hot() for
  latency-sensitive (non-latency-sensitive) tasks.


5> Separating AVX512 tasks and latency sensitive tasks on separate cores
( -Tim Chen )
===========================================================================
Another usecase we are considering is to segregate those workload that will
pull down core cpu frequency (e.g. AVX512) from workload that are latency
sensitive. There are certain tasks that need to provide a fast response
time (latency sensitive) and they are best scheduled on cpu that has a
lighter load and not have other tasks running on the sibling cpu that could
pull down the cpu core frequency.

Some users are running machine learning batch tasks with AVX512, and have
observed that these tasks affect the tasks needing a fast response.  They
have to rely on manual CPU affinity to separate these tasks.  With
appropriate latency hint on task, the scheduler can be taught to separate them.


6> EAS
( -Qais Yousef )
====================
The new knob can help EAS path to switch to spreading behavior when
latency-nice is set instead of packing tasks on the most energy efficient CPU.
ie: pick the most energy efficient idle CPU.



Further doubts requiring community attention
---------------------------------------------
1. Who is the intended user for setting this value? (- Qais Yousef)
   - system admin or application developer ?


Thanks everyone for providing your valuable inputs, hence again asking for
the same. (◠﹏◠)

---------------
**References**
---------------
[1]. https://lkml.org/lkml/2019/8/30/829
[2]. https://lkml.org/lkml/2019/7/25/296
[3]. Message-ID: <20190905114709.GM2349@...ez.programming.kicks-ass.net>
https://lore.kernel.org/lkml/20190905114709.GM2349@hirez.programming.kicks-ass.net/
[4]. https://lkml.kernel.org/r/3d3306e4-3a78-5322-df69-7665cf01cc43@arm.com