linux-kernel - Re: [RFC 1/8] sched: Add nice value change notifier

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <6818e34e-d41c-67b7-85dd-76d2e47bc078@linux.intel.com>
Date:   Thu, 7 Oct 2021 10:09:51 +0100
From:   Tvrtko Ursulin <tvrtko.ursulin@...ux.intel.com>
To:     Barry Song <21cnbao@...il.com>
Cc:     "Wanghui (John)" <john.wanghui@...wei.com>,
        Intel-gfx@...ts.freedesktop.org, dri-devel@...ts.freedesktop.org,
        LKML <linux-kernel@...r.kernel.org>,
        Tvrtko Ursulin <tvrtko.ursulin@...el.com>,
        Ingo Molnar <mingo@...hat.com>,
        Peter Zijlstra <peterz@...radead.org>,
        Juri Lelli <juri.lelli@...hat.com>,
        Vincent Guittot <vincent.guittot@...aro.org>
Subject: Re: [RFC 1/8] sched: Add nice value change notifier


On 07/10/2021 09:50, Tvrtko Ursulin wrote:
> 
> On 06/10/2021 21:21, Barry Song wrote:
>> On Thu, Oct 7, 2021 at 2:44 AM Tvrtko Ursulin
>> <tvrtko.ursulin@...ux.intel.com> wrote:
>>>
>>>
>>> Hi,
>>>
>>> On 06/10/2021 08:58, Barry Song wrote:
>>>> On Wed, Oct 6, 2021 at 5:15 PM Wanghui (John) 
>>>> <john.wanghui@...wei.com> wrote:
>>>>>
>>>>> HI Tvrtko
>>>>>
>>>>> On 2021/10/4 22:36, Tvrtko Ursulin wrote:
>>>>>>     void set_user_nice(struct task_struct *p, long nice)
>>>>>>     {
>>>>>>         bool queued, running;
>>>>>> -     int old_prio;
>>>>>> +     int old_prio, ret;
>>>>>>         struct rq_flags rf;
>>>>>>         struct rq *rq;
>>>>>>
>>>>>> @@ -6915,6 +6947,9 @@ void set_user_nice(struct task_struct *p, 
>>>>>> long nice)
>>>>>>
>>>>>>     out_unlock:
>>>>>>         task_rq_unlock(rq, p, &rf);
>>>>>> +
>>>>>> +     ret = atomic_notifier_call_chain(&user_nice_notifier_list, 
>>>>>> nice, p);
>>>>>> +     WARN_ON_ONCE(ret != NOTIFY_DONE);
>>>>>>     }
>>>>> How about adding a new "io_nice" to task_struct，and move the call 
>>>>> chain to
>>>>> sched_setattr/getattr, there are two benefits:
>>>>
>>>> We already have an ionice for block io scheduler. hardly can this 
>>>> new io_nice
>>>> be generic to all I/O. it seems the patchset is trying to link
>>>> process' nice with
>>>> GPU's scheduler, to some extent, it makes more senses than having a
>>>> common ionice because we have a lot of IO devices in the systems, we 
>>>> don't
>>>> know which I/O the ionice of task_struct should be applied to.
>>>>
>>>> Maybe we could have an ionice dedicated for GPU just like ionice for 
>>>> CFQ
>>>> of bio/request scheduler.
>>>
>>> Thought crossed my mind but I couldn't see the practicality of a 3rd
>>> nice concept. I mean even to start with I struggle a bit with the
>>> usefulness of existing ionice vs nice. Like coming up with practical
>>> examples of usecases where it makes sense to decouple the two 
>>> priorities.
>>>
>>>   From a different angle I did think inheriting CPU nice makes sense for
>>> GPU workloads. This is because today, and more so in the future,
>>> computations on a same data set do flow from one to the other.
>>>
>>> Like maybe a simple example of batch image processing where CPU decodes,
>>> GPU does a transform and then CPU encodes. Or a different mix, doesn't
>>> really matter, since the main point it is one computing pipeline from
>>> users point of view.
>>>
>>
>> I am on it. but I am also seeing two problems here:
>> 1. nice is not global in linux. For example, if you have two cgroups, 
>> cgroup A
>> has more quota then cgroup B. Tasks in B won't win even if it has a 
>> lower nice.
>> cgroups will run proportional-weight time-based division of CPU.
>>
>> 2. Historically, we had dynamic nice which was adjusted based on the 
>> average
>> sleep/running time; right now, we don't have dynamic nice, but virtual 
>> time
>> still make tasks which sleep more preempt other tasks with the same nice
>> or even lower nice.
>> virtual time += physical time/weight by nice
>> so, static nice number doesn't always make sense to decide preemption.
>>
>> So it seems your patch only works under some simple situation for example
>> no cgroups, tasks have similar sleep/running time.
> 
> Yes, I broadly agree with your assessment. Although there are plans for 
> adding cgroup support to i915 scheduling, I doubt as fine grained 
> control and exact semantics as there are on the CPU side will happen.
> 
> Mostly because the drive seems to be for more micro-controller managed 
> scheduling which adds further challenges in connecting the two sides 
> together.
> 
> But when you say it is a problem, I would characterize it more a 
> weakness in terms of being only a subset of possible control. It is 
> still richer (better?) than what currently exists and as demonstrated 
> with benchmarks in my cover letter it can deliver improvements in user 
> experience. If in the mid term future we can extend it with cgroup 
> support then the concept should still apply and get closer to how you 
> described nice works in the CPU world.
> 
> Main question in my mind is whether the idea of adding the 
> sched_attr/priority notifier to the kernel can be justified. Because as 
> mentioned before, everything apart from adjusting currently running GPU 
> jobs could be done purely in userspace. Stack changes would be quite 
> extensive and all, but that is not usually a good enough reason to put 
> something in the kernel. That's why it is an RFC an invitation to discuss.
> 
> Even ionice inherits from nice (see task_nice_ioprio()) so I think 
> argument can be made for drivers as well.

Now that I wrote this, I had a little bit of a light bulb moment. If I 
abandon the idea of adjusting the priority of already submitted work 
items, then I can do much of what I want purely from within the confines 
of i915.

I simply add code to inherit from current task nice on every new work 
item submission. This should probably bring the majority of the benefit 
I measured.

Regards,

Tvrtko