linux-kernel - Re: [PATCH v5 0/7] SCHED_DEADLINE server infrastructure

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <878r3freza.fsf@yhuang6-desk2.ccr.corp.intel.com>
Date: Tue, 20 Feb 2024 11:28:41 +0800
From: "Huang, Ying" <ying.huang@...el.com>
To: Daniel Bristot de Oliveira <bristot@...nel.org>
Cc: Ingo Molnar <mingo@...hat.com>,  Peter Zijlstra <peterz@...radead.org>,
  Juri Lelli <juri.lelli@...hat.com>,  Vincent Guittot
 <vincent.guittot@...aro.org>,  Dietmar Eggemann
 <dietmar.eggemann@....com>,  Steven Rostedt <rostedt@...dmis.org>,  Ben
 Segall <bsegall@...gle.com>,  Mel Gorman <mgorman@...e.de>,  Daniel
 Bristot de Oliveira <bristot@...hat.com>,  Valentin Schneider
 <vschneid@...hat.com>,  linux-kernel@...r.kernel.org,  Luca Abeni
 <luca.abeni@...tannapisa.it>,  Tommaso Cucinotta
 <tommaso.cucinotta@...tannapisa.it>,  Thomas Gleixner
 <tglx@...utronix.de>,  Joel Fernandes <joel@...lfernandes.org>,  Vineeth
 Pillai <vineeth@...byteword.org>,  Shuah Khan <skhan@...uxfoundation.org>,
  Phil Auld <pauld@...hat.com>,  Aaron Lu <aaron.lu@...el.com>,  Kairui
 Song <kasong@...cent.com>,  Guo Ziliang <guo.ziliang@....com.cn>
Subject: Re: [PATCH v5 0/7] SCHED_DEADLINE server infrastructure

Daniel Bristot de Oliveira <bristot@...nel.org> writes:

> Hi
>
> On 2/19/24 08:33, Huang, Ying wrote:
>> Hi, Daniel,
>> 
>> Thanks a lot for your great patchset!
>> 
>> We have a similar starvation issue in mm subsystem too.  Details are in
>> the patch description of the below commit.  In short, task A is busy
>> looping on some event, while task B will signal the event after some
>> work.  If the priority of task A is higher than that of task B, task B
>> may be starved.
>
> ok...
>
>> 
>> IIUC, if task A is RT task while task B is fair task, then your patchset
>> will solve the issue.
>
> This patch set will not solve the issue. It will mitigate the effect of the
> problem. Still the system will perform very poorly...

I don't think that it's common (or even reasonable) for real-time tasks
to use swap.  So, IMHO, performance isn't very important here.  But, we
need to avoid live-lock anyway.  I think that your patchset solves the
live-lock issue.

>> If both task A and task B is RT tasks, is there
>> some way to solve the issue?
>
> I would say reworking the swap algorithm, as it is not meant to be used when
> real-time tasks are in place.
>
> As an exercise, let's say that we add a server per priority on FIFO, with a default
> 50ms/1s runtime period. Your "real-time" workload would suffer a 950ms latency,
> busy loop in vain.

If the target is only the live-lock avoidance, is it possible to run
lower priority runnable tasks for a short while if we run long enough in
the busy loop?

> Then one would say, let's lower the parameters, so the granularity of
> the server would provide lower latencies. The same problem would still
> exist, as it exists with sched fair....
>
> So, the solution is not on schedule. Busy loop waiting is not right when you
> have RT tasks. That is why PREEMPT_RT reworks the locking pattern to remove
> spin_locks that do busy waiting. spin_locks do not have this problem you
> show because they disable preemption... but disabling preemption is not
> the best solution either.
>
> So, a first try of duct tape would using (local_?) locks like in
> preempt rt to make things sleepable...
>
> AFAICS, this was already discussed in the previous link, right?
>
>> 
>> Now, we use a ugly schedule_timeout_uninterruptible(1) in the loop to
>> resolve the issue, is there something better?
>
> I am not a swap/mm expert.. my guesses would be all on sleepable locking.
> But I know there are many smart people on the mm side with better guesses...
>
> It is just that the DL server or any type of starvation avoidance does not
> seem to be a solution for your problem.

Yes.  To improve the performance, we need something else.

--
Best Regards,
Huang, Ying

> -- Daniel
>
>
>> Best Regards,
>> Huang, Ying
>> 
>> --------------------------8<---------------------------------------
>> commit 029c4628b2eb2ca969e9bf979b05dc18d8d5575e
>> Author: Guo Ziliang <guo.ziliang@....com.cn>
>> Date:   Wed Mar 16 16:15:03 2022 -0700
>> 
>>     mm: swap: get rid of livelock in swapin readahead
>>     
>>     In our testing, a livelock task was found.  Through sysrq printing, same
>>     stack was found every time, as follows:
>>     
>>       __swap_duplicate+0x58/0x1a0
>>       swapcache_prepare+0x24/0x30
>>       __read_swap_cache_async+0xac/0x220
>>       read_swap_cache_async+0x58/0xa0
>>       swapin_readahead+0x24c/0x628
>>       do_swap_page+0x374/0x8a0
>>       __handle_mm_fault+0x598/0xd60
>>       handle_mm_fault+0x114/0x200
>>       do_page_fault+0x148/0x4d0
>>       do_translation_fault+0xb0/0xd4
>>       do_mem_abort+0x50/0xb0
>>     
>>     The reason for the livelock is that swapcache_prepare() always returns
>>     EEXIST, indicating that SWAP_HAS_CACHE has not been cleared, so that it
>>     cannot jump out of the loop.  We suspect that the task that clears the
>>     SWAP_HAS_CACHE flag never gets a chance to run.  We try to lower the
>>     priority of the task stuck in a livelock so that the task that clears
>>     the SWAP_HAS_CACHE flag will run.  The results show that the system
>>     returns to normal after the priority is lowered.
>>     
>>     In our testing, multiple real-time tasks are bound to the same core, and
>>     the task in the livelock is the highest priority task of the core, so
>>     the livelocked task cannot be preempted.
>>     
>>     Although cond_resched() is used by __read_swap_cache_async, it is an
>>     empty function in the preemptive system and cannot achieve the purpose
>>     of releasing the CPU.  A high-priority task cannot release the CPU
>>     unless preempted by a higher-priority task.  But when this task is
>>     already the highest priority task on this core, other tasks will not be
>>     able to be scheduled.  So we think we should replace cond_resched() with
>>     schedule_timeout_uninterruptible(1), schedule_timeout_interruptible will
>>     call set_current_state first to set the task state, so the task will be
>>     removed from the running queue, so as to achieve the purpose of giving
>>     up the CPU and prevent it from running in kernel mode for too long.
>>     
>>     (akpm: ugly hack becomes uglier.  But it fixes the issue in a
>>     backportable-to-stable fashion while we hopefully work on something
>>     better)
>>     
>>     Link: https://lkml.kernel.org/r/20220221111749.1928222-1-cgel.zte@gmail.com
>>     Signed-off-by: Guo Ziliang <guo.ziliang@....com.cn>
>>     Reported-by: Zeal Robot <zealci@....com.cn>
>>     Reviewed-by: Ran Xiaokai <ran.xiaokai@....com.cn>
>>     Reviewed-by: Jiang Xuexin <jiang.xuexin@....com.cn>
>>     Reviewed-by: Yang Yang <yang.yang29@....com.cn>
>>     Acked-by: Hugh Dickins <hughd@...gle.com>
>>     Cc: Naoya Horiguchi <naoya.horiguchi@....com>
>>     Cc: Michal Hocko <mhocko@...nel.org>
>>     Cc: Minchan Kim <minchan@...nel.org>
>>     Cc: Johannes Weiner <hannes@...xchg.org>
>>     Cc: Roger Quadros <rogerq@...nel.org>
>>     Cc: Ziliang Guo <guo.ziliang@....com.cn>
>>     Cc: <stable@...r.kernel.org>
>>     Signed-off-by: Andrew Morton <akpm@...ux-foundation.org>
>>     Signed-off-by: Linus Torvalds <torvalds@...ux-foundation.org>
>> 
>> diff --git a/mm/swap_state.c b/mm/swap_state.c
>> index 8d4104242100..ee67164531c0 100644
>> --- a/mm/swap_state.c
>> +++ b/mm/swap_state.c
>> @@ -478,7 +478,7 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
>>  		 * __read_swap_cache_async(), which has set SWAP_HAS_CACHE
>>  		 * in swap_map, but not yet added its page to swap cache.
>>  		 */
>> -		cond_resched();
>> +		schedule_timeout_uninterruptible(1);
>>  	}
>>  
>>  	/*