[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <878r3freza.fsf@yhuang6-desk2.ccr.corp.intel.com>
Date: Tue, 20 Feb 2024 11:28:41 +0800
From: "Huang, Ying" <ying.huang@...el.com>
To: Daniel Bristot de Oliveira <bristot@...nel.org>
Cc: Ingo Molnar <mingo@...hat.com>, Peter Zijlstra <peterz@...radead.org>,
Juri Lelli <juri.lelli@...hat.com>, Vincent Guittot
<vincent.guittot@...aro.org>, Dietmar Eggemann
<dietmar.eggemann@....com>, Steven Rostedt <rostedt@...dmis.org>, Ben
Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>, Daniel
Bristot de Oliveira <bristot@...hat.com>, Valentin Schneider
<vschneid@...hat.com>, linux-kernel@...r.kernel.org, Luca Abeni
<luca.abeni@...tannapisa.it>, Tommaso Cucinotta
<tommaso.cucinotta@...tannapisa.it>, Thomas Gleixner
<tglx@...utronix.de>, Joel Fernandes <joel@...lfernandes.org>, Vineeth
Pillai <vineeth@...byteword.org>, Shuah Khan <skhan@...uxfoundation.org>,
Phil Auld <pauld@...hat.com>, Aaron Lu <aaron.lu@...el.com>, Kairui
Song <kasong@...cent.com>, Guo Ziliang <guo.ziliang@....com.cn>
Subject: Re: [PATCH v5 0/7] SCHED_DEADLINE server infrastructure
Daniel Bristot de Oliveira <bristot@...nel.org> writes:
> Hi
>
> On 2/19/24 08:33, Huang, Ying wrote:
>> Hi, Daniel,
>>
>> Thanks a lot for your great patchset!
>>
>> We have a similar starvation issue in mm subsystem too. Details are in
>> the patch description of the below commit. In short, task A is busy
>> looping on some event, while task B will signal the event after some
>> work. If the priority of task A is higher than that of task B, task B
>> may be starved.
>
> ok...
>
>>
>> IIUC, if task A is RT task while task B is fair task, then your patchset
>> will solve the issue.
>
> This patch set will not solve the issue. It will mitigate the effect of the
> problem. Still the system will perform very poorly...
I don't think that it's common (or even reasonable) for real-time tasks
to use swap. So, IMHO, performance isn't very important here. But, we
need to avoid live-lock anyway. I think that your patchset solves the
live-lock issue.
>> If both task A and task B is RT tasks, is there
>> some way to solve the issue?
>
> I would say reworking the swap algorithm, as it is not meant to be used when
> real-time tasks are in place.
>
> As an exercise, let's say that we add a server per priority on FIFO, with a default
> 50ms/1s runtime period. Your "real-time" workload would suffer a 950ms latency,
> busy loop in vain.
If the target is only the live-lock avoidance, is it possible to run
lower priority runnable tasks for a short while if we run long enough in
the busy loop?
> Then one would say, let's lower the parameters, so the granularity of
> the server would provide lower latencies. The same problem would still
> exist, as it exists with sched fair....
>
> So, the solution is not on schedule. Busy loop waiting is not right when you
> have RT tasks. That is why PREEMPT_RT reworks the locking pattern to remove
> spin_locks that do busy waiting. spin_locks do not have this problem you
> show because they disable preemption... but disabling preemption is not
> the best solution either.
>
> So, a first try of duct tape would using (local_?) locks like in
> preempt rt to make things sleepable...
>
> AFAICS, this was already discussed in the previous link, right?
>
>>
>> Now, we use a ugly schedule_timeout_uninterruptible(1) in the loop to
>> resolve the issue, is there something better?
>
> I am not a swap/mm expert.. my guesses would be all on sleepable locking.
> But I know there are many smart people on the mm side with better guesses...
>
> It is just that the DL server or any type of starvation avoidance does not
> seem to be a solution for your problem.
Yes. To improve the performance, we need something else.
--
Best Regards,
Huang, Ying
> -- Daniel
>
>
>> Best Regards,
>> Huang, Ying
>>
>> --------------------------8<---------------------------------------
>> commit 029c4628b2eb2ca969e9bf979b05dc18d8d5575e
>> Author: Guo Ziliang <guo.ziliang@....com.cn>
>> Date: Wed Mar 16 16:15:03 2022 -0700
>>
>> mm: swap: get rid of livelock in swapin readahead
>>
>> In our testing, a livelock task was found. Through sysrq printing, same
>> stack was found every time, as follows:
>>
>> __swap_duplicate+0x58/0x1a0
>> swapcache_prepare+0x24/0x30
>> __read_swap_cache_async+0xac/0x220
>> read_swap_cache_async+0x58/0xa0
>> swapin_readahead+0x24c/0x628
>> do_swap_page+0x374/0x8a0
>> __handle_mm_fault+0x598/0xd60
>> handle_mm_fault+0x114/0x200
>> do_page_fault+0x148/0x4d0
>> do_translation_fault+0xb0/0xd4
>> do_mem_abort+0x50/0xb0
>>
>> The reason for the livelock is that swapcache_prepare() always returns
>> EEXIST, indicating that SWAP_HAS_CACHE has not been cleared, so that it
>> cannot jump out of the loop. We suspect that the task that clears the
>> SWAP_HAS_CACHE flag never gets a chance to run. We try to lower the
>> priority of the task stuck in a livelock so that the task that clears
>> the SWAP_HAS_CACHE flag will run. The results show that the system
>> returns to normal after the priority is lowered.
>>
>> In our testing, multiple real-time tasks are bound to the same core, and
>> the task in the livelock is the highest priority task of the core, so
>> the livelocked task cannot be preempted.
>>
>> Although cond_resched() is used by __read_swap_cache_async, it is an
>> empty function in the preemptive system and cannot achieve the purpose
>> of releasing the CPU. A high-priority task cannot release the CPU
>> unless preempted by a higher-priority task. But when this task is
>> already the highest priority task on this core, other tasks will not be
>> able to be scheduled. So we think we should replace cond_resched() with
>> schedule_timeout_uninterruptible(1), schedule_timeout_interruptible will
>> call set_current_state first to set the task state, so the task will be
>> removed from the running queue, so as to achieve the purpose of giving
>> up the CPU and prevent it from running in kernel mode for too long.
>>
>> (akpm: ugly hack becomes uglier. But it fixes the issue in a
>> backportable-to-stable fashion while we hopefully work on something
>> better)
>>
>> Link: https://lkml.kernel.org/r/20220221111749.1928222-1-cgel.zte@gmail.com
>> Signed-off-by: Guo Ziliang <guo.ziliang@....com.cn>
>> Reported-by: Zeal Robot <zealci@....com.cn>
>> Reviewed-by: Ran Xiaokai <ran.xiaokai@....com.cn>
>> Reviewed-by: Jiang Xuexin <jiang.xuexin@....com.cn>
>> Reviewed-by: Yang Yang <yang.yang29@....com.cn>
>> Acked-by: Hugh Dickins <hughd@...gle.com>
>> Cc: Naoya Horiguchi <naoya.horiguchi@....com>
>> Cc: Michal Hocko <mhocko@...nel.org>
>> Cc: Minchan Kim <minchan@...nel.org>
>> Cc: Johannes Weiner <hannes@...xchg.org>
>> Cc: Roger Quadros <rogerq@...nel.org>
>> Cc: Ziliang Guo <guo.ziliang@....com.cn>
>> Cc: <stable@...r.kernel.org>
>> Signed-off-by: Andrew Morton <akpm@...ux-foundation.org>
>> Signed-off-by: Linus Torvalds <torvalds@...ux-foundation.org>
>>
>> diff --git a/mm/swap_state.c b/mm/swap_state.c
>> index 8d4104242100..ee67164531c0 100644
>> --- a/mm/swap_state.c
>> +++ b/mm/swap_state.c
>> @@ -478,7 +478,7 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
>> * __read_swap_cache_async(), which has set SWAP_HAS_CACHE
>> * in swap_map, but not yet added its page to swap cache.
>> */
>> - cond_resched();
>> + schedule_timeout_uninterruptible(1);
>> }
>>
>> /*
Powered by blists - more mailing lists