[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <5fee2cc2-e6de-6bc2-7644-778ef2185803@virtuozzo.com>
Date: Fri, 20 Apr 2018 17:30:36 +0300
From: Kirill Tkhai <ktkhai@...tuozzo.com>
To: Juri Lelli <juri.lelli@...il.com>
Cc: mingo@...hat.com, peterz@...radead.org,
linux-kernel@...r.kernel.org
Subject: Re: [PATCH v2] sched/rt: Rework for_each_process_thread() iterations
in tg_has_rt_tasks()
On 20.04.2018 17:11, Juri Lelli wrote:
> On 20/04/18 13:06, Kirill Tkhai wrote:
>> From: Kirill Tkhai <ktkhai@...tuozzo.com>
>>
>> tg_rt_schedulable() iterates over all child task groups,
>> while tg_has_rt_tasks() iterates over all linked tasks.
>> In case of systems with big number of tasks, this may
>> take a lot of time.
>>
>> I observed hard LOCKUP on machine with 20000+ processes
>> after write to "cpu.rt_period_us" of cpu cgroup with
>> 39 children. The problem occurred because of tasklist_lock
>> is held for a long time and other processes can't do fork().
>>
>> PID: 1036268 TASK: ffff88766c310000 CPU: 36 COMMAND: "criu"
>> #0 [ffff887f7f408e48] crash_nmi_callback at ffffffff81050601
>> #1 [ffff887f7f408e58] nmi_handle at ffffffff816e0cc7
>> #2 [ffff887f7f408eb0] do_nmi at ffffffff816e0fb0
>> #3 [ffff887f7f408ef0] end_repeat_nmi at ffffffff816e00b9
>> [exception RIP: tg_rt_schedulable+463]
>> RIP: ffffffff810bf49f RSP: ffff886537ad7d50 RFLAGS: 00000202
>> RAX: 0000000000000000 RBX: 000000003b9aca00 RCX: ffff883e9cb4b1b0
>> RDX: ffff887d0be43608 RSI: ffff886537ad7dd8 RDI: ffff8840a6ad0000
>> RBP: ffff886537ad7d68 R8: ffff887d0be431b0 R9: 00000000000e7ef0
>> R10: ffff88164fc39400 R11: 0000000000023380 R12: ffffffff81ef8d00
>> R13: ffffffff810bea40 R14: 0000000000000000 R15: ffff8840a6ad0000
>> ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
>> --- <NMI exception stack> ---
>> #4 [ffff886537ad7d50] tg_rt_schedulable at ffffffff810bf49f
>> #5 [ffff886537ad7d70] walk_tg_tree_from at ffffffff810c6c91
>> #6 [ffff886537ad7dc0] tg_set_rt_bandwidth at ffffffff810c6dd0
>> #7 [ffff886537ad7e28] cpu_rt_period_write_uint at ffffffff810c6eea
>> #8 [ffff886537ad7e38] cgroup_file_write at ffffffff8111cfd3
>> #9 [ffff886537ad7ec8] vfs_write at ffffffff8121eced
>> #10 [ffff886537ad7f08] sys_write at ffffffff8121faff
>> #11 [ffff886537ad7f50] system_call_fastpath at ffffffff816e8a7d
>>
>> The patch reworks tg_has_rt_tasks() and makes it to iterate over
>> task group process list instead of iteration over all tasks list.
>> This makes the function to scale well, and reduces its execution
>> time.
>>
>> Note, that since tasklist_lock doesn't protect a task against
>> sched_class changing, we don't introduce new races in comparison
>> to that we had before.
>
> This seems to be true. However, I wonder why we are OK with current racy
> code (against tasks moving between groups). :/
>
> Can't a task join the group while we are iterating and we miss that?
Yes, it can, but I'm not sure either this should be considered as problem,
seeing the race design we already have. It's not a real protection, this
place is to warn a person, he does something wrong. We check for zero written
there, but really written "1" will invent the same problems.
There are cgroup_threadgroup_change_begin(task)/cgroup_threadgroup_change_end(task)
to protect from cgroup change, but it seems we can't use it as they have task argument
and aimed to protect single task thread group in the future (despite they take global
percpu_rwsem).
Kirill
Powered by blists - more mailing lists