linux-kernel - Re: [PATCH RFC] sched/fair: fix sudden expiration of cfq quota in put_prev

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <551E8CC5.30906@yandex-team.ru>
Date:	Fri, 03 Apr 2015 15:51:17 +0300
From:	Konstantin Khlebnikov <khlebnikov@...dex-team.ru>
To:	Peter Zijlstra <peterz@...radead.org>,
	Ingo Molnar <mingo@...hat.com>, linux-kernel@...r.kernel.org
CC:	Ben Segall <bsegall@...gle.com>,
	Roman Gushchin <klamm@...dex-team.ru>
Subject: Re: [PATCH RFC] sched/fair: fix sudden expiration of cfq quota in
 put_prev_task()

On 03.04.2015 15:41, Konstantin Khlebnikov wrote:
> Pick_next_task_fair() must be sure that here is at least one runnable
> task before calling put_prev_task(), but put_prev_task() can expire
> last remains of cfs quota and throttle all currently runnable tasks.
> As a result pick_next_task_fair() cannot find next task and crashes.

Kernel crash looks like this:

<1>[50288.719491] BUG: unable to handle kernel NULL pointer dereference 
at 0000000000000038
<1>[50288.719538] IP: [<ffffffff81097b8c>] set_next_entity+0x1c/0x80
<4>[50288.719567] PGD 0
<4>[50288.719578] Oops: 0000 [#1] SMP
<4>[50288.719594] Modules linked in: vhost_net macvtap macvlan vhost 
8021q mrp garp ip6table_filter ip6_tables nf_conntrack_ipv4 
nf_defrag_ipv4 xt_state nf_conntrack ipt_REJECT nf_reject_ipv4 
xt_CHECKSUM iptable_mangle xt_tcpudp iptable_filter ip_tables x_tables 
bridge stp llc netconsole configfs x86_pkg_temp_thermal intel_powerclamp 
coretemp kvm_intel kvm mgag200 crc32_pclmul ghash_clmulni_intel 
aesni_intel ablk_helper cryptd lrw ttm gf128mul drm_kms_helper drm 
glue_helper aes_x86_64 i2c_algo_bit sysimgblt sysfillrect i2c_core 
sb_edac edac_core syscopyarea microcode ipmi_si ipmi_msghandler lpc_ich 
ioatdma dca mlx4_en mlx4_core vxlan udp_tunnel ip6_udp_tunnel tcp_htcp 
e1000e ptp pps_core ahci libahci raid10 raid456 async_pq async_xor xor 
async_memcpy async_raid6_recov raid6_pq async_tx raid1 raid0 
multipath<4>[50288.719956]  linear
<4>[50288.719964] CPU: 27 PID: 11505 Comm: kvm Not tainted 3.18.10-7 #7
<4>[50288.719987] Hardware name:
<4>[50288.720015] task: ffff880036acbaa0 ti: ffff8808445f8000 task.ti: 
ffff8808445f8000
<4>[50288.720041] RIP: 0010:[<ffffffff81097b8c>]  [<ffffffff81097b8c>] 
set_next_entity+0x1c/0x80
<4>[50288.720072] RSP: 0018:ffff8808445fbbb8  EFLAGS: 00010086
<4>[50288.720091] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 
000000000000bcb8
<4>[50288.720116] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 
ffff88107fd72af0
<4>[50288.720141] RBP: ffff8808445fbbd8 R08: 0000000000000000 R09: 
0000000000000001
<4>[50288.720165] R10: 0000000000000000 R11: 0000000000000001 R12: 
0000000000000000
<4>[50288.720190] R13: 0000000000000000 R14: ffff880b6f250030 R15: 
ffff88107fd72af0
<4>[50288.720214] FS:  00007f55467fc700(0000) GS:ffff88107fd60000(0000) 
knlGS:ffff8802175e0000
<4>[50288.720242] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4>[50288.720262] CR2: 0000000000000038 CR3: 0000000324ede000 CR4: 
00000000000427e0
<4>[50288.720287] Stack:
<4>[50288.720296]  ffff88107fd72a80 ffff88107fd72a80 0000000000000000 
0000000000000000
<4>[50288.720327]  ffff8808445fbc68 ffffffff8109ead8 ffff880800000000 
ffffffffa1438990
<4>[50288.720357]  ffff880b6f250000 0000000000000000 0000000000012a80 
ffff880036acbaa0
<4>[50288.720388] Call Trace:
<4>[50288.720402]  [<ffffffff8109ead8>] pick_next_task_fair+0x88/0x5d0
<4>[50288.720429]  [<ffffffffa1438990>] ? 
vmx_fpu_activate.part.63+0x90/0xb0 [kvm_intel]
<4>[50288.720457]  [<ffffffff81096b95>] ? sched_clock_cpu+0x85/0xc0
<4>[50288.720479]  [<ffffffff816b5b99>] __schedule+0xf9/0x7d0
<4>[50288.720500]  [<ffffffff816bb210>] ? reboot_interrupt+0x80/0x80
<4>[50288.720522]  [<ffffffff816b630a>] _cond_resched+0x2a/0x40
<4>[50288.720549]  [<ffffffffa03dd8c5>] __vcpu_run+0xd35/0xf30 [kvm]
<4>[50288.720573]  [<ffffffff81075fc7>] ? __set_task_blocked+0x37/0x80
<4>[50288.720595]  [<ffffffff8109387e>] ? try_to_wake_up+0x21e/0x360
<4>[50288.720622]  [<ffffffffa03ddb65>] 
kvm_arch_vcpu_ioctl_run+0xa5/0x220 [kvm]
<4>[50288.720650]  [<ffffffffa03c48b2>] kvm_vcpu_ioctl+0x2c2/0x620 [kvm]
<4>[50288.720675]  [<ffffffff811c01c6>] do_vfs_ioctl+0x86/0x4f0
<4>[50288.720697]  [<ffffffff810d14a2>] ? SyS_futex+0x142/0x1a0
<4>[50288.720717]  [<ffffffff811c06c1>] SyS_ioctl+0x91/0xb0
<4>[50288.720737]  [<ffffffff816ba489>] system_call_fastpath+0x12/0x17
<4>[50288.720758] Code: c7 47 60 00 00 00 00 45 31 c0 e9 0c ff ff ff 66 
66 66 66 90 55 48 89 e5 48 83 ec 20 48 89 5d e8 4c 89 65 f0 48 89 f3 4c 
89 6d f8 <44> 8b 4e 38 49 89 fc 45 85 c9 74 17 4c 8d 6e 10 4c 39 6f 30 74
<1>[50288.722636] RIP  [<ffffffff81097b8c>] set_next_entity+0x1c/0x80
<4>[50288.723533]  RSP <ffff8808445fbbb8>
<4>[50288.724406] CR2: 0000000000000038

in pick_next_task_fair() cfs_rq->nr_running was non-zero but after
put_prev_task(rq, prev) kernel cannot find any tasks to schedule next.

It crashes from time to time on strange libvirt/kvm setup where
cfs_quota is set on two levels: at parent cgroup which contains kvm
and at per-vcpu child cgroup.

This patch isn't verified yet.
But I haven't found any other possible reasons for that crash.

>
> This patch leaves 1 in ->runtime_remaining when current assignation
> expires and tries to refill it right after that. In the worst case
> task will be scheduled once and throttled at the end of slice.
>
> Signed-off-by: Konstantin Khlebnikov <khlebnikov@...dex-team.ru>
> ---
>   kernel/sched/fair.c |   19 +++++++++++++------
>   1 file changed, 13 insertions(+), 6 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 7ce18f3c097a..91785d077db4 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3447,11 +3447,12 @@ static void expire_cfs_rq_runtime(struct cfs_rq *cfs_rq)
>   {
>   	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
>
> -	/* if the deadline is ahead of our clock, nothing to do */
> -	if (likely((s64)(rq_clock(rq_of(cfs_rq)) - cfs_rq->runtime_expires) < 0))
> +	/* nothing to expire */
> +	if (cfs_rq->runtime_remaining <= 0)
>   		return;
>
> -	if (cfs_rq->runtime_remaining < 0)
> +	/* if the deadline is ahead of our clock, nothing to do */
> +	if (likely((s64)(rq_clock(rq_of(cfs_rq)) - cfs_rq->runtime_expires) < 0))
>   		return;
>
>   	/*
> @@ -3469,8 +3470,14 @@ static void expire_cfs_rq_runtime(struct cfs_rq *cfs_rq)
>   		/* extend local deadline, drift is bounded above by 2 ticks */
>   		cfs_rq->runtime_expires += TICK_NSEC;
>   	} else {
> -		/* global deadline is ahead, expiration has passed */
> -		cfs_rq->runtime_remaining = 0;
> +		/*
> +		 * Global deadline is ahead, expiration has passed.
> +		 *
> +		 * Do not expire runtime completely. Otherwise put_prev_task()
> +		 * can throttle all tasks when we already checked nr_running or
> +		 * put_prev_entity() can throttle already chosen next entity.
> +		 */
> +		cfs_rq->runtime_remaining = 1;
>   	}
>   }
>
> @@ -3480,7 +3487,7 @@ static void __account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec)
>   	cfs_rq->runtime_remaining -= delta_exec;
>   	expire_cfs_rq_runtime(cfs_rq);
>
> -	if (likely(cfs_rq->runtime_remaining > 0))
> +	if (likely(cfs_rq->runtime_remaining > 1))
>   		return;
>
>   	/*
>


-- 
Konstantin
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/