linux-kernel - Re: [PROBLEM] possible divide by 0 in kernel/sched/cputime.c scale

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20131118140224.GA3330@redhat.com>
Date:	Mon, 18 Nov 2013 15:02:24 +0100
From:	Stanislaw Gruszka <sgruszka@...hat.com>
To:	Christian Engelmayer <cengelma@....at>
Cc:	Ingo Molnar <mingo@...hat.com>, linux-kernel@...r.kernel.org,
	Peter Zijlstra <peterz@...radead.org>,
	Thomas Gleixner <tglx@...utronix.de>
Subject: Re: [PROBLEM] possible divide by 0 in kernel/sched/cputime.c
 scale_stime()

Hi Christian 

On Sat, Nov 16, 2013 at 10:37:40PM +0100, Christian Engelmayer wrote:
> Since upgrading from v3.8 to v3.12 I see random crashes in function scale_stime()
> in kernel/sched/cputime.c:
> 
>    divide error: 0000 [#1] SMP 
>    Modules linked in: parport_pc(F) ppdev(F) bnep rfcomm bluetooth binfmt_misc(F)
>    zl10353 cx88_dvb cx88_vp3054_i2c videobuf_dvb dvb_core intel_powerclamp coretemp
>    kvm_intel(F) tuner_xc2028 kvm(F) i915 snd_hda_codec_hdmi snd_hda_codec_realtek
>    cx8800 cx8802 tuner snd_hda_intel snd_hda_codec cx88_alsa crct10dif_pclmul(F)
>    crc32_pclmul(F) snd_hwdep(F) snd_pcm(F) snd_page_alloc(F) ghash_clmulni_intel(F)
>    aesni_intel(F) snd_seq_midi(F) snd_seq_midi_event(F) snd_rawmidi(F) snd_seq(F)
>    joydev(F) cx88xx snd_seq_device(F) snd_timer(F) aes_x86_64(F) lrw(F) gf128mul(F)
>    glue_helper(F) video(F) btcx_risc drm_kms_helper ablk_helper(F) tveeprom cryptd(F)
>    lp(F) videobuf_dma_sg rc_core drm v4l2_common videobuf_core mei_me parport(F)
>    snd(F) mei soundcore(F) videodev i2c_algo_bit serio_raw(F) microcode(F) mac_hid
>    lpc_ich asus_atk0110 hid_generic usbhid hid usb_storage(F) firewire_ohci ahci(F)
>    libahci(F) firewire_core r8169 crc_itu_t(F) mii(F)
>    CPU: 3 PID: 15367 Comm: htop Tainted: GF            3.12.0-031200-generic #201311031935
>    Hardware name: System manufacturer System Product Name/P7H55-M PRO, BIOS 1709    01/04/2011
>    task: ffff8800cc09e000 ti: ffff8800af620000 task.ti: ffff8800af620000
>    RIP: 0010:[<ffffffff81099de0>]  [<ffffffff81099de0>] cputime_adjust+0xf0/0x110
>    RSP: 0018:ffff8800af621cc8  EFLAGS: 00010847
>    RAX: 85fdc1fef4047c00 RBX: 0000000000000000 RCX: ffff8800af621df8
>    RDX: 0000000000000000 RSI: ffff8800cc0634d0 RDI: 0000000000000000
>    RBP: ffff8800af621cd8 R08: 00000000fffffffe R09: 0000000000000000
>    R10: 0000000000000000 R11: fffffffe03427acc R12: ffff8800af621df0
>    R13: ffff8800af621df8 R14: 0000000000000000 R15: ffff8800cc063300
>    FS:  00007f22a387d740(0000) GS:ffff880117c60000(0000) knlGS:0000000000000000
>    CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>    CR2: 00007f22a3892000 CR3: 0000000097023000 CR4: 00000000000007e0
>    Stack:
>     ffff8800c37f0000 ffff8800af621df0 ffff8800af621d18 ffffffff8109aa51
>     000000000a7d8c00 0000000042fee100 fffffffe03427acc ffff8800bf112a80
>     ffff8800c37f0000 ffff8800c307c280 ffff8800af621e50 ffffffff8121f74b
>    Call Trace:
>     [<ffffffff8109aa51>] thread_group_cputime_adjusted+0x41/0x50
>     [<ffffffff8121f74b>] do_task_stat+0x8eb/0xb60
>     [<ffffffff81176400>] ? vma_compute_subtree_gap+0x50/0x50
>     [<ffffffff81220314>] proc_tgid_stat+0x14/0x20
>     [<ffffffff8121b16d>] proc_single_show+0x4d/0x90
>     [<ffffffff811d6eee>] seq_read+0x14e/0x390
>     [<ffffffff811b3725>] vfs_read+0x95/0x160
>     [<ffffffff811b4239>] SyS_read+0x49/0xa0
>     [<ffffffff81723ced>] system_call_fastpath+0x1a/0x1f
>    Code: 89 fa 49 c1 ea 20 4d 85 d2 74 ca 4c 89 c2 48 d1 ef 49 89 c0 48 d1 ea 48
>          89 d0 eb 9f 0f 1f 80 00 00 00 00 89 c0 31 d2 49 0f af c0 <48> f7 f7 4c
>          89 df 48 29 c7 49 89 c3 e9 31 ff ff ff 66 66 66 66 
>    RIP  [<ffffffff81099de0>] cputime_adjust+0xf0/0x110
>     RSP <ffff8800af621cc8>
>    ---[ end trace dbafd2159a385dd6 ]---
> 
> The affected LOC performing the division by 0 was introduced in commit
> 
>    commit 55eaa7c1f511af5fb6ef808b5328804f4d4e5243
>    Author: Stanislaw Gruszka <sgruszka@...hat.com>
>    Date:   Tue Apr 30 17:14:42 2013 +0200
>    sched: Avoid cputime scaling overflow
> 
> For the problem to occur the function is called eg. with the following
> input parameters
> 
>    stime: 0x3567e00
>    rtime: 0xffffffffbf1abfdb
>    total: 0x3938700
> 
> which leads to 'total' being shifted to 0 during the adaption of the precision
> and is then used without further check in
> 
>    scaled = div_u64((u64) (u32) stime * (u64) (u32) rtime, (u32)total);
> 
> The root cause triggering this issue seems to be an overflowed value of
> 
>    rtime = nsecs_to_cputime(curr->sum_exec_runtime);
> 
> On the affected machine the problem can be triggered by loading the
> previously idle system by starting a full kernel build. The problem occurs
> within a minute after the ondemand frequency scaling governor adjusts the
> frequency from the minimum to the maximum.
> 
> The x86 init check whether all booted CPUs have their TSC's synchronized, never
> failed so far, however, the tsc clocksource is sporadically marked unstable.
> 
>    Clocksource tsc unstable (delta = -74994678 ns)
> 
> The used CPU provides an Intel Invariant TSC as stated by CPUID.80000007H:EDX[8]:
> 
>     eax in    eax      ebx      ecx      edx
>    00000000 0000000b 756e6547 6c65746e 49656e69
>    00000001 00020652 04100800 0298e3ff bfebfbff
>    00000002 55035a01 00f0b2e3 00000000 09ca212c
>    00000003 00000000 00000000 00000000 00000000
>    00000004 00000000 00000000 00000000 00000000
>    00000005 00000040 00000040 00000003 00001120
>    00000006 00000005 00000002 00000001 00000000
>    00000007 00000000 00000000 00000000 00000000
>    00000008 00000000 00000000 00000000 00000000
>    00000009 00000000 00000000 00000000 00000000
>    0000000a 07300403 00000004 00000000 00000603
>    0000000b 00000000 00000000 0000002c 00000004
>    80000000 80000008 00000000 00000000 00000000
>    80000001 00000000 00000000 00000001 28100800
>    80000002 65746e49 2952286c 726f4320 4d542865
>    80000003 35692029 55504320 20202020 20202020
>    80000004 30353620 20402020 30322e33 007a4847
>    80000005 00000000 00000000 00000000 00000000
>    80000006 00000000 00000000 01006040 00000000
>    80000007 00000000 00000000 00000000 00000100
>    80000008 00003024 00000000 00000000 00000000
> 
>    Vendor ID: "GenuineIntel"; CPUID level 11
> 
>    Intel-specific functions:
>    Version 00020652:
>    Type 0 - Original OEM
>    Family 6 - Pentium Pro
>    Model 5 - Pentium II Model 5/Xeon/Celeron
>    Stepping 2
>    Reserved 8
> 
>    Extended brand string: "Intel(R) Core(TM) i5 CPU         650  @ 3.20GHz"
>    CLFLUSH instruction cache line size: 8
>    Initial APIC ID: 4
>    Hyper threading siblings: 16
> 
>    Feature flags bfebfbff:
>    FPU    Floating Point Unit
>    VME    Virtual 8086 Mode Enhancements
>    DE     Debugging Extensions
>    PSE    Page Size Extensions
>    TSC    Time Stamp Counter
>    MSR    Model Specific Registers
>    PAE    Physical Address Extension
>    MCE    Machine Check Exception
>    CX8    COMPXCHG8B Instruction
>    APIC   On-chip Advanced Programmable Interrupt Controller present and enabled
>    SEP    Fast System Call
>    MTRR   Memory Type Range Registers
>    PGE    PTE Global Flag
>    MCA    Machine Check Architecture
>    CMOV   Conditional Move and Compare Instructions
>    FGPAT  Page Attribute Table
>    PSE-36 36-bit Page Size Extension
>    CLFSH  CFLUSH instruction
>    DS     Debug store
>    ACPI   Thermal Monitor and Clock Ctrl
>    MMX    MMX instruction set
>    FXSR   Fast FP/MMX Streaming SIMD Extensions save/restore
>    SSE    Streaming SIMD Extensions instruction set
>    SSE2   SSE2 extensions
>    SS     Self Snoop
>    HT     Hyper Threading
>    TM     Thermal monitor
>    31     reserved
> 
> Nevertheless, when looking into the issue I saw occurences of sched_clock going
> backwards as if the TSCs were read out of sync. Accordingly the problem does
> not occur when either booting with option 'nosmp' or when forcing the TSC to be
> marked as unstable for sched_clock. Booting with 'acpi=off' and no frequency
> scaling works too.
> 
> Having a look at the scheduler code I see the following pattern that would also
> catch a time warp, eg. kernel/sched/rt.c update_curr_rt():
> 
>    u64 delta_exec;
>    delta_exec = rq_clock_task(rq) - curr->se.exec_start;
>    if (unlikely((s64)delta_exec <= 0))
>            return;
> 
> However, there are still vulnerable places, eg. kernel/sched/fair.c update_curr():
> 
>    /*
>     * Get the amount of time the current task was running
>     * since the last time we changed load (this cannot
>     * overflow on 32 bits):
>     */
>    delta_exec = (unsigned long)(now - curr->exec_start);
>    if (!delta_exec)
>            return;

Thanks for great analyse. I'm not sure where this problem should be
fixed (in TSC clocksource or sched/fair.c or sched/cputime.c), however
since RT scheduler is protected for sched_clock going backward, we
probably can protect fair scheduler as well (also do_task_delta_exec()
has similar check).

Does the below patch fixes the issue on affected machine?

Stanislaw

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7c70201..f02a567 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -727,7 +727,7 @@ static void update_curr(struct cfs_rq *cfs_rq)
 	u64 now = rq_clock_task(rq_of(cfs_rq));
 	unsigned long delta_exec;
 
-	if (unlikely(!curr))
+	if (unlikely(!curr || now <= curr->exec_start))
 		return;
 
 	/*
@@ -736,8 +736,6 @@ static void update_curr(struct cfs_rq *cfs_rq)
 	 * overflow on 32 bits):
 	 */
 	delta_exec = (unsigned long)(now - curr->exec_start);
-	if (!delta_exec)
-		return;
 
 	__update_curr(cfs_rq, curr, delta_exec);
 	curr->exec_start = now;
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/