linux-kernel - Re: [lkp-robot] [x86] ed3ce2a917: BUG:unable_to_handle

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20170309023052.grzkmd2m6cttnesv@wfg-t540p.sh.intel.com>
Date:   Thu, 9 Mar 2017 10:30:52 +0800
From:   Fengguang Wu <fengguang.wu@...el.com>
To:     Ye Xiaolong <xiaolong.ye@...el.com>
Cc:     Borislav Petkov <bp@...en8.de>, X86 ML <x86@...nel.org>,
        Andy Lutomirski <luto@...capital.net>,
        Peter Zijlstra <peterz@...radead.org>,
        LKML <linux-kernel@...r.kernel.org>, lkp@...org
Subject: Re: [lkp-robot] [x86]  ed3ce2a917: BUG:unable_to_handle_kernel

On Thu, Mar 09, 2017 at 10:13:10AM +0800, Ye Xiaolong wrote:
>On 03/02, Borislav Petkov wrote:
>>Hi,
>>
>>On Thu, Mar 02, 2017 at 09:09:34AM +0800, kernel test robot wrote:
>>>
>>> FYI, we noticed the following commit:
>>>
>>> commit: ed3ce2a9172457ef7dbaa9f964e63dfde2bdcb5f ("x86: Optimize clear_page()")
>>> url: https://github.com/0day-ci/linux/commits/Borislav-Petkov/x86-Optimize-clear_page/20170215-193441
>>>
>>>
>>> in testcase: will-it-scale
>>> with following parameters:
>>>
>>> 	test: poll2
>>> 	cpufreq_governor: performance
>>>
>>> test-description: Will It Scale takes a testcase and runs it from 1 through to n parallel copies to see if the testcase will scale. It builds both a process and threads based test in order to see any differences between the two.
>>> test-url: https://github.com/antonblanchard/will-it-scale
>>
>>thanks for the report, I was able to reproduce.
>>
>>BUT(!) this report is misleading because it talks about will-it-scale
>>but your splat happens when you kexec the kernel:
>>
>>  [  336.340747] LKP: kexec loading...
>>  [  336.340852]
>>  [  336.343323] kexec --noefi -l /tmp/cache/pkg/linux/x86_64-rhel-7.2/gcc-6/ed3ce2a9172457ef7dbaa9f964e63dfde2bdcb5f/vmlinuz-4.9.0-rc6-00134-ged3ce2a --initrd=/tmp/cache/initrd-concatenated
>>  [  336.343758]
>>  [  337.893471] --append=ip=::::lkp-ivb-d01::dhcp root=/dev/ram0 user=lkp job=/lkp/scheduled/lkp-ivb-d01/will-it-scale-poll2-performance-debian-x86_64-2016-08-31.cgz-ed3ce2a9172457ef7dbaa9f964e63dfde2bdcb5f-20170301-28072-1dqjyhl-11.yaml ARCH=x86_64 kconfig=x86_64-rhel-7.2 branch=linux-devel/devel-hourly-2017022612 commit=ed3ce2a9172457ef7dbaa9f964e63dfde2bdcb5f BOOT_IMAGE=/pkg/linux/x86_64-rhel-7.2/gcc-6/ed3ce2a9172457ef7dbaa9f964e63dfde2bdcb5f/vmlinuz-4.9.0-rc6-00134-ged3ce2a max_uptime=1500 RESULT_ROOT=/result/will-it-scale/poll2-performance/lkp-ivb-d01/debian-x86_64-2016-08-31.cgz/x86_64-rhel-7.2/gcc-6/ed3ce2a9172457ef7dbaa9f964e63dfde2bdcb5f/11 LKP_SERVER=inn debug apic=debug sysrq_always_enabled rcupdate.rcu_cpu_stall_timeout=100 net.ifnames=0 printk.devkmsg=on panic=-1 softlockup_panic=1 nmi_watchdog=panic oops=panic load_ramdisk=2 prompt_ramdisk=0 drbd.minor_count=8 systemd.log_level=err ignore_
>>  [  337.895521]
>>  [  339.467661] BUG: unable to handle kernel paging request at ffff8803cf2e2008
>>  [  339.468000] IP: [<ffffffff81061e71>] native_set_pmd+0x1/0x10
>>  ...
>>
>>
>>Maybe Fengguang has an idea what to do here, maybe something like add
>>markers to the log to denote where the test environment is prepared and
>>when the actual test starts. Then grep for those and generate the report
>>based on that...
>
>Thanks for the suggestions, we'll keep improving the reports to avoid confusion
>or misleading.

One possible improvement is to provide "lkp qemu" reproduce steps for
kernel oops -- it would be way more convenient and safe to follow than
"lkp run", since the later risks hang the physical machine.

As for the test description, the dmesg carries markers for the user
space test start/stop points, so the robot can easily tell whether the
oops happen during the test or before/after the test -- the latter may
well (but not always) indicate the oops is not relevant to the testcase,
but to the regular kernel boot/reboot/kexec process.

Thanks,
Fengguang