linux-kernel - Re: kvm splat in mmu_spte_clear_track

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-Id: <B72727D4-0E1C-4DE5-AFDA-A1D1259AC138@gmail.com>
Date:   Tue, 29 Aug 2017 08:53:46 -0700
From:   Nadav Amit <nadav.amit@...il.com>
To:     Bernhard Held <berny156@....de>
Cc:     Adam Borowski <kilobyte@...band.pl>,
        Paolo Bonzini <pbonzini@...hat.com>,
        Wanpeng Li <kernellwp@...il.com>,
        Radim Krčmář <rkrcmar@...hat.com>,
        kvm <kvm@...r.kernel.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: Re: kvm splat in mmu_spte_clear_track_bits

Bernhard Held <berny156@....de> wrote:

> On 08/28/2017 at 06:56 PM, Nadav Amit wrote:
>> Bernhard Held <berny156@....de> wrote:
>>> On 08/27/2017 at 02:35 PM, Adam Borowski wrote:
>>>> 4.13-rc5 retested fails
>>>> Crashed only after two hours or so of testing.
>>>> 4.13-rc4 apparently works
>>>> It survived several hours of varied tests (like 5 debian-installer runs, a
>>>> win10 point release upgrade, some hurd package building, openbsd, etc),
>>>> all while the host was likewise busy.
>>>> Thus: to the best of my knowledge, the problem is between 4.13-rc4 and 4.13-rc5
>>>> but I wouldn't bet my life on it.
>>> 
>>> I get crashes with Win10 in kvm with 4.13-rc5. 4.13-rc4 works for me. THP seems to accelerate the crash, but that's not 100% sure.
>>> 
>>> There's still no crash after reverting merge 27df70 on 4.13-rc7. There are 21 commits in this merge, 10 are mm-related:
>>> 
>>> $ git log 4e082e9ba7cd..e86b298bebf7 --pretty=oneline --abbrev-commit
>>> e86b298bebf7 userfaultfd: replace ENOSPC with ESRCH in case mm has gone during copy/zeropage
>>> f357e345eef7 zram: rework copy of compressor name in comp_algorithm_store()
>>> aac2fea94f7a rmap: do not call mmu_notifier_invalidate_page() under ptl
>>> d041353dc98a mm: fix list corruptions on shmem shrinklist
>>> af54aed94bf3 mm/balloon_compaction.c: don't zero ballooned pages
>>> c0a6a5ae6b5d MAINTAINERS: copy virtio on balloon_compaction.c
>>> b3a81d0841a9 mm: fix KSM data corruption
>>> 99baac21e458 mm: fix MADV_[FREE|DONTNEED] TLB flush miss problem
>>> 0a2dd266dd6b mm: make tlb_flush_pending global
>>> 56236a59556c mm: refactor TLB gathering API
>>> a9b802500ebb Revert "mm: numa: defer TLB flush for THP migration as long as possible"
>>> 0a2c40487f3e mm: migrate: fix barriers around tlb_flush_pending
>>> 16af97dc5a89 mm: migrate: prevent racy access to tlb_flush_pending
>>> 9eeb52ae712e fault-inject: fix wrong should_fail() decision in task context
>>> 4e98ebe5f435 test_kmod: fix small memory leak on filesystem tests
>>> 9c56771316ef test_kmod: fix the lock in register_test_dev_kmod()
>>> 434b06ae23ba test_kmod: fix bug which allows negative values on two config options
>>> a4afe8cdec16 test_kmod: fix spelling mistake: "EMTPY" -> "EMPTY"
>>> 5af10dfd0afc userfaultfd: hugetlbfs: remove superfluous page unlock in VM_SHARED case
>>> 75dddef32514 mm: ratelimit PFNs busy info message
>>> d507e2ebd2c7 mm: fix global NR_SLAB_.*CLAIMABLE counter reads
>> Don’t blame me for the TLB stuff... My money is on aac2fea94f7a .
> 
> Amit, thanks for your courage to expose your patch!

Just for the record, aac2fea94f7a is not mine (some others are).

> 
> I'm more and more confident that aac2fea94f7a is the culprit. Maybe it just accelerates the triggering of the splash. To be more sure the kernel needs to be tested for a couple of days. It would be great if others could assist in testing aac2fea94f7a.