linux-kernel - Regression in linux-next

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <SJ1PR11MB61296D265E3407D447188EF6B903A@SJ1PR11MB6129.namprd11.prod.outlook.com>
Date:   Tue, 25 Jul 2023 06:42:54 +0000
From:   "Borah, Chaitanya Kumar" <chaitanya.kumar.borah@...el.com>
To:     "apopple@...dia.com" <apopple@...dia.com>
CC:     "Yedireswarapu, SaiX Nandan" <saix.nandan.yedireswarapu@...el.com>,
        "Saarinen, Jani" <jani.saarinen@...el.com>,
        "Kurmi, Suresh Kumar" <suresh.kumar.kurmi@...el.com>,
        "Nikula, Jani" <jani.nikula@...el.com>,
        "intel-gfx@...ts.freedesktop.org" <intel-gfx@...ts.freedesktop.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "linux-mm@...ck.org" <linux-mm@...ck.org>
Subject: Regression in linux-next

Hello Alistair,

Hope you are doing well. I am Chaitanya from the linux graphics team in Intel.
 
This mail is regarding a regression we are seeing in our CI runs[1] on linux-next
repository.
 
On next-20230720 [2], we are seeing the following error

<4>[   76.189375] Hardware name: Intel Corporation Meteor Lake Client Platform/MTL-P DDR5 SODIMM SBS RVP, BIOS MTLPFWI1.R00.3271.D81.2307101805 07/10/2023
<4>[   76.202534] RIP: 0010:__mmu_notifier_register+0x40/0x210
<4>[   76.207804] Code: 1a 71 5a 01 85 c0 0f 85 ec 00 00 00 48 8b 85 30 01 00 00 48 85 c0 0f 84 04 01 00 00 8b 85 cc 00 00 00 85 c0 0f 8e bb 01 00 00 <49> 8b 44 24 10 48 83 78 38 00 74 1a 48 83 78 28 00 74 0c 0f 0b b8
<4>[   76.226368] RSP: 0018:ffffc900019d7ca8 EFLAGS: 00010202
<4>[   76.231549] RAX: 0000000000000001 RBX: 0000000000001000 RCX: 0000000000000001
<4>[   76.238613] RDX: 0000000000000000 RSI: ffffffff823ceb7b RDI: ffffffff823ee12d
<4>[   76.245680] RBP: ffff888102ec9b40 R08: 00000000ffffffff R09: 0000000000000001
<4>[   76.252747] R10: 0000000000000001 R11: ffff8881157cd2c0 R12: 0000000000000000
<4>[   76.259811] R13: ffff888102ec9c70 R14: ffffffffa07de500 R15: ffff888102ec9ce0
<4>[   76.266875] FS:  00007fbcabe11c00(0000) GS:ffff88846ec00000(0000) knlGS:0000000000000000
<4>[   76.274884] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4>[   76.280578] CR2: 0000000000000010 CR3: 000000010d4c2005 CR4: 0000000000f70ee0
<4>[   76.287643] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
<4>[   76.294711] DR3: 0000000000000000 DR6: 00000000ffff07f0 DR7: 0000000000000400
<4>[   76.301775] PKRU: 55555554
<4>[   76.304463] Call Trace:
<4>[   76.306893]  <TASK>
<4>[   76.308983]  ? __die_body+0x1a/0x60
<4>[   76.312444]  ? page_fault_oops+0x156/0x450
<4>[   76.316510]  ? do_user_addr_fault+0x65/0x980
<4>[   76.320747]  ? exc_page_fault+0x68/0x1a0
<4>[   76.324643]  ? asm_exc_page_fault+0x26/0x30
<4>[   76.328796]  ? __mmu_notifier_register+0x40/0x210
<4>[   76.333460]  ? __mmu_notifier_register+0x11c/0x210
<4>[   76.338206]  ? preempt_count_add+0x4c/0xa0
<4>[   76.342273]  mmu_notifier_register+0x30/0xe0
<4>[   76.346509]  mmu_interval_notifier_insert+0x74/0xb0
<4>[   76.351344]  i915_gem_userptr_ioctl+0x21a/0x320 [i915]
<4>[   76.356565]  ? __pfx_i915_gem_userptr_ioctl+0x10/0x10 [i915]
<4>[   76.362271]  drm_ioctl_kernel+0xb4/0x150
<4>[   76.366159]  drm_ioctl+0x21d/0x420
<4>[   76.369537]  ? __pfx_i915_gem_userptr_ioctl+0x10/0x10 [i915]
<4>[   76.375242]  ? find_held_lock+0x2b/0x80
<4>[   76.379046]  __x64_sys_ioctl+0x79/0xb0
<4>[   76.382766]  do_syscall_64+0x3c/0x90
<4>[   76.386312]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
<4>[   76.391317] RIP: 0033:0x7fbcae63f3ab

Details log can be found in [3].

After bisecting the tree, the following patch seems to be causing the
regression.

commit 828fe4085cae77acb3abf7dd3d25b3ed6c560edf
Author: Alistair Popple apopple@...dia.com
Date:   Wed Jul 19 22:18:46 2023 +1000

    mmu_notifiers: rename invalidate_range notifier

    There are two main use cases for mmu notifiers.  One is by KVM which uses
    mmu_notifier_invalidate_range_start()/end() to manage a software TLB.

    The other is to manage hardware TLBs which need to use the
    invalidate_range() callback because HW can establish new TLB entries at
    any time.  Hence using start/end() can lead to memory corruption as these
    callbacks happen too soon/late during page unmap.

    mmu notifier users should therefore either use the start()/end() callbacks
    or the invalidate_range() callbacks.  To make this usage clearer rename
    the invalidate_range() callback to arch_invalidate_secondary_tlbs() and
    update documention.

    Link: https://lkml.kernel.org/r/9a02dde2f8ddaad2db31e54706a80c12d1817aaf.1689768831.git-series.apopple@nvidia.com


We also verified by reverting the patch in the tree.

Could you please check why this patch causes the regression and if we can find
a solution for it soon?

[1] https://intel-gfx-ci.01.org/tree/linux-next/combined-alt.html?
[2] https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?h=next-20230720 
[3] https://intel-gfx-ci.01.org/tree/linux-next/next-20230720/bat-mtlp-6/dmesg0.txt