lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20210429075300.GA31265@xsang-OptiPlex-9020>
Date:   Thu, 29 Apr 2021 15:53:00 +0800
From:   Oliver Sang <oliver.sang@...el.com>
To:     Linus Torvalds <torvalds@...ux-foundation.org>
Cc:     "Aneesh Kumar K.V" <aneesh.kumar@...ux.ibm.com>,
        Harish Sriram <harish@...ux.ibm.com>,
        Andrew Morton <akpm@...ux-foundation.org>,
        LKML <linux-kernel@...r.kernel.org>, lkp@...ts.01.org,
        kernel test robot <lkp@...el.com>
Subject: Re: [mm/vunmap] e47110e905: WARNING:at_mm/vmalloc.c:#__vunmap

Hi, Linus,

On Sat, Apr 24, 2021 at 07:04:00PM -0700, Linus Torvalds wrote:
> On Sat, Apr 24, 2021 at 6:31 PM Oliver Sang <oliver.sang@...el.com> wrote:
> > >
> > > Oliver - how reliable is that bisection?
> >
> > we will check further if any issue in our test env.
> >
> > by bot auto tests, we saw 12 issue instances out of 74 runs. but not happen
> > out of 100 runs of parent.
> 
> Oh, that's interesting. So only 12 out of 74 runs saw that __vunmap
> warning, but if I understand your table correctly, there were some
> _other_ issues in there?

yes, there are.

> 
> Are those also for that same commit? (ie those RIP:kfree /

yes. the 1st column (fail:runs) is for parent,
the 3rd colum (fail:runs) is for e47110e905

> RIP:kobject_add_internal / etc)?

the RIP:kfree happens just after __vunmap warning (as attached dmesg):
===================================================================
[  198.730073] ------------[ cut here ]------------
[  198.730597] Trying to vfree() bad address (0000000070935066)
[  198.731223] WARNING: CPU: 0 PID: 1948 at mm/vmalloc.c:2247 __vunmap+0x663/0x990
..
[  198.763940] ---[ end trace 572fd76a7879a124 ]---
[  198.764449] stack segment: 0000 [#1] SMP KASAN
[  198.764933] CPU: 0 PID: 1948 Comm: systemd-udevd Tainted: G        W         5.9.0-rc1-00107-ge47110e90584a #1
[  198.765985] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
[  198.766872] RIP: 0010:kfree+0x64/0x360
==================================================================

and it did not happen on parent:
f3f99d63a8156c7a e47110e90584a22e9980510b00d
---------------- ---------------------------
       fail:runs  %reproduction    fail:runs
           |             |             |
           :998         11%         108:432   dmesg.RIP:kfree

RIP:kobject_add_internal happens on both commits after more runs,
but much lower happen rate than __vunmap (or kfree) on e47110e905.

           :998         13%         125:432   dmesg.RIP:__vunmap
           :998         11%         108:432   dmesg.RIP:kfree
         11:998         -1%           6:432   dmesg.RIP:kobject_add_internal


> 
> I'm not sure how to read that table of yours - if I understand it
> correctly, it looks like the parent commit had some different ones

yes, the parent has some different ones that e47110e905 did not have
(below there is a full table after we run both parent and the e47110e905
more times which has more details)

> that the child did not (eg 2 cases of BUG_at_mm/usercopy.c?)

after more runs, the BUG_at_mm/usercopy.c is also reproduced on e47110e905
          9:998         -1%           3:432   dmesg.kernel_BUG_at_mm/usercopy.c

> 
> So it feels to me like there's some memory corruption somewhere, and
> that commit that it bisected to likely just changed the failure case
> (due to timing differences or allocation ordering changes).
> 
> IOW, there seem to be other panics even in the parent.
> 
> Yes/No?

Yes. we also concerned that the panics on parent could cover the __vunmap issue,
so we ran it up to almost 1000 times.


f3f99d63a8156c7a e47110e90584a22e9980510b00d
---------------- ---------------------------
       fail:runs  %reproduction    fail:runs
           |             |             |
          3:998          0%           6:432   dmesg.BUG:kernel_NULL_pointer_dereference,address
          1:998          0%           2:432   dmesg.BUG:non-zero_pgtables_bytes_on_freeing_mm
         24:998          0%          25:432   dmesg.BUG:unable_to_handle_page_fault_for_address
          1:998         -0%            :432   dmesg.Bad_pagetable:#[##]
          2:998         -0%            :432   dmesg.INFO:rcu_sched_detected_stalls_on_CPUs/tasks
         40:998         14%         180:432   dmesg.Kernel_panic-not_syncing:Fatal_exception
           :998          0%           1:432   dmesg.Kernel_panic-not_syncing:Fatal_exception_in_interrupt
         27:998          0%          31:432   dmesg.Oops:#[##]
          1:998         -0%            :432   dmesg.RIP:__asan_report_load8_noabort
          6:998         -0%           4:432   dmesg.RIP:__is_module_percpu_address
          1:998         -0%            :432   dmesg.RIP:__kasan_check_read
          1:998          0%           5:432   dmesg.RIP:__lock_acquire
           :998          1%           7:432   dmesg.RIP:__migration_entry_wait
          1:998          0%           1:432   dmesg.RIP:__module_address
          2:998          0%           3:432   dmesg.RIP:__rb_insert_augmented
           :998         13%         125:432   dmesg.RIP:__vunmap
          2:998         -0%            :432   dmesg.RIP:anon_vma_interval_tree_insert
           :998          0%           2:432   dmesg.RIP:apply_relocate_add
          1:998         -0%            :432   dmesg.RIP:check_memory_region
          1:998          0%           1:432   dmesg.RIP:copy_pte_range
          1:998         -0%            :432   dmesg.RIP:copy_user_generic_string
          2:998         -0%           1:432   dmesg.RIP:deactivate_slab
           :998          0%           1:432   dmesg.RIP:del_usage_links
          1:998         -0%            :432   dmesg.RIP:do_raw_spin_trylock
           :998          0%           1:432   dmesg.RIP:ep_send_events_proc
           :998          0%           2:432   dmesg.RIP:find_vma
           :998          1%           6:432   dmesg.RIP:free_percpu
          1:998          0%           1:432   dmesg.RIP:handle_mm_fault
          1:998         -0%            :432   dmesg.RIP:kasan_report
          1:998          0%           1:432   dmesg.RIP:kernfs_find_ns
          1:998         -0%            :432   dmesg.RIP:kernfs_link_sibling
           :998         11%         108:432   dmesg.RIP:kfree
          1:998          0%           1:432   dmesg.RIP:kmem_cache_alloc
           :998          0%           2:432   dmesg.RIP:kmem_cache_alloc_trace
         11:998         -1%           6:432   dmesg.RIP:kobject_add_internal
           :998          0%           1:432   dmesg.RIP:llist_del_first
           :998          0%           2:432   dmesg.RIP:load_module
          1:998         -0%            :432   dmesg.RIP:lock_release
          7:998         -1%           1:432   dmesg.RIP:module_put
           :998          0%           1:432   dmesg.RIP:module_remove_modinfo_attrs
          1:998         -0%            :432   dmesg.RIP:native_queued_spin_lock_slowpath
           :998          0%           1:432   dmesg.RIP:native_safe_halt
           :998          0%           1:432   dmesg.RIP:nmi_uaccess_okay
           :998          0%           1:432   dmesg.RIP:paravirt_patch_default.cold
         23:998         -1%          17:432   dmesg.RIP:print_modules
          2:998         -0%           1:432   dmesg.RIP:qlist_free_all
          1:998         -0%            :432   dmesg.RIP:rb_erase
           :998          0%           1:432   dmesg.RIP:rb_next
         26:998          0%          27:432   dmesg.RIP:skip_spaces
           :998          0%           1:432   dmesg.RIP:strncmp
           :998          0%           1:432   dmesg.RIP:sysfs_file_ops
          1:998         -0%            :432   dmesg.RIP:sysfs_remove_group
           :998          0%           1:432   dmesg.RIP:update_sd_lb_stats
          7:998         -1%           1:432   dmesg.RIP:usercopy_abort
           :998          0%           2:432   dmesg.RIP:vmalloc_to_page
           :998          0%           1:432   dmesg.RIP:zap_p4d_range
          1:998         -0%            :432   dmesg.WARNING:at_fs/sysfs/group.c:#sysfs_remove_group
           :998          0%           2:432   dmesg.WARNING:at_kernel/locking/lockdep.c:#__lock_acquire
          7:998         -1%           1:432   dmesg.WARNING:at_kernel/module.c:#module_put
         11:998         -1%           6:432   dmesg.WARNING:at_lib/kobject.c:#kobject_add_internal
           :998          0%           1:432   dmesg.WARNING:at_mm/memory.c:#wp_page_copy
           :998         13%         125:432   dmesg.WARNING:at_mm/vmalloc.c:#__vunmap
           :998          0%           2:432   dmesg.WARNING:at_mm/vmalloc.c:#vmalloc_to_page
          1:998         -0%            :432   dmesg.WARNING:stack_going_in_the_wrong_direction?at_asm_exc_double_fault/0x
          1:998         -0%            :432   dmesg.WARNING:stack_recursion
         42:998          3%          67:432   dmesg.canonical_address#:#[##]
          9:998         -1%           4:432   dmesg.invalid_opcode:#[##]
          1:998         -0%            :432   dmesg.kernel_BUG_at_arch/x86/entry/common.c
           :998          0%           1:432   dmesg.kernel_BUG_at_arch/x86/kernel/paravirt.c
          9:998         -1%           3:432   dmesg.kernel_BUG_at_mm/usercopy.c
           :998         10%         103:432   dmesg.stack_segment:#[##]


BTW, we noticed the e91d8d7823 ("mm/zsmalloc.c: drop ZSMALLOC_PGTABLE_MAPPING")
is a fix for e47110e90584, so we also tested it, as well as the v5.12-rc8,
found the issue still exists though lower happen rate.

f3f99d63a8156c7a e47110e90584a22e e91d8d78237de8d7        v5.12-rc8
---------------- ---------------- ---------------- ----------------
       fail:runs        fail:runs        fail:runs        fail:runs
           |                |                |                |
           :998          125:432           46:299           35:334   dmesg.RIP:__vunmap
           :998          125:432           46:299           35:334   dmesg.WARNING:at_mm/vmalloc.c:#__vunmap
> 
>              Linus

Download attachment "dmesg.xz" of type "application/x-xz" (40868 bytes)

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ