lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CA+CK2bBLAD58Q545r6W9eSwKJ3-BgzUF5oLAn6wHUcDi=jBpdw@mail.gmail.com>
Date:   Fri, 17 May 2019 10:09:57 -0400
From:   Pavel Tatashin <pasha.tatashin@...een.com>
To:     Dan Williams <dan.j.williams@...el.com>
Cc:     "Verma, Vishal L" <vishal.l.verma@...el.com>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "jmorris@...ei.org" <jmorris@...ei.org>,
        "tiwai@...e.de" <tiwai@...e.de>,
        "sashal@...nel.org" <sashal@...nel.org>,
        "linux-mm@...ck.org" <linux-mm@...ck.org>,
        "dave.hansen@...ux.intel.com" <dave.hansen@...ux.intel.com>,
        "david@...hat.com" <david@...hat.com>, "bp@...e.de" <bp@...e.de>,
        "akpm@...ux-foundation.org" <akpm@...ux-foundation.org>,
        "linux-nvdimm@...ts.01.org" <linux-nvdimm@...ts.01.org>,
        "jglisse@...hat.com" <jglisse@...hat.com>,
        "zwisler@...nel.org" <zwisler@...nel.org>,
        "mhocko@...e.com" <mhocko@...e.com>,
        "Jiang, Dave" <dave.jiang@...el.com>,
        "bhelgaas@...gle.com" <bhelgaas@...gle.com>,
        "Busch, Keith" <keith.busch@...el.com>,
        "thomas.lendacky@....com" <thomas.lendacky@....com>,
        "Huang, Ying" <ying.huang@...el.com>,
        "Wu, Fengguang" <fengguang.wu@...el.com>,
        "baiyaowei@...s.chinamobile.com" <baiyaowei@...s.chinamobile.com>
Subject: Re: [v5 0/3] "Hotremove" persistent memory

>
> I would think that ACPI hotplug would have a similar problem, but it does this:
>
>                 acpi_unbind_memory_blocks(info);
>                 __remove_memory(nid, info->start_addr, info->length);

ACPI does have exactly the same problem, so this is not a bug for this
series, I will submit a new version of my series with comments
addressed, but without fix for this issue.

I was able to reproduce this issue on the current mainline kernel.
Also, I been thinking more about how to fix it, and there is no easy
fix without a major hotplug redesign. Basically, we have to remove
sysfs memory entries before or after memory is hotplugged/hotremoved.
But, we also have to guarantee that hotplug/hotremove will succeed or
reinstate sysfs entries.

Qemu script:

qemu-system-x86_64                                                      \
        -enable-kvm                                                     \
        -cpu host                                                       \
        -parallel none                                                  \
        -echr 1                                                         \
        -serial none                                                    \
        -chardev stdio,id=console,signal=off,mux=on                     \
        -serial chardev:console                                         \
        -mon chardev=console                                            \
        -vga none                                                       \
        -display none                                                   \
        -kernel pmem/native/arch/x86/boot/bzImage                       \
        -m 8G,slots=1,maxmem=16G                                        \
        -smp 8                                                          \
        -fsdev local,id=virtfs1,path=/,security_model=none              \
        -device virtio-9p-pci,fsdev=virtfs1,mount_tag=hostfs            \
        -append 'earlyprintk=serial,ttyS0,115200 console=ttyS0
TERM=xterm ip=dhcp loglevel=7'

Config is attached.

Steps to reproduce:
#
# QEMU 4.0.0 monitor - type 'help' for more information
(qemu) object_add memory-backend-ram,id=mem1,size=1G
(qemu) device_add pc-dimm,id=dimm1,memdev=mem1
(qemu)

# echo online_movable > /sys/devices/system/memory/memory79/state
[   23.029552] Built 1 zonelists, mobility grouping on.  Total pages: 2045370
[   23.032591] Policy zone: Normal
# (qemu) device_del dimm1
(qemu) [   32.013950] Offlined Pages 32768
[   32.014307] Built 1 zonelists, mobility grouping on.  Total pages: 2031022
[   32.014843] Policy zone: Normal
[   32.015733]
[   32.015881] ======================================================
[   32.016390] WARNING: possible circular locking dependency detected
[   32.016881] 5.1.0_pt_pmem #38 Not tainted
[   32.017202] ------------------------------------------------------
[   32.017680] kworker/u16:4/380 is trying to acquire lock:
[   32.018096] 00000000675cc7e1 (kn->count#18){++++}, at:
kernfs_remove_by_name_ns+0x3b/0x80
[   32.018745]
[   32.018745] but task is already holding lock:
[   32.019201] 0000000053e50a99 (mem_sysfs_mutex){+.+.}, at:
unregister_memory_section+0x1d/0xa0
[   32.019859]
[   32.019859] which lock already depends on the new lock.
[   32.019859]
[   32.020499]
[   32.020499] the existing dependency chain (in reverse order) is:
[   32.021080]
[   32.021080] -> #4 (mem_sysfs_mutex){+.+.}:
[   32.021522]        __mutex_lock+0x8b/0x900
[   32.021843]        hotplug_memory_register+0x26/0xa0
[   32.022231]        __add_pages+0xe7/0x160
[   32.022545]        add_pages+0xd/0x60
[   32.022835]        add_memory_resource+0xc3/0x1d0
[   32.023207]        __add_memory+0x57/0x80
[   32.023530]        acpi_memory_device_add+0x13a/0x2d0
[   32.023928]        acpi_bus_attach+0xf1/0x200
[   32.024272]        acpi_bus_scan+0x3e/0x90
[   32.024597]        acpi_device_hotplug+0x284/0x3e0
[   32.024972]        acpi_hotplug_work_fn+0x15/0x20
[   32.025342]        process_one_work+0x2a0/0x650
[   32.025755]        worker_thread+0x34/0x3d0
[   32.026077]        kthread+0x118/0x130
[   32.026442]        ret_from_fork+0x3a/0x50
[   32.026766]
[   32.026766] -> #3 (mem_hotplug_lock.rw_sem){++++}:
[   32.027261]        get_online_mems+0x39/0x80
[   32.027600]        kmem_cache_create_usercopy+0x29/0x2c0
[   32.028019]        kmem_cache_create+0xd/0x10
[   32.028367]        ptlock_cache_init+0x1b/0x23
[   32.028724]        start_kernel+0x1d2/0x4b8
[   32.029060]        secondary_startup_64+0xa4/0xb0
[   32.029447]
[   32.029447] -> #2 (cpu_hotplug_lock.rw_sem){++++}:
[   32.030007]        cpus_read_lock+0x39/0x80
[   32.030360]        __offline_pages+0x32/0x790
[   32.030709]        memory_subsys_offline+0x3a/0x60
[   32.031089]        device_offline+0x7e/0xb0
[   32.031425]        acpi_bus_offline+0xd8/0x140
[   32.031821]        acpi_device_hotplug+0x1b2/0x3e0
[   32.032202]        acpi_hotplug_work_fn+0x15/0x20
[   32.032576]        process_one_work+0x2a0/0x650
[   32.032942]        worker_thread+0x34/0x3d0
[   32.033283]        kthread+0x118/0x130
[   32.033588]        ret_from_fork+0x3a/0x50
[   32.033919]
[   32.033919] -> #1 (&device->physical_node_lock){+.+.}:
[   32.034450]        __mutex_lock+0x8b/0x900
[   32.034784]        acpi_get_first_physical_node+0x16/0x60
[   32.035217]        acpi_companion_match+0x3b/0x60
[   32.035594]        acpi_device_uevent_modalias+0x9/0x20
[   32.036012]        platform_uevent+0xd/0x40
[   32.036352]        dev_uevent+0x85/0x1c0
[   32.036674]        kobject_uevent_env+0x1e2/0x640
[   32.037044]        kobject_synth_uevent+0x2b7/0x324
[   32.037428]        uevent_store+0x17/0x30
[   32.037752]        kernfs_fop_write+0xeb/0x1a0
[   32.038112]        vfs_write+0xb2/0x1b0
[   32.038417]        ksys_write+0x57/0xd0
[   32.038721]        do_syscall_64+0x4b/0x1a0
[   32.039053]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
[   32.039491]
[   32.039491] -> #0 (kn->count#18){++++}:
[   32.039913]        lock_acquire+0xaa/0x180
[   32.040242]        __kernfs_remove+0x244/0x2d0
[   32.040593]        kernfs_remove_by_name_ns+0x3b/0x80
[   32.040991]        device_del+0x14a/0x370
[   32.041309]        device_unregister+0x9/0x20
[   32.041653]        unregister_memory_section+0x69/0xa0
[   32.042059]        __remove_pages+0x112/0x460
[   32.042402]        arch_remove_memory+0x6f/0xa0
[   32.042758]        __remove_memory+0xab/0x130
[   32.043103]        acpi_memory_device_remove+0x67/0xe0
[   32.043537]        acpi_bus_trim+0x50/0x90
[   32.043889]        acpi_device_hotplug+0x2fa/0x3e0
[   32.044300]        acpi_hotplug_work_fn+0x15/0x20
[   32.044686]        process_one_work+0x2a0/0x650
[   32.045044]        worker_thread+0x34/0x3d0
[   32.045381]        kthread+0x118/0x130
[   32.045679]        ret_from_fork+0x3a/0x50
[   32.046005]
[   32.046005] other info that might help us debug this:
[   32.046005]
[   32.046636] Chain exists of:
[   32.046636]   kn->count#18 --> mem_hotplug_lock.rw_sem --> mem_sysfs_mutex
[   32.046636]
[   32.047514]  Possible unsafe locking scenario:
[   32.047514]
[   32.047976]        CPU0                    CPU1
[   32.048337]        ----                    ----
[   32.048697]   lock(mem_sysfs_mutex);
[   32.048983]                                lock(mem_hotplug_lock.rw_sem);
[   32.049519]                                lock(mem_sysfs_mutex);
[   32.050004]   lock(kn->count#18);
[   32.050270]
[   32.050270]  *** DEADLOCK ***
[   32.050270]
[   32.050736] 7 locks held by kworker/u16:4/380:
[   32.051087]  #0: 00000000a22fe78e
((wq_completion)kacpi_hotplug){+.+.}, at: process_one_work+0x21e/0x650
[   32.051830]  #1: 00000000944f2dca
((work_completion)(&hpw->work)){+.+.}, at:
process_one_work+0x21e/0x650
[   32.052577]  #2: 0000000024bbe147 (device_hotplug_lock){+.+.}, at:
acpi_device_hotplug+0x2e/0x3e0
[   32.053271]  #3: 000000005cb50027 (acpi_scan_lock){+.+.}, at:
acpi_device_hotplug+0x3c/0x3e0
[   32.053916]  #4: 00000000b8d06992 (cpu_hotplug_lock.rw_sem){++++},
at: __remove_memory+0x3b/0x130
[   32.054602]  #5: 00000000897f0ef4 (mem_hotplug_lock.rw_sem){++++},
at: percpu_down_write+0x1d/0x110
[   32.055315]  #6: 0000000053e50a99 (mem_sysfs_mutex){+.+.}, at:
unregister_memory_section+0x1d/0xa0
[   32.056016]
[   32.056016] stack backtrace:
[   32.056355] CPU: 4 PID: 380 Comm: kworker/u16:4 Not tainted 5.1.0_pt_pmem #38
[   32.056923] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS 1.12.0-20181126_142135-anatol 04/01/2014
[   32.057720] Workqueue: kacpi_hotplug acpi_hotplug_work_fn
[   32.058144] Call Trace:
[   32.058344]  dump_stack+0x67/0x90
[   32.058604]  print_circular_bug.cold.60+0x15c/0x195
[   32.058989]  __lock_acquire+0x17de/0x1d30
[   32.059308]  ? find_held_lock+0x2d/0x90
[   32.059611]  ? __kernfs_remove+0x199/0x2d0
[   32.059937]  lock_acquire+0xaa/0x180
[   32.060223]  ? kernfs_remove_by_name_ns+0x3b/0x80
[   32.060596]  __kernfs_remove+0x244/0x2d0
[   32.060908]  ? kernfs_remove_by_name_ns+0x3b/0x80
[   32.061283]  ? kernfs_name_hash+0xd/0x80
[   32.061596]  ? kernfs_find_ns+0x68/0xf0
[   32.061907]  kernfs_remove_by_name_ns+0x3b/0x80
[   32.062266]  device_del+0x14a/0x370
[   32.062548]  ? unregister_mem_sect_under_nodes+0x4f/0xc0
[   32.062973]  device_unregister+0x9/0x20
[   32.063285]  unregister_memory_section+0x69/0xa0
[   32.063651]  __remove_pages+0x112/0x460
[   32.063949]  arch_remove_memory+0x6f/0xa0
[   32.064271]  __remove_memory+0xab/0x130
[   32.064579]  ? walk_memory_range+0xa1/0xe0
[   32.064907]  acpi_memory_device_remove+0x67/0xe0
[   32.065274]  acpi_bus_trim+0x50/0x90
[   32.065560]  acpi_device_hotplug+0x2fa/0x3e0
[   32.065900]  acpi_hotplug_work_fn+0x15/0x20
[   32.066249]  process_one_work+0x2a0/0x650
[   32.066591]  worker_thread+0x34/0x3d0
[   32.066925]  ? process_one_work+0x650/0x650
[   32.067275]  kthread+0x118/0x130
[   32.067542]  ? kthread_create_on_node+0x60/0x60
[   32.067909]  ret_from_fork+0x3a/0x50

>
> I wonder if that ordering prevents going too deep into the
> device_unregister() call stack that you highlighted below.
>
>
> >
> > Here is the problem:
> >
> > When we offline pages we have the following call stack:
> >
> > # echo offline > /sys/devices/system/memory/memory8/state
> > ksys_write
> >  vfs_write
> >   __vfs_write
> >    kernfs_fop_write
> >     kernfs_get_active
> >      lock_acquire                       kn->count#122 (lock for
> > "memory8/state" kn)
> >     sysfs_kf_write
> >      dev_attr_store
> >       state_store
> >        device_offline
> >         memory_subsys_offline
> >          memory_block_action
> >           offline_pages
> >            __offline_pages
> >             percpu_down_write
> >              down_write
> >               lock_acquire              mem_hotplug_lock.rw_sem
> >
> > When we unbind dax0.0 we have the following  stack:
> > # echo dax0.0 > /sys/bus/dax/drivers/kmem/unbind
> > drv_attr_store
> >  unbind_store
> >   device_driver_detach
> >    device_release_driver_internal
> >     dev_dax_kmem_remove
> >      remove_memory                      device_hotplug_lock
> >       try_remove_memory                 mem_hotplug_lock.rw_sem
> >        arch_remove_memory
> >         __remove_pages
> >          __remove_section
> >           unregister_memory_section
> >            remove_memory_section        mem_sysfs_mutex
> >             unregister_memory
> >              device_unregister
> >               device_del
> >                device_remove_attrs
> >                 sysfs_remove_groups
> >                  sysfs_remove_group
> >                   remove_files
> >                    kernfs_remove_by_name
> >                     kernfs_remove_by_name_ns
> >                      __kernfs_remove    kn->count#122
> >
> > So, lockdep found the ordering issue with the above two stacks:
> >
> > 1. kn->count#122 -> mem_hotplug_lock.rw_sem
> > 2. mem_hotplug_lock.rw_sem -> kn->count#122

Download attachment "x86.config.bz2" of type "application/x-bzip" (24701 bytes)

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ