[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <b879d191-a07c-e808-e48f-2b9bd8ba4fa3@nvidia.com>
Date: Fri, 27 Dec 2019 13:56:48 -0800
From: John Hubbard <jhubbard@...dia.com>
To: Leon Romanovsky <leon@...nel.org>
CC: Jason Gunthorpe <jgg@...pe.ca>,
Andrew Morton <akpm@...ux-foundation.org>,
Al Viro <viro@...iv.linux.org.uk>,
Alex Williamson <alex.williamson@...hat.com>,
Benjamin Herrenschmidt <benh@...nel.crashing.org>,
Björn Töpel <bjorn.topel@...el.com>,
Christoph Hellwig <hch@...radead.org>,
Dan Williams <dan.j.williams@...el.com>,
Daniel Vetter <daniel@...ll.ch>,
Dave Chinner <david@...morbit.com>,
David Airlie <airlied@...ux.ie>,
"David S . Miller" <davem@...emloft.net>,
Ira Weiny <ira.weiny@...el.com>, Jan Kara <jack@...e.cz>,
Jens Axboe <axboe@...nel.dk>, Jonathan Corbet <corbet@....net>,
Jérôme Glisse <jglisse@...hat.com>,
Magnus Karlsson <magnus.karlsson@...el.com>,
Mauro Carvalho Chehab <mchehab@...nel.org>,
Michael Ellerman <mpe@...erman.id.au>,
Michal Hocko <mhocko@...e.com>,
Mike Kravetz <mike.kravetz@...cle.com>,
Paul Mackerras <paulus@...ba.org>,
Shuah Khan <shuah@...nel.org>,
Vlastimil Babka <vbabka@...e.cz>, <bpf@...r.kernel.org>,
<dri-devel@...ts.freedesktop.org>, <kvm@...r.kernel.org>,
<linux-block@...r.kernel.org>, <linux-doc@...r.kernel.org>,
<linux-fsdevel@...r.kernel.org>, <linux-kselftest@...r.kernel.org>,
<linux-media@...r.kernel.org>, <linux-rdma@...r.kernel.org>,
<linuxppc-dev@...ts.ozlabs.org>, <netdev@...r.kernel.org>,
<linux-mm@...ck.org>, LKML <linux-kernel@...r.kernel.org>,
Maor Gottlieb <maorg@...lanox.com>,
"Ran Rozenstein" <ranro@...lanox.com>
Subject: Re: [PATCH v11 00/25] mm/gup: track dma-pinned pages: FOLL_PIN
On 12/24/19 9:26 PM, Leon Romanovsky wrote:
...
>>>> The branch is here (I just tested it and it seems healthy):
>>>>
>>>> git@...hub.com:johnhubbard/linux.git pin_user_pages_tracking_v11_with_diags
>>>
>>> Hi,
>>>
>>> We tested the following branch and here comes results:
>>
>> Thanks for this testing run!
>>
>>> [root@...ver consume_mtts]# (master) $ grep foll_pin /proc/vmstat
>>> nr_foll_pin_requested 0
>>> nr_foll_pin_returned 0
>>>
>>
>> Zero pinned pages!
>
> Maybe we are missing some CONFIG_* option?
> https://lore.kernel.org/linux-rdma/12a28917-f8c9-5092-2f01-92bb74714cae@nvidia.com/T/#mf900896f5dfc86cdee9246219990c632ed77115f
Ah OK, it must be that CONFIG_DEBUG_VM is not set, thanks!
>> ...now I'm confused. Somehow FOLL_PIN and pin_user_pages*() calls are
>> not happening. And although the backtraces below show some of my new
>> routines (like try_grab_page), they also confirm the above: there is no
>> pin_user_page*() call in the stack.
>>
>> In particular, it looks like ib_umem_get() is calling through to
>> get_user_pages*(), rather than pin_user_pages*(). I don't see how this
>> is possible, because the code on my screen shows ib_umem_get() calling
>> pin_user_pages_fast().
>>
It must be that pin_user_pages() is in the call stack, but just not getting
printed. There's no other way to explain this.
>> Any thoughts or ideas are welcome here.
>>
>> However, glossing over all of that and assuming that the new
>> GUP_PIN_COUNTING_BIAS of 256 is applied, it's interesting that we still
>> see any overflow. I'm less confident now that this is a true refcount
>> overflow.
>
> Earlier in this email thread, I posted possible function call chain which
> doesn't involve refcount overflow, but for some reason the refcount
> overflow was chosen as a way to explore.
>
Well, both of the WARN() calls are asserting that the refcount went negative
(well, one asserts negative, and the other asserts "<=0"). So it's pretty
hard to talk our way out of a refcount overflow here.
>>
>> Also, any information that would get me closer to being able to attempt
>> my own reproduction of the problem are *very* welcome. :)
>
> It is ancient verification test (~10y) which is not an easy task to
> make it understandable and standalone :).
>
Is this the only test that fails, btw? No other test failures or hints of
problems?
(Also, maybe hopeless, but can *anyone* on the RDMA list provide some
characterization of the test, such as how many pins per page, what page
sizes are used? I'm still hoping to write a test to trigger something
close to this...)
I do have a couple more ideas for test runs:
1. Reduce GUP_PIN_COUNTING_BIAS to 1. That would turn the whole override of
page->_refcount into a no-op, and so if all is well (it may not be!) with the
rest of the patch, then we'd expect this problem to not reappear.
2. Active /proc/vmstat *foll_pin* statistics unconditionally (just for these
tests, of course), so we can see if there is a get/put mismatch. However, that
will change the timing, and so it must be attempted independently of (1), in
order to see if it ends up hiding the repro.
I've updated this branch to implement (1), but not (2), hoping you can give
this one a spin?
git@...hub.com:johnhubbard/linux.git pin_user_pages_tracking_v11_with_diags
thanks,
--
John Hubbard
NVIDIA
>>
>>> [root@...er consume_mtts]# (master) $ dmesg
>>> [ 425.221459] ------------[ cut here ]------------
>>> [ 425.225894] WARNING: CPU: 1 PID: 6738 at mm/gup.c:61 try_grab_compound_head+0x90/0xa0
>>> [ 425.228021] Modules linked in: mlx5_ib mlx5_core mlxfw mlx4_ib mlx4_en ptp pps_core mlx4_core bonding ip6_gre ip6_tunnel tunnel6 ip_gre gre ip_tunnel rdma_rxe ip6_udp_tunnel udp_tunnel rdma_ucm ib_uverbs ib_ipoib ib_umad ib_srp scsi_transport_srp rpcrdma ib_iser libiscsi scsi_transport_iscsi rdma_cm iw_cm ib_cm ib_core [last unloaded: mlxfw]
>>> [ 425.235266] CPU: 1 PID: 6738 Comm: consume_mtts Tainted: G O 5.5.0-rc2+ #1
>>> [ 425.237480] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
>>> [ 425.239738] RIP: 0010:try_grab_compound_head+0x90/0xa0
>>> [ 425.241170] Code: 06 48 8d 4f 34 f0 0f b1 57 34 74 cd 85 c0 74 cf 8d 14 06 f0 0f b1 11 74 c0 eb f1 8d 14 06 f0 0f b1 11 74 b5 85 c0 75 f3 eb b5 <0f> 0b 31 c0 c3 90 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 41
>>> [ 425.245739] RSP: 0018:ffffc900006878a8 EFLAGS: 00010082
>>> [ 425.247124] RAX: 0000000080000001 RBX: 00007f780488a000 RCX: 0000000000000bb0
>>> [ 425.248956] RDX: ffffea000e031087 RSI: 0000000000008a00 RDI: ffffea000dc58000
>>> [ 425.250761] RBP: ffffea000e031080 R08: ffffc90000687974 R09: 000fffffffe00000
>>> [ 425.252661] R10: 0000000000000000 R11: ffff888362560000 R12: 000000000000008a
>>> [ 425.254487] R13: 80000003716000e7 R14: 00007f780488a000 R15: ffffc90000687974
>>> [ 425.256309] FS: 00007f780d9d3740(0000) GS:ffff8883b1c80000(0000) knlGS:0000000000000000
>>> [ 425.258401] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> [ 425.259949] CR2: 0000000002334048 CR3: 000000039c68c001 CR4: 00000000001606a0
>>> [ 425.261884] Call Trace:
>>> [ 425.262735] gup_pgd_range+0x517/0x5a0
>>> [ 425.263819] internal_get_user_pages_fast+0x210/0x250
>>> [ 425.265193] ib_umem_get+0x298/0x550 [ib_uverbs]
>>> [ 425.266476] mr_umem_get+0xc9/0x260 [mlx5_ib]
>>> [ 425.267699] mlx5_ib_reg_user_mr+0xcc/0x7e0 [mlx5_ib]
>>> [ 425.269134] ? xas_load+0x8/0x80
>>> [ 425.270074] ? xa_load+0x48/0x90
>>> [ 425.271038] ? lookup_get_idr_uobject.part.10+0x12/0x70 [ib_uverbs]
>>> [ 425.272757] ib_uverbs_reg_mr+0x127/0x280 [ib_uverbs]
>>> [ 425.274120] ib_uverbs_handler_UVERBS_METHOD_INVOKE_WRITE+0xc2/0xf0 [ib_uverbs]
>>> [ 425.276058] ib_uverbs_cmd_verbs.isra.6+0x5be/0xbe0 [ib_uverbs]
>>> [ 425.277657] ? uverbs_disassociate_api+0xd0/0xd0 [ib_uverbs]
>>> [ 425.279155] ? __alloc_pages_nodemask+0x148/0x2b0
>>> [ 425.280445] ib_uverbs_ioctl+0xc0/0x120 [ib_uverbs]
>>> [ 425.281755] do_vfs_ioctl+0x9d/0x650
>>> [ 425.282766] ksys_ioctl+0x70/0x80
>>> [ 425.283745] __x64_sys_ioctl+0x16/0x20
>>> [ 425.284912] do_syscall_64+0x42/0x130
>>> [ 425.285973] entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>> [ 425.287377] RIP: 0033:0x7f780d2df267
>>> [ 425.288449] Code: b3 66 90 48 8b 05 19 3c 2c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d e9 3b 2c 00 f7 d8 64 89 01 48
>>> [ 425.293073] RSP: 002b:00007ffce49a88a8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
>>> [ 425.295034] RAX: ffffffffffffffda RBX: 00007ffce49a8938 RCX: 00007f780d2df267
>>> [ 425.296895] RDX: 00007ffce49a8920 RSI: 00000000c0181b01 RDI: 0000000000000003
>>> [ 425.298689] RBP: 00007ffce49a8900 R08: 0000000000000003 R09: 00007f780d9a1010
>>> [ 425.300480] R10: 00000000ffffffff R11: 0000000000000246 R12: 00007f780d9a1150
>>> [ 425.302290] R13: 00007ffce49a8900 R14: 00007ffce49a8ad8 R15: 00007f780468a000
>>> [ 425.304113] ---[ end trace 1ecbefdb403190dd ]---
>>> [ 425.305434] ------------[ cut here ]------------
>>> [ 425.307147] WARNING: CPU: 1 PID: 6738 at mm/gup.c:150 try_grab_page+0x56/0x60
>>> [ 425.309111] Modules linked in: mlx5_ib mlx5_core mlxfw mlx4_ib mlx4_en ptp pps_core mlx4_core bonding ip6_gre ip6_tunnel tunnel6 ip_gre gre ip_tunnel rdma_rxe ip6_udp_tunnel udp_tunnel rdma_ucm ib_uverbs ib_ipoib ib_umad ib_srp scsi_transport_srp rpcrdma ib_iser libiscsi scsi_transport_iscsi rdma_cm iw_cm ib_cm ib_core [last unloaded: mlxfw]
>>> [ 425.316461] CPU: 1 PID: 6738 Comm: consume_mtts Tainted: G W O 5.5.0-rc2+ #1
>>> [ 425.318582] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
>>> [ 425.320958] RIP: 0010:try_grab_page+0x56/0x60
>>> [ 425.322167] Code: 7e 28 f0 81 47 34 00 01 00 00 c3 48 8b 47 08 48 8d 50 ff a8 01 48 0f 45 fa 8b 47 34 85 c0 7e 0f f0 ff 47 34 b8 01 00 00 00 c3 <0f> 0b 31 c0 c3 0f 0b 31 c0 c3 0f 1f 44 00 00 41 57 41 56 41 55 41
>>> [ 425.326814] RSP: 0018:ffffc90000687830 EFLAGS: 00010282
>>> [ 425.328226] RAX: 0000000000000001 RBX: ffffea000dc58000 RCX: ffffea000e031087
>>> [ 425.330104] RDX: 0000000080000001 RSI: 0000000000040000 RDI: ffffea000dc58000
>>> [ 425.331980] RBP: 00007f7804800000 R08: 000ffffffffff000 R09: 80000003716000e7
>>> [ 425.333898] R10: ffff88834af80120 R11: ffff8883ac16f000 R12: ffff88834af80120
>>> [ 425.335704] R13: ffff88837c0915c0 R14: 0000000000050201 R15: 00007f7804800000
>>> [ 425.337638] FS: 00007f780d9d3740(0000) GS:ffff8883b1c80000(0000) knlGS:0000000000000000
>>> [ 425.339734] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> [ 425.341369] CR2: 0000000002334048 CR3: 000000039c68c001 CR4: 00000000001606a0
>>> [ 425.343160] Call Trace:
>>> [ 425.343967] follow_trans_huge_pmd+0x16f/0x2e0
>>> [ 425.345263] follow_p4d_mask+0x51c/0x630
>>> [ 425.346344] __get_user_pages+0x1a1/0x6c0
>>> [ 425.347463] internal_get_user_pages_fast+0x17b/0x250
>>> [ 425.348918] ib_umem_get+0x298/0x550 [ib_uverbs]
>>> [ 425.350174] mr_umem_get+0xc9/0x260 [mlx5_ib]
>>> [ 425.351383] mlx5_ib_reg_user_mr+0xcc/0x7e0 [mlx5_ib]
>>> [ 425.352849] ? xas_load+0x8/0x80
>>> [ 425.353776] ? xa_load+0x48/0x90
>>> [ 425.354730] ? lookup_get_idr_uobject.part.10+0x12/0x70 [ib_uverbs]
>>> [ 425.356410] ib_uverbs_reg_mr+0x127/0x280 [ib_uverbs]
>>> [ 425.357843] ib_uverbs_handler_UVERBS_METHOD_INVOKE_WRITE+0xc2/0xf0 [ib_uverbs]
>>> [ 425.359749] ib_uverbs_cmd_verbs.isra.6+0x5be/0xbe0 [ib_uverbs]
>>> [ 425.361405] ? uverbs_disassociate_api+0xd0/0xd0 [ib_uverbs]
>>> [ 425.362898] ? __alloc_pages_nodemask+0x148/0x2b0
>>> [ 425.364206] ib_uverbs_ioctl+0xc0/0x120 [ib_uverbs]
>>> [ 425.365564] do_vfs_ioctl+0x9d/0x650
>>> [ 425.366567] ksys_ioctl+0x70/0x80
>>> [ 425.367537] __x64_sys_ioctl+0x16/0x20
>>> [ 425.368698] do_syscall_64+0x42/0x130
>>> [ 425.369782] entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>> [ 425.371117] RIP: 0033:0x7f780d2df267
>>> [ 425.372159] Code: b3 66 90 48 8b 05 19 3c 2c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d e9 3b 2c 00 f7 d8 64 89 01 48
>>> [ 425.376774] RSP: 002b:00007ffce49a88a8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
>>> [ 425.378740] RAX: ffffffffffffffda RBX: 00007ffce49a8938 RCX: 00007f780d2df267
>>> [ 425.380598] RDX: 00007ffce49a8920 RSI: 00000000c0181b01 RDI: 0000000000000003
>>> [ 425.382411] RBP: 00007ffce49a8900 R08: 0000000000000003 R09: 00007f780d9a1010
>>> [ 425.384312] R10: 00000000ffffffff R11: 0000000000000246 R12: 00007f780d9a1150
>>> [ 425.386132] R13: 00007ffce49a8900 R14: 00007ffce49a8ad8 R15: 00007f780468a000
>>> [ 425.387964] ---[ end trace 1ecbefdb403190de ]---
>>>
Powered by blists - more mailing lists