linux-kernel - Re: [PATCH 1/4] mm: Trial do_wp

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <e2f897f8-b6af-0d4a-6a69-b47da5564165@amazon.com>
Date:   Wed, 3 Feb 2021 16:47:20 +0200
From:   Gal Pressman <galpress@...zon.com>
To:     Jason Gunthorpe <jgg@...pe.ca>
CC:     <aarcange@...hat.com>, <akpm@...ux-foundation.org>,
        <gokhale2@...l.gov>, <hch@....de>, <jack@...e.cz>,
        <jannh@...gle.com>, <jhubbard@...dia.com>, <kirill@...temov.name>,
        <ktkhai@...tuozzo.com>, <leonro@...dia.com>,
        <linux-kernel@...r.kernel.org>, <linux-mm@...ck.org>,
        <mcfadden8@...l.gov>, <oleg@...hat.com>, <peterx@...hat.com>,
        <torvalds@...ux-foundation.org>, <wzam@...zon.com>,
        <yang.shi@...ux.alibaba.com>
Subject: Re: [PATCH 1/4] mm: Trial do_wp_page() simplification

On 03/02/2021 16:00, Jason Gunthorpe wrote:
> On Wed, Feb 03, 2021 at 02:43:58PM +0200, Gal Pressman wrote:
>>> On Tue, Feb 02, 2021 at 12:05:36PM -0500, Peter Xu wrote:
>>>
>>>>> Gal, you could also MADV_DONTFORK this range if you are explicitly
>>>>> allocating them via special mmap.
>>>>
>>>> Yeah I wanted to mention this one too but I just forgot when reply: the issue
>>>> thread previously pasted smells like some people would like to drop
>>>> MADV_DONTFORK, but if it's able to still be applied I don't know why
>>>> not..
>>>
>>> I want to drop the MADV_DONTFORK for dynamic data memory allocated by
>>> the application layers (eg with malloc) without knowledge of how they
>>> will be used.
>>>
>>> This case is a buffer internal to the communication system that we
>>> know at allocation time how it will be used; so an explicit,
>>> deliberate, MADV_DONTFORK is fine
>>
>> We are referring to libfabric's bounce buffers, correct?
>> Libfabric could be considered as the "app" here, it's not clear why these
>> buffers should be DONTFORK'd before ibv_reg_mr() but others don't.
> 
> I assumed they were internal to the EFA code itself.

The hugepages allocation is part of libfabric generic bufpool implementation:
https://github.com/ofiwg/libfabric/blob/cde8665ca5ec2fb957260490d0c8700d8ac69863/include/linux/osd.h#L64

I guess we could madvise them at the libfabric provider's layer.

>> Anyway, it should be simple enough to madvise them after allocation, although I
>> think it's part of libfabric's generic code (which isn't necessarily used on
>> top of rdma-core).
> 
> Ah, so that is a reasonable justification for wanting to fix this in
> the kernel..
> 
> Lets give Peter some time first.
> 
> The other direction to validate this approach is to remove the
> MAP_HUGETLB flags and rely on THP instead, and/or mark them as
> MAP_SHARED.
> 
> I'm not sure generic code should be use using MAP_HUGETLB..

It's using MAP_HUGETLB but has a fallback in case it fails:

		ret = ofi_alloc_hugepage_buf((void **) &buf_region->alloc_region,
					     pool->alloc_size);
		/* If we can't allocate huge pages, fall back to normal
		 * allocations for all future attempts.
		 */
		if (ret) {
			pool->attr.flags &= ~OFI_BUFPOOL_HUGEPAGES;
			goto retry;
		}
		buf_region->flags = OFI_BUFPOOL_HUGEPAGES;


> This would be enough to confirm that everything else is working as
> expected
Agree.