[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20240726084809.gdz2axvawwwekpu6@oppo.com>
Date: Fri, 26 Jul 2024 16:48:09 +0800
From: Hailong Liu <hailong.liu@...o.com>
To: Baoquan He <bhe@...hat.com>
CC: Barry Song <21cnbao@...il.com>, Andrew Morton <akpm@...ux-foundation.org>,
Uladzislau Rezki <urezki@...il.com>, Christoph Hellwig <hch@...radead.org>,
Lorenzo Stoakes <lstoakes@...il.com>, Vlastimil Babka <vbabka@...e.cz>,
Michal Hocko <mhocko@...e.com>, Matthew Wilcox <willy@...radead.org>,
Tangquan Zheng <zhengtangquan@...o.com>, <linux-mm@...ck.org>,
<linux-kernel@...r.kernel.org>
Subject: Re: [RFC PATCH v2] mm/vmalloc: fix incorrect
__vmap_pages_range_noflush() if vm_area_alloc_pages() from high order
fallback to order0
On Fri, 26. Jul 16:37, Baoquan He wrote:
> On 07/26/24 at 05:29pm, Barry Song wrote:
> > On Fri, Jul 26, 2024 at 5:04 PM Hailong Liu <hailong.liu@...o.com> wrote:
> > >
> > > On Fri, 26. Jul 12:00, Hailong Liu wrote:
> > > > On Fri, 26. Jul 10:31, Baoquan He wrote:
> > > > [...]
> > > > > > The logic of this patch is somewhat similar to my first one. If high order
> > > > > > allocation fails, it will go normal mapping.
> > > > > >
> > > > > > However I also save the fallback position. The ones before this position are
> > > > > > used for huge mapping, the ones >= position for normal mapping as Barry said.
> > > > > > "support the combination of PMD and PTE mapping". this will take some
> > > > > > times as it needs to address the corner cases and do some tests.
> > > > >
> > > > > Hmm, we may not need to worry about the imperfect mapping. Currently
> > > > > there are two places setting VM_ALLOW_HUGE_VMAP: __kvmalloc_node_noprof()
> > > > > and vmalloc_huge().
> > > > >
> > > > > For vmalloc_huge(), it's called in below three interfaces which are all
> > > > > invoked during boot. Basically they can succeed to get required contiguous
> > > > > physical memory. I guess that's why Tangquan only spot this issue on kvmalloc
> > > > > invocation when the required size exceeds e.g 2M. For kvmalloc_node(),
> > > > > we have told that in the code comment above __kvmalloc_node_noprof(),
> > > > > it's a best effort behaviour.
> > > > >
> > > > Take a __vmalloc_node_range(2.1M, VM_ALLOW_HUGE_VMAP) as a example.
> > > > because the align requirement of huge. the real size is 4M.
> > > > if allocation first order-9 successfully and the next failed. becuase the
> > > > fallback, the layout out pages would be like order9 - 512 * order0
> > > > order9 support huge mapping, but order0 not.
> > > > with the patch above, would call vmap_small_pages_range_noflush() and do normal
> > > > mapping, the huge mapping would not exist.
> > > >
> > > > > mm/mm_init.c <<alloc_large_system_hash>>
> > > > > table = vmalloc_huge(size, gfp_flags);
> > > > > net/ipv4/inet_hashtables.c <<inet_pernet_hashinfo_alloc>>
> > > > > new_hashinfo->ehash = vmalloc_huge(ehash_entries * sizeof(struct inet_ehash_bucket),
> > > > > net/ipv4/udp.c <<udp_pernet_table_alloc>>
> > > > > udptable->hash = vmalloc_huge(hash_entries * 2 * sizeof(struct udp_hslot)
> > > > >
> > > > > Maybe we should add code comment or document to notice people that the
> > > > > contiguous physical pages are not guaranteed for vmalloc_huge() if you
> > > > > use it after boot.
> > > > >
> > > > > >
> > > > > > IMO, the draft can fix the current issue, it also does not have significant side
> > > > > > effects. Barry, what do you think about this patch? If you think it's okay,
> > > > > > I will split this patch into two: one to remove the VM_ALLOW_HUGE_VMAP and the
> > > > > > other to address the current mapping issue.
> > > > > >
> > > > > > --
> > > > > > help you, help me,
> > > > > > Hailong.
> > > > > >
> > > > >
> > > > >
> > > I check the code, the issue only happen in gfp_mask with __GFP_NOFAIL and
> > > fallback to order 0, actuaally without this commit
> > > e9c3cda4d86e ("mm, vmalloc: fix high order __GFP_NOFAIL allocations")
> > > if __vmalloc_area_node allocation failed, it will goto fail and try order-0.
> > >
> > > fail:
> > > if (shift > PAGE_SHIFT) {
> > > shift = PAGE_SHIFT;
> > > align = real_align;
> > > size = real_size;
> > > goto again;
> > > }
> > >
> > > So do we really need fallback to order-0 if nofail?
> >
> > Good catch, this is what I missed. I feel we can revert Michal's fix.
> > And just remove __GFP_NOFAIL bit when we are still allocating
> > by high-order. When "goto again" happens, we will allocate by
> > order-0, in this case, we keep the __GFP_NOFAIL.
>
> With Michal's patch, the fallback will be able to satisfy the allocation
> for nofail case because it fallback to 0-order plus __GFP_NOFAIL. The
Hi Baoquan:
int __vmap_pages_range_noflush(unsigned long addr, unsigned long end,
pgprot_t prot, struct page **pages, unsigned int page_shift)
{
unsigned int i, nr = (end - addr) >> PAGE_SHIFT;
WARN_ON(page_shift < PAGE_SHIFT);
if (!IS_ENABLED(CONFIG_HAVE_ARCH_HUGE_VMALLOC) ||
page_shift == PAGE_SHIFT)
return vmap_small_pages_range_noflush(addr, end, prot, pages);
for (i = 0; i < nr; i += 1U << (page_shift - PAGE_SHIFT)) { ---> huge mapping
int err;
err = vmap_range_noflush(addr, addr + (1UL << page_shift),
page_to_phys(pages[i]), prot, ---------> incorrect mapping would occur here if nofail and fallback to order0
page_shift);
if (err)
return err;
addr += 1UL << page_shift;
}
return 0;
}
> 'if (shift > PAGE_SHIFT)' conditional checking and handling may be
> problemtic since it could jump to fail becuase vmap_pages_range()
> invocation failed, or partially allocate huge parges and break down,
> then it will ignore the already allocated pages, and do all the thing again.
>
> The only thing 'if (shift > PAGE_SHIFT)' checking and handling makes
> sense is it fallback to the real_size and real_align. BUT we need handle
> the fail separately, e.g
> 1)__get_vm_area_node() failed;
> 2)vm_area_alloc_pages() failed when shift > PAGE_SHIFT and non-nofail;
> 3)vmap_pages_range() failed;
>
> Honestly, I didn't see where the nofail is mishandled, could you point
> it out specifically? I could miss it.
>
> Thanks
> Baoquan
>
--
help you, help me,
Hailong.
Powered by blists - more mailing lists