[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <2B66E2E7-7D32-418C-9DFD-1E17180300B4@fb.com>
Date: Tue, 11 Oct 2022 16:25:51 +0000
From: Song Liu <songliubraving@...a.com>
To: "Edgecombe, Rick P" <rick.p.edgecombe@...el.com>
CC: "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"peterz@...radead.org" <peterz@...radead.org>,
Kernel Team <Kernel-team@...com>,
"linux-mm@...ck.org" <linux-mm@...ck.org>,
"song@...nel.org" <song@...nel.org>, "hch@....de" <hch@....de>,
"x86@...nel.org" <x86@...nel.org>,
"akpm@...ux-foundation.org" <akpm@...ux-foundation.org>,
"Hansen, Dave" <dave.hansen@...el.com>,
"urezki@...il.com" <urezki@...il.com>
Subject: Re: [RFC v2 4/4] vmalloc_exec: share a huge page with kernel text
> On Oct 10, 2022, at 1:09 PM, Edgecombe, Rick P <rick.p.edgecombe@...el.com> wrote:
>
> On Mon, 2022-10-10 at 19:08 +0000, Song Liu wrote:
>>> On Oct 10, 2022, at 11:32 AM, Edgecombe, Rick P <
>>> rick.p.edgecombe@...el.com> wrote:
>>>
>>> On Fri, 2022-10-07 at 16:43 -0700, Song Liu wrote:
>>>> On x86 kernel, we allocate 2MB pages for kernel text up to
>>>> round_down(_etext, 2MB). Therefore, some of the kernel text is
>>>> still
>>>> on 4kB pages. With vmalloc_exec, we can allocate 2MB pages up to
>>>> round_up(_etext, 2MB), and use the rest of the page for modules
>>>> and
>>>> BPF programs.
>>>>
>>>> Here is an example:
>>>>
>>>> [root@...50-1 ~]# grep _etext /proc/kallsyms
>>>> ffffffff82202a08 T _etext
>>>>
>>>> [root@...50-1 ~]# grep bpf_prog_ /proc/kallsyms | tail -n 3
>>>> ffffffff8220f920 t
>>>> bpf_prog_cc61a5364ac11d93_handle__sched_wakeup [bpf]
>>>> ffffffff8220fa28 t
>>>> bpf_prog_cc61a5364ac11d93_handle__sched_wakeup_new [bpf]
>>>> ffffffff8220fad4 t
>>>> bpf_prog_3bf73fa16f5e3d92_handle__sched_switch [bpf]
>>>>
>>>> [root@...50-1 ~]# grep 0xffffffff82200000
>>>> /sys/kernel/debug/page_tables/kernel
>>>> 0xffffffff82200000-
>>>> 0xffffffff82400000 2M ro PSE x pmd
>>>>
>>>> [root@...50-1 ~]# grep xfs_flush_inodes /proc/kallsyms
>>>> ffffffff822ba910 t xfs_flush_inodes_worker [xfs]
>>>> ffffffff822bc580 t xfs_flush_inodes [xfs]
>>>>
>>>> ffffffff82200000-ffffffff82400000 is a 2MB page, serving kernel
>>>> text,
>>>> xfs
>>>> module, and bpf programs.
>>>
>>> Can this memory range be freed as part of a vfree_exec() call then?
>>> Does vmalloc actually try to unmap it? If so, it could get
>>> complicated
>>> with PTI.
>>>
>>> It probably should be a special case that never gets fully freed.
>>
>> Right, this is never freed.
>
> Can we get a comment somewhere highlighting how this is avoided?
vfree_exec() only frees the range when we get PMD aligned range >=
PMD_SIZE. Since this range is smaller than PMD_SIZe, it is never
freed. I will add a comment for this in the never version.
>
> Maybe this is just me missing some vmalloc understanding, but this
> pointer to an all zero vm_struct seems weird too. Are there other vmap
> allocations like this? Which vmap APIs work with this and which don't?
There are two vmap trees at the moment: free_area_ tree and
vmap_area_ tree. free_area_ tree uses vmap->subtree_max_size, while
vmap_area_ tree contains vmap backed by vm_struct, and thus uses
vmap->vm.
This set add a new tree, free_text_area_. This tree is different to
the other two, as it uses subtree_max_size, and it is also backed
by vm_struct. To handle this requirement without growing vmap_struct,
we introduced all_text_vm to store the vm_struct for free_text_area_
tree.
free_text_area_ tree is different to vmap_area_ tree. Each vmap in
vmap_area_ tree has its own vm_struct (1 to 1 mapping), while
multiple vmap in free_text_area_ tree map to a single vm_struct.
Also, free_text_area_ handles granularity < PAGE_SIZE; while the
other two trees only work with PAGE_SIZE aligned memory.
Does this answer your questions?
>
>>
>>>
>>>>
>>>> Signed-off-by: Song Liu <song@...nel.org>
>>>> ---
>>>> arch/x86/mm/init_64.c | 3 ++-
>>>> mm/vmalloc.c | 24 ++++++++++++++++++++++++
>>>> 2 files changed, 26 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
>>>> index 0fe690ebc269..d94f196c541a 100644
>>>> --- a/arch/x86/mm/init_64.c
>>>> +++ b/arch/x86/mm/init_64.c
>>>> @@ -1367,12 +1367,13 @@ int __init
>>>> deferred_page_init_max_threads(const struct cpumask
>>>> *node_cpumask)
>>>>
>>>> int kernel_set_to_readonly;
>>>>
>>>> +#define PMD_ALIGN(x) (((unsigned long)(x) + (PMD_SIZE -
>>>> 1)) &
>>>> PMD_MASK)
>>>> void mark_rodata_ro(void)
>>>> {
>>>> unsigned long start = PFN_ALIGN(_text);
>>>> unsigned long rodata_start = PFN_ALIGN(__start_rodata);
>>>> unsigned long end = (unsigned
>>>> long)__end_rodata_hpage_align;
>>>> - unsigned long text_end = PFN_ALIGN(_etext);
>>>> + unsigned long text_end = PMD_ALIGN(_etext);
>>>
>>> This should probably have more logic and adjustments. If etext is
>>> PMD
>>> aligned, some of the stuff outside the diff won't do anything.
>>
>> Hmm.. I don't quite follow this comment. If the etext is PMD
>> aligned,
>> we can still use vmalloc_exec to allocate memory. So it shouldn't
>> matter, no?
>
> Maybe this doesn't matter since PMD alignment must happen naturally
> sometimes. I was just noticing the attempts to operate on this region
> between etext and start_rodata (free_init_pages(), etc). If this was
> never not PMD aligned they could be dropped. But if you are going to
> adjust the behavior for !CONFIG_MODULES, etc, then it is still needed.
I guess we can add special handling for !CONFIG_MODULES && !CONFIG_BPF
&& !CONFIG_FTRACE cases, where we will not allocate this memory.
Thanks,
Song
Powered by blists - more mailing lists