lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Mon, 10 Oct 2022 20:09:38 +0000
From:   "Edgecombe, Rick P" <rick.p.edgecombe@...el.com>
To:     "songliubraving@...com" <songliubraving@...com>
CC:     "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "peterz@...radead.org" <peterz@...radead.org>,
        "Kernel-team@...com" <Kernel-team@...com>,
        "linux-mm@...ck.org" <linux-mm@...ck.org>,
        "song@...nel.org" <song@...nel.org>, "hch@....de" <hch@....de>,
        "x86@...nel.org" <x86@...nel.org>,
        "akpm@...ux-foundation.org" <akpm@...ux-foundation.org>,
        "Hansen, Dave" <dave.hansen@...el.com>,
        "urezki@...il.com" <urezki@...il.com>
Subject: Re: [RFC v2 4/4] vmalloc_exec: share a huge page with kernel text

On Mon, 2022-10-10 at 19:08 +0000, Song Liu wrote:
> > On Oct 10, 2022, at 11:32 AM, Edgecombe, Rick P <
> > rick.p.edgecombe@...el.com> wrote:
> > 
> > On Fri, 2022-10-07 at 16:43 -0700, Song Liu wrote:
> > > On x86 kernel, we allocate 2MB pages for kernel text up to
> > > round_down(_etext, 2MB). Therefore, some of the kernel text is
> > > still
> > > on 4kB pages. With vmalloc_exec, we can allocate 2MB pages up to
> > > round_up(_etext, 2MB), and use the rest of the page for modules
> > > and
> > > BPF programs.
> > > 
> > > Here is an example:
> > > 
> > > [root@...50-1 ~]# grep _etext /proc/kallsyms
> > > ffffffff82202a08 T _etext
> > > 
> > > [root@...50-1 ~]# grep bpf_prog_ /proc/kallsyms  | tail -n 3
> > > ffffffff8220f920 t
> > > bpf_prog_cc61a5364ac11d93_handle__sched_wakeup       [bpf]
> > > ffffffff8220fa28 t
> > > bpf_prog_cc61a5364ac11d93_handle__sched_wakeup_new   [bpf]
> > > ffffffff8220fad4 t
> > > bpf_prog_3bf73fa16f5e3d92_handle__sched_switch       [bpf]
> > > 
> > > [root@...50-1 ~]#  grep 0xffffffff82200000
> > > /sys/kernel/debug/page_tables/kernel
> > > 0xffffffff82200000-
> > > 0xffffffff82400000     2M     ro   PSE         x  pmd
> > > 
> > > [root@...50-1 ~]# grep xfs_flush_inodes /proc/kallsyms
> > > ffffffff822ba910 t xfs_flush_inodes_worker      [xfs]
> > > ffffffff822bc580 t xfs_flush_inodes     [xfs]
> > > 
> > > ffffffff82200000-ffffffff82400000 is a 2MB page, serving kernel
> > > text,
> > > xfs
> > > module, and bpf programs.
> > 
> > Can this memory range be freed as part of a vfree_exec() call then?
> > Does vmalloc actually try to unmap it? If so, it could get
> > complicated
> > with PTI.
> > 
> > It probably should be a special case that never gets fully freed.
> 
> Right, this is never freed. 

Can we get a comment somewhere highlighting how this is avoided?

Maybe this is just me missing some vmalloc understanding, but this
pointer to an all zero vm_struct seems weird too. Are there other vmap
allocations like this? Which vmap APIs work with this and which don't?

> 
> > 
> > > 
> > > Signed-off-by: Song Liu <song@...nel.org>
> > > ---
> > > arch/x86/mm/init_64.c |  3 ++-
> > > mm/vmalloc.c          | 24 ++++++++++++++++++++++++
> > > 2 files changed, 26 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> > > index 0fe690ebc269..d94f196c541a 100644
> > > --- a/arch/x86/mm/init_64.c
> > > +++ b/arch/x86/mm/init_64.c
> > > @@ -1367,12 +1367,13 @@ int __init
> > > deferred_page_init_max_threads(const struct cpumask
> > > *node_cpumask)
> > > 
> > > int kernel_set_to_readonly;
> > > 
> > > +#define PMD_ALIGN(x)        (((unsigned long)(x) + (PMD_SIZE -
> > > 1)) &
> > > PMD_MASK)
> > > void mark_rodata_ro(void)
> > > {
> > >       unsigned long start = PFN_ALIGN(_text);
> > >       unsigned long rodata_start = PFN_ALIGN(__start_rodata);
> > >       unsigned long end = (unsigned
> > > long)__end_rodata_hpage_align;
> > > -    unsigned long text_end = PFN_ALIGN(_etext);
> > > +    unsigned long text_end = PMD_ALIGN(_etext);
> > 
> > This should probably have more logic and adjustments. If etext is
> > PMD
> > aligned, some of the stuff outside the diff won't do anything.
> 
> Hmm.. I don't quite follow this comment. If the etext is PMD
> aligned, 
> we can still use vmalloc_exec to allocate memory. So it shouldn't 
> matter, no?

Maybe this doesn't matter since PMD alignment must happen naturally
sometimes. I was just noticing the attempts to operate on this region
between etext and start_rodata (free_init_pages(), etc). If this was
never not PMD aligned they could be dropped. But if you are going to
adjust the behavior for !CONFIG_MODULES, etc, then it is still needed.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ