[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAADnVQJ8-XVYb21bFRgsaoj7hzd89NSbSOBj0suwsYSL89pxsg@mail.gmail.com>
Date: Tue, 25 Jan 2022 14:48:30 -0800
From: Alexei Starovoitov <alexei.starovoitov@...il.com>
To: Song Liu <song@...nel.org>
Cc: Song Liu <songliubraving@...com>,
Ilya Leoshkevich <iii@...ux.ibm.com>,
bpf <bpf@...r.kernel.org>,
Network Development <netdev@...r.kernel.org>,
LKML <linux-kernel@...r.kernel.org>,
Alexei Starovoitov <ast@...nel.org>,
Daniel Borkmann <daniel@...earbox.net>,
Andrii Nakryiko <andrii@...nel.org>,
Kernel Team <Kernel-team@...com>,
Peter Zijlstra <peterz@...radead.org>, X86 ML <x86@...nel.org>
Subject: Re: [PATCH v6 bpf-next 6/7] bpf: introduce bpf_prog_pack allocator
On Tue, Jan 25, 2022 at 2:25 PM Song Liu <song@...nel.org> wrote:
>
> On Tue, Jan 25, 2022 at 12:00 PM Alexei Starovoitov
> <alexei.starovoitov@...il.com> wrote:
> >
> > On Mon, Jan 24, 2022 at 11:21 PM Song Liu <song@...nel.org> wrote:
> > >
> > > On Mon, Jan 24, 2022 at 9:21 PM Alexei Starovoitov
> > > <alexei.starovoitov@...il.com> wrote:
> > > >
> > > > On Mon, Jan 24, 2022 at 10:27 AM Song Liu <songliubraving@...com> wrote:
> > > > > >
> > > > > > Are arches expected to allocate rw buffers in different ways? If not,
> > > > > > I would consider putting this into the common code as well. Then
> > > > > > arch-specific code would do something like
> > > > > >
> > > > > > header = bpf_jit_binary_alloc_pack(size, &prg_buf, &prg_addr, ...);
> > > > > > ...
> > > > > > /*
> > > > > > * Generate code into prg_buf, the code should assume that its first
> > > > > > * byte is located at prg_addr.
> > > > > > */
> > > > > > ...
> > > > > > bpf_jit_binary_finalize_pack(header, prg_buf);
> > > > > >
> > > > > > where bpf_jit_binary_finalize_pack() would copy prg_buf to header and
> > > > > > free it.
> > > >
> > > > It feels right, but bpf_jit_binary_finalize_pack() sounds 100% arch
> > > > dependent. The only thing it will do is perform a copy via text_poke.
> > > > What else?
> > > >
> > > > > I think this should work.
> > > > >
> > > > > We will need an API like: bpf_arch_text_copy, which uses text_poke_copy()
> > > > > for x86_64 and s390_kernel_write() for x390. We will use bpf_arch_text_copy
> > > > > to
> > > > > 1) write header->size;
> > > > > 2) do finally copy in bpf_jit_binary_finalize_pack().
> > > >
> > > > we can combine all text_poke operations into one.
> > > >
> > > > Can we add an 'image' pointer into struct bpf_binary_header ?
> > >
> > > There is a 4-byte hole in bpf_binary_header. How about we put
> > > image_offset there? Actually we only need 2 bytes for offset.
> > >
> > > > Then do:
> > > > int bpf_jit_binary_alloc_pack(size, &ro_hdr, &rw_hdr);
> > > >
> > > > ro_hdr->image would be the address used to compute offsets by JIT.
> > >
> > > If we only do one text_poke(), we cannot write ro_hdr->image yet. We
> > > can use ro_hdr + rw_hdr->image_offset instead.
> >
> > Good points.
> > Maybe let's go back to Ilya's suggestion and return 4 pointers
> > from bpf_jit_binary_alloc_pack ?
>
> How about we use image_offset, like:
>
> struct bpf_binary_header {
> u32 size;
> u32 image_offset;
> u8 image[] __aligned(BPF_IMAGE_ALIGNMENT);
> };
>
> Then we can use
>
> image = (void *)header + header->image_offset;
I'm not excited about it, since it leaks header details into JITs.
Looks like we don't need JIT to be aware of it.
How about we do random() % roundup(sizeof(struct bpf_binary_header), 64)
to pick the image start and populate
image-sizeof(struct bpf_binary_header) range
with 'int 3'.
This way we can completely hide binary_header inside generic code.
The bpf_jit_binary_alloc_pack() would return ro_image and rw_image only.
And JIT would pass them back into bpf_jit_binary_finalize_pack().
>From the image pointer it would be trivial to get to binary_header with &63.
The 128 byte offset that we use today was chosen arbitrarily.
We were burning the whole page for a single program, so 128 bytes zone
at the front was ok.
Now we will be packing progs rounded up to 64 bytes, so it's better
to avoid wasting those 128 bytes regardless.
Powered by blists - more mailing lists