lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAAH8bW_GCviYSadmf5CUxgJixkXbq+SfL63ZJt7Lsm9OAmPjVQ@mail.gmail.com>
Date:   Tue, 14 Jun 2022 09:40:08 -0700
From:   Yury Norov <yury.norov@...il.com>
To:     wang.yi59@....com.cn, Andrew Morton <akpm@...ux-foundation.org>,
        linux-mm <linux-mm@...ck.org>
Cc:     andriy.shevchenko@...ux.intel.com, linux@...musvillemoes.dk,
        linux-kernel@...r.kernel.org, xue.zhihong@....com.cn,
        wang.liang82@....com.cn, Liu.Jianjun3@....com.cn,
        Yury Norov <yury.norov@...il.com>
Subject: Re: [PATCH] bitmap: fix a unproper remap when mpol_rebind_nodemask()

+ Andrew Morton <akpm@...ux-foundation.org>
+ linux-mm@...ck.org

On Mon, Jun 13, 2022 at 8:45 PM <wang.yi59@....com.cn> wrote:
>
> Hi Yury,
>
> Thanks for your quick and clear response!
>
> > On Mon, Jun 13, 2022 at 4:31 AM Yi Wang <wang.yi59@....com.cn> wrote:
> > >
> > > Consider one situation:
> > >
> > > The app have two vmas which mbind() to node 1 and node3 respectively,
> > > and its cpuset.mems is 0-3, now set its cpuset.mems to 1,3, according
> > > to current bitmap_remap(), we got:
> >
> > Regarding the original problem - can you please confirm that
> > it's reproduced on current kernels, show the execution path etc.
> > From what I see on modern kernel, the only user of nodes_remap()
> > is mpol_rebind_nodemask(). Is that the correct path?
>
> Yes, it's mpol_rebind_nodemask() calls nodes_remap() from
> mpol_rebind_policy(). The stacks are as follow:
> [  290.836747]  bitmap_remap+0x84/0xe0
> [  290.836753]  mpol_rebind_nodemask+0x64/0x2a0
> [  290.836764]  mpol_rebind_mm+0x3a/0x90
> [  290.836768]  update_tasks_nodemask+0x8a/0x1e0
> [  290.836774]  cpuset_write_resmask+0x563/0xa00
> [  290.836780]  cgroup_file_write+0x81/0x150
> [  290.836784]  kernfs_fop_write_iter+0x12d/0x1c0
> [  290.836791]  new_sync_write+0x109/0x190
> [  290.836800]  vfs_write+0x218/0x2a0
> [  290.836809]  ksys_write+0x59/0xd0
> [  290.836812]  do_syscall_64+0x37/0x80
> [  290.836818]  entry_SYSCALL_64_after_hwframe+0x46/0xb0
>
> To reproduce this situation, I write a program which seems like this:
>     unsigned int flags = MAP_PRIVATE | MAP_ANONYMOUS;
>     unsigned long size = 262144 << 12;
>     unsigned long node1 = 2;   // node 1
>     unsigned long node2 = 8;   // node 3
>
>     p1 = vma1 = mmap(NULL, size, PROT_READ | PROT_WRITE, flags, -1, 0);
>     p2 = vma2 = mmap(NULL, size, PROT_READ | PROT_WRITE, flags, -1, 0);
>
>     assert(!mbind(vma1, size, MPOL_BIND, &node1, MAX_NODES, MPOL_MF_STRICT | MPOL_MF_MOVE));
>     assert(!mbind(vma2, size, MPOL_BIND, &node2, MAX_NODES, MPOL_MF_STRICT | MPOL_MF_MOVE));
>
> Start the program whos name is mbind_tester, and do follow steps:
>
>         mkdir && cd /sys/fs/cgroup/cpuset/mbind
>         echo 0-31 > cpuset.cpus
>         echo 0-3 > cpuset.mems
>
>         cat /proc/`pidof mbind_tester`/numa_maps |grep bind -w
>         7ff73e200000 bind:3 anon=262144 dirty=262144 active=0 N3=262144 kernelpagesize_kB=4
>         7ff77e200000 bind:1 anon=262144 dirty=262144 active=0 N1=262144 kernelpagesize_kB=4
>
>         echo 1,3 > cpuset.mems
>         cat /proc/`pidof mbind_tester`/numa_maps |grep bind -w
>         7ff73e200000 bind:3 anon=262144 dirty=262144 active=0 N3=262144 kernelpagesize_kB=4
>         7ff77e200000 bind:3 anon=262144 dirty=262144 active=0 N1=262144 kernelpagesize_kB=4
>
> As you see, after set cpuset.mems to 1,3, the nodes which one of vma
> binded to changed from 1 to 3.
>
> This maybe confused, the original nodes binded is 1, after modify
> cpuset.mems to 1,3 which include the node 3, it changed to 3...

Ok, thanks for the reproducer. I'll take a look at it closer to the weekend.

> > Anyways, as per name, bitmap_remap() is intended to change bit
> > positions, and it doesn't look wrong if it does so.
> >
> > This is not how the function is supposed to work. For example,
> >         old: 00111000
> >         new: 00011100
> >
> > means:
> >         old: 00111 000
> >              || \\\|||
> >         new: 000 11100
> >
> > And after this patch it would be:
> >         old: 001 11000
> >              || \|||||
> >         new: 000 11100
> >
> > Which is not the same, right?
>
> Right.

So, we both agree that bitmap_remap() works as advertised. This is
good. Let's try figuring out a solution without touching it.

> Actually this is what makes me embarrassed. If we want to fix this
> situtation, we can:
>
> - change the bitmap_remap() as this patch did, but this changed the
> behavior of this routine which looks does the right thing. One good
> news is this function is only called by mpol_rebind_nodemask().

There are users of bitmap_remap() in drivers/gpio/gpio-xilinx.c

> - don't change the bitmap_remap(), to be honest, I didn't figure out
> a way. Any suggestions?

I haven't had a chance to play with it (because of my dayjob), but I
have a strong feeling that the proper solution should come from
existing functionality.

Did you experiment with MPOL_F_{STATIC,RELATIVE}_NODES?
Those flags enable nodes_and() and mpol_relative_nodemask()
paths correspondingly.

> > If mpol_rebind() wants to keep previous relations, then according to
> > the comment:
> >  * The positions of unset bits in @old are mapped to themselves
> >  * (the identify map).
> >
> > , you can just clear @old bits that already have good relations
> > you'd like to preserve.
>
> Actually this does not work for me :)

What I suggested is:

 328                 node_clear(1, pol->w.cpuset_mems_allowed);
 329                 node_clear(3, pol->w.cpuset_mems_allowed);
 330                 nodes_remap(tmp, pol->nodes, pol->w.cpuset_mems_allowed,
 331                                                                 *nodes);
 332                 pol->w.cpuset_mems_allowed = *nodes;

> In the example above, if set cpuset.mems to 0,2 firstly, the nodes
> binds will change from 1 to 2. And then set cpuset.mems to 1,3, it will
> change from 2 to 3 again.

I bet you can find a sequence that will finally give you the desired binding.
And probably somebody does this black magic in production. For me it
looks just scary. Can you try those static/relative flags, and if it doesn't
work, we'll have to invent another policy for nodes binding.

Adding Andrew and linux-mm, as it's definitely beyond bitmaps scope.

Thanks,
Yury

> ---
> Best wishes
> Yi Wang

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ