lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAHS8izNgmwjwyTyFzXWKrM==nTO0CJEW3+mUoKmtYjPushL5-g@mail.gmail.com>
Date:   Mon, 14 Nov 2022 11:14:26 -0800
From:   Mina Almasry <almasrymina@...gle.com>
To:     Pankaj Gupta <pankaj.gupta.linux@...il.com>
Cc:     Dan Williams <dan.j.williams@...el.com>,
        Michael Sammler <sammler@...gle.com>,
        Vishal Verma <vishal.l.verma@...el.com>,
        Dave Jiang <dave.jiang@...el.com>,
        Ira Weiny <ira.weiny@...el.com>,
        Pasha Tatashin <pasha.tatashin@...een.com>,
        nvdimm@...ts.linux.dev, linux-kernel@...r.kernel.org
Subject: Re: [PATCH v1] virtio_pmem: populate numa information

On Sun, Nov 13, 2022 at 9:44 AM Pankaj Gupta
<pankaj.gupta.linux@...il.com> wrote:
>
> > > Pankaj Gupta wrote:
> > > > > > > Compute the numa information for a virtio_pmem device from the memory
> > > > > > > range of the device. Previously, the target_node was always 0 since
> > > > > > > the ndr_desc.target_node field was never explicitly set. The code for
> > > > > > > computing the numa node is taken from cxl_pmem_region_probe in
> > > > > > > drivers/cxl/pmem.c.
> > > > > > >
> > > > > > > Signed-off-by: Michael Sammler <sammler@...gle.com>
> >
> > Tested-by: Mina Almasry <almasrymina@...gle.com>
> >
> > I don't have much expertise on this driver, but with the help of this
> > patch I was able to get memory tiering [1] emulation going on qemu. As
> > far as I know there is no alternative to this emulation, and so I
> > would love to see this or equivalent merged, if possible.
> >
> > This is what I have going to get memory tiering emulation:
> >
> > In qemu, added these configs:
> >       -object memory-backend-file,id=m4,share=on,mem-path="$path_to_virtio_pmem_file",size=2G
> > \
> >       -smp 2,sockets=2,maxcpus=2  \
> >       -numa node,nodeid=0,memdev=m0 \
> >       -numa node,nodeid=1,memdev=m1 \
> >       -numa node,nodeid=2,memdev=m2,initiator=0 \
> >       -numa node,nodeid=3,initiator=0 \
> >       -device virtio-pmem-pci,memdev=m4,id=nvdimm1 \
> >
> > On boot, ran these commands:
> >     ndctl_static create-namespace -e namespace0.0 -m devdax -f 1&> /dev/null
> >     echo dax0.0 > /sys/bus/dax/drivers/device_dax/unbind
> >     echo dax0.0 > /sys/bus/dax/drivers/kmem/new_id
> >     for i in `ls /sys/devices/system/memory/`; do
> >       state=$(cat "/sys/devices/system/memory/$i/state" 2&>/dev/null)
> >       if [ "$state" == "offline" ]; then
> >         echo online_movable > "/sys/devices/system/memory/$i/state"
> >       fi
> >     done
>
> Nice to see the way to handle the virtio-pmem device memory through kmem driver
> and online the corresponding memory blocks to 'zone_movable'.
>
> This also opens way to use this memory range directly irrespective of attached
> block device. Of course there won't be any persistent data guarantee. But good
> way to simulate memory tiering inside guest as demonstrated below.
> >
> > Without this CL, I see the memory onlined in node 0 always, and is not
> > a separate memory tier. With this CL and qemu configs, the memory is
> > onlined in node 3 and is set as a separate memory tier, which enables
> > qemu-based development:
> >
> > ==> /sys/devices/virtual/memory_tiering/memory_tier22/nodelist <==
> > 3
> > ==> /sys/devices/virtual/memory_tiering/memory_tier4/nodelist <==
> > 0-2
> >
> > AFAIK there is no alternative to enabling memory tiering emulation in
> > qemu, and would love to see this or equivalent merged, if possible.
>
> Just wondering if Qemu vNVDIMM device can also achieve this?
>

I spent a few minutes on this. Please note I'm really not familiar
with these drivers, but as far as I can tell the qemu vNVDIMM device
has the same problem and needs a similar fix to this to what Michael
did here. What I did with vNVDIMM qemu device:

- Added these qemu configs:
      -object memory-backend-file,id=m4,share=on,mem-path=./hello,size=2G,readonly=off
\
      -device nvdimm,id=nvdimm1,memdev=m4,unarmed=off \

- Ran the same commands in my previous email (they seem to apply to
the vNVDIMM device without modification):
    ndctl_static create-namespace -e namespace0.0 -m devdax -f 1&> /dev/null
    echo dax0.0 > /sys/bus/dax/drivers/device_dax/unbind
    echo dax0.0 > /sys/bus/dax/drivers/kmem/new_id
    for i in `ls /sys/devices/system/memory/`; do
      state=$(cat "/sys/devices/system/memory/$i/state" 2&>/dev/null)
      if [ "$state" == "offline" ]; then
        echo online_movable > "/sys/devices/system/memory/$i/state"
      fi
    done

I see the memory from the vNVDIMM device get onlined on node0, and is
not detected as a separate memory tier. I suspect that driver needs a
similar fix to this one.

> In any case, this patch is useful, So,
> Reviewed-by: Pankaj Gupta <pankaj.gupta@....com
>
> >
> >
> >
> > [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/ABI/testing/sysfs-kernel-mm-memory-tiers
> >
> > > > > > > ---
> > > > > > >  drivers/nvdimm/virtio_pmem.c | 11 +++++++++--
> > > > > > >  1 file changed, 9 insertions(+), 2 deletions(-)
> > > > > > >
> > > > > > > diff --git a/drivers/nvdimm/virtio_pmem.c b/drivers/nvdimm/virtio_pmem.c
> > > > > > > index 20da455d2ef6..a92eb172f0e7 100644
> > > > > > > --- a/drivers/nvdimm/virtio_pmem.c
> > > > > > > +++ b/drivers/nvdimm/virtio_pmem.c
> > > > > > > @@ -32,7 +32,6 @@ static int init_vq(struct virtio_pmem *vpmem)
> > > > > > >  static int virtio_pmem_probe(struct virtio_device *vdev)
> > > > > > >  {
> > > > > > >         struct nd_region_desc ndr_desc = {};
> > > > > > > -       int nid = dev_to_node(&vdev->dev);
> > > > > > >         struct nd_region *nd_region;
> > > > > > >         struct virtio_pmem *vpmem;
> > > > > > >         struct resource res;
> > > > > > > @@ -79,7 +78,15 @@ static int virtio_pmem_probe(struct virtio_device *vdev)
> > > > > > >         dev_set_drvdata(&vdev->dev, vpmem->nvdimm_bus);
> > > > > > >
> > > > > > >         ndr_desc.res = &res;
> > > > > > > -       ndr_desc.numa_node = nid;
> > > > > > > +
> > > > > > > +       ndr_desc.numa_node = memory_add_physaddr_to_nid(res.start);
> > > > > > > +       ndr_desc.target_node = phys_to_target_node(res.start);
> > > > > > > +       if (ndr_desc.target_node == NUMA_NO_NODE) {
> > > > > > > +               ndr_desc.target_node = ndr_desc.numa_node;
> > > > > > > +               dev_dbg(&vdev->dev, "changing target node from %d to %d",
> > > > > > > +                       NUMA_NO_NODE, ndr_desc.target_node);
> > > > > > > +       }
> > > > > >
> > > > > > As this memory later gets hotplugged using "devm_memremap_pages". I don't
> > > > > > see if 'target_node' is used for fsdax case?
> > > > > >
> > > > > > It seems to me "target_node" is used mainly for volatile range above
> > > > > > persistent memory ( e.g kmem driver?).
> > > > > >
> > > > > I am not sure if 'target_node' is used in the fsdax case, but it is
> > > > > indeed used by the devdax/kmem driver when hotplugging the memory (see
> > > > > 'dev_dax_kmem_probe' and '__dax_pmem_probe').
> > > >
> > > > Yes, but not currently for FS_DAX iiuc.
> > >
> > > The target_node is only used by the dax_kmem driver. In the FSDAX case
> > > the memory (persistent or otherwise) is mapped behind a block-device.
> > > That block-device has affinity to a CPU initiator, but that memory does
> > > not itself have any NUMA affinity or identity as a target.
> > >
> > > So:
> > >
> > > block-device NUMA node == closest CPU initiator node to the device
> > >
> > > dax-device target node == memory only NUMA node target, after onlining

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ