lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAPcyv4i-sW9gu4nrRvvb24=uAUQms9=+Yx5=EQSj+CxpmoNkSw@mail.gmail.com>
Date:   Wed, 6 Feb 2019 18:48:18 -0800
From:   Dan Williams <dan.j.williams@...el.com>
To:     Doug Ledford <dledford@...hat.com>
Cc:     Jason Gunthorpe <jgg@...pe.ca>, Dave Chinner <david@...morbit.com>,
        Christopher Lameter <cl@...ux.com>,
        Matthew Wilcox <willy@...radead.org>, Jan Kara <jack@...e.cz>,
        Ira Weiny <ira.weiny@...el.com>,
        lsf-pc@...ts.linux-foundation.org,
        linux-rdma <linux-rdma@...r.kernel.org>,
        Linux MM <linux-mm@...ck.org>,
        Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
        John Hubbard <jhubbard@...dia.com>,
        Jerome Glisse <jglisse@...hat.com>,
        Michal Hocko <mhocko@...nel.org>,
        linux-nvdimm <linux-nvdimm@...ts.01.org>
Subject: Re: [LSF/MM TOPIC] Discuss least bad options for resolving
 longterm-GUP usage by RDMA

On Wed, Feb 6, 2019 at 5:57 PM Doug Ledford <dledford@...hat.com> wrote:
[..]
> > > > Dave, you said the FS is responsible to arbitrate access to the
> > > > physical pages..
> > > >
> > > > Is it possible to have a filesystem for DAX that is more suited to
> > > > this environment? Ie designed to not require block reallocation (no
> > > > COW, no reflinks, different approach to ftruncate, etc)
> > >
> > > Can someone give me a real world scenario that someone is *actually*
> > > asking for with this?
> >
> > I'll point to this example. At the 6:35 mark Kodi talks about the
> > Oracle use case for DAX + RDMA.
> >
> > https://youtu.be/ywKPPIE8JfQ?t=395
>
> Thanks for the link, I'll review the panel.
>
> > Currently the only way to get this to work is to use ODP capable
> > hardware, or Device-DAX. Device-DAX is a facility to map persistent
> > memory statically through device-file. It's great for statically
> > allocated use cases, but loses all the nice things (provisioning,
> > permissions, naming) that a filesystem gives you. This debate is what
> > to do about non-ODP capable hardware and Filesystem-DAX facility. The
> > current answer is "no RDMA for you".
> >
> > > Are DAX users demanding xfs, or is it just the
> > > filesystem of convenience?
> >
> > xfs is the only Linux filesystem that supports DAX and reflink.
>
> Is it going to be clear from the link above why reflink + DAX + RDMA is
> a good/desirable thing?
>

No, unfortunately it will only clarify the DAX + RDMA use case, but
you don't need to look very far to see that the trend for storage
management is more COW / reflink / thin-provisioning etc in more
places. Users want the flexibility to be able delay, change, and
consolidate physical storage allocation decisions, otherwise
device-dax would have solved all these problems and we would not be
having this conversation.

> > > Do they need to stick with xfs?
> >
> > Can you clarify the motivation for that question?
>
> I did a little googling and research before I asked that question.
> According to the documentation, other FSes can work with DAX too (namely
> ext2 and ext4).  The question was more or less pondering whether or not
> ext2 or ext4 + RDMA + DAX would solve people's problems without the
> issues that xfs brings.

No, ext4 also supports hole punch, and the ext2 support is a toy. We
went through quite a bit of work to solve this problem for the
O_DIRECT pinned page case.

6b2bb7265f0b sched/wait: Introduce wait_var_event()
d6dc57e251a4 xfs, dax: introduce xfs_break_dax_layouts()
69eb5fa10eb2 xfs: prepare xfs_break_layouts() for another layout type
c63a8eae63d3 xfs: prepare xfs_break_layouts() to be called with
XFS_MMAPLOCK_EXCL
5fac7408d828 mm, fs, dax: handle layout changes to pinned dax mappings
b1f382178d15 ext4: close race between direct IO and ext4_break_layouts()
430657b6be89 ext4: handle layout changes to pinned DAX mappings
cdbf8897cb09 dax: dax_layout_busy_page() warn on !exceptional

So the fs is prepared to notify RDMA applications of the need to
evacuate a mapping (layout change), and the timeout to respond to that
notification can be configured by the administrator. The debate is
about what to do when the platform owner needs to get a mapping out of
the way in bounded time.

> >  This problem exists
> > for any filesystem that implements an mmap that where the physical
> > page backing the mapping is identical to the physical storage location
> > for the file data. I don't see it as an xfs specific problem. Rather,
> > xfs is taking the lead in this space because it has already deployed
> > and demonstrated that leases work for the pnfs4 block-server case, so
> > it seems logical to attempt to extend that case for non-ODP-RDMA.
> >
> > > Are they
> > > really trying to do COW backed mappings for the RDMA targets?  Or do
> > > they want a COW backed FS but are perfectly happy if the specific RDMA
> > > targets are *not* COW and are statically allocated?
> >
> > I would expect the COW to be broken at registration time. Only ODP
> > could possibly support reflink + RDMA. So I think this devolves the
> > problem back to just the "what to do about truncate/punch-hole"
> > problem in the specific case of non-ODP hardware combined with the
> > Filesystem-DAX facility.
>
> If that's the case, then we are back to EBUSY *could* work (despite the
> objections made so far).

I linked it in my response to Jason [1], but the entire reason ext2,
ext4, and xfs scream "experimental" when DAX is enabled is because DAX
makes typical flows fail that used to work in the page-cache backed
mmap case. The failure of a data space management command like
fallocate(punch_hole) is more risky than just not allowing the memory
registration to happen in the first place. Leases result in a system
that has a chance at making forward progress.

The current state of disallowing RDMA for FS-DAX is one of the "if
(dax) goto fail;" conditions that needs to be solved before filesystem
developers graduate DAX from experimental status.

[1]: https://lists.01.org/pipermail/linux-nvdimm/2019-February/019884.html

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ