lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAPcyv4hi08KCQHFV0aorVmZZ0YXo=wGzsXbrnTSAySXirNjzrA@mail.gmail.com>
Date:   Wed, 26 Feb 2020 09:54:12 -0800
From:   Dan Williams <dan.j.williams@...el.com>
To:     Jan Kara <jack@...e.cz>
Cc:     Jonathan Halliday <jonathan.halliday@...hat.com>,
        Jeff Moyer <jmoyer@...hat.com>, Christoph Hellwig <hch@....de>,
        Dave Chinner <david@...morbit.com>,
        "Weiny, Ira" <ira.weiny@...el.com>,
        Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
        Alexander Viro <viro@...iv.linux.org.uk>,
        "Darrick J. Wong" <darrick.wong@...cle.com>,
        "Theodore Y. Ts'o" <tytso@....edu>,
        linux-ext4 <linux-ext4@...r.kernel.org>,
        linux-xfs <linux-xfs@...r.kernel.org>,
        linux-fsdevel <linux-fsdevel@...r.kernel.org>
Subject: Re: [PATCH V4 07/13] fs: Add locking for a dynamic address space
 operations state

On Wed, Feb 26, 2020 at 9:20 AM Jan Kara <jack@...e.cz> wrote:
>
> On Wed 26-02-20 08:46:42, Dan Williams wrote:
> > On Wed, Feb 26, 2020 at 1:29 AM Jonathan Halliday
> > <jonathan.halliday@...hat.com> wrote:
> > >
> > >
> > > Hi All
> > >
> > > I'm a middleware developer, focused on how Java (JVM) workloads can
> > > benefit from app-direct mode pmem. Initially the target is apps that
> > > need a fast binary log for fault tolerance: the classic database WAL use
> > > case; transaction coordination systems; enterprise message bus
> > > persistence and suchlike. Critically, there are cases where we use log
> > > based storage, i.e. it's not the strict 'read rarely, only on recovery'
> > > model that a classic db may have, but more of a 'append only, read many
> > > times' event stream model.
> > >
> > > Think of the log oriented data storage as having logical segments (let's
> > > implement them as files), of which the most recent is being appended to
> > > (read_write) and the remaining N-1 older segments are full and sealed,
> > > so effectively immutable (read_only) until discarded. The tail segment
> > > needs to be in DAX mode for optimal write performance, as the size of
> > > the append may be sub-block and we don't want the overhead of the kernel
> > > call anyhow. So that's clearly a good fit for putting on a DAX fs mount
> > > and using mmap with MAP_SYNC.
> > >
> > > However, we want fast read access into the segments, to retrieve stored
> > > records. The small access index can be built in volatile RAM (assuming
> > > we're willing to take the startup overhead of a full file scan at
> > > recovery time) but the data itself is big and we don't want to move it
> > > all off pmem. Which means the requirements are now different: we want
> > > the O/S cache to pull hot data into fast volatile RAM for us, which DAX
> > > explicitly won't do. Effectively a poor man's 'memory mode' pmem, rather
> > > than app-direct mode, except here we're using the O/S rather than the
> > > hardware memory controller to do the cache management for us.
> > >
> > > Currently this requires closing the full (read_write) file, then copying
> > > it to a non-DAX device and reopening it (read_only) there. Clearly
> > > that's expensive and rather tedious. Instead, I'd like to close the
> > > MAP_SYNC mmap, then, leaving the file where it is, reopen it in a mode
> > > that will instead go via the O/S cache in the traditional manner. Bonus
> > > points if I can do it over non-overlapping ranges in a file without
> > > closing the DAX mode mmap, since then the segments are entirely logical
> > > instead of needing separate physical files.
> >
> > Hi John,
> >
> > IIRC we chatted about this at PIRL, right?
> >
> > At the time it sounded more like mixed mode dax, i.e. dax writes, but
> > cached reads. To me that's an optimization to optionally use dax for
> > direct-I/O writes, with its existing set of page-cache coherence
> > warts, and not a capability to dynamically switch the dax-mode.
> > mmap+MAP_SYNC seems the wrong interface for this. This writeup
> > mentions bypassing kernel call overhead, but I don't see how a
> > dax-write syscall is cheaper than an mmap syscall plus fault. If
> > direct-I/O to a dax capable file bypasses the block layer, isn't that
> > about the maximum of kernel overhead that can be cut out of this use
> > case? Otherwise MAP_SYNC is a facility to achieve efficient sub-block
> > update-in-place writes not append writes.
>
> Well, even for appends you'll pay the cost only once per page (or maybe even
> once per huge page) when using MAP_SYNC. With a syscall you'll pay once per
> write. So although it would be good to check real numbers, the design isn't
> non-sensical to me.

True, Jonathan, how many writes per page are we talking about in this case?

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ