linux-ext4 - RE: Subtle races between DAX mmap fault and write path

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CS1PR84MB0119ACB424699154BDA197B28E1B0@CS1PR84MB0119.NAMPRD84.PROD.OUTLOOK.COM>
Date:	Mon, 8 Aug 2016 12:30:18 +0000
From:	"Boylston, Brian" <brian.boylston@....com>
To:	Jan Kara <jack@...e.cz>
CC:	Dave Chinner <david@...morbit.com>,
	"Kani, Toshimitsu" <toshi.kani@....com>,
	"linux-nvdimm@...ts.01.org" <linux-nvdimm@...ts.01.org>,
	"xfs@....sgi.com" <xfs@....sgi.com>,
	"linux-fsdevel@...r.kernel.org" <linux-fsdevel@...r.kernel.org>,
	"linux-ext4@...r.kernel.org" <linux-ext4@...r.kernel.org>,
	Ross Zwisler <ross.zwisler@...ux.intel.com>
Subject: RE: Subtle races between DAX mmap fault and write path

Jan Kara wrote on 2016-08-08:
> On Fri 05-08-16 19:58:33, Boylston, Brian wrote:
>> Dave Chinner wrote on 2016-08-05:
>>> [ cut to just the important points ]
>>> On Thu, Aug 04, 2016 at 06:40:42PM +0000, Kani, Toshimitsu wrote:
>>>> On Tue, 2016-08-02 at 10:21 +1000, Dave Chinner wrote:
>>>>> If I drop the fsync from the
>>>>> buffered IO path, bandwidth remains the same but runtime drops to
>>>>> 0.55-0.57s, so again the buffered IO write path is faster than DAX
>>>>> while doing more work.
>>>> 
>>>> I do not think the test results are relevant on this point because both
>>>> buffered and dax write() paths use uncached copy to avoid clflush.  The
>>>> buffered path uses cached copy to the page cache and then use uncached copy to
>>>> PMEM via writeback.  Therefore, the buffered IO path also benefits from using
>>>> uncached copy to avoid clflush.
>>> 
>>> Except that I tested without the writeback path for buffered IO, so
>>> there was a direct comparison for single cached copy vs single
>>> uncached copy.
>>> 
>>> The undenial fact is that a write() with a single cached copy with
>>> all the overhead of dirty page tracking is /faster/ than a much
>>> shorter, simpler IO path that uses an uncached copy. That's what the
>>> numbers say....
>>> 
>>>> Cached copy (req movq) is slightly faster than uncached copy,
>>> 
>>> Not according to Boaz - he claims that uncached is 20% faster than
>>> cached. How about you two get together, do some benchmarking and get
>>> your story straight, eh?
>>> 
>>>> and should be
>>>> used for writing to the page cache.  For writing to PMEM, however, additional
>>>> clflush can be expensive, and allocating cachelines for PMEM leads to evict
>>>> application's cachelines.
>>> 
>>> I keep hearing people tell me why cached copies are slower, but
>>> no-one is providing numbers to back up their statements. The only
>>> numbers we have are the ones I've published showing cached copies w/
>>> full dirty tracking is faster than uncached copy w/o dirty tracking.
>>> 
>>> Show me the numbers that back up your statements, then I'll listen
>>> to you.
>> 
>> Here are some numbers for a particular scenario, and the code is below.
>> 
>> Time (in seconds) to copy a 16KiB buffer 1M times to a 4MiB NVDIMM buffer
>> (1M total memcpy()s).  For the cached+clflush case, the flushes are done
>> every 4MiB (which seems slightly faster than flushing every 16KiB):
>> 
>>                   NUMA local    NUMA remote
>> Cached+clflush      13.5           37.1
>> movnt                1.0            1.3
> 
> Thanks for the test Brian. But looking at the current source of libpmem
> this seems to be comparing apples to oranges. Let me explain the details
> below:
> 
>> In the code below, pmem_persist() does the CLFLUSH(es) on the given range,
>> and pmem_memcpy_persist() does non-temporal MOVs with an SFENCE:
> 
> Yes. libpmem does what you describe above and the name
> pmem_memcpy_persist() is thus currently misleading because it is not
> guaranteed to be persistent with the current implementation of DAX in
> the kernel.
> 
> It is important to know which kernel version and what filesystem have you
> used for the test to be able judge the details but generally pmem_persist()
> does properly tell the filesystem to flush all metadata associated with the
> file, commit open transactions etc. That's the full cost of persistence.

I used NVML 1.1 for the measurements.  In this version and with the hardware
that I used, the pmem_persist() flow is:

  pmem_persist()
    pmem_flush()
      Func_flush() == flush_clflush
        CLFLUSH
    pmem_drain()
      Func_predrain_fence() == predrain_fence_empty
        no-op

So, I don't think that pmem_persist() does anything to cause the filesystem
to flush metadata as it doesn't make any system calls?

> pmem_memcpy_persist() makes sure the data writes have reached persistent
> storage but nothing guarantees associated metadata changes have reached
> persistent storage as well.

While metadata is certainly important, my goal with this specific test was
to measure the "raw" performance of cached+flush vs uncached, without
anything else in the way.

> To assure that, fsync() (or pmem_persist()
> if you wish) is currently the only way from userspace.

Perhaps you mean pmem_msync() here?  pmem_msync() calls msync(), but
pmem_persist() does not.

> At which point
> you've lost most of the advantages using movnt. Ross researches into
> possibilities of allowing more efficient userspace implementation but
> currently there are none.

Apart from the current performance discussion, if the metadata for a file
is already established (file created, space allocated by explicit writes(),
and everything synced), then if I map it and do pmem_memcpy_persist(),
are there any "ongoing" metadata updates that would need to be flushed
(besides timestamps)?


Brian

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html