linux-ext4 - Re: Subtle races between DAX mmap fault and write path

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1470335997.8908.128.camel@hpe.com>
Date:	Thu, 4 Aug 2016 18:40:42 +0000
From:	"Kani, Toshimitsu" <toshi.kani@....com>
To:	"david@...morbit.com" <david@...morbit.com>,
	"boaz@...xistor.com" <boaz@...xistor.com>
CC:	"linux-ext4@...r.kernel.org" <linux-ext4@...r.kernel.org>,
	"jack@...e.cz" <jack@...e.cz>,
	"linux-nvdimm@...ts.01.org" <linux-nvdimm@...ts.01.org>,
	"xfs@....sgi.com" <xfs@....sgi.com>,
	"linux-fsdevel@...r.kernel.org" <linux-fsdevel@...r.kernel.org>
Subject: Re: Subtle races between DAX mmap fault and write path

On Tue, 2016-08-02 at 10:21 +1000, Dave Chinner wrote:
> On Mon, Aug 01, 2016 at 01:13:45PM +0300, Boaz Harrosh wrote:
> > 
> > On 07/30/2016 03:12 AM, Dave Chinner wrote:
> > <>
> > > If we track the dirty blocks from write in the radix tree like we
> > > for mmap, then we can just use a normal memcpy() in dax_do_io(),
> > > getting rid of the slow cache bypass that is currently run. Radix
> > > tree updates are much less expensive than a slow memcpy of large
> > > amounts of data, ad fsync can then take care of persistence, just
> > > like we do for mmap.
> > > 
> >
> > No! 
> > 
> > mov_nt instructions, That "slow cache bypass that is currently run" above
> > is actually faster then cached writes by 20%, and if you add the dirty
> > tracking and cl_flush instructions it becomes x2 slower in the most
> > optimal case and 3 times slower in the DAX case.
>
> IOWs, we'd expect writing to a file with DAX to be faster than when
> buffered through the page cache and fsync()d, right?
>
> The numbers I get say otherwise. Filesystem on 8GB pmem block device:
> 
> $ sudo mkfs.xfs -f /dev/pmem1
> meta-data=/dev/pmem1             isize=512    agcount=4, agsize=524288 blks
>          =                       sectsz=4096  attr=2, projid32bit=1
>          =                       crc=1        finobt=1, sparse=0, rmapbt=0,
> reflink=0
> data     =                       bsize=4096   blocks=2097152, imaxpct=25
>          =                       sunit=0      swidth=0 blks
> naming   =version 2              bsize=4096   ascii-ci=0 ftype=1
> log      =internal log           bsize=4096   blocks=2560, version=2
>          =                       sectsz=4096  sunit=1 blks, lazy-count=1
> realtime =none                   extsz=4096   blocks=0, rtextents=0
> 
> Test command that writes 1GB to the filesystem:
> 
> $ sudo time xfs_io -f -c "pwrite 0 1g" -c "sync" /mnt/scratch/foo
> wrote 1073741824/1073741824 bytes at offset 0
> 1 GiB, 262144 ops; 0:00:01.00 (880.040 MiB/sec and 225290.3317 ops/sec)
> 0.02user 1.13system 0:02.27elapsed 51%CPU (0avgtext+0avgdata
> 2344maxresident)k
> 0inputs+0outputs (0major+109minor)pagefaults 0swaps
> 
> Results:
> 
> 	    pwrite B/W (MiB/s)			runtime
> run	no DAX		DAX		no DAX		DA
> X
>  1	880.040		236.352		2.27s		
> 4.34s
>  2	857.094		257.297		2.18s		
> 3.99s
>  3	865.820		236.087		2.13s		
> 4.34s
> 
> It is quite clear that *DAX is much slower* than normal buffered
> IO through the page cache followed by a fsync().
> 
> Stop and think why that might be. We're only doing one copy with
> DAX, so why is the pwrite() speed 4x lower than for a copy into the
> page cache? We're not copying 4x the data here. We're copying it
> once. But there's another uncached write to each page during
> allocation to zero each block first, so we're actually doing two
> uncached writes to the page. And we're doing an allocation per page
> with DAX, whereas we're using delayed allocation in the buffered IO
> case which has much less overhead.
> 
> The only thing we can do here to speed the DAX case up is do cached
> memcpy so that the data copy after zeroing runs at L1 cache speed
> (i.e. 50x faster than it currently does).
> 
> Let's take the allocation out of it, eh? Let's do overwrite instead,
> fsync in the buffered Io case, no fsync for DAX:
> 
> 	    pwrite B/W (MiB/s)			runtime
> run	no DAX		DAX		no DAX		DA
> X
>  1	1119		1125		1.85s		0.93s
>  2	1113		1121		1.83s		0.91s
>  3	1128		1078		1.80s		0.94s
> 
> So, pwrite speeds are no different for DAX vs page cache IO. Also,
> now we can see the overhead of writeback - a second data copy to
> the pmem for the IO during fsync. If I take the fsync() away from
> the buffered IO, the runtime drops to 0.89-0.91s, which is identical
> to the DAX code. Given the DAX code has a short IO path than
> buffered IO, it's not showing any advantage speed for using uncached
> IO....
>
> Let's go back to the allocation case, but this time take advantage
> of the new iomap based Io path in XFS to amortise the DAX allocation
> overhead by using a 16MB IO size instead of 4k:
> 
> $ sudo time xfs_io -f -c "pwrite 0 1g -b 16m" -c sync /mnt/scratch/foo
> 
> 
> 	    pwrite B/W (MiB/s)			runtime
> run	no DAX		DAX		no DAX		DA
> X
>  1	1344		1028		1.63s		1.03s
>  2	1410		 980		1.62s		1.06s
>  3	1399		1032		1.72s		0.99s
> 
> So, pwrite bandwidth of the copy into the page cache is still much
> higher than that of the DAX path, but now the allocation overhead
> is minimised and hence the double copy in the buffered IO writeback
> path shows up. For completeness, lets just run the overwrite case
> here which is effectively just competing  memcpy implementations,
> fsync for buffered, no fsync for DAX:
> 
> 	    pwrite B/W (MiB/s)			runtime
> run	no DAX		DAX		no DAX		DA
> X
>  1	1791		1727		1.53s		0.59s
>  2	1768		1726		1.57s		0.59s
>  3	1799		1729		1.55s		0.59s
> 
> Again, runtime shows the overhead of the double copy in the buffered
> IO/writeback path. It also shows the overhead in the DAX path of the
> allocation zeroing vs overwrite. If I drop the fsync from the
> buffered IO path, bandwidth remains the same but runtime drops to
> 0.55-0.57s, so again the buffered IO write path is faster than DAX
> while doing more work.
> 
> IOws, the overhead of dirty page tracking in the page cache mapping
> tree is not significant in terms of write() performance. Hence
> I fail to see why it should be significant in the DAX path - it will
> probably have less overhead because we have less to account for in
> the DAX write path. The only performance penalty for dirty tracking
> is in the fsync writeback path itself, and that a separate issue
> for optimisation.

I mostly agree with the analysis, but I have a few comments about the use of
cached / uncached (movnt) copy to PMEM.

I do not think the test results are relevant on this point because both
buffered and dax write() paths use uncached copy to avoid clflush.  The
buffered path uses cached copy to the page cache and then use uncached copy to
PMEM via writeback.  Therefore, the buffered IO path also benefits from using
uncached copy to avoid clflush.

Cached copy (req movq) is slightly faster than uncached copy, and should be
used for writing to the page cache.  For writing to PMEM, however, additional
clflush can be expensive, and allocating cachelines for PMEM leads to evict
application's cachelines.

Thanks,
-Toshi