linux-kernel - Re: [PATCH 6/6] dax: update I/O path to do proper PMEM flushing

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <1438974515.2293.4.camel@linux.intel.com>
Date:	Fri, 07 Aug 2015 13:08:35 -0600
From:	Ross Zwisler <ross.zwisler@...ux.intel.com>
To:	Dave Chinner <david@...morbit.com>
Cc:	linux-kernel@...r.kernel.org, linux-nvdimm@...ts.01.org,
	dan.j.williams@...el.com, Matthew Wilcox <willy@...ux.intel.com>,
	Alexander Viro <viro@...iv.linux.org.uk>,
	linux-fsdevel@...r.kernel.org
Subject: Re: [PATCH 6/6] dax: update I/O path to do proper PMEM flushing

On Fri, 2015-08-07 at 07:04 +1000, Dave Chinner wrote:
> On Thu, Aug 06, 2015 at 11:43:20AM -0600, Ross Zwisler wrote:
> > Update the DAX I/O path so that all operations that store data (I/O
> > writes, zeroing blocks, punching holes, etc.) properly synchronize the
> > stores to media using the PMEM API.  This ensures that the data DAX is
> > writing is durable on media before the operation completes.
> > 
> > Signed-off-by: Ross Zwisler <ross.zwisler@...ux.intel.com>
> ....
> > +		if (pgsz < PAGE_SIZE) {
> >  				memset(addr, 0, pgsz);
> > -			else
> > +				wb_cache_pmem((void __pmem *)addr, pgsz);
> > +			} else {
> >  				clear_page(addr);
> > +				wb_cache_pmem((void __pmem *)addr, PAGE_SIZE);
> > +			}
> 
> I'd much prefer to see these wrapped up in helper fuctions e.g.
> clear_page_pmem() rather than scatter them around randomly.
> Especially the barriers - the way they've been optimised is asking
> for people to get it wrong in the future.  I'd much prefer to see
> the operations paired properly in a helper first (i.e. obviously
> correct) and then it can be optimised later if workloads start to
> show the barrier as a bottleneck...
> 
> > +/*
> > + * This function's stores and flushes need to be synced to media by a
> > + * wmb_pmem() in the caller. We flush the data instead of writing it back
> > + * because we don't expect to read this newly zeroed data in the near future.
> > + */
> 
> That seems suboptimal. dax_new_buf() is called on newly allocated or
> unwritten buffers we are about to write to. Immediately after this
> we write the new data to the page, so we are effectively writting
> the whole page here.
> 
> So why wouldn't we simply commit the whole page during the write and
> capture all this zeroing in the one flush/commit/barrier op?
> 
> >  static void dax_new_buf(void *addr, unsigned size, unsigned first, loff_t pos,
> >  			loff_t end)
> >  {
> >  	loff_t final = end - pos + first; /* The final byte of the buffer */
> >  
> > -	if (first > 0)
> > +	if (first > 0) {
> >  		memset(addr, 0, first);
> > -	if (final < size)
> > +		flush_cache_pmem((void __pmem *)addr, first);
> > +	}
> > +	if (final < size) {
> >  		memset(addr + final, 0, size - final);
> > +		flush_cache_pmem((void __pmem *)addr + final, size - final);
> > +	}
> >  }
> >  
> >  static bool buffer_written(struct buffer_head *bh)
> > @@ -108,6 +123,7 @@ static ssize_t dax_io(struct inode *inode, struct iov_iter *iter,
> >  	loff_t bh_max = start;
> >  	void *addr;
> >  	bool hole = false;
> > +	bool need_wmb = false;
> >  
> >  	if (iov_iter_rw(iter) != WRITE)
> >  		end = min(end, i_size_read(inode));
> > @@ -145,18 +161,23 @@ static ssize_t dax_io(struct inode *inode, struct iov_iter *iter,
> >  				retval = dax_get_addr(bh, &addr, blkbits);
> >  				if (retval < 0)
> >  					break;
> > -				if (buffer_unwritten(bh) || buffer_new(bh))
> > +				if (buffer_unwritten(bh) || buffer_new(bh)) {
> >  					dax_new_buf(addr, retval, first, pos,
> >  									end);
> > +					need_wmb = true;
> > +				}
> >  				addr += first;
> >  				size = retval - first;
> >  			}
> >  			max = min(pos + size, end);
> >  		}
> >  
> > -		if (iov_iter_rw(iter) == WRITE)
> > +		if (iov_iter_rw(iter) == WRITE) {
> >  			len = copy_from_iter_nocache(addr, max - pos, iter);
> > -		else if (!hole)
> > +			if (!iter_is_iovec(iter))
> > +				wb_cache_pmem((void __pmem *)addr, max - pos);
> > +			need_wmb = true;
> 
> Conditional pmem cache writeback after a "nocache" copy to the pmem?
> Comments, please.
> 
> Cheers,
> 
> Dave.

I agree with all your comments, and will address them in v2.  Thank you for
the feedback.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/