linux-kernel - Re: [PATCH 0/7] crypto: SHA256 multibuffer implementation

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <20160627090455.GA7140@gondor.apana.org.au>
Date:	Mon, 27 Jun 2016 17:04:55 +0800
From:	Herbert Xu <herbert@...dor.apana.org.au>
To:	Megha Dey <megha.dey@...el.com>
Cc:	tim.c.chen@...ux.intel.com, davem@...emloft.net,
	linux-crypto@...r.kernel.org, linux-kernel@...r.kernel.org,
	fenghua.yu@...el.com, Megha Dey <megha.dey@...ux.intel.com>
Subject: Re: [PATCH 0/7] crypto: SHA256 multibuffer implementation

On Thu, Jun 23, 2016 at 06:40:41PM -0700, Megha Dey wrote:
> From: Megha Dey <megha.dey@...ux.intel.com>
> 
> In this patch series, we introduce the multi-buffer crypto algorithm on
> x86_64 and apply it to SHA256 hash computation.  The multi-buffer technique
> takes advantage of the 8 data lanes in the AVX2 registers and allows
> computation to be performed on data from multiple jobs in parallel.
> This allows us to parallelize computations when data inter-dependency in
> a single crypto job prevents us to fully parallelize our computations.
> The algorithm can be extended to other hashing and encryption schemes
> in the future.
> 
> On multi-buffer SHA256 computation with AVX2, we see throughput increase
> up to 2.2x over the existing x86_64 single buffer AVX2 algorithm.
> 
> The multi-buffer crypto algorithm is described in the following paper:
> Processing Multiple Buffers in Parallel to Increase Performance on
> Intel® Architecture Processors
> http://www.intel.com/content/www/us/en/communications/communications-ia-multi-buffer-paper.html
> 
> The outline of the algorithm is sketched below:
> Any driver requesting the crypto service will place an async
> crypto request on the workqueue.  The multi-buffer crypto daemon will
> pull request from work queue and put each request in an empty data lane
> for multi-buffer crypto computation.  When all the empty lanes are filled,
> computation will commence on the jobs in parallel and the job with the
> shortest remaining buffer will get completed and be returned.  To prevent
> prolonged stall when there is no new jobs arriving, we will flush a crypto
> job if it has not been completed after a maximum allowable delay.
> 
> To accommodate the fragmented nature of scatter-gather, we will keep
> submitting the next scatter-buffer fragment for a job for multi-buffer
> computation until a job is completed and no more buffer fragments remain.
> At that time we will pull a new job to fill the now empty data slot.
> We call a get_completed_job function to check whether there are other
> jobs that have been completed when we job when we have no new job arrival
> to prevent extraneous delay in returning any completed jobs.
> 
> The multi-buffer algorithm should be used for cases where crypto jobs
> submissions are at a reasonable high rate.  For low crypto job submission
> rate, this algorithm will not be beneficial. The reason is at low rate,
> we do not fill out the data lanes before the maximum allowable latency,
> we will be flushing the jobs instead of processing them with all the
> data lanes full.  We will miss the benefit of parallel computation,
> and adding delay to the processing of the crypto job at the same time.
> Some tuning of the maximum latency parameter may be needed to get the
> best performance.
> 
> Note that the tcrypt SHA256 speed test, we wait for a previous job to
> be completed before submitting a new job.  Hence this is not a valid
> test for multi-buffer algorithm as it requires multiple outstanding jobs
> submitted to fill the all data lanes to be effective (i.e. 8 outstanding
> jobs for the AVX2 case). An updated version of the tcrypt test is also
> included which would contain a more appropriate test for this scenario.
> 
> As this is the first algorithm in the kernel's crypto library
> that we have tried to use multi-buffer optimizations, feedbacks
> and testings will be much appreciated.

All applied.  Thanks.
-- 
Email: Herbert Xu <herbert@...dor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt