lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Date:   Mon, 30 Mar 2020 06:14:07 +0000
From:   Omer Shpigelman <oshpigelman@...ana.ai>
To:     Oded Gabbay <oded.gabbay@...il.com>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        Tomer Tayar <ttayar@...ana.ai>
CC:     "gregkh@...uxfoundation.org" <gregkh@...uxfoundation.org>
Subject: RE: [PATCH 6/6] habanalabs: increase timeout during reset

On Sat, Mar 28, 2020 at 11:53 AM, Oded Gabbay <oded.gabbay@...il.com> wrote:
> When doing training, the DL framework (e.g. tensorflow) performs hundreds of
> thousands of memory allocations and mappings. In case the driver needs to
> perform hard-reset during training, the driver kills the application and unmaps all
> those memory allocations. Unfortunately, because of that large amount of
> mappings, the driver isn't able to do that in the current timeout (5 seconds).
> Therefore, increase the timeout significantly to 30 seconds to avoid situation
> where the driver resets the device with active mappings, which sometime can
> cause a kernel bug.
>
> BTW, it doesn't mean we will spend all the 30 seconds because the reset thread
> checks every one second if the unmap operation is done.
>
> Signed-off-by: Oded Gabbay <oded.gabbay@...il.com>

Reviewed-by: Omer Shpigelman <oshpigelman@...ana.ai>


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ