linux-kernel - RE: [PATCH 6/6] habanalabs: increase timeout during reset

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <AM0PR02MB5523AC0E528B32AD791AC9A9B8CB0@AM0PR02MB5523.eurprd02.prod.outlook.com>
Date:   Mon, 30 Mar 2020 06:14:07 +0000
From:   Omer Shpigelman <oshpigelman@...ana.ai>
To:     Oded Gabbay <oded.gabbay@...il.com>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        Tomer Tayar <ttayar@...ana.ai>
CC:     "gregkh@...uxfoundation.org" <gregkh@...uxfoundation.org>
Subject: RE: [PATCH 6/6] habanalabs: increase timeout during reset

On Sat, Mar 28, 2020 at 11:53 AM, Oded Gabbay <oded.gabbay@...il.com> wrote:
> When doing training, the DL framework (e.g. tensorflow) performs hundreds of
> thousands of memory allocations and mappings. In case the driver needs to
> perform hard-reset during training, the driver kills the application and unmaps all
> those memory allocations. Unfortunately, because of that large amount of
> mappings, the driver isn't able to do that in the current timeout (5 seconds).
> Therefore, increase the timeout significantly to 30 seconds to avoid situation
> where the driver resets the device with active mappings, which sometime can
> cause a kernel bug.
>
> BTW, it doesn't mean we will spend all the 30 seconds because the reset thread
> checks every one second if the unmap operation is done.
>
> Signed-off-by: Oded Gabbay <oded.gabbay@...il.com>

Reviewed-by: Omer Shpigelman <oshpigelman@...ana.ai>