Skip to content

Latest commit

 

History

History
123 lines (71 loc) · 4.97 KB

debug_accuracy.md

File metadata and controls

123 lines (71 loc) · 4.97 KB

Debugging TensorRT Accuracy Issues

Accuracy issues in TensorRT, especially with large networks, can be challenging to debug. One way to make them manageable is to reduce the problem size or pinpoint the source of failure.

This guide aims to provide a general approach to doing so; it is structured as a flattened flowchart - at each branch, two links are provided so you can choose the one that best matches your situation.

If you're using an ONNX model, try sanitizing it before proceeding, as this may solve the problem in some cases.

Intermittent Or Not?

Is the issue intermittent between engine builds?

Debugging Intermittent Accuracy Issues

Since the engine building process is non-deterministic, different tactics (i.e. layer implementations) may be selected each time the engine is built. When one of the tactics is faulty, this may manifest as an intermittent failure. Polygraphy includes a debug build subtool to help you find such tactics.

For more information, refer to debug example 01.

Were you able to find the failing tactic?

Is Layerwise An Option?

If the accuracy issue is consistently reproducible, the best next step is to figure out which layer is causing the failure. Polygraphy includes a mechanism to mark all tensors in the network as outputs so that they can be compared; however, this can potentially affect TensorRT's optimization process. Hence, we need to determine if we still observe the accuracy issue when all output tensors are marked.

Refer to this example for details on how to compare per-layer outputs before proceeding.

Were you able to reproduce the accuracy failure when comparing layer-wise outputs?

Extracting A Failing Subgraph

Since we're able to compare layerwise outputs, we should be able to determine which layer first introduces the error by looking at the output comparison logs. Once we know which layer is problematic, we can extract it from the model.

In order to figure out the input and output tensors for the layer in question, we can use polygraphy inspect model. Refer to one of these examples for details:

Next, we can extract a subgraph including just the problematic layer. For more information, refer to surgeon example 01.

Does this isolated subgraph reproduce the problem?

Reducing A Failing ONNX Model

When we're unable to pinpoint the source of failure using a layerwise comparison, we can use a brute force method of reducing the ONNX model - iteratively generate smaller and smaller subgraphs to find the smallest possible one that still fails. The debug reduce tools helps automate this process.

For more information, refer to debug example 02.

Does the reduced model fail?

Double Check Your Reduce Options

If the reduced model no longer fails, or fails in a different way, ensure that your --check command is correct. You may also want to use --fail-regex to ensure that you're only considering the accuracy failure (and not other, unrelated failures) when reducing the model.

You Have A Minimal Failing Case!

If you've made it to this point, you now have a minimal failing case! Further debugging should be significantly easier.

If you are a TensorRT developer, you'll need to dive into the code at this point. If not, please report your bug!