单模型性能调优AOE

使用AOE工具可以在模型转换阶段对于模型运行和后端编译过程进行执行调优。请注意AOE只适合静态shape的模型调优。在AOE调优时，容易受当前缓存的一些影响，建议分两次进行操作，以达到较好的优化效果（第一次执行生成AOE的知识库，在第二次使用时可以复用）。在该场景中，AOE对text_encoder等模型提升效果不大，性能主要瓶颈点在unet模型中，主要对unet模型做调优，整体的操作步骤如下：

转换前先清理缓存，避免转换时的影响。

# shell
# 删除已有的aoe知识库，或者备份一份。
rm -rf /root/Ascend/latest/data/aoe
# 删除编译缓存。
rm -rf /root/atc_data/*

新建并进入AOE工作目录。

mkdir -p /home_host/work/aoe
cd /home_host/work/aoe

在配置文件中启用AOE自动调优。
配置unet.ini，开启aoe调优（aoe_mode + op_select_impl_mode）。
```
# unet.ini
[ascend_context]
input_shape=sample:[2,4,64,64];timestep:[1];encoder_hidden_states:[2,77,768]
input_format=NCHW

aoe_mode="subgraph tuning, operator tuning"
op_select_impl_mode=high_performance
```
配置打印ASCEND日志。其中，ASCEND_GLOBAL_LOG_LEVEL的值对应的日志级别分别为：0-debug、1-info、2-warning、3-error。
```
# shell
export ASCEND_GLOBAL_LOG_LEVEL=1
export ASCEND_SLOG_PRINT_TO_STDOUT=1
```
模型转换时指定AOE调优配置文件。
```
# shell
# 模型转换时指定AOE调优配置文件并将调优日志输出到aoe_unet.log。
mkdir aoe_output
converter_lite --modelFile=/home_host/work/runwayml/onnx_models/unet/model.onnx --outputFile=./aoe_output/aoe_unet --configFile=unet.ini  --fmk=ONNX --saveType=MINDIR --optimize=ascend_oriented > aoe_unet.log
```
启动AOE调优后，模型转换时长会延长到数小时，因为其中包含了AOE的转化过程耗时较长。您也可以指定调优时间，一般情况下时间越长效果会越好，一般10h以内即可，推荐在后台执行。调优完成后，默认将AOE生成的知识库保存在“/root/Ascend/latest/data/aoe”路径下，同时会在aoe_output路径下输出对应的mindir模型，由于当前模型并没有吸收知识库信息，所以性能不佳，因此需要在保留AOE知识库的情况下，再次进行转换，以达到较优性能。
删除编译缓存atc_data。

注意相比第一次清除缓存操作，本次保留了AOE知识库。
```
#shell
# 删除编译缓存。
rm -rf /root/atc_data/*
```

再次执行模型转换命令，确保AOE能够命中知识库。

配置config.ini，关闭AOE调优：

# unet.ini
[ascend_context]
input_shape=sample:[2,4,64,64];timestep:[1];encoder_hidden_states:[2,77,768]
input_format=NCHW

再次执行模型转换命令（此次运行关闭了AOE，速度会变快）：

#shell
converter_lite --modelFile=/home_host/work/runwayml/onnx_models/unet/model.onnx --outputFile=./aoe_output/aoe_unet --configFile=unet.ini  --fmk=ONNX --saveType=MINDIR --optimize=ascend_oriented > aoe_unet2.log

此时，aoe_output下面会有对应的mindir模型，包含了AOE知识库信息。使用benchmark工具测试新生成的mindir模型性能，同AOE调优前的模型进行对比，可以看到模型性能有所提升。

#shell
# 调优前命令如下：
benchmark --modelFile=/home_host/work/static_shape_convert/mindir_models/unet_graph.mindir --device=Ascend --numThreads=1 --parallelNum=1 --workersNum=1 --warmUpLoopCount=100 --loopCount=100

# 调优后命令如下：
benchmark --modelFile=/home_host/work/aoe/aoe_output/aoe_unet_graph.mindir --device=Ascend --numThreads=1 --parallelNum=1 --workersNum=1 --warmUpLoopCount=100 --loopCount=100

图1 调优前模型