首次加载权重较慢
量化需要先加载权重,按照如下方法能够加速权重的加载。除此之外,加载权重的速度还受到磁盘读取速度等方面影响。
在quant_deepseek_w8a8.py脚本中新增set_initialized_submodules方法,并在主方法中调用,以加速加载模型权重。
# 增加下述set_initialized_submodules方法
def set_initialized_submodules(model, state_dict_keys):
"""
Sets the `_is_hf_initialized` flag in all submodules of a given model when all its weights are in the loaded state
dict.
"""
state_dict_keys = set(state_dict_keys)
not_initialized_submodules = {}
for module_name, module in model.named_modules():
if module_name == "":
# When checking if the root module is loaded there's no need to prepend module_name.
module_keys = set(module.state_dict())
else:
module_keys = {f"{module_name}.{k}" for k in module.state_dict()}
if module_keys.issubset(state_dict_keys):
module._is_hf_initialized = True
else:
not_initialized_submodules[module_name] = module
return not_initialized_submodules
......
if __name__ == "__main__":
# torch_npu will fork a new process to init,
# it's lazy_init will fail after we load a big model,so we need to init it here
torch_npu.npu.init()
# 此处增加对set_initialized_submodules方法的调用
patch("transformers.modeling_utils.set_initialized_submodules", new=set_initialized_submodules).start()
# Invoke main process
main()