文档首页/
AI开发平台ModelArts/
最佳实践/
LLM大语言模型训练/
主流开源大模型基于Lite Cluster适配AscendFactory NPU训练解决方案/
常见错误原因和解决方法/
OBS联动SFS_Turbo场景下dist权重保存失败
更新时间:2025-10-22 GMT+08:00
OBS联动SFS_Turbo场景下dist权重保存失败
在VeRL训练场景中,启用联动了OBS的SFS_Turbo,训练产物设置保存在SFS_Turbo中,训练结束保存权重时报以下错误:
PermissionError: [Errno 1] Operation not permitted: '/xxx/xxx/.metadata.tmp' -> '/xxx/xxxt/.metadata'
根因是:联动了OBS的SFS_Turbo目前不支持重命名功能
解决方法如下,任选一种:
- 使用未联动OBS的SFS_Turbo
- 通过执行 pip show megatron-core 查看megatron-lm安装路径,在Megatron-LM/megatron/core/dist_checkpointing/strategies/filesystem_async.py文件中FileSystemWriterAsync类下添加如下方法:
def finish(self, metadata, results): # _FileSystemWriter.finish 会执行重命名操作,OBS与SFS_Turbo联动场景不支持rename操作 storage_md = {} for wr_list in results: storage_md.update({wr.index: wr.storage_data for wr in wr_list}) metadata.storage_data = storage_md metadata.storage_meta = self.storage_meta() # delete in-case other checkpoints were present. if self.fs.exists(self.metadata_path): self.fs.rm_file(self.metadata_path) with self.fs.create_stream(self.metadata_path, "wb") as metadata_file: pickle.dump(metadata, metadata_file) if self.sync_files: try: os.fsync(metadata_file.fileno()) except AttributeError: os.sync()
父主题: 常见错误原因和解决方法