更新时间:2025-10-22 GMT+08:00
分享

OBS联动SFS_Turbo场景下dist权重保存失败

在VeRL训练场景中,启用联动了OBS的SFS_Turbo,训练产物设置保存在SFS_Turbo中,训练结束保存权重时报以下错误:

PermissionError: [Errno 1] Operation not permitted: '/xxx/xxx/.metadata.tmp' -> '/xxx/xxxt/.metadata'

根因是:联动了OBS的SFS_Turbo目前不支持重命名功能

解决方法如下,任选一种:

  1. 使用未联动OBS的SFS_Turbo
  2. 通过执行 pip show megatron-core 查看megatron-lm安装路径,在Megatron-LM/megatron/core/dist_checkpointing/strategies/filesystem_async.py文件中FileSystemWriterAsync类下添加如下方法:
        def finish(self, metadata, results):
            # _FileSystemWriter.finish 会执行重命名操作,OBS与SFS_Turbo联动场景不支持rename操作
            storage_md = {}
            for wr_list in results:
                storage_md.update({wr.index: wr.storage_data for wr in wr_list})
            metadata.storage_data = storage_md
            metadata.storage_meta = self.storage_meta()
            # delete in-case other checkpoints were present.
            if self.fs.exists(self.metadata_path):
                self.fs.rm_file(self.metadata_path)
            with self.fs.create_stream(self.metadata_path, "wb") as metadata_file:
                pickle.dump(metadata, metadata_file)
                if self.sync_files:
                    try:
                        os.fsync(metadata_file.fileno())
                    except AttributeError:
                        os.sync()

相关文档