Shell/Python节点执行失败,后台报错session is down
本指导以Shell算子为例。
问题现象
Shell节点运行失败了,实际上Shell脚本运行成功了。
原因分析
- 获取Shell节点的运行日志。
[2021/11/17 02:00:36 GMT+0800] [INFO] No job-level agency is set, Workspace-level agency is dlg_agency, Execute job use agency dlg_agency, job id is 07572F197E4642E5BE549C2B656F157Ctm7cHkHd [2021/11/17 02:00:36 GMT+0800] [DEBUG] =============================================== [2021/11/17 02:00:36 GMT+0800] [INFO] Get response from agent when try to submit shell running job : [2021/11/17 02:00:36 GMT+0800] [INFO] { "jobResultList":[ { "jobId":"a567f7f5-3c9e-4dfc-a464-bd477ac5b1ea", "status":"created", "errorCode":0, "failCount":0, "result":[ ] } ], "agentId":"614853ee-c1c6-456d-9aa6-fc84ad1281ed" } [2021/11/17 02:00:36 GMT+0800] [DEBUG] =============================================== [2021/11/17 02:05:56 GMT+0800] [DEBUG] =============================================== [2021/11/17 02:05:56 GMT+0800] [INFO] Job Run finish , the raw output is : [2021/11/17 02:05:56 GMT+0800] [INFO] { "jobId":"a567f7f5-3c9e-4dfc-a464-bd477ac5b1ea", "status":"failed", "errorCode":3427, "errorMessage":"Shell script job execute failed.", "failCount":0, "result":[ { "is_success":false, "exeTime":300.609 } ] } [2021/11/17 02:05:56 GMT+0800] [DEBUG] =============================================== [2021/11/17 02:05:56 GMT+0800] [DEBUG] =============================================== [2021/11/17 02:05:56 GMT+0800] [INFO] The return code is : [-1]. [2021/11/17 02:05:56 GMT+0800] [DEBUG] =============================================== [2021/11/17 02:05:56 GMT+0800] [INFO] Execute shell script job finished. [2021/11/17 02:05:56 GMT+0800] [ERROR] Shell exit code is not 0 [2021/11/17 02:05:56 GMT+0800] [DEBUG] =============================================== [2021/11/17 02:05:56 GMT+0800] [ERROR] Shell script job execute failed. Please contact ECS Service. [2021/11/17 02:05:56 GMT+0800] [ERROR] Exception message: RuntimeException: Shell script job execute failed. Please contact ECS Service. [2021/11/17 02:05:56 GMT+0800] [ERROR] Root Cause message:RuntimeException: Shell script job execute failed. Please contact ECS Service.
- 确认其ECS的sshd_config参数如下。
原因分析:由于ssh session超时断开了,因此Shell节点失败。
解决办法
- 编辑ECS的/etc/ssh/sshd_config文件,添加或者更新如下两个值。
ClientAliveInterval 300
ClientAliveCountMax 3
ClientAliveInterval指定了服务器端向客户端请求消息的时间间隔,默认是0,不发送请求。然而ClientAliveInterval 300表示五分钟发送一次,然后客户端响应,这样就保持长连接了。ClientAliveCountMax的默认值3。ClientAliveCountMax表示服务器发出请求后客户端没有响应的次数达到一定值,就自动断开,正常情况下,客户端会正常响应。
- 修改后,重启ECS的sshd,执行如下命令:
restart sshd.service
- 检查sshd是否启动成功(下图为成功):