文档首页> 数据治理中心 DataArts Studio> 常见问题> 数据开发> Shell/Python节点执行失败,后台报错session is down
更新时间:2023-06-21 GMT+08:00

Shell/Python节点执行失败,后台报错session is down

本指导以Shell算子为例。

问题背景与现象

Shell节点运行失败了,实际上Shell脚本运行成功了。

原因分析

  1. 获取Shell节点的运行日志。
    [2021/11/17 02:00:36 GMT+0800] [INFO] No job-level agency is set, Workspace-level agency is dlg_agency, Execute job use agency dlg_agency, job id is 07572F197E4642E5BE549C2B656F157Ctm7cHkHd
    [2021/11/17 02:00:36 GMT+0800] [DEBUG] ===============================================
    [2021/11/17 02:00:36 GMT+0800] [INFO] Get response from agent when try to submit shell running job :
    [2021/11/17 02:00:36 GMT+0800] [INFO]
    {
    "jobResultList":[
    {
    "jobId":"a567f7f5-3c9e-4dfc-a464-bd477ac5b1ea",
    "status":"created",
    "errorCode":0,
    "failCount":0,
    "result":[
    
    ]
    }
    ],
    "agentId":"614853ee-c1c6-456d-9aa6-fc84ad1281ed"
    }
    [2021/11/17 02:00:36 GMT+0800] [DEBUG] ===============================================
    [2021/11/17 02:05:56 GMT+0800] [DEBUG] ===============================================
    [2021/11/17 02:05:56 GMT+0800] [INFO] Job Run finish , the raw output is :
    [2021/11/17 02:05:56 GMT+0800] [INFO]
    {
    "jobId":"a567f7f5-3c9e-4dfc-a464-bd477ac5b1ea",
    "status":"failed",
    "errorCode":3427,
    "errorMessage":"Shell script job execute failed.",
    "failCount":0,
    "result":[
    {
    "is_success":false,
    "exeTime":300.609
    }
    ]
    }
    [2021/11/17 02:05:56 GMT+0800] [DEBUG] ===============================================
    [2021/11/17 02:05:56 GMT+0800] [DEBUG] ===============================================
    [2021/11/17 02:05:56 GMT+0800] [INFO] The return code is : [-1].
    [2021/11/17 02:05:56 GMT+0800] [DEBUG] ===============================================
    [2021/11/17 02:05:56 GMT+0800] [INFO] Execute shell script job finished.
    [2021/11/17 02:05:56 GMT+0800] [ERROR] Shell exit code is not 0
    [2021/11/17 02:05:56 GMT+0800] [DEBUG] ===============================================
    [2021/11/17 02:05:56 GMT+0800] [ERROR] Shell script job execute failed. Please contact ECS Service.
    [2021/11/17 02:05:56 GMT+0800] [ERROR] Exception message: RuntimeException: Shell script job execute failed. Please contact ECS Service.
    [2021/11/17 02:05:56 GMT+0800] [ERROR] Root Cause message:RuntimeException: Shell script job execute failed. Please contact ECS Service.
  2. 确认其ECS的sshd_config参数如下。

    原因分析:由于ssh session超时断开了,因此Shell节点失败。

解决办法

  1. 编辑ECS的/etc/ssh/sshd_config文件,添加或者更新如下两个值。

    ClientAliveInterval 300

    ClientAliveCountMax 3

    ClientAliveInterval指定了服务器端向客户端请求消息的时间间隔,默认是0,不发送请求。然而ClientAliveInterval 300表示五分钟发送一次,然后客户端响应,这样就保持长连接了。ClientAliveCountMax的默认值3。ClientAliveCountMax表示服务器发出请求后客户端没有响应的次数达到一定值,就自动断开,正常情况下,客户端会正常响应。

  2. 修改后,重启ECS的sshd,执行如下命令:

  3. 检查sshd是否启动成功(下图为成功):

数据开发 所有常见问题

more