更新时间:2022-02-22 GMT+08:00

在Spark应用执行过程中NodeManager出现OOM异常

问题

当开启Yarn External Shuffle服务时,在Spark应用执行过程中,如果当前shuffle连接过多,Yarn External Shuffle会出现“java.lang.OutofMemoryError: Direct buffer Memory”的异常,该异常说明内存不足。错误日志如下:

2016-12-06 02:01:00,768 | WARN  | shuffle-server-38 | Exception in connection from /192.168.101.95:53680 | TransportChannelHandler.java:79
io.netty.handler.codec.DecoderException: java.lang.OutOfMemoryError: Direct buffer memory
        at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:153)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
        at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:787)
        at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:130)
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
        at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.OutOfMemoryError: Direct buffer memory
        at java.nio.Bits.reserveMemory(Bits.java:693)
        at java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:123)
        at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311)
        at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:434)
        at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:179)
        at io.netty.buffer.PoolArena.allocate(PoolArena.java:168)
        at io.netty.buffer.PoolArena.reallocate(PoolArena.java:277)
        at io.netty.buffer.PooledByteBuf.capacity(PooledByteBuf.java:108)
        at io.netty.buffer.AbstractByteBuf.ensureWritable(AbstractByteBuf.java:251)
        at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:849)
        at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:841)
        at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:831)
        at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:146)
        ... 10 more

回答

对于Yarn的Shuffle Service,其启动的线程数为机器可用CPU核数的两倍,而默认配置的Direct buffer Memory为128M,因此当有较多shuffle同时连接时,平均分配到各线程所能使用的Direct buffer Memory将较低(例如,当机器的CPU为40核,Yarn的Shuffle Service启动的线程数为80,80个线程共享进程里的Direct buffer Memory,这种场景下每个线程分配到的内存将不足2MB)。

因此建议根据集群中的NodeManager节点的CPU核数适当调整Direct buffer Memory,例如在CPU核数为40时,将Direct buffer Memory配置为512M。即配置集群NodeManger的“GC_OPTS”参数,如:

-XX:MaxDirectMemorySize=512M

GC_OPTS参数中-XX:MaxDirectMemorySize默认没有配置,如需配置,用户可在GC_OPTS参数中自定义添加。

具体的配置方法如下:

用户可登录FusionInsight Manager,单击集群 > 待操作集群的名称 > 服务 > Yarn > 配置,单击全部配置,单击NodeManager > 系统,在“GC_OPTS”参数中修改配置。

表1 参数说明

参数

描述

默认值

GC_OPTS

Yarn NodeManger的GC参数。

128M