在Spark应用执行过程中NodeManager出现OOM异常
问题
当开启Yarn External Shuffle服务时,在Spark应用执行过程中,如果当前shuffle连接过多,Yarn External Shuffle会出现“java.lang.OutofMemoryError: Direct buffer Memory”的异常,该异常说明内存不足。错误日志如下:
2016-12-06 02:01:00,768 | WARN | shuffle-server-38 | Exception in connection from /192.168.101.95:53680 | TransportChannelHandler.java:79 io.netty.handler.codec.DecoderException: java.lang.OutOfMemoryError: Direct buffer memory at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:153) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319) at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:787) at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:130) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.OutOfMemoryError: Direct buffer memory at java.nio.Bits.reserveMemory(Bits.java:693) at java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:123) at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311) at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:434) at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:179) at io.netty.buffer.PoolArena.allocate(PoolArena.java:168) at io.netty.buffer.PoolArena.reallocate(PoolArena.java:277) at io.netty.buffer.PooledByteBuf.capacity(PooledByteBuf.java:108) at io.netty.buffer.AbstractByteBuf.ensureWritable(AbstractByteBuf.java:251) at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:849) at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:841) at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:831) at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:146) ... 10 more
回答
对于Yarn的Shuffle Service,其启动的线程数为机器可用CPU核数的两倍,而默认配置的Direct buffer Memory为128M,因此当有较多shuffle同时连接时,平均分配到各线程所能使用的Direct buffer Memory将较低(例如,当机器的CPU为40核,Yarn的Shuffle Service启动的线程数为80,80个线程共享进程里的Direct buffer Memory,这种场景下每个线程分配到的内存将不足2MB)。
因此建议根据集群中的NodeManager节点的CPU核数适当调整Direct buffer Memory,例如在CPU核数为40时,将Direct buffer Memory配置为512M。即配置集群NodeManager的“GC_OPTS”参数,如:
-XX:MaxDirectMemorySize=512M
GC_OPTS参数中-XX:MaxDirectMemorySize默认没有配置,如需配置,用户可在GC_OPTS参数中自定义添加。
具体的配置方法如下:
用户可登录FusionInsight Manager,单击“集群 > 服务 > Yarn > 配置”,单击 ,单击 ,在“GC_OPTS”参数中修改配置。
参数 |
描述 |
默认值 |
---|---|---|
GC_OPTS |
Yarn NodeManager的GC参数。 |
128M |