NodeManager OOM Occurs During Spark Application Execution
Question
Enabling YARN's External Shuffle Service and having an excessive number of shuffle connections during the execution of a Spark application will trigger error "java.lang.OutofMemoryError: Direct buffer Memory". This indicates that the memory is insufficient. The error log is as follows:
2016-12-06 02:01:00,768 | WARN | shuffle-server-38 | Exception in connection from /192.168.101.95:53680 | TransportChannelHandler.java:79 io.netty.handler.codec.DecoderException: java.lang.OutOfMemoryError: Direct buffer memory at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:153) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319) at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:787) at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:130) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.OutOfMemoryError: Direct buffer memory at java.nio.Bits.reserveMemory(Bits.java:693) at java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:123) at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311) at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:434) at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:179) at io.netty.buffer.PoolArena.allocate(PoolArena.java:168) at io.netty.buffer.PoolArena.reallocate(PoolArena.java:277) at io.netty.buffer.PooledByteBuf.capacity(PooledByteBuf.java:108) at io.netty.buffer.AbstractByteBuf.ensureWritable(AbstractByteBuf.java:251) at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:849) at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:841) at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:831) at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:146) ... 10 more
Answer
YARN's External Shuffle Service starts twice the number of threads as there are available vCPUs. However, the default direct buffer memory is only 128 MB. This means that when a large number of shuffle connections are established simultaneously, the direct buffer memory allocated to each thread is low. For instance, if a node has 40 vCPUs, YARN's External Shuffle Service will start 80 threads, and these threads will share the direct buffer memory. As a result, the memory allocated to each thread will be less than 2 MB.
Therefore, you are advised to adjust the value of direct buffer memory based on the number of vCPUs of the NodeManager node in the cluster. For example, if the number of vCPUs is 40, set direct buffer memory to 512 MB. That is, set the GC_OPTS parameter of the NodeManager node. The following is an example:
-XX:MaxDirectMemorySize=512M
The -XX:MaxDirectMemorySize parameter is not used by default. If you need to set this parameter, add it to the GC_OPTS parameter.
Perform the following operations to configure this parameter:
Log in to FusionInsight Manager, click Cluster, click the name of the cluster, choose Services > Yarn, click Configurations, and click All Configurations. Click NodeManager, select System, and modify the configuration in the GC_OPTS parameter in the right pane.
Parameter |
Description |
Default Value |
---|---|---|
GC_OPTS |
GC parameter of YARN NodeManager |
128M |
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.