Skip to content

CompressionCodec

CompressionCodec is an abstraction of <>.

A concrete CompressionCodec is supposed to <>.

The default compression codec is configured using configuration-properties.md#spark.io.compression.codec[spark.io.compression.codec] configuration property.

== [[implementations]][[shortCompressionCodecNames]] Available CompressionCodecs

[cols="30,10m,60",options="header",width="100%"] |=== | CompressionCodec | Alias | Description

| LZ4CompressionCodec | lz4 a| [[LZ4CompressionCodec]] https://github.com/lz4/lz4-java[LZ4 compression]

  • The default compression codec based on configuration-properties.md#spark.io.compression.codec[spark.io.compression.codec] configuration property

  • Uses configuration-properties.md#spark.io.compression.lz4.blockSize[spark.io.compression.lz4.blockSize] configuration property for the block size

| LZFCompressionCodec | lzf | [[LZFCompressionCodec]] https://github.com/ning/compress[LZF compression]

| SnappyCompressionCodec | snappy a| [[SnappyCompressionCodec]] https://google.github.io/snappy/[Snappy compression]

  • Uses configuration-properties.md#spark.io.compression.snappy.blockSize[spark.io.compression.snappy.blockSize] configuration property for the block size

| ZStdCompressionCodec | zstd a| [[ZStdCompressionCodec]] https://facebook.github.io/zstd/[ZStandard compression]

  • configuration-properties.md#spark.io.compression.zstd.bufferSize[spark.io.compression.zstd.bufferSize] for the buffer size

  • configuration-properties.md#spark.io.compression.zstd.level[spark.io.compression.zstd.level] for the compression level

|===

== [[compressedOutputStream]] Compressing Output Stream

[source,scala]

compressedOutputStream( s: OutputStream): OutputStream


compressedOutputStream is used when:

  • TorrentBroadcast is requested to blockifyObject

  • ReliableCheckpointRDD is requested to writePartitionToCheckpointFile

  • EventLoggingListener is requested to spark-history-server:EventLoggingListener.md#start[start]

  • GenericAvroSerializer is requested to compress a schema

  • SerializerManager is requested to serializer:SerializerManager.md#wrapForCompression[wrap an output stream of a block for compression]

  • UnsafeShuffleWriter is requested to shuffle:UnsafeShuffleWriter.md#mergeSpillsWithFileStream[mergeSpillsWithFileStream]

== [[compressedInputStream]] Compressing Input Stream

[source,scala]

compressedInputStream( s: InputStream): InputStream


compressedInputStream is used when:

  • TorrentBroadcast is requested to unBlockifyObject

  • ReliableCheckpointRDD is requested to readCheckpointFile

  • EventLoggingListener is requested to spark-history-server:EventLoggingListener.md#openEventLog[openEventLog]

  • GenericAvroSerializer is requested to decompress a schema

  • SerializerManager is requested to serializer:SerializerManager.md#wrapForCompression[wrap an input stream of a block for compression]

  • UnsafeShuffleWriter is requested to shuffle:UnsafeShuffleWriter.md#mergeSpillsWithFileStream[mergeSpillsWithFileStream]

== [[createCodec]] Creating CompressionCodec

[source, scala]

createCodec( conf: SparkConf): CompressionCodec createCodec( conf: SparkConf, codecName: String): CompressionCodec


createCodec creates an instance of the compression codec by the given name (using a constructor that accepts a SparkConf.md[SparkConf]).

createCodec uses <> utility to find the codec name unless specified explicitly.

createCodec finds the class name in the <> internal lookup table or assumes that the codec name is already a fully-qualified class name.

createCodec throws an IllegalArgumentException exception if a compression codec could not be found:

[source,plaintext]

Codec [codecName] is not available. Consider setting spark.io.compression.codec=snappy

createCodec is used when:

  • TorrentBroadcast is requested to setConf

  • ReliableCheckpointRDD is requested to writePartitionToCheckpointFile and readCheckpointFile

  • spark-history-server:EventLoggingListener.md[EventLoggingListener] is created and requested to spark-history-server:EventLoggingListener.md#openEventLog[openEventLog]

  • GenericAvroSerializer is created

  • SerializerManager is created

  • UnsafeShuffleWriter is requested to shuffle:UnsafeShuffleWriter.md#mergeSpills[merge spills]

Finding Compression Codec Name

getCodecName(
  conf: SparkConf): String

getCodecName takes the name of a compression codec based on configuration-properties.md#spark.io.compression.codec[spark.io.compression.codec] configuration property (using the SparkConf.md[SparkConf]) if available or defaults to lz4.

getCodecName is used when:

== [[supportsConcatenationOfSerializedStreams]] supportsConcatenationOfSerializedStreams Method

[source, scala]

supportsConcatenationOfSerializedStreams( codec: CompressionCodec): Boolean


supportsConcatenationOfSerializedStreams returns true when the given CompressionCodec is one of the <>.

supportsConcatenationOfSerializedStreams is used when UnsafeShuffleWriter is requested to shuffle:UnsafeShuffleWriter.md#mergeSpills[merge spills].

Back to top