CompressionCodec¶
CompressionCodec
is an abstraction of <
A concrete CompressionCodec is supposed to <
The default compression codec is configured using configuration-properties.md#spark.io.compression.codec[spark.io.compression.codec] configuration property.
== [[implementations]][[shortCompressionCodecNames]] Available CompressionCodecs
[cols="30,10m,60",options="header",width="100%"] |=== | CompressionCodec | Alias | Description
| LZ4CompressionCodec | lz4 a| [[LZ4CompressionCodec]] https://github.com/lz4/lz4-java[LZ4 compression]
-
The default compression codec based on configuration-properties.md#spark.io.compression.codec[spark.io.compression.codec] configuration property
-
Uses configuration-properties.md#spark.io.compression.lz4.blockSize[spark.io.compression.lz4.blockSize] configuration property for the block size
| LZFCompressionCodec | lzf | [[LZFCompressionCodec]] https://github.com/ning/compress[LZF compression]
| SnappyCompressionCodec | snappy a| [[SnappyCompressionCodec]] https://google.github.io/snappy/[Snappy compression]
- Uses configuration-properties.md#spark.io.compression.snappy.blockSize[spark.io.compression.snappy.blockSize] configuration property for the block size
| ZStdCompressionCodec | zstd a| [[ZStdCompressionCodec]] https://facebook.github.io/zstd/[ZStandard compression]
-
configuration-properties.md#spark.io.compression.zstd.bufferSize[spark.io.compression.zstd.bufferSize] for the buffer size
-
configuration-properties.md#spark.io.compression.zstd.level[spark.io.compression.zstd.level] for the compression level
|===
== [[compressedOutputStream]] Compressing Output Stream
[source,scala]¶
compressedOutputStream( s: OutputStream): OutputStream
compressedOutputStream
is used when:
-
TorrentBroadcast
is requested to blockifyObject -
ReliableCheckpointRDD is requested to writePartitionToCheckpointFile
-
EventLoggingListener is requested to spark-history-server:EventLoggingListener.md#start[start]
-
GenericAvroSerializer is requested to compress a schema
-
SerializerManager is requested to serializer:SerializerManager.md#wrapForCompression[wrap an output stream of a block for compression]
-
UnsafeShuffleWriter is requested to shuffle:UnsafeShuffleWriter.md#mergeSpillsWithFileStream[mergeSpillsWithFileStream]
== [[compressedInputStream]] Compressing Input Stream
[source,scala]¶
compressedInputStream( s: InputStream): InputStream
compressedInputStream
is used when:
-
TorrentBroadcast
is requested to unBlockifyObject -
ReliableCheckpointRDD is requested to readCheckpointFile
-
EventLoggingListener is requested to spark-history-server:EventLoggingListener.md#openEventLog[openEventLog]
-
GenericAvroSerializer is requested to decompress a schema
-
SerializerManager is requested to serializer:SerializerManager.md#wrapForCompression[wrap an input stream of a block for compression]
-
UnsafeShuffleWriter is requested to shuffle:UnsafeShuffleWriter.md#mergeSpillsWithFileStream[mergeSpillsWithFileStream]
== [[createCodec]] Creating CompressionCodec
[source, scala]¶
createCodec( conf: SparkConf): CompressionCodec createCodec( conf: SparkConf, codecName: String): CompressionCodec
createCodec creates an instance of the compression codec by the given name (using a constructor that accepts a SparkConf.md[SparkConf]).
createCodec uses <
createCodec finds the class name in the <
createCodec throws an IllegalArgumentException exception if a compression codec could not be found:
[source,plaintext]¶
Codec [codecName] is not available. Consider setting spark.io.compression.codec=snappy¶
createCodec
is used when:
-
TorrentBroadcast
is requested to setConf -
ReliableCheckpointRDD is requested to writePartitionToCheckpointFile and readCheckpointFile
-
spark-history-server:EventLoggingListener.md[EventLoggingListener] is created and requested to spark-history-server:EventLoggingListener.md#openEventLog[openEventLog]
-
GenericAvroSerializer is created
-
SerializerManager is created
-
UnsafeShuffleWriter is requested to shuffle:UnsafeShuffleWriter.md#mergeSpills[merge spills]
Finding Compression Codec Name¶
getCodecName(
conf: SparkConf): String
getCodecName
takes the name of a compression codec based on configuration-properties.md#spark.io.compression.codec[spark.io.compression.codec] configuration property (using the SparkConf.md[SparkConf]) if available or defaults to lz4
.
getCodecName
is used when:
- SparkContext is created
- CompressionCodec utility is used to creating a CompressionCodec
== [[supportsConcatenationOfSerializedStreams]] supportsConcatenationOfSerializedStreams Method
[source, scala]¶
supportsConcatenationOfSerializedStreams( codec: CompressionCodec): Boolean
supportsConcatenationOfSerializedStreams returns true
when the given CompressionCodec is one of the <
supportsConcatenationOfSerializedStreams is used when UnsafeShuffleWriter is requested to shuffle:UnsafeShuffleWriter.md#mergeSpills[merge spills].