Parquet
Bases: ReadWriteFileFormat
Parquet file format (columnar).
Based on Spark Parquet Files file format.
Supports reading/writing files with .parquet extension.
Added in 0.9.0
Examples
Note
You can pass any option mentioned in
official documentation.
Option names should be in camelCase!
The set of supported options depends on Spark version.
You may also set options mentioned
parquet-hadoop documentation.
They are prefixed with parquet. with dots in names,
so instead of calling constructor Parquet(parquet.option=True) (invalid in Python)
you should call method Parquet.parse({"parquet.option": True}).
from onetl.file.format import Parquet
parquet = Parquet(mergeSchema=True)
from onetl.file.format import Parquet
parquet = Parquet.parse(
{
"compression": "snappy",
# Enable Bloom filter for columns 'id' and 'name'
"parquet.bloom.filter.enabled#id": True,
"parquet.bloom.filter.enabled#name": True,
# Set expected number of distinct values for column 'id'
"parquet.bloom.filter.expected.ndv#id": 10_000_000,
# other options
}
)
Source code in onetl/file/format/parquet.py
27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 | |
mergeSchema = None
class-attribute
instance-attribute
Merge schemas of all Parquet files being read into a single schema.
By default, Spark config option spark.sql.parquet.mergeSchema value is used (false).
Note
Used only for reading files.
compression = None
class-attribute
instance-attribute
Compression codec of the Parquet files.
By default, Spark config option spark.sql.parquet.compression.codec value is used (snappy).
Note
Used only for writing files.