ORC
Bases: ReadWriteFileFormat
ORC file format (columnar). |support_hooks|
Based on Spark ORC Files <https://spark.apache.org/docs/latest/sql-data-sources-orc.html>_ file format.
Supports reading/writing files with .orc extension.
.. versionadded:: 0.9.0
Examples
.. note ::
You can pass any option mentioned in
`official documentation <https://spark.apache.org/docs/latest/sql-data-sources-orc.html>`_.
**Option names should be in** ``camelCase``!
The set of supported options depends on Spark version.
You may also set options mentioned `orc-java documentation <https://orc.apache.org/docs/core-java-config.html>`_.
They are prefixed with ``orc.`` with dots in names, so instead of calling constructor ``ORC(orc.option=True)`` (invalid in Python)
you should call method ``ORC.parse({"orc.option": True})``.
.. tabs::
.. code-tab:: py Reading files
from onetl.file.format import ORC
orc = ORC(mergeSchema=True)
.. tab:: Writing files
.. code:: python
from onetl.file.format import ORC
orc = ORC.parse(
{
"compression": "snappy",
# Enable Bloom filter for columns 'id' and 'name'
"orc.bloom.filter.columns": "id,name",
# Set Bloom filter false positive probability
"orc.bloom.filter.fpp": 0.01,
# Do not use dictionary for 'highly_selective_column'
"orc.column.encoding.direct": "highly_selective_column",
# other options
}
)
Source code in onetl/file/format/orc.py
27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 | |
mergeSchema = None
class-attribute
instance-attribute
Merge schemas of all ORC files being read into a single schema.
By default, Spark config option spark.sql.orc.mergeSchema value is used (False).
.. note::
Used only for reading files.
compression = None
class-attribute
instance-attribute
Compression codec of the ORC files.
By default, Spark config option spark.sql.orc.compression.codec value is used (snappy).
.. note::
Used only for writing files.