Skip to content

ORC

Bases: ReadWriteFileFormat

ORC file format (columnar). |support_hooks|

Based on Spark ORC Files <https://spark.apache.org/docs/latest/sql-data-sources-orc.html>_ file format.

Supports reading/writing files with .orc extension.

.. versionadded:: 0.9.0

Examples

.. note ::

You can pass any option mentioned in
`official documentation <https://spark.apache.org/docs/latest/sql-data-sources-orc.html>`_.
**Option names should be in** ``camelCase``!

The set of supported options depends on Spark version.

You may also set options mentioned `orc-java documentation <https://orc.apache.org/docs/core-java-config.html>`_.
They are prefixed with ``orc.`` with dots in names, so instead of calling constructor ``ORC(orc.option=True)`` (invalid in Python)
you should call method ``ORC.parse({"orc.option": True})``.

.. tabs::

.. code-tab:: py Reading files

    from onetl.file.format import ORC

    orc = ORC(mergeSchema=True)

.. tab:: Writing files

    .. code:: python

        from onetl.file.format import ORC

        orc = ORC.parse(
            {
                "compression": "snappy",
                # Enable Bloom filter for columns 'id' and 'name'
                "orc.bloom.filter.columns": "id,name",
                # Set Bloom filter false positive probability
                "orc.bloom.filter.fpp": 0.01,
                # Do not use dictionary for 'highly_selective_column'
                "orc.column.encoding.direct": "highly_selective_column",
                # other options
            }
        )
Source code in onetl/file/format/orc.py
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
@support_hooks
class ORC(ReadWriteFileFormat):
    """
    ORC file format (columnar). |support_hooks|

    Based on `Spark ORC Files <https://spark.apache.org/docs/latest/sql-data-sources-orc.html>`_ file format.

    Supports reading/writing files with ``.orc`` extension.

    .. versionadded:: 0.9.0

    Examples
    --------

    .. note ::

        You can pass any option mentioned in
        `official documentation <https://spark.apache.org/docs/latest/sql-data-sources-orc.html>`_.
        **Option names should be in** ``camelCase``!

        The set of supported options depends on Spark version.

        You may also set options mentioned `orc-java documentation <https://orc.apache.org/docs/core-java-config.html>`_.
        They are prefixed with ``orc.`` with dots in names, so instead of calling constructor ``ORC(orc.option=True)`` (invalid in Python)
        you should call method ``ORC.parse({"orc.option": True})``.

    .. tabs::

        .. code-tab:: py Reading files

            from onetl.file.format import ORC

            orc = ORC(mergeSchema=True)

        .. tab:: Writing files

            .. code:: python

                from onetl.file.format import ORC

                orc = ORC.parse(
                    {
                        "compression": "snappy",
                        # Enable Bloom filter for columns 'id' and 'name'
                        "orc.bloom.filter.columns": "id,name",
                        # Set Bloom filter false positive probability
                        "orc.bloom.filter.fpp": 0.01,
                        # Do not use dictionary for 'highly_selective_column'
                        "orc.column.encoding.direct": "highly_selective_column",
                        # other options
                    }
                )

    """

    name: ClassVar[str] = "orc"

    mergeSchema: Optional[bool] = None
    """
    Merge schemas of all ORC files being read into a single schema.
    By default, Spark config option ``spark.sql.orc.mergeSchema`` value is used (``False``).

    .. note::

        Used only for reading files.
    """

    compression: Union[
        str,
        Literal["uncompressed", "snappy", "zlib", "lzo", "zstd", "lz4"],
        None,
    ] = None
    """
    Compression codec of the ORC files.
    By default, Spark config option ``spark.sql.orc.compression.codec`` value is used (``snappy``).

    .. note::

        Used only for writing files.
    """

    class Config:
        known_options = ORC_JAVA_OPTIONS
        prohibited_options = PROHIBITED_OPTIONS
        extra = "allow"

    @slot
    def check_if_supported(self, spark: SparkSession) -> None:
        # always available
        pass

    def __repr__(self):
        options_dict = self.dict(by_alias=True, exclude_none=True)
        options_dict = dict(sorted(options_dict.items()))
        if any("." in field for field in options_dict.keys()):
            return f"{self.__class__.__name__}.parse({options_dict})"

        options_kwargs = ", ".join(f"{k}={v!r}" for k, v in options_dict.items())
        return f"{self.__class__.__name__}({options_kwargs})"

mergeSchema = None class-attribute instance-attribute

Merge schemas of all ORC files being read into a single schema. By default, Spark config option spark.sql.orc.mergeSchema value is used (False).

.. note::

Used only for reading files.

compression = None class-attribute instance-attribute

Compression codec of the ORC files. By default, Spark config option spark.sql.orc.compression.codec value is used (snappy).

.. note::

Used only for writing files.