Skip to content

ORC

Bases: ReadWriteFileFormat

ORC file format (columnar). support hooks

Based on Spark ORC Files file format.

Supports reading/writing files with .orc extension.

Added in 0.9.0

Examples

Note

You can pass any option mentioned in official documentation. Option names should be in camelCase!

The set of supported options depends on Spark version.

You may also set options mentioned orc-java documentation. They are prefixed with orc. with dots in names, so instead of calling constructor ORC(orc.option=True) (invalid in Python) you should call method ORC.parse({"orc.option": True}).

from onetl.file.format import ORC

orc = ORC(mergeSchema=True)
from onetl.file.format import ORC

orc = ORC.parse(
    {
        "compression": "snappy",
        # Enable Bloom filter for columns 'id' and 'name'
        "orc.bloom.filter.columns": "id,name",
        # Set Bloom filter false positive probability
        "orc.bloom.filter.fpp": 0.01,
        # Do not use dictionary for 'highly_selective_column'
        "orc.column.encoding.direct": "highly_selective_column",
        # other options
    }
)
Source code in onetl/file/format/orc.py
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
@support_hooks
class ORC(ReadWriteFileFormat):
    """
    ORC file format (columnar). [![support hooks](https://img.shields.io/badge/%20-support%20hooks-blue)](/hooks/)

    Based on [Spark ORC Files](https://spark.apache.org/docs/latest/sql-data-sources-orc.html) file format.

    Supports reading/writing files with `.orc` extension.

    !!! success "Added in 0.9.0"

    Examples
    --------

    !!! note

        You can pass any option mentioned in
        [official documentation](https://spark.apache.org/docs/latest/sql-data-sources-orc.html).
        **Option names should be in** `camelCase`!

        The set of supported options depends on Spark version.

        You may also set options mentioned
        [orc-java documentation](https://orc.apache.org/docs/core-java-config.html).
        They are prefixed with `orc.` with dots in names,
        so instead of calling constructor `ORC(orc.option=True)` (invalid in Python)
        you should call method `ORC.parse({"orc.option": True})`.

    === "Reading files"
        ```python
        from onetl.file.format import ORC

        orc = ORC(mergeSchema=True)
        ```
    === "Writing files"

        ```python
        from onetl.file.format import ORC

        orc = ORC.parse(
            {
                "compression": "snappy",
                # Enable Bloom filter for columns 'id' and 'name'
                "orc.bloom.filter.columns": "id,name",
                # Set Bloom filter false positive probability
                "orc.bloom.filter.fpp": 0.01,
                # Do not use dictionary for 'highly_selective_column'
                "orc.column.encoding.direct": "highly_selective_column",
                # other options
            }
        )
        ```
    """

    name: ClassVar[str] = "orc"

    mergeSchema: Optional[bool] = None
    """
    Merge schemas of all ORC files being read into a single schema.
    By default, Spark config option `spark.sql.orc.mergeSchema` value is used (`False`).

    !!! note

        Used only for reading files.
    """

    compression: Union[
        str,
        Literal["uncompressed", "snappy", "zlib", "lzo", "zstd", "lz4"],
        None,
    ] = None
    """
    Compression codec of the ORC files.
    By default, Spark config option `spark.sql.orc.compression.codec` value is used (`snappy`).

    !!! note

        Used only for writing files.
    """

    class Config:
        known_options = ORC_JAVA_OPTIONS
        prohibited_options = PROHIBITED_OPTIONS
        extra = "allow"

    @slot
    def check_if_supported(self, spark: SparkSession) -> None:
        # always available
        pass

    def __repr__(self):
        options_dict = self.dict(by_alias=True, exclude_none=True)
        options_dict = dict(sorted(options_dict.items()))
        if any("." in field for field in options_dict):
            return f"{self.__class__.__name__}.parse({options_dict})"

        options_kwargs = ", ".join(f"{k}={v!r}" for k, v in options_dict.items())
        return f"{self.__class__.__name__}({options_kwargs})"

mergeSchema = None class-attribute instance-attribute

Merge schemas of all ORC files being read into a single schema. By default, Spark config option spark.sql.orc.mergeSchema value is used (False).

Note

Used only for reading files.

compression = None class-attribute instance-attribute

Compression codec of the ORC files. By default, Spark config option spark.sql.orc.compression.codec value is used (snappy).

Note

Used only for writing files.