ORC

Bases: ReadWriteFileFormat

ORC file format (columnar).

Based on Spark ORC Files file format.

Supports reading/writing files with .orc extension.

Added in 0.9.0

Examples

Note

You can pass any option mentioned in official documentation. Option names should be in camelCase!

The set of supported options depends on Spark version.

You may also set options mentioned orc-java documentation. They are prefixed with orc. with dots in names, so instead of calling constructor ORC(orc.option=True) (invalid in Python) you should call method ORC.parse({"orc.option": True}).

Reading filesWriting files

from onetl.file.format import ORC

orc = ORC(mergeSchema=True)

from onetl.file.format import ORC

orc = ORC.parse(
    {
        "compression": "snappy",
        # Enable Bloom filter for columns 'id' and 'name'
        "orc.bloom.filter.columns": "id,name",
        # Set Bloom filter false positive probability
        "orc.bloom.filter.fpp": 0.01,
        # Do not use dictionary for 'highly_selective_column'
        "orc.column.encoding.direct": "highly_selective_column",
        # other options
    }
)

Source code in onetl/file/format/orc.py

@support_hooks
class ORC(ReadWriteFileFormat):
    """
    ORC file format (columnar). [![support hooks](https://img.shields.io/badge/%20-support%20hooks-blue)](/hooks/)

    Based on [Spark ORC Files](https://spark.apache.org/docs/latest/sql-data-sources-orc.html) file format.

    Supports reading/writing files with `.orc` extension.

    !!! success "Added in 0.9.0"

    Examples
    --------

    !!! note

        You can pass any option mentioned in
        [official documentation](https://spark.apache.org/docs/latest/sql-data-sources-orc.html).
        **Option names should be in** `camelCase`!

        The set of supported options depends on Spark version.

        You may also set options mentioned
        [orc-java documentation](https://orc.apache.org/docs/core-java-config.html).
        They are prefixed with `orc.` with dots in names,
        so instead of calling constructor `ORC(orc.option=True)` (invalid in Python)
        you should call method `ORC.parse({"orc.option": True})`.

    === "Reading files"
        ```python
        from onetl.file.format import ORC

        orc = ORC(mergeSchema=True)
        ```
    === "Writing files"

        ```python
        from onetl.file.format import ORC

        orc = ORC.parse(
            {
                "compression": "snappy",
                # Enable Bloom filter for columns 'id' and 'name'
                "orc.bloom.filter.columns": "id,name",
                # Set Bloom filter false positive probability
                "orc.bloom.filter.fpp": 0.01,
                # Do not use dictionary for 'highly_selective_column'
                "orc.column.encoding.direct": "highly_selective_column",
                # other options
            }
        )
        ```
    """

    name: ClassVar[str] = "orc"

    mergeSchema: Optional[bool] = None
    """
    Merge schemas of all ORC files being read into a single schema.
    By default, Spark config option `spark.sql.orc.mergeSchema` value is used (`False`).

    !!! note

        Used only for reading files.
    """

    compression: Union[
        str,
        Literal["uncompressed", "snappy", "zlib", "lzo", "zstd", "lz4"],
        None,
    ] = None
    """
    Compression codec of the ORC files.
    By default, Spark config option `spark.sql.orc.compression.codec` value is used (`snappy`).

    !!! note

        Used only for writing files.
    """

    class Config:
        known_options = ORC_JAVA_OPTIONS
        prohibited_options = PROHIBITED_OPTIONS
        extra = "allow"

    @slot
    def check_if_supported(self, spark: SparkSession) -> None:
        # always available
        pass

    def __repr__(self):
        options_dict = self.dict(by_alias=True, exclude_none=True)
        options_dict = dict(sorted(options_dict.items()))
        if any("." in field for field in options_dict):
            return f"{self.__class__.__name__}.parse({options_dict})"

        options_kwargs = ", ".join(f"{k}={v!r}" for k, v in options_dict.items())
        return f"{self.__class__.__name__}({options_kwargs})"

`mergeSchema = None` `class-attribute` `instance-attribute`

Merge schemas of all ORC files being read into a single schema. By default, Spark config option spark.sql.orc.mergeSchema value is used (False).

Note

Used only for reading files.

`compression = None` `class-attribute` `instance-attribute`

Compression codec of the ORC files. By default, Spark config option spark.sql.orc.compression.codec value is used (snappy).

Note

Used only for writing files.

ORC

Examples

mergeSchema = None class-attribute instance-attribute

compression = None class-attribute instance-attribute

`mergeSchema = None` `class-attribute` `instance-attribute`

`compression = None` `class-attribute` `instance-attribute`