Чтение из MongoDB с использованием `MongoDB.pipeline`

MongoDB.sql позволяет передавать пользовательский пайплайн, но не поддерживает инкрементальные стратегии.

Предупреждение

Пожалуйста, учитывайте типы данных MongoDB

References

`onetl.connection.db_connection.mongodb.connection.MongoDB.pipeline(collection, pipeline=None, df_schema=None, options=None)`

Execute a pipeline for a specific collection, and return DataFrame.

Almost like Aggregation pipeline syntax in MongoDB:

db.collection_name.aggregate([{"$match": ...}, {"$group": ...}])

but pipeline is executed on Spark executors, in a distributed way.

Note

This method does not support strategy, use DBReader instead

Added in 0.7.0

Parameters

str

Collection name.

dict | list[dict], optional

Pipeline containing a database query. See Aggregation pipeline syntax.

StructType, optional

Schema describing the resulting DataFrame.

PipelineOptions | dict, optional

Additional pipeline options, see MongoDB.PipelineOptions.

Examples

Get document with a specific field value:

df = connection.pipeline(
    collection="collection_name",
    pipeline={"$match": {"field": {"$eq": 1}}},
)

Calculate aggregation and get result:

df = connection.pipeline(
    collection="collection_name",
    pipeline={
        "$group": {
            "_id": 1,
            "min": {"$min": "$column_int"},
            "max": {"$max": "$column_int"},
        }
    },
)

Explicitly pass DataFrame schema:

from pyspark.sql.types import (
    DoubleType,
    IntegerType,
    StringType,
    StructField,
    StructType,
    TimestampType,
)

df_schema = StructType(
    [
        StructField("_id", StringType()),
        StructField("some_string", StringType()),
        StructField("some_int", IntegerType()),
        StructField("some_datetime", TimestampType()),
        StructField("some_float", DoubleType()),
    ],
)

df = connection.pipeline(
    collection="collection_name",
    df_schema=df_schema,
    pipeline={"$match": {"some_int": {"$gt": 999}}},
)

Pass additional options to pipeline execution:

df = connection.pipeline(
    collection="collection_name",
    pipeline={"$match": {"field": {"$eq": 1}}},
    options=MongoDB.PipelineOptions(hint={"field": 1}),
)

Source code in onetl/connection/db_connection/mongodb/connection.py

@slot
def pipeline(
    self,
    collection: str,
    pipeline: dict | list[dict] | None = None,
    df_schema: StructType | None = None,
    options: MongoDBPipelineOptions | dict | None = None,
):
    """
    Execute a pipeline for a specific collection, and return DataFrame. [![support hooks](https://img.shields.io/badge/%20-support%20hooks-blue)](/hooks/)

    Almost like [Aggregation pipeline syntax](https://www.mongodb.com/docs/manual/core/aggregation-pipeline/)
    in MongoDB:

    ```js
    db.collection_name.aggregate([{"$match": ...}, {"$group": ...}])
    ```
    but pipeline is executed on Spark executors, in a distributed way.

    !!! note

        This method does not support [strategy][],
        use [DBReader][onetl.db.db_reader.db_reader.DBReader] instead

    !!! success "Added in 0.7.0"

    Parameters
    ----------

    collection : str
        Collection name.

    pipeline : dict | list[dict], optional
        Pipeline containing a database query.
        See [Aggregation pipeline syntax](https://www.mongodb.com/docs/manual/core/aggregation-pipeline/).

    df_schema : StructType, optional
        Schema describing the resulting DataFrame.

    options : PipelineOptions | dict, optional
        Additional pipeline options,
        see [MongoDB.PipelineOptions][onetl.connection.db_connection.mongodb.options.MongoDBPipelineOptions].

    Examples
    --------

    Get document with a specific `field` value:

    ```python
    df = connection.pipeline(
        collection="collection_name",
        pipeline={"$match": {"field": {"$eq": 1}}},
    )
    ```
    Calculate aggregation and get result:

    ```python
    df = connection.pipeline(
        collection="collection_name",
        pipeline={
            "$group": {
                "_id": 1,
                "min": {"$min": "$column_int"},
                "max": {"$max": "$column_int"},
            }
        },
    )
    ```
    Explicitly pass DataFrame schema:

    ```python
    from pyspark.sql.types import (
        DoubleType,
        IntegerType,
        StringType,
        StructField,
        StructType,
        TimestampType,
    )

    df_schema = StructType(
        [
            StructField("_id", StringType()),
            StructField("some_string", StringType()),
            StructField("some_int", IntegerType()),
            StructField("some_datetime", TimestampType()),
            StructField("some_float", DoubleType()),
        ],
    )

    df = connection.pipeline(
        collection="collection_name",
        df_schema=df_schema,
        pipeline={"$match": {"some_int": {"$gt": 999}}},
    )
    ```
    Pass additional options to pipeline execution:

    ```python
    df = connection.pipeline(
        collection="collection_name",
        pipeline={"$match": {"field": {"$eq": 1}}},
        options=MongoDB.PipelineOptions(hint={"field": 1}),
    )
    ```
    """
    log.info("|%s| Executing aggregation pipeline:", self.__class__.__name__)

    read_options = self.PipelineOptions.parse(options).dict(by_alias=True, exclude_none=True)
    if pipeline:
        pipeline = self.dialect.prepare_pipeline(pipeline)

    log_with_indent(log, "collection = %r", collection)
    log_json(log, pipeline, name="pipeline")

    if df_schema:
        empty_df = self.spark.createDataFrame([], df_schema)
        log_dataframe_schema(log, empty_df)

    log_options(log, read_options)

    # exclude from the log
    read_options.update(self._get_connection_params(collection))
    if pipeline:
        read_options["aggregation.pipeline"] = json.dumps(pipeline)

    with override_job_description(self.spark, f"{self}.pipeline()"):
        spark_reader = self.spark.read.format("mongodb").options(**read_options)

        if df_schema:
            spark_reader = spark_reader.schema(df_schema)

        return spark_reader.load()

`onetl.connection.db_connection.mongodb.options.MongoDBPipelineOptions`

Bases: GenericOptions

Aggregation pipeline options for MongoDB connector.

The only difference from [MongoDB.ReadOptions][MongoDBReadOptions] that latter does not allow to pass the hint parameter.

Warning

Options uri, database, collection, pipeline are populated from connection attributes, and cannot be overridden by the user in PipelineOptions to avoid issues.

Added in 0.7.0

Examples

Note

You can pass any value supported by connector, even if it is not mentioned in this documentation. Option names should be in camelCase!

The set of supported options depends on connector version.

from onetl.connection import MongoDB

options = MongoDB.PipelineOptions(
    hint={"some_field": 1},
)

Source code in onetl/connection/db_connection/mongodb/options.py

class MongoDBPipelineOptions(GenericOptions):
    """Aggregation pipeline options for MongoDB connector.

    The only difference from [MongoDB.ReadOptions][MongoDBReadOptions]
    that latter does not allow to pass the `hint` parameter.

    !!! warning

        Options `uri`, `database`, `collection`, `pipeline` are populated from connection attributes,
        and cannot be overridden by the user in `PipelineOptions` to avoid issues.

    !!! success "Added in 0.7.0"

    Examples
    --------

    !!! note

        You can pass any value
        [supported by connector](https://www.mongodb.com/docs/spark-connector/current/batch-mode/batch-read-config/),
        even if it is not mentioned in this documentation. **Option names should be in** `camelCase`!

        The set of supported options depends on connector version.

    ```python
    from onetl.connection import MongoDB

    options = MongoDB.PipelineOptions(
        hint={"some_field": 1},
    )
    ```
    """

    class Config:
        prohibited_options = PIPELINE_PROHIBITED_OPTIONS
        known_options = KNOWN_READ_OPTIONS
        extra = "allow"

Чтение из MongoDB с использованием `MongoDB.pipeline`

Рекомендации

Обратите внимание на значение `pipeline`

References

`onetl.connection.db_connection.mongodb.connection.MongoDB.pipeline(collection, pipeline=None, df_schema=None, options=None)`

Parameters

Examples

`onetl.connection.db_connection.mongodb.options.MongoDBPipelineOptions`

Examples

Чтение из MongoDB с использованием MongoDB.pipeline

Рекомендации

Обратите внимание на значение pipeline

References

onetl.connection.db_connection.mongodb.connection.MongoDB.pipeline(collection, pipeline=None, df_schema=None, options=None)

Parameters

Examples

onetl.connection.db_connection.mongodb.options.MongoDBPipelineOptions

Examples

Чтение из MongoDB с использованием `MongoDB.pipeline`

Обратите внимание на значение `pipeline`

`onetl.connection.db_connection.mongodb.connection.MongoDB.pipeline(collection, pipeline=None, df_schema=None, options=None)`

`onetl.connection.db_connection.mongodb.options.MongoDBPipelineOptions`