DOCSP-36546 Scan Multiple Collections (#193)

jordan-smith721 · jordan-smith721 · commit e206f09b3dcd · 2024-02-28T09:37:00.000-08:00
(cherry picked from commit 23deda1)
diff --git a/source/release-notes.txt b/source/release-notes.txt
@@ -2,6 +2,38 @@
 Release Notes
 =============
 
+MongoDB Connector for Spark 10.3
+--------------------------------
+
+The 10.3 connector release includes the following new features:
+
+- Added support for reading multiple collections when using micro-batch or
+  continuous streaming modes.
+
+  .. warning:: Breaking Change
+
+     Support for reading multiple collections introduces the following breaking
+     changes:
+     
+     - If the name of a collection used in your ``collection`` configuration
+       option contains a comma, the
+       {+connector-short+} treats it as two different collections. To avoid
+       this, you must escape the comma by preceding it with a backslash (\\).
+
+     - If the name of a collection used in your ``collection`` configuration
+       option is "*", the {+connector-short+} interprets it as a specification
+       to scan all collections. To avoid this, you must escape the asterisk by preceding it
+       with a backslash (\\).
+
+     - If the name of a collection used in your ``collection`` configuration
+       option contains a backslash (\\), the
+       {+connector-short+} treats the backslash as an escape character, which
+       might change how it interprets the value. To avoid this, you must escape
+       the backslash by preceding it with another backslash.
+    
+  To learn more about scanning multiple collections, see the :ref:`collection
+  configuration property <spark-streaming-input-conf>` description.
+
 MongoDB Connector for Spark 10.2
 --------------------------------
 
diff --git a/source/streaming-mode/streaming-read-config.txt b/source/streaming-mode/streaming-read-config.txt
@@ -46,6 +46,10 @@ You can configure the following properties when reading data from MongoDB in str
    * - ``collection``
      - | **Required.**
        | The collection name configuration.
+       | You can specify multiple collections by separating the collection names
+         with a comma.
+       |
+       | To learn more about specifying multiple collections, see :ref:`spark-specify-multiple-collections`.
 
    * - ``comment``
      - | The comment to append to the read operation. Comments appear in the 
@@ -168,7 +172,7 @@ You can configure the following properties when reading a change stream from Mon
          omit the ``fullDocument`` field and publishes only the value of the
          field.
        - If you don't specify a schema, the connector infers the schema
-         from the change stream document rather than from the underlying collection.
+         from the change stream document.
 
        **Default**: ``false``
        
@@ -203,4 +207,91 @@ You can configure the following properties when reading a change stream from Mon
 Specifying Properties in ``connection.uri``
 -------------------------------------------
 
-.. include:: /includes/connection-read-config.rst
+.. include:: /includes/connection-read-config.rst
+
+.. _spark-specify-multiple-collections:
+
+Specifying Multiple Collections in the ``collection`` Property
+--------------------------------------------------------------
+
+You can specify multiple collections in the ``collection`` change stream
+configuration property by separating the collection names
+with a comma. Do not add a space between the collections unless the space is a
+part of the collection name.
+
+Specify multiple collections as shown in the following example:
+
+.. code-block:: java
+
+   ...
+   .option("spark.mongodb.collection", "collectionOne,collectionTwo")
+
+If a collection name is "*", or if the name includes a comma or a backslash (\\),
+you must escape the character as follows:
+
+- If the name of a collection used in your ``collection`` configuration
+  option contains a comma, the {+connector-short+} treats it as two different
+  collections. To avoid this, you must escape the comma by preceding it with
+  a backslash (\\). Escape a collection named "my,collection" as follows:
+
+  .. code-block:: java
+
+     "my\,collection"
+
+- If the name of a collection used in your ``collection`` configuration
+  option is "*", the {+connector-short+} interprets it as a specification
+  to scan all collections. To avoid this, you must escape the asterisk by preceding it
+  with a backslash (\\). Escape a collection named "*" as follows:
+
+  .. code-block:: java
+
+     "\*"
+
+- If the name of a collection used in your ``collection`` configuration
+  option contains a backslash (\\), the
+  {+connector-short+} treats the backslash as an escape character, which
+  might change how it interprets the value. To avoid this, you must escape
+  the backslash by preceding it with another backslash. Escape a collection named "\\collection" as follows:
+
+  .. code-block:: java
+
+     "\\collection"
+  
+  .. note:: 
+     
+     When specifying the collection name as a string literal in Java, you must
+     further escape each backslash with another one. For example, escape a collection 
+     named "\\collection" as follows:
+
+     .. code-block:: java
+
+        "\\\\collection"
+
+You can stream from all collections in the database by passing an
+asterisk (*) as a string for the collection name.
+
+Specify all collections as shown in the following example:
+
+.. code-block:: java
+
+   ...
+   .option("spark.mongodb.collection", "*")
+
+If you create a collection while streaming from all collections, the new
+collection is automatically included in the stream. 
+
+You can drop collections at any time while streaming from multiple collections.
+
+.. important:: Inferring the Schema with Multiple Collections
+
+   If you set the ``change.stream.publish.full.document.only``
+   option to ``true``, the {+connector-short+} infers the schema of a ``DataFrame``
+   by using the schema of the scanned documents. 
+   
+   Schema inference happens at the beginning of streaming, and does not take
+   into account collections that are created during streaming.
+
+   When streaming from multiple collections and inferring the schema, the connector samples
+   each collection sequentially. Streaming from a large number of
+   collections can cause the schema inference to have noticeably slower
+   performance. This performance impact occurs only while inferring the schema.
diff --git a/source/streaming-mode/streaming-read.txt b/source/streaming-mode/streaming-read.txt
@@ -15,6 +15,13 @@ Read from MongoDB in Streaming Mode
    :depth: 1
    :class: singlecol 
 
+.. facet::
+   :name: genre
+   :values: reference
+ 
+.. meta::
+   :keywords: change stream
+
 Overview
 --------
 
@@ -344,12 +351,10 @@ The following example shows how to stream data from MongoDB to your console.
 
 .. important:: Inferring the Schema of a Change Stream
 
-   When the {+connector-short+} infers the schema of a DataFrame
-   read from a change stream, by default,
-   it uses the schema of the underlying collection rather than that
-   of the change stream. If you set the ``change.stream.publish.full.document.only``
-   option to ``true``, the connector uses the schema of the 
-   change stream instead.
+   If you set the ``change.stream.publish.full.document.only``
+   option to ``true``, the {+connector-short+} infers the schema of a ``DataFrame``
+   by using the schema of the scanned documents. If you set the option to
+   ``false``, you must specify a schema.
 
    For more information about this setting, and to see a full list of change stream
    configuration options, see the