Skip to content

Commit e206f09

Browse files
DOCSP-36546 Scan Multiple Collections (#193)
(cherry picked from commit 23deda1)
1 parent eda72ae commit e206f09

File tree

3 files changed

+136
-8
lines changed

3 files changed

+136
-8
lines changed

source/release-notes.txt

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,38 @@
22
Release Notes
33
=============
44

5+
MongoDB Connector for Spark 10.3
6+
--------------------------------
7+
8+
The 10.3 connector release includes the following new features:
9+
10+
- Added support for reading multiple collections when using micro-batch or
11+
continuous streaming modes.
12+
13+
.. warning:: Breaking Change
14+
15+
Support for reading multiple collections introduces the following breaking
16+
changes:
17+
18+
- If the name of a collection used in your ``collection`` configuration
19+
option contains a comma, the
20+
{+connector-short+} treats it as two different collections. To avoid
21+
this, you must escape the comma by preceding it with a backslash (\\).
22+
23+
- If the name of a collection used in your ``collection`` configuration
24+
option is "*", the {+connector-short+} interprets it as a specification
25+
to scan all collections. To avoid this, you must escape the asterisk by preceding it
26+
with a backslash (\\).
27+
28+
- If the name of a collection used in your ``collection`` configuration
29+
option contains a backslash (\\), the
30+
{+connector-short+} treats the backslash as an escape character, which
31+
might change how it interprets the value. To avoid this, you must escape
32+
the backslash by preceding it with another backslash.
33+
34+
To learn more about scanning multiple collections, see the :ref:`collection
35+
configuration property <spark-streaming-input-conf>` description.
36+
537
MongoDB Connector for Spark 10.2
638
--------------------------------
739

source/streaming-mode/streaming-read-config.txt

Lines changed: 93 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -46,6 +46,10 @@ You can configure the following properties when reading data from MongoDB in str
4646
* - ``collection``
4747
- | **Required.**
4848
| The collection name configuration.
49+
| You can specify multiple collections by separating the collection names
50+
with a comma.
51+
|
52+
| To learn more about specifying multiple collections, see :ref:`spark-specify-multiple-collections`.
4953

5054
* - ``comment``
5155
- | The comment to append to the read operation. Comments appear in the
@@ -168,7 +172,7 @@ You can configure the following properties when reading a change stream from Mon
168172
omit the ``fullDocument`` field and publishes only the value of the
169173
field.
170174
- If you don't specify a schema, the connector infers the schema
171-
from the change stream document rather than from the underlying collection.
175+
from the change stream document.
172176

173177
**Default**: ``false``
174178

@@ -203,4 +207,91 @@ You can configure the following properties when reading a change stream from Mon
203207
Specifying Properties in ``connection.uri``
204208
-------------------------------------------
205209

206-
.. include:: /includes/connection-read-config.rst
210+
.. include:: /includes/connection-read-config.rst
211+
212+
.. _spark-specify-multiple-collections:
213+
214+
Specifying Multiple Collections in the ``collection`` Property
215+
--------------------------------------------------------------
216+
217+
You can specify multiple collections in the ``collection`` change stream
218+
configuration property by separating the collection names
219+
with a comma. Do not add a space between the collections unless the space is a
220+
part of the collection name.
221+
222+
Specify multiple collections as shown in the following example:
223+
224+
.. code-block:: java
225+
226+
...
227+
.option("spark.mongodb.collection", "collectionOne,collectionTwo")
228+
229+
If a collection name is "*", or if the name includes a comma or a backslash (\\),
230+
you must escape the character as follows:
231+
232+
- If the name of a collection used in your ``collection`` configuration
233+
option contains a comma, the {+connector-short+} treats it as two different
234+
collections. To avoid this, you must escape the comma by preceding it with
235+
a backslash (\\). Escape a collection named "my,collection" as follows:
236+
237+
.. code-block:: java
238+
239+
"my\,collection"
240+
241+
- If the name of a collection used in your ``collection`` configuration
242+
option is "*", the {+connector-short+} interprets it as a specification
243+
to scan all collections. To avoid this, you must escape the asterisk by preceding it
244+
with a backslash (\\). Escape a collection named "*" as follows:
245+
246+
.. code-block:: java
247+
248+
"\*"
249+
250+
- If the name of a collection used in your ``collection`` configuration
251+
option contains a backslash (\\), the
252+
{+connector-short+} treats the backslash as an escape character, which
253+
might change how it interprets the value. To avoid this, you must escape
254+
the backslash by preceding it with another backslash. Escape a collection named "\\collection" as follows:
255+
256+
.. code-block:: java
257+
258+
"\\collection"
259+
260+
.. note::
261+
262+
When specifying the collection name as a string literal in Java, you must
263+
further escape each backslash with another one. For example, escape a collection
264+
named "\\collection" as follows:
265+
266+
.. code-block:: java
267+
268+
"\\\\collection"
269+
270+
You can stream from all collections in the database by passing an
271+
asterisk (*) as a string for the collection name.
272+
273+
Specify all collections as shown in the following example:
274+
275+
.. code-block:: java
276+
277+
...
278+
.option("spark.mongodb.collection", "*")
279+
280+
If you create a collection while streaming from all collections, the new
281+
collection is automatically included in the stream.
282+
283+
You can drop collections at any time while streaming from multiple collections.
284+
285+
.. important:: Inferring the Schema with Multiple Collections
286+
287+
If you set the ``change.stream.publish.full.document.only``
288+
option to ``true``, the {+connector-short+} infers the schema of a ``DataFrame``
289+
by using the schema of the scanned documents.
290+
291+
Schema inference happens at the beginning of streaming, and does not take
292+
into account collections that are created during streaming.
293+
294+
When streaming from multiple collections and inferring the schema, the connector samples
295+
each collection sequentially. Streaming from a large number of
296+
collections can cause the schema inference to have noticeably slower
297+
performance. This performance impact occurs only while inferring the schema.

source/streaming-mode/streaming-read.txt

Lines changed: 11 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,13 @@ Read from MongoDB in Streaming Mode
1515
:depth: 1
1616
:class: singlecol
1717

18+
.. facet::
19+
:name: genre
20+
:values: reference
21+
22+
.. meta::
23+
:keywords: change stream
24+
1825
Overview
1926
--------
2027

@@ -344,12 +351,10 @@ The following example shows how to stream data from MongoDB to your console.
344351

345352
.. important:: Inferring the Schema of a Change Stream
346353

347-
When the {+connector-short+} infers the schema of a DataFrame
348-
read from a change stream, by default,
349-
it uses the schema of the underlying collection rather than that
350-
of the change stream. If you set the ``change.stream.publish.full.document.only``
351-
option to ``true``, the connector uses the schema of the
352-
change stream instead.
354+
If you set the ``change.stream.publish.full.document.only``
355+
option to ``true``, the {+connector-short+} infers the schema of a ``DataFrame``
356+
by using the schema of the scanned documents. If you set the option to
357+
``false``, you must specify a schema.
353358

354359
For more information about this setting, and to see a full list of change stream
355360
configuration options, see the

0 commit comments

Comments
 (0)