@@ -8,7 +8,7 @@ How can I achieve data locality?
8
8
--------------------------------
9
9
10
10
For any MongoDB deployment, the Mongo Spark Connector sets the
11
- preferred location for an RDD to be where the data is:
11
+ preferred location for a DataFrame or Dataset to be where the data is:
12
12
13
13
- For a non sharded system, it sets the preferred location to be the
14
14
hostname(s) of the standalone or the replica set.
@@ -30,89 +30,10 @@ To promote data locality,
30
30
To partition the data by shard use the
31
31
:ref:`conf-shardedpartitioner`.
32
32
33
- How do I interact with Spark Streams?
34
- -------------------------------------
35
-
36
- Spark streams can be considered as a potentially infinite source of
37
- RDDs. Therefore, anything you can do with an RDD, you can do with the
38
- results of a Spark Stream.
39
-
40
- For an example, see :mongo-spark:`SparkStreams.scala
41
- </blob/master/examples/src/test/scala/tour/SparkStreams.scala>`
42
-
43
33
How do I resolve ``Unrecognized pipeline stage name`` Error?
44
34
------------------------------------------------------------
45
35
46
36
In MongoDB deployments with mixed versions of :binary:`~bin.mongod`, it is
47
37
possible to get an ``Unrecognized pipeline stage name: '$sample'``
48
38
error. To mitigate this situation, explicitly configure the partitioner
49
39
to use and define the Schema when using DataFrames.
50
-
51
- How do I use MongoDB BSON types that are unsupported in Spark?
52
- --------------------------------------------------------------
53
-
54
- Some custom MongoDB BSON types, such as ``ObjectId``, are unsupported
55
- in Spark.
56
-
57
- The MongoDB Spark Connector converts custom MongoDB data types to and
58
- from extended JSON-like representations of those data types that are
59
- compatible with Spark. See :ref:`<bson-spark-datatypes>` for a list of
60
- custom MongoDB types and their Spark counterparts.
61
-
62
- Spark Datasets
63
- ~~~~~~~~~~~~~~
64
-
65
- To create a standard Dataset with custom MongoDB data types, use
66
- ``fieldTypes`` helpers:
67
-
68
- .. code-block:: scala
69
-
70
- import com.mongodb.spark.sql.fieldTypes
71
-
72
- case class MyData(id: fieldTypes.ObjectId, a: Int)
73
- val ds = spark.createDataset(Seq(MyData(fieldTypes.ObjectId(new ObjectId()), 99)))
74
- ds.show()
75
-
76
- The preceding example creates a Dataset containing the following fields
77
- and data types:
78
-
79
- - The ``id`` field is a custom MongoDB BSON type, ``ObjectId``, defined
80
- by ``fieldTypes.ObjectId``.
81
-
82
- - The ``a`` field is an ``Int``, a data type available in Spark.
83
-
84
- Spark DataFrames
85
- ~~~~~~~~~~~~~~~~
86
-
87
- To create a DataFrame with custom MongoDB data types, you must supply
88
- those types when you create the RDD and schema:
89
-
90
- - Create RDDs using custom MongoDB BSON types
91
- (e.g. ``ObjectId``). The Spark Connector handles converting
92
- those custom types into Spark-compatible data types.
93
-
94
- - Declare schemas using the ``StructFields`` helpers for data types
95
- that are not natively supported by Spark
96
- (e.g. ``StructFields.objectId``). Refer to
97
- :ref:`<bson-spark-datatypes>` for the mapping between BSON and custom
98
- MongoDB Spark types.
99
-
100
- .. code-block:: scala
101
-
102
- import org.apache.spark.sql.Row
103
- import org.apache.spark.sql.types.{StructType, StructField, IntegerType}
104
- import com.mongodb.spark.sql.helpers.StructFields
105
-
106
- val data = Seq(Row(Row(new ObjectId().toHexString()), 99))
107
- val rdd = spark.sparkContext.parallelize(data)
108
- val schema = StructType(List(StructFields.objectId("id", true), StructField("a", IntegerType, true)))
109
- val df = spark.createDataFrame(rdd, schema)
110
- df.show()
111
-
112
- The preceding example creates a DataFrame containing the following
113
- fields and data types:
114
-
115
- - The ``id`` field is a custom MongoDB BSON type, ``ObjectId``, defined
116
- by ``StructFields.objectId``.
117
-
118
- - The ``a`` field is an ``Int``, a data type available in Spark.
0 commit comments