@@ -47,3 +47,72 @@ In MongoDB deployments with mixed versions of :binary:`~bin.mongod`, it is
47
47
possible to get an ``Unrecognized pipeline stage name: '$sample'``
48
48
error. To mitigate this situation, explicitly configure the partitioner
49
49
to use and define the Schema when using DataFrames.
50
+
51
+ How do I use MongoDB BSON types that are unsupported in Spark?
52
+ --------------------------------------------------------------
53
+
54
+ Some custom MongoDB BSON types, such as ``ObjectId``, are unsupported
55
+ in Spark.
56
+
57
+ The MongoDB Spark Connector converts custom MongoDB data types to and
58
+ from extended JSON-like representations of those data types that are
59
+ compatible with Spark. See :ref:`<bson-spark-datatypes>` for a list of
60
+ custom MongoDB types and their Spark counterparts.
61
+
62
+ Spark Datasets
63
+ ~~~~~~~~~~~~~~
64
+
65
+ To create a standard Dataset with custom MongoDB data types, use
66
+ ``fieldTypes`` helpers:
67
+
68
+ .. code-block:: scala
69
+
70
+ import com.mongodb.spark.sql.fieldTypes
71
+
72
+ case class MyData(id: fieldTypes.ObjectId, a: Int)
73
+ val ds = spark.createDataset(Seq(MyData(fieldTypes.ObjectId(new ObjectId()), 99)))
74
+ ds.show()
75
+
76
+ The preceding example creates a Dataset containing the following fields
77
+ and data types:
78
+
79
+ - The ``id`` field is a custom MongoDB BSON type, ``ObjectId``, defined
80
+ by ``fieldTypes.ObjectId``.
81
+
82
+ - The ``a`` field is an ``Int``, a data type available in Spark.
83
+
84
+ Spark DataFrames
85
+ ~~~~~~~~~~~~~~~~
86
+
87
+ To create a DataFrame with custom MongoDB data types, you must supply
88
+ those types when you create the RDD and schema:
89
+
90
+ - Create RDDs using custom MongoDB BSON types
91
+ (e.g. ``ObjectId``). The Spark Connector handles converting
92
+ those custom types into Spark-compatible data types.
93
+
94
+ - Declare schemas using the ``StructFields`` helpers for data types
95
+ that are not natively supported by Spark
96
+ (e.g. ``StructFields.objectId``). Refer to
97
+ :ref:`<bson-spark-datatypes>` for the mapping between BSON and custom
98
+ MongoDB Spark types.
99
+
100
+ .. code-block:: scala
101
+
102
+ import org.apache.spark.sql.Row
103
+ import org.apache.spark.sql.types.{StructType, StructField, IntegerType}
104
+ import com.mongodb.spark.sql.helpers.StructFields
105
+
106
+ val data = Seq(Row(Row(new ObjectId().toHexString()), 99))
107
+ val rdd = spark.sparkContext.parallelize(data)
108
+ val schema = StructType(List(StructFields.objectId("id", true), StructField("a", IntegerType, true)))
109
+ val df = spark.createDataFrame(rdd, schema)
110
+ df.show()
111
+
112
+ The preceding example creates a DataFrame containing the following
113
+ fields and data types:
114
+
115
+ - The ``id`` field is a custom MongoDB BSON type, ``ObjectId``, defined
116
+ by ``StructFields.objectId``.
117
+
118
+ - The ``a`` field is an ``Int``, a data type available in Spark.
0 commit comments