Skip to content

Commit 77ace85

Browse files
DOCSP-41049 schemaHint option (#205)
1 parent b37226e commit 77ace85

File tree

3 files changed

+85
-0
lines changed

3 files changed

+85
-0
lines changed

source/batch-mode/batch-read-config.txt

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -135,6 +135,13 @@ You can configure the following properties when reading data from MongoDB in bat
135135
|
136136
| **Default:** ``false``
137137

138+
* - ``schemaHint``
139+
- | Specifies a partial schema of known field types to use when inferring
140+
the schema for the collection. To learn more about the ``schemaHint``
141+
option, see the :ref:`spark-schema-hint` section.
142+
|
143+
| **Default:** None
144+
138145
.. _partitioner-conf:
139146

140147
Partitioner Configurations

source/batch-mode/batch-read.txt

Lines changed: 71 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -57,6 +57,77 @@ Schema Inference
5757

5858
.. include:: /scala/schema-inference.rst
5959

60+
.. _spark-schema-hint:
61+
62+
Specify Known Fields with Schema Hints
63+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
64+
65+
You can specify a schema containing known field values to use during
66+
schema inference by specifying the ``schemaHint`` configuration option. You can
67+
specify the ``schemaHint`` option in any of the following Spark formats:
68+
69+
.. list-table::
70+
:header-rows: 1
71+
:widths: 35 65
72+
73+
* - Type
74+
- Format
75+
76+
* - DDL
77+
- ``<field one name> <FIELD ONE TYPE>, <field two name> <FIELD TWO TYPE>``
78+
79+
* - SQL DDL
80+
- ``STRUCT<<field one name>: <FIELD ONE TYPE>, <field two name>: <FIELD TWO TYPE>``
81+
82+
* - JSON
83+
- .. code-block:: json
84+
:copyable: false
85+
86+
{ "type": "struct", "fields": [
87+
{ "name": "<field name>", "type": "<field type>", "nullable": <true/false> },
88+
{ "name": "<field name>", "type": "<field type>", "nullable": <true/false> }]}
89+
90+
The following example shows how to specify the ``schemaHint`` option in each
91+
format by using the Spark shell. The example specifies a string-valued field named
92+
``"value"`` and an integer-valued field named ``"count"``.
93+
94+
.. code-block:: scala
95+
96+
import org.apache.spark.sql.types._
97+
98+
val mySchema = StructType(Seq(
99+
StructField("value", StringType),
100+
StructField("count", IntegerType))
101+
102+
// Generate DDL format
103+
mySchema.toDDL
104+
105+
// Generate SQL DDL format
106+
mySchema.sql
107+
108+
// Generate Simple String DDL format
109+
mySchema.simpleString
110+
111+
// Generate JSON format
112+
mySchema.json
113+
114+
You can also specify the ``schemaHint`` option in the Simple String DDL format,
115+
or in JSON format by using PySpark, as shown in the following example:
116+
117+
.. code-block:: python
118+
119+
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
120+
121+
mySchema = StructType([
122+
StructField('value', StringType(), True),
123+
StructField('count', IntegerType(), True)])
124+
125+
# Generate Simple String DDL format
126+
mySchema.simpleString()
127+
128+
# Generate JSON format
129+
mySchema.json()
130+
60131
Filters
61132
-------
62133

source/streaming-mode/streaming-read-config.txt

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -107,6 +107,13 @@ You can configure the following properties when reading data from MongoDB in str
107107
|
108108
| **Default:** ``false``
109109

110+
* - ``schemaHint``
111+
- | Specifies a partial schema of known field types to use when inferring
112+
the schema for the collection. To learn more about the ``schemaHint``
113+
option, see the :ref:`spark-schema-hint` section.
114+
|
115+
| **Default:** None
116+
110117
.. _change-stream-conf:
111118

112119
Change Stream Configuration

0 commit comments

Comments
 (0)