DOCSP-41049 schemaHint option (#205)

jordan-smith721 · web-flow · commit 77ace854507b · 2024-09-09T05:36:04.000-07:00
diff --git a/source/batch-mode/batch-read-config.txt b/source/batch-mode/batch-read-config.txt
@@ -135,6 +135,13 @@ You can configure the following properties when reading data from MongoDB in bat
        |
        | **Default:** ``false``
 
+   * - ``schemaHint``
+     - | Specifies a partial schema of known field types to use when inferring
+         the schema for the collection. To learn more about the ``schemaHint``
+         option, see the :ref:`spark-schema-hint` section.
+       |
+       | **Default:** None
+
 .. _partitioner-conf:
 
 Partitioner Configurations
diff --git a/source/batch-mode/batch-read.txt b/source/batch-mode/batch-read.txt
@@ -57,6 +57,77 @@ Schema Inference
 
          .. include:: /scala/schema-inference.rst
 
+.. _spark-schema-hint:
+
+Specify Known Fields with Schema Hints
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+You can specify a schema containing known field values to use during
+schema inference by specifying the ``schemaHint`` configuration option. You can
+specify the ``schemaHint`` option in any of the following Spark formats:
+
+.. list-table::
+   :header-rows: 1
+   :widths: 35 65
+
+   * - Type
+     - Format
+
+   * - DDL
+     - ``<field one name> <FIELD ONE TYPE>, <field two name> <FIELD TWO TYPE>``
+
+   * - SQL DDL
+     - ``STRUCT<<field one name>: <FIELD ONE TYPE>, <field two name>: <FIELD TWO TYPE>``
+
+   * - JSON
+     - .. code-block:: json
+          :copyable: false
+           
+          { "type": "struct", "fields": [
+          { "name": "<field name>", "type": "<field type>", "nullable": <true/false> },
+          { "name": "<field name>", "type": "<field type>", "nullable": <true/false> }]}
+
+The following example shows how to specify the ``schemaHint`` option in each
+format by using the Spark shell. The example specifies a string-valued field named
+``"value"`` and an integer-valued field named ``"count"``.
+
+.. code-block:: scala
+
+   import org.apache.spark.sql.types._
+
+   val mySchema = StructType(Seq(
+       StructField("value", StringType), 
+       StructField("count", IntegerType))
+   
+   // Generate DDL format
+   mySchema.toDDL
+
+   // Generate SQL DDL format
+   mySchema.sql
+
+   // Generate Simple String DDL format
+   mySchema.simpleString
+
+   // Generate JSON format
+   mySchema.json
+
+You can also specify the ``schemaHint`` option in the Simple String DDL format,
+or in JSON format by using PySpark, as shown in the following example:
+
+.. code-block:: python
+
+   from pyspark.sql.types import StructType, StructField, StringType, IntegerType
+   
+   mySchema = StructType([ 
+      StructField('value', StringType(), True), 
+      StructField('count', IntegerType(), True)])
+
+   # Generate Simple String DDL format
+   mySchema.simpleString()
+
+   # Generate JSON format
+   mySchema.json()
+
 Filters
 -------
 
diff --git a/source/streaming-mode/streaming-read-config.txt b/source/streaming-mode/streaming-read-config.txt
@@ -107,6 +107,13 @@ You can configure the following properties when reading data from MongoDB in str
        |
        | **Default:** ``false``
 
+   * - ``schemaHint``
+     - | Specifies a partial schema of known field types to use when inferring
+         the schema for the collection. To learn more about the ``schemaHint``
+         option, see the :ref:`spark-schema-hint` section.
+       |
+       | **Default:** None
+
 .. _change-stream-conf:
 
 Change Stream Configuration