(DOCSP-21746) Add Scala examples to Structured Streaming page (#122)

zach-carr · web-flow · commit cc9bcefba5fb · 2022-05-03T12:38:23.000-04:00
* (DOCSP-21746) Add Scala examples to Structured Streaming page
diff --git a/source/structured-streaming.txt b/source/structured-streaming.txt
@@ -92,6 +92,59 @@ Configuring a Write Stream to MongoDB
          For a complete list of methods, see the 
          `pyspark Structured Streaming reference <https://spark.apache.org/docs/latest/api/python/reference/pyspark.ss.html>`__.
 
+     - id: scala
+       content: |
+
+         Specify write stream configuration settings on your streaming 
+         Dataset or DataFrame using the ``writeStream`` property. You 
+         must specify the following configuration settings to write 
+         to MongoDB:
+         
+         .. list-table::
+            :header-rows: 1
+            :stub-columns: 1
+            :widths: 10 40
+         
+            * - Setting
+              - Description
+         
+            * - ``writeStream.format()``
+              - The format to use for write stream data. Use 
+                ``mongodb``.
+         
+            * - ``writeStream.option()``
+              - Use the ``option`` method to specify your MongoDB 
+                deployment connection string with the 
+                ``spark.mongodb.connection.uri`` option key.
+         
+                You must specify a database and collection, either as 
+                part of your connection string or with additional 
+                ``option`` methods using the following keys:
+         
+                - ``spark.mongodb.database``
+                - ``spark.mongodb.collection``
+         
+            * - ``writeStream.outputMode()``
+              - The output mode to use. To view a list of all supported 
+                output modes, see `the Scala outputMode documentation <https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/streaming/DataStreamWriter.html#outputMode(outputMode:String):org.apache.spark.sql.streaming.DataStreamWriter[T]>`__.
+         
+         The following code snippet shows how to use the preceding 
+         configuration settings to stream data to MongoDB:
+
+         .. code-block:: scala
+            :copyable: true
+            :emphasize-lines: 2-3, 6
+         
+            <streaming Dataset/ DataFrame>.writeStream
+              .format("mongodb")
+              .option("spark.mongodb.connection.uri", <mongodb-connection-string>)
+              .option("spark.mongodb.database", <database-name>)
+              .option("spark.mongodb.collection", <collection-name>)
+              .outputMode("append")
+
+         For a complete list of methods, see the 
+         `Scala Structured Streaming reference <https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/streaming/index.html>`__.
+
 .. _read-structured-stream:
 .. _continuous-processing:
 
@@ -175,6 +228,72 @@ more about continuous processing, see the `Spark documentation <https://spark.ap
          For a complete list of methods, see the 
          `pyspark Structured Streaming reference <https://spark.apache.org/docs/latest/api/python/reference/pyspark.ss.html>`__.
 
+     - id: scala
+       content: |
+
+         To use continuous processing with the MongoDB Spark Connector, 
+         add the ``trigger()`` method to the ``writeStream`` property 
+         of the streaming Dataset or DataFrame that you create from 
+         your MongoDB read stream. In your ``trigger()``, specify the 
+         ``continuous`` parameter.
+         
+         .. note:: 
+         
+            The connector populates its read stream from your MongoDB 
+            deployment's change stream. To populate your change stream, 
+            perform update operations on your database.
+         
+            To learn more about change streams, see 
+            :manual:`Change Streams </changeStreams>` in the MongoDB 
+            manual.
+         
+         Specify read stream configuration settings on your local 
+         SparkSession ``readStream``. You must specify the following 
+         configuration settings to read from MongoDB:
+         
+         .. list-table::
+            :header-rows: 1
+            :stub-columns: 1
+            :widths: 10 40
+         
+            * - Setting
+              - Description
+         
+            * - ``readStream.format()``
+              - The format to use for read stream data. Use ``mongodb``.
+         
+            * - ``writeStream.trigger()``
+              - Enables continuous processing for your read stream. 
+                Import and use the ``Trigger.Continuous()`` property.
+
+         The following code snippet shows how to use the preceding 
+         configuration settings to stream data from MongoDB:
+
+         .. code-block:: scala
+            :copyable: true
+            :emphasize-lines: 1, 4, 8
+
+            import org.apache.spark.sql.streaming.Trigger
+         
+            val streamingDataFrame = <local SparkSession>.readStream
+              .format("mongodb")
+              .load()
+         
+            val query = streamingDataFrame.writeStream
+              .trigger(Trigger.Continuous("1 second"))
+              .format("memory")
+              .outputMode("append")
+
+            query.start()
+
+         .. note::
+            
+            Spark does not begin streaming until you call the 
+            ``start()`` method on a streaming query.
+
+         For a complete list of methods, see the 
+         `Scala Structured Streaming reference <https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/streaming/index.html>`__.
+
 Examples
 --------
 
@@ -227,14 +346,59 @@ Stream to MongoDB from a CSV File
               .option("checkpointLocation", "/tmp/pyspark/")
               .option("forceDeleteTempCheckpointLocation", "true")
               .option("spark.mongodb.connection.uri", <mongodb-connection-string>)
-              .option('spark.mongodb.database', <database-name>)
-              .option('spark.mongodb.collection', <collection-name>)
+              .option("spark.mongodb.database", <database-name>)
+              .option("spark.mongodb.collection", <collection-name>)
               .outputMode("append")
             )
 
             # run the query
             query.start()
 
+     - id: scala
+       content: |
+
+         To create a :ref:`write stream <write-structured-stream>` to 
+         MongoDB from a ``.csv`` file, first create a `DataStreamReader <https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/streaming/DataStreamReader.html>`__ 
+         from the ``.csv`` file, then use that ``DataStreamReader`` to 
+         create a `DataStreamWriter <https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/streaming/DataStreamWriter.html>`__ 
+         to MongoDB. Finally, use the ``start()`` method to begin the 
+         stream.
+         
+         As streaming data is read from the ``.csv`` file, it is added 
+         to MongoDB in the `outputMode <https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/streaming/DataStreamWriter.html#outputMode(outputMode:String):org.apache.spark.sql.streaming.DataStreamWriter[T]>`__ 
+         you specify.
+
+         .. code-block:: scala
+            :copyable: true
+            :emphasize-lines: 10, 16
+
+            // create a local SparkSession
+            val spark = SparkSession
+              .builder
+              .appName("writeExample")
+              .master("spark://spark-master:<port>")
+              .config("spark.jars", "<mongo-spark-connector>.jar")
+              .getOrCreate()
+
+            // define a streaming query
+            val query = spark.readStream
+              .format("csv")
+              .option("header", "true")
+              .schema(<csv-schema>)
+              .load(<csv-file-name>)
+              // manipulate your streaming data
+              .writeStream
+              .format("mongodb")
+              .option("checkpointLocation", "/tmp/")
+              .option("forceDeleteTempCheckpointLocation", "true")
+              .option("spark.mongodb.connection.uri", <mongodb-connection-string>)
+              .option("spark.mongodb.database", <database-name>)
+              .option("spark.mongodb.collection", <collection-name>)
+              .outputMode("append")
+
+            // run the query
+            query.start()
+
 Stream to a CSV File from MongoDB
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
@@ -294,4 +458,55 @@ Stream to a CSV File from MongoDB
             )
 
             # run the query
-            query.start()   
+            query.start()  
+
+     - id: scala
+       content: |
+
+         To create a :ref:`read stream <read-structured-stream>` to a 
+         ``.csv`` file from MongoDB, first create a `DataStreamReader <https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/streaming/DataStreamReader.htmll>`__ 
+         from MongoDB, then use that ``DataStreamReader`` to 
+         create a `DataStreamWriter <https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/streaming/DataStreamWriter.html>`__ 
+         to a new ``.csv`` file. Finally, use the ``start()`` method 
+         to begin the stream.
+         
+         As new data is inserted into MongoDB, MongoDB streams that 
+         data out to a ``.csv`` file in the `outputMode <https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/streaming/DataStreamWriter.html#outputMode(outputMode:String):org.apache.spark.sql.streaming.DataStreamWriter[T]>`__ 
+         you specify.
+
+         .. code-block:: scala
+            :copyable: true
+            :emphasize-lines: 17, 25, 28
+
+            // create a local SparkSession
+            val spark = SparkSession
+              .builder
+              .appName("readExample")
+              .master("spark://spark-master:<port>")
+              .config("spark.jars", "<mongo-spark-connector>.jar")
+              .getOrCreate()
+
+            // define the schema of the source collection
+            val readSchema = StructType()
+              .add("company_symbol", StringType())
+              .add("company_name", StringType())
+              .add("price", DoubleType())
+              .add("tx_time", TimestampType())
+
+            // define a streaming query
+            val query = spark.readStream
+              .format("mongodb")
+              .option("spark.mongodb.connection.uri", <mongodb-connection-string>)
+              .option("spark.mongodb.database", <database-name>)
+              .option("spark.mongodb.collection", <collection-name>)
+              .schema(readSchema)
+              .load()
+              // manipulate your streaming data
+              .writeStream
+              .format("csv")
+              .option("path", "/output/")
+              .trigger(Trigger.Continuous("1 second"))
+              .outputMode("append")
+
+            // run the query
+            query.start()