|
4 | 4 | Quick Start
|
5 | 5 | ===========
|
6 | 6 |
|
| 7 | +.. facet:: |
| 8 | + :name: genre |
| 9 | + :values: reference |
| 10 | + |
| 11 | +.. meta:: |
| 12 | + :keywords: tutorial, introduction, setup, begin |
| 13 | + |
7 | 14 | This tutorial is intended as an introduction to working with
|
8 |
| -**PyMongoArrow**. The reader is assumed to be familiar with basic |
9 |
| -`PyMongo <https://pymongo.readthedocs.io/en/stable/tutorial.html>`_ and |
10 |
| -`MongoDB <https://docs.mongodb.com>`_ concepts. |
| 15 | +**{+driver-short+}**. The tutorial assumes the reader is familiar with basic |
| 16 | +`PyMongo <https://pymongo.readthedocs.io/en/stable/tutorial.html>`__ and |
| 17 | +`MongoDB <https://docs.mongodb.com>`__ concepts. |
11 | 18 |
|
12 | 19 | Prerequisites
|
13 | 20 | -------------
|
14 |
| -Before we start, make sure that you have the **PyMongoArrow** distribution |
15 |
| -:doc:`installed <installation>`. In the Python shell, the following should |
16 |
| -run without raising an exception:: |
17 | 21 |
|
18 |
| - import pymongoarrow as pma |
| 22 | +Ensure that you have the {+driver-short+} distribution |
| 23 | +:ref:`installed <pymongo-arrow-install>`. In the Python shell, the following should |
| 24 | +run without raising an exception: |
| 25 | + |
| 26 | +.. code-block:: python |
| 27 | + |
| 28 | + >>> import pymongoarrow as pma |
19 | 29 |
|
20 | 30 | This tutorial also assumes that a MongoDB instance is running on the
|
21 |
| -default host and port. Assuming you have `downloaded and installed |
22 |
| -<https://docs.mongodb.com/manual/installation/>`_ MongoDB, you can start |
23 |
| -it like so: |
| 31 | +default host and port. After you have `downloaded and installed |
| 32 | +<https://docs.mongodb.com/manual/installation/>`__ MongoDB, you can start |
| 33 | +it as shown in the following code example: |
24 | 34 |
|
25 | 35 | .. code-block:: bash
|
26 | 36 |
|
27 | 37 | $ mongod
|
28 | 38 |
|
29 | 39 | Extending PyMongo
|
30 | 40 | ~~~~~~~~~~~~~~~~~
|
31 |
| -The :mod:`pymongoarrow.monkey` module provides an interface to patch PyMongo, |
32 |
| -in place, and add **PyMongoArrow**'s functionality directly to |
33 |
| -:class:`~pymongo.collection.Collection` instances:: |
34 | 41 |
|
35 |
| - from pymongoarrow.monkey import patch_all |
36 |
| - patch_all() |
| 42 | +The ``pymongoarrow.monkey`` module provides an interface to patch PyMongo |
| 43 | +in place, and add {+driver-short+} functionality directly to |
| 44 | +``Collection`` instances: |
| 45 | + |
| 46 | +.. code-block:: python |
37 | 47 |
|
38 |
| -After running :meth:`~pymongoarrow.monkey.patch_all`, new instances of |
39 |
| -:class:`~pymongo.collection.Collection` will have PyMongoArrow's APIs, |
40 |
| -e.g. :meth:`~pymongoarrow.api.find_pandas_all`. |
| 48 | + from pymongoarrow.monkey import patch_all |
| 49 | + patch_all() |
41 | 50 |
|
42 |
| -.. note:: Users can also directly use any of **PyMongoArrow**'s APIs |
43 |
| - by importing them from :mod:`pymongoarrow.api`. The only difference in |
44 |
| - usage would be the need to manually pass the instance of |
45 |
| - :class:`~pymongo.collection.Collection` on which the operation is to be |
46 |
| - run as the first argument when directly using the API method. |
| 51 | +After you run the ``monkey.patch_all()`` method, new instances of |
| 52 | +the ``Collection`` class will contain the {+driver-short+} APIs-- |
| 53 | +for example, the ``pymongoarrow.api.find_pandas_all()`` method. |
| 54 | + |
| 55 | +.. note:: |
| 56 | + |
| 57 | + You can also use any of the {+driver-short+} APIs |
| 58 | + by importing them from the ``pymongoarrow.api`` module. If you do, |
| 59 | + you must pass the instance of the ``Collection`` on which the operation is to be |
| 60 | + run as the first argument when calling the API method. |
47 | 61 |
|
48 | 62 | Test Data
|
49 | 63 | ~~~~~~~~~
|
50 |
| -Before we begin, we must first add some data to our cluster that we can |
51 |
| -query. We can do so using **PyMongo**:: |
52 |
| - |
53 |
| - from datetime import datetime |
54 |
| - from pymongo import MongoClient |
55 |
| - client = MongoClient() |
56 |
| - client.db.data.insert_many([ |
57 |
| - {'_id': 1, 'amount': 21, 'last_updated': datetime(2020, 12, 10, 1, 3, 1), 'account': {'name': 'Customer1', 'account_number': 1}, 'txns': ['A']}, |
58 |
| - {'_id': 2, 'amount': 16, 'last_updated': datetime(2020, 7, 23, 6, 7, 11), 'account': {'name': 'Customer2', 'account_number': 2}, 'txns': ['A', 'B']}, |
59 |
| - {'_id': 3, 'amount': 3, 'last_updated': datetime(2021, 3, 10, 18, 43, 9), 'account': {'name': 'Customer3', 'account_number': 3}, 'txns': ['A', 'B', 'C']}, |
60 |
| - {'_id': 4, 'amount': 0, 'last_updated': datetime(2021, 2, 25, 3, 50, 31), 'account': {'name': 'Customer4', 'account_number': 4}, 'txns': ['A', 'B', 'C', 'D']}]) |
| 64 | + |
| 65 | +The following code uses PyMongo to add sample data to your cluster: |
| 66 | + |
| 67 | +.. code-block:: python |
| 68 | + |
| 69 | + from datetime import datetime |
| 70 | + from pymongo import MongoClient |
| 71 | + client = MongoClient() |
| 72 | + client.db.data.insert_many([ |
| 73 | + {'_id': 1, 'amount': 21, 'last_updated': datetime(2020, 12, 10, 1, 3, 1), 'account': {'name': 'Customer1', 'account_number': 1}, 'txns': ['A']}, |
| 74 | + {'_id': 2, 'amount': 16, 'last_updated': datetime(2020, 7, 23, 6, 7, 11), 'account': {'name': 'Customer2', 'account_number': 2}, 'txns': ['A', 'B']}, |
| 75 | + {'_id': 3, 'amount': 3, 'last_updated': datetime(2021, 3, 10, 18, 43, 9), 'account': {'name': 'Customer3', 'account_number': 3}, 'txns': ['A', 'B', 'C']}, |
| 76 | + {'_id': 4, 'amount': 0, 'last_updated': datetime(2021, 2, 25, 3, 50, 31), 'account': {'name': 'Customer4', 'account_number': 4}, 'txns': ['A', 'B', 'C', 'D']}]) |
61 | 77 |
|
62 | 78 | Defining the Schema
|
63 | 79 | -------------------
|
64 |
| -**PyMongoArrow** relies upon a data schema to marshall |
65 |
| -query result sets into tabular form. This schema can either be automatically inferred from the data, |
66 |
| -or provided by the user. Users can define the schema by |
67 |
| -instantiating :class:`pymongoarrow.api.Schema` using a mapping of field names |
68 |
| -to type-specifiers, e.g.:: |
69 | 80 |
|
70 |
| - from pymongoarrow.api import Schema |
71 |
| - schema = Schema({'_id': int, 'amount': float, 'last_updated': datetime}) |
| 81 | +{+driver-short+} relies on a data schema to marshall |
| 82 | +query result sets into tabular form. If you don't provide this schema, {+driver-short+} |
| 83 | +infers one from the data. You can define the schema by |
| 84 | +creating a ``Schema`` object and mapping the field names |
| 85 | +to type-specifiers, as shown in the following example: |
| 86 | + |
| 87 | +.. code-block:: python |
| 88 | + |
| 89 | + from pymongoarrow.api import Schema |
| 90 | + schema = Schema({'_id': int, 'amount': float, 'last_updated': datetime}) |
72 | 91 |
|
| 92 | +MongoDB uses embedded documents to represent nested data. {+driver-short+} offers |
| 93 | +first-class support for these documents: |
73 | 94 |
|
74 |
| -PyMongoArrow offers first-class support for Nested data (embedded documents):: |
| 95 | +.. code-block:: python |
75 | 96 |
|
76 |
| - schema = Schema({'_id': int, 'amount': float, 'account': { 'name': str, 'account_number': int}}) |
| 97 | + schema = Schema({'_id': int, 'amount': float, 'account': { 'name': str, 'account_number': int}}) |
77 | 98 |
|
78 |
| -Lists (and nested lists) are also supported:: |
| 99 | +{+driver-short+} also supports lists and nested lists: |
79 | 100 |
|
80 |
| - from pyarrow import list_, string |
81 |
| - schema = Schema({'txns': list_(string())}) |
82 |
| - polars_df = client.db.data.find_polars_all({'amount': {'$gt': 0}}, schema=schema) |
| 101 | +.. code-block:: python |
83 | 102 |
|
84 |
| -There are multiple permissible type-identifiers for each supported BSON type. |
85 |
| -For a full-list of data types and associated type-identifiers see |
86 |
| -:doc:`data_types`. |
| 103 | + from pyarrow import list_, string |
| 104 | + schema = Schema({'txns': list_(string())}) |
| 105 | + polars_df = client.db.data.find_polars_all({'amount': {'$gt': 0}}, schema=schema) |
87 | 106 |
|
| 107 | +.. tip:: |
| 108 | + |
| 109 | + {+driver-short+} includes multiple permissible type-identifiers for each supported BSON |
| 110 | + type. For a full list of these data types and their associated type-identifiers, see |
| 111 | + :ref:`<pymongo-arrow-data-types>`. |
88 | 112 |
|
89 | 113 | Find Operations
|
90 | 114 | ---------------
|
91 |
| -We are now ready to query our data. Let's start by running a ``find`` |
92 |
| -operation to load all records with a non-zero ``amount`` as a |
93 |
| -:class:`pandas.DataFrame`:: |
94 | 115 |
|
95 |
| - df = client.db.data.find_pandas_all({'amount': {'$gt': 0}}, schema=schema) |
| 116 | +The following code example shows how to load all records that have a non-zero |
| 117 | +value for the ``amount`` field as a ``pandas.DataFrame`` object: |
| 118 | + |
| 119 | +.. code-block:: python |
| 120 | + |
| 121 | + df = client.db.data.find_pandas_all({'amount': {'$gt': 0}}, schema=schema) |
| 122 | + |
| 123 | +You can also load the same result set as a ``pyarrow.Table`` instance: |
96 | 124 |
|
97 |
| -We can also load the same result set as a :class:`pyarrow.Table` instance:: |
| 125 | +.. code-block:: python |
98 | 126 |
|
99 |
| - arrow_table = client.db.data.find_arrow_all({'amount': {'$gt': 0}}, schema=schema) |
| 127 | + arrow_table = client.db.data.find_arrow_all({'amount': {'$gt': 0}}, schema=schema) |
100 | 128 |
|
101 |
| -a :class:`polars.DataFrame`:: |
| 129 | +Or as a ``polars.DataFrame`` instance: |
102 | 130 |
|
103 |
| - df = client.db.data.find_polars_all({'amount': {'$gt': 0}}, schema=schema) |
| 131 | +.. code-block:: python |
104 | 132 |
|
105 |
| -or as **Numpy arrays**:: |
| 133 | + df = client.db.data.find_polars_all({'amount': {'$gt': 0}}, schema=schema) |
106 | 134 |
|
107 |
| - ndarrays = client.db.data.find_numpy_all({'amount': {'$gt': 0}}, schema=schema) |
| 135 | +Or as a NumPy ``arrays`` object: |
108 | 136 |
|
109 |
| -In the NumPy case, the return value is a dictionary where the keys are field |
110 |
| -names and values are corresponding :class:`numpy.ndarray` instances. |
| 137 | +.. code-block:: python |
| 138 | + |
| 139 | + ndarrays = client.db.data.find_numpy_all({'amount': {'$gt': 0}}, schema=schema) |
| 140 | + |
| 141 | +When using NumPy, the return value is a dictionary where the keys are field |
| 142 | +names and the values are the corresponding ``numpy.ndarray`` instances. |
111 | 143 |
|
112 | 144 | .. note::
|
113 | 145 |
|
114 |
| - For all of the examples above, the schema can be omitted like so:: |
| 146 | + In all of the preceding examples, you can omit the schema as shown in the following |
| 147 | + example: |
115 | 148 |
|
116 |
| - arrow_table = client.db.data.find_arrow_all({'amount': {'$gt': 0}}) |
| 149 | + .. code-block:: python |
117 | 150 |
|
118 |
| - In this case, PyMongoArrow will try to automatically apply a schema based on |
119 |
| - the data contained in the first batch. |
| 151 | + arrow_table = client.db.data.find_arrow_all({'amount': {'$gt': 0}}) |
120 | 152 |
|
| 153 | + If you omit the schema, {+driver-short+} tries to automatically apply a schema based on |
| 154 | + the data contained in the first batch. |
121 | 155 |
|
122 | 156 | Aggregate Operations
|
123 | 157 | --------------------
|
124 |
| -Running an ``aggregate`` operation is similar to ``find``, but it takes a sequence of operations to perform. |
125 |
| -Here is a simple example of ``aggregate_pandas_all`` that outputs a new dataframe |
126 |
| -in which all ``_id`` values are grouped together and their ``amount`` values summed:: |
127 | 158 |
|
128 |
| - df = client.db.data.aggregate_pandas_all([{'$group': {'_id': None, 'total_amount': { '$sum': '$amount' }}}]) |
| 159 | +Running an aggregate operation is similar to running a find operation, but it takes a |
| 160 | +sequence of operations to perform. |
| 161 | + |
| 162 | +The following is a simple example of the ``aggregate_pandas_all()`` method that outputs a |
| 163 | +new dataframe in which all ``_id`` values are grouped together and their ``amount`` values |
| 164 | +summed: |
| 165 | + |
| 166 | +.. code-block:: python |
| 167 | + |
| 168 | + df = client.db.data.aggregate_pandas_all([{'$group': {'_id': None, 'total_amount': { '$sum': '$amount' }}}]) |
| 169 | + |
| 170 | +You can also run aggregate operations on embedded documents. |
| 171 | +The following example unwinds values in the nested ``txn`` field, counts the number of each |
| 172 | +value, then returns the results as a list of NumPy ``ndarray`` objects, sorted in |
| 173 | +descending order: |
129 | 174 |
|
130 |
| -Nested data (embedded documents) are also supported. |
131 |
| -In this more complex example, we unwind values in the nested ``txn`` field, count the number of each, |
132 |
| -then return as a list of numpy ndarrays sorted in decreasing order:: |
| 175 | +.. code-block:: python |
133 | 176 |
|
134 |
| - pipeline = [{'$unwind': '$txns'}, {'$group': {'_id': '$txns', 'count': {'$sum': 1}}}, {'$sort': {"count": -1}}] |
135 |
| - ndarrays = client.db.data.aggregate_numpy_all(pipeline) |
| 177 | + pipeline = [{'$unwind': '$txns'}, {'$group': {'_id': '$txns', 'count': {'$sum': 1}}}, {'$sort': {"count": -1}}] |
| 178 | + ndarrays = client.db.data.aggregate_numpy_all(pipeline) |
136 | 179 |
|
137 |
| -More information on aggregation pipelines can be found `here <https://www.mongodb.com/docs/manual/core/aggregation-pipeline/>`_. |
| 180 | +.. tip:: |
| 181 | + |
| 182 | + For more information about aggregation pipelines, see the |
| 183 | + :manual:`MongoDB Server documentation </core/aggregation-pipeline/>`. |
138 | 184 |
|
139 | 185 | Writing to MongoDB
|
140 | 186 | ------------------
|
141 |
| -All of these types, Arrow's :class:`~pyarrow.Table`, Pandas' |
142 |
| -:class:`~pandas.DataFrame`, NumPy's :class:`~numpy.ndarray`, or :class:`~polars.DataFrame` can |
143 |
| -be easily written to your MongoDB database using the :meth:`~pymongoarrow.api.write` function:: |
144 | 187 |
|
145 |
| - from pymongoarrow.api import write |
146 |
| - from pymongo import MongoClient |
147 |
| - coll = MongoClient().db.my_collection |
148 |
| - write(coll, df) |
149 |
| - write(coll, arrow_table) |
150 |
| - write(coll, ndarrays) |
| 188 | +You can use the ``write()`` method to write objects of the following types to MongoDB: |
151 | 189 |
|
152 |
| -(Keep in mind that NumPy arrays are specified as ``dict[str, ndarray]``.) |
| 190 | +- Arrow ``Table`` |
| 191 | +- Pandas ``DataFrame`` |
| 192 | +- NumPy ``ndarray`` |
| 193 | +- Polars ``DataFrame`` |
| 194 | + |
| 195 | +.. code-block:: python |
| 196 | + |
| 197 | + from pymongoarrow.api import write |
| 198 | + from pymongo import MongoClient |
| 199 | + coll = MongoClient().db.my_collection |
| 200 | + write(coll, df) |
| 201 | + write(coll, arrow_table) |
| 202 | + write(coll, ndarrays) |
| 203 | + |
| 204 | +.. note:: |
| 205 | + |
| 206 | + NumPy arrays are specified as ``dict[str, ndarray]``. |
153 | 207 |
|
154 | 208 | Writing to Other Formats
|
155 | 209 | ------------------------
|
156 |
| -Once result sets have been loaded, one can then write them to any format that the package supports. |
157 | 210 |
|
158 |
| -For example, to write the table referenced by the variable ``arrow_table`` to a Parquet |
159 |
| -file ``example.parquet``, run:: |
160 |
| - |
161 |
| - import pyarrow.parquet as pq |
162 |
| - pq.write_table(arrow_table, 'example.parquet') |
| 211 | +Once result sets have been loaded, you can then write them to any format that the package |
| 212 | +supports. |
163 | 213 |
|
164 |
| -Pandas also supports writing :class:`~pandas.DataFrame` instances to a variety |
165 |
| -of formats including CSV, and HDF. To write the data frame |
166 |
| -referenced by the variable ``df`` to a CSV file ``out.csv``, for example, run:: |
| 214 | +For example, to write the table referenced by the variable ``arrow_table`` to a Parquet |
| 215 | +file named ``example.parquet``, run the following code: |
167 | 216 |
|
168 |
| - df.to_csv('out.csv', index=False) |
| 217 | +.. code-block:: python |
| 218 | + |
| 219 | + import pyarrow.parquet as pq |
| 220 | + pq.write_table(arrow_table, 'example.parquet') |
169 | 221 |
|
170 |
| -The Polars API is a mix of the two:: |
| 222 | +Pandas also supports writing ``DataFrame`` instances to a variety |
| 223 | +of formats, including CSV and HDF. To write the data frame |
| 224 | +referenced by the variable ``df`` to a CSV file named ``out.csv``, run the following |
| 225 | +code: |
171 | 226 |
|
| 227 | +.. code-block:: python |
| 228 | + |
| 229 | + df.to_csv('out.csv', index=False) |
172 | 230 |
|
173 |
| - import polars as pl |
174 |
| - df = pl.DataFrame({"foo": [1, 2, 3, 4, 5]}) |
175 |
| - df.write_parquet('example.parquet') |
| 231 | +The Polars API is a mix of the two preceding examples: |
176 | 232 |
|
| 233 | +.. code-block:: python |
| 234 | + |
| 235 | + import polars as pl |
| 236 | + df = pl.DataFrame({"foo": [1, 2, 3, 4, 5]}) |
| 237 | + df.write_parquet('example.parquet') |
177 | 238 |
|
178 | 239 | .. note::
|
179 | 240 |
|
180 |
| - Nested data is supported for parquet read/write but is not well supported |
181 |
| - by Arrow or Pandas for CSV read/write. |
| 241 | + Nested data is supported for parquet read and write operations, but is not well |
| 242 | + supported by Arrow or Pandas for CSV read and write operations. |
0 commit comments