Skip to content

Commit 4c8a09e

Browse files
DOCSP-37419 - Quick Start (#7)
Co-authored-by: Jordan Smith <[email protected]>
1 parent ce51e16 commit 4c8a09e

File tree

1 file changed

+163
-102
lines changed

1 file changed

+163
-102
lines changed

source/quick-start.txt

Lines changed: 163 additions & 102 deletions
Original file line numberDiff line numberDiff line change
@@ -4,178 +4,239 @@
44
Quick Start
55
===========
66

7+
.. facet::
8+
:name: genre
9+
:values: reference
10+
11+
.. meta::
12+
:keywords: tutorial, introduction, setup, begin
13+
714
This tutorial is intended as an introduction to working with
8-
**PyMongoArrow**. The reader is assumed to be familiar with basic
9-
`PyMongo <https://pymongo.readthedocs.io/en/stable/tutorial.html>`_ and
10-
`MongoDB <https://docs.mongodb.com>`_ concepts.
15+
**{+driver-short+}**. The tutorial assumes the reader is familiar with basic
16+
`PyMongo <https://pymongo.readthedocs.io/en/stable/tutorial.html>`__ and
17+
`MongoDB <https://docs.mongodb.com>`__ concepts.
1118

1219
Prerequisites
1320
-------------
14-
Before we start, make sure that you have the **PyMongoArrow** distribution
15-
:doc:`installed <installation>`. In the Python shell, the following should
16-
run without raising an exception::
1721

18-
import pymongoarrow as pma
22+
Ensure that you have the {+driver-short+} distribution
23+
:ref:`installed <pymongo-arrow-install>`. In the Python shell, the following should
24+
run without raising an exception:
25+
26+
.. code-block:: python
27+
28+
>>> import pymongoarrow as pma
1929

2030
This tutorial also assumes that a MongoDB instance is running on the
21-
default host and port. Assuming you have `downloaded and installed
22-
<https://docs.mongodb.com/manual/installation/>`_ MongoDB, you can start
23-
it like so:
31+
default host and port. After you have `downloaded and installed
32+
<https://docs.mongodb.com/manual/installation/>`__ MongoDB, you can start
33+
it as shown in the following code example:
2434

2535
.. code-block:: bash
2636

2737
$ mongod
2838

2939
Extending PyMongo
3040
~~~~~~~~~~~~~~~~~
31-
The :mod:`pymongoarrow.monkey` module provides an interface to patch PyMongo,
32-
in place, and add **PyMongoArrow**'s functionality directly to
33-
:class:`~pymongo.collection.Collection` instances::
3441

35-
from pymongoarrow.monkey import patch_all
36-
patch_all()
42+
The ``pymongoarrow.monkey`` module provides an interface to patch PyMongo
43+
in place, and add {+driver-short+} functionality directly to
44+
``Collection`` instances:
45+
46+
.. code-block:: python
3747

38-
After running :meth:`~pymongoarrow.monkey.patch_all`, new instances of
39-
:class:`~pymongo.collection.Collection` will have PyMongoArrow's APIs,
40-
e.g. :meth:`~pymongoarrow.api.find_pandas_all`.
48+
from pymongoarrow.monkey import patch_all
49+
patch_all()
4150

42-
.. note:: Users can also directly use any of **PyMongoArrow**'s APIs
43-
by importing them from :mod:`pymongoarrow.api`. The only difference in
44-
usage would be the need to manually pass the instance of
45-
:class:`~pymongo.collection.Collection` on which the operation is to be
46-
run as the first argument when directly using the API method.
51+
After you run the ``monkey.patch_all()`` method, new instances of
52+
the ``Collection`` class will contain the {+driver-short+} APIs--
53+
for example, the ``pymongoarrow.api.find_pandas_all()`` method.
54+
55+
.. note::
56+
57+
You can also use any of the {+driver-short+} APIs
58+
by importing them from the ``pymongoarrow.api`` module. If you do,
59+
you must pass the instance of the ``Collection`` on which the operation is to be
60+
run as the first argument when calling the API method.
4761

4862
Test Data
4963
~~~~~~~~~
50-
Before we begin, we must first add some data to our cluster that we can
51-
query. We can do so using **PyMongo**::
52-
53-
from datetime import datetime
54-
from pymongo import MongoClient
55-
client = MongoClient()
56-
client.db.data.insert_many([
57-
{'_id': 1, 'amount': 21, 'last_updated': datetime(2020, 12, 10, 1, 3, 1), 'account': {'name': 'Customer1', 'account_number': 1}, 'txns': ['A']},
58-
{'_id': 2, 'amount': 16, 'last_updated': datetime(2020, 7, 23, 6, 7, 11), 'account': {'name': 'Customer2', 'account_number': 2}, 'txns': ['A', 'B']},
59-
{'_id': 3, 'amount': 3, 'last_updated': datetime(2021, 3, 10, 18, 43, 9), 'account': {'name': 'Customer3', 'account_number': 3}, 'txns': ['A', 'B', 'C']},
60-
{'_id': 4, 'amount': 0, 'last_updated': datetime(2021, 2, 25, 3, 50, 31), 'account': {'name': 'Customer4', 'account_number': 4}, 'txns': ['A', 'B', 'C', 'D']}])
64+
65+
The following code uses PyMongo to add sample data to your cluster:
66+
67+
.. code-block:: python
68+
69+
from datetime import datetime
70+
from pymongo import MongoClient
71+
client = MongoClient()
72+
client.db.data.insert_many([
73+
{'_id': 1, 'amount': 21, 'last_updated': datetime(2020, 12, 10, 1, 3, 1), 'account': {'name': 'Customer1', 'account_number': 1}, 'txns': ['A']},
74+
{'_id': 2, 'amount': 16, 'last_updated': datetime(2020, 7, 23, 6, 7, 11), 'account': {'name': 'Customer2', 'account_number': 2}, 'txns': ['A', 'B']},
75+
{'_id': 3, 'amount': 3, 'last_updated': datetime(2021, 3, 10, 18, 43, 9), 'account': {'name': 'Customer3', 'account_number': 3}, 'txns': ['A', 'B', 'C']},
76+
{'_id': 4, 'amount': 0, 'last_updated': datetime(2021, 2, 25, 3, 50, 31), 'account': {'name': 'Customer4', 'account_number': 4}, 'txns': ['A', 'B', 'C', 'D']}])
6177

6278
Defining the Schema
6379
-------------------
64-
**PyMongoArrow** relies upon a data schema to marshall
65-
query result sets into tabular form. This schema can either be automatically inferred from the data,
66-
or provided by the user. Users can define the schema by
67-
instantiating :class:`pymongoarrow.api.Schema` using a mapping of field names
68-
to type-specifiers, e.g.::
6980

70-
from pymongoarrow.api import Schema
71-
schema = Schema({'_id': int, 'amount': float, 'last_updated': datetime})
81+
{+driver-short+} relies on a data schema to marshall
82+
query result sets into tabular form. If you don't provide this schema, {+driver-short+}
83+
infers one from the data. You can define the schema by
84+
creating a ``Schema`` object and mapping the field names
85+
to type-specifiers, as shown in the following example:
86+
87+
.. code-block:: python
88+
89+
from pymongoarrow.api import Schema
90+
schema = Schema({'_id': int, 'amount': float, 'last_updated': datetime})
7291

92+
MongoDB uses embedded documents to represent nested data. {+driver-short+} offers
93+
first-class support for these documents:
7394

74-
PyMongoArrow offers first-class support for Nested data (embedded documents)::
95+
.. code-block:: python
7596

76-
schema = Schema({'_id': int, 'amount': float, 'account': { 'name': str, 'account_number': int}})
97+
schema = Schema({'_id': int, 'amount': float, 'account': { 'name': str, 'account_number': int}})
7798

78-
Lists (and nested lists) are also supported::
99+
{+driver-short+} also supports lists and nested lists:
79100

80-
from pyarrow import list_, string
81-
schema = Schema({'txns': list_(string())})
82-
polars_df = client.db.data.find_polars_all({'amount': {'$gt': 0}}, schema=schema)
101+
.. code-block:: python
83102

84-
There are multiple permissible type-identifiers for each supported BSON type.
85-
For a full-list of data types and associated type-identifiers see
86-
:doc:`data_types`.
103+
from pyarrow import list_, string
104+
schema = Schema({'txns': list_(string())})
105+
polars_df = client.db.data.find_polars_all({'amount': {'$gt': 0}}, schema=schema)
87106

107+
.. tip::
108+
109+
{+driver-short+} includes multiple permissible type-identifiers for each supported BSON
110+
type. For a full list of these data types and their associated type-identifiers, see
111+
:ref:`<pymongo-arrow-data-types>`.
88112

89113
Find Operations
90114
---------------
91-
We are now ready to query our data. Let's start by running a ``find``
92-
operation to load all records with a non-zero ``amount`` as a
93-
:class:`pandas.DataFrame`::
94115

95-
df = client.db.data.find_pandas_all({'amount': {'$gt': 0}}, schema=schema)
116+
The following code example shows how to load all records that have a non-zero
117+
value for the ``amount`` field as a ``pandas.DataFrame`` object:
118+
119+
.. code-block:: python
120+
121+
df = client.db.data.find_pandas_all({'amount': {'$gt': 0}}, schema=schema)
122+
123+
You can also load the same result set as a ``pyarrow.Table`` instance:
96124

97-
We can also load the same result set as a :class:`pyarrow.Table` instance::
125+
.. code-block:: python
98126

99-
arrow_table = client.db.data.find_arrow_all({'amount': {'$gt': 0}}, schema=schema)
127+
arrow_table = client.db.data.find_arrow_all({'amount': {'$gt': 0}}, schema=schema)
100128

101-
a :class:`polars.DataFrame`::
129+
Or as a ``polars.DataFrame`` instance:
102130

103-
df = client.db.data.find_polars_all({'amount': {'$gt': 0}}, schema=schema)
131+
.. code-block:: python
104132

105-
or as **Numpy arrays**::
133+
df = client.db.data.find_polars_all({'amount': {'$gt': 0}}, schema=schema)
106134

107-
ndarrays = client.db.data.find_numpy_all({'amount': {'$gt': 0}}, schema=schema)
135+
Or as a NumPy ``arrays`` object:
108136

109-
In the NumPy case, the return value is a dictionary where the keys are field
110-
names and values are corresponding :class:`numpy.ndarray` instances.
137+
.. code-block:: python
138+
139+
ndarrays = client.db.data.find_numpy_all({'amount': {'$gt': 0}}, schema=schema)
140+
141+
When using NumPy, the return value is a dictionary where the keys are field
142+
names and the values are the corresponding ``numpy.ndarray`` instances.
111143

112144
.. note::
113145

114-
For all of the examples above, the schema can be omitted like so::
146+
In all of the preceding examples, you can omit the schema as shown in the following
147+
example:
115148

116-
arrow_table = client.db.data.find_arrow_all({'amount': {'$gt': 0}})
149+
.. code-block:: python
117150

118-
In this case, PyMongoArrow will try to automatically apply a schema based on
119-
the data contained in the first batch.
151+
arrow_table = client.db.data.find_arrow_all({'amount': {'$gt': 0}})
120152

153+
If you omit the schema, {+driver-short+} tries to automatically apply a schema based on
154+
the data contained in the first batch.
121155

122156
Aggregate Operations
123157
--------------------
124-
Running an ``aggregate`` operation is similar to ``find``, but it takes a sequence of operations to perform.
125-
Here is a simple example of ``aggregate_pandas_all`` that outputs a new dataframe
126-
in which all ``_id`` values are grouped together and their ``amount`` values summed::
127158

128-
df = client.db.data.aggregate_pandas_all([{'$group': {'_id': None, 'total_amount': { '$sum': '$amount' }}}])
159+
Running an aggregate operation is similar to running a find operation, but it takes a
160+
sequence of operations to perform.
161+
162+
The following is a simple example of the ``aggregate_pandas_all()`` method that outputs a
163+
new dataframe in which all ``_id`` values are grouped together and their ``amount`` values
164+
summed:
165+
166+
.. code-block:: python
167+
168+
df = client.db.data.aggregate_pandas_all([{'$group': {'_id': None, 'total_amount': { '$sum': '$amount' }}}])
169+
170+
You can also run aggregate operations on embedded documents.
171+
The following example unwinds values in the nested ``txn`` field, counts the number of each
172+
value, then returns the results as a list of NumPy ``ndarray`` objects, sorted in
173+
descending order:
129174

130-
Nested data (embedded documents) are also supported.
131-
In this more complex example, we unwind values in the nested ``txn`` field, count the number of each,
132-
then return as a list of numpy ndarrays sorted in decreasing order::
175+
.. code-block:: python
133176

134-
pipeline = [{'$unwind': '$txns'}, {'$group': {'_id': '$txns', 'count': {'$sum': 1}}}, {'$sort': {"count": -1}}]
135-
ndarrays = client.db.data.aggregate_numpy_all(pipeline)
177+
pipeline = [{'$unwind': '$txns'}, {'$group': {'_id': '$txns', 'count': {'$sum': 1}}}, {'$sort': {"count": -1}}]
178+
ndarrays = client.db.data.aggregate_numpy_all(pipeline)
136179

137-
More information on aggregation pipelines can be found `here <https://www.mongodb.com/docs/manual/core/aggregation-pipeline/>`_.
180+
.. tip::
181+
182+
For more information about aggregation pipelines, see the
183+
:manual:`MongoDB Server documentation </core/aggregation-pipeline/>`.
138184

139185
Writing to MongoDB
140186
------------------
141-
All of these types, Arrow's :class:`~pyarrow.Table`, Pandas'
142-
:class:`~pandas.DataFrame`, NumPy's :class:`~numpy.ndarray`, or :class:`~polars.DataFrame` can
143-
be easily written to your MongoDB database using the :meth:`~pymongoarrow.api.write` function::
144187

145-
from pymongoarrow.api import write
146-
from pymongo import MongoClient
147-
coll = MongoClient().db.my_collection
148-
write(coll, df)
149-
write(coll, arrow_table)
150-
write(coll, ndarrays)
188+
You can use the ``write()`` method to write objects of the following types to MongoDB:
151189

152-
(Keep in mind that NumPy arrays are specified as ``dict[str, ndarray]``.)
190+
- Arrow ``Table``
191+
- Pandas ``DataFrame``
192+
- NumPy ``ndarray``
193+
- Polars ``DataFrame``
194+
195+
.. code-block:: python
196+
197+
from pymongoarrow.api import write
198+
from pymongo import MongoClient
199+
coll = MongoClient().db.my_collection
200+
write(coll, df)
201+
write(coll, arrow_table)
202+
write(coll, ndarrays)
203+
204+
.. note::
205+
206+
NumPy arrays are specified as ``dict[str, ndarray]``.
153207

154208
Writing to Other Formats
155209
------------------------
156-
Once result sets have been loaded, one can then write them to any format that the package supports.
157210

158-
For example, to write the table referenced by the variable ``arrow_table`` to a Parquet
159-
file ``example.parquet``, run::
160-
161-
import pyarrow.parquet as pq
162-
pq.write_table(arrow_table, 'example.parquet')
211+
Once result sets have been loaded, you can then write them to any format that the package
212+
supports.
163213

164-
Pandas also supports writing :class:`~pandas.DataFrame` instances to a variety
165-
of formats including CSV, and HDF. To write the data frame
166-
referenced by the variable ``df`` to a CSV file ``out.csv``, for example, run::
214+
For example, to write the table referenced by the variable ``arrow_table`` to a Parquet
215+
file named ``example.parquet``, run the following code:
167216

168-
df.to_csv('out.csv', index=False)
217+
.. code-block:: python
218+
219+
import pyarrow.parquet as pq
220+
pq.write_table(arrow_table, 'example.parquet')
169221

170-
The Polars API is a mix of the two::
222+
Pandas also supports writing ``DataFrame`` instances to a variety
223+
of formats, including CSV and HDF. To write the data frame
224+
referenced by the variable ``df`` to a CSV file named ``out.csv``, run the following
225+
code:
171226

227+
.. code-block:: python
228+
229+
df.to_csv('out.csv', index=False)
172230

173-
import polars as pl
174-
df = pl.DataFrame({"foo": [1, 2, 3, 4, 5]})
175-
df.write_parquet('example.parquet')
231+
The Polars API is a mix of the two preceding examples:
176232

233+
.. code-block:: python
234+
235+
import polars as pl
236+
df = pl.DataFrame({"foo": [1, 2, 3, 4, 5]})
237+
df.write_parquet('example.parquet')
177238

178239
.. note::
179240

180-
Nested data is supported for parquet read/write but is not well supported
181-
by Arrow or Pandas for CSV read/write.
241+
Nested data is supported for parquet read and write operations, but is not well
242+
supported by Arrow or Pandas for CSV read and write operations.

0 commit comments

Comments
 (0)