Skip to content

Commit bb9013c

Browse files
p-mongop
andauthored
RUBY-2236 Explain Ruby vs MongoDB/PCRE regular expressions better (#222)
Co-authored-by: Oleg Pudeyev <[email protected]>
1 parent 24c6a9c commit bb9013c

File tree

1 file changed

+177
-17
lines changed

1 file changed

+177
-17
lines changed

docs/tutorials/bson-v4.txt

Lines changed: 177 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -637,19 +637,175 @@ of the day that the ``Date`` refers to in UTC.
637637
Regular Expressions
638638
===================
639639

640-
Ruby regular expressions always have BSON regular expressions' equivalent of
641-
'm' flag on. In order for behavior to be preserved between the two, the 'm'
642-
option is always added when a Ruby regular expression is serialized to BSON.
640+
Both MongoDB and Ruby provide facilities for working with regular expressions,
641+
but they use regular expression engines. The following subsections detail the
642+
differences between Ruby regular expressions and MongoDB regular expressions
643+
and describe how to work with both.
644+
645+
Ruby vs MongoDB Regular Expressions
646+
-----------------------------------
647+
648+
MongoDB server uses `Perl-compatible regular expressions implemented using
649+
the PCRE library<http://pcre.org/>`_ and `Ruby regular expressions
650+
<http://ruby-doc.org/core/Regexp.html>`_ are implemented using the
651+
`Onigmo regular expression engine <https://github.com/k-takata/Onigmo>`_,
652+
which is a fork of `Oniguruma <https://github.com/kkos/oniguruma>`_.
653+
The two regular expression implementations generally provide equivalent
654+
functionality but have several important syntax differences, as described
655+
below.
656+
657+
Unfortunately, there is no simple way to programmatically convert a PCRE
658+
regular expression into the equivalent Ruby regular expression,
659+
and there are currently no Ruby bindings for PCRE.
660+
661+
Options / Flags / Modifiers
662+
```````````````````````````
663+
664+
Both Ruby and PCRE regular expressions support modifiers. These are
665+
also called "options" in Ruby parlance and "flags" in PCRE parlance.
666+
The meaning of ``s`` and ``m`` modifiers differs in Ruby and PCRE:
667+
668+
- Ruby does not have the ``s`` modifier, instead the Ruby ``m`` modifier
669+
performs the same function as the PCRE ``s`` modifier which is to make the
670+
period (``.``) match any character including newlines. Confusingly, the
671+
Ruby documentation refers to the ``m`` modifier as "enabling multi-line mode".
672+
- Ruby always operates in the equivalent of PCRE's multi-line mode, enabled by
673+
the ``m`` modifier in PCRE regular expressions. In Ruby the ``^`` anchor
674+
always refers to the beginning of line and the ``$`` anchor always refers
675+
to the end of line.
676+
677+
When writing regular expressions intended to be used in both Ruby and
678+
PCRE environments (including MongoDB server and most other MongoDB drivers),
679+
henceforth referred to as "portable regular expressions", avoid using
680+
the ``^`` and ``$`` anchors. The following sections provide workarounds and
681+
recommendations for authoring portable regular expressions.
682+
683+
``^`` Anchor
684+
````````````
685+
686+
In Ruby regular expressions, the ``^`` anchor always refers to the beginning
687+
of line. In PCRE regular expressions, the ``^`` anchor refers to the beginning
688+
of input by default and the ``m`` flag changes its meaning to the beginning
689+
of line.
690+
691+
Both Ruby and PCRE regular expressions support the ``\A`` anchor to refer to
692+
the beginning of input, regardless of modifiers.
693+
694+
When writing portable regular expressions:
695+
696+
- Use the ``\A`` anchor to refer to the beginning of input.
697+
- Use the ``^`` anchor to refer to the beginning of line (this requires
698+
setting the ``m`` flag in PCRE regular expressions). Alternatively use
699+
one of the following constructs which work regardless of modifiers:
700+
- ``(?:\A|(?<=\n))`` (handles LF and CR+LF line ends)
701+
- ``(?:\A|(?<=[\r\n]))`` (handles CR, LF and CR+LF line ends)
702+
703+
``$`` Anchor
704+
````````````
705+
706+
In Ruby regular expressions, the ``$`` anchor always refers to the end
707+
of line. In PCRE regular expressions, the ``$`` anchor refers to the end
708+
of input by default and the ``m`` flag changes its meaning to the end
709+
of line.
710+
711+
Both Ruby and PCRE regular expressions support the ``\z`` anchor to refer to
712+
the end of input, regardless of modifiers.
713+
714+
When writing portable regular expressions:
715+
716+
- Use the ``\z`` anchor to refer to the end of input.
717+
- Use the ``$`` anchor to refer to the beginning of line (this requires
718+
setting the ``m`` flag in PCRE regular expressions). Alternatively use
719+
one of the following constructs which work regardless of modifiers:
720+
- ``(?:\z|(?=\n))`` (handles LF and CR+LF line ends)
721+
- ``(?:\z|(?=[\n\n]))`` (handles CR, LF and CR+LF line ends)
643722

644-
There is a class provided by the bson gem, ``Regexp::Raw``, to allow Ruby users
645-
to get around this. You can simply create a regular expression like this:
723+
``BSON::Regexp::Raw`` Class
724+
---------------------------
725+
726+
Since there is no simple way to programmatically convert a PCRE
727+
regular expression into the equivalent Ruby regular expression,
728+
bson-ruby provides the ``BSON::Regexp::Raw`` class for holding MongoDB/PCRE
729+
regular expressions. Instances of this class are called "BSON regular
730+
expressions" in this documentation.
731+
732+
Instances of this class can be created using the regular expression text
733+
as a string and optional PCRE modifiers:
734+
735+
.. code-block:: ruby
736+
737+
BSON::Regexp::Raw.new("^b403158")
738+
# => #<BSON::Regexp::Raw:0x000055df63186d78 @pattern="^b403158", @options="">
739+
740+
BSON::Regexp::Raw.new("^Hello.world$", "s")
741+
# => #<BSON::Regexp::Raw:0x000055df6317f028 @pattern="^Hello.world$", @options="s">
742+
743+
The ``BSON::Regexp`` module is included in the Ruby ``Regexp`` class, such that
744+
the ``BSON::`` prefix may be omitted:
646745

647746
.. code-block:: ruby
648747

649748
Regexp::Raw.new("^b403158")
749+
# => #<BSON::Regexp::Raw:0x000055df63186d78 @pattern="^b403158", @options="">
750+
751+
Regexp::Raw.new("^Hello.world$", "s")
752+
# => #<BSON::Regexp::Raw:0x000055df6317f028 @pattern="^Hello.world$", @options="s">
753+
754+
Regular Expression Conversion
755+
-----------------------------
650756

651-
This code example illustrates the difference between serializing a core Ruby
652-
``Regexp`` versus a ``Regexp::Raw`` object:
757+
To convert a Ruby regular expression to a BSON regular expression,
758+
instantiate a ``BSON::Regexp::Raw`` object as follows:
759+
760+
.. code-block:: ruby
761+
762+
regexp = /^Hello.world/
763+
bson_regexp = BSON::Regexp::Raw.new(regexp.source, regexp.options)
764+
# => #<BSON::Regexp::Raw:0x000055df62e42d60 @pattern="^Hello.world", @options=0>
765+
766+
Note that the ``BSON::Regexp::Raw`` constructor accepts both the Ruby numeric
767+
options and the PCRE modifier strings.
768+
769+
To convert a BSON regular expression to a Ruby regular expression, call the
770+
``compile`` method on the BSON regular expression:
771+
772+
.. code-block:: ruby
773+
774+
bson_regexp = BSON::Regexp::Raw.new("^hello.world", "s")
775+
bson_regexp.compile
776+
# => /^hello.world/m
777+
778+
bson_regexp = BSON::Regexp::Raw.new("^hello", "")
779+
bson_regexp.compile
780+
# => /^hello.world/
781+
782+
bson_regexp = BSON::Regexp::Raw.new("^hello.world", "m")
783+
bson_regexp.compile
784+
# => /^hello.world/
785+
786+
Note that the ``s`` PCRE modifier was converted to the ``m`` Ruby modifier
787+
in the first example, and the last two examples were converted to the same
788+
regular expression even though the original BSON regular expressions had
789+
different meanings.
790+
791+
When a BSON regular expression uses the non-portable ``^`` and ``$``
792+
anchors, its conversion to a Ruby regular expression can change its meaning:
793+
794+
.. code-block:: ruby
795+
796+
BSON::Regexp::Raw.new("^hello.world", "").compile =~ "42\nhello world"
797+
# => 3
798+
799+
When a Ruby regular expression is converted to a BSON regular expression
800+
(for example, to send to the server as part of a query), the BSON regular
801+
expression always has the ``m`` modifier set reflecting the behavior of
802+
``^`` and ``$`` anchors in Ruby regular expressions.
803+
804+
Reading and Writing
805+
-------------------
806+
807+
Both Ruby and BSON regular expressions implement the ``to_bson`` method
808+
for serialization to BSON:
653809

654810
.. code-block:: ruby
655811

@@ -659,27 +815,31 @@ This code example illustrates the difference between serializing a core Ruby
659815
# => #<BSON::ByteBuffer:0x007fcf20ab8028>
660816
_.to_s
661817
# => "^b403158\x00m\x00"
818+
662819
regexp_raw = Regexp::Raw.new("^b403158")
663820
# => #<BSON::Regexp::Raw:0x007fcf21808f98 @pattern="^b403158", @options="">
664821
regexp_raw.to_bson
665822
# => #<BSON::ByteBuffer:0x007fcf213622f0>
666823
_.to_s
667824
# => "^b403158\x00\x00"
668825

669-
670-
Please use the ``Regexp::Raw`` class to instantiate your BSON regular
671-
expressions to get the exact pattern and options you want.
672-
673-
When regular expressions are deserialized, they return a wrapper that holds the
674-
raw regex string, but do not compile it. In order to get the Ruby ``Regexp``
675-
object, one must call ``compile`` on the returned object.
826+
Both ``Regexp`` and ``BSON::Regexp::Raw`` classes implement the ``from_bson``
827+
class method that deserializes a regular expression from a BSON byte buffer.
828+
Methods of both classes return a ``BSON::Regexp::Raw`` instance that
829+
must be converted to a Ruby regular expression using the ``compile`` method
830+
as described above.
676831

677832
.. code-block:: ruby
678833

834+
byte_buffer = BSON::ByteBuffer.new("^b403158\x00\x00")
679835
regex = Regexp.from_bson(byte_buffer)
680-
regex.pattern #=> Returns the pattern as a string.
681-
regex.options #=> Returns the raw options as a String.
682-
regex.compile #=> Returns the compiled Ruby Regexp object.
836+
# => #<BSON::Regexp::Raw:0x000055df63100d40 @pattern="^b403158", @options="">
837+
regex.pattern
838+
# => "^b403158"
839+
regex.options
840+
# => ""
841+
regex.compile
842+
# => /^b403158/
683843

684844

685845
Key Order

0 commit comments

Comments
 (0)