@@ -637,19 +637,175 @@ of the day that the ``Date`` refers to in UTC.
637
637
Regular Expressions
638
638
===================
639
639
640
- Ruby regular expressions always have BSON regular expressions' equivalent of
641
- 'm' flag on. In order for behavior to be preserved between the two, the 'm'
642
- option is always added when a Ruby regular expression is serialized to BSON.
640
+ Both MongoDB and Ruby provide facilities for working with regular expressions,
641
+ but they use regular expression engines. The following subsections detail the
642
+ differences between Ruby regular expressions and MongoDB regular expressions
643
+ and describe how to work with both.
644
+
645
+ Ruby vs MongoDB Regular Expressions
646
+ -----------------------------------
647
+
648
+ MongoDB server uses `Perl-compatible regular expressions implemented using
649
+ the PCRE library<http://pcre.org/>`_ and `Ruby regular expressions
650
+ <http://ruby-doc.org/core/Regexp.html>`_ are implemented using the
651
+ `Onigmo regular expression engine <https://github.com/k-takata/Onigmo>`_,
652
+ which is a fork of `Oniguruma <https://github.com/kkos/oniguruma>`_.
653
+ The two regular expression implementations generally provide equivalent
654
+ functionality but have several important syntax differences, as described
655
+ below.
656
+
657
+ Unfortunately, there is no simple way to programmatically convert a PCRE
658
+ regular expression into the equivalent Ruby regular expression,
659
+ and there are currently no Ruby bindings for PCRE.
660
+
661
+ Options / Flags / Modifiers
662
+ ```````````````````````````
663
+
664
+ Both Ruby and PCRE regular expressions support modifiers. These are
665
+ also called "options" in Ruby parlance and "flags" in PCRE parlance.
666
+ The meaning of ``s`` and ``m`` modifiers differs in Ruby and PCRE:
667
+
668
+ - Ruby does not have the ``s`` modifier, instead the Ruby ``m`` modifier
669
+ performs the same function as the PCRE ``s`` modifier which is to make the
670
+ period (``.``) match any character including newlines. Confusingly, the
671
+ Ruby documentation refers to the ``m`` modifier as "enabling multi-line mode".
672
+ - Ruby always operates in the equivalent of PCRE's multi-line mode, enabled by
673
+ the ``m`` modifier in PCRE regular expressions. In Ruby the ``^`` anchor
674
+ always refers to the beginning of line and the ``$`` anchor always refers
675
+ to the end of line.
676
+
677
+ When writing regular expressions intended to be used in both Ruby and
678
+ PCRE environments (including MongoDB server and most other MongoDB drivers),
679
+ henceforth referred to as "portable regular expressions", avoid using
680
+ the ``^`` and ``$`` anchors. The following sections provide workarounds and
681
+ recommendations for authoring portable regular expressions.
682
+
683
+ ``^`` Anchor
684
+ ````````````
685
+
686
+ In Ruby regular expressions, the ``^`` anchor always refers to the beginning
687
+ of line. In PCRE regular expressions, the ``^`` anchor refers to the beginning
688
+ of input by default and the ``m`` flag changes its meaning to the beginning
689
+ of line.
690
+
691
+ Both Ruby and PCRE regular expressions support the ``\A`` anchor to refer to
692
+ the beginning of input, regardless of modifiers.
693
+
694
+ When writing portable regular expressions:
695
+
696
+ - Use the ``\A`` anchor to refer to the beginning of input.
697
+ - Use the ``^`` anchor to refer to the beginning of line (this requires
698
+ setting the ``m`` flag in PCRE regular expressions). Alternatively use
699
+ one of the following constructs which work regardless of modifiers:
700
+ - ``(?:\A|(?<=\n))`` (handles LF and CR+LF line ends)
701
+ - ``(?:\A|(?<=[\r\n]))`` (handles CR, LF and CR+LF line ends)
702
+
703
+ ``$`` Anchor
704
+ ````````````
705
+
706
+ In Ruby regular expressions, the ``$`` anchor always refers to the end
707
+ of line. In PCRE regular expressions, the ``$`` anchor refers to the end
708
+ of input by default and the ``m`` flag changes its meaning to the end
709
+ of line.
710
+
711
+ Both Ruby and PCRE regular expressions support the ``\z`` anchor to refer to
712
+ the end of input, regardless of modifiers.
713
+
714
+ When writing portable regular expressions:
715
+
716
+ - Use the ``\z`` anchor to refer to the end of input.
717
+ - Use the ``$`` anchor to refer to the beginning of line (this requires
718
+ setting the ``m`` flag in PCRE regular expressions). Alternatively use
719
+ one of the following constructs which work regardless of modifiers:
720
+ - ``(?:\z|(?=\n))`` (handles LF and CR+LF line ends)
721
+ - ``(?:\z|(?=[\n\n]))`` (handles CR, LF and CR+LF line ends)
643
722
644
- There is a class provided by the bson gem, ``Regexp::Raw``, to allow Ruby users
645
- to get around this. You can simply create a regular expression like this:
723
+ ``BSON::Regexp::Raw`` Class
724
+ ---------------------------
725
+
726
+ Since there is no simple way to programmatically convert a PCRE
727
+ regular expression into the equivalent Ruby regular expression,
728
+ bson-ruby provides the ``BSON::Regexp::Raw`` class for holding MongoDB/PCRE
729
+ regular expressions. Instances of this class are called "BSON regular
730
+ expressions" in this documentation.
731
+
732
+ Instances of this class can be created using the regular expression text
733
+ as a string and optional PCRE modifiers:
734
+
735
+ .. code-block:: ruby
736
+
737
+ BSON::Regexp::Raw.new("^b403158")
738
+ # => #<BSON::Regexp::Raw:0x000055df63186d78 @pattern="^b403158", @options="">
739
+
740
+ BSON::Regexp::Raw.new("^Hello.world$", "s")
741
+ # => #<BSON::Regexp::Raw:0x000055df6317f028 @pattern="^Hello.world$", @options="s">
742
+
743
+ The ``BSON::Regexp`` module is included in the Ruby ``Regexp`` class, such that
744
+ the ``BSON::`` prefix may be omitted:
646
745
647
746
.. code-block:: ruby
648
747
649
748
Regexp::Raw.new("^b403158")
749
+ # => #<BSON::Regexp::Raw:0x000055df63186d78 @pattern="^b403158", @options="">
750
+
751
+ Regexp::Raw.new("^Hello.world$", "s")
752
+ # => #<BSON::Regexp::Raw:0x000055df6317f028 @pattern="^Hello.world$", @options="s">
753
+
754
+ Regular Expression Conversion
755
+ -----------------------------
650
756
651
- This code example illustrates the difference between serializing a core Ruby
652
- ``Regexp`` versus a ``Regexp::Raw`` object:
757
+ To convert a Ruby regular expression to a BSON regular expression,
758
+ instantiate a ``BSON::Regexp::Raw`` object as follows:
759
+
760
+ .. code-block:: ruby
761
+
762
+ regexp = /^Hello.world/
763
+ bson_regexp = BSON::Regexp::Raw.new(regexp.source, regexp.options)
764
+ # => #<BSON::Regexp::Raw:0x000055df62e42d60 @pattern="^Hello.world", @options=0>
765
+
766
+ Note that the ``BSON::Regexp::Raw`` constructor accepts both the Ruby numeric
767
+ options and the PCRE modifier strings.
768
+
769
+ To convert a BSON regular expression to a Ruby regular expression, call the
770
+ ``compile`` method on the BSON regular expression:
771
+
772
+ .. code-block:: ruby
773
+
774
+ bson_regexp = BSON::Regexp::Raw.new("^hello.world", "s")
775
+ bson_regexp.compile
776
+ # => /^hello.world/m
777
+
778
+ bson_regexp = BSON::Regexp::Raw.new("^hello", "")
779
+ bson_regexp.compile
780
+ # => /^hello.world/
781
+
782
+ bson_regexp = BSON::Regexp::Raw.new("^hello.world", "m")
783
+ bson_regexp.compile
784
+ # => /^hello.world/
785
+
786
+ Note that the ``s`` PCRE modifier was converted to the ``m`` Ruby modifier
787
+ in the first example, and the last two examples were converted to the same
788
+ regular expression even though the original BSON regular expressions had
789
+ different meanings.
790
+
791
+ When a BSON regular expression uses the non-portable ``^`` and ``$``
792
+ anchors, its conversion to a Ruby regular expression can change its meaning:
793
+
794
+ .. code-block:: ruby
795
+
796
+ BSON::Regexp::Raw.new("^hello.world", "").compile =~ "42\nhello world"
797
+ # => 3
798
+
799
+ When a Ruby regular expression is converted to a BSON regular expression
800
+ (for example, to send to the server as part of a query), the BSON regular
801
+ expression always has the ``m`` modifier set reflecting the behavior of
802
+ ``^`` and ``$`` anchors in Ruby regular expressions.
803
+
804
+ Reading and Writing
805
+ -------------------
806
+
807
+ Both Ruby and BSON regular expressions implement the ``to_bson`` method
808
+ for serialization to BSON:
653
809
654
810
.. code-block:: ruby
655
811
@@ -659,27 +815,31 @@ This code example illustrates the difference between serializing a core Ruby
659
815
# => #<BSON::ByteBuffer:0x007fcf20ab8028>
660
816
_.to_s
661
817
# => "^b403158\x00m\x00"
818
+
662
819
regexp_raw = Regexp::Raw.new("^b403158")
663
820
# => #<BSON::Regexp::Raw:0x007fcf21808f98 @pattern="^b403158", @options="">
664
821
regexp_raw.to_bson
665
822
# => #<BSON::ByteBuffer:0x007fcf213622f0>
666
823
_.to_s
667
824
# => "^b403158\x00\x00"
668
825
669
-
670
- Please use the ``Regexp::Raw`` class to instantiate your BSON regular
671
- expressions to get the exact pattern and options you want.
672
-
673
- When regular expressions are deserialized, they return a wrapper that holds the
674
- raw regex string, but do not compile it. In order to get the Ruby ``Regexp``
675
- object, one must call ``compile`` on the returned object.
826
+ Both ``Regexp`` and ``BSON::Regexp::Raw`` classes implement the ``from_bson``
827
+ class method that deserializes a regular expression from a BSON byte buffer.
828
+ Methods of both classes return a ``BSON::Regexp::Raw`` instance that
829
+ must be converted to a Ruby regular expression using the ``compile`` method
830
+ as described above.
676
831
677
832
.. code-block:: ruby
678
833
834
+ byte_buffer = BSON::ByteBuffer.new("^b403158\x00\x00")
679
835
regex = Regexp.from_bson(byte_buffer)
680
- regex.pattern #=> Returns the pattern as a string.
681
- regex.options #=> Returns the raw options as a String.
682
- regex.compile #=> Returns the compiled Ruby Regexp object.
836
+ # => #<BSON::Regexp::Raw:0x000055df63100d40 @pattern="^b403158", @options="">
837
+ regex.pattern
838
+ # => "^b403158"
839
+ regex.options
840
+ # => ""
841
+ regex.compile
842
+ # => /^b403158/
683
843
684
844
685
845
Key Order
0 commit comments