Project Awesome project awesome

uclp

An experimental implementation of parsing expression grammars (PEGs, a la Janet) in Common Lisp. MIT.

Package 24 stars GitHub
  • Unnamed Common Lisp Peg

UCLP is an experimental implementation of [[https://en.wikipedia.org/wiki/Parsing_expression_grammar][PEG parsing]] in Common Lisp. A parsing expression grammar is a very elegant way of recognizing, parsing, and transforming text- much more powerful than regular expressions without the complexity of a custom-built parser. Note that while it is possible to parse PEGs in guaranteed linear time (at the cost of linear space) with a packrat parser, UCLP does not. Most patterns, unless very pathological, can be rewritten to run in linear time, and for ease of use some necessary departures from the strict definition are included.

UCLP patterns are just made of native data structures, so it's easy to compose patterns and interact with them in code. Unlike regular expressions, PEG syntax is pretty readable even if you aren't familiar with the specifics. The below example is largely copied from the [[https://janet-lang.org/docs/peg.html][Janet language documentation]]:

#+BEGIN_SRC lisp (defparameter ip-address '(grammar :dig (range "09") :0-4 (range "04") :0-5 (range "05") :byte (choice (sequence "25" :0-5) (sequence "2" :0-4 :dig) (sequence "1" :dig :dig) (between :dig 1 2)) :main (sequence :byte "." :byte "." :byte "." :byte)))

; uclp:match returns two values, a boolean indicating success/failure and a list of captures (uclp:match ip-address "0.0.0.0") ; -> t nil (uclp:match ip-address "elephant") ; -> nil (uclp:match ip-address "256.0.0.0") ; -> nil (uclp:match ip-address "0.0.0.0moretext") ; -> t nil #+END_SRC

If you follow the link you'll see that the example is almost exactly copied, with the only differences being those forced by the differences in language syntax. UCLP is a very close reproduction of the semantics of Janet's PEG module. After experiencing how pleasant text parsing is in Janet you'll also feel the urge to rewrite it for every language you use.

As of now UCLP is usable. Bugs are to be expected, but almost all of Janet's patterns are supported and there shouldn't be any significant footguns. Unfortunately it has only been tested on SBCL, but I plan on at least expanding that to some other implementations since there isn't much implementation-specific code. Except of course that for it to be performant it needs to be compiled quickly to a quick runtime, ideally machine code.

** Usage The peg syntax used by UCLP is largely a 1:1 reproduction of Janet's. Differences are noted below, but otherwise you can safely default to using [[https://janet-lang.org/docs/peg.html][Janet's documentation]]. Many tasks that might require a particular function when using regex can be accomplished with the correct pattern. However, for convenience, there are entrypoints for a few common use-cases.

*** (match rule str &optional start &rest args) This is the primary entrypoint to UCLP. Matches str against rule anchored to start. Returns two values on success, a boolean indicating success or failure and a list of captures, and nil otherwise. Additional arguments are accessible while matching using the argument pattern. Note that if using additional arguments you must specify a start.

*** (compile-peg rule &key (quiet? t) debug?) Compiles rule to source ahead of time. The resulting compiled peg can be used anywhere a pattern is taken, such as in match or any other entrypoint here. If there are any holes, instead compiles to a frame which can be filled with fill-holes.

*** (assemble-peg rule &key (quiet? t) debug?) Compiles rule to a chain of closures. Like compile-peg, if there are any holes, this will return a frame. For more information, see Assembling vs Compiling.

*** (fill-holes frame &rest names-and-values) Takes a frame and a &rest parameter consisting of alternating hole names and values. For more on frames and holes please see Holes below.

*** (captured rule str &optional start &rest args) Takes exactly the same arguments as match but returns in the opposite order- first the list of captures, and success as the second value. Useful to get at captures without multiple-value-bind.

*** (replace-all rule replace str &optional start &rest args) Returns the string resulting from replacing every instance of rule in str with replace, which may be a string to substitute or a function taking as an argument the substring matched by rule. Returns a boolean indicating if matches were found as a second value.

*** (replace-one rule replace str &optional start &rest args) Behaves like replace-all, but replaces only the first match.

*** (find-all rule str &optional start &rest args) Gives the position of each match in str as a list, such that #+BEGIN_SRC lisp (defparameter dig '(* "dig:" :d)) (defparameter str "dig:7, dig:8, dig:9")

(every (lambda (position) (uclp:match dig str position)) (uclp:find-all dig str)) ; => t #+END_SRC

*** (find-one rule str &optional start &rest args) Returns the position of the /first/ match, or nil if there are no matches.

** Differences From Janet Mostly UCLP adheres to the behavior of Janet pegs, enough so that Janet's documentation is better than anything I'll have put together for a little while. However, there are some differences which obviously need to be documented somewhere. Some are due to differences in host language, some are due to taste, and some are just features I haven't implemented yet.

Any difference in behavior from Janet not mentioned below is a bug, and either the behavior or the documentation will need to be changed.

*** Unimplemented Janet buffers and strings are simply byte strings, and Janet pegs work on arbitrary strings of bytes. UCLP expects to work with Common Lisp strings, which in general are /not/ byte strings but vectors of type character. As such the patterns intended to work on bytes are not implemented. These are: uint, uint-be, int, and int-be. Because using PEGs to parse binaries is so nice, I plan on at some point implementing some way of compiling PEGs intended to operate on raw bytes. However, specialization will be at compile-time and the above patterns will likely be available only in a byte PEG.

I'm not yet sure if UCLP will ever support a general number pattern. It's possible I'll bring in parse-float to make a float as well as a general number pattern.

*** Implemented

  • While UCLP does not have number, it does have integer, which takes identical arguments and parses an integer using parse-integer.
  • To create a template PEG with holes filled in later, UCLP introduces the (hole name) pattern. For more information on holes, see Holes below.

*** Changes

  • Anywhere a string literal can go, including those in range or set, a character or list of characters and strings can also go. This is because Common Lisp strings do not have escape codes like Janet strings. So (range ("a" #\Newline)) in UCLP is the same as (range "a\n") in Janet.
  • between, at-least, at-most, and look all have the pattern as the first argument, unlike in Janet where it is the last argument.
  • backmatch requires a tag argument, and will not look up captures on the capture stack
  • replace takes either a string which it captures literally, or a function which it calls. Taking other datatypes literally will probably be in the next version. But unlike Janet, it will never look the matches up.
  • Grammars, represented in Janet by tables or structs, are written in UCLP with the grammar rule, which is followed by alternating keywords naming rules and patterns implementing them.
  • Because the reader doesn't distinguish between :s and :S, the complement of a built-in pattern is prefixed with !. So :!s instead of :S.
  • The error pattern has significant differences from Janet's error. See Error below.
  • In addition to the + and * variants of built-in patterns, UCLP has a maybe variant marked by a ?. So :w? denotes (? :w).
  • UCLP includes any (*), some (+), and maybe (?) variants of complement patterns. So :!d+ is (some (if-not :d 1)). See Aliases below

** Compiling vs Assembling The initial concept for UCLP was that all patterns would be translated to Common Lisp source and compiled at runtime. This provides the fastest matching, but even with SBCL's lightning-fast compile times there is a significant amount of overhead. Therefore, patterns can either be assembled or compiled.

An assembled PEG can be used exactly like a compiled PEG. Both are simply Common Lisp functions wrapped in a struct, which can be transparently passed to match or composed with other patterns. However, unlike compiling a PEG, assembling is fast- fewer than a few milliseconds. The tradeoff is that they're up to 3 times slower at match time.

For one-off patterns created at runtime, it makes the most sense to assemble rather than compile. That's why PEGs passed as raw data are transparently assembled, not compiled. If you know a pattern in advance, you can compile it ahead of time with compile-peg.

To get as much compiled performance as possible without paying compile costs at runtime, see Holes below.

** Holes UCLP introduces the (hole name) pattern, which represents an unknown pattern which will be filled in later. Name is a keyword. If a PEG with a hole is compiled or assembled, the result will be a frame. A frame can be turned into a pattern by providing a value for each hole. Two holes with the same name will be filled with the same pattern.

#+BEGIN_SRC lisp (defparameter delimited-frame (compile-peg '(split (hole :div) '(any 1)))) (defparameter s "123,456|789")

(match (fill-holes delimited-frame :div ",") s) ; => t ("123" "456|789") (match (fill-holes delimited-frame :div "|") s) ; => t ("123,456" "789") #+END_SRC

Holes allow you to compile almost an entire PEG ahead of time, assembling just what's needed on demand to avoid paying the compilation overhead at runtime.

** Aliases UCLP offers aliases, keywords that stand in for larger patterns, similar to the Janet's built-in patterns. And like built-in patterns, aliases are user extensible. However, there are a number of differences which are important to be aware of. Aliases are not first class citizens of UCLP- rather than full mutually recursive subpatterns, they are simple find-and-replace macros, inserted literally. You can reference other aliases from inside one, but if you create a cycle it'll just blow out the stack. So just be cautious!

Aliases are stored in the alist aliases. You can manipulate aliases directly, or call the helper functions register-alias! and register-alias-suite!. Both take the name of the alias as a keyword, and the body as a peg expression, and push the new alias to aliases. However, register-alias-suite! will also add the complement, some, maybe, and any variants, like so:

#+BEGIN_SRC lisp (uclp:register-alias-suite! :v '(set "vV")) (uclp:match '(* :v (<- :!v+)) "v not a V") ; => t (" not a ") #+END_SRC

** Error Because Common Lisp conditions are so different from Janet signals, the error pattern has some subtleties in UCLP. It takes arguments of the form (error &optional pat condition). With 0 arguments, it will raise a peg-error. With one argument, it will raise an error only if pat matches. With two arguments, it will raise a condition so long as pat matches. To specify a particular condition a pattern must be given, but something like 0 will always match.

UCLP special cases conditions inheriting from peg-error, which is exported. A peg-error has slots pat, matched, and caps. If a pattern is given, these will automatically be filled by the pattern itself, the text that pattern matched, and the captures from the pattern respectively. If no pattern is supplied they will all be nil. These slots can be accessed with error-pat, error-matched, and error-caps. By default peg-error has a report function giving the pattern and, if nonempty, the matching substring. Depending on your choice of pattern, this can be tolerably readable.

LocalWords: UCLP alist LocalWords subpatterns structs datatypes PEGs packrat footguns

LocalWords: SBCL performant entrypoint entrypoints regex SBCL's struct

Back to Common Lisp