uclp
An experimental implementation of parsing expression grammars (PEGs, a la Janet) in Common Lisp. MIT.
- Unnamed Common Lisp Peg
UCLP is an experimental implementation of [[https://en.wikipedia.org/wiki/Parsing_expression_grammar][PEG parsing]] in Common Lisp. A parsing expression grammar is a very elegant way of recognizing, parsing, and transforming text- much more powerful than regular expressions without the complexity of a custom-built parser. Note that while it is possible to parse PEGs in guaranteed linear time (at the cost of linear space) with a packrat parser, UCLP does not. Most patterns, unless very pathological, can be rewritten to run in linear time, and for ease of use some necessary departures from the strict definition are included.
UCLP patterns are just made of native data structures, so it's easy to compose patterns and interact with them in code. Unlike regular expressions, PEG syntax is pretty readable even if you aren't familiar with the specifics. The below example is largely copied from the [[https://janet-lang.org/docs/peg.html][Janet language documentation]]:
#+BEGIN_SRC lisp (defparameter ip-address '(grammar :dig (range "09") :0-4 (range "04") :0-5 (range "05") :byte (choice (sequence "25" :0-5) (sequence "2" :0-4 :dig) (sequence "1" :dig :dig) (between :dig 1 2)) :main (sequence :byte "." :byte "." :byte "." :byte)))
; uclp:match returns two values, a boolean indicating success/failure and a list of captures (uclp:match ip-address "0.0.0.0") ; -> t nil (uclp:match ip-address "elephant") ; -> nil (uclp:match ip-address "256.0.0.0") ; -> nil (uclp:match ip-address "0.0.0.0moretext") ; -> t nil #+END_SRC
If you follow the link you'll see that the example is almost exactly copied, with the only differences being those forced by the differences in language syntax. UCLP is a very close reproduction of the semantics of Janet's PEG module. After experiencing how pleasant text parsing is in Janet you'll also feel the urge to rewrite it for every language you use.
As of now UCLP is usable. Bugs are to be expected, but almost all of Janet's patterns are supported and there shouldn't be any significant footguns. Unfortunately it has only been tested on SBCL, but I plan on at least expanding that to some other implementations since there isn't much implementation-specific code. Except of course that for it to be performant it needs to be compiled quickly to a quick runtime, ideally machine code.
** Usage The peg syntax used by UCLP is largely a 1:1 reproduction of Janet's. Differences are noted below, but otherwise you can safely default to using [[https://janet-lang.org/docs/peg.html][Janet's documentation]]. Many tasks that might require a particular function when using regex can be accomplished with the correct pattern. However, for convenience, there are entrypoints for a few common use-cases.
*** (match rule str &optional start &rest args)
This is the primary entrypoint to UCLP. Matches str against rule anchored to start.
Returns two values on success, a boolean indicating success or failure and a list of
captures, and nil otherwise. Additional arguments are accessible while matching using the
argument pattern. Note that if using additional arguments you must specify a start.
*** (compile-peg rule &key (quiet? t) debug?)
Compiles rule to source ahead of time. The resulting compiled peg can be used anywhere a
pattern is taken, such as in match or any other entrypoint here. If there are any holes,
instead compiles to a frame which can be filled with fill-holes.
*** (assemble-peg rule &key (quiet? t) debug?)
Compiles rule to a chain of closures. Like compile-peg, if there are any holes, this
will return a frame. For more information, see Assembling vs Compiling.
*** (fill-holes frame &rest names-and-values)
Takes a frame and a &rest parameter consisting of alternating hole names and values. For
more on frames and holes please see Holes below.
*** (captured rule str &optional start &rest args)
Takes exactly the same arguments as match but returns in the opposite order- first the
list of captures, and success as the second value. Useful to get at captures without
multiple-value-bind.
*** (replace-all rule replace str &optional start &rest args)
Returns the string resulting from replacing every instance of rule in str with
replace, which may be a string to substitute or a function taking as an argument the
substring matched by rule. Returns a boolean indicating if matches were found as a
second value.
*** (replace-one rule replace str &optional start &rest args)
Behaves like replace-all, but replaces only the first match.
*** (find-all rule str &optional start &rest args)
Gives the position of each match in str as a list, such that
#+BEGIN_SRC lisp
(defparameter dig '(* "dig:" :d))
(defparameter str "dig:7, dig:8, dig:9")
(every (lambda (position) (uclp:match dig str position)) (uclp:find-all dig str)) ; => t #+END_SRC
*** (find-one rule str &optional start &rest args)
Returns the position of the /first/ match, or nil if there are no matches.
** Differences From Janet Mostly UCLP adheres to the behavior of Janet pegs, enough so that Janet's documentation is better than anything I'll have put together for a little while. However, there are some differences which obviously need to be documented somewhere. Some are due to differences in host language, some are due to taste, and some are just features I haven't implemented yet.
Any difference in behavior from Janet not mentioned below is a bug, and either the behavior or the documentation will need to be changed.
*** Unimplemented
Janet buffers and strings are simply byte strings, and Janet pegs work on arbitrary
strings of bytes. UCLP expects to work with Common Lisp strings, which in general are
/not/ byte strings but vectors of type character. As such the patterns intended to work
on bytes are not implemented. These are: uint, uint-be, int, and int-be. Because
using PEGs to parse binaries is so nice, I plan on at some point implementing some way of
compiling PEGs intended to operate on raw bytes. However, specialization will be at
compile-time and the above patterns will likely be available only in a byte PEG.
I'm not yet sure if UCLP will ever support a general number pattern. It's possible
I'll bring in parse-float to make a float as well as a general number pattern.
*** Implemented
- While UCLP does not have
number, it does haveinteger, which takes identical arguments and parses an integer usingparse-integer. - To create a template PEG with holes filled in later, UCLP introduces the
(hole name)pattern. For more information on holes, see Holes below.
*** Changes
- Anywhere a string literal can go, including those in
rangeorset, a character or list of characters and strings can also go. This is because Common Lisp strings do not have escape codes like Janet strings. So(range ("a" #\Newline))in UCLP is the same as(range "a\n")in Janet. between,at-least,at-most, andlookall have the pattern as the first argument, unlike in Janet where it is the last argument.backmatchrequires a tag argument, and will not look up captures on the capture stackreplacetakes either a string which it captures literally, or a function which it calls. Taking other datatypes literally will probably be in the next version. But unlike Janet, it will never look the matches up.- Grammars, represented in Janet by tables or structs, are written in UCLP with the
grammarrule, which is followed by alternating keywords naming rules and patterns implementing them. - Because the reader doesn't distinguish between
:sand:S, the complement of a built-in pattern is prefixed with!. So:!sinstead of:S. - The
errorpattern has significant differences from Janet'serror. See Error below. - In addition to the
+and*variants of built-in patterns, UCLP has a maybe variant marked by a?. So:w?denotes(? :w). - UCLP includes any (
*), some (+), and maybe (?) variants of complement patterns. So:!d+is(some (if-not :d 1)). See Aliases below
** Compiling vs Assembling The initial concept for UCLP was that all patterns would be translated to Common Lisp source and compiled at runtime. This provides the fastest matching, but even with SBCL's lightning-fast compile times there is a significant amount of overhead. Therefore, patterns can either be assembled or compiled.
An assembled PEG can be used exactly like a compiled PEG. Both are simply Common Lisp
functions wrapped in a struct, which can be transparently passed to match or composed
with other patterns. However, unlike compiling a PEG, assembling is fast- fewer than a few
milliseconds. The tradeoff is that they're up to 3 times slower at match time.
For one-off patterns created at runtime, it makes the most sense to assemble rather than
compile. That's why PEGs passed as raw data are transparently assembled, not compiled. If
you know a pattern in advance, you can compile it ahead of time with compile-peg.
To get as much compiled performance as possible without paying compile costs at runtime, see Holes below.
** Holes
UCLP introduces the (hole name) pattern, which represents an unknown pattern which will be
filled in later. Name is a keyword. If a PEG with a hole is compiled or assembled, the
result will be a frame. A frame can be turned into a pattern by providing a value for each
hole. Two holes with the same name will be filled with the same pattern.
#+BEGIN_SRC lisp (defparameter delimited-frame (compile-peg '(split (hole :div) '(any 1)))) (defparameter s "123,456|789")
(match (fill-holes delimited-frame :div ",") s) ; => t ("123" "456|789") (match (fill-holes delimited-frame :div "|") s) ; => t ("123,456" "789") #+END_SRC
Holes allow you to compile almost an entire PEG ahead of time, assembling just what's needed on demand to avoid paying the compilation overhead at runtime.
** Aliases UCLP offers aliases, keywords that stand in for larger patterns, similar to the Janet's built-in patterns. And like built-in patterns, aliases are user extensible. However, there are a number of differences which are important to be aware of. Aliases are not first class citizens of UCLP- rather than full mutually recursive subpatterns, they are simple find-and-replace macros, inserted literally. You can reference other aliases from inside one, but if you create a cycle it'll just blow out the stack. So just be cautious!
Aliases are stored in the alist aliases. You can manipulate aliases directly, or
call the helper functions register-alias! and register-alias-suite!. Both take the
name of the alias as a keyword, and the body as a peg expression, and push the new alias
to aliases. However, register-alias-suite! will also add the complement, some,
maybe, and any variants, like so:
#+BEGIN_SRC lisp (uclp:register-alias-suite! :v '(set "vV")) (uclp:match '(* :v (<- :!v+)) "v not a V") ; => t (" not a ") #+END_SRC
** Error
Because Common Lisp conditions are so different from Janet signals, the error pattern
has some subtleties in UCLP. It takes arguments of the form
(error &optional pat condition).
With 0 arguments, it will raise a peg-error. With one argument, it will
raise an error only if pat matches. With two arguments, it will raise a condition so
long as pat matches. To specify a particular condition a pattern must be given, but
something like 0 will always match.
UCLP special cases conditions inheriting from peg-error, which is exported. A
peg-error has slots pat, matched, and caps. If a pattern is given, these will
automatically be filled by the pattern itself, the text that pattern matched, and the
captures from the pattern respectively. If no pattern is supplied they will all be
nil. These slots can be accessed with error-pat, error-matched, and error-caps. By
default peg-error has a report function giving the pattern and, if nonempty, the
matching substring. Depending on your choice of pattern, this can be tolerably readable.