tfw you mix up Glushkov's construction with Brzozowski's derivatives, amirite?

no? that's just me?

@jamey tfw Eastern European comp sci regex algorithms

@jamey hadn’t heard about Glushkov’s algorithm. it’s very cool!!

@judofyr honestly I don't really understand Glushkov's construction. I don't even quite understand how the paper I read that name in actually corresponds to Glushkov's construction, but it claims that it does, so I guess I have to believe it?

@jamey I grokked the logic behind the local language (L'), but I must admit I don't quite know why/how that automaton is easy to construct. or why you need to rename the alphabet first.

@jamey huh. interesting paper. not quuite sure how it corresponds to Glushkov's construction, and I agree that it looks oddly similar to the derivative.

@judofyr there's a sentence in the paper about going straight to determinizing the NFA so states should be thought of as sets of positions, and another sentence about implementing Glushkov's structurally, and I guess the answer is somewhere in that vague handwaving?

and thank you for agreeing that it feels more like the derivatives—I was honestly embarrassed about my confusion 😅

@jamey I think that L’-expression I took a picture of is not used in practice. the important part is the F-set which is all of the transitions that can happen. their algorithm seems related in the sense that the “mark” is the current state in a NFA. but instead of computing F, they do the transition on-the-fly with a Boolean variable. they use the idea of the Glushkov, not the actual construction algorithm.

@judofyr cool, that makes as much sense as anything I could come up with!

seems like the connection is kind of weak, to the point that I feel like this paper is really doing their own thing. that said, it's a very nice thing in some ways!

I also meant to point you at which I found while trying to figure out more about Glushkov's construction. if you enjoyed this stuff I thought you might enjoy that too 😄

@jamey yeah, the way they create a whole new regex for every char they parse is very derivative-like. but it doesn’t have the behavior where the sequence-operator increases the size, which is neat.

I found it, but didn’t really look at it. I’ll take another look later! I got more interested in “can we use the same approach for parsing CFG” to be honest

@judofyr oh, I know the answer to that question! yes!

the "Play on Regular Expressions" paper covers context-free in the last section, but also, I have an implementation in Rust, including a unit test that matches balanced-parentheses strings:

I also worked out how to use much less memory, and even avoid ever doing allocation when matching regular languages...

@jamey plain Glushkov should have better performance though since it only looks at all active states per char, while The Play needs to iterate over the whole regex.

cool project!!

once it’s CFG (ah, even Boolean), then the regex structure is growing in size each iteration, right? all of the lazy Haskell stuff makes my head hurt because I never understand what needs to be lazy or not

@judofyr performance here is confusing (at least to me): "Play" adds an "active" flag in the last section that lets them skip propagating zeroes through any subtree that has no marks in it. they report that some tests got faster and others got slower. 🤷

memory usage for a CFG/Boolean grammar should be linear in the depth of the parse tree, I think, but there's some stuff I don't get yet.

like, Kleene star runs in O(1) space, but if I implement it recursively it's proportional to # of matches?

@jamey interesting.

the recursive Kleene star keeps track of the stack at every level, while the native one “throws it away” I guess. kinda like a tail call optimization

@judofyr yeah, I've been thinking this is very much like tail-call optimization, but I haven't found a way to convince myself that it's always correct, or exactly how to expose that optimization to users of my library. basically I keep finding pieces of this quite difficult to reason about, but it seems to have some very nice properties once it's correct?

@jamey hmmm. how does the recursive Kleene star look in your library? can you show example code?

does it handle left-recursion btw?

S ::= “1” | S “+” S

is always my go-to, highly ambiguous (exponential amount of parse trees), stress test of parsing algorithms.

@judofyr I need to apply a couple ideas from "Parsing with Derivatives" (cited in my readme) to make left-recursion work, and that's tied up in this stuff I'm confused about 😅

recursive star over some grammar "g" would look basically like this, I think:

fn star() -> impl Regex<T, M> { empty() | (g + delay(star)) }

@jamey I think an API like this could work:

fn star() { tail_delay(|c| empty() | (g + c.jump(star)) }

but it’s ugly and I haven’t quite figured out the details..

I don’t remember how left-recursion is handled in Parsing with Derivatives, but it seems like this is very similar indeed

@judofyr I was thinking about something like:

recurse(|c| empty() | (g + c))

where recurse is a y-combinator-ish thing that passes a heap-allocated copy of the given grammar into itself. that lets recurse allocate storage that is _not_ copied for computing the least fixed point on whether the grammar can match the empty string. I don't have the details worked out though.

I need to go sleep now but thank you for talking this through with me! I'd love to chat more later.

@jamey I've been playing more directly with the Glushkov construction (e.g. actually computing the F-set and treating it as an NFD). seems to work pretty nicely, but I haven't quite figured out delayed actions and left-recursion yet.

Sign in to participate in the conversation

The social network of the future: No ads, no corporate surveillance, ethical design, and decentralization! Own your data with Mastodon!