Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lark resulting LALR -nor- Earley parser will pick out the longest match of say "a + b + c + ... + d" example input. Bug (?) #1463

Open
FruitfulApproach opened this issue Sep 4, 2024 · 0 comments

Comments

@FruitfulApproach
Copy link

Describe the bug

Normally when you create a calculator you work merely with binary operations and everything is parenthesized internally like so (unless they transform it like I think I will be doing to solve this):
a + (b + c)

Now, Lark is thinking this is what we always want. When I try to match a finite number of additions, greater than two:

from lark import Lark, Transformer

KATEX_SUBSET_GRAMMAR = r"""
start : (RAW_STR | katex_formula)*
katex_formula : DDOLR content DDOLR | DOLR content DOLR
content : addition | variable | int_const
addition : content ("+" content)+
variable : text_var | atomic_var
atomic_var : GREEK | LATIN
text_var : TEXT_CMD "{" var_name "}"
var_name : (NAME | "-")+
int_const : INT
LATIN : /[a-zA-Z]/
GREEK : /\\alpha|\\beta|\\gamma|\\delta/
RAW_STR : /[^\$]+/
TEXT_CMD : /\\text|\\textbf/
DDOLR : /\$\$/
DOLR : /\$/
%import common.INT
%import python.NAME
%import common.WS
%ignore WS
"""

katex_parser = Lark(grammar=KATEX_SUBSET_GRAMMAR, parser='earley')

TEST_ENUM = False

if not __debug__ or TEST_ENUM:
    Op = 'o'
    Data = '@'
    Var, Concat, KatexBlock, Katex, Add = range(5)
else:
    Op = 'Op'; Data = 'Data'
    Var = 'Var';  Concat = 'Concat'; Add = 'Add'
    KatexBlock = '$$'; Katex = '$'
    
wrapping_ops = { Katex, KatexBlock }
infix_ops = { Concat }
        
class KaTeXtoJson(Transformer):     
    def start(self, tree):
        if len(tree) > 1:
            return self._concat(tree)
        return tree[0]
    
    def _concat(self, tree):
        return [Concat, tree]
    
    def variable(self, tree):
        print(tree)
        return tree
    
    def atomic_var(self, tree):
        print(tree)
        return tree
    
    def RAW_STR(self, tree):
        return tree.strip()
    
    def int_const(self, tree):
        return int(tree.data[0])
    
    def katex_formula(self, tree):
        return ''.join(tree.data)
    
    def atomic_var(self, tree):
        return tree[0]
    
    def variable(self, tree):
        return tree[0]
    
    def content(self, tree):
        return tree[0]
    
    def katex_formula(self, tree):
        if str(tree[0]) == '$':
            op = Katex
        else:
            op = KatexBlock
        return [op, tree[1]]
    
    def addition(self, tree):
        return [Add, list(tree)]        
    
if __name__ == '__main__':    
    katex_to_json = KaTeXtoJson()
    import json
    
    while True:
        user_input = input("(╯‵□′)╯︵┻━┻ ... : ")
        parse_tree = katex_parser.parse(user_input)        
        print("Parse tree (Before Xforming): ", parse_tree)
        
        # JSON test & TODO: ultimately test JSON object against 
        # assignment / saving / retrieval using Neomodel's JSONProperty
        json_hopefully = katex_to_json.transform(parse_tree)
        
        print("Result after Xforming: ", json_hopefully)
        
        try:
            did_it_work = json.dumps(json_hopefully)            
            print("It works! 😎 (Valid JSON) : ", did_it_work)
            
        except Exception as e:
            print(f"That didn't work 😭 (Invalid JSON) : {e}")

I use:

addition : content ("+" content)+ 

I.e. the most obvious / simple way to accomplish the above. However:

(╯‵□′)╯︵┻━┻ ... : $a + b + c + d$
Parse tree (Before Xforming):  Tree(Token('RULE', 'start'), [Tree(Token('RULE', 'katex_formula'), [Token('DOLR', '$'), Tree(Token('RULE', 'content'), [Tree(Token('RULE', 'addition'), [Tree(Token('RULE', 'content'), [Tree(Token('RULE', 'addition'), [Tree(Token('RULE', 'content'), [Tree(Token('RULE', 'variable'), [Tree(Token('RULE', 'atomic_var'), [Token('LATIN', 'a')])])]), Tree(Token('RULE', 'content'), [Tree(Token('RULE', 'variable'), [Tree(Token('RULE', 'atomic_var'), [Token('LATIN', 'b')])])])])]), Tree(Token('RULE', 'content'), [Tree(Token('RULE', 'addition'), [Tree(Token('RULE', 'content'), [Tree(Token('RULE', 'variable'), [Tree(Token('RULE', 'atomic_var'), [Token('LATIN', 'c')])])]), Tree(Token('RULE', 'content'), [Tree(Token('RULE', 'variable'), [Tree(Token('RULE', 'atomic_var'), [Token('LATIN', 'd')])])])])])])]), Token('DOLR', '$')])])
Result after Xforming:  ['$', ['Add', [['Add', [Token('LATIN', 'a'), Token('LATIN', 'b')]], ['Add', [Token('LATIN', 'c'), Token('LATIN', 'd')]]]]]
It works! 😎 (Valid JSON) :  ["$", ["Add", [["Add", ["a", "b"]], ["Add", ["c", "d"]]]]]
(╯‵□′)╯︵┻━┻ ... : 

So, what I would expect to see is:

["Add", ["a", "b", "c", "d"]] inside the last part, but instead it thinks the user means (a + b) + (c + d).

Now I have a user code solution to this: Convert to the format I want in the addition() transformer method. It checks if either side is addition and if so, blends everything together.

However, a Lark-side solution would be much cleaner code. Or am I doing something incorrect, and this is NOT a bug?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant