Saturday, May 30, 2020

replacing symbols in a Sympy expression and generalizing the AST

Sympy's ability to convert a Latex string to a Sympy expression is useful but does not relate information about the variables in the Latex string to other resources (like dimension).

>>> import sympy
>>> from sympy import Equality, Add, Symbol, Mul, Pow, Integral, Tuple
>>> from sympy.parsing.latex import parse_latex

First, remove all presentation-related markup from a Latex string.
Then convert a Latex string to a Sympy expression using
>>> eq = parse_latex('a + b = c')
>>> eq
Eq(a + b, c)

In this post we will replace the variables with the reference IDs for each variable while maintaining the structure of the expression. 
The structure of the expression is
>>> sympy.srepr(eq)
"Equality(Add(Symbol('a'), Symbol('b')), Symbol('c'))"

Since this is a string, we can replace each variable in the expression with a reference ID.
The set of variables in the expression can be accessed using
>>> set_of_symbols_in_eq = eq.free_symbols
>>> set_of_symbols_in_eq
{a, c, b}

We can then replace each variable with an ID
>>> eq_str_with_id = sympy.srepr(eq).replace("'a'","'pdg4942'").replace("'b'","'pdg3291'").replace("'c'","'pdg0021'")
>>> eq_str_with_id
"Equality(Add(Symbol('pdg4942'), Symbol('pdg3291')), Symbol('pdg0021'))"

Lastly, evaluate the string to get a Sympy expression
>>> eq_with_id = eval(eq_str_with_id)
>>> eq_with_id
Eq(pdg3291 + pdg4942, pdg0021)

The reason this representation is useful is because of the separation of presentation from semantic structure.

And getting the symbol list is easy:
>>> eq_with_id.free_symbols
 {pdg3291, pdg4942, pdg0021}

Example

To show why separation matters, suppose we have the Latex string
f = \int_{x_{\rm bottom}}^{x_{\rm top}} g dg
That is a challenge for Sympy's parse_latex, even though Sympy can handle semantically equivalent structures like
>>> parse_latex('f = \int_a^b g dg')
Eq(f, Integral(g, (g, a, b)))

If we happen to know that x_{\rm bottom} is a variable and we know that x_{\rm top} is a variable, then we can simplify the presentation string to a temporary string using dummy variables
>>> initial_latex_str = 'f = \int_{x_{\rm bottom}}^{x_{\rm top}} g dg'
>>> tmp_latex_str = initial_latex_str.replace('x_{\rm bottom}','p').replace('x_{\rm top}','q')
>>> tmp_latex_str
'f = \\int_{p}^{q} g dg'
Caveat: the dummy variables (here p and q) cannot exist in initial_latex_str

Now we can act on the tmp_latex_str as we did in the first example
>>> eq = parse_latex(tmp_latex_str)
>>> eq_str_with_id = sympy.srepr(eq).replace("'p'","'pdg4942'").replace("'q'","'pdg3291'").replace("'g'","'pdg0021'").replace("'f'","'pdg2103'")
>>> eq_with_id = eval(eq_str_with_id)
>>> eq_with_id
Eq(pdg2103, Integral(pdg0021, (pdg0021, pdg4942, pdg3291)))


Algorithm for Converting Latex to Semantically-meaningful expression

  1. get a Latex string
  2. clean the Latex by removing presentation syntax
  3. In the cleaned Latex string, identify known variables from the PDG that the Sympy parser does not handle, e.g., r_{\rm Earth}
  4. In the cleaned Latex string, replace each known variable with a dummy variable, e.g. d = r_{\rm Earth}, where the dummy variable does not appear in the Latex string.
  5. eq = parse_latex(cleaned latex string with dummy variables)
  6. replace variables and dummy variables in eq with PDG symbol ID


No comments:

Post a Comment