Monday, February 24, 2020

ASTs for Integrals

I've understood ASTs for simple expressions that only involve binary operators. I don't understand how ASTs deal with operators that have more than two arguments.
https://reference.wolfram.com/language/ref/TreeForm.html

https://demonstrations.wolfram.com/ExpressionTreesForIntegrals/

Sunday, February 23, 2020

Integration path for contributions

So far I've been hesitant on collaborations involving software in the Physics Derivation Graph. I didn't have a good path for integration of contributions, especially of complex features. I think I can provide both more details explanation of what would be helpful, as well as a clear integration path.

For example, in this post I provided a set of valid and invalid and ambiguous Latex examples. I did not provide details on how I would integrate a suggested solution written by a contributor.

Here are three specific aspects I would need for integration of contributed code:

  1. I will write doctests in Python. That way I can express the function as it would be integrated in the PDG project code
  2. The contributed Python script should run inside a Docker image. That way the dependencies are made explicit
  3. The "docker build" can assume to have Internet access, but the "docker run" process should assume no Internet connection

As an example from the above blog post, I can express the interface as a Python3 function
def is_expression_valid_latex(expr_latex: str) -> bool:
    """
    >>> is_expression_valid_latex("a = b")
    True
    >>> is_expression_valid_latex("a = b +")
    True
    >>> is_expression_valid_latex("\si a")
    False
    """

By using sys.stdin, we could expose that function to the container such that the following would be an acceptance test:
docker run -it --rm demo:latest python3 /opt/my_script.py "a = b"
True

Sunday, February 9, 2020

todo list for February 2020 (completed!)

Current status: I have an interactive web interface using Docker and Flask that I'm reasonably happy with. In this post I outline tasks that need to be done prior to wider exposure.

Functionality
  • list all 
    • operators
      • in which derivation is each used?
      • popularity: how many references are there to this operator?
    • symbols
      • in which derivation is each used?
      • popularity: how many references are there to this symbol?
    • derivations
      • popularity: include stats -- number of steps, number of inf rules, number of expressions
    • expressions
      • popularity: list which derivations use which expressions
    • inference rules
      • include number of inputs, outputs
      • popularity: which derivations use each inference rule?
  • show a complete derivation
  • edit 
    • an inference rule
      • how to address all the places that inference rule gets used?
    • a derivation
      • edit a step
        • how to address dangling steps?
    • an expression
      • where else is that expression used?

functionality
  • Latex to AST
    • suggest related expressions
  • Web interface
    • download pkl file
    • upload pkl file
    • export derivation PNG
    • export derivation to PDF
  • CAS integration
    • validate a single step of a derivation
Rendering
  • use d3.js instead of graphviz

visualize trace of flow
convert trace of flow to Selenium script
generate PDG website

host on DigitalOcean droplet
account management

Previous task list:
https://physicsderivationgraph.blogspot.com/2018/07/snapshot-of-milestones-for-physics.html
see also
https://physicsderivationgraph.blogspot.com/2017/06/not-getting-caught-in-details.html

type hinting and linting in the Docker image

See also
https://physicsderivationgraph.blogspot.com/2018/08/cleaning-up-code-using-pylint-and.html

Usually I start my Docker container using

$ python create_tmp_pkl.py ; docker build -t flask_ub .; docker run -it --rm --publish 5000:5000 flask_ub

However, if I need the command line to run mypy or flake8, I'll start a shell using

$ python create_tmp_pkl.py ; docker build -t flask_ub .; docker run -it --rm --entrypoint='' --publish 5000:5000 flask_ub /bin/bash

Then, in the container, I can run commands like

$ mypy compute.py
Success: no issues found in 1 source file
$ mypy --ignore-missing-imports controller.py 
Success: no issues found in 1 source file
see https://mypy.readthedocs.io/en/latest/running_mypy.html#ignore-missing-imports

and linting with

$ flake8 compute.py
compute.py:4:80: E501 line too long (89 > 79 characters)

and check doctest using

$ python3 -m doctest -v compute.py

Code complexity measurement:
$ python3 -m mccabe compute.py

Monday, February 3, 2020

example derivation steps for a CAS or theorem prover to validate

in order of increasing complexity, here are a set of derivation steps for a CAS or theorem prover to validate

start with "a = b"
add "2" to both sides
end with "a + 2 = b + 2"


start with "\sin x = f(x)"
multiply both sides by "2"
end with "2 \sin x = 2 f(x)"


start with "\sin x = f(x)"
substitute "2 y" for "x"
end with "\sin (2 y) = f(2 y)"

example Latex expressions to parse

valid math latex in order of increasing complexity

a = b

\sin x

\sin x \in f

f \in g

invalid math latex in order of increasing complexity

a = b +              operator with no input

\sin x \left(        unpaired "("

\sin x \sum       operator with no input

valid ambiguous latex in order of increasing complexity

1/2\pi = (1/2) \pi  OR 1/(2 \pi); source: https://www.ntg.nl/maps/26/16.pdf

\sin x / y = (\sin x)/y  OR \sin (x/y); source: https://www.ntg.nl/maps/26/16.pdf

\sin x + 2 = (\sin x) + 2  OR \sin (x + 2)

https://math.stackexchange.com/a/1025217
https://math.stackexchange.com/a/1026483

valid ambiguous latex in a step in which the ambiguity can be resolved

input expression: \sin x / y = g
inf rule: multiply both sides by y
output expression: \sin x = g y

Here the input expression is ambiguous -- it isn't clear whether "\sin x / y" = (\sin x)/y  OR \sin (x/y)
The output expression implies that (\sin x)/y was the user's intention.


input expression: \sin x + 2 = g
inf rule: subtract "2" from both sides
output expression: \sin x = g - 2

Here the input expression is ambiguous -- it isn't clear whether "\sin x + 2" = (\sin x) + 2  OR \sin (x + 2)
The output expression implies that (\sin x) + 2 was the user's intention.

valid ambiguous latex in a step in which the ambiguity cannot be resolved

a = b

from Latex to Abstract Syntax Tree

In the latest revision to the Physics Derivation Graph, the tuple (unique expression identifier, latex expression) has been replaced with (unique expression identifier, latex expression, abstract syntax tree). This is similar to the split between "presentation MathML" and "content MathML." This distinction requires a translation between a (visually pleasing and easy to input representation) and (a mathematically meaningful representation).

Latex will be input by the user for the PDG; the user will not need to supply the AST as input. To validate a step, the AST is needed. This presents a few challenges:

  • Is the input valid tex?
  • Is the valid tex a mathematical expression?
  • Is the valid mathematical expression consistent with the step? 
A step in a derivation is defined as the application of a single inference rule with one or more expressions as input, feed, or output.

There are a few options for parsing mathematical tex: 
  • write a custom parser 
  • use an existing parser, e.g. MathJax

Sunday, February 2, 2020

significant changes to the Physics Derivation Graph

This weekend I initiated a significant rewrite of the Physics Derivation Graph.
  • I revised the data structures, the level of details present in the data structure, and how the data structure is accessed. 
  • I also better understand the model-view-controller paradigm; this led to a better workflow. 
  • I improved the logging used in the Python code.

Improved Data Structures

I've investigated many different file formats (XML, CSV, plain text, SQL), each of which impose different constraints on the data structure, as well as imposing a translation between the file format and the representation internal to Python. I recently arrived at the insight that I could avoid both file format choices and the associated translation work by using Python's serialization -- the pickles module.

In addition to eliminating work associated with translation, it freed my cognitive focus. This second aspect was vital as it led to improved mental agility in analyzing other options. Once I didn't have to worry about choosing the best file format, I could identify what work would lead to rapid progress. 

The first big change was having a single data structure (the dictionary "dat") which had all the other data structures (expressions, inference rules, derivations) as keys. Each of those was initially a list of dictionaries, but this proved to be cumbersome in implementing data access. I realized I could leverage the unique identifiers present in the Physics Derivation Graph as keys. That lead to a dictionary (top level "dat") of dictionaries (expressions, inference rules, derivations) of dictionaries (each expression, each inference rule, each derivation, respectively). While this may sound messy, accessing specific elements of the PDG is now much easier. 

Motivated by a conversation about how the PDG will integrate with a Computer algebra system, I decided to include a few additional keys in the top level data structure. Enabling validation of steps requires supporting a computer algebra system (CAS). To enable an arbitrary choice of CAS, I need to support abstract syntax trees (ASTs). To enable an AST, I need to define symbols and operators. To enable symbols, I need units and measures. To summarize, I now track the following:

  • derivations
  • expressions
    • latex
    • AST
  • inference rules
  • symbols
  • operators
  • units
  • measures

Improved understanding of the model-view-controller paradigm

Previously I had web form actions that led to a follow-on page. While technically possible, this turned out to be a bad decision. The problems are in tracking state (which variables get passed between pages) and poor visibility on the state changes. I updated the web forms to pass their action back to the "controller.py" which maintains both the variable passing and flow control (which page calls another page).

By adhering to the model-view-controller paradigm, troubleshooting and implementation were made much easier. This ease resulted in faster implementation of ideas. 

Improved logging in Python

I use print statements throughout my code to help in troubleshooting. There are different categories of print statements: trace, debug, error. These are now present in (almost) every print statement. I've also included the name of the file (either "compute" or "controller") in print statements, as well as the function the print statement is in. These changes help track the state of the application.