Sunday, May 31, 2020

proposal for analysis of math used in a subdomain of Physics based on the arXiv corpus

What type of math is used in the High-Energy Physics (HEP) corpus available from arXiv?
Example categories include

  • geometry
  • trigonometry
  • integration
  • differential equations
  • sets
  • linear algebra
Motives for asking this question:
  • if I read a randomly selected paper in HEP, what math should I have fluency in?
  • are there papers in HEP that use math which is unusual compared to the majority? For example, is there only one paper that uses geometry?
  • what is the diversity of math per author? (For multi-author papers, tag each author with all the math categories.)
  • what is the diversity of math per paper? (How many types of math appear in a given paper?)

Saturday, May 30, 2020

literature review for using arXiv as a corpus for analysis

"Towards Machine-assisted Meta-Studies: The Hubble Constant"
https://arxiv.org/pdf/1902.00027.pdf
"an approach for automatic extraction of measured values from the astrophysical literature, using the Hubble constant for our pilot study. Our rules-based model – a classical technique in natural language processing – has successfully extracted 298 measurements of the Hubble constant, with uncertainties, from the 208,541 available arXiv astrophysics papers."


"Scienceography: the study of how science is written" (2013)
https://arxiv.org/abs/1202.2638
https://arxiv.org/pdf/1202.2638.pdf
https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.488.6970&rep=rep1&type=pdf
Focused on characterization
separates out packages, comments, authors, figures in the .tex source

"Transforming the arχiv to XML" (2008)
https://link.springer.com/chapter/10.1007%2F978-3-540-85110-3_46
Kohlhase

"An Architecture for Recovering Meaning in a LATEX to OMDoc Conversion" (2009)
https://pdfs.semanticscholar.org/6647/612d3b61102a589db63a7ad9ac243901a9d8.pdf
undergrad thesis; describes processing pipeline for arXiv to OMDoc using LatexML
Kohlhase's student

"Delineating Fields Using Mathematical Jargon"
https://www.aclweb.org/anthology/W16-1508.pdf

"On the Use of ArXiv as a Dataset" (2019)
https://arxiv.org/abs/1905.00075
primarily characterization of arXiv

"Plagiarism Detection in arXiv"
https://arxiv.org/pdf/cs/0702012.pdf


characterizing Latex content in arXiv.org .tex files

  • how many total .tex files?
  • how many english words per file?
  • how many expressions total in the corpus?
  • distribution of (number of expressions) per file
  • what's the distribution of (ratio of words per file to expressions per file)
  • how many known latex symbols are present in all the expressions
  • what is the distribution of (expression length in characters)
  • what is the distribution of (known symbols per expression)
  • are there character sequences that are extremely rare? binary files hidden in .tex and other anomalies
This characterization step will be useful when comparing domains. For example, if we sample another domain (e.g., quantum mechanics), are the distributions similar or not? If we see that the same characterization, then we can expect that the techniques you develop are likely to apply to a novel corpus.

Establishing that the sample being used is generic means we can work with a smaller data set (rather than "all the .tex in arXiv"). Showing the distribution shape does not change as more .tex files are added means convergence is possible.

If we find a domain that doesn't have a similar distributions, then we can investigate why it is anomalous.

replacing symbols in a Sympy expression and generalizing the AST

Sympy's ability to convert a Latex string to a Sympy expression is useful but does not relate information about the variables in the Latex string to other resources (like dimension).

>>> import sympy
>>> from sympy import Equality, Add, Symbol, Mul, Pow, Integral, Tuple
>>> from sympy.parsing.latex import parse_latex

First, remove all presentation-related markup from a Latex string.
Then convert a Latex string to a Sympy expression using
>>> eq = parse_latex('a + b = c')
>>> eq
Eq(a + b, c)

In this post we will replace the variables with the reference IDs for each variable while maintaining the structure of the expression. 
The structure of the expression is
>>> sympy.srepr(eq)
"Equality(Add(Symbol('a'), Symbol('b')), Symbol('c'))"

Since this is a string, we can replace each variable in the expression with a reference ID.
The set of variables in the expression can be accessed using
>>> set_of_symbols_in_eq = eq.free_symbols
>>> set_of_symbols_in_eq
{a, c, b}

We can then replace each variable with an ID
>>> eq_str_with_id = sympy.srepr(eq).replace("'a'","'pdg4942'").replace("'b'","'pdg3291'").replace("'c'","'pdg0021'")
>>> eq_str_with_id
"Equality(Add(Symbol('pdg4942'), Symbol('pdg3291')), Symbol('pdg0021'))"

Lastly, evaluate the string to get a Sympy expression
>>> eq_with_id = eval(eq_str_with_id)
>>> eq_with_id
Eq(pdg3291 + pdg4942, pdg0021)

The reason this representation is useful is because of the separation of presentation from semantic structure.

And getting the symbol list is easy:
>>> eq_with_id.free_symbols
 {pdg3291, pdg4942, pdg0021}

Example

To show why separation matters, suppose we have the Latex string
f = \int_{x_{\rm bottom}}^{x_{\rm top}} g dg
That is a challenge for Sympy's parse_latex, even though Sympy can handle semantically equivalent structures like
>>> parse_latex('f = \int_a^b g dg')
Eq(f, Integral(g, (g, a, b)))

If we happen to know that x_{\rm bottom} is a variable and we know that x_{\rm top} is a variable, then we can simplify the presentation string to a temporary string using dummy variables
>>> initial_latex_str = 'f = \int_{x_{\rm bottom}}^{x_{\rm top}} g dg'
>>> tmp_latex_str = initial_latex_str.replace('x_{\rm bottom}','p').replace('x_{\rm top}','q')
>>> tmp_latex_str
'f = \\int_{p}^{q} g dg'
Caveat: the dummy variables (here p and q) cannot exist in initial_latex_str

Now we can act on the tmp_latex_str as we did in the first example
>>> eq = parse_latex(tmp_latex_str)
>>> eq_str_with_id = sympy.srepr(eq).replace("'p'","'pdg4942'").replace("'q'","'pdg3291'").replace("'g'","'pdg0021'").replace("'f'","'pdg2103'")
>>> eq_with_id = eval(eq_str_with_id)
>>> eq_with_id
Eq(pdg2103, Integral(pdg0021, (pdg0021, pdg4942, pdg3291)))


Algorithm for Converting Latex to Semantically-meaningful expression

  1. get a Latex string
  2. clean the Latex by removing presentation syntax
  3. In the cleaned Latex string, identify known variables from the PDG that the Sympy parser does not handle, e.g., r_{\rm Earth}
  4. In the cleaned Latex string, replace each known variable with a dummy variable, e.g. d = r_{\rm Earth}, where the dummy variable does not appear in the Latex string.
  5. eq = parse_latex(cleaned latex string with dummy variables)
  6. replace variables and dummy variables in eq with PDG symbol ID


Wednesday, May 27, 2020

progression of the interface used in the Physics Derivation Graph

Initially, content in the Physics Derivation Graph was manually entered into text files (e.g., CSV, XML). Presentation was Graphviz only.

Then the process was automated through use of a command line interface. Instead of the user generating numeric indices, IDs were created by the application. The user was able to specify a local ID in order to avoid retyping Latex for existing expressions. Presentation was Graphviz only.

The latest interface is browser-based, running in a Docker container either locally or on a website (derivationmap.net). The user is presented with forms and types in Latex. Some symbols are recognized automatically by the application. Presentation of the graph uses both Graphviz and d3js. The d3js graph has hyperlinked nodes.


I see two potential ways to iterate:
  • interactive graph. More intuitive to type text and connect nodes; no dependence on the user interacting with numeric IDs
  • linear input based on article structure with text, pictures, and expressions. The "form" would be more dynamic, like a Mathematica notebook entry or like Overleaf + inference rules and validation. 
Both of these approaches are more intuitive for users.


I also considered writing a Latex package that specifies the inference rules as macros. However, that lacks the ability to check the math. 

working with Sympy symbols extracted from a Latex expression

I'm using SymPy to work with Latex expressions
>>> import sympy
>>> sympy.__version__
'1.5.1'

I can convert Latex to SymPy using
>>> from sympy.parsing.latex import parse_latex
>>> eq = parse_latex("F = m a")
>>> eq.rhs
a*m
However, to work with the symbols in SymPy I need to extract them from the expression
>>> set_of_symbols_in_eq = eq.free_symbols
>>> set_of_symbols_in_eq
{a, F, m}
The entries exist in the set
>>> type(list(set_of_symbols_in_eq)[0])
<class 'sympy.core.symbol.Symbol'>
but they are not defined as Python variables
>>> F
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NameError: name 'F' is not defined

To associate a Python variable with each symbol, I used
>>> for symb in set_of_symbols_in_eq:
...     exec(str(symb) + " = sympy.symbols('" + str(symb) + "')")

Then the Python variable "F" is associated with the Sympy Symbol "F"
>>> F
F
>>> type(F)
<class 'sympy.core.symbol.Symbol'>


For the relevance of this thread, see
https://groups.google.com/d/msg/sympy/_RnbbOqhERM/YAoJAbyPAgAJ

checking dimensions using Sympy

Suppose we have the expression
F = m a
and we want to validate the consistency of dimensions.

import sympy.physics.units
import sympy.physics.units.systems.si
from sympy.parsing.latex import parse_latex

eq = parse_latex("F = m a")
lhs = eq.lhs
rhs = eq.rhs
set_of_symbols_in_eq = eq.free_symbols

for each recognized symbol, associate that symbol with the ID in the PDG.
for each symbol ID in the PDG, determine the dimensions of that variable.
for each symbol, create a new "_dim" variable for the dimensions based on the lookup table in the PDG

F = sympy.physics.units.mass * sympy.physics.units.length / (sympy.physics.units.time**2)
m = sympy.physics.units.mass
a = sympy.physics.units.length / (sympy.physics.units.time**2)

I wanted to avoid manually entering the AST,
sympy.physics.units.systems.si.dimsys_SI.equivalent_dims(F_dim, m_dim * a_dim)

This conversation 
https://groups.google.com/d/msg/sympy/_RnbbOqhERM/dehog-xpAgAJ
led to

import sympy
from sympy.physics.units import mass, length, time
from sympy.physics.units.systems.si import dimsys_SI
from sympy.parsing.latex import parse_latex
convert the Latex string into SymPy expression
eq = parse_latex("F = m a")
specify the dimension of each symbol 
F = mass * length / time**2
m = mass
a = length / time**2


dimsys_SI.equivalent_dims( eval(str(eq.lhs)), eval(str(eq.rhs)) )

For more on this content, see
https://groups.google.com/d/msg/sympy/_RnbbOqhERM/YAoJAbyPAgAJ

Tuesday, May 26, 2020

verifying dimensions, checking unit conversions, and using constants

There are multiple checks that can be performed for a step.
  • apply inference rule to expressions, verify change to AST is correct
  • verify that the dimensionality of variables is consistent
  • if units are present, validate unit conversions and consistency
The existence of the AST depends on having all symbols in the expression accounted for.
The ability to verify dimensions relies on having the AST.
The ability to check units requires dimension validation.


My prioritization is to validate the scope of coverage. The claim of coverage means
  1. all domains (e.g., electrodynamics, classical mechanics, quantum mechanics, thermodynamics)
  2. all symbols 
  3. all expressions (e.g., E=mc^2, F=ma, Schrodinger, Maxwell, wave equation)
Even though my prioritization is scope, it would be foolish to tackle scope and then later realize the PDG infrastructure does not handle checking inference rules, checking dimensionality, and checking units. 

Therefore, I should pause adding derivations and verify that, for a given derivation, I can check dimensionality and units. 

plan for parsing math latex expressions from arxiv

The arxiv content is available through AWS S3: https://arxiv.org/help/bulk_data_s3
As an alternative to S3, arxiv points to a subset that's available without going through AWS: https://www.cs.cornell.edu/projects/kddcup/datasets.html

The value of having a large number of expressions in Latex is that we could use the expressions to predict what a user wants to enter, decreasing the amount of manual entry required. Also, if a derivation contains similar expressions to what exists in the arxiv content, we could investigate whether the derivation is related to the arxiv paper.

Steps for working with arxiv data

Download papers (in .tex format) for a given domain.

For each tex file, separate the text content from the math from the latex commands.
Task: identify all latex commands.
Task: identify latex commands that alter the math latex content (e.g., \newcommand)

Before attempting to parse the math latex content, remove all presentation-related artifacts
  • replace '\left(' with '('
  • replace '\right)' with ')'
  • replace '\ ' with ' '
  • replace '\,' with ' '
  • replace '\quad' with ' '
  • replace '\qquad' with ' '
Task: identify all non-math commands used in math latex.

Sources to help with parsing math latex:
  • within the math latex string to parse, what can be deduced about the expected context?
  • given other math expressions in the same paper, what would be consistent?
  • given the text in a paper surrounding the math expressions, what would be expected based on keywords?
  • given other papers in the same domain or based on citations, what would be likely?
  • what is statistically likely give the corpus of all articles?
    • Use the Trie data structure to determine what the valid characters in the grammar should be. (Probably be some subset of ASCII with some Unicode chars.) 
    • What are the tokens/symbols of the language?
    • What are the common sequences of tokens?
    • What are the appropriate labels for the tokens?
    • Instead of listing 10 different relational operators each time, create a group of relational operators and reference the group.
    • What are some logical grouping of symbols?
Parsing a LaTeX expression should return candidate SymPy expressions with a probability. In case of unambiguous matching, only one expression should match (p=1). In the case of ambiguous matching, two or more SymPy expressions some probability (p_1 + p_2 = 1).

That is, in some sense, the same process a human goes through to decode the intended meaning of any given math expression in a scientific paper. We are looking to encode that process as a Python program.

Monday, May 25, 2020

set theory depends on logic and axioms; logic depends on set theory

The inference rules used in the Physics Derivation Graph could probably be reduced to a more fundamental basis.

The problem is that set theory starts by assuming logical rules and axioms, while the rules of logic are based on set theory.

From the perspective of a Physicist, the only relevance is that the rules being used are self-consistent.

Friday, May 22, 2020

what does comprehensive mean in the domain of Physics?

To ensure coverage of Physics we need to enumerate what coverage would mean. Three paradigms are


If the Physics Derivation Graph can demonstrate utility in each category, that provides evidence towards the claim that the Graph can be comprehensive.

That is distinct from a second, separate question of determining connectivity. 

Thursday, May 21, 2020

d3js for hire on freelancer

I like d3js as a presentation method but don't have a lot of motive to learn javascript.
There's a backlog of d3js tasks that are well defined: https://github.com/allofphysicsgraph/proofofconcept/issues?q=is%3Aissue+is%3Aopen+label%3Ad3js

I'm thinking about paying a developer (or multiple developers) to implement very specific features. I would provide a working example as a JSfiddle and then I would ask for a specific feature (like "prevent images from overlapping" or "hyperlink images") with the solution being a modification to the JSfiddle with no dependencies other than d3js.

On the site Freelancer I posted $120 to my account (with a $3.06 charge). Then I had to verify the card.
https://www.freelancer.com/projects/javascript/javascript-visualization-graph-hyperlink/proposals
I quickly got a bid and awarded the $20 since the bidder provided a video showing they had a working solution. I inspected the result and it wasn't quite what I wanted -- they had hyperlinked the node circle rather than the node image. They made the change and I sent them the award.
Freelancer charged $3 for the transaction.

Balance: $123.06 - ($3.06 loading fee to freelancer) - ($20 to bidder) - ($3 fee to freelancer) = $97

prioritizing work for the Physics Derivation Graph

Now that the website works, there are three candidate categories of work:
  • Presentation -- how does the site look, e.g., tables, use of color. images,
  • Functionality -- what features exist, e.g. step validation, manipulating the graphs, form entry
  • Content -- what derivations, symbols, expressions are present
Of these three, content is most important with functionality being second. Functionality enables the workflow.

Within the category of "content," there are three sub-categories:
  • Derivations
  • Abstract syntax trees -- how symbols and operators relate within an expression
  • Inference rules
Of those three, the derivations are the highest priority. Inference rules enable derivations. 


The parsing of expressions in support of step validation is nice to have but not vital.

Within the topic of derivations, there is a plethora of candidates to invest in. Priorities are
  • subject scope diversity -- span all topics in Physics
  • simplicity -- not too many steps, easy math
  • interconnectedness with other topics

Wednesday, May 20, 2020

latex is for rendering presentation and does not provide semantics

Latex is used in the Physics Derivation Graph to specify mathematical expressions. This is a poor choice in that Latex does not provide explicit semantics for the content. That association is typically left to the human reader.

In the Physics Derivation Graph we could manually assign semantic meaning to each expression. That's a tedious task.

Alternatively, we could

  1. parse the arxiv database and find the most common expressions
  2. assign semantic meaning to those. 
  3. when the expression arises in the Physics Derivation Graph, the most likely interpretation of that expression is what has previously been labeled.

why web interfaces are limited in search capability

Webpages have content. For text content, a common task is to search for something.
Web browsers have a plain text search feature, usually mapped to ctrl+f.

Why don't websites feature richer search functionality? 
For example, sorting tables, regular expressions for text, SQL, Cypher/Gremlin, infinite scroll (rather than pagination).

Answers include
  • adding complex functionality takes effort
  • most users will be confused by advanced search capability
  • complex search features can be used maliciously/incorrectly if direct access to the backend database is allowed: 
    • a user could delete data via SQL
    • a user could edit data
  • if a user is given read-only access to the server, a user could write an infinite loop and saturate server resources
Shipping all the data to the front-end isn't reasonable when the data is large (e.g., more than 1 MB), so interaction with the back-end is necessary (e.g., for pagination).

Thursday, May 14, 2020

timing page rendering with and without Sympy

I noticed that page rendering for derivationmap.net is slow and wanted to time the rendering to see what timing "slow" actually meant.

To summarize the results below, the first load of SUVAT with Sympy turned on took almost 30 seconds. In constrast, with Sympy turned off, the second load of the same derivation took 8 seconds.
--> That's a big difference in user experience.
30 seconds is also the threshold where Nginx times out waiting for the backend, so getting the time lower is important (or else the page won't show up).

The more meaningful comparison is second load of SUVAT with and without Sympy. That change is 15 seconds to 8 seconds, a factor of 2x improvement.
The same 2x improvement also occurs for the other derivation (6.4 to 3.2 seconds).
--> That indicates there is value in caching Sympy validation results (though we probably won't recover the full 2x).

The second takeaway is that 8 seconds for a page to load with Sympy is turned off is still too long.
The flask logs are not configured to enable profiling -- they merely indicate when an event started. Because I don't long when an event finished, I cannot create a flame graph.

derivation which render render time in seconds with d3js with graphviz PNG with Sympy inf rule check
SUVAT 1 19.6 Yes Yes No
SUVAT 2 7.8 Yes Yes No
SUVAT 3 7.4 Yes Yes No
SUVAT 1 28.4 Yes Yes Yes
SUVAT 2 15.2 Yes Yes Yes
SUVAT 3 14.8 Yes Yes Yes
Maxwell Eq 1 7.4 Yes Yes No
Maxwell Eq 2 3.2 Yes Yes No
Maxwell Eq 3 3.2 Yes Yes No
Maxwell Eq 1 13.6 Yes Yes Yes
Maxwell Eq 2 6.4 Yes Yes Yes
Maxwell Eq 3 6.7 Yes Yes Yes

Wednesday, May 13, 2020

selecting a target audience for the Physics Derivation Graph

I've recognized the Physics Derivation Graph has a few potential audiences: teachers, students, researchers, publishing.

Of these, targeting students makes the most sense:

  • the math is easier
  • less legacy momentum (e.g., "this is how we've done it before")
  • educators value free (rather than paid subscription)

inspecting the list of users who have logged in

$ sqlite3 flask/users_sqlite.db 
SQLite version 3.22.0 2018-01-22 18:45:57
Enter ".help" for usage hints.
sqlite> select * from user;

use ctrl+d to terminate sqlite

https://sqlite.org/cli.html

Sunday, May 10, 2020

histogram of expression lengths in bash

Reading the JSON as text does not work since there are multiple entries that have the key "latex"
 cat data.json | grep "            \"latex\":"

So I decided to read JSON into Python on command line
https://www.cambus.net/parsing-json-from-command-line-using-python/

That worked but I learned that handling for loops in command line requires extra work
https://stackoverflow.com/questions/2043453/executing-multi-line-statements-in-the-one-line-command-line

Once I knew the length of the values, I added a leading zero
https://stackoverflow.com/questions/21620602/add-leading-zero-python

Then I used cut to eliminate the last digit (so the histogram bin size is 10).


cat data.json |\
   python -c "exec(\"import sys, json; expr=json.load(sys.stdin)['expressions'];\nfor i,d in expr.items(): print(str(len(d['latex'])).zfill(3))\")" |\
   sort -n |\
   cut -c1-2 |\
   uniq -c
 127 00
  63 01
  75 02
  54 03
  34 04
  28 05
  17 06
  18 07
  14 08
  15 09
  11 10
  10 11
   6 12
   5 13
   2 14
   1 15
   1 16
   1 18
   1 20
   2 23
   1 27

The longest expressions
cat data.json |\
   python -c "exec(\"import sys, json; expr=json.load(sys.stdin)['expressions'];\nfor i,d in expr.items(): print(len(d['latex']))\")" |\
   sort -n |\
   tail -n 5
186
201
231
233
271

The shortest expressions
cat data.json |\
   python -c "exec(\"import sys, json; expr=json.load(sys.stdin)['expressions'];\nfor i,d in expr.items(): print(len(d['latex']))\")" |\
   sort -n |\
   head -n 5
1
1
1
1
1


Similarly, we can get the popularity of inference rules
cat data.json |\
   grep "inf rule" |\
   sed 's/"inf rule": //' |\
   tr -s " " |\
   sort |\
   uniq -c |\
   sort -n |\
   tail -n 10
  11  "substitute X for Y",
  12  "declare identity",
  13  "subtract X from both sides",
  14  "declare variable replacement",
  20  "declare final expr",
  21  "divide both sides by",
  21  "substitute LHS of expr 1 into expr 2",
  31  "simplify",
  31  "substitute RHS of expr 1 into expr 2",
  54  "declare initial expr",

Saturday, May 9, 2020

operators syntax: macros instead of abstract syntax trees

Currently the JSON file has a set of named operators with the attributes "argument count", "latex", "scope". The "argument count" is a non-negative integer and the "scope" is a list with elements like real, complex, vector, matrix. integer.

What's missing is the AST structure that defines where the arguments go with respect to the operator. For example,
x + y
is valid while
x y +
is not.

Similarly,
cos x
is valid while
x cos
is not.

While I can state these concepts I don't know how to formalize the notation.
For example, a definite integral takes 4 arguments in a specific location:
int_x^y f(z) dz

I could express operators using a latex macro


\documentclass[12pt]{article}
\usepackage{amsmath,amssymb,amsfonts}
\usepackage[dvipdfmx,colorlinks=true,pdfkeywords={physics derivation graph}]{hyperref}
\newcommand\addition[2]{ #1 + #2}
\newcommand\subtraction[2]{ #1 - #2}
\newcommand\divisionSameLine[2]{ #1 / #2 }
\newcommand\divisionFrac[2]{ \frac{ #1}{ #2} }
\newcommand\integralDefinite[4]{ \int_{ #1}^{ #2} #3 #4}
\newcommand\addXtobothsides[3]{Add $#1$ to both sides of Eq.~\ref{eq:#2}; yields Eq.~\ref{eq:#3}.}
\title{Lorentz transformation}
\date{\today}
\setlength{\topmargin}{-.5in}
\setlength{\textheight}{9in}
\setlength{\oddsidemargin}{0in}
\setlength{\textwidth}{6.5in}
\begin{document}
\maketitle
\begin{abstract}
This is the abstract
\end{abstract}

\begin{equation}
\addition{a}{b}
\end{equation}

\begin{equation}
\divisionFrac{a}{b}
\end{equation}

\begin{equation}
\integralDefinite{a}{b}{f(x)}{dx}
\end{equation}

\end{document}

Compile to PDF using

latex runthis.tex 
latex runthis.tex 
dvipdfmx runme.dvi 

recurring tasks

There are a few recurring tasks:
Each of these tasks take time, so including all of them as git commit hooks induces undesirable latency. 

Currently .git/hooks/pre-commit contains

#!/bin/bash
cd v7_pickle_web_interface/flask
docker run --rm -v`pwd`:/scratch --entrypoint='' -w /scratch/ flask_ub make black

I don't want to run mypy and sphinx every time because that would take a lot of my time.
I could run these two every tenth commit.

To get a command to run every tenth commit, I could leverage the date and run a command if the date is modulo 10.

GIT_AUTHOR_DATE='@1589048959 -0400'

dynamically build latex parser grammar based on symbols used in the Physics Derivation Graph

I've been using the Sympy Latex parser. After encountering a wide variety of issues, I realized a new strategy is needed.

The previous mindset was "make modifications to the ANTLR grammar as we encounter novel issues in Latex." That approach would constant process of catching up with whatever is in the Physics Derivation Graph.

Here is a different method that takes advantage of the information available in the Physics Derivation Graph to inform the ANTLR grammar.

The Physics Derivation Graph has a list of symbols in its database. We could leverage that list of symbols and build an ANTLR grammar specification that is based on the Physics Derivation Graph list of symbols.

The process would be
  1. get list of symbols from Physics Derivation Graph
  2. add those symbols into the ANTLR grammar
  3. when Sympy parses Latex, use the modified grammar specification
  4. when new symbols are added to the Physics Derivation Graph, go to step 1

Friday, May 8, 2020

sticking with the basics and avoiding dependencies

There's a trade-off between "write all the code yourself" and "use external libraries."

  • The "use external libraries" approach allows for quicker time-to-market and leverages the expertise of other people. The downside is that if any of the providers change their code, you now have to keep up with those dependencies. This means refactoring as versions change. 
  • The "write all the code yourself" approach can take longer and incurs the burden of learning things outside the initial focus. The upside is that there's less risk associated with dependencies.

Based on my observations of other projects not being durable due to the necessary upkeep of code when the "use external libraries" choice is selected, I've opted to write most of the code for the Physics Derivation Graph and thus minimize my risk associated with dependencies.

An example of this is avoidance of jQuery and other external Javascript libraries.
I was happy to find the site
https://plainjs.com/

Thursday, May 7, 2020

live display of input

I realized it would be helpful to provide users a live preview of what the Latex renders as so the mistaken input can be corrected before submitting a step. (Decreasing feedback latency is a recurring theme.)

There are Javascript-based approaches like
https://stackoverflow.com/questions/20876797/create-live-preview-of-form-inputs
(see demo here - https://jsfiddle.net/mYjrn/1/ )
and
https://demos.joypixels.com/latest/live-preview.html
but these both rely on JQuery.

JQuery is "a tool used to make writing common JavaScript tasks more concise." (source: https://www.digitalocean.com/community/tutorials/an-introduction-to-jquery )
I want to avoid unnecessary dependencies, so live preview of text that relies on JQuery is unattractive.

I was able to find a live preview example that is written in pure Javascript:
https://www.codespeedy.com/show-live-preview-of-html-textarea-with-javascript/
I can get that to work on my website, but wrapping the output in \( \) to get Mathjax to interpret the live input does not work.

I needed to find a "live preview of Mathjax" since the input is parsed as Latex.
This snippet from 2013 didn't work
https://github.com/mathjax/mathjax-docs/wiki/More-live-preview-examples
but a similar code posted here
https://cdn.rawgit.com/mathjax/MathJax/2.7.1/test/sample-dynamic.html
does work for live preview of Mathjax!

That link is for 2.7 but I had been using Mathjax 3. I couldn't find a v3-based approach, so I posted a question
https://stackoverflow.com/questions/61658297/mathjax-live-preview-for-version-3
and was directed to
https://mathjax.github.io/MathJax-demos-web/input-tex2chtml.html
which is a suitable solution for v3.

Saturday, May 2, 2020

categories of features for the website

Although the Physics Derivation Graph is currently available as a website (http://derivationmap.net/) I am unwilling to share the URL with other people. The lack of features means they could get the wrong impression.

There are lots of missing features which result in an inadequate site. The categories of missing features are

  • essential functionality. Examples: 
    • entering symbols
    • entering units and dimensions for variables
    • unit checking
    • dimensional analysis
    • search
  • ease of workflow. Examples: 
  • professionalism. Examples:
    • Google Sign In
There are additional features that are not vital to showing other users. For example, the "monitoring" page and tracing user behavior.

Friday, May 1, 2020

sympy to AST using latex

Given an expression in Latex, extract the symbols:
based on https://stackoverflow.com/a/59843709/1164295

>>> import sympy
>>> from sympy.parsing.latex import parse_latex
>>> symp_lat = parse_latex('x^2 + a x + b = 0')

>>> symp_lat.atoms(sympy.Symbol)
{x, b, a}

Given an expression in Latex, generate the graphviz of the AST:
from https://docs.sympy.org/latest/tutorial/manipulation.html
see https://docs.sympy.org/latest/modules/printing.html#sympy.printing.dot.dotprint

>>> graphviz_of_AST_for_expr = sympy.printing.dot.dotprint(symp_lat)

more checks to perform in the Physics Derivation Graph

Up until now I've only considered a single check throughout the Physics Derivation Graph: the validation of a derivation step. This validation is made by verifying that changes to the per-expression ASTs are consistent with the inference rule and feeds.

There are other checks that can be performed.

  • Symbolic consistency within an expression. For a given latex symbol in an expression, there can be only one numeric ID. This could manifest within one side of an expression or across both sides. 
    • If there are multiple instances of "a" on the LHS, are they all the same numeric ID?
    • Is the symbol "a" used on the LHS the same "a" used on the RHS? For example, in the expression "a + b = a * 2", both "a" symbols must have the same numeric ID.
  • Symbolic consistency within a step. A step involving more than one expression must have consistent symbols. For example, given "a + b" and the inference rule "add X to both sides" where the feed is 2, the output is "a + 2 = b + 2". Is the "a" in "a + b" the same as the "a" in "a + 2 = b + 2"? Changes of variable are feasible but must be explicit
  • Symbolic consistency within a derivation. A a given latex symbol has multiple numeric IDs in a derivation, is the change documented in an explicit inference rule or do the two cases not intersect? For example, the symbol "c" could appear as both a constant and a variable in the same derivation, but not in the same step.
  • Dimensional analysis. Variables can be dimensional. The dimensions must be consistent within an expression. As an example, adding acceleration and velocity is not allowed. See https://docs.sympy.org/latest/modules/physics/units/dimensions.html
  • Unit consistency. Constants have a unit and a value. The units must be consistent when applied in an expression As an example, adding "meters per second" and "miles per hour" is not allowed even though both are speeds.  See https://docs.sympy.org/latest/modules/physics/units/unitsystem.html and https://docs.sympy.org/latest/modules/physics/units/examples.html#equation-with-quantities
Sympy has some of these features built in; see https://docs.sympy.org/latest/modules/physics/units/quantities.html

grepping nginx logs to observe user behavior

What IP addresses made page requests and how many pages did they request?

$ cat nginx_access.log | cut -d' ' -f1,7 | grep -v "\.xml\|\.js\|php\|cgi\|\.png\|\.txt\|/$\|400$" | cut -d' ' -f1 | sort | uniq -c | sort -nr
    431 71.244.214.232
    301 18.223.152.78
    131 66.249.79.109   - Googlebot
    106 96.245.195.226
     50 66.249.79.111   - Google crawler
     24 66.249.79.113   - Google crawler
     23 174.198.15.222
      9 35.197.133.35

That same list without the leading counts:
$ cat nginx_access.log | cut -d' ' -f1,7 | grep -v "\.xml\|\.js\|php\|cgi\|\.png\|\.txt\|/$\|400$" | cut -d' ' -f1 | sort | uniq -c | sort -nr | head -n 20 | tr -s " " | cut -d' ' -f3
which is handy for https://www.maxmind.com/en/geoip-demo


What were the page dwell times for a given IP address?

$ ip="18.223.152.78"
$ cat nginx_access.log | grep $ip | cut -d' ' -f4,7 | grep -v "\.png\|\.js"
[30/Apr/2020:19:19:29 /navigation
[30/Apr/2020:19:19:35 /list_all_expressions?referrer=navigation
[30/Apr/2020:19:19:42 /list_all_symbols?referrer=_table_of_expressions


What was the user agent strings for a given IP address?

$ cat nginx_access.log | grep $ip | cut -d' ' -f12- | sort | uniq -c
     60 "Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.96 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" "-"
      3 "Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.92 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" "-"
      8 "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" "-"