Friday, July 31, 2020

conferences on formalization and cataloging math and physics

https://imkt.org/activities/events/ - International Mathematical Knowledge Trust


https://cicm-conference.org/cicm.php - Conference on Intelligent Computer Mathematics

  • CICM 2021 will be held in Timisoara, Romania during the week of July 26-30, 2021
  • CICM 2022 will be held in Tbilisi, Georgia sometime in the second half of September 2022 as a component of the Computational Logic Autumn Summit (CLAS 2022) taking place in Tbilisi.

online digital math libraries and Physics libraries

Mathematics

There are multiple formal math databases:
In addition to formal math databases, there are digital math libraries
These digital math libraries are relevant for searching and parsing content.


"How to Build a Global Digital Mathematics Library" by S Watt
https://cs.uwaterloo.ca/~smwatt/pub/reprints/2016-synasc-gdml.pdf

Physics


Teaching resource: https://www.compadre.org/

Increasing levels of formality in a trivial physics statement

In this post I start with a trivial observation about Physics concepts and attempt to add layers of formalization through semantic enrichment. Inspired by sTeX, I use pseudo-Latex which has no supported implementation. RDF tags might be more appropriate.

The initial observation is
The relation between period, T, and linear frequency, f, is T = 1/f. Thus frequency as a function of period is f = 1/T.
The purpose of using this as a starting point is that it is simple and easily understood by a human.
The rest of this post explores what semantic additions would be needed for a computer to parse the text.

Each of these transitions can be identified manually. I don't have reliable ways of automating this process.

*********************
The first change is conversion from ASCII to Latex, a common presentation method in Math and Physics. This change only alters presentation. I've highlighted the changes in red.

The relation between period, $T$, and linear frequency, $f$, is
\begin{equation}
T = 1/f.
\end{equation}
Thus frequency as a function of period is
\begin{equation}
f = 1/T.
\end{equation}

*********************

The next change is to bound claims being made.

\claim{1}{The relation between period, $T$, and linear frequency, $f$, is
  \begin{equation}
    T = 1/f.
  \end{equation}
}
Thus \claim{2}{frequency as a function of period is
  \begin{equation}
    f = 1/T.
  \end{equation}
}

*********************

Replace "thus" with the relation between claims.

\claim{1}{The relation between period, $T$, and linear frequency, $f$, is
  \begin{equation}
    T = 1/f.
  \end{equation}
}
\relation_between_claims{claim 2 is derivable from claim 1}
\claim{2}{frequency as a function of period is
  \begin{equation}
    f = 1/T.
  \end{equation}
}

*********************

Within each claim there are variables and symbols. Identify those.
I've inserted parenthesis to make the order of operations clear.

\claim{1}{The relation between \named_variable{period}, \symbol{T}, and \named_variable{linear frequency}, \symbol{f}, is
  \begin{equation}
    \symbol{T} \operator{=} (\integer{1}\operator{/}\symbol{f}).
  \end{equation}
}
\relation_between_claims{claim 2 is derivable from claim 1}
\claim{2}{The \named_variable{frequency} as a function of \named_variable{period} is
  \begin{equation}
    \symbol{f} \operator{=} (\integer{1}\operator{/}\symbol{T}).
  \end{equation}
}

Visually, this could be represented by color-coded boxing
Here claims are red, connections are light green, expressions are dark green, named variables are pink, variables are blue, operators are light blue, and operators are purple.

This visualization seems to relate to discourse representation theory (DRT) though I don't know how to leverage it. See box notation here and here.


*********************
The "T" and "f" symbols are conventions that are used in other papers. We can indicate that by assigning universal identifiers.
Similarly, the operators "/" and "=" are defined in other papers. Those also get a universal ID.
The relation between named variables and symbols needs to be made explicit.
With those associations made, replace the symbols and operators in the expressions.

\def{symbol{T} = pdg2948}                   % the symbol "T" has a (universal) Godel number
\def{symbol{f} = pdg9215}                   % the symbol "f" has a (universal) Godel number
\def{operator{/} = pdg1340}                 % the operator "/" has a (universal) Godel number
\def{operator{=} = pdg8821}                 % the operator "=" has a (universal) Godel number
\def{named_variable{period} = symbol{T}}    % the named variable "period" is equivalent to the symbol "T"
\def{named_variable{frequency} = symbol{f}} % the named variable "frequency" is equivalent to the symbol "f"

\claim{1}{The relation between \named_variable{period}, pdg2948, and \named_variable{linear frequency}, pdg9215, is
  \begin{equation}
    pdg2948 pdg8821 (\integer{1} pdg1340 pdg9215).
  \end{equation}
}
\relation_between_claims{claim 2 is derivable from claim 1}
\claim{2}{The \named_variable{frequency} as a function of \named_variable{period} is
  \begin{equation}
    pdg9215 pdg8821 (\integer{1} pdg1340 pdg2948).
  \end{equation}
}

One of the purposes of associating a symbol "T" with a universal identifier is that the type can be specified. Here "f" and "T" are both Real valued.

At this point the relation between claims 1 and 2 could be converted to SymPy and verified; see
https://derivationmap.net/review_derivation/884319/

converting the Physics Derivation Graph backend from JSON to a property graph

I shared the JSON database with a few researchers and realized I am embarrassed by the pointer chasing needed to construct the graph. The complexity of numerical IDs used for every feature in the PDG is confusing.

I had come up with a property graph schema which greatly simplifies the concepts and reduces the pointer chasing present in the JSON. Implementing the backend of the PDG as a property graph instead of JSON is a reasonable and beneficial task, but I'm not comfortable with any of the available property graph databases. There are many options available:

My criteria are open source, widely adopted, stable. 
I want the graph content to be plain text. For Neo4j, this means exporting graph content to a Cypher representation:
I want to be able to enforce constraints on the property graph to prevent undesired edges or properties. In Neo4j this is possible but not supported in the community edition.
Staring at the current PDG JSON file, the structure is reasonably intuitive and relatively close to the property graph schema I had come up with. The JSON structure is not a "nodes and edges" design. The JSON is hierarchical (nested dictionaries) to both be more concise and to reduce the computational pointer chasing. 

searching Math and Physics content

The workflow for naive web search is
  1. input query
  2. click on search
  3. get results
This is the dominant mindset and is based on use of Google for finding web content. 
Revising the search string is taken as a failure.
Looking at the second page of results is taken as a failure.

That doesn't need to be the process for search. Interactive refinement of queries is a significant improvement since the user might not realize what's available.


What would searching Math and Physics be like?

  • Leveraging Latex would be important since that is the style the bulk of content uses.
  • Tagging sections would be useful to differentiate claim/lemma/theorem. Documents with sTeX markup or OMdoc could provide better results.
  • Identifying symbol meanings (operators, constants, variables) in expressions. Using "x" for position and "d" for position should be treated as equivalent.

Thursday, July 30, 2020

degrees of formalization in math

Update 2021-07-30: this concept is the de Bruijn factor.

There are many different views of formalization for mathematics. They are not all competing in the same scope. There are different degrees of formalization. Three analogies for this are
  • spectrum of formalization, from informal (e.g., Latex) to increasing formality, like CAS and then to Proof assistants.
  • "levels of zoom" as in visualization, like Google Maps but for Math
  • compiling "high level concepts" (e.g., Latex) to controlled natural language to proofs
The idea of "zooming visualization" is relevant to the user experience -- a reader should be able to select the level of granularity.

The idea of "compiling" is about making implicit concepts explicit. Also, compiling to different architectures is similar to formalizing in different logics and axioms. Unlike the visualization concept, there is not one underlying basis to specify formalization in.


What are the degrees of formalization in math?
  1. verbal (hallway chats, post-conference dinners, coffee breaks)
  2. white board or chalk board
  3. physical paper + pen or pencil
  4. typesetting: Latex or equivalents like Markdown or pretextbook.org (XML based) or pandoc; Powerpoint; Rmd, reStructuredText, MyST, jupyterbook
  5. semantic enrichment/decoration: sTeX, OMDoc; PDG inference rules; pf2 package (sty)
  6. controlled natural languages (e.g., Naproche)
  7. CAS verification of inference rules (e.g. Sage, SymPy)
  8. formal verification in a selected target logic and axioms (e.g., Lean, Isabella, COQ)

References

Wednesday, July 29, 2020

manual translation process from Latex to Sympy for Physics Derivation Graph

Suppose I have a Latex string like

1 = \\int_0^W a \\sin\\left(\\frac{n \\pi}{W} x\\right) \\psi(x)^* dx

The first step is to eliminate notation that is presentation oriented

1 = \\int_0^W a \\sin(\\frac{n \\pi}{W} x) \\psi(x)^* dx

Second, rewrite multi-character variables to single variable

>>> import sympy 
>>> sympy.__version__ 
'1.5.1' 
>>> from sympy.parsing.latex import parse_latex
>>> sympy.srepr(parse_latex('1 = \int_0^W a \sin(\\frac{n \pi}{W} x) f(x)^* dx'))
"Equality(Integer(1), Integral(Mul(Symbol('a'), Mul(Function('f')(Symbol('x')), sin(Mul(Symbol('x'), Mul(Pow(Symbol('W'), Integer(-1)), Mul(Symbol('n'), Symbol('pi'))))))), Tuple(Symbol('x'), Integer(0), Symbol('W'))))"

Note that the complex conjugate operation was dropped without warning. We'll need to manually insert that later.

The SymPy representation is challenging to review, so it can be transformed back to Latex:
>>> sympy.latex((parse_latex('1 = \int_0^W a \sin(\\frac{n \pi}{W} x) f(x)^* dx')))
'1 = \\int\\limits_{0}^{W} a f{\\left(x \\right)} \\sin{\\left(x \\frac{n \\pi}{W} \\right)}\\, dx'
which shows that the input Latex does not get transformed exactly, but close enough for human verification.

From the single-variable representation, replace variables with PDG symbol IDs.
W --> 2523
a --> ?
n --> 1592
pi --> 3141
x --> ?
f --> psi --> 9489
In the context of an isolated function, I don't know which "a" and which "x" are relevant.
To determine those IDs I would need to review the derivation.

which, replacing all instances, yields
Equality(Integer(1), Integral(Mul(Symbol('a'), Mul(Function('pdg9489')(Symbol('x')), sin(Mul(Symbol('x'), Mul(Pow(Symbol('pdg2523'), Integer(-1)), Mul(Symbol('pdg1592'), Symbol('pdg3141'))))))), Tuple(Symbol('x'), Integer(0), Symbol('pdg2523'))))

Finally, insert the "conjugate" for \psi(x)
Equality(Integer(1), Integral(Mul(Symbol('a'), Mul(conjugate(Function('pdg9489')(Symbol('x'))), sin(Mul(Symbol('x'), Mul(Pow(Symbol('pdg2523'), Integer(-1)), Mul(Symbol('pdg1592'), Symbol('pdg3141'))))))), Tuple(Symbol('x'), Integer(0), Symbol('pdg2523'))))

assumptions about formalizing Math or Physics based on a corpus like arxiv

I'm still reeling from the disconnect between what was proposed in the QED manifesto in 1994 and the current state of formalization of Math (and Physics) research in 2020.

In this post I'll write down what assumptions and what truths I understand about the problem in the context of papers posted to arXiv. In no particular order,
  • assumption: There is an intentional meaning for each symbol and each expression. 
  • truth: The meaning of each symbol and expression is not stated explicitly in writing.
  • assumption: An article is self-consistent in terms of both jargon and math expressions
  • truth: The syntax used for jargon and math is not consistent across different papers. 
  • truth: There are conventions (often more than one) and which convention is being used is rarely stated explicitly. 
  • assumption: Content of a given paper has information contextualized by other references,
  • assumption: the context of information is only partially known to the author of a paper. 

The consequence of inconsistent use of conventions and use of implicit associations is that developing a grammar for parsing article text and math will necessarily yield incomplete and incorrect results.

function Latex for Sympy

I am interested in parsing functions written in Latex into a Sympy representation. For simple examples, this currently works:

>>> import sympy 
>>> sympy.__version__ 
'1.5.1' 
>>> from sympy.parsing.latex import parse_latex
>>> sympy.srepr(parse_latex('f(x)'))
"Function('f')(Symbol('x'))"

That's good. Sympy also supports expressions like

>>> sympy.srepr(parse_latex('x=0'))
"Equality(Symbol('x'), Integer(0))"

However, a function with a constraint written in Latex is not parsed correctly:

>>> sympy.srepr(parse_latex('f(x = 0)'))
"Symbol('f')"

The opposite direction does work (from Sympy to Latex)

>>> sympy.latex(sympy.Function('f')(sympy.Equality(sympy.Symbol('x'), sympy.Integer(0))))
'f{\\left(x = 0 \\right)}'

Similarly, we can get more complicated expressions from Sympy to Latex

>>> sympy.latex(sympy.Equality(sympy.Function('f')(sympy.Equality(sympy.Symbol('x'), sympy.Integer(0))),sympy.Symbol('a')))
'f{\\left(x = 0 \\right)} = a'

but the opposite (Latex to Sympy) is not working:

>>> sympy.srepr(parse_latex('f{\\left(x = 0 \\right)} = a'))
"Symbol('f')"

quantum bra ket Dirac notation in Sympy

Quantum mechanics notation is supported in SymPy, but not for parsing Latex input.

>>> import sympy
>>> sympy.__version__
'1.5.1'
>>> from sympy.parsing.latex import parse_latex
>>> sympy.srepr(parse_latex("\\langle a |"))
"Mul(Symbol('a'), Symbol('langle'))"

That isn't what I intended.

>>> from sympy.physics.quantum.state import Ket, Bra
>>> sympy.latex(Bra('a'))
'{\\left\\langle a\\right|}'

SymPy can produce Latex; can it then parse that?

>>> sympy.srepr(parse_latex("{\\left\\langle a\\right|}"))
  ...
  File "/usr/local/lib/python3.6/dist-packages/sympy/parsing/latex/_parse_latex_antlr.py", line 57, in syntaxError
    raise LaTeXParsingError(err)
sympy.parsing.latex.errors.LaTeXParsingError: I don't understand this
{\left\langle a\right|}
~~~~~~~~~~~~~~~~~~~~~^

I've opened a ticket 
https://github.com/sympy/sympy/issues/19854


A good amount of Dirac notation is supported in Sympy:
>>> from sympy.physics.quantum import Operator, Dagger
>>> sympy.latex(Bra('a')*Dagger(Operator('A'))*Ket('b'))
{\left\langle a\right|} A^{\dagger} {\left|b\right\rangle }
>>> sympy.latex(sympy.Symbol('alpha')*Bra('pdg1')*Ket('pdg2')) 
'\\alpha {\\left\\langle pdg_{1}\\right|} {\\left|pdg_{2}\\right\\rangle }'

Tuesday, July 28, 2020

options if Latex is the basis

Latex is the representation of Math that everyone can agree is common across domains. There is also consensus that Latex is inadequate for formal math.

Brainstorming possible methods of leveraging Latex,




  • Having authors use semantic markup (e.g., sTeX) when creating documents.
  • After documents have been created, perform bulk analysis (arXiv corpus) to extract semantics. After separating text from math in a given document, tasks are to (identify article text as theorems or claims or lemma) and (identify parts of expressions in math Latex using either a grammar or machine learning).
  • After author has written document but before they share it, the author uses software to categorize sections of text and identify parts of expressions. (This requires an addition to existing workflows, creation of novel software, and reliance on novel grammars or ML algorithms.)
  • Instead of authoring Latex in a text editor or GUI or OverLeaf, author the content in a CAS (e.g., Sage) and generate the .tex from the CAS. Here the CAS is serving as a Latex editor, constraining what's allowed by or required from an author. While CAS software can generate Latex output, using a CAS as a Latex editor isn't standard. 

3D visualization of graphs is beneficial

Currently the Physics Derivation Graph has 2D graphs.

3D visualization of graphs is useful:

  • 3D is important for larger graphs because of the extra spacing enabled compared to 2D
  • visualization helps identify mistakes in the data structure
  • statically placed 3D graphs are useful and can be navigate in the browser (VR is unnecessary)
  • interactive rotation of perspective matters to making spacing of nodes visually accessible for human pattern detection
  • coloring, weights, filtering edges, filtering nodes, and filtering types makes the viz less noisy

Notes from CICM 2020 - 13th Conference on Intelligent Computer Mathematics

https://cicm-conference.org/2020/cicm.php

The schedule is https://easychair.org/smart-program/CICM-13/

The list below is incomplete since I didn't attend every presentation.

Monday


Joe Wells: slides for this talk are on Ciarán's web page: https://www.macs.hw.ac.uk/~cmd1/assets/talks/cicm-2020-talk.pdf
There is also a nearly-identical preprint on arxiv: https://arxiv.org/pdf/2005.13954.pdf
The machine-checked proofs that the axioms of zfp hold in the model we build for it can be found at: https://www.macs.hw.ac.uk/~cmd1/cicm2020/ZFPDoc/ZFP.html

Hanna Lachnitt: https://youtu.be/nbwZ-521sMQ

Tuesday

Survey of languages for formalizing math :: Cezary Kaliszyk


Observation: The language of formal systems is not the way mathematicians actually communicate (in person verbally, in writing).
Computer Algebra Systems (CAS) and Proof Assistants are viewed as "code" separate from Natural Language.

Natural Language Processing (NLP) of math does not recognize repeated use of the same notation, whereas humans recognize that as a convention.

Aspects that distinguish formal from informal communication: types, inheritance, equality (isomorphism). Each of those aspects are typically implicit for human consumers.
The goal of formal specification is proof, whereas in informal communication the goal is to convince the reader. Those are distinct.

Spectrum of math language formality:
natural language presentation semantic controlled natural languages proof assistants
speaking Latex CAS ForTheL Coq
written text presentation MathML sTeX Mizar Lean
Isabelle
<-- more informal        //       more formal -->

Observation: formal language can be more concise than Controlled Natural Language due to possibility of autocompletion in formal languages.

Cezary expects deep learning to be useful in translation from informal to formal languages, though the capability seems to vary.

Representing Structural Language features in formal meta languages :: Dennis Muller


MMT is defined in a 2013 Kohlhase paper "A scalable module system"
MMT is for math theories.

Formalization of Elementary Number Theory in Mizar :: Adam Naumowicz

Mizar has 60,000 theorems and 12,000 definitions in 1300 files.
Mizar is based on set theory.

The focus of this talk is the first 10 problems from the 1970 textbook by Sierpinski.

Question from audience: when will Mizar be open source?
Adam has no problem with open sourcing Mizar.
The specific barrier is not clear to me.

Interpreting Math texts in Naproche-SAD :: Peter Koepke

Naproche = Natural Proof Checking
SAD = System of Automated Deduction

The origin of Mizar is in a 1985 paper "Computer Aided Reasoning" by Blair and Trybulec

Integrating ForTheL with Latex :: Adrian De Lon

technique is to separate Latex text from math Latex and then develop separate grammars.
https://github.com/adelon/nave - written in Haskell


Guided tour of Formally verified constraint solvers :: Catherine Dubois

bugs have been found in constraint programming


barriers to implementing formalization in mathematics

Inspired by reading the QED manifesto and responses like

I attended a conference on formalization of mathematics. The challenges to progress appear to include the following
  • The diversity of topics (both depth and breadth) is huge. 
  • The variable level of formalization (from natural language to Latex to Constrained Natural Languages to Computer Algebra Systems to Proof assistants. 
  • Addressing all the requirements (readability, conciseness, preciseness) is infeasible with current techniques, so selecting any potential solution is suboptimal. Investing resources (time, attention, compute, creativity) in a suboptimal solution is an opportunity cost.
  • There is more brain power than staffing. Good ideas don't get implemented because hiring software programmers isn't accessible
  • The staffing that is available is graduate students. These are neither domain experts nor experienced programmers
  • The degree of automation for projects varies. What is considered hard or scalable depends on who is doing the work. 
  • Design decisions are driven by the specific problems being solved. A "universal" and comprehensive approach incurs inefficiency in any given domain. 
  • The degree of software development best practices varies (linting, continuous integration/continuous deployment, use of version control)
  • The diversity of implementation choices (Python, Haskell, Coq, C, C++)
  • Inventing novel domain specific languages is easy for this community, further mudding the options
  • Standard (basic) math is not exciting to build a product around
  • Maintenance of software and databases is not exciting research
  • There are a variety of databases which hold proofs: 
  • Anyone can post code (e.g., github); also there is no requirement to post source code
    • There is no standardization on what level of quality is acceptable
    • There is no review process
    • Linking of code to papers to presentations is unusual
  • Licensing of code and results varies -- open source, closed source, unspecified
  • Who the audience should be is unclear -- advancing research or sharing with advanced peers or on-boarding graduate students or  teaching undergraduates?
  • Specific goal (beyond QED manifesto) is not agreed upon; there are no milestones.
    • use cases are undefined
    • well-defined challenge thresholds are needed
  • A survey and detailed comparison of various options is not readily available?
  • Translation between representations is usually lossy since scopes are not one-to-one
  • There is no authority, formal or informal. There is no designated leader
  • There is no governance structure. Decision making is ad hoc and social
  • Incentives for participants include
    • Publishing papers
    • Advancing knowledge
    • Building reputation
    • Forming alliances

As a consequence of the above issues, the selection function becomes who can produce the most creative idea and get it implemented in order to compete for attention.

Top-down direction of effort isn't required, but the consequence of bottom-up coordination is that there duplicated effort and investment in translation between faction required.



The above list gets oversimplified into observations like "no barrier exists other than someone sitting down and doing the work."
There are a couple implicit assumptions there:

  • the ill-defined approach in my mind is the best and right method
  • that someone is not me. Probably a graduate student since the work is trivial

Speed does not seem to be an issue. There is not a compute bottleneck. 

Monday, July 27, 2020

keywords for the project: mathematical physics, applied physics, computational science

Context
There are three subsets of Physics: computational, theoretical, and experimental.
In Math there are many subsets, one of which is Applied Math.
At the intersection of computational physics and computer science and Applied math is the use of computers for large-scale numerical analysis.
Each of these subspecializations has a different perspective and different jargon for the same task.


I've been using the phrase "documenting mathematical physics" to describe the intent and scope of the Physics Derivation Graph. In conversation with a mathematician, he suggested a few alternative key phrases that might help me find new audiences:
This topic came up because I mentioned that I my searches have been repeatedly ending up in pure math due to my focus on semantics of expressions. I think of "pure math" as establishing foundations using axioms, theorems, and proofs. 

I have an undergraduate degree in Applied Math, but I've never considered the Physics Derivation Graph as an "applied math" issue. In my experience, applied math is about solving PDEs for specific scenarios and applying combinators to narrow situations. 
My PhD is in computational Physics, so my experience with "scientific computation" is using large computers to solve numerical math problems. 


I wouldn't consider the Physics Derivation Graph to be "theoretical Physics." However, both feature heavy reliance on math. 

Applied Math does have Physics as a domain of use, but there are other domains in which Applied Math is used: engineering, industrial planning, economics, biology.
See https://www.reddit.com/r/math/comments/ec6r0/whats_the_difference_between_applied_math_and/

Looking around the Internet, there are degrees in Applied Math and in Scientific Computation and Computation Science. Many of these focus on numerical computation and big data. I suspect these degree programs have been displaced by the hotter "data science" programs. 

LatexML for converting Latex math to Content MathML

Rather than creating a Dockerfile, here's a one-liner:

docker run -it --rm phusion/baseimage:0.11 /bin/bash -c "apt update && apt install -y latexml libtext-unidecode-perl && latexmlmath --cmml=- A \\cdot B"

The downside is that if you got the command wrong, the complete build process is necessary.


Here are two examples of LatexML converting Latex math into Content MathML:

First, the quadratic formula (which features some ambiguity)

latexmlmath --cmml=- \\frac{b\\pm\\sqrt{b^2-4ac}}{2a}
<?xml version="1.0" encoding="UTF-8"?>
<math xmlns="http://www.w3.org/1998/Math/MathML" alttext="\frac{b\pm\sqrt{b^{2}-4ac}}{2a}" display="block">
  <apply>
    <divide/>
    <apply>
      <csymbol cd="latexml">plus-or-minus</csymbol>
      <ci>𝑏</ci>
      <apply>
        <root/>
        <apply>
          <minus/>
          <apply>
            <csymbol cd="ambiguous">superscript</csymbol>
            <ci>𝑏</ci>
            <cn type="integer">2</cn>
          </apply>
          <apply>
            <times/>
            <cn type="integer">4</cn>
            <ci>𝑎</ci>
            <ci>𝑐</ci>
          </apply>
        </apply>
      </apply>
    </apply>
    <apply>
      <times/>
      <cn type="integer">2</cn>
      <ci>𝑎</ci>
    </apply>
  </apply>
</math>

The following Latex math is ambiguous, so LatexML is not able to do much.

latexmlmath --cmml=- A \\cdot B
<?xml version="1.0" encoding="UTF-8"?>
<math xmlns="http://www.w3.org/1998/Math/MathML" alttext="A\cdot B" display="block">
  <apply>
    <ci>⋅</ci>
    <ci>𝐴</ci>
    <ci>𝐵</ci>
  </apply>
</math>

my reaction to the QED manifesto

The QED manifesto states a moral vision for the field of mathematics.  The vision is for moving from human-based knowledge management towards machine assisted knowledge management. The process initially relied on oral storytelling, then use of paper, and now a computers, for sharing insights enables progress.

Claim: the vision of QED can be realized in Physics now.
  • Now because technology has decreased the cost of implementation.
  • Physics because experiments are the arbiter of validity. 
What remains the same is individual and institutional incentives.


Physicist have historically gotten away with human-mediated storytelling of quantitative relations. What we currently use (.tex) is a machine parsable version of oral transmission. Comprehension does not scale well given the complexity of the subject.

One outcome is that if physicist continue with the current approach of human-mediated storytelling of quantitative relations, we risk selecting smaller and smaller sets of individuals capable of advancing the field, as well as requiring narrower specialization. Another outcome is decreasing transparency for external consumers.

To make progress towards the vision described in the QED manifesto, break the objective into tangible milestones. Each technical change is also a cultural change that requires individual participants to evaluate the trade-off of value. Technology change also can manifest as change to an interface/workflow.  Change can be minor or revolutionary. The more revolutionary, the more value is required. Value can be a result of creativity (Tex, doi) or investment of resources (arxiv). 

Milestones enabling the QED vision:
  1. Symbol registry using Godel numbers. Improves reader's ability to compare content. 
  2. Expression registry (presentation) using Godel numbers. Improves reader's ability to compare content.
    These two registries form a Physics catalog. The catalog lacks value beyond lateral comparison. Both registries help the reader, not the author. The author has no incentive to invest extra work that does not benefit them. (This is the tragedy of the commons.)
  3. Connective explanation (story). An appendix for derivation. Author shows their work to ease the readers burden. Improves reproducibility
  4. Interference rule registry (math)
  5. Formal expressions registry (math) per presentation
  6. Validation of steps 
Both formal expressions and validation requires significant effort by author or machine. Specific to and limited by selection of CAS. Again, value is mostly low. Does serve as filter for crackpot content. 

All of these milestones are present in the Physics Derivation Graph. This perspective of milestones is useful for segmenting the PDG barrier into constituent barriers. Each milestone has to have clear incentives for users to justify investment.

Registries are the opposite of (local) content dictionaries in openMath.
Model: doi, orchid

Each registry requires perpetual hosting and public accessibility.
Models: arxiv,

Individual researchers would only participate as authors of content if they expect that their investment results in a durable product that is used by the rest of their community. Their individual work also needs to be tied to employment incentive structures or reduce existing suffering. Reduction of individual suffering can be by increasing effectiveness or increasing efficiency.

parsing .tex on arxiv into MathML

https://sigmathling.kwarc.info/resources/arxmliv-dataset-082018/

https://kwarc.info/projects/arXMLiv/ - now rebranded as
https://github.com/dginev/CorTeX and in active development (written in Rust)

https://www.youtube.com/watch?v=cDzIpFPNPpI - kohlhase (2012)

Physics Derivation Graph as Latex package: omdoc and sTeX

Since Physics content is authored as .tex files, a useful question would be to consider how the Physics Derivation Graph could be made into a Latex package.

A literature search indicates sTeX is the closest implementation, featuring a domain specific language (the cmathml package) enabling concise specification of Content MathML and the ability to generate math Latex or Presentation MathML. While necessary, the Content MathML does not address other aspects of the Physics Derivation Graph like (Godel numbering for symbols) and (inference rules between steps).

There are efforts to semantically structure mathematical papers, for example OMDoc.
Both MathML and OpenMath address mathematical formulas in isolation, whereas OMDoc allows to express the structure of mathematical documents, for example the relation between definitions, theorems and proofs.
source: "Extracting Mathematical Semantics from LATEX Documents"

There is a Physics extension for OMDoc, PhysML
https://github.com/OMdoc/OMDoc/wiki/PhysML
https://github.com/OMDoc/PhysML

reconsidering the size of the leap; a literature search for related efforts

A pure graph-based approach is far from what is standard (sequential presentation in PDFs). The value isn't sufficient for content creators (Physics researchers) to cross the entry barrier. I needed to learn this lesson:
Improving on tradition is good, but ignoring tradition is stupid.
More specifically,
Focus on formalization of actual mathematical practice, and not also try to ‘improve’ on the way that people currently do mathematics.
From "The QED manifesto revisited" (2007)

If a graph-based semantically-encoded representation is too much, what are the more incremental options? Non-graph options for sequential content with semantic decoration:

  • start the workflow of document creation in a CAS (e.g., Sympy) and generate .tex from that. 
  • create documents in Latex, then check the .tex file using a CAS

We could ignore both graphs and validation and focus merely on semantically meaningful expressions. That alone would constitute an evolutionary step.


This semantic representation is a necessary dependency for validation and for a rigorous graph.
A mathematically-linked graph that has edges which are not rigorously checked is closer to Wikipedia's hyperlinked text pages.


Semantic text in Latex


"SALT – Semantically Annotated LATEX for Scientific Publications."

Groza, Tudor, Handschuh, Siegfried, Möller, Knud, Decker, Stefan
In Franconi, Enrico, Kifer, Michael, May, Wolfgang (eds.) ESWC 2007. LNCS vol. 4519, pp. 518–532. Springer, Heidelberg (2007)
--> focuses on decorators like "claim" and "explanation." No suggestion of handling mathematics.

QED v2: Mathropolis and mathematical content markup language

"The QED Manifesto after Two Decades -- Version 2.0"

MCML appears focused on the structure of proofs rather than expressions.
No source code available. 

sTeX: An Infrastructure for Semantic Preloading of LaTeX Documents

sTeX handles expressions and proofs.
Most relevant to the PDG is "cmathml" which provides a method for concise Content MathML that gets converted to Latex for display. 
https://github.com/slatex/sTeX - sTeX source code
https://ctan.org/pkg/stex?lang=en - sTeX package for Latex


https://github.com/slatex/sTeX/blob/master/sty/stex/stex.pdf "Semantic Markup in Tex/Latex" (2019)
This document advises that github is the best source since CTAN is tedious to update and thus not current.

An amalgamation of the above summary plus the PDFs for each package

Sub-packages in sTeX

https://tools.ietf.org/doc/texlive-doc/latex/stex/cmathml/cmathml.pdf "cmathml.sty: A TEX/LATEX-based Syntax for Content MathML" (2012)
"This package provides a collection of semantic macros for content MathML and their LATEXML bindings. These macros form the basis of a naive translation from semantically preloaded LATEX formulae into the content MathML formulae via the LATEXML system."
content --> presentation (Latex or Presentation MathML)

"statements" in sTeX
https://ctan.math.illinois.edu/macros/latex/contrib/stex/sty/statements/statements.pdf "Semantic Markup for Mathematical Statements" (2019)
abstract: "This package provides semantic markup facilities for mathematical statements like Theorems, Lemmata, Axioms, Definitions, etc. in STEX files. This structure can be used by MKM systems for added-value services, either directly from the STEX sources, or after translation."

"omtext" in sTeX
http://mirrors.ibiblio.org/CTAN/macros/latex/contrib/stex/sty/omtext/omtext.pdf "omtext: Semantic Markup for Mathematical Text Fragments in LATEX" (2019)
abstract: "This package supplies an infrastructure for writing OMDoc text fragments in LATEX."




Latex packages which are not relevant

A package for common math commands: COntent Oriented LaTeX
https://www.ctan.org/pkg/cool
http://ftp.math.purdue.edu/mirrors/ctan.org/macros/latex/contrib/cool/Content_LaTeX_Package_Demo.pdf


There are tools for performing arithmetic calculations in Latex,

There's no math support on https://schema.org/ for webpages. 

Sunday, July 26, 2020

Wednesday, July 15, 2020

troubleshooting errors with ASTs in the Physics Derivation Graph (SOLVED)

At the top of derivation pages I display errors associated with processing the page. Most of the errors are associated with the Abstract Syntax Trees (ASTs) that are in

For example, on the "curl curl identity" page https://derivationmap.net/review_derivation/000005/
in step 2339482: unable to eval AST for "sympy.Equality(cross(sympy.Symbol('E'), cross(Del, Del)), sympy.Function('nabla')(sympy.Add(sympy.Mul(sympy.Integer(-1), sympy.Symbol('E'), sympy.Pow(Del, sympy.Integer(2))), dot(sympy.Symbol('E'), Del))))"

I need to recreate the issue on my local instance of the site, so my first step is to launch a Docker image
cd ~/version_controlled/proofofconcept/v7_pickle_web_interface/flask/
make dockerlive

In the Docker container I have the same environment as the website, so I open Python and run Sympy to parse the expression
$ python 
Python 3.6.9 (default, Apr 18 2020, 01:56:04) 
[GCC 8.4.0] on linux
>>> import sympy
>>> sympy.__version__ 
'1.5.1'
>>> sympy.Equality(cross(sympy.Symbol('E'), cross(Del, Del)), sympy.Function('nabla')(sympy.Add(sympy.Mul(sympy.Integer(-1), sympy.Symbol('E'), sympy.Pow(Del, sympy.Integer(2))), dot(sympy.Symbol('E'), Del)))) 
Traceback (most recent call last): 
File "<stdin>", line 1, in <module> 
NameError: name 'cross' is not defined

Looking in the file validate_dimensions_sympy.py the relevant import statements exist:
from sympy.vector import cross, dot  # type: ignore 
from sympy.vector.deloperator import Del  # type: ignore

Next I looked in
logs/flask_critical_and_error_and_warning_and_info_and_debug.log
to see where the error is being observed.

The error occurs on line 2984 in controller.py.
Actually this reference is just a try/except based on validate_dimensions in validate_dimensions_sympy.py

As mentioned earlier, the "cross" and "dot" and "Del" are correctly imported. Going back to the REPL,
>>> from sympy.vector import cross, dot  # type: ignore 
>>> from sympy.vector.deloperator import Del  # type: ignore
>>> sympy.Equality(cross(sympy.Symbol('E'), cross(Del, Del)), sympy.Function('nabla')(sympy.Add(sympy.Mul(sympy.Integer(-1), sympy.Symbol('E'), sympy.Pow(Del, sympy.Integer(2))), dot(sympy.Symbol('E'), Del))))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.6/dist-packages/sympy/core/cache.py", line 94, in wrapper
    retval = cfunc(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/sympy/core/power.py", line 301, in __new__
    obj = b._eval_power(e)
AttributeError: type object 'Del' has no attribute '_eval_power'

The other information in the error message we can leverage is step_ID. That step contains the expression_ID 7575859295 which has a complicated latex:
\\vec{ \\nabla} \\times \\vec{ \\nabla} \\times \\vec{E} = \\vec{ \\nabla}( \\vec{ \\nabla} \\cdot \\vec{E} - \\nabla^2 \\vec{E})

Eliminating the "\\vec{}" we get
\\nabla \\times \\nabla \\times E = \\nabla( \\nabla \\cdot E - \\nabla^2 E)
which can be parsed by Sympy:
>>> from sympy.parsing.latex import parse_latex
>>> parse_latex("\\nabla \\times \\nabla \\times E = \\nabla( \\nabla \\cdot E - \\nabla^2 E)")
Eq(E*(nabla*nabla), nabla(-E*nabla**2 + E*nabla))
>>> sympy.srepr(parse_latex("\\nabla \\times \\nabla \\times E = \\nabla( \\nabla \\cdot E - \\nabla^2 E)"))
"Equality(Mul(Symbol('E'), Mul(Symbol('nabla'), Symbol('nabla'))), Function('nabla')(Add(Mul(Integer(-1), Symbol('E'), Pow(Symbol('nabla'), Integer(2))), Mul(Symbol('E'), Symbol('nabla')))))"

The output string does not work as-is in the Python prompt
>>> Equality(Mul(Symbol('E'), Mul(Symbol('nabla'), Symbol('nabla'))), Function('nabla')(Add(Mul(Integer(-1), Symbol('E'), Pow(Symbol('nabla'), Integer(2))), Mul(Symbol('E'), Symbol('nabla'))))) 
Traceback (most recent call last): 
File "<stdin>", line 1, in <module> 
NameError: name 'Equality' is not defined

Changes required:
  • Mul --> cross
  • Symbol('nabla') --> Del
  • append "sympy." in front of commands
which gets us back to the original expression.

Staring at the expression, I see it contains
sympy.Pow(Del, sympy.Integer(2))
which is causing the error.

The solution is to replace the operation with
Mul(Del, Del)
in expression_ID 7575859295 in data.json