Saturday, August 26, 2023

HTML arxiv papers, semantic tagging, and formal verification

arxiv recently started experimenting with converting .tex files to HTML: https://info.arxiv.org/about/accessible_HTML.html

The first official announcement was in February 2022:
https://blog.arxiv.org/2022/02/21/arxiv-articles-as-responsive-web-pages/

The HTML from .tex effort is a collaboration between Deyan Ginev (a student of Dr. Kohlhase) and Bruce Miller (at NIST - https://www.nist.gov/people/bruce-r-miller). Kohlhase's group (https://kwarc.info/research/) focuses on semantic enrichment of Latex. Bruce provided the software to convert Latex.

The reason for this Latex to HTML conversion is because it's the first step for enabling semantic enrichment of content on the arxiv. There's immediate benefit for arxiv being able to support HTML, which I suspect is why arxiv cooperated with Kohlhase's group.

In the long term I see a need to connect semantic tagging (the focus of Kohlhase's group) with formal verification (e.g., derivations-as-proofs using Lean). The formally verified math expressions need to be tied to use in narrative text (e.g., arxiv papers). For example, if I'm referring to "x" in a publication, is that the same "x" specified in a Lean-based proof? One way of answering is to use tags like

<unique_variable_id=42>x</unique_variable_id>
 
in the narrative text document, and then have Lean reference the ID 42 in derivations. There are more conventional uses of tags like
<properties name_of_variable="distance" dimension_of_variable="length">x</properties>
but those tags don't address the question of whether two variables refer to the same concept.

Summary

I predict that the conversion of arxiv .tex files to HTML will enable semantic tagging. This will intersect with the challenge of "how do I relate the use of variables and expressions across multiple proofs in Lean?"

Saturday, August 12, 2023

a chain of custody for papers in arxiv to formal proofs in Lean

Papers published in arxiv using Latex contain narrative text and mathematical expressions. How could we have confidence that the content of a published paper is related to a formal proof?

The narrative text would refer to symbols, expressions, units, and quantities. The expressions and symbols could be embedded in the Latex source using a package that provides macros.

The author's .tex would include loading a package like

\usepackage{pdg}[server_address=http://serveraddressforpdg.net]

Then when the author references an expression they would use

\begin{equation}
\pdgexp{192839}
\end{equation}

When compiled to PDF, the macro \pdgexp calls to the server address http://serveraddressforpdg.net to return the Latex expression. The same could be done for an inline reference to a symbol, like \pdgsymbol{93831911}

In this approach there wouldn't need to be an appendix for the derivation because the derivation steps are hosted on the PDG server.

The chain of custody would be

arxiv PDF -- arxiv .tex -- .tex contains references to PDG expression and symbol IDs -- PDG server contains derivation steps -- steps are validated using Lean


The macro might need to cache the PDG database locally so the server doesn't get called every time the .tex is compiled PDF. 
Also, allow the author to force a refresh of the cache.