Saturday, May 28, 2022

Searchable Latex, semantic enrichment, and reconciling variables in arxiv

Inspired by searchonmath.com I spent time attempting to recreate and then extend the effort.
  1. Analyzing .tex from arxiv is hard due to "minor" issues like encoding and misspelling and mal-formed latex and expansion of macros. (LaTeXML may help with macro expansion?)
  2. Once the math (e.g. $x$) and expressions are separated from the text, I don't have a good way of separating variables within expressions. (I think this is where grammar explorations are helpful. LaTeXML may also be able to do this, but I haven't gotten it working yet.)
  3. Once variables are tokenized within expressions, identifying the concept (e.g., name of constants) is burdensome. (Need manual annotation -- MioGatto -- or NLP or both.)
  4. Reconcile variables across different .tex files in arxiv
  5. Create an interface providing semantically-enriched arxiv content that is indexed for search queries to users. (Something like searchonmath.com but with semantic enrichment of variables.)

Getting to 5 is still a long way from the feature set I'm trying to demonstrate with the Physics Derivation Graph! To be specific, filling in missing derivation steps and checking the consistency of expressions and the correctness of derivation steps would be additional work.

That leads me to the conclusion that I should focus my PDG efforts on building an exemplar destination rather than spend my time implementing steps 1-5. Both are relevant and take a lot of hard labor and creative work. My plans are to continue work on the Neo4j-based property graph.

No comments:

Post a Comment