Saturday, May 28, 2022

Next steps once math expressions are tokenized

In my previous post I outlined a sequence of steps with a negative framing about how difficult each step would be. A positive framing of the sequence is

  1. Analyze .tex from arxiv and account for issues like encoding and misspelling and mal-formed latex and expansion of macros.
  2. Once the math (e.g. $x$) and expressions are separated from the text, tokenize variables within expressions.
  3. Once variables are tokenized within expressions, identifying the concept (e.g., name of constants) based on the text in the paper.
  4. Reconcile variables across different .tex files in arxiv
  5. Create an interface providing semantically-enriched arxiv content that is indexed for search queries to users.

Suppose we are at step 2 and everything in a document is correctly tokenized (or even if just a fraction  of the content is tokenized). The follow-on step (3) would be to detect the definition of the tokens from the text. For example, if the variable "a" shows up in an expression, and $a$ shows up in the text, and the text is something like

"where $a$ is the number of cats in the house"
Then we can deduce that "a" is defined as "number of cats in the house".

Step 4 would be to figure out if "a" is used similarly in other papers. That would indicate a relation of the papers based on the topic of the content. See for example https://arxiv.org/pdf/1902.00027.pdf


Another use case for tokenized text (in step 2) with some semantic meaning (step 3) would be to validate the expressions. If the expression is "a = b" and the two variables have different units, that means the expression is wrong.

No comments:

Post a Comment