In my previous post I outlined a sequence of steps with a negative framing about how difficult each step would be. A positive framing of the sequence is
- Analyze .tex from arxiv and account for issues like encoding and misspelling and mal-formed latex and expansion of macros.
- Once the math (e.g. $x$) and expressions are separated from the text, tokenize variables within expressions.
- Once variables are tokenized within expressions, identifying the concept (e.g., name of constants) based on the text in the paper.
- Reconcile variables across different .tex files in arxiv
- Create an interface providing semantically-enriched arxiv content that is indexed for search queries to users.
Suppose we are at step 2 and everything in a document is correctly tokenized (or even if just a fraction of the content is tokenized). The follow-on step (3) would be to detect the definition of the tokens from the text. For example, if the variable "a" shows up in an expression, and $a$ shows up in the text, and the text is something like
"where $a$ is the number of cats in the house"Then we can deduce that "a" is defined as "number of cats in the house".
Step 4 would be to figure out if "a" is used similarly in other papers. That would indicate a relation of the papers based on the topic of the content. See for example https://arxiv.org/pdf/1902.00027.pdf
Another use case for tokenized text (in step 2) with some semantic meaning (step 3) would be to validate the expressions. If the expression is "a = b" and the two variables have different units, that means the expression is wrong.