As an alternative to S3, arxiv points to a subset that's available without going through AWS: https://www.cs.cornell.edu/projects/kddcup/datasets.html
The value of having a large number of expressions in Latex is that we could use the expressions to predict what a user wants to enter, decreasing the amount of manual entry required. Also, if a derivation contains similar expressions to what exists in the arxiv content, we could investigate whether the derivation is related to the arxiv paper.
Steps for working with arxiv data
Download papers (in .tex format) for a given domain.For each tex file, separate the text content from the math from the latex commands.
Task: identify all latex commands.
Task: identify latex commands that alter the math latex content (e.g., \newcommand)
Before attempting to parse the math latex content, remove all presentation-related artifacts
- replace '\left(' with '('
- replace '\right)' with ')'
- replace '\ ' with ' '
- replace '\,' with ' '
- replace '\quad' with ' '
- replace '\qquad' with ' '
Sources to help with parsing math latex:
- within the math latex string to parse, what can be deduced about the expected context?
- given other math expressions in the same paper, what would be consistent?
- given the text in a paper surrounding the math expressions, what would be expected based on keywords?
- given other papers in the same domain or based on citations, what would be likely?
- what is statistically likely give the corpus of all articles?
- Use the Trie data structure to determine what the valid characters in the grammar should be. (Probably be some subset of ASCII with some Unicode chars.)
- What are the tokens/symbols of the language?
- What are the common sequences of tokens?
- What are the appropriate labels for the tokens?
- Instead of listing 10 different relational operators each time, create a group of relational operators and reference the group.
- What are some logical grouping of symbols?
That is, in some sense, the same process a human goes through to decode the intended meaning of any given math expression in a scientific paper. We are looking to encode that process as a Python program.
No comments:
Post a Comment