The first step would be to extract pages that contain derivations. Pages with a section title containing "derivation" and containing at least three mathematical expressions in that section would be a useful set. Suppose there are a thousand pages containing derivations which contain text+Latex.
Side note: Wikipedia "text is available under the Creative Commons Attribution-ShareAlike License."
Given 1000 pages of text+Latex there are two nested challenges:
- Between any two adjacent expressions in your data set, there are likely a bunch of missing steps.
- Suppose all the expressions were present. Even in that situation, the inference rules are missing. Filling in these is a big challenge.
To address these challenges, text analysis would be useful. Suppose the sequence is
- text1
- expression1
- text2
- expression2
- text3
- expression3
- text4
There are a few distinct categories of text to analyze:
- s1 = the last two sentences in "text1" proceeding "expression1"
- s(i) = if text2 and text3 are short (ie a few sentences), then they are potential inference rules
- s(j) = if text2 and text3 are longer than a few sentences, then probably the two sentences following an expression and the two sentences proceeding an expression are relevant
- sf = the first two sentences of the "text4" which is text after the last expression.
We now have 1000 instances of "s1" sentences. In this "s1" data set, what's the most common word? What's the most common two word phrase? What's the most common three word phrase? If there are things that look like inference rules, that would be interesting. I doubt that "declare initial expression" will appear, but some consistency would be validating.
Similarly, run the same word and phrase frequency analysis for the 1000 "sf" sentences. Also apply to each of "s(i)" and "s(j)."
No comments:
Post a Comment