- how many total .tex files?
- how many english words per file?
- how many expressions total in the corpus?
- distribution of (number of expressions) per file
- what's the distribution of (ratio of words per file to expressions per file)
- how many known latex symbols are present in all the expressions
- what is the distribution of (expression length in characters)
- what is the distribution of (known symbols per expression)
- are there character sequences that are extremely rare? binary files hidden in .tex and other anomalies
Establishing that the sample being used is generic means we can work with a smaller data set (rather than "all the .tex in arXiv"). Showing the distribution shape does not change as more .tex files are added means convergence is possible.
If we find a domain that doesn't have a similar distributions, then we can investigate why it is anomalous.
No comments:
Post a Comment