Friday, July 20, 2018

analyzing the text of Wikipedia posts

In a previous post, an outline for analyzing Wikipedia content was described. In this post, I document a few initial observations about the data collected from Wikipedia.

Searching for "derivation" as a section marker means searching for "=== Derivation ===". There are other meanings to derivation, so sometimes the results include non-mathematical content like "=== Derivation and other names ===". To filter out irrelevant content, only sections with mathematical expressions (ie ":<math>") are relevant.

In addition to the text, there are potentially relevant images like
https://en.wikipedia.org/wiki/File:Derivation_of_acoustic_wave_equation.png
which has dimensions 813 × 570 pixels. Pictures with "derivation" in the name and dimensions greater than 300 x 300 might be relevant.

In the "derivation" section, lines that start with ":<math>" in the text are expressions. The closing bracket "</math>" may occur on a following line. 

No comments:

Post a Comment