Saturday, May 28, 2022

Next steps once math expressions are tokenized

In my previous post I outlined a sequence of steps with a negative framing about how difficult each step would be. A positive framing of the sequence is

  1. Analyze .tex from arxiv and account for issues like encoding and misspelling and mal-formed latex and expansion of macros.
  2. Once the math (e.g. $x$) and expressions are separated from the text, tokenize variables within expressions.
  3. Once variables are tokenized within expressions, identifying the concept (e.g., name of constants) based on the text in the paper.
  4. Reconcile variables across different .tex files in arxiv
  5. Create an interface providing semantically-enriched arxiv content that is indexed for search queries to users.

Suppose we are at step 2 and everything in a document is correctly tokenized (or even if just a fraction  of the content is tokenized). The follow-on step (3) would be to detect the definition of the tokens from the text. For example, if the variable "a" shows up in an expression, and $a$ shows up in the text, and the text is something like

"where $a$ is the number of cats in the house"
Then we can deduce that "a" is defined as "number of cats in the house".

Step 4 would be to figure out if "a" is used similarly in other papers. That would indicate a relation of the papers based on the topic of the content. See for example https://arxiv.org/pdf/1902.00027.pdf


Another use case for tokenized text (in step 2) with some semantic meaning (step 3) would be to validate the expressions. If the expression is "a = b" and the two variables have different units, that means the expression is wrong.

Searchable Latex, semantic enrichment, and reconciling variables in arxiv

Inspired by searchonmath.com I spent time attempting to recreate and then extend the effort.
  1. Analyzing .tex from arxiv is hard due to "minor" issues like encoding and misspelling and mal-formed latex and expansion of macros. (LaTeXML may help with macro expansion?)
  2. Once the math (e.g. $x$) and expressions are separated from the text, I don't have a good way of separating variables within expressions. (I think this is where grammar explorations are helpful. LaTeXML may also be able to do this, but I haven't gotten it working yet.)
  3. Once variables are tokenized within expressions, identifying the concept (e.g., name of constants) is burdensome. (Need manual annotation -- MioGatto -- or NLP or both.)
  4. Reconcile variables across different .tex files in arxiv
  5. Create an interface providing semantically-enriched arxiv content that is indexed for search queries to users. (Something like searchonmath.com but with semantic enrichment of variables.)

Getting to 5 is still a long way from the feature set I'm trying to demonstrate with the Physics Derivation Graph! To be specific, filling in missing derivation steps and checking the consistency of expressions and the correctness of derivation steps would be additional work.

That leads me to the conclusion that I should focus my PDG efforts on building an exemplar destination rather than spend my time implementing steps 1-5. Both are relevant and take a lot of hard labor and creative work. My plans are to continue work on the Neo4j-based property graph.

Friday, May 27, 2022

PDG as dedicated website, and PDG-as-a-service, and PDG as an overlay for arxiv

Two ways to present the Physics Derivation Graph is as a website (currently https://derivationmap.net/ ) and as an API (see https://derivationmap.net/api/v1/resources/derivations/list as an example from the https://derivationmap.net/api/v1/documentation ). 

A third way to present the content would be as an overlay for existing content, e.g. https://arxiv.org/ . 

Related: Comments on papers

The content overlay concept has been explored primarily for comments. For example, active efforts include

For more related projects, see https://reimaginereview.asapbio.org/

Illustration of the comment section for research papers: https://phdcomics.com/comics.php?f=1178

Inactive efforts:


Another overlay: variable identification

  • Find the same variable referenced in multiple papers
  • Find the same expression reference in multiple papers
Having all of the metadata for every archive paper would be a good starting point to reference the variables with.

Example of active effort:
  • Subscription-based latex search of equations: https://www.searchonmath.com/ ; $0.99 for the first month as of 2022-05-26, then $4.50/month after. Can search either web (stackoverflow, wikipedia) xor arxiv content.
Inactive effort:

Friday, May 20, 2022

what is using disk space on the web server?

https://stackoverflow.com/a/15142053/1164295

$ du -cBM --max-depth=1 2> >(grep -v 'Permission denied') | sort -n 
0M	./dev
0M	./proc
0M	./sys
1M	./lost+found
1M	./media
1M	./mnt
1M	./root
1M	./srv
2M	./run
7M	./etc
11M	./opt
74M	./tmp
151M	./boot
1329M	./home
1497M	./snap
2535M	./usr
3820M	./var
9421M	.
9421M	total
Confusingly, that doesn't seem consistent with
$ df -h
Filesystem      Size  Used Avail Use% Mounted on
udev            474M     0  474M   0% /dev
tmpfs            99M  1.2M   97M   2% /run
/dev/vda1        25G   20G  4.4G  82% /
tmpfs           491M     0  491M   0% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
tmpfs           491M     0  491M   0% /sys/fs/cgroup
/dev/vda15      105M  9.2M   96M   9% /boot/efi
tmpfs            99M     0   99M   0% /run/user/0
tmpfs            99M     0   99M   0% /run/user/1000
/dev/loop4       68M   68M     0 100% /snap/lxd/22526
/dev/loop2       44M   44M     0 100% /snap/snapd/15177
/dev/loop3       56M   56M     0 100% /snap/core18/2344
/dev/loop5       68M   68M     0 100% /snap/lxd/22753
/dev/loop0       62M   62M     0 100% /snap/core20/1405
/dev/loop6       45M   45M     0 100% /snap/snapd/15534
/dev/loop7       62M   62M     0 100% /snap/core20/1434
overlay          25G   20G  4.4G  82% /var/lib/docker/overlay2/b1e93808993411941a56eeab3447a9620dabf64956633befd4f4997c00d3bfea/merged
shm              64M     0   64M   0% /var/lib/docker/containers/dd7ef352d6ba8fa022bde66cc083c81c868ecc492b41eb31725cbd3d44e41297/mounts/shm
overlay          25G   20G  4.4G  82% /var/lib/docker/overlay2/37c02acbf47a52998e26eb679988396a263c4b2bc723435a7e185d999adb3554/merged
shm              64M     0   64M   0% /var/lib/docker/containers/4ca979c1faea9fee6b29a1bcebbea5b1897aabcb8f5e6b4e3844b52a90f481e7/mounts/shm
/dev/loop8       56M   56M     0 100% /snap/core18/2409

Disk usage savings #1: decrease Journal

https://askubuntu.com/a/1238221 and https://unix.stackexchange.com/a/130802/431711 and https://wiki.archlinux.org/title/Systemd/Journal
cd /var/log/journal
sudo journalctl --vacuum-time=10d

Disk usage savings #2: remove unused Docker images

docker images | grep "<none> <none>" | tr -s " " | cut -d' ' -f3 | xargs docker rmi

Tuesday, May 3, 2022

observations on the conversion of the backend from JSON to property graph (Neo4j)

The JSON backend for the Physics Derivation Graph 

  • is concise -- only the fields necessary are present 
  • is easily readable -- plain text and not much nesting
  • requires significant investment to construct queries
  • is static in terms of dependencies; unlikely to degrade or require maintenance
The property graph (in Neo4j) backend
  • supports user-provided queries
  • adds maintenance risk of keeping up with changes to Cypher and Neo4j

Sunday, March 6, 2022

Changing from JSON to property graph + SQL backend

A few recent conversations with scientists about the Physics Derivation Graph have led me think about different queries (247, 243, 241, 240, 239, 238) that could be of value and can be extracted from the  current content.

In coming up with ways to query the graph, I realized a property graph is useful for supporting the queries. That is in contrast to writing a custom query capability against my existing JSON format. I'm already embarrassed by the JSON/SQL implementation, so having specific queries of interest provided me sufficient motivation to investigate implementing a Neo4j backend.

Transitioning to a property graph (specifically Neo4j) loses the fine grain control over the mechanics of the graph. However, the trade-off is well worth the increased development speed. Having a Cypher query interface via the web GUI very powerful.


With the property graph representation, augmenting information is needed in tabular format:

  • all possible inference rules, along with the CAS implementation per rule
  • all variable definitions, along with dimensions, constant or variable, scope, and reference URLs
  • units, along with dimensions, and reference URLs

Those three tables could be stored in an SQL database. 

I'm replacing a single plaintext JSON file with two non-plaintext data formats -- SQL and Neo4j. 

Monday, February 21, 2022

Lots of tasks for 2022; what are the priorities

With the JSON/SQL implementation, I showed myself that what I was imagining (Latex entry, CAS integration, symbol tracking, Latex/PDF output) was in fact feasible. However, the JSON/SQL backend and the forms-based web front-end were sufficiently embarrassing that I wasn't interested in showing off the idea. 

Now my goal with the Neo4j/SQL backend my goal is 1) provide query capability and 2) to not be embarrassed. 


High priority:

Low priority: 

  • analysis of server logs -- https://github.com/allofphysicsgraph/proofofconcept/issues/246