Sunday, December 18, 2022

MONTEREY PHOENIX grammar, keywords, and common patterns

MONTEREY PHOENIX grammar

In this table A is an event and B and C are events in A

description of pattern	grammar rule
Ordered sequence of events (B followed by C)	A: B C;
Alternative events (B or C)	A: ( B \| C );
Optional event (B or no event at all)	A: [ B ];
Ordered sequence of zero or more events B	A: (* B *);
Ordered sequence of one or more events B	A: (+ B +);
Unordered set of events B and C (B and C may happen concurrently)	A: { B, C };
Unordered set of zero or more events B	A: {* B *};
Unordered set of one or more events B	A: {+ B +};

description of pattern	grammar rule
name of model	SCHEMA
Identifies an actor	ROOT
event precedence across swimlanes	COORDINATE
if and only if	<->
	<!>
a condition that each valid trace should satisfy	ENSURE
	FOREACH
	BUILD
	DISJ
	MARK
	ONFAIL
	CHECK
	AFTER
	EXISTS
	MAP
	CLEAR
	SHOW
message box that prints text	SAY("hello world")

Common pattern: branching logic in swimlane

ROOT Driver: enters_car (starts_car | exits_car);
starts_car:  move_forward
             stops_car
             exits_car;

Common pattern: dependency across swimlanes

COORDINATE
$a: turn_steering_wheel_right FROM driver,
$b: right FROM front_tires
DO 
ADD $a PRECEDES $b; 
OD;

Common pattern: shared events

Driver, Car SHARE ALL move_forward, 
                      turn_left, 
                      turn_right;

MONTEREY PHOENIX model checker

In this blog post I discuss a model involving two actors: a car and a driver. Each actor has an associated set of actions they can take.

For a video walkthrough of this code, see https://youtu.be/r6aPKem6WLs

Using the web interface https://firebird.nps.edu/, run this model

SCHEMA using_a_car /* model title */

/* Actor Behaviors */

/* purple text is an MP keyword,
   actors are green text,
   atomic events are blue */

/* with space as the separator, 
   the events (blue) are an ordered sequence */

ROOT Driver: enters_car 
             starts_car
             move_forward turn_left turn_right 
             stops_car
             exits_car;

ROOT Car: off
          starting 
          running_idle
          move_forward
          turn_left
          turn_right
          shutting_down;

Introduce branching logic

SCHEMA using_a_car /* model title */

/* Actor Behaviors */

/* purple text is an MP keyword,
   actors are green text,
   atomic events are blue,
   composit events are orange */

/* 
      ( A | B ) are alternative events 
*/

ROOT Driver: enters_car (starts_car | exits_car);
starts_car:  move_forward turn_left turn_right 
             stops_car
             exits_car;

ROOT Car: off
          starting 
          running_idle
          move_forward
          turn_left
          turn_right
          shutting_down;

Events among actors are coordinated

SCHEMA using_a_car /* model title */

/* Actor Behaviors */

/* purple text is an MP keyword,
   actors are green text,
   atomic events are blue,
   composit events are orange */

ROOT Driver: enters_car (starts_car | exits_car);
starts_car:  move_forward turn_left turn_right 
             stops_car
             exits_car;

ROOT Car: off
          starting 
          running_idle
          move_forward
          turn_left
          turn_right
          shutting_down;

/* the two processes are related by specific events */

COORDINATE
$a: starts_car FROM Driver,
$b: starting FROM Car
DO
ADD $a PROCEEDS $b;
OD;

COORDINATE
$a: stops_car FROM Driver,
$b: shutting_down FROM Car
DO
ADD $a PROCEEDS $b;
OD;

Actors may take zero or more actions

SCHEMA using_a_car /* model title */

/* Actor Behaviors */

/* purple text is an MP keyword,
   actors are green text,
   atomic events are blue,
   composit events are orange */

/* 
    {* *} is an unordered set of zero or more events 
*/

ROOT Driver: enters_car (starts_car | exits_car);
starts_car:  {* ( move_forward | 
                  turn_left | 
                  turn_right ) *}
             stops_car
             exits_car;

ROOT Car: off
          starting 
          running_idle
          {* ( move_forward | 
               turn_left | 
               turn_right ) *}
          shutting_down;

COORDINATE
$a: starts_car FROM Driver,
$b: starting FROM Car
DO
ADD $a PROCEEDS $b;
OD;

COORDINATE
$a: stops_car FROM Driver,
$b: shutting_down FROM Car
DO
ADD $a PROCEEDS $b;
OD;

Events are shared among actors

SCHEMA using_a_car /* model title */

/* Actor Behaviors */

/* purple text is an MP keyword,
   actors are green text,
   atomic events are blue,
   composit events are orange */

ROOT Driver: enters_car (starts_car | exits_car);
starts_car:  {* ( move_forward | 
                  turn_left | 
                  turn_right ) *}
             stops_car
             exits_car;

ROOT Car: off
          starting 
          running_idle
          {* ( move_forward | 
               turn_left | 
               turn_right ) *}
          shutting_down;

Driver, Car SHARE ALL move_forward, 
                      turn_left, 
                      turn_right;

COORDINATE
$a: starts_car FROM Driver,
$b: starting FROM Car
DO
ADD $a PROCEEDS $b;
OD;

COORDINATE
$a: stops_car FROM Driver,
$b: shutting_down FROM Car
DO
ADD $a PROCEEDS $b;
OD;

Sunday, November 20, 2022

steps I took to add a picture to the website frontpage

The current iteration of https://derivationmap.net/ uses data from a JSON file to dynamically generate page content.

On the front page of https://derivationmap.net/ there's currently an SVG. I wanted to place a PNG next to the SVG. Before placing the PNG, I wanted to make the SVG narrower. Looking at the source code for the page index.html, I see

<svg id='viz'></svg>

Inspecting the source code of the rendered page, I see

var width = 600;

which comes from https://github.com/allofphysicsgraph/proofofconcept/blob/gh-pages/v7_pickle_web_interface/flask/templates/_d3_js.html#L29

To test out a change to a narrower width, I start "Docker desktop" on my Mac and, in the directory allofphysicsgraph/proofofconcept/v7_pickle_web_interface, run the command

docker-compose up --build --remove-orphans

In a web browser I was then able to access http://localhost:80 and got a warning from Chrome that the certs were invalid. Incognito mode seemed to be working better.

Wednesday, October 26, 2022

Levels of reproducibility and repeatability and replication

Knowledge progresses when one person can leverage the insights of another person. There are levels of reproducibility that require different levels investment on the part of the person looking to build on the initial knowledge.

The levels described below are ranked from "requires lots of work to build upon" to "very easy to leverage."

Level 0: discoverable claim

Undocumented claims that no one else is aware of are irrelevant to the advancement of science.

Level 1: claim without evidence

example: "My design for this car tire supports operational speeds of 50 miles per hour and will be usable for 50,000 miles."
No software or analytical calculations are provided. No explanation of how claim was arrived at.

Level 2: verbal hints of process used

software-based example: My design for the car tire is based on calculations from code written in Python.
analytical example: In my calculation of operational speed I used the Fourier transform.

What distinguishes 2 from 1: Advertising that code was written, or math was done.
Most peer-reviewed scientific papers are written at this level or, if you're lucky, level 2.
Most presentations of experiments (e.g., at conferences and lectures) also are made at this level or level 2.

Level 3: software without documentation or dependencies; or, for analytical, a few of the key equations

software-based example: Python script provided by the author to back up claim. No configuration file or random seed value. Library dependencies and versions need to be determined through trial and error (aka digital archeology).
analytical example: a few key equations (from a much more complex derivation) are mentioned, as are a few (not all) of the assumptions.

consequence: If you're smart and diligent, you may be able to recover statistically similar (though not exact) behavior for stochastic models, assuming neither you nor the original author had any bugs in the implementation.

Level 4:

software-based example: Python script with random seed value specified and configuration parameters documented. Documentation not included, nor are dependencies made explicit.
analytical example: complete derivation provided, with explanation of assumptions

Level 5: repeatable

software-based example: Python script containerized, software versions pinned. Build process is executable. No digital archeology needed.
analytical example: portions of the complete derivation are checked by a computer algebra system (CAS) for correctness.

Level 6:

software-based example: Python script containerized with documentation of assumptions and examples of how to use. Configuration file with parameters and random seed values provided.
analytical example: complete derivation checked by a computer algebra system (CAS) for correctness. Proofs provided. Citations where appropriate.

Caveat: the levels described above are not actually linear. There are a few meandering paths that get from 0 to 6.

References

"Reproducibility vs. Replicability: A Brief History of a Confused Terminology"
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5778115/

Sunday, October 9, 2022

TLA+ comparison with other formal verification software

The scope of this post is limited to formal verification of algorithms and protocols.

To save docker images to my repo,

docker pull official/imagename
docker image tag official/imagename:latest myusername/imagename:latest
docker push myusername/imagename:latest

TLA+

Check the correctness of a specification.
Not tied to a specific implementation or tests of an implementation.

Jepsen

https://jepsen.io/

https://github.com/jepsen-io/jepsen

"Jepsen is a Clojure library. A test is a Clojure program which uses the Jepsen library to set up a distributed system, run a bunch of operations against that system, and verify that the history of those operations makes sense."

For distributed systems, introduce {partitions, added resources, removed resources, delays, clock skew} and other noises, run operations, and then compare against expected results.

Stateright

https://github.com/stateright/stateright

https://docs.rs/stateright/latest/stateright/

"Stateright is a Rust actor library that aims to solve this problem by providing an embedded model checker, a UI for exploring system behavior (demo), and a lightweight actor runtime. It also features a linearizability tester that can be run within the model checker for more exhaustive test coverage than similar solutions such as Jepsen."

F*

https://www.fstar-lang.org/

https://hub.docker.com/r/fstarlang/fstar

https://github.com/FStarLang/FStar

Can generate C, as well as parser/generators.

Not explicitly temporal (whereas TLA+ is explicitly about temporal actions), but F* could be used for temporal models.

Why3

https://why3.lri.fr/

Z3

written by the same developer as SMV after the developer moved to Microsoft

https://github.com/Z3Prover/z3

https://theory.stanford.edu/~nikolaj/programmingz3.html

https://github.com/z3prover/z3/pkgs/container/z3

Coq

constructive dependent type theory

Lean

"Simple formally verified compiler in Lean" by Ericson (2021)

https://uu.diva-portal.org/smash/get/diva2:1613286/FULLTEXT01.pdf

SMV

CMU's SMV --> Cadence SMV --> nuSMV

https://swmath.org/software/7795

http://mcmil.net/smv.html

https://www.cs.cmu.edu/~modelcheck/smv.html

https://sws.cs.ru.nl/publications/papers/biniam/smv/

https://web.cs.wpi.edu/~kfisler/Courses/525V/S02/Readings/smv-cadence.pdf

https://nusmv.fbk.eu/

https://en.wikipedia.org/wiki/NuSMV

Isabelle/HOL

simple type theory

https://lawrencecpaulson.github.io/2022/10/12/verifying-distributed-systems-isabelle.html

https://news.ycombinator.com/item?id=33177980

Alloy

https://alloytools.org/

Cryptol

https://cryptol.net/index.html

https://github.com/GaloisInc/cryptol

Spin and Promela

https://spinroot.com/spin/whatispin.html

https://spinroot.com/spin/Man/index.html

https://spinroot.com/spin/Man/Quick.html

https://en.wikipedia.org/wiki/Promela

P

https://github.com/p-org/P

Event B

"a modelling method for formalising and developing systems whose components can be modelled as discrete transition systems. An evolution of the (classical) B-method, Event-B is now centred around the general notion of events, which are also found in other formal methods such as Action Systems, TLA and UNITY."

An Introduction to the Event-B Modelling Method

https://www.southampton.ac.uk/~tsh2n14/publications/chapters/eventb-dbook13.pdf

Translating B to TLA+ for Validation with TLC

https://www3.hhu.de/stups/downloads/pdf/HansenLeuschel_TLC4B_techreport.pdf

https://www.sciencedirect.com/science/article/pii/S0167642316300235

classical B for software development

https://discuss.tlapl.us/msg03080.html

Relevance of TLA+ to the Physics Derivation Graph

The Physics Derivation Graph is just a CRUD application, so I didn't expect TLA+ would be relevant. My initial view was that there isn't an algorithm or a protocol in the PDG. The PDG is just a webpage, some middleware (e.g., bash or Python) and a back-end database (CSV, SQL, JSON, Neo4j, etc). Standard MVC applies.

After spending 1 day reviewing TLA+ materials, I realized the PDG has a workflow specification that I hadn't previously considered in detail.

User has a derivation in mind and they want to integrate it with existing PDG content. User knows their derivation's expressions, steps, symbols (variables and constants), and operators.
User writes their derivation into the PDG referencing existing symbols, existing expressions, and existing inference rules.

If the relevant symbol isn't available, add it.
If an existing symbol needs to be corrected/revised/merged, modify it.
If the relevant inference rule isn't available, add it.
If an existing inference rule needs to be corrected/revised/merged, modify it.

Integrated derivation steps are checked for correctness using a Computer Algebra System (e.g., SymPy, Mathematica, etc)

That user story sounds reasonable until you consider two concurrent users editing the same steps or expressions or symbols or inference rules or operators. If each of the above steps are considered atomic, there are potential conflicts.

One user could be attempting to create a new derivation step while another modifies existing symbols or inference rules.
Two users could concurrently attempt to modify/merge/edit existing symbols or inference rules. When they commit their change, the thing they are changing may have disappeared or been altered.

One way to avoid the above potential for conflict would be to lock the database while edits are made (no concurrent users).
Another avoidance mechanism would be to queue changes and check consistency of every change sequentially (serialize concurrency).

Saturday, October 8, 2022

glossary for TLA+ and PlusCal jargon

TLA+

jargon	definition

Apalache	"symbolic model checker for TLA+." "Translates TLA+ into the logic supported by SMT solvers such as Microsoft Z3." Source: apalache github
behavior	aka execution sequence of states Source: Lecture 1 video
CONSTANT	Source: Specifying Systems page i25
CONSTANTS	See CONSTANT
execution	see behavior. Source: Lecture 1 video
deadlock	execution stopped when it wasn't supposed to Source: Lecture 3 video See also termination
EXTENDS	Source: Lecture 2 video
FALSE	See also TRUE Source: Lecture 2 video

IF THEN	Source: Lecture 2 video
IF THEN ELSE	Source: Lecture 2 video
liveness property	"you have to look at the entire execution, which may be infinite, to determine that it is not satisfied." Source: session 9 See also safety property
MODULE	Source: Lecture 2 video
PlusCal	User's manual
safety property	"if is not satisfied by an execution, then you can tell that it's not satisfied by looking at a finite portion of the execution—even if the execution is infinite." Source: session 9 See also liveness property
state	assignment of values to variables Source: Lecture 1 video and Source: TLA+ chapter in "Software Specification Methods"

state machine, specification of	all possible initial states variables initial value of each variable what state follows a given state relations values of variables in current state and possible values in the next state. Source: Lecture 1 video
state predicate	"A formula that is true or false of a state. A state predicate is just an ordinary Boolean-valued formula that can contain variables but no primed variables and no temporal operators like <>." Source: session 9

step	change from one state to the next state. Source: Lecture 1 video
strong fairness	"A step of a process is strongly fair in a behavior iff there is no point in the behavior after which the step is enabled in infinitely many states but no such step occurs in the behavior." Source: session 9 See also weak fairness
strongly fair	see strong fairness

termination	execution stopped when it was supposed to Source: Lecture 3 video See also deadlock
TLA	Temporal Logic of Actions a temporal logic developed for describing and reasoning about concurrent and distributed systems Source: Specifying and Verifying Systems With TLA+
TLAPS	TLA+ proof system. Source: Lecture 1 video and Source: TLA+ Proof System (TLAPS)
TLA+	language for high-level modeling digital systems. Here "high-level" refers to the design aspects, above the code. Examples of digital systems: algorithms, programs, and computer systems. Source: Lecture 1 video "a formal specification language based on (untyped) ZF set theory, first-order logic, and TLA (the Temporal Logic of Actions)" Source: Specifying and Verifying Systems With TLA+ TLA+ version 2 user guide (2018) TLA+ hyperbook
TLA+ toolbox	IDE for TLA+. Source: Lecture 1 video. Toolbox website
TLC	on-the-fly model checker for debugging TLA+ specifications Source: Specifying and Verifying Systems With TLA+
TRUE	See also FALSE Source: Lecture 2 video
VARIABLE	Source: Lecture 2 video
VARIABLES
weak fairness	the process must keep taking steps if it can Source: session 9 See also strong fairness
weakly fair	see weak fairness
/\	and

\/	or Source: Lecture 2 video
<>	aka eventually For any state predicate P, the formula <>P is true of a behavior iff P is true of at least one state of the behavior. Source: session 9
[]	aka always For any state predicate P, the formula <>P is true of a behavior iff P is true of at least one state of the behavior. Source: session 9
~>	aka leads to Source: session 9
'	aka prime next value

In module Sequences

See TLA+ summary page 8 of 9, and this summary.

jargon	definition
Append(seq,e)	the sequence obtained by appending the value e as the new last element of the sequence seq. For example, Append(<<1, 2>>, 42) equals <<1, 2, 42>>. Source: intermezzo 1
Head(seq)	the first element of a non-empty sequence seq. For example, Head(<<"a", "c", "b">>) equals "a". Source: intermezzo 1
\o	aka concatenation "seq1 \o seq2 is the sequence obtained by concatinating the sequences seq1 and seq2 in the obvious way. For example, seq \o <<e>> equals Append(seq,e) for any sequence seq and value e." Source: intermezzo 1

websites for learning TLA+

Wikipedia

https://en.wikipedia.org/wiki/TLA%2B
https://news.ycombinator.com/item?id=9601770

Lamport's homepage

https://lamport.azurewebsites.net/tla/tla.html
https://lamport.azurewebsites.net/tla/learning.html
https://lamport.azurewebsites.net/video/intro.html
https://news.ycombinator.com/item?id=13918648

video tutorial series:
https://www.youtube.com/watch?v=p54W-XOIEF8&list=PLWAv2Etpa7AOAwkreYImYt0gIpOdWQevD

TLA+ Github

https://github.com/tlaplus/awesome-tlaplus

https://github.com/tlaplus/DrTLAPlus
https://github.com/tlaplus/Examples - includes DieHard4, DiningPhilosophers, Paxos,

https://github.com/dmilstein/channels

https://github.com/tlaplus-workshops/ewd998

https://github.com/Cjen1/tla_increment

LearnTLA.com - Hillel Wayne

https://learntla.com/
https://news.ycombinator.com/item?id=19661329
https://news.ycombinator.com/item?id=33010645

https://www.hillelwayne.com/post/learntla/ -- announcement about revision to book; points to learntla.com
https://news.ycombinator.com/item?id=31952643

https://hillelwayne.com/talks/distributed-systems-tlaplus/

Hillel Wayne - Designing Distributed Systems with TLA+
https://www.youtube.com/watch?v=ATobswwFwQA

Tackling Concurrency Bugs with TLA+ by Hillel Wayne
https://www.youtube.com/watch?v=_9B__0S21y8

toy example of money transfer in TLA+
three real-world cases from Hillel's employer of using TLA+ to check design before implementation

@pressron - Ron Pressler

https://pron.github.io/posts/tlaplus_part1
https://pron.github.io/posts/tlaplus_part2
https://pron.github.io/posts/tlaplus_part3
https://pron.github.io/posts/tlaplus_part4

https://pron.github.io/posts/tlaplus-curryon-talk
Ron Pressler - The Practice and Theory of TLA+
https://www.youtube.com/watch?v=15uy9Ga-14I

TLA+ in Docker in VSCode

A gentle intro to TLA+
https://www.youtube.com/watch?v=D_sh1nnX3zY
Demos a three-state system using VSCode

VSCode plug-in using Docker for TLA+:
https://github.com/kevinsullivan/TLAPlusDocker

My video on how to install TLA+ inside Docker as a plug-in for VSCode:
https://www.youtube.com/watch?v=sLGY7_agg4E

Questions and Answers

https://groups.google.com/g/tlaplus

https://stackoverflow.com/questions/tagged/tla%2b

https://www.reddit.com/r/tlaplus

Blog posts

https://elliotswart.github.io/pragmaticformalmodeling/

http://muratbuffalo.blogspot.com/2022/10/checking-statistical-properties-of.html
https://news.ycombinator.com/item?id=33077898

https://www.mydistributed.systems/2022/06/some-practical-tips-on-using-tla-and-p.html

https://thenewstack.io/tla-the-best-debugger-optimizer-youve-never-heard-of/

https://roscidus.com/blog/blog/2019/01/01/using-tla-plus-to-understand-xen-vchan/
Deep dive into practical use of TLA+ for real software. Links to github repo with Dockerfile

Using TLA+ for fun and profit in the development of Elasticsearch - Yannick Welsch
https://www.youtube.com/watch?v=qYDcbcOVurc

TLA+ Toolbox for Beginners
https://www.youtube.com/watch?v=U2FAnyPygrA

Modeling Virtual Machines and Interrupts in TLA+ & PlusCal - Valentin Schneider
https://www.youtube.com/watch?v=hlLZi4wfBjs
use of TLA+ at ARM

TLA+ Tutorial 2021 at DISC 2021
https://www.youtube.com/watch?v=NXLJoUKJnDQ

Debugging software designs using testable pseudo-code (Jay Parlar)
https://www.youtube.com/watch?v=LAEXHua4MQQ
Alice and Bob money transfer model (from learntla.com)
utility: first motivating example for why TLA+ provides value to design process

A Beginner's Guide to TLA+ Exploring State Machines & Proving Correctness
https://www.youtube.com/watch?v=H7yBYoY7ILc

SREcon20 Americas - Weeks of Debugging Can Save You Hours of TLA+
https://www.youtube.com/watch?v=wjsI0lTSjIo
Uses an example written in C that has a rare deadlock.

Saturday, May 28, 2022

Next steps once math expressions are tokenized

In my previous post I outlined a sequence of steps with a negative framing about how difficult each step would be. A positive framing of the sequence is

Analyze .tex from arxiv and account for issues like encoding and misspelling and mal-formed latex and expansion of macros.
Once the math (e.g. $x$) and expressions are separated from the text, tokenize variables within expressions.
Once variables are tokenized within expressions, identifying the concept (e.g., name of constants) based on the text in the paper.
Reconcile variables across different .tex files in arxiv
Create an interface providing semantically-enriched arxiv content that is indexed for search queries to users.

Suppose we are at step 2 and everything in a document is correctly tokenized (or even if just a fraction of the content is tokenized). The follow-on step (3) would be to detect the definition of the tokens from the text. For example, if the variable "a" shows up in an expression, and $a$ shows up in the text, and the text is something like

"where $a$ is the number of cats in the house"

Then we can deduce that "a" is defined as "number of cats in the house".

Step 4 would be to figure out if "a" is used similarly in other papers. That would indicate a relation of the papers based on the topic of the content. See for example https://arxiv.org/pdf/1902.00027.pdf

Another use case for tokenized text (in step 2) with some semantic meaning (step 3) would be to validate the expressions. If the expression is "a = b" and the two variables have different units, that means the expression is wrong.

Searchable Latex, semantic enrichment, and reconciling variables in arxiv

Inspired by searchonmath.com I spent time attempting to recreate and then extend the effort.

Analyzing .tex from arxiv is hard due to "minor" issues like encoding and misspelling and mal-formed latex and expansion of macros. (LaTeXML may help with macro expansion?)
Once the math (e.g. $x$) and expressions are separated from the text, I don't have a good way of separating variables within expressions. (I think this is where grammar explorations are helpful. LaTeXML may also be able to do this, but I haven't gotten it working yet.)
Once variables are tokenized within expressions, identifying the concept (e.g., name of constants) is burdensome. (Need manual annotation -- MioGatto -- or NLP or both.)
Reconcile variables across different .tex files in arxiv
Create an interface providing semantically-enriched arxiv content that is indexed for search queries to users. (Something like searchonmath.com but with semantic enrichment of variables.)

Getting to 5 is still a long way from the feature set I'm trying to demonstrate with the Physics Derivation Graph! To be specific, filling in missing derivation steps and checking the consistency of expressions and the correctness of derivation steps would be additional work.

That leads me to the conclusion that I should focus my PDG efforts on building an exemplar destination rather than spend my time implementing steps 1-5. Both are relevant and take a lot of hard labor and creative work. My plans are to continue work on the Neo4j-based property graph.

Friday, May 27, 2022

PDG as dedicated website, and PDG-as-a-service, and PDG as an overlay for arxiv

Two ways to present the Physics Derivation Graph is as a website (currently https://derivationmap.net/ ) and as an API (see https://derivationmap.net/api/v1/resources/derivations/list as an example from the https://derivationmap.net/api/v1/documentation ).

A third way to present the content would be as an overlay for existing content, e.g. https://arxiv.org/ .

Related: Comments on papers

The content overlay concept has been explored primarily for comments. For example, active efforts include

https://fermatslibrary.com/, which also has a chrome extension, Librarian, which isn't functional. Example comments: https://fermatslibrary.com/arxiv_comments?url=https://arxiv.org/pdf/1602.03837.pdf
https://scirate.com/; as an example of comments on a paper see https://scirate.com/arxiv/2205.09767 ; https://reimaginereview.asapbio.org/listing/scirate/
https://scipost.org/?tab=commentaries ; http://researchpracticesandtools.blogspot.com/2016/07/scipost-right-tool-for-commenting-arxiv.html

For more related projects, see https://reimaginereview.asapbio.org/

Illustration of the comment section for research papers: https://phdcomics.com/comics.php?f=1178

Inactive efforts:

https://selectedpapers.net/ ; https://johncarlosbaez.wordpress.com/2013/06/14/the-selected-papers-network-part-2/

Another overlay: variable identification

Find the same variable referenced in multiple papers
Find the same expression reference in multiple papers

Having all of the metadata for every archive paper would be a good starting point to reference the variables with.

Example of active effort:

Subscription-based latex search of equations: https://www.searchonmath.com/ ; $0.99 for the first month as of 2022-05-26, then $4.50/month after. Can search either web (stackoverflow, wikipedia) xor arxiv content.

Inactive effort:

http://numericalatlas.cs.ucl.ac.uk/ which is described in https://arxiv.org/abs/2107.00665

Friday, May 20, 2022

what is using disk space on the web server?

https://stackoverflow.com/a/15142053/1164295

$ du -cBM --max-depth=1 2> >(grep -v 'Permission denied') | sort -n 
0M	./dev
0M	./proc
0M	./sys
1M	./lost+found
1M	./media
1M	./mnt
1M	./root
1M	./srv
2M	./run
7M	./etc
11M	./opt
74M	./tmp
151M	./boot
1329M	./home
1497M	./snap
2535M	./usr
3820M	./var
9421M	.
9421M	total

Confusingly, that doesn't seem consistent with

$ df -h
Filesystem      Size  Used Avail Use% Mounted on
udev            474M     0  474M   0% /dev
tmpfs            99M  1.2M   97M   2% /run
/dev/vda1        25G   20G  4.4G  82% /
tmpfs           491M     0  491M   0% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
tmpfs           491M     0  491M   0% /sys/fs/cgroup
/dev/vda15      105M  9.2M   96M   9% /boot/efi
tmpfs            99M     0   99M   0% /run/user/0
tmpfs            99M     0   99M   0% /run/user/1000
/dev/loop4       68M   68M     0 100% /snap/lxd/22526
/dev/loop2       44M   44M     0 100% /snap/snapd/15177
/dev/loop3       56M   56M     0 100% /snap/core18/2344
/dev/loop5       68M   68M     0 100% /snap/lxd/22753
/dev/loop0       62M   62M     0 100% /snap/core20/1405
/dev/loop6       45M   45M     0 100% /snap/snapd/15534
/dev/loop7       62M   62M     0 100% /snap/core20/1434
overlay          25G   20G  4.4G  82% /var/lib/docker/overlay2/b1e93808993411941a56eeab3447a9620dabf64956633befd4f4997c00d3bfea/merged
shm              64M     0   64M   0% /var/lib/docker/containers/dd7ef352d6ba8fa022bde66cc083c81c868ecc492b41eb31725cbd3d44e41297/mounts/shm
overlay          25G   20G  4.4G  82% /var/lib/docker/overlay2/37c02acbf47a52998e26eb679988396a263c4b2bc723435a7e185d999adb3554/merged
shm              64M     0   64M   0% /var/lib/docker/containers/4ca979c1faea9fee6b29a1bcebbea5b1897aabcb8f5e6b4e3844b52a90f481e7/mounts/shm
/dev/loop8       56M   56M     0 100% /snap/core18/2409

Disk usage savings #1: decrease Journal

https://askubuntu.com/a/1238221 and https://unix.stackexchange.com/a/130802/431711 and https://wiki.archlinux.org/title/Systemd/Journal

cd /var/log/journal

sudo journalctl --vacuum-time=10d

Disk usage savings #2: remove unused Docker images

docker images | grep "<none> <none>" | tr -s " " | cut -d' ' -f3 | xargs docker rmi

Tuesday, May 3, 2022

observations on the conversion of the backend from JSON to property graph (Neo4j)

The JSON backend for the Physics Derivation Graph

is concise -- only the fields necessary are present
is easily readable -- plain text and not much nesting
requires significant investment to construct queries
is static in terms of dependencies; unlikely to degrade or require maintenance

The property graph (in Neo4j) backend

supports user-provided queries
adds maintenance risk of keeping up with changes to Cypher and Neo4j

Sunday, March 6, 2022

Changing from JSON to property graph + SQL backend

A few recent conversations with scientists about the Physics Derivation Graph have led me think about different queries (247, 243, 241, 240, 239, 238) that could be of value and can be extracted from the current content.

In coming up with ways to query the graph, I realized a property graph is useful for supporting the queries. That is in contrast to writing a custom query capability against my existing JSON format. I'm already embarrassed by the JSON/SQL implementation, so having specific queries of interest provided me sufficient motivation to investigate implementing a Neo4j backend.

Transitioning to a property graph (specifically Neo4j) loses the fine grain control over the mechanics of the graph. However, the trade-off is well worth the increased development speed. Having a Cypher query interface via the web GUI very powerful.

With the property graph representation, augmenting information is needed in tabular format:

all possible inference rules, along with the CAS implementation per rule
all variable definitions, along with dimensions, constant or variable, scope, and reference URLs
units, along with dimensions, and reference URLs

Those three tables could be stored in an SQL database.

I'm replacing a single plaintext JSON file with two non-plaintext data formats -- SQL and Neo4j.

Monday, February 21, 2022

Lots of tasks for 2022; what are the priorities

With the JSON/SQL implementation, I showed myself that what I was imagining (Latex entry, CAS integration, symbol tracking, Latex/PDF output) was in fact feasible. However, the JSON/SQL backend and the forms-based web front-end were sufficiently embarrassing that I wasn't interested in showing off the idea.

Now my goal with the Neo4j/SQL backend my goal is 1) provide query capability and 2) to not be embarrassed.

High priority:

Property graph implementation from scratch; probably Neo4j -- https://github.com/allofphysicsgraph/proofofconcept/issues/171

web search -- https://github.com/allofphysicsgraph/proofofconcept/issues/238
integrating SQL for variables, inference rules, units

automation of inference rules
add derivation, add step, add expressions -- form-based or dynamic canvas?

Low priority:

analysis of server logs -- https://github.com/allofphysicsgraph/proofofconcept/issues/246

Tuesday, February 8, 2022

old undated project goals

This page content is from two migrations, most recently https://sites.google.com/site/physicsderivationgraph/goals

I don't know the original date of this post

Objectives

Create a framework capable of describing all mathematics needed for physics derivations. I use Latex as the syntax in the framework because Latex is how I think of equations. However, Latex is insufficient for processing by computer algebra systems. Status: proof of concept exists
Create machine-readable databases which use the above framework to capture the mathematical derivations in physics. To hold the content of the databases I'm using custom XML. Status: proof of concept exists
Create graphical representation of relations content in the databases. I'm using GraphViz to render the visualization. Status: proof of concept works
Use a computer algebra system to verify the relations in the databases. I'm using SymPy as the CAS (see also a list of candidates). Status: proof of concept works.
Create a web browser-based viewing of the generated graph. HTML5 seems capable. Status: investigated, not started

Create a web browser-based graph input tool. Status: not started

The above workflow applies to both CLI and web-based GUI. Implementing a web-based GUI is its own learning curve, so separate the above workflow diagram into two sets of tasks. The following applies to both web-based GUI and CLI.

Tasks

user adds content to internal data structure (user --> IDS)
write content to external database from internal data structure (IDS --> DB)
read content from external database into internal data structure (IDS <-- DB)
render internal data structure as visual graph (IDS --> graph)
check internal data structure content using CAS (IDS --> CAS)

Currently (20140527) what I'm actually doing is (user --> DB) and (DB --> graph)

Rant 1: historical progress

How we do math and physics has undergone some historical transitions in how the process is carried out.

Initially research was done by individuals by paper and pen, then communicated via letters and later journals. <claim>Mechanical computers were initially used for computation of known capability, rather than enabling novel research. </claim>

In the past 50 years, electronic computers have enabled numeric and symbolic computation. High performance computers at large scale allow for research at unprecedented pace. Results in article form are still communicated via printed journals, and more recently, electronically. Sharing data and algorithms in electronic format is the current revolution.

The tools we use have expanded from paper and pen to Computer Algebra Systems, databases, and programming languages.

old log entries from 2014

20140927 syntax := ABNF

An issue I've previously encountered is how to store the information associated with the graph. I started with plain text, and then moved to XML (the current state). The content stored in the XML is Latex. Latex renders well, but lacks context (meaning). I am now investigating Content MathML as a stricter (but less widely adopted) syntax. Although Content MathML is better, it doesn't entirely capture the use I intend for this project.

The core of the project centers on the following two conventions:

<statement> = <left hand side> <relation> <right hand side>
<statement> --> <inference rule> --> <statement>

The relation is typically equality or inequality, though it could include union and other operators.

The inference rule acts on one or more statements.

The statement is the relation between two mathematical expressions

This is outside the scope of Content MathML, XML, and Latex. I think the appropriate capture is to use something like Augmented Backus-Naur Form (ABNF). Here are a few possible rules:

derivation = 1*2statement inferenceRule statement; some derivations involve multiple input statements, here we allow 1 or 2

statement = mathExpression relation mathExpression

mathExpression = *ALPHA *DIGIT *SP *operator *"\" *"." *"," *"'" 

relation = ( "=" / "<=" / "<" / ">" / ">=" / "U" )

operator = ( "+" / "-" / "/" / "*" / "^" / "!" / ( "[" "]" ) / ( "(" ")" ) / ( "{" "}" ) / "|" )

Looking at the code, I've already adopted the following rules

infrule_name = 1*ALPHA

statement_punid = 10DIGIT

statement_tunid = 7DIGIT

symbol_punid = 15DIGIT

From my notes, it appears I first figured out the requirement for the statement label restrictions on 20110406.

20140527 commercial projects

In the past week I've been made aware of 3 commercial efforts similar to this project

SymboLab and Formula-Database are similar, in that both are about search of equations. In addition to search, SymboLab provides derivations and plot. It's not clear whether these are dynamic or hand-coded. SymboLab appears to be more mathematics oriented, whereas Formula-Database has descriptions of the physics and symbol definitions. EquationMap is about user-generated graphs of derivations. My project has a poorer user interface, and the objective is grander (capture all of physics derivations).

None of these three are direct competition with this open source project, though there is overlap with subtasks from each project. I don't see a clear use case for any of the three -- if it were that useful as a product, I would have expected someone else to already be working in this area.

I'm not clear on why there aren't other open source projects in this area. Proofwiki.org is not graph-based, and PlanetPhysics disappeared.

20140527 task prioritization

Task prioritization:

navigable interface. The current graph is large enough that intuitive navigation is an issue which needs to be addressed. The graph is currently rendered as a static PNG file. Rendering the graph in a web browser might be more accessible (allowing interactive navigation), and would generate a new set of issues to be addressed (i.e., reading XML databases into a new data structure). By "navigable interface" I am distinguishing from an interface for entering new data to be stored in the database
- http://mkolkowski.lqnstudio.mydevil.net/
- http://cytoscapeweb.cytoscape.org/demos. This is flash-based, but the pan-zoom feature is present in https://github.com/cytoscape/cytoscape.js-panzoom
ability to high-light subsections of the total graph related to a specific derivation
ability to high-light the symbol use graph within the statement graph
add content (E=mc^2, Maxwell's equations)
fully check graph using CAS (currently SymPy)
user input through CLI
user input through web browser GUI
- MathQuill

To create a navigable interface, my first guess is to start by finding an example of an HTML5/Javascript directed graph with png/svg files as nodes. This is because I think it's probably unrealistic to render the Latex statements in a browser with a dynamic graph

As far as intuitive interface, I like Google Maps and its ability to zoom in/out when visualizing large spatial data. EquationMap is a good start but isn't open source.

20140526 computer algebra system (CAS) inputs

In reviewing candidate Computer Algebra systems, I still think SymPy is the best option compared to the other free and open source CASs https://en.wikipedia.org/wiki/List_of_computer_algebra_systems

Currently I expect the user to provide two inputs for each statement: Latex for visual rendering, and the SymPy equivalent for checking with the CAS. This is redundant, since SymPy can render Latex.

The reason I let the user supply Latex is because as a physicist I think in terms of Latex (not SymPy). I treat the addition of the SymPy equivalent as a second step

I have a fundamental conflict with (1) wanting the input format to be easy (i.e., Latex) and (2) wanting the content to be checkable by a CAS (i.e., MathML). I don't think there is a an easy-to-input format which is also easily checkable by a CAS.

Does the database need to be checked by a CAS?

There are two reasons one builds this database:

as a notepad for current research, possibly to be used in a publication. Assumptions: dynamic; written to by only a few people; it may contain mistakes. Thus, it would be helpful to be checked by a CAS
to store relations between all accumulated knowledge. Assumptions: static database; written to by many people; read by many people. Thus, mistakes are likely to be found whether or not there is a CAS

20140510 graph vs relational database

I need to better understand the difference between Graph vs relational database vs object

https://en.wikipedia.org/wiki/Graph_database
https://en.wikipedia.org/wiki/Relational_databases

Wednesday, January 5, 2022

two hours to manually transcribe three typed pages of math into 150 lines of HTML+Latex

I have about 20 boxes of notes from roughly 10 years of undergrad and graduate classes in Math and Physics. I've kept the notes for the past 10 years with the intent of converting the notes into a structured and computer-readable format. The current site https://derivationmap.net/ shows the proof-of-concept that the technical capability is feasible.

I spent two hours manually transcribing three typed pages of math into https://derivationmap.net/class_notes/math402_mathematical_physics_hale. The resulting HTML with MathJax was about 150 lines. By the end my attention/focus was waning, so a break was necessary.

The typed notes are merely transcription into computer-readable format; the notes are not in the "equation graph" form necessary for the Physics Derivation Graph. That is a separate tedious process.

Observations:

An estimate of how long transcribing all the notes would take:

If I type HTML+MathJax for 4 hours a day, that seems reasonable
If a box of notes is 1 ream (500 pages), (500 pages/box)*(2 hours/3 pages)*(20 boxes)=6666 hours
6666 hours/(4 hours/day) = 1666.5 days, or 4.5 years

There are many diagrams that go with the text and equations. The three pages in this sample didn't have diagrams. Converting the diagrams into tikz would be time consuming.
The structure of the notes is not compatible with the constraints of the "equation graph" structure.
I expect that scanning + OCR would be of limited value for producing Latex. Also, much of the notation in the notes is sloppy and requires translation to more rigorous notation.
Even the more "rigorous notation" is merely an improvement; the notation is a long ways from the specificity needed for use in a computer algebra system. As an example,

the original format for an inner product is (x, y) -- here "x" and "y" are merely bolded
my more rigorous notation is $(\vec{x}, \vec{y})$
Multiplying the inner product by a complex value $\alpha(\vec{x}, \vec{y})$ is visually ambiguous -- it could be interpreted that $\vec{x}$ and $\vec{y}$ are arguments to the function $\alpha$. See also this question.

Notes for a course are either a single page (with thousands of equations, which render in MathJax slowly) or split among many pages (increasing the number of context switches)

If typing every page of notes into HTML+MathJax is not reasonable (or useful), then identifying what part of the notes are useful should be the objective.

not useful: Homework problems and exams are reliant on identifying and applying one or more Physics expressions and then numerically compute the result for a given scenario. Homework problems rarely demonstrate connectivity among Physics expressions.
not useful: identities that arise from definitions. Example: Cauchy Schwartz
marginally useful: Derivation from definitions
marginally useful: Derivation from experiments
potentially useful: Derivation from another expression

Simple harmonic oscillator - identify every domain it is used in