Sunday, June 18, 2023

software for numerical calculations of derived expressions versus implementation of an algorithm

Suppose I have a Python program that takes "angle" and "initial speed" as arguments and calculates where a ball shot from a cannon lands. The program could be described using the function

f(\theta, s) = d

where \theta is the angle, s is the speed, and d is the distance from the cannon where the ball lands. The function could be expressed using a variety of languages like Python, C, Lean, etc.

If f(\theta, s) = d is derived from the equations of motion (like https://derivationmap.net/review_derivation/201726/) and each step is verified using Lean (as described in this post), then we can use Lean to calculate d from \theta and s. The difference would be that there's more explanation of where the result came from. Rather than just "here the equation to use," including the derivation step and proof of each step makes the assumptions explicit and the implementation correct. 


All software programs can be described as functions that take arguments and return results. That doesn't mean that all functions are derived. As an example of software that is a function but isn't derived, consider the A* routing algorithm. An algorithm outlines a sequence of steps for accomplishing a function.

While all programs could be written in Lean (because each program is a function), not all functions are derived mathematically. As a trivial example, printing "hello world" is not a derived function. In contrast, the first example in this post, the distance a projectile flies, is derived from the equations of motion. What distinguishes "hello world" and A* routing from the cannon ball trajectory?

The distinction is that the cannon ball example is a numerical simulation of a tangible scenario, whereas "hello world" is not describing physically-constrained reality. 

For numerical simulations, Lean can formally verify the derivation and the formula used for analytical calculation. Keeping both aspects in Lean results in a clear chain of custody. 

For algorithms (e.g., A* routing, "hello world"), the verification is to show that the implementation is consistent with the requirements. Here languages like Dafny and F* are relevant.

More on F*:

More on Dafny:


Floats in Lean

float is not a ring.
Nor is there an injection; inf is not a real

relation between Inference Rules in the Physics Derivation Graph and Proofs in Lean

In this post I explore the concept that each derivation can be phrased as a proof of the initial declaration and the final result. Also, each step in a derivation can be phrased as a proof.

Lean Proofs of Derivation Steps

As per https://stackoverflow.com/questions/41946310/how-to-prove-a-b-%E2%86%92-a-1-b-1-in-lean Lean can prove that

(a=b) -> (a+1=b+1)

In comparison, a PDG Inference Rule is more generic:

add __ to both sides of (LHS=RHS) to get (LHS+__=RHS+__)

The PDG Inference Rule is generic because the types of LHS and RHS are undefined. Are they Real? Complex? Matrices?

The relevance of types for inference rules is because mixing types may not work. "Add 1 to both sides of (LHS=RHS)" won't work if LHS and RHS are 2x2 matrices.  

Friday, June 16, 2023

Translating between Physics Derivation Graph concepts and features in the Lean Theorem prover

In the blog post I try to figure out how all of those concepts map to features in Lean Theorem prover.

Concepts and relations of concepts in the Physics Derivation Graph

In the Physics Derivation Graph a derivation is comprised of steps. Each step has one inference rule. Inference rules can have

  • zero or more input expressions
  • zero or more output expressions
  • zero or more feed values
Each expression is comprised of a LHS, RHS, and relation. "Relation" can be =, >, <, <=, >=.
The LHS and RHS and Feed values are comprised of symbols (e.g., a, b, x, y) and operators (e.g., *,+,/). 
Symbols are variables (e.g., a, b, x, y) or constants (\pi, e).

How Physics Derivation Graph concepts map to Lean


I think a PDG input expression is a proposition in Lean.
I think a PDG step is a theorem in Lean. 
Maybe the PDG output expression is a goal in Lean?
I think a PDG inference rule is a tactic in Lean. See https://leanprover.github.io/reference/tactics.html

"The type of a function that can inspect the proof state, modify it, and potentially return something of type A (or fail) is called tactic A.
source: https://leanprover-community.github.io/extras/tactic_writing.html


Equivalence of the PDG derivation step "add 2 to both sides of a=b to get a+2=b+2" with using Lean for the proof "(a=b) -> (a+2=b+2)"

In the Physics Derivation Graph the expressions "a=b" and "a+2=b+2" are related by the inference rule "add __ to both sides".

In Lean, "a=b" is a proposition. We have to specify that a is Real and b is Real. Then we can prove that
(a=b) -> (a+2=b+2)

Work in progress -- try something like
see also

Thursday, June 15, 2023

computer algebra system (CAS) is inadequate for the Physics Derivation Graph (PDG)

I'm beginning to appreciate that a computer algebra system (CAS) is not sufficient for the Physics Derivation Graph (PDG). The evidence of this is that the variables I'm defining can be real or complex; that is not distinguished by the computer algebra system.

The simple story of starting with

a = b
and then adding 2 to both sides to get
a+2 = b+2
is appropriate for a computer algebra system. But if "a" is a matrix then the operation is invalid. The only way to distinguish "scalar a" from "vector a" from "matrix a" is to specify the difference. [SymPy does have support for predicates -- https://docs.sympy.org/latest/guides/assumptions.html#predicates .]

In the Physics Derivation Graph I want to be more specific about the possible values of a and b. Even for non-physics derivations like the Euler equations, there are assumptions about the possible values of each variable.

I don't know Lean, but I also don't know the foundational concepts of theorems and proofs.
Is a step a theorem?
Is a derivation a theorem?
The role of inference rules in steps in derivations does not map to anything in a theorem.

Sunday, June 11, 2023

chain of custody for formally-verified software

Relating disparate sources of knowledge in science involves:

  • Formally-verified Mathematical Physics (e.g., in Lean). Purpose: verify the mathematical model is consistent with specified axioms.
  • Formally-verified software; see https://physicsderivationgraph.blogspot.com/2022/10/tla-comparison.html for a list.
  • Performant parallelized software (e.g., MPI + OpenMP + Fortran or C/C++) for High Performance Computing (HPC).
  • Peer-reviewed journal articles, e.g., arxiv; rendered using Latex. [Of the four, this has historically been closest to the focus of the Physics Derivation Graph.]

Each of these content sources are independent and involve expressing a mathematical model in different ways. 


While kernels of mathematical models can be implemented in formally-verified software, a lot of scientific computing uses High Performance Computing (HPC). In the HPC realm the compilers and libraries and runtimes can all be sources of bugs. 

Need to compile or interpret software with a verified compiler or interpreter. 


How can the consistency of content across sources be ensured? 

There are 6 bilateral transitions for the 4 sources. For example, how does the model transition between theorem prover (e.g., Lean) and formally-verified software implementation?


Another possible issue: provenance

May need to track provenance of the source code to binaries. Is the verified algorithm what was provided to the compiler? Is the binary I'm currently using related to the compiler output? Need to be able to distinguish "binary I don't know the origin of" from "binary that has a known chain of custody." 

This could be just a log of hashes coupled to metadata for the "where did this come from." The provenance needs to include versions for compilers and libraries and runtimes. The binary should be reproducible. See https://en.wikipedia.org/wiki/Reproducible_builds

Saturday, June 10, 2023

Finding Derivations of Newton's Law of Universal Gravitation online

Question: was Newton's law of universal gravitation found empirically (from measurements) or can it be derived mathematically? The short answer is yes -- both. The historical process is separate from what can be done now.

Empirical (historical) route

The second paragraph of the Wikipedia article on Newton's Law of Universal Gravitation says, "This is a general physical law derived from empirical observations by what Isaac Newton called inductive reasoning." There's a video on YouTube describing that Newton started with

  • acceleration of gravity on earth, g = 9.8 m/s^2
  • distance between Earth and Moon = 60*(radius of earth)
  • radius of Earth = 
  • orbital period of moon = 27.32 days

Newton figured out how fast the Moon is circling the Earth using

 velocity = distance/time
where, in the case of the moon circling the Earth,
 distance the moon travels around the Earth = circumference = 2*pi*r_{orbit of Moon}

Plugging in numbers, velocity of Moon around Earth = 1022 m/s. That can then be plugged into the centripetal acceleration,

a_{centripetal} = v^2/r
How does a_{centripetal} compare to the gravitational acceleration g?
g/a_{centripetal} = 60^2

Noticing the 60 is common to the ratio and the distance between the Earth and the Moon, Newton figures that gravitation follows an inverse square law. Newton then checked this against data from observational studies of planets.

That's a big leap to F = G*(M*m)/(r^2), so are there more mathematical routes?

Mathematical Derivations

From nasa.gov

I first found NASA's classroom materials. There are some leaps that I wasn't able to follow. The same content of that page is duplicated on a related NASA page. The derivation starts with F=ma and a_{centripetal} = v^2/r. The author mentions Kepler's Third Law but says T^2 \approx R^3 (which is dimenionally inconsistent) when they mean T^2 \propto R^3. The misuse of \approx and \propto continues throughout the rest of the derivation.

velocity = distance/time
and the distance here is the circumfrence, 2*pi*r, so
period T = (2*pi*R)/v

Drop the 2*pi to get

period v \approx R/T
Square both sides and apply Kepler's Third law
T^2 \propto R^3
to get
v^2 \propto 1/R

The second source of my confusion is subscripting versus superscripting -- v_2 versus v^2.

F = (m*(v^2))/R

I tried submitting a correction to NASA's feedback page but couldn't since the Captcha is missing. :(


From a tutorial

Next I found another student-oriented page that has a derivation which is even less helpful than NASA's. The derivation presented starts from angular speed and F=mr\omega^2.


YouTube tutorial

Happily I found this helpful video from Daniel M, "Deriving Newton's Law of Universal Gravitation".

automating entry of derivations into the Physics Derivation Graph website

What would it take to integrate support for symbol detection and conversion to SymPy for a single step in a derivation?
  1. user provides initial expression in Latex to web UI.
  2. computer parses symbols and operators from Latex
  3. computer searches Physics Derivation Graph database of symbols and operators to find candidate symbols
  4. computer provides candidate symbols to user and prompts, "which of the following symbols were you referring to?"
  5. computer parses expression to SymPy, returns AST to user, and prompts, "is this the AST you meant?"
  6. if yes, continue; if no, go back to step 1 or provide corrections to AST.
  7. user provides next expression in Latex
  8. computer parses symbols and operators from Latex
  9. if symbols match symbols used in this derivation, then associate with those; otherwise 
  10. computer searches Physics Derivation Graph database of symbols and operators to find candidate symbols
  11. if computer had to search PDG database, then computer provides candidate symbols to user and prompts, "which of the following symbols were you referring to?"
  12. computer parses expression from step 7 to SymPy, returns AST to user, and prompts, "is this the AST you meant?"
  13. computer uses brute force to check every inference rule using a CAS against the provided expressions to "guess" the inference rule. 
  14. if valid inference rule is found, continue to next expression; if no valid inference rule is found, prompt user to provide inference rule.
  15. Given the inference rule and associated expressions, use the CAS to verify the step.

Tuesday, June 6, 2023

historical evolution of a git repo

JSON-like output
git log --date=format-local:'%Y-%m-%d %H:%M:%S' \
--pretty=format:'{%n  "commit": "%H",%n  "author": "%aN <%aE>",%n  "date": "%ad",%n  "message": "%f"%n},' > all_logs.dat

as per https://stackoverflow.com/a/34778736/1164295 and https://gist.github.com/textarcana/1306223 which points to https://gist.github.com/textarcana/1306223

python3 -c "import json; 
with open('all_entries','r') as fh:
    content = json.load(fh)
print(content)"
Single line is better:
git log --date=format-local:'%Y-%m-%d %H:%M:%S' --pretty=format:"%H%x09%ae%x09%ad%x09%s" > all_hash

TODO:

  • how many commits per year?
  • sample the git repo at a given frequency and count number of files in the sample

general approach:

git clone [remote_address_here] my_repo
cd my_repo
git reset --hard [ENTER HERE THE COMMIT HASH YOU WANT]

as per https://stackoverflow.com/a/3555202/1164295

loop over relevant hashes:
git clone https://github.com/allofphysicsgraph/proofofconcept.git
cd proofofconcept
find . -type f | grep -v ".git" | wc -l
    3381
git reset --hard f12795798d2537d3fec80ba2b4d33396e52011bd
find . -type f | grep -v ".git" | wc -l
       2
number of commits in a year:
cat all_hash | grep 2014- | wc -l
      17
for year in {2014..2023}; do commits_per_year=`cat all_hash | grep ${year}- | wc -l`; echo $year $commits_per_year; done
2014 17
2015 234
2016 62
2017 41
2018 81
2019 30
2020 790
2021 67
2022 90
2023 5
for year in {2014..2023}; do this_hash=`cat all_hash | grep $year | head -n 1 | cut -c-40`; git reset --hard $this_hash; file_count=`find . -type f | grep -v ".git" | wc -l`; echo $this_hash $year $file_count; done > counts_per_year.dat
cat counts_per_year.dat | grep -v HEAD
4289c2a3311d4e051bdab3b0d99f49b25dab6bc3 2014 1027
b81d6ddba5a2015d328975607318d7e7755b27aa 2015 3339
26b0d9fc8c49ede12c897b4bf4cd050765747a81 2016 2098
eec25f59649a4cc9e9e8b166355793b58b742672 2017 2194
201822fd2025349f8749b9433533d0d54c7363b3 2018 3007
918245c17bece668f868ce7201976e2d49dc1743 2019 3022
bd4fb0528c1a46ed2fac13aa16f77508aaa43e67 2020 3150
7dd27b734673e20e405cd26acbdf7d237cf73e33 2021 3343
ad8dfc5931922788f32a21f10906d97c50f7ca36 2022 3384
9df026b16827dfe97fc8a44c4063e493c21a49d4 2023 3384

Sunday, June 4, 2023

summarization, information retrieval, and creative synthesis

Large Language Models like ChatGPT are a hot topic due to the novelty of results in multiple application domains. Stepping back from the hype, the central capabilities seem to include summarization of content, information retrieval, and creative synthesis. Unfortunately those are not separate categories -- the summarization or information retrieval can contain hallucinations that get stated confidently.

Focusing on the topic of information retrieval and setting aside hallucinations, let's consider alternative mechanisms for search:  
  • plain text search, like what Google supports
  • boolean logic, i.e., AND/OR/NOT
  • use of special indicators like wild cards, quotes for exact search
  • regular expressions
  • graph queries for inference engines that support inductive, deductive, and abduction
Except for the last, those search mechanisms all return specific results from a previously collected set of sources. 

--> I expect conventional search to remain important. There are cases where I really am looking for a specific document and not a summarization.

--> Specialized search capabilities like regular expressions and wild cards will remain relevant for matching specific text strings. An LLM might provide suggestions on designing the regex?

--> Graph queries rely on bespoke databases that LLMs are not trained on currently. I'm not aware of any reason these can't be combined. 

The Physics Derivation Graph effectively provides a knowledge graph for mathematical Physics. Combining this with machine learning is feasible.

use of the Physics Derivation Graph is driven by incentives for individuals

Semantic tagging of documents has the potential of enriching the reader's experience because content is easier to search. The burden of work is on the document author to provide the right tags. Worse, the document author has to find tags that are common to uses in other documents -- consistency of tags is necessary for search. This extra work of 1) tagging and 2) using consistent tags are reasons semantic enrichment hasn't become mainstream. 

The Physics Derivation Graph faces a similar challenge. If the Physics Derivation Graph relies on using appropriately annotated symbols (effectively equivalent to a subset of semantic tags), then the PDG has the same burdens of work on individual authors. 

The incentive for the individual researcher authoring a paper to use the Physics Derivation Graph is when there's integration with a computer algebra system that can check the correctness of steps. Then the author benefits from immediate feedback before sharing with others for review.

Annotating symbols probably isn't sufficient to motivate the work, but integration with a computer algebra system could provide incentive. Currently, the use of a computer algebra system requires detailed steps to be specified by the author. 

There are ways to partially automate both symbol annotation and specifying steps. For symbol annotation, the computer could guess from context which symbols are being used. In a similar reliance on context, the user could provide leaps in specifying a derivation that the computer then tries to fill in with the detailed steps.