Saturday, June 10, 2023

Finding Derivations of Newton's Law of Universal Gravitation online

Question: was Newton's law of universal gravitation found empirically (from measurements) or can it be derived mathematically? The short answer is yes -- both. The historical process is separate from what can be done now.

Empirical (historical) route

The second paragraph of the Wikipedia article on Newton's Law of Universal Gravitation says, "This is a general physical law derived from empirical observations by what Isaac Newton called inductive reasoning." There's a video on YouTube describing that Newton started with

  • acceleration of gravity on earth, g = 9.8 m/s^2
  • distance between Earth and Moon = 60*(radius of earth)
  • radius of Earth = 
  • orbital period of moon = 27.32 days

Newton figured out how fast the Moon is circling the Earth using

 velocity = distance/time
where, in the case of the moon circling the Earth,
 distance the moon travels around the Earth = circumference = 2*pi*r_{orbit of Moon}

Plugging in numbers, velocity of Moon around Earth = 1022 m/s. That can then be plugged into the centripetal acceleration,

a_{centripetal} = v^2/r
How does a_{centripetal} compare to the gravitational acceleration g?
g/a_{centripetal} = 60^2

Noticing the 60 is common to the ratio and the distance between the Earth and the Moon, Newton figures that gravitation follows an inverse square law. Newton then checked this against data from observational studies of planets.

That's a big leap to F = G*(M*m)/(r^2), so are there more mathematical routes?

Mathematical Derivations

From nasa.gov

I first found NASA's classroom materials. There are some leaps that I wasn't able to follow. The same content of that page is duplicated on a related NASA page. The derivation starts with F=ma and a_{centripetal} = v^2/r. The author mentions Kepler's Third Law but says T^2 \approx R^3 (which is dimenionally inconsistent) when they mean T^2 \propto R^3. The misuse of \approx and \propto continues throughout the rest of the derivation.

velocity = distance/time
and the distance here is the circumfrence, 2*pi*r, so
period T = (2*pi*R)/v

Drop the 2*pi to get

period v \approx R/T
Square both sides and apply Kepler's Third law
T^2 \propto R^3
to get
v^2 \propto 1/R

The second source of my confusion is subscripting versus superscripting -- v_2 versus v^2.

F = (m*(v^2))/R

I tried submitting a correction to NASA's feedback page but couldn't since the Captcha is missing. :(


From a tutorial

Next I found another student-oriented page that has a derivation which is even less helpful than NASA's. The derivation presented starts from angular speed and F=mr\omega^2.


YouTube tutorial

Happily I found this helpful video from Daniel M, "Deriving Newton's Law of Universal Gravitation".

automating entry of derivations into the Physics Derivation Graph website

What would it take to integrate support for symbol detection and conversion to SymPy for a single step in a derivation?
  1. user provides initial expression in Latex to web UI.
  2. computer parses symbols and operators from Latex
  3. computer searches Physics Derivation Graph database of symbols and operators to find candidate symbols
  4. computer provides candidate symbols to user and prompts, "which of the following symbols were you referring to?"
  5. computer parses expression to SymPy, returns AST to user, and prompts, "is this the AST you meant?"
  6. if yes, continue; if no, go back to step 1 or provide corrections to AST.
  7. user provides next expression in Latex
  8. computer parses symbols and operators from Latex
  9. if symbols match symbols used in this derivation, then associate with those; otherwise 
  10. computer searches Physics Derivation Graph database of symbols and operators to find candidate symbols
  11. if computer had to search PDG database, then computer provides candidate symbols to user and prompts, "which of the following symbols were you referring to?"
  12. computer parses expression from step 7 to SymPy, returns AST to user, and prompts, "is this the AST you meant?"
  13. computer uses brute force to check every inference rule using a CAS against the provided expressions to "guess" the inference rule. 
  14. if valid inference rule is found, continue to next expression; if no valid inference rule is found, prompt user to provide inference rule.
  15. Given the inference rule and associated expressions, use the CAS to verify the step.

Tuesday, June 6, 2023

historical evolution of a git repo

JSON-like output
git log --date=format-local:'%Y-%m-%d %H:%M:%S' \
--pretty=format:'{%n  "commit": "%H",%n  "author": "%aN <%aE>",%n  "date": "%ad",%n  "message": "%f"%n},' > all_logs.dat

as per https://stackoverflow.com/a/34778736/1164295 and https://gist.github.com/textarcana/1306223 which points to https://gist.github.com/textarcana/1306223

python3 -c "import json; 
with open('all_entries','r') as fh:
    content = json.load(fh)
print(content)"
Single line is better:
git log --date=format-local:'%Y-%m-%d %H:%M:%S' --pretty=format:"%H%x09%ae%x09%ad%x09%s" > all_hash

TODO:

  • how many commits per year?
  • sample the git repo at a given frequency and count number of files in the sample

general approach:

git clone [remote_address_here] my_repo
cd my_repo
git reset --hard [ENTER HERE THE COMMIT HASH YOU WANT]

as per https://stackoverflow.com/a/3555202/1164295

loop over relevant hashes:
git clone https://github.com/allofphysicsgraph/proofofconcept.git
cd proofofconcept
find . -type f | grep -v ".git" | wc -l
    3381
git reset --hard f12795798d2537d3fec80ba2b4d33396e52011bd
find . -type f | grep -v ".git" | wc -l
       2
number of commits in a year:
cat all_hash | grep 2014- | wc -l
      17
for year in {2014..2023}; do commits_per_year=`cat all_hash | grep ${year}- | wc -l`; echo $year $commits_per_year; done
2014 17
2015 234
2016 62
2017 41
2018 81
2019 30
2020 790
2021 67
2022 90
2023 5
for year in {2014..2023}; do this_hash=`cat all_hash | grep $year | head -n 1 | cut -c-40`; git reset --hard $this_hash; file_count=`find . -type f | grep -v ".git" | wc -l`; echo $this_hash $year $file_count; done > counts_per_year.dat
cat counts_per_year.dat | grep -v HEAD
4289c2a3311d4e051bdab3b0d99f49b25dab6bc3 2014 1027
b81d6ddba5a2015d328975607318d7e7755b27aa 2015 3339
26b0d9fc8c49ede12c897b4bf4cd050765747a81 2016 2098
eec25f59649a4cc9e9e8b166355793b58b742672 2017 2194
201822fd2025349f8749b9433533d0d54c7363b3 2018 3007
918245c17bece668f868ce7201976e2d49dc1743 2019 3022
bd4fb0528c1a46ed2fac13aa16f77508aaa43e67 2020 3150
7dd27b734673e20e405cd26acbdf7d237cf73e33 2021 3343
ad8dfc5931922788f32a21f10906d97c50f7ca36 2022 3384
9df026b16827dfe97fc8a44c4063e493c21a49d4 2023 3384

Sunday, June 4, 2023

summarization, information retrieval, and creative synthesis

Large Language Models like ChatGPT are a hot topic due to the novelty of results in multiple application domains. Stepping back from the hype, the central capabilities seem to include summarization of content, information retrieval, and creative synthesis. Unfortunately those are not separate categories -- the summarization or information retrieval can contain hallucinations that get stated confidently.

Focusing on the topic of information retrieval and setting aside hallucinations, let's consider alternative mechanisms for search:  
  • plain text search, like what Google supports
  • boolean logic, i.e., AND/OR/NOT
  • use of special indicators like wild cards, quotes for exact search
  • regular expressions
  • graph queries for inference engines that support inductive, deductive, and abduction
Except for the last, those search mechanisms all return specific results from a previously collected set of sources. 

--> I expect conventional search to remain important. There are cases where I really am looking for a specific document and not a summarization.

--> Specialized search capabilities like regular expressions and wild cards will remain relevant for matching specific text strings. An LLM might provide suggestions on designing the regex?

--> Graph queries rely on bespoke databases that LLMs are not trained on currently. I'm not aware of any reason these can't be combined. 

The Physics Derivation Graph effectively provides a knowledge graph for mathematical Physics. Combining this with machine learning is feasible.

use of the Physics Derivation Graph is driven by incentives for individuals

Semantic tagging of documents has the potential of enriching the reader's experience because content is easier to search. The burden of work is on the document author to provide the right tags. Worse, the document author has to find tags that are common to uses in other documents -- consistency of tags is necessary for search. This extra work of 1) tagging and 2) using consistent tags are reasons semantic enrichment hasn't become mainstream. 

The Physics Derivation Graph faces a similar challenge. If the Physics Derivation Graph relies on using appropriately annotated symbols (effectively equivalent to a subset of semantic tags), then the PDG has the same burdens of work on individual authors. 

The incentive for the individual researcher authoring a paper to use the Physics Derivation Graph is when there's integration with a computer algebra system that can check the correctness of steps. Then the author benefits from immediate feedback before sharing with others for review.

Annotating symbols probably isn't sufficient to motivate the work, but integration with a computer algebra system could provide incentive. Currently, the use of a computer algebra system requires detailed steps to be specified by the author. 

There are ways to partially automate both symbol annotation and specifying steps. For symbol annotation, the computer could guess from context which symbols are being used. In a similar reliance on context, the user could provide leaps in specifying a derivation that the computer then tries to fill in with the detailed steps.

Wednesday, May 31, 2023

OpenAI's process supervision for math problems and relevance to the Physics Derivation Graph

OpenAI just announced (see https://openai.com/research/improving-mathematical-reasoning-with-process-supervision) progress on solving math problems using process supervision during training.

The data on https://github.com/openai/prm800k/tree/main comes from https://github.com/hendrycks/math (which is for https://arxiv.org/pdf/2103.03874.pdf) and there are examples in that data which come from https://artofproblemsolving.com/wiki/index.php/2015_AIME_II_Problems/Problem_6

AoPS describes itself as "Math texts, online classes, and more for students in grades 5-12."

The problems are constrained and feel very artificial. See for example https://artofproblemsolving.com/wiki/index.php/Mock_AIME_1_Pre_2005_Problems/Problem_4

The training data doesn't have inference rules, so the output from the LLM doesn't have inference rules. As a consequence, the output of the LLM cannot be confirmed by a Computer Algebra System. The output text needs to be validated by a human. LLMs are hallucinating answers that sound reasonable, so checking each step is still vital. 

The ability to resolve distinct variables across all of Mathematical Physics is beyond the scope of the training data. 

On a positive note, if the Physics Derivation Graph content existed, I now think an LLM-based approach could be used to make progress in Mathematical Physics.

Saturday, May 27, 2023

tracing Python in JSON-based workflow is untenable

I added

def mytrace(frame, event, arg):
    if event == "call":
        print("call", frame.f_code.co_name, frame.f_locals)
    elif event == "return":
        print("return", frame.f_code.co_name, arg)
    return mytrace

sys.settrace(mytrace)

to https://github.com/allofphysicsgraph/proofofconcept/blob/gh-pages/v7_pickle_web_interface/flask/controller.py but the output wasn't that useful since I'm passing the entire database as JSON. The full JSON shows up in almost every argument and return value, making the output of the trace unreadable.

When I switch to the Neo4j/Cypher-based approach, the trace might be more usable.