The LLM is not a database of facts. Historical events, dates, places are not stored as exact references. LLMs generate their response based on statistical probabilities derived from patterns.
The more widely documented something is, the better the LLM knows it
The LLM's training is roughly proportional to the representation of the information on the Internet. An LLM is more reliable and detailed when discussing common knowledge.
Precise questions using relevant jargon with context yields useful output
Poorly worded questions that do not use domain-specific terminology are less likely to produce clear answers.
Do not trust citations
The LLM does not have citations hard-coded into the network. Citations are most likely to be hallucinations
Decompose complex tasks and questions into a sequence of iterative prompts
There is a limited amount of "thinking" by the LLM per prompt, so simpler tasks are more likely to produce relevant answers.
Structure your question to produce a page or less of output
Producing a 200 page book from a single prompt devolves into hallucinations after a few pages. Shorter answers are more likely to remain lucid, so phrase your question in a way that can be answered with a small amount of text.
LLMs default to the average
While LLM output can be creative (in unexpected ways), seeking exceptional insight yields the mundane
Simplify your question to a one-shot prompt
Iterative questions are more likely to yield hallucinations
Delegation to an intern who doesn't learn
This can be confusing, as the LLM occasionally knows more than you do.
BLUF/tl;dr: methods of requirements generation are described in this post: "one-shot think hard and brainstorm", learn from others, iterative adversarial feedback, and formal models.
As a solo hobby developer I get to do what feels right with regard to my project. That autonomy and freedom applies to both prioritization and scope. Part of the joy of working is based on being able to follow my curiosity, and to do so at a time and rate of my choosing.
When I work with another person on a shared goal, then there is value in identifying how to divide tasks. For example, by skill, by interests, by availability.
A vision distills into use cases, and these use cases determine requirements which determine tasks. Then sequencing and parallelization of tasks can happen. Let's refer to this as the "think hard and brainstorm" waterfall method. The success of waterfall relies on the ability of planners to identify all contingencies before taking action. Use of an LLM for generating requirements fits in this category as an alternative to thinking hard.
If someone else has a similar situation, learning from their requirements is a valid way of making progress. Plagiarism is acceptable; no need for being original.
The optimistic waterfall method described above assumes the alignment of incentives for participants doing the tasks. If the participants doing tasks are looking for the easiest solution to the requirement they may provided results that don't satisfy the vision.
If the folks satisfying a requirement may be adversarial, that can be accounted for in an iterative manner.
think hard and brainstorm to come up with an initial draft of requirements
provide the draft requirements to adversarial works with the instructions, "provide a solution in a day." Leverage their creativity to provide an insufficient result.
Determine why the adversarial solutions (which do meet the requirements) don't satisfy the vision. Use that insight to develop better requirements.
Repeat the iterations until requirements are "fool proof" for the sample pool of fools.
A third method of coming up with requirements is to use formal methods. For example,
"Program Derivation is the practice of beginning with a specification of a function, and by a series of mechanical steps, deriving an efficient implementation." (source: https://www.youtube.com/watch?v=JDqD6RZpnZA)
And here's an implementation in Python that I think provides that capability:
When I run the above script I get
$ python3 add_one.py 4
4 plus one is 5
$ python3 add_one.py 4.2
4.2 plus one is 5.2
$ python3 add_one.py cat
Error: 'cat' is not a valid float.
Usage: python add_one.py <number>
Next I'm going to intentionally add a few bugs and then ask how to prove the implementation has no bugs:
I've added three bugs in v2: a deterministic bug, a random bug, and bug that depends on the user's environment. A brute force test would be expensive but could identify the first two bugs.
There are a couple problems with v1 of the program requirements to "add one to a user-provided value."
The input range (unstated) is negative infinity to positive infinity.
Python does not have a built-in limit for the size of integers. The maximum integer value is restricted only by the available memory of the system.
Time-out conditions are unspecified. So if the program doesn't respond for 5 minutes, the requirements have nothing to say about that.
Rewrite the program requirements to be more specific:
If the "computer" doing this calculation has a 1Hz CPU clock frequency with 1byte of RAM, that might result in the Python program being "right" but the hardware being inadequate.
Also, let's make explicit the assumption that we are operating in base 10 numbers.
To be safe with the input string, let's bound that to be less than 1000 characters.
The revised implementation is
Normal testing involves evaluating pre-determined cases, like "input 5, get 6" and "input 5.4, get 6.4" and "input 'cat', get error" and "input (nothing), get error."
Property-based testing (e.g., https://hypothesis.readthedocs.io/en/latest/) is where you "write tests which should pass for all inputs in whatever range you describe, and let Hypothesis randomly choose which of those inputs to check - including edge cases you might not have thought about."
https://github.com/pschanely/CrossHair: a static verification tool for Python using symbolic execution. "Repeatedly calls your functions with symbolic inputs. It uses an SMT solver (a kind of theorem prover) to explore viable execution paths and find counterexamples for you."
https://github.com/formal-land/coq-of-python - Translate Python code to Coq code for formal verification. "formal-land" is a commercial company selling verification-as-a-service:
Design by Contract (https://en.wikipedia.org/wiki/Design_by_contract) approaches for Python include Dafny, Deal, and icontract. For Dafny you write the program in Dafny and compile to Python. For Deal you write Python and provide decorators.
"Dafny lifts the burden of writing bug-free code into that of writing bug-free annotations." Dafny was created by Rustan Leino at Microsoft Research. Dafny uses the Z3 automated theorem prover and Boogie.
Boogie is a simple programming language that is meant to be
an easy compile target (think "like JVM bytecode, but for proving code correct")
easy to analyze soundly
not actually intended to be executable
Instead of running Boogie programs, the Boogie compiler looks through the Boogie code to find assertions. For each assertion, the compiler generates a "verification condition", which is a formula based on a (symbolic) analysis of the program; the verification condition formula is constructed so that if the verification condition is true, the assertion holds.
It then hands those verification conditions, along with annotations in the program like assumptions, preconditions, postconditions, and loop invariants, to an SMT solver (Boogie uses Z3 by default). The SMT solver determines whether or not the assumptions definitely ensure the verification condition holds; Boogie complains about the assertions whose verification-conditions haven't been shown to hold.
Prototypes are impactful and efficient when they feature only the essential features. The consequence of that claim is that the prototype is janky (not easy to use), fragile (not robust), shared prematurely (not "professional" looking). For software, a prototype might act on fake data and produce incorrect results.
After stakeholders provide feedback, then the RoI has been confirmed and the direction for where to invest more effort is clarified -- what else is essential? Correctness is typically of interest, but that competes with ease-of-use, speed, and robustness.
In this post I add a third: contrasting variables.
The "contrasting variables" and approaches compare extremes like
fast (HEP) and slow (classical)
large (relativity) and small (quantum)
The question in this approach is whether there is a derivation that spans each pair of the two extremes.
A starting point for this approach would be to identify a specific variable (mass, velocity, temperature) and find derivations that involve the variable at each extreme.
Previously I've considered the Physics Derivation Graph to be an unlikely dream, in that a complete representation of Physics is infeasible, and even a breadth-first approach is infeasible. My expectation of infeasibility was based on the constraints that data entry would not just require a significant number of person-hours, but the data entry would need to be enacted by highly-qualified people with training tailored to the Physics Derivation Graph.
If the workflow can be segmented into small tasks, LLMs might provide relevant automation of complex tasks that require semantic context.
Overview of workflow
For example,
Given a corpus of papers on arxiv,
identify whether a paper has a derivation.
Given a paper with a derivation, can the steps be identified? (Associate specific inference rule)
Can the steps be verified (e.g., SymPy, Lean)?
For derivations with steps, what variables and constants are used in the paper?
Do these variables and constants appear in other derivations?
That's a combination of text search and semantic awareness of the text content. Not every step has to be "just a human" or "just an LLM" -- an augmented capability is a reasonable workflow.
Another enabling aspect is the Latex representation for arxiv papers. That makes parsing (by regex or LLM) easier.
$ tar xzvf arXiv-2004.04818v2.tar.gz
x elemental.pdf
x fig3.pdf
x sound1.tex
x sound-speeds.pdf
The file sound1.tex is 37110 bytes (36kB), 302 lines, 37108 characters, and 5134 words.
Loading the .tex into Gemini 2.0's prompt uses 10,518 tokens (with the max for Gemini 2.0 being 1,048,574 tokens).
Gemini 2.0 (on 2025-01-07) says
Yes, the article contains a derivation. It presents two approaches to derive the speed of sound, starting from system elasticity and vibrational properties, respectively. Both approaches lead to the same result, which is then used to find the upper bound for the speed of sound.
I ask a follow-up question,
Gemini's response:
I follow with another prompt,
Gemini's response:
Let's see if Gemini can extract symbol definitions:
Gemini's response:
The use of unicode isn't desirable. Let's see if we can force the use of Latex using the prompt
Gemini's response:
Now lets see if Gemini can identify the steps of the derivation.
Gemini's response
Next step: extract the variables used in the above derivation to see if they are present in the Physics Derivation Graph by using the PDG API.
Gemini's response (2025-01-07):
TODO: the PDG API needs to be able to handle tasks associated with a workflow that uses an LLM.
tl;dr: I'm unsuccessful at getting Gemini 2.0 (on 2025-01-07) to relate Brewster's angle to the Law of Refraction. However, Gemini is able to provide the steps for simple (non-trivial) derivations.
Reformatting HTML table
The site https://aistudio.google.com/ provides Gemini 2.0 LLM for free.
To experiment with using it I wanted to provide the inference rules. There are 94 inference rules listed on https://allofphysics.com/list_all_inference_rules in an 8 column table.
After removing javascript, headers, and footers the HTML file has 3869 lines. That takes 32,164 tokens for the prompt
Gemini provided the correct 5 column HTML file. Huzzah!
Brewster's angle and Law of Refraction
Next step is to see if Gemini can use the inference rules for a derivation.
To which Gemini replied (on 2025-01-07),
"Unsure about answer"
That's reasonable, so I removed the inference rules and tried again
Again Gemini replied (on 2025-01-07),
"Unsure about answer"
I then relaxed the constraint about right answers only and encouraged step-by-step reasoning
Gemini's response (on 2025-01-07):
It's not possible to directly derive Brewster's angle from Snell's Law, or vice versa, using standard mathematical manipulations. Brewster's angle is a specific condition related to polarization, while Snell's Law describes refraction in general. Therefore, no sequence of inference rules can relate them directly. They are independent physical principles governing light behavior under different conditions.
Simple derivation
Gemini 2.0 on 2025-01-07 responds with
That's "good enough" in my view to be useful -- associating the relevant inference rule with specific steps in a multi-step derivation!
Running as root doesn't help
As per https://stackoverflow.com/a/71576242 I tried
As a consequence, while I can run the container,
I am unable to mount the local directory
separate the property graph backend from an API and have the web interface call the API
find a lighter weight alternative to neo4j backend
get a minimum viable product working using the existing property graph backend
separate static documentation pages from the dynamic paths
get docker working on my new Mac
That last one is the blocker for everything else, so get docker working on my new Mac is priority. The good news is that the code does run on my old MacBook Air.
Next I'll work on getting a minimum viable product working using the existing property graph backend