Sunday, November 2, 2025

finding an alternative Neo4j; a schema for SQL

I had been using Neo4j's property graph for the Physics Derivation Graph for a few years and recently stopped due to the 1GB of memory used by the server that is constantly running. (That 1GB of RAM is unrelated to the size of the underlying database, which in my case was a few hundred nodes.) 

I realized that what I want is a SQLite-like "offline" database that only is noticeable to CPU and RAM use when there's a query. I'm ok with the database's initialization cost resulting slower queries.

Search for Alternatives

https://jgrapht.org/ and NetworkX are irrelevant to the needs of the Physics Derivation Graph

https://en.wikipedia.org/wiki/Blazegraph is unsupported since 2018

KuzuDB was exactly what I was interested in -- an embedded offline property database comptible with Cypher. Coincidentally, https://kuzudb.github.io/ announced Oct 10 (3 weeks ago) that the project was being terminated immediately.
https://news.ycombinator.com/item?id=45560036
https://www.reddit.com/r/Database/comments/1o3w73z/status_of_kuzudb_from_kuzu_inc/
https://www.theregister.com/2025/10/14/kuzudb_abandoned/
A fork: https://github.com/Kineviz/bighorn

Possible contenders:

However, the most reliable approach may be to switch to SQLite and use LLMs to craft Cypher-like queries. This is inelegant (I had been exposing read-only Cypher capabilities on the web) and requires coming up with a schema (something I have no experience with). 

Convert Property Graph to Relational Database for SQL

Here's the prompt I provided Gemini 2.5 Pro:
I have a complicated property graph in Neo4j and want to convert to SQLite. Design a schema for the relational database.

The nodes present in the property graph representation are below.


node type: 'derivation'
properties={'author_name_latex': str,  
  'name_latex': str,     
  'id': str,  
  'abstract_latex': str,  
  'reference_latex': str, 
  'created_datetime': str} 

node type: 'step'
properties={'author_name_latex': str,   
'note_before_step_latex': str, 
'note_after_step_latex': str, 
'id': str,   
 'created_datetime': str} 

node type: 'feed'
properties={'author_name_latex': str,  
'id': str,  
'sympy': str, 
'lean': str,   
 'latex': str} 

node type: 'expression'
properties={'author_name_latex': str,   
 'name_latex': str, 
 'description_latex': str,  
 'reference_latex': str,   
 'id': str,         
 'latex': str,  
 'lean': str,   
 'sympy': str}  

node type: 'operation'
properties={'author_name_latex': str, 
'name_latex': str,  'argument_count': int,  
'description_latex': str,     
 'reference_latex': str, 
   'id': str,  
   'latex': str} 

node type: 'scalar'
properties={'author_name_latex': str,
 'name_latex': str, 
 'description_latex': str,  
 'id': str,         
 'reference_latex': str,   
 'latex': str,     
 'scope': str, 
 'variable_or_constant': str,  
 'domain': str,   
 'dimension_length': int,      
 'dimension_time': int,        
 'dimension_mass': int,        
 'dimension_temperature': int, 
 'dimension_electric_charge': int,     
 'dimension_amount_of_substance': int, 
 'dimension_luminous_intensity': int}  

node type: 'value_and_units'
properties={'author_name_latex': str
 'id': str,         
 'dimension_length_unit': int, 
 'dimension_time_unit': int,   
 'dimension_mass_unit': int,   
 'dimension_temperature_unit': int, 
 'dimension_electric_charge_unit': int,  
 'dimension_amount_of_substance_unit': int,
 'dimension_luminous_intensity_unit': int, 

node type: 'vector'
properties={'author_name_latex': str,
 'name_latex': str, 
 'description_latex': str,  
 'reference_latex': str,   
 'id': str,         
 'latex': str}     

node type: 'matrix'
properties={'author_name_latex': str,
 'name_latex': str, 
 'description_latex': str,  
 'id': str,         
 'reference_latex': str,   
 'latex': str}     

node type: 'quantum_operator'
properties={'author_name_latex': str,
 'name_latex': str, 
 'description_latex': str,  
 'requires_arguments': bool,
 'reference_latex': str,   
 'id': str,         
 'latex': str}

node type: 'inference_rule'
properties={'author_name_latex': str,  
 'number_of_inputs': int,     
 'number_of_outputs': int,    
 'number_of_feeds': int,      
 'name_latex': str, 
 'id': str,         
 'latex': str}      





The nodes and edges occur only in certain combinations described below.


edge 'HAS_STEP' properties={'sequence_index': int}
connects node type "derivation"
to node type "step"

edge 'HAS_INPUT' properties={'sequence_index': int}
connects node type "step"
to node type "expression"

edge 'HAS_FEED' properties={'sequence_index': int}
connects node type "step"
to node type "feed"

edge 'HAS_OUTPUT' properties={'sequence_index': int}
connects node type "step"
to node type "expression"

edge 'HAS_INFERENCE_RULE'
connects node type "step"
to node type "inference_rule"


edge 'HAS_SYMBOL'
connects node type "expression"
to node type "scalar"

edge 'HAS_SYMBOL'
connects node type "expression"
to node type "vector"

edge 'HAS_SYMBOL'
connects node type "expression"
to node type "matrix"

edge 'HAS_SYMBOL'
connects node type "expression"
to node type "operation"

edge 'HAS_SYMBOL'
connects node type "expression"
to node type "quantum_operator"

edge 'HAS_SYMBOL'
connects node type "feed"
to node type "scalar"

edge 'HAS_SYMBOL'
connects node type "feed"
to node type "vector"

edge 'HAS_SYMBOL'
connects node type "feed"
to node type "matrix"

edge 'HAS_SYMBOL'
connects node type "feed"
to node type "operation"

edge 'HAS_SYMBOL'
connects node type "feed"
to node type "quantum_operator"


edge 'HAS_VALUE'
connects node type "scalar"
to node type "value_and_units"


Response from Gemini:

Converting a property graph to a relational model involves translating nodes into tables and edges into relationships (either through foreign keys or dedicated join tables). Here is a comprehensive SQLite schema designed to represent your Neo4j graph structure, balancing relational purity with the flexibility of the original graph.

Design Principles

  • Nodes as Tables: Each distinct node type (`derivation`, `step`, `expression`, etc.) will become its own table.
  • Properties as Columns: The properties of each node will become columns in the corresponding table. The node's `id` will serve as the `PRIMARY KEY`.
  • Edges as Relationships:
    • One-to-Many / One-to-One Edges: These are modeled using a foreign key column in the "many" or "child" table. For example, since a `step` belongs to one `derivation`, the `steps` table will have a `derivation_id` column.
    • Many-to-Many Edges: These are modeled using a separate "join table" (or "link table") that contains foreign keys to the two connected tables. For example, a `step` can have many `expression` inputs, and an `expression` can be an input to many `steps`. This requires a `step_inputs` join table.
  • Polymorphic Edges: The `HAS_SYMBOL` edge is polymorphic because it can connect to multiple different types of symbol nodes (`scalar`, `vector`, etc.). To handle this cleanly without losing foreign key integrity, we will create a dedicated join table for each specific relationship (e.g., `expression_scalars`, `expression_vectors`). A more "graph-like" but less relationally pure alternative is discussed in the rationale section.

SQLite Schema Definition

Here are the `CREATE TABLE` statements for the SQLite database. Node Tables. These tables store the data for each type of node.
-- The top-level container for a mathematical derivation.
CREATE TABLE derivations (
    id TEXT PRIMARY KEY,
    author_name_latex TEXT,
    name_latex TEXT,
    abstract_latex TEXT,
    reference_latex TEXT,
    created_datetime TEXT NOT NULL
);

-- Represents a single logical step in a derivation.
-- It connects to a derivation (one-to-many) and an inference_rule (many-to-one).
CREATE TABLE steps (
    id TEXT PRIMARY KEY,
    derivation_id TEXT NOT NULL,
    inference_rule_id TEXT NOT NULL,
    sequence_index INTEGER NOT NULL, -- From the 'HAS_STEP' edge property
    author_name_latex TEXT,
    note_before_step_latex TEXT,
    note_after_step_latex TEXT,
    created_datetime TEXT NOT NULL,
    FOREIGN KEY (derivation_id) REFERENCES derivations(id) ON DELETE CASCADE,
    FOREIGN KEY (inference_rule_id) REFERENCES inference_rules(id) ON DELETE RESTRICT,
    UNIQUE (derivation_id, sequence_index) -- A derivation cannot have two steps at the same index
);

-- Represents the mathematical content of an expression.
CREATE TABLE expressions (
    id TEXT PRIMARY KEY,
    author_name_latex TEXT,
    name_latex TEXT,
    description_latex TEXT,
    reference_latex TEXT,
    latex TEXT,
    lean TEXT,
    sympy TEXT
);

-- Represents feed-in values or assumptions for a step.
CREATE TABLE feeds (
    id TEXT PRIMARY KEY,
    author_name_latex TEXT,
    sympy TEXT,
    lean TEXT,
    latex TEXT
);

-- Represents a rule of logic or mathematics used in a step.
CREATE TABLE inference_rules (
    id TEXT PRIMARY KEY,
    author_name_latex TEXT,
    name_latex TEXT,
    latex TEXT,
    number_of_inputs INTEGER,
    number_of_outputs INTEGER,
    number_of_feeds INTEGER
);

-- Represents a mathematical operation like addition or differentiation.
CREATE TABLE operations (
    id TEXT PRIMARY KEY,
    author_name_latex TEXT,
    name_latex TEXT,
    description_latex TEXT,
    reference_latex TEXT,
    latex TEXT,
    argument_count INTEGER
);

-- Represents a scalar value (a single number).
CREATE TABLE scalars (
    id TEXT PRIMARY KEY,
    author_name_latex TEXT,
    name_latex TEXT,
    description_latex TEXT,
    reference_latex TEXT,
    latex TEXT,
    scope TEXT,
    variable_or_constant TEXT,
    domain TEXT,
    dimension_length INTEGER,
    dimension_time INTEGER,
    dimension_mass INTEGER,
    dimension_temperature INTEGER,
    dimension_electric_charge INTEGER,
    dimension_amount_of_substance INTEGER,
    dimension_luminous_intensity INTEGER
);

-- Represents a vector.
CREATE TABLE vectors (
    id TEXT PRIMARY KEY,
    author_name_latex TEXT,
    name_latex TEXT,
    description_latex TEXT,
    reference_latex TEXT,
    latex TEXT
);

-- Represents a matrix.
CREATE TABLE matrices (
    id TEXT PRIMARY KEY,
    author_name_latex TEXT,
    name_latex TEXT,
    description_latex TEXT,
    reference_latex TEXT,
    latex TEXT
);

-- Represents a quantum mechanical operator.
CREATE TABLE quantum_operators (
    id TEXT PRIMARY KEY,
    author_name_latex TEXT,
    name_latex TEXT,
    description_latex TEXT,
    reference_latex TEXT,
    latex TEXT,
    requires_arguments INTEGER -- Using INTEGER for BOOLEAN (0=false, 1=true)
);

-- Represents the numerical value and units for a scalar.
-- This has a one-to-one relationship with a scalar.
CREATE TABLE values_and_units (
    id TEXT PRIMARY KEY,
    scalar_id TEXT NOT NULL UNIQUE, -- The UNIQUE constraint enforces the 1-to-1 relationship
    author_name_latex TEXT,
    dimension_length_unit INTEGER,
    dimension_time_unit INTEGER,
    dimension_mass_unit INTEGER,
    dimension_temperature_unit INTEGER,
    dimension_electric_charge_unit INTEGER,
    dimension_amount_of_substance_unit INTEGER,
    dimension_luminous_intensity_unit INTEGER,
    FOREIGN KEY (scalar_id) REFERENCES scalars(id) ON DELETE CASCADE
);

Relationship (Join) Tables. These tables model the many-to-many relationships defined by the edges.
-- Links steps to their input expressions.
CREATE TABLE step_inputs (
    step_id TEXT NOT NULL,
    expression_id TEXT NOT NULL,
    sequence_index INTEGER NOT NULL,
    PRIMARY KEY (step_id, sequence_index),
    FOREIGN KEY (step_id) REFERENCES steps(id) ON DELETE CASCADE,
    FOREIGN KEY (expression_id) REFERENCES expressions(id) ON DELETE CASCADE
);

-- Links steps to their output expressions.
CREATE TABLE step_outputs (
    step_id TEXT NOT NULL,
    expression_id TEXT NOT NULL,
    sequence_index INTEGER NOT NULL,
    PRIMARY KEY (step_id, sequence_index),
    FOREIGN KEY (step_id) REFERENCES steps(id) ON DELETE CASCADE,
    FOREIGN KEY (expression_id) REFERENCES expressions(id) ON DELETE CASCADE
);

-- Links steps to their feed-in values.
CREATE TABLE step_feeds (
    step_id TEXT NOT NULL,
    feed_id TEXT NOT NULL,
    sequence_index INTEGER NOT NULL,
    PRIMARY KEY (step_id, sequence_index),
    FOREIGN KEY (step_id) REFERENCES steps(id) ON DELETE CASCADE,
    FOREIGN KEY (feed_id) REFERENCES feeds(id) ON DELETE CASCADE
);

-- Join tables for the polymorphic 'HAS_SYMBOL' relationships
-- These tables link expressions and feeds to the various symbol types they contain.

-- Expression-to-Symbol Links
CREATE TABLE expression_symbols (
    expression_id TEXT NOT NULL,
    symbol_id TEXT NOT NULL,
    symbol_type TEXT NOT NULL, -- e.g., 'scalar', 'vector', 'matrix', 'operation', 'quantum_operator'
    PRIMARY KEY (expression_id, symbol_id),
    FOREIGN KEY (expression_id) REFERENCES expressions(id) ON DELETE CASCADE
);

-- Feed-to-Symbol Links
CREATE TABLE feed_symbols (
    feed_id TEXT NOT NULL,
    symbol_id TEXT NOT NULL,
    symbol_type TEXT NOT NULL, -- e.g., 'scalar', 'vector', 'matrix', 'operation', 'quantum_operator'
    PRIMARY KEY (feed_id, symbol_id),
    FOREIGN KEY (feed_id) REFERENCES feeds(id) ON DELETE CASCADE
);

Rationale and Explanation

Direct Relationships (`derivation` -> `step`, `step` -> `inference_rule`): The `HAS_STEP` edge connects one derivation to many steps. This is a classic one-to-many relationship, best modeled by adding a `derivation_id` foreign key to the `steps` table. The `sequence_index` from the edge property is also placed here. The same logic applies to the `HAS_INFERENCE_RULE` edge.

Step Inputs/Outputs/Feeds: A `step` can have multiple ordered inputs, outputs, and feeds. These are many-to-many relationships modeled with the join tables `step_inputs`, `step_outputs`, and `step_feeds`. The `sequence_index` property from the edge is a crucial part of the primary key to maintain order and uniqueness.

One-to-One Relationship (`scalar` -> `value_and_units`): The `HAS_VALUE` edge is modeled by adding a `scalar_id` foreign key to the `values_and_units` table. Adding a `UNIQUE` constraint on `scalar_id` ensures that each scalar can have at most one `value_and_units` entry, perfectly modeling the one-to-one relationship.

Handling Polymorphism (`HAS_SYMBOL`): This is the most complex relationship to model. An `expression` or a `feed` can contain many symbols of different types.

  • Chosen Approach: The schema uses two tables, `expression_symbols` and `feed_symbols`. Each table stores the ID of the source (`expression_id` or `feed_id`), the ID of the target symbol (`symbol_id`), and a `symbol_type` column to tell you which table to join with to get the symbol's details (e.g., `scalars`, `vectors`).
  • Advantage: This approach is pragmatic. It keeps the number of tables manageable.
  • Trade-off: It sacrifices full relational integrity because you cannot create a native `FOREIGN KEY` on `symbol_id` that points to five different tables. You would enforce this relationship at the application layer (e.g., when you retrieve an `expression`'s symbols, you read the `symbol_type` and then query the appropriate table).

In my assessment, the SQL schema is inelegant compared to the property graph representation.

The use of `ON DELETE CASCADE` is a guess by the LLM and is not always the desired behavior. (The `ON DELETE CASCADE` means that if the parent gets deleted then automatically delete the linked child.) 
The alternative is `ON DELETE RESTRICT` (or `NO ACTION`). It prevents deletion of the parent row as long as any child rows  exist. The database would return an error. This is used when you want to protect the child data.


The above tailored-to-PDG schema is complicated. A simpler schema like the one below incurs a higher query cost.

From Gemini I learned that the EAV (Entity-Attribute-Value) model is a normalized and traditional approach that is portable across almost any relational database.

Saturday, November 1, 2025

links to papers on Hilbert's Sixth problem

https://en.wikipedia.org/wiki/Hilbert%27s_sixth_problem

"A Field Guide to Recent Work on the Foundations of Statistical Mechanics" by R. Frigg
arxiv.org/abs/0804.0399 and
https://romanfrigg.org/wp-content/uploads/writings/SM_Fieldguide.pdf

"Hilbert's Sixth Problem: the endless road to rigour" by A. N. Gorban
https://arxiv.org/abs/1803.03599


"Long time derivation of the Boltzmann equation from hard sphere dynamics" by Yu Deng, Zaher Hani, Xiao Ma
https://arxiv.org/abs/2408.07818

"Hilbert's sixth problem: derivation of fluid equations via Boltzmann's kinetic theory" by Yu Deng, Zaher Hani, Xiao Ma
https://arxiv.org/abs/2503.01800

Friday, October 17, 2025

I know more now than I did a few years ago

Why do I think I know more now than I did a few years ago?

  • Some of my assumptions have been validated or invalidated by experience of implementation
  • Some things I thought I knew have gained more nuance
  • I have gained insight on what needs to be learned - new domains, additional details
  • I better understand better how previously disparate concepts are related
  • Implementation forces decisions about design that a mental model is not confronted with
  • Experience from implementing provides nuance and validation
  • Sometimes I have to undo previously made decisions due to new insights
  • I better understand which paths are deadends and which are worth investment
Interestingly, even though I know that I need to learn more I cannot a priori know which nuance is relevant, which path is a deadend.

Saturday, October 4, 2025

resizing to smaller VPS

I'm currently paying $12/month for my VPS (virtual private server) that has 2GB of RAM and 25GB of storage. I had been paying $6/month previously, but including Neo4j bumped the memory usage to nearly 100%. (CPU usage is averaging around 10% and spikes to 30%.)

I have a few iterations of the source code for the webserver so I'll need to figure out which is actually in use. After logging into the VPS, I see I have two folders:

allofphysics.com
ui_v7_website_flask_json

Using docker images I see the latest image (from 7 months ago) is allofphysicscom-flask. That doesn't help me figure out which repo is in use.

allofphysics.com/docker-compose.yaml has neo4j, whereas ui_v7_website_flask_json/docker-compose.yaml does not. Therefore I'm currently operating out of allofphysics.com/docker-compose.yaml

I have two options: either revert to the "v7" or disable neo4j in "allofphysics.com".

Going with option 1,

~/allofphysics.com$ docker compose down
~/allofphysics.com$ cd ~/ui_v7_website_flask_json
~/ui_v7_website_flask_json$ git pull
~/ui_v7_website_flask_json$ make
Turns out the packages are pretty old. The opencv package wasn't compatible. After I reverted all pip packages to known good version numbers I found the crypto libraries weren't happy. Eventually I was able to get a docker image built.

Next the certs weren't present so I copied those from ~/allofphysics.com/certs/ and that worked.

To free up disk space I ran

docker images | grep none | tr -s ' ' | cut -d' ' -f3 | xargs docker rmi

I powered down the VPS, changed to a smaller size, powered on, and then ran make to get a running webserver.

Tuesday, June 17, 2025

best practices for use of LLMs

I've previously written about best practices for prompts. This post is more abstract.

 

Avoid asking factual questions

The LLM is not a database of facts. Historical events, dates, places are not stored as exact references. LLMs generate their response based on statistical probabilities derived from patterns.

The more widely documented something is, the better the LLM knows it

The LLM's training is roughly proportional to the representation of the information on the Internet. An LLM is more reliable and detailed when discussing common knowledge.

Precise questions using relevant jargon with context yields useful output

Poorly worded questions that do not use domain-specific terminology are less likely to produce clear answers.

Do not trust citations

The LLM does not have citations hard-coded into the network. Citations are most likely to be hallucinations 

Decompose complex tasks and questions into a sequence of iterative prompts

There is a limited amount of "thinking" by the LLM per prompt, so simpler tasks are more likely to produce relevant answers.

Structure your question to produce a page or less of output

Producing a 200 page book from a single prompt devolves into hallucinations after a few pages. Shorter answers are more likely to remain lucid, so phrase your question in a way that can be answered with a small amount of text.

LLMs default to the average

While LLM output can be creative (in unexpected ways), seeking exceptional insight yields the mundane

Simplify your question to a one-shot prompt

Iterative questions are more likely to yield hallucinations


Delegation to an intern who doesn't learn

This can be confusing, as the LLM occasionally knows more than you do.

Wednesday, May 7, 2025

arm64 and amd64 docker builds

Makefile

# Get the machine architecture.
# On arm64 (Apple Silicon M1/M2/etc.), `uname -m` outputs "arm64".
# On amd64 (Intel), `uname -m` outputs "x86_64".
ARCH := $(shell uname -m)

ifeq ($(ARCH), arm64)
        this_arch=arm64
else ifeq ($(ARCH), x86_64)
        this_arch=amd64
else
        @echo "Unknown architecture: $(ARCH). Cannot determine if Mac is new or old."
endif


docker push multiple platforms using buildx

I wasn't able to get docker manifest to amend both amd64 and arm64, so I used the buildx approach:

docker buildx build --push --platform linux/arm64,linux/amd64 --tag benislocated/allofphysicscom-flask:latest .


References

https://www.docker.com/blog/multi-arch-build-and-images-the-simple-way/

Sunday, May 4, 2025

requirements generation

BLUF/tl;dr: methods of requirements generation are described in this post: "one-shot think hard and brainstorm", learn from others, iterative adversarial feedback, and formal models.

As a solo hobby developer I get to do what feels right with regard to my project. That autonomy and freedom applies to both prioritization and scope. Part of the joy of working is based on being able to follow my curiosity, and to do so at a time and rate of my choosing.

When I work with another person on a shared goal, then there is value in identifying how to divide tasks. For example, by skill, by interests, by availability.

A vision distills into use cases, and these use cases determine requirements which determine tasks. Then sequencing and parallelization of tasks can happen. Let's refer to this as the "think hard and brainstorm" waterfall method. The success of waterfall relies on the ability of planners to identify all contingencies before taking action. Use of an LLM for generating requirements fits in this category as an alternative to thinking hard.

If someone else has a similar situation, learning from their requirements is a valid way of making progress. Plagiarism is acceptable; no need for being original.


The optimistic waterfall method described above assumes the alignment of incentives for participants doing the tasks. If the participants doing tasks are looking for the easiest solution to the requirement they may provided results that don't satisfy the vision. 

If the folks satisfying a requirement may be adversarial, that can be accounted for in an iterative manner.

  1. think hard and brainstorm to come up with an initial draft of requirements
  2. provide the draft requirements to adversarial works with the instructions, "provide a solution in a day." Leverage their creativity to provide an insufficient result.
  3. Determine why the adversarial solutions (which do meet the requirements) don't satisfy the vision. Use that insight to develop better requirements.

Repeat the iterations until requirements are "fool proof" for the sample pool of fools.


A third method of coming up with requirements is to use formal methods. For example,

"Program Derivation is the practice of beginning with a specification of a function, and by a series of mechanical steps, deriving an efficient implementation." (source: https://www.youtube.com/watch?v=JDqD6RZpnZA)
https://p-org.github.io/P/ and https://github.com/p-org/P
https://www.youtube.com/watch?v=FdXZXnkMDxs
https://www.youtube.com/watch?v=tZnX585r4ms