Sunday, December 27, 2020

ordered list representation in RDF

The Physics Derivation Graph depends on a data structure capable of using ordered lists. RDF's support for ordered lists is slightly convoluted. The best visualization of ordered lists in RDF I've found is https://ontola.io/blog/ordered-data-in-rdf/

I tried sketching how the "linked recursive lists" approach looks for the Physics Derivation Graph for a derivation that has a sequence of steps, and each step has an ordered list of inputs, feeds, and outputs.



Credit: dreampuf.github.io

Sunday, December 13, 2020

identifying classes in the Physics Derivation Graph for OWL (Web Ontology Language)

Classes and subclasses of entities in the Physics Derivation Graph:

  • derivations = an ordered set of two or more steps
  • steps = a set of one or more statements related by an inference rule
  • inference rule = identifies the relation of a set of one or more statements
  • statement = two or more expressions (LHS and RHS) and a relational operator
    • expressions = an ordered set of symbols
    • symbols = a token
      • operator = applies to one or more values (aka operands). Property: number of expected values
      • value. Property: categorized as "variable" xor "constant"
        • integer = one or more digits. The set of digits depends on the base
        • float
        • complex
      • unit. Examples: "m" for meter, "kg" for kilogram
Some aspects of expressions and derivations I don't have names for yet:
  • binary operators {"where", "for all", "when", "for"} used two relate two expressions, the "primary expression" on the left and one or more "scope"/"definition"/"constraint" (equation/inequality)

Some aspects of expressions and derivations I don't need to label in the PDG:
  • terms = parts of the expression that are connected with addition and subtraction
  • factors = parts of the expression that are connected by multiplication
  • coefficients = a number that is multiplied by a variable in a mathematical expression.
  • power, base, exponent
  • base (as in decimal vs hexadecimal, etc)
  • formula
  • function

An equation is two expressions linked with an equal sign. 
What is the superclass above "equation" and "inequality"?
So far I'm settling on "statement".

I am intentionally staying out of the realm of {proofs, theorems, axioms} both because that is outside the scope of the Physics Derivation Graph and because the topic is already addressed by OMDoc. 

Suppose we have a statement like
y = x^2 + b where x = {5, 3, 1}
In that statement, 
  • "y = x^2 + b" is an equation
  • "x^2 + b" is an expression and is related to the expression "y" by equality. 
  • "x^2" is a term in the RHS expression
  • "x = {5, 3, 1}" is an equation that provides scope for the primary equation. 
What is the "where" relation in the statement? The "where" is a binary operator that relates two equations. There are other "statement operators" to relate equations, like "for all"; see the statement
a + c = 2*g + k for all g \in \Re
In that statement, "g \in \Re" is (an equation?) serving as a scope for the primary equation. 

All statements have supplemental scope/definition equations that are usually left as implicit. The reader is expected to deduce the scope of the statement from the surrounding context. 

The supplemental scope/definition equations describe both per-variable and inter-variable constraints. For example,
x*y + 3 = 94 where ((x \in \Re) AND (y \in \Re) AND (x<y))

More complicated statement:
f(x) = { 0 for x<0
       { 1 for 0<=x<=1
       { 0 for x>1
Here the LHS is a function and the RHS is an integer, but the value of the integer depends on x. 
Note that the "0<=x<=1" can be separated into "0<=x AND x<=1". Expanding this even more,
(f(x) = 0 for x<0) AND (f(x) = 1 for (0<=x AND x<=1)) AND (f(x) = 0 for x>1)

Saturday, December 12, 2020

an argument in support of RDF instead of property graphs

I've wrestled with whether to use Property Graphs to store and query the Physics Derivation Graph. I see potential value, but the licensing of Neo4j keeps me from committing. I'm aware of other implementations, but I don't have confidence about either their stability or durability.

This post makes a convincing argument about both the short-comings of a property-graph-based knowledge graph and the value of an RDF-based storage method. To summarize,

  • don't be distracted by visualization capabilities; inference is more important
  • property graph IDs are local, whereas identifiers in RDF are global. 
  • Global IDs are vital for enabling federation, merge, diff

I know OWL (Web Ontology Language) is popular for knowledge representation, and this post was the first to provide a clear breakdown of the difference between property graphs, RDF, and OWL. OWL supports

  • the ability infer that a node that is a member of a class is also a member of any of its superclasses
  • properties can have superproperties
OWL overview:
  • https://www.cambridgesemantics.com/blog/semantic-university/learn-rdf/
  • https://www.cambridgesemantics.com/blog/semantic-university/learn-owl-rdfs/owl-101/
  • https://www.cambridgesemantics.com/blog/semantic-university/learn-owl-rdfs/

Saturday, November 21, 2020

log analysis of nginx access using Python Pandas

My first step is to review logins on the site,
https://physicsderivationgraph.blogspot.com/2020/05/inspecting-list-of-users-who-have.html

My previous post on reviewing logs
https://physicsderivationgraph.blogspot.com/2020/05/grepping-nginx-logs-to-observe-user.html
was written prior to the current nginx format I'm using.

I haven't gotten around to a deeper analysis like
https://physicsderivationgraph.blogspot.com/2020/04/analysis-of-web-logs-to-understand-how.html


First I had to install supporting software

  sudo apt install python3-pip
  pip3 install pandas

Inline Python in bash with Pandas is possible because every line is formatted like a Python dictionary. Here I want to review what columns are present in the logs

cat nginx_access.log | python3 -c "import sys
import pandas
pandas.options.display.max_rows = 999 # https://pandas.pydata.org/pandas-docs/stable/user_guide/options.html
list_of_lines = []
for line in sys.stdin:
    list_of_lines.append(eval(line))
df = pandas.DataFrame(list_of_lines)
print(df.columns)
"
How many of each entry for a few columns?
cat nginx_access.log | python3 -c "import sys
import pandas
pandas.options.display.max_rows = 999 # https://pandas.pydata.org/pandas-docs/stable/user_guide/options.html
list_of_lines = []
for line in sys.stdin:
    list_of_lines.append(eval(line))
df = pandas.DataFrame(list_of_lines)
threshold = 20
print('user:')
vc = df['user'].value_counts()
print(vc[vc>threshold])
print('IP:')
vc = df['ip'].value_counts()
print(vc[vc>threshold])
print('req:')
vc = df['req'].value_counts()
print(vc[vc>threshold])
#print(df.head())
"
For IPs that have made multiple (e.g., 30) requests, what pages have been accessed?
cat nginx_access.log | python3 -c "import sys
import pandas
pandas.options.display.max_rows = 999 # https://pandas.pydata.org/pandas-docs/stable/user_guide/options.html
list_of_lines = []
for line in sys.stdin:
    list_of_lines.append(eval(line))
df = pandas.DataFrame(list_of_lines)
threshold = 30
vc = df['ip'].value_counts()
for ip, number_of_requests in vc[vc>threshold].items():
    print('\nIP = ',ip, 'made',number_of_requests,'requests')
    df_this_ip = df[df['ip']==ip]
    #for request in df_this_ip['req'].values:
    #    print(request)
    print(df_this_ip['req'].value_counts())
"

Sunday, October 11, 2020

upgrading Ubuntu 18.04 to 20.04 on DigitalOcean VPS droplet

I've been running a DigitalOcean droplet for $5/month for the past 6 months. Because I was new and didn't know better, I selected the Ubuntu 18.04 droplet. 

Now I want to update to Ubuntu 20.04 LTS. 

The guide recommends starting with a fresh 20.04 image instead of upgrading. 

The following is a record of the steps I took in this process. 

Total duration: 2 hours. The process took longer than expected because I hadn't previously configured the website from a bare Ubuntu server. Also, I had made a few changes since the initial installation that weren't documented.

Step 1: collect all data prior to turning off the server

Used scp to copy data from the droplet to my mac

scp user@IP:/home/pdg/arxiv_rss/rss_filter_email.py .
scp user@IP:/home/pdg/arxiv_rss/.env .
scp user@IP:/home/pdg/videos/* .
scp user@IP:/home/pdg/.bash_history .
scp user@IP:/home/pdg/.bashrc .
scp user@IP:/home/pdg/.python_history .
scp user@IP:/home/pdg/.sqlite_history .
cd proofofconcept/v7_pickle_web_interface/
scp user@IP:/home/pdg/proofofconcept/v7_pickle_web_interface/.env .
scp user@IP:/home/pdg/proofofconcept/v7_pickle_web_interface/certs/* .
scp user@IP:/home/pdg/proofofconcept/v7_pickle_web_interface/flask/logs/* .
scp user@IP:/home/pdg/.ssh/authorized_keys .

Grab the crontab entry

0 0 * * * /usr/bin/python3 /home/user/arxiv_rss/rss_filter_email.py >> /home/user/arxiv_rss/cron.log 2>&1

Step 2: power off the server and take a snapshot

https://www.digitalocean.com/docs/images/snapshots/how-to/snapshot-droplets/

Step 3: Start a new droplet

Selected Ubuntu 20.04

Step 4: configure accounts and access

adduser pdg
usermod -aG sudo pdg

ufw allow OpenSSH
ufw enable

Instead of creating new SSH key pairs, 
I imported my authorized_keys file to /home/pdg/.ssh/

To get the authorized_keys file I temporarily allowed password-based authentication for scp using
sudo vim /etc/ssh/sshd_config
change "PasswordAuthentication No" to "PasswordAuthentication Yes"
sudo service ssh restart
While I was there, I also ran
change "PermitRootLogin yes" to "permitRootLogin no"
Once I had transferred the authorized_keys file, I reverted to "PasswordAuthentication No" and ran
sudo service ssh restart


sudo ufw allow 443
sudo ufw allow 80

Step 5: update OS


sudo apt-get update
sudo apt-get upgrade

Step 6: install metrics


sudo apt-get purge do-agent
curl -sSL https://repos.insights.digitalocean.com/install.sh -o /tmp/install.sh
sudo bash /tmp/install.sh
/opt/digitalocean/bin/do-agent --version

Step 7: install Docker and Docker-Compose


Step 8: certs

sudo apt install certbot python3-certbot-nginx
sudo certbot certonly --webroot \
     -w /home/pdg/proofofconcept/v7_pickle_web_interface/certs \
     --server https://acme-v02.api.letsencrypt.org/directory \
     -d derivationmap.net -d www.derivationmap.net

Your certificate and chain have been saved at:
   /etc/letsencrypt/live/derivationmap.net/fullchain.pem   Your key file has been saved at:   /etc/letsencrypt/live/derivationmap.net/privkey.pem   Your cert will expire on 2021-01-09.
https://security.stackexchange.com/questions/94390/whats-the-purpose-of-dh-parameters
cd /etc/ssl/certs
sudo openssl dhparam -out dhparam.pem 4096
cp dhparam.pem ~/proofofconcept/v7_pickle_web_interface/certs/

Step 9: restore data from backup

git clone https://github.com/allofphysicsgraph/proofofconcept.git
scp .env user@IP:/home/pdg/proofofconcept/v7_pickle_web_interface/
cd proofofconcept/v7_pickle_web_interface/flask
cp users_sqlite.db_TEMPLATE users_sqlite.db
cd ..
docker-compose up --build --remove-orphans --detach

Sunday, September 20, 2020

use the inputs and inference rule to generate the output

Instead of expecting the user to provide the inputs and outputs and inference rule, supplying the inputs and inference rule is sufficient to generate the output. This output is necessarily consistent with the inputs and inference rule.

>>> from sympy import *

Define an inference rule

def mult_both_sides_by(expr, feed):
    return Equality(expr.lhs*feed, expr.rhs*feed, evaluate=False)
 
>>> expr = parse_latex('a = b')
>>> feed = parse_latex('f')
>>> mult_both_sides_by(expr, feed)
Eq(a*f, b*f)

This generalizes to include the relation

def mult_both_sides_by(expr, feed, relation):
    return relation(expr.lhs*feed, expr.rhs*feed, evaluate=False)
 
>>> mult_both_sides_by(expr, feed, Equality)
Eq(a*f, b*f)

Other relations are available; see https://docs.sympy.org/latest/modules/core.html
>>> mult_both_sides_by(expr, feed, Le)
a*f <= b*f

text to Latex to SymPy using frequency and period example

As an illustration of the gradations from text to Latex to CAS is provided below. In the derivation the CAS is 1-to-1 with the Latex.



statement

Frequency and period are inversely related.


statement with mathematical notation

Frequency and period are inversely related; thus T = 1/f and f = 1/T

statement with mathematical notation and explanation of derivation

Frequency and period are inversely related; thus T = 1/f
Multiple both sides by f, then divide by T to get f = 1/T.
statement with explanation of derivation, separating expressions from text

Frequency and period are inversely related; thus 
T = 1/f.
Multiple both sides by f to get
f T=1
then divide by T to get
f = 1/T.

statement with expressions separated from text and with bindings between math and text made explicit

Frequency and period are inversely related; thus 
expression 1: T = 1/f
Multiple both sides of expression 1 by f to get expression 2
expression 2: f T=1
then divide both sides of expression 2 by T to get expression 3
expression 3: f = 1/T.

statement with inference rules made explicit

claim: Frequency and period are inversely related; thus
inference rule: declare initial expression
expression 1: T = 1/f
inference rule: Multiple both sides of expression 1 by f to get expression 2
expression 2: f T=1
then 
inference rule: divide both sides of expression 2 by T to get expression 3
expression 3: f = 1/T.
inference rule: declare final expression


use of a Computer algebra system to implement inference rules

The following expansion requires

  • conversion of Latex to SymPy
  • correctly implemented inference rules

>>> import sympy
>>> from sympy import *
>>> from sympy.parsing.latex import parse_latex

claim: Frequency and period are inversely related; thus
inference rule: declare initial expression
expression 1: T = 1/f

To confirm consistency of representations, the input Latex expression can be converted to SymPy and then back to Latex using

>>> latex(eval(sympy.srepr(parse_latex('T = 1/f'))))
'T = \\frac{1}{f}'

We'll work with the SymPy representation of expression 1,

>>> sympy.srepr(parse_latex('T = 1/f'))
"Equality(Symbol('T'), Pow(Symbol('f'), Integer(-1)))"

Rather than using the SymPy, use the raw format of expression 1

>>> expr1 = parse_latex('T = 1/f')

inference rule: Multiple both sides of expression 1 by f to get expression 2
expression 2: f T=1

Although we can multiply a variable and an expression,

>>> expr1*Symbol('f')
f*(Eq(T, 1/f))

what actually needs to happen is first split the expression, then apply the multiplication to both sides

>>> Equality(expr1.lhs*Symbol('f'), expr1.rhs*Symbol('f'))
Eq(T*f, 1)

Application of an inference rule (above) results in the desired result, so save that result as the second expression (below).

>>> expr2 = Equality(expr1.lhs*Symbol('f'), expr1.rhs*Symbol('f'))

inference rule: divide both sides of expression 2 by T to get expression 3
expression 3: f = 1/T.

>>> Equality(expr2.lhs/Symbol('T'), expr2.rhs/Symbol('T'))
Eq(f, 1/T)

Again, save that to a variable

>>> expr3 = Equality(expr2.lhs/Symbol('T'), expr2.rhs/Symbol('T'))

>>> latex(expr3)
'f = \\frac{1}{T}'

inference rule: declare final expression


statement with inference rules and numeric IDs for symbols

To relate the above derivation to any other content in the Physics Derivation Graph, replace T and f with numeric IDs unique to "period" and "frequency"

>>> import sympy
>>> from sympy import *
>>> from sympy.parsing.latex import parse_latex

claim: Frequency and period are inversely related; thus
inference rule: declare initial expression
expression 1T = 1/f

>>> expr1 = parse_latex('T = 1/f')
>>> eval(srepr(expr1).replace('T','pdg9491').replace('f','pdg4201'))
Eq(pdg9491, 1/pdg4201)

Save the result as expression 1
>>> expr1 = eval(srepr(expr1).replace('T','pdg9491').replace('f','pdg4201'))

inference rule: Multiple both sides of expression 1 by f to get expression 2
expression 2f T=1

>>> feed = Symbol('f')
>>> feed = eval(srepr(feed).replace('f','pdg4201'))
>>> Equality(expr1.lhs*feed, expr1.rhs*feed)
>>> Equality(expr1.lhs*feed, expr1.rhs*feed)
Eq(pdg4201*pdg9491, 1)
>>> expr2 = Equality(expr1.lhs*feed, expr1.rhs*feed)

inference rule: divide both sides of expression 2 by T to get expression 3
expression 3f = 1/T.

>>> feed = Symbol('T')
>>> feed = eval(srepr(feed).replace('T','pdg9491'))
>>> Equality(expr2.lhs/feed, expr2.rhs/feed)
Eq(pdg4201, 1/pdg9491)
>>> expr3 = Equality(expr2.lhs/feed, expr2.rhs/feed)

Convert from numeric ID back to Latex symbols in Latex expression
>>> latex(eval(srepr(expr3).replace('pdg9491','T').replace('pdg4201','f')))
'f = \\frac{1}{T}'

inference rule: declare final expression

removal of text, pure Python

The above steps can be expressed as a Python script with two functions (one for each inference rule)

from sympy import *
from sympy.parsing.latex import parse_latex

# assumptions: the inference rules are correct, the conversion of symbols-to-IDs is correct, the Latex-to-SymPy parsing is correct

def mult_both_sides_by(expr, feed):
    return Equality(expr.lhs*feed, expr.rhs*feed)

def divide_both_sides_by(expr, feed):
    return Equality(expr.lhs/feed, expr.rhs/feed)

# inference rule: declare initial expression
expr1 = parse_latex('T = 1/f')
expr1 = eval(srepr(expr1).replace('T','pdg9491').replace('f','pdg4201'))

feed = Symbol('f')
feed = eval(srepr(feed).replace('f','pdg4201'))
expr2 = mult_both_sides_by(expr1, feed)

feed = Symbol('T')
feed = eval(srepr(feed).replace('T','pdg9491'))
expr3 = divide_both_sides_by(expr2, feed)

latex(eval(
srepr(expr3).replace('pdg9491','T').replace('pdg4201','f')))
# inference rule: declare final expression


How would the rigor of the above be increased?

To get beyond what a CAS can verify, a "proof" would relate each of the two functions to a set of axioms. Given the two arguments (an expression, a "feed" value), is the returned value always consistent with some set of axioms?

The set of axioms chosen matters. For example, we could start with Zermelo–Fraenkel set theory

That would leave a significant gap between building up addition and subtraction and getting to calculus and differential equations. "Theorems of calculus derive from the axioms of the real, rational, integer, and natural number systems, as well as set theory." (source)