Saturday, August 11, 2018

connecting Jupyter and Neo4j

Jupyter

$ cd v5_property_graph
$ jupyter notebook

Web browser opens to the URL
http://localhost:8888/tree
Then open a new Python notebook.

Neo4j

I'm running Neo4j Community version 3.2.3 on a Mac. I start the client GUI and then open a browser window to
http://127.0.0.1:7474/browser/

Connect Jupyter to Neo4j

from py2neo import authenticate,Graph, Node, Relationship
authenticate("127.0.0.1:7474", "neo4j", "asdf")
graph = Graph("http://127.0.0.1:7474/browser/")
graph.delete_all()

For more, see this notebook.

Saturday, August 4, 2018

Neo4j for the Physics Derivation Graph

I've been focusing my efforts on the interactive user prompt, a Python-based CLI for the Physics Derivation Graph. Effectively, I'm working through a finite state machine with associated actions for each option. (Tangential task: a pictorial representation of the state machine would be useful.)

I've use Neo4j for other tasks associated with knowledge representation, so I'm surprised I haven't considered property graphs for storing the PDG (there's no mention in my old notes or issues or anything meaningful besides a generic link on the wiki.)

One of the potential benefits of using a property graph over a normal graph is the labeling of edges. Currently when there are multiple input expressions or feeds to an inference rule, it's not clear which input is referenced. For example, consider "IntOverFromTo" which has the LaTeX expansion, "Integrate Eq.~\ref{eq:#4} over $#1$ from lower limit $#2$ to upper limit $#3$." There are three feeds. Without labeling which feed is which, the substitution is undetermined.

With a property graph, the inference rule would have pre-defined labeled edges, ie "lower_limit" and "upper_limit" and "integrate_wrt."

Benefits to using the property graph include
  • visualization tools are more likely to exist, rather than me having to code up a d3js-based web display.
  • querying and editing the graph uses standard syntax, rather than relying on me creating a Python-based CLI with pre-set abilities. 
  • the current data structure is a list of dictionaries in memory and a set of CSV files in directories; using Neo4j I wouldn't need to manage the data structure and could still translate back to plain text
  • adding additional properties (ie LaTeX for expressions versus SymPy, comments, weblinks) would be more scalable than the current data structure and schema which is manually crafted.
  • cross-platform compatibility is not lost



Thursday, August 2, 2018

cleaning up the code using pylint and flake8 and bandit

I realized with so much Python, there's a need to clean up the code.
https://www.youtube.com/watch?v=G1lDk_WKXvY
In this post I document a few software tools I used.

Pylint

$ pylint interactive_user_prompt.py --disable bad-whitespace,missing-docstring,superfluous-parens,bad-indentation,line-too-long,trailing-whitespace,len-as-condition,too-many-locals,invalid-name,too-many-branches,too-many-return-statements,too-many-statements --reports=n

and flake8

$ flake8 --ignore=E111,E225,E231,E501,E226,W291,E221,E115,E201,W293,E261,E302,E265 interactive_user_prompt.py

Not surprisingly, some of my functions are complicated (a score of greater than 10 is frowned upon)
$ python -m mccabe --min 9 interactive_user_prompt.py | wc -l
      15
$ python -m mccabe --min 15 interactive_user_prompt.py | wc -l
       4
So 15 functions scored 9 or greater; 4 functions were 15 or higher!

That's out of 50 functions and 1946 lines of Python (including comments and blank lines) $ cat interactive_user_prompt.py | wc -l
    1946
$ cat interactive_user_prompt.py | grep "^def " | wc -l
      50

Although I'm not concerned about security of a locally run Python script, I also tried bandit:
$ bandit -r interactive_user_prompt.py
which complained about my use of the shell.

I'm aware of autopep8 but haven't used it yet.



Tuesday, July 31, 2018

mathjax for github.io site

For the project website hosted at http://allofphysicsgraph.github.io/proofofconcept/ I've been using static PNGs generated off-line to render the expressions in the graph using d3js.

In order to dynamically enter content on a webpage without resorting to off-line rendering, I used MathJax to display content.

The javascript for MathJax is at
https://github.com/mathjax/MathJax/blob/master/MathJax.js
with instructions for use here
https://docs.mathjax.org/en/latest/configuration.html

I was able to get a page that accepts input and renders Latex input:
http://allofphysicsgraph.github.io/proofofconcept/site/mjtest
Source code for the page is here:
https://github.com/allofphysicsgraph/proofofconcept/blob/gh-pages/site/mjtest.html

Next I ran scaling tests for latency as a function of the number of rendered expressions in Chrome.

25 expressions:
  • http://allofphysicsgraph.github.io/proofofconcept/site/mjtest_scaling_25
  • DOMContentLoaded: 203 ms; Load: 516 ms; Finish: 835 ms
50 expressions:
  • http://allofphysicsgraph.github.io/proofofconcept/site/mjtest_scaling_50
  • DOMContentLoaded: 202 ms; Load: 548 ms; Finish: 977 ms
100 expressions:
  • http://allofphysicsgraph.github.io/proofofconcept/site/mjtest_scaling_100
  • DOMContentLoaded: 220 ms; Load: 538 ms; Finish: 1140 ms

Sunday, July 22, 2018

Python: convert XML to dictionary

#https://stackoverflow.com/questions/13101653/python-convert-complex-dictionary-of-strings-from-unicode-to-ascii
def convert(input):
    if isinstance(input, dict):
        return {convert(key): convert(value) for key, value in input.iteritems()}
    elif isinstance(input, list):
        return [convert(element) for element in input]
    elif isinstance(input, unicode):
        return input.encode('utf-8')
    else:
        return input

#https://docs.python-guide.org/scenarios/xml/
import xmltodict
with open('sample.xml') as fd:
  doc = xmltodict.parse(fd.read())
#print(doc)

# doc is an ordered dict containing unicode. 

#https://stackoverflow.com/questions/3860813/recursively-traverse-multidimensional-dictionary-dimension-unknown
import pprint
#pprint.pprint(doc) # expects dict, not ordered dict

#https://stackoverflow.com/questions/20166749/how-to-convert-an-ordereddict-into-a-regular-dict-in-python3
import json
from collections import OrderedDict
output_dict = json.loads(json.dumps(doc))

# remove the unicode from keys and values
doc = convert(output_dict)

pprint.pprint(doc)

Friday, July 20, 2018

analyzing the text of Wikipedia posts

In a previous post, an outline for analyzing Wikipedia content was described. In this post, I document a few initial observations about the data collected from Wikipedia.

Searching for "derivation" as a section marker means searching for "=== Derivation ===". There are other meanings to derivation, so sometimes the results include non-mathematical content like "=== Derivation and other names ===". To filter out irrelevant content, only sections with mathematical expressions (ie ":<math>") are relevant.

In addition to the text, there are potentially relevant images like
https://en.wikipedia.org/wiki/File:Derivation_of_acoustic_wave_equation.png
which has dimensions 813 × 570 pixels. Pictures with "derivation" in the name and dimensions greater than 300 x 300 might be relevant.

In the "derivation" section, lines that start with ":<math>" in the text are expressions. The closing bracket "</math>" may occur on a following line. 

a differnet approach to generating content for the Physics Derivation Graph

I've been focused on creating the interface for the Physics Derivation Graph to enable manual entry of content. An alternative method to create content would be parsing large databases like Wikipedia.

The first step would be to extract pages that contain derivations. Pages with a section title containing "derivation" and containing at least three mathematical expressions in that section would be a useful set. Suppose there are a thousand pages containing derivations which contain text+Latex.

Side note: Wikipedia "text is available under the Creative Commons Attribution-ShareAlike License."

Given 1000 pages of text+Latex there are two nested challenges:

  1. Between any two adjacent expressions in your data set, there are likely a bunch of missing steps.
  2. Suppose all the expressions were present. Even in that situation, the inference rules are missing. Filling in these is a big challenge.

To address these challenges, text analysis would be useful. Suppose the sequence is
  • text1
  • expression1
  • text2
  • expression2
  • text3
  • expression3
  • text4
There are a few distinct categories of text to analyze:
  • s1 = the last two sentences in "text1" proceeding "expression1"
  • s(i) = if text2 and text3 are short (ie a few sentences), then they are potential inference rules
  • s(j) = if text2 and text3 are longer than a few sentences, then probably the two sentences following an expression and the two sentences proceeding an expression are relevant
  • sf = the first two sentences of the "text4" which is text after the last expression.
We now have 1000 instances of "s1" sentences. In this "s1" data set, what's the most common word? What's the most common two word phrase? What's the most common three word phrase? If there are things that look like inference rules, that would be interesting. I doubt that "declare initial expression" will appear, but some consistency would be validating.

Similarly, run the same word and phrase frequency analysis for the 1000 "sf" sentences. Also apply to each of "s(i)" and "s(j)."

Thursday, July 19, 2018

relevant posts on reddit

Open:

https://old.reddit.com/r/Physics/comments/8vurwq/derivation_for_dummies_a_quick_guide_to_one_of/

Closed:

https://old.reddit.com/r/Physics/comments/725dz1/keeping_track_of_derivations/

https://old.reddit.com/r/Physics/comments/1v0wap/derivations_of_equations/

https://old.reddit.com/r/Physics/comments/1c7uas/derivation_of_the_schrodinger_equation_in_under/

https://old.reddit.com/r/Physics/comments/7y9gkz/are_derivations_worth_it/

forums to contribute to

I don't expect people to serendipitous stumble upon either this blog or the github page or the project page. Therefore, part of my responsibility is to socialize the existence of my effort on forums that I think interested parties may already be a part of. Building a community, gaining a user base, and finding collaborators are potential outcomes.

I don't want to simply advertise on these channels. Instead, I intend to provide value to address challenges participants face. By demonstrating value, the community for the Physics Derivation Graph grows. If the PDG doesn't provide value, then I shouldn't expect a community to develop.

Brainstorming relevant channels,

Monday, July 16, 2018

Physics of Minecraft derivation - the graph is unwieldy

There's a video on the Physics of Minecraft which measures the gravitation in Minecraft. I wanted to see how well the projectile motion is described in the Physics Derivation Graph.
Screenshot from 0:35 in the video. Useful commentary is on news.ycombinator. The post on Wired.com was basic.

On paper, the derivation was 4 expressions and two lines of text. The Physics Derivation Graph yields a cumbersome 7 expressions and a total of 25 nodes (feeds, expressions, and inference rules).

Current output from the Physics Derivation

The reason the graph is large is because the "subXforY" is used three times. Analysis of the midpoint is really three concurrent substitutions: at y=y_mid, v_horizontal=0 and t=t_midpoint. Concurrent substitutions are not supported, so three steps are required.

Also, the current implementation lacks support for comments.

Wednesday, July 11, 2018

The Physics Derivation Graph is for workflow management

There's no complicated math underlying the Physics Derivation Graph. The code base is primarily about tracking numeric indices and strings in Python dictionaries and lists read from plain text CSVs.

Similarly, there are no fancy algorithms in the software.

The lack of complicated math and fancy algorithms is because the Physics Derivation Graph is for workflow management of mathematical Physics. Encoding the logic and processes is merely management of simple data (numerical indices, strings to represent math).

Saturday, July 7, 2018

snapshot of milestones for the Physics Derivation Graph


Each node is a milestone/task. The task points towards a follow-on task.

There are two components; one of which is focused on Latex-based input and the other to do with syntactically-meaningful content.

Wednesday, July 4, 2018

static analysis of function dependency in Python

With 1400 lines of Python, I wanted to find a way to visualize the static dependencies of functions internal to the script
https://github.com/allofphysicsgraph/proofofconcept/blob/gh-pages/v4_file_per_expression/bin/interactive_user_prompt.py

I looked at PyCallGraph but it only supports dynamic call graphs. In addition to Pyan and Snakefood, I found a blog post that included an AST parser as a single file.

python construct_call_graph.py -i ../../proofofconcept/v4_file_per_expression/bin/interactive_user_prompt.py > graph.dot

add "overlap=false;" in the graph.dot file

neato -Tpng graph.dot -o graph.png

which yields


Tuesday, June 19, 2018

why software used by the Physics Derivation Graph is open source

The Physics Derivation Graph content is available under the Creative Commons Attribution 4.0 International License, and I use open source software.

I avoid software that is not open source and not free
My motivations for this include
  • wider accessibility of the results due to fewer constraints
  • enable other people to build on top of the results
  • contribute back to the community which has provided so much 

community growth and diverging interests

Recently the number of developers involved in the Physics Derivation Graph has started to unexpectedly increase. As the number of people involved grows, the objectives diversify. People with shared interest can progress faster by collaborating.

There are multiple channels for people engaging in the project:

  • the github "group" has multiple repos, each a distinct but related effort
  • the gitter group has multiple discussions and one-to-one threads
  • email threads
  • phone and Skype calls
The idea of splitting the team was suggested in order to address the complexity of interaction. Having multiple threads going concurrently can be distracting to members. Then the question is how focused team members can be in the presence of distractions.

I usually let things like this go until there's a clear need for change. The necessity of the change usually helps push which option is preferable. Thinking ahead about potential change is useful, but the action can be delayed until needed. 

Once the split occurs, what coordination among the groups is needed? How does it occur?

Monday, June 18, 2018

converting the old derivations into the new folder structure

Context: this post documents a one-time fix that converts old derivations to the new convention.

In the "version 4" implementation of the Physics Derivation Graph, there are two directories that contain Latex and PNG files of expressions and feeds:

  • in the directory of the derivation, ie "proofofconcept/v4_file_per_expression/derivations/frequency period relation"
  • in the directory containing all expressions, ie "proofofconcept/v4_file_per_expression/expressions"

One reason for this redundancy is to enable modification of the derivation content with disrupting the complete graph. Also, deconflicting expression index collisions doesn't need to be carried out until the derivation is verified. Lastly, I don't have an automatic method for deconflicting expression indices.

As of today (20180618), there are the derivations which were manually created don't follow the convention of having Latex and PNG in the folder of the specific derivation. These older derivations only have the Latex and PNG in the second directory.

In order to enable editing of existing derivations, I needed to copy expressions and feeds from the shared folder into each derivation. To do this, I started in a specific derivation folder and copied only the relevant LaTeX and PNG into the folder.

pwd
proofofconcept/v4_file_per_expression/derivations/derivation of Schrodinger Eq
while IFS='' read -r line || [[ -n "$line" ]]; do cp ../../feeds/${line}* .; done < feeds.csv
while IFS='' read -r line || [[ -n "$line" ]]; do cp ../../expressions/${line}* .; done < <(cat expression_identifiers.csv | cut -d',' -f2)

Saturday, June 2, 2018

building a docker image for the Physics Derivation Graph

Requirements

In order to build the Physics Derivation Graph in a Docker image, the minimum functionality needed is
  • Compile .tex files to .pdf and .png (LaTeX)
  • Compile .gv files to .png (GraphViz)
  • Run a webserver (eg flask or nginx or lighthttpd)
  • Run .py scripts (Python 3)
  • Read SQLite
In this post I explore whether Alpine is a sufficient OS. If not, Ubuntu is a candidate OS which supports the needed functionality.

Alpine-based

The build will be executed from the "proofofconcept" folder because the contents of the Docker image depend on files in the v4_file_per_expression/ folder.

cd proofofconcept
mkdir sandbox/docker_images/python_alpine
cat > sandbox/docker_images/python_alpine/Dockerfile << EOF

FROM python:2.7-alpine

MAINTAINER My Name <my.email.address@gmail.com>

LABEL distro_style="apk" distro="alpine" arch="x86_64" operatingsystem="linux"

RUN apk add --update --no-cache graphviz
RUN apk add --update --no-cache texlive-full
RUN apk add --update --no-cache texlive

RUN pip install pyyaml
RUN pip install sympy

RUN mkdir /derivations
RUN mkdir /inference_rules

ADD ./v4_file_per_expression/bin/interactive_user_prompt.py interactive_user_prompt.py
ADD ./v4_file_per_expression/lib/lib_physics_graph.py /lib/lib_physics_graph.py
ADD ./v4_file_per_expression/inference_rules/* /inference_rules/

#WORKDIR /bin

CMD ["python", "interactive_user_prompt.py"]

EOF

The above Dockerfile fails due to "texlive" and "texlive-full" not existing for alpine.

Now that the Dockerfile exists, we can build the image:
docker build --tag python_alpine/example sandbox/docker_images/python_alpine/
and run it to get the interactive prompt:
docker run -ti python_alpine/example


Ubuntu-based

cd proofofconcept
mkdir sandbox/docker_images/python_ubuntu

cat > sandbox/docker_images/python_ubuntu/Dockerfile << EOF

# 20180602

FROM ubuntu:18.04

MAINTAINER My Name <my.email.address@gmail.com>

RUN apt-get update \
    && apt-get upgrade -y \
    && apt-get install -y \
    python-pip \
    python2.7 \
    graphviz    

RUN pip install pyyaml
RUN pip install sympy

RUN mkdir /derivations
RUN mkdir /inference_rules

ADD ./v4_file_per_expression/bin/interactive_user_prompt.py interactive_user_prompt.py
ADD ./v4_file_per_expression/lib/lib_physics_graph.py /lib/lib_physics_graph.py
ADD ./v4_file_per_expression/inference_rules/* /inference_rules/

#WORKDIR /bin
#ENTRYPOINT ["/usr/bin/python2.7"]

CMD ["python", "interactive_user_prompt.py"]

EOF

docker build --tag python_ubuntu/example sandbox/docker_images/python_ubuntu/

Stop and remove all images

Not unexpectedly, I ran out of disk space.

docker stop $(docker ps -a -q)
docker rm $(docker ps -aq)
docker rmi -f $(docker images -q)

Thursday, May 3, 2018

lots of channels for tracking exploration

A private Trello board for task tracking:
https://trello.com/b/kSZvdVg5/physics-derivation-graph

Primary source code repo:
https://github.com/allofphysicsgraph/proofofconcept/blob/gh-pages/doc/physics_graph_notes.log

Demo of proof of concept:
http://allofphysicsgraph.github.io/proofofconcept/

A blog to track my status
https://physicsderivationgraph.blogspot.com/

An old set of descriptions
https://sites.google.com/site/physicsderivationgraph/
This site is depreciated

limiting my effort to just Latex for expressions

The core of the Physics Derivation graph is the relation between expressions. I've considered how much additional knowledge could be captured by storing expressions as Abstract Syntax Trees. While this would add a lot of work and therefore take more time, there's potentially a lot of value in having a more robust representation. Additionally, there's the challenge that I don't know how to represent all expressions in Physics using Abstract Syntax Trees.

Rather than store each expression as an AST, I'm going to limit my effort to just storing expressions using Latex. As a consequence, the validity of inference rules applied to expressions not be able to be checked. If someone comes up with a representation more useful than Latex (ie ASTs, MathML, etc), then a conversion will need to be performed.

The second reason the choice of Latex is significant because it limits how far down in the hierarchy that can be enumerated. Specifically, in the context of these layers

  • Physics derivation graph
  • Derivation 
  • Step 
  • Expressions, inference rules 
  • Symbol, operators

the Physics Derivation Graph with Latex will not be able to explore systematically the symbols and operators used.