Friday, March 20, 2020

data structure continues to evolve

I've experimented with seven different data structures for the Physics Derivation Graph:

  • v1_plain_text
  • v2_XML
  • v3_CSV
  • v4_file_per_expression
  • v5_property_graph
  • v6_sqlite
  • v7_pickle_web_interface
Each of these have required a rewrite of the code from scratch, as well as transfer code (to move from n to n+1). 

These changes progress concurrently with my knowledge of data structures. I didn't know about property graphs when I was implementing v1, v2, and v3. I wasn't comfortable with SQL when I implemented v4. I didn't know about Tidy data when I implemented v1 to v6.  The data structures used in the PDG slightly lag my understanding of data structures. 

Within a given implementation, there are design decisions with trade-offs to evaluate. I typically don't know all the options or consequences until I implement one of them and then determine what inefficiencies exist. Knowledge gained through evolutionary iteration is expensive and takes a lot of time. 

Here's an example of two "minor" tweaks that incur a rewrite of all the code. My current data structure in v7 is

dat['derivations'] = {
  'fun deriv': { # name of derivation
     '4928482': {    # key is "step ID"   
          'inf rule': 'declare initial expr',
          'inputs':  {},
          'feeds':   {},
          'outputs': {'9428': '4928923942'}, # key is "expr local ID", value is "expr global ID"
          'linear index': 1}, # linear index for PDF and for graph orientation
     '2948592': {
          'inf rule': 'add X to both sides',
          'inputs':  {'9428': '4928923942'},
          'feeds':   {'3190': '9494829190'},
          'outputs': {'3921': '9499959299'},
          'linear index': 2},

A better data structure would be

dat['derivations'] = {
  'fun deriv': { # name of derivation
     '4928482': {    # key is "step ID"   
          'inf rule': 'declare initial expr',
          'inputs':  {},
          'feeds':   {},
          'outputs': {1: '9428'}, # key is index, value is "expr local ID"
          'linear index': 1}, # linear index for PDF and for graph orientation
     '2948592': {
          'inf rule': 'add X to both sides',
          'inputs':  {1: '9428'},
          'feeds':   {1: '3190'},
          'outputs': {1: '3921'},
          'linear index': 2},

dat['expr local to global'] = {
        '9428': '4928923942',
        '3190': '9494829190',
        '3921': '9499959299',
        '9128': '1492842000'}

The reasons this second data structure is an improvement is
  1. the global expression ID does not appear in the 'derivations' dict
  2. the inputs, feeds, and outputs have an index. The index is relevant for both printing in a PDF and use in inference rules. 
I'm slowly evolving towards the likelihood that there will be a "v8" based on tables. The backend database would be something like SQLite3, and the internal representation in Python would be dataframes. 

I'm not going to switch to v8 yet; I'll continue to invest effort in v7 for a bit longer to explore a few challenges (like implementation of inference rules). 

No comments:

Post a Comment