Sunday, June 26, 2016

build a link graph


The Physics Derivation Graph (PDG) site uses relative links. This complicates creating a graph of the site links using wget since -k can be used together with -O only if outputting to a regular file.

$ root_page=allofphysicsgraph.github.io/proofofconcept/
$ wget http://$root_page -q -O -  | grep -i -o '<a[^>]\+href[ ]*=[ \t]*"[^"]\+"' | sed 's/<a href="//' | sed 's/"//' | grep -v https | sort | uniq > list_of_pages
$ while read -r line; do this_page="$line"; wget $root_page$this_page -q -O - | grep -i -o '<a[^>]\+href[ ]*=[ \t]*"[^"]\+"' | sed 's/<a href="//' | sed 's/"//' | grep -v https | sort | uniq >> list_of_pages2; done < list_of_pages 
$ cat list_of_pages2 | grep -v http | sed 's/^/site\//' >> list_of_pages
$ cat list_of_pages | sort | uniq > list_of_pages_master
$ rm list_of_pages list_of_pages2
$ while read -r line; do this_page="$line"; wget $root_page$this_page -q -O - | grep -i -o '<a[^>]\+href[ ]*=[ \t]*"[^"]\+"' | sed 's/<a href="//' | sed 's/"//' | awk -v thispage="$this_page" '{print thispage" ->", $0";"}' >> graph_level2.gv; done < list_of_pages_master


No comments:

Post a Comment