Description of library

This is a python library/module written mostly for the purpose of merging multiple bookmark files together. It works only with the bookmark.html files created by firefox (although possibly netscape would use the same files?). Rather than removing all duplicate hyperlinks, it works with the folder structure of the bookmarks; merging together two bookmark files in the same way that two recursive filesystem directories would be combined. Folders with the same name are merged appropriately with the removal of duplicate entries and the merging of entries with the same hyperlink but different strings (see the function - merge_entries).

The library consists of two parts. The first is a parsing grammar defined using the 'pyparsing library'. This turns a bookmark file into a recursive collection of strings along with some named attributes consisting of folder name and lists of found hyperlinks. The pyparsing module documentation should be consulted to understand the pyparsing.parseResults class. Optionally the original string can be recreated from the parseResults instance (see function -serialize).

The second part is a set of functions for creating a recursive set of dictionaries and for merging two or more bookmark files together.

The module is expected to be used at the python shell or in a script. The example script - 'example.py' - shows high level use.

Main Functions & Objects

pyparsing part of module

bookmarkshtml
- a pyparsing grammar. Use via
'parseresult=bookmarkshtml.parseString(str)'
or 'parseresult=bookmarkshtml.parseFile(file_obj)'

'parseresult' is an instance of pyparsing.parseResults() which is a class that can act both like a list and dictionary (see pyparsing documentation). It is grouping object that either contains strings or instances of itself.
hyperlinks(parseresult)
- returns a list of all hyperlinks found in the file.
clean_tree(parseresult)
- returns a set of nested lists and tuples containing the hyperlinks using the original folder structure of the bookmark file
serialize(parseresult)
- turns a parseresult instance back into a bookmark.html string. If used directly on the parseresult (e.g. before any editting), it will exactly recreate the original string.
duplicates_dict(parseresults)
- returns a dictionary with keys giving any duplicates found in the top level of the parseresults and values giving their indices.
top_folders_dict(parseresults)
- returns a dictionary with keys giving the names of any folders found in the top level of the parseresults and values giving their indices in the parseresults structure.
depersonalisefolders(parseresult)
- removes personnal_toolbar_folder tags from parseresult objects. Acts in place.

Nested dictionary part of module

bookmarkDict(parseresult)
- returns a set of nested dictionaries using hyperlinks/folder names as keys and 'original entry strings'/sub-dictionaries as values. Original folder name strings are stored under the key 'Folder' within a given folder. Since a dictionary must have uniques keys, any duplicate hyperlinks or folders with the same name are merged by calling merge_entries or merge_bookmarkDict appropriately. Useful information is printed to std out.
merge_bookmarkDict(bookdict1,bookdict2)
- merges 2 bookmarkDict dictionaries. Duplicate entries are removed, entries with identical hyperlinks but non-identical strings are merged using merge_entries. The function will act recursively if sub-directories with identical names are found. Useful information is printed to std out.
serialize_bookmarkDict(bookdict)
- turns a bookmarkDict dictionary back into a bookmark.html type string. Note that any ordering is lost with regard to original file.
hyperlinks_bookmarkDict(bookdict)
- returns a list of all hyperlinks found in a 'bookmarkDict' dictionary

'private' functions for module

merge_entries(entry1,entry2)
- given two token strings it parses them. The string with the most recent 'LAST_MODIFIED' or 'LAST_VISIT' tag is returned as the new string except for the tag 'ADD_DATE' which uses the earliest value found. It is meant to check that entries have the same hyperlink but this currently doesn't work if an entry has no 'HREF' tag. Useful information is printed to std out.
duplicates(seq)
- given any iterable sequence, it will return any items which have duplicates.