Skip to content

Data Structures

Karl Voit edited this page Jun 18, 2017 · 3 revisions

Here is a description of the most important and recurring data structures used in the Python source.

In the source code, following things might be of interest:

  • /lib/orgparser.pyparse_orgmode_file(…)
    • the main routine of the Org-mode parser
  • /lib/htmlizer.pysanitize_and_htmlize_blog_content(…)
    • the main routine of the HTMLization process

Org-Mode Element Overview

Org elements: from ox-ascii.el (Org-mode)

Org Element [fn:earmarked] [fn:lowprio] implemented since [fn:internalrepresentation] HTML5
external hyperlinks <2014-01-30 Thu> a
internal links <2014-03-03 Mon> a
bold <2014-01-30 Thu> b
center-block x
clock x
code <2014-01-30 Thu> code
drawer x
dynamic-block x
entity
example-block <2014-01-30 Thu> [‘example-block’, ‘name or None’, [u’first line’, u’second line’]] FIXXME
example “colon-block” <2014-08-10 Sun> [‘colon-block’, False, [u’first line’, u’second line’]] pre
export-block x
export-snippet x
fixed-width x
footnote-definition x
footnote-reference x
headline <2014-01-30 Thu> [‘heading’, {‘level’: 3, ‘title’: u’my title’}] section+header+h1
horizontal-rule <2014-01-31 Fri> [‘hr’] (ignored and only interpreted to mark end of standfirst)
inline-src-block x
inlinetask x
inner-template x
italic x
item
keyword x
latex-environment <2014-01-30 Thu> [fn:pypandoc] [‘latex-block’, ‘name or None’, [u’first line’, u’second line’]]
latex-fragment x
line-break x
link x
paragraph <2014-01-30 Thu> [‘par’, u’line1’, u’line2’] p
plain-list x [‘list-itemize’, [u’first line’, u’second line’]] ul+li
plain-text <2014-01-30 Thu> see: paragraph
planning x
quote-block <2014-01-30 Thu> [‘quote-block’, ‘name or None’, [u’first line’, u’second line’]] blockquote
quote-section ?
radio-target x
section <2014-01-30 Thu> [‘heading’, {‘title’: u’Sub-heading foo’, ‘level’: 3}] h2, h3, …
special-block x
src-block <2014-01-30 Thu> [‘src-block’, ‘name or None’, [u’first line’, u’second line’]] pre
statistics-cookie x
strike-through x
subscript x
superscript x
table x [fn:pypandoc]
table-cell x
table-row x
target
template x
timestamp x
underline x
verbatim x pre
verse-block <2014-01-30 Thu> [‘verse-block’, ‘name or None’, [u’first line’, u’second line’]] pre
html-block <2014-01-30 Thu> [‘html-block’, ‘name or None’, [u’first line’, u’second line’]] pre (if no #+NAME: then insert directly!)
tsfile-links <2017-06-17 Sat> [‘cust_link_image’, u’2017-03-11T18.29.20 Stars.jpg’, {u’width’: u’300’, u’alt’: u’Stars in a Tree’, u’align’: u’right’}] figure, img + attributes, figcaption
the rest [fn:pypandoc]

NOTE: OrgParser is using “par” for anything it can not interpret as something else.

[fn:earmarked] Planned to be implemented soon (or at all :-)

[fn:lowprio] This feature is low on my personal development list (way take some time or might never get implemented)

[fn:pypandoc] This element gets converted using pypndoc (and additional sanitizing)

[fn:internalrepresentation] usually in list: blog_data['id-of-entry']['content']

  • Blocks: (beginning with BEGIN_)

Representation of blog data

For a complete list of content elements, please take a look at id:implemented-org-elements (above) FIXXME

blog_data is a Python list containing one dictionary entry per blog entry:

  • FIXXME: add examples of:
    • category
    • other additional data
blog_data = \
[ {'level': 2,                                                ## number of asterisks
   'title': u'This is a blog entry about foo',
   'usertags': [u'tag1', u'tag2'],
   'autotags': {'language': 'english'},
   'id': u'lazyblorg-example-entry',                          ## ID from PROPERTIES-drawer
   'finished-timestamp-history': [datetime1, datetime2, datetime3],
   'latestupdateTS': datetime,                                ## most current time-stamp that changed (or overwrote) heading to DONE
   'firstpublishTS': datetime,                                ## oldest time-stamp that changed heading to DONE
   'created': datetime,
   'content': [ ['par', u'This is the Org-mode content'],     ## 'par: paragraph containing anything that is not defined like tables, ...
                '\n',    ## change of paragraph
                ['heading', {'level': 3, 'title': u'Another aspect'}],
                ['html-block', 'its name or None', [u'first line', u'second line', u'', u'last line']],
                ['list-itemize', [u'first line', u'second line']],
                ['cust_link_image', u'2017-03-11T18.29.20 Stars.jpg', {u'width': u'300', u'alt': u'Stars in a Tree', u'align': u'right'}]
              ]                                                    #FIXXME: further elements
} ]

Thus:

blog_data[0].keys()
## ... results in:
# ['title',
#  'latestupdateTS',
#  'firstpublishTS',
#  'created',
#  'usertags',
#  'content',
#  'finished-timestamp-history',
#  'level',
#  'id']

blog_data[0]['content']  ## -> list of elements of content
# [['text', u'This is the Org-mode content'],
#  ['heading', {'level': 3, 'title': u'Another aspect'}],
#  ['list-itemize', [u'first line', u'second line']],
#  ['table', u'FIXXME: followed by this table data'],
#  ['image', u'FIXXME: followed by this image']]

Internal format of meta-data

Example:

>>> metadata
{u'2013-08-22-testid': {'title': u"This is the title", 'latestupdateTS': datetime.datetime(2013, 8, 22, 21, 6), 'firstpublishTS': datetime.datetime(2013, 8, 22, 21, 6), 'checksum': 'b757f8478bffd6c70a474f213d6520de', 'created': datetime.datetime(2013, 8, 22, 21, 6)},
 u'2013-02-12-lazyblorg-example-entry': {'latestupdateTS': datetime.datetime(2013, 2, 14, 19, 2), 'checksum': '24af2246a5121e829a0dbbd6e2425c15', 'created': datetime.datetime(2013, 2, 12, 10, 58)}}

Keys of the dict: IDs of the entries:

>>> metadata.keys()
[u'2013-08-22-testid', u'2013-02-12-lazyblorg-example-entry']

One entry with key=ID holds a dict with following entries:

  • ‘title’: string containing the title of the blog entry
  • ‘latestupdateTS’: datetime.datetime(2013, 8, 22, 21, 6)
    • most recent time-stamp from the LOGBOOK drawer which marked going to a final state
  • ‘checksum’: ‘b757f8478bffd6c70a474f213d6520de’
    • md5 check-sum of: [title, tags, finished_timestamp_history, content]
  • ‘created’: datetime.datetime(2013, 8, 22, 21, 6)
    • datetime object of the CREATED property from the PROPERTY drawer
    • [ ] FIXXME: why not the first CLOSED time-stamp?

Time-stamps

Example:

CLOSED: [2014-01-31 Fri 14:02]
:LOGBOOK:
- State "DONE"       from "DONE"       [2014-02-01 Sat 18:42]
- State "DONE"       from ""           [2014-01-30 Thu 14:02]
:END:
:PROPERTIES:
:CREATED:  [2014-01-28 Tue 14:02]
:ID: 2014-01-27-lb-tests
:END:

What happens with the various time-stamps?

  • most recent LOGBOOK entry of setting to DONE:
    • added to entry[‘finished-timestamp-history’] (which is a list)
    • overwrites entry[‘latestupdateTS’] if is newer than the old one
      • entry[‘latestupdateTS’] is the most recent LOGBOOK entry of setting to DONE
    • overwrites entry[‘firstpublishTS’] if is older than the old one
  • CREATED:
    • entry[‘created’]
  • CLOSED:
    • ignored
  • ID-timestamp:
    • ignored

After parsing entry from above:

  • entry[‘created’] = [2014-01-28 Tue 14:02]
  • entry[‘latestupdateTS’] = [2014-02-01 Sat 18:42]
    • note that entry['timestamp'] was renamed to entry['latestupdateTS'] on 2017-02-12
  • entry[‘firstpublishTS’] = [2014-01-30 Thu 14:02]
    • Oldest entry of entry[‘finished-timestamp-history’] is the publication time-stamp!
  • entry[‘finished-timestamp-history’] = [2014-02-01 Sat 18:42] and [2014-01-30 Thu 14:02]

entries_timeline_by_published

The dict format is:

  • dict with year (int) as key, value = list of 12 MONTH
  • MONTH: list of 28-31 DAY
  • DAY: list of 0 to many entry-IDs
for year in sorted(entries_timeline_by_published.keys()):
    for month in enumerate(entries_timeline_by_published[year], start=0):
        # month = tuple(index, list of days)
        for day in enumerate(month[1], start=0):
            # day = tuple(index, list of IDs)
            for blogentry in day[1]:
                print str(year) + '-' + str(month[0]) + '-' + str(day[0]) + " has entry: " + str(blogentry)

see Utils.__add_entry_to_entries_timeline_by_published(…) how it is populated

see utils_test.py > test_entries_timeline_by_published_functions(…) how it’s tested

Clone this wiki locally