OGDL (Ordered Graph Data Language)

Revision 2012.3, 21 Mar 2012
Copyright (c) 2002-2012, Rolf Veen
This is an open standard. See the license.

1. Introduction

OGDL is a textual format that represents trees or graphs of data, where the nodes are strings and the edges are space or indentation. Its main objectives are simplicity and readability.

2. The graph model

The text format specified here represents a directed graph G=(N,E), where N is an ordered bag of nodes and E a relation NxN. Nodes are represented by strings. Each member of E is an arc (edge), and is represented by space between nodes.

3. The building blocks of OGDL

3.1. Strings and white space

Strings and space are the basics of OGDL (if a string contains spaces then it has to be quoted). These two elements form a tree, where strings are childs of the immediately preceding lower indented string. The tree can be converted into a graph if necessary, by using special strings (see the level 2 grammar).

The characters that delimit strings are normally white space characters, but there are some exceptions: the comma and the parenthesis, that permit an inline or compact form (see 3.2 and 3.3).

a
  b
  "string with spaces"

This first form of writing OGDL, without using commas and parenthesis, is called the canonical form.

3.2. Comma

The comma has the effect of reseting the level of indentation to that of the beginning of the line. The example above can be rewritten:

a
  b, "string with spaces"

3.3. Parenthesis

The same example, but using parenthesis:

a ( b, "string with spaces" )

It could also have been written without the extra spaces:

a(b,"string with spaces")

This specification doesn't support nodes after a group (a group is a sequence of nodes surrounded by parenthesis).

3.4. Text block

A text block is a string with possibly newlines in it. A standalone '\' character at the end of a line introduces a text block, as in the following example.

text_block \
  This is a multiline
  description

Indentation spaces are deleted from the text. When, within a block, a line is less indented than the previous one, this indentation becomes the new level of indentation. If it is more indented, the extra spaces are considered part of the text and not deleted.

This example is equivalente to the previous:

text_block \
   This is a multiline
  description

3.5. Comments

The '#' character introduces a comment. Comments are treated as white space, and thus ignored.

# this is a comment
#this also
this#not

To be considered as a comment, a string must start with '#'. The same character in the middle or the end of a string is not considered as the start of a comment.

The special combination "#?" is reserved for an optional metadata block, which is also expressed in OGDL. This is explained later.

3.6. End of stream

Any character that is not a space, break or word character will end the OGDL stream. That means that the parser will exit when it finds such a character. Most characters below ASCII 32 (space) will end the current stream.

This mechanism may be used in, for example, in log files, where many OGDL fragments can be concatenated in one file; pointing the parser to the start of any of them will return that fragment only.

3.7. Cycles

a
  b
c
  #{2
In this example node 'c' has an arc pointing to node 'b'. The number 2 points to the node 2 lines before the current node, in the canonical form of OGDL, where each node begins on a new line.

3.8. Summary of special characters

CharacterASCII decimal valueCombinations
(40
)41
,44
#35#? #{
"34
'39
\92\' \" \\EOL in quoted and block; escape char in quoted
CR13
NL10
SP32
TAB9

4. Layers

OGDL is specified as a series of layers:

It is not requiered that tools comply with both layers. It is possible that someone wants to implement only layer 1, since the inclusion of cycles complicates both the parser and emitter. It depends on the field of application. Presenting tools or libraries as 'OGDL 1.0 level 1' compliant is correct.

5. Level 1: Tree grammar

The following grammar rules or productions are written in a simplified EBNF format similar to the one used in the XML specification (see http://www.w3.org/TR/2004/REC-xml-20040204/#sec-notation)
[1]  char_text  ::= integer > 32
[2]  char_word  ::= char_text - ',' - '(' - ')'
[3]  char_space ::= 32 | 9
[4]  char_break ::= 13 | 10
[5]  char_end   ::= integer - char_text - char_space - char_break

These productions use the integer as the base type for representing a character, and then only positive values. Any character that is not char_text, char_space or char_break is considered the end of the OGDL stream, and makes the parser stop and return, without signaling an error.

[6]  word     ::= ( char_word - '#' - ' \'' - '"')+ char_word*
[7]  comment  ::= '#' (char_word | char_space)* break
[8]  break    ::= 10 | 13 | (13 10)
[9]  end      ::= char_end
[10] space    ::= char_space+
[11] space(n) ::= char_space*n ; where n is the equivalent number of spaces.
[12] quoted   ::= ('\''|'"') (char_word | char_space | break)* ('\''|'"') ; where starting
                    and ending character are the same
[13] block(n) ::= '\' space? break (space(>n) (char_word | char_space)* break)+

[11] is the indentation production. It corresponds to the equivalent number of spaces between the start of a line and the beginning of the first scalar node. For any two consecutive scalars preceded by indentation, the second is child of the first one if it is more indented. Intermixing of spaces and tabs is NOT allowed: either tabs or spaces should be used for indentation within a document.

A quote character that has to be included in the string should be preceded by '\'. If the string contains line breaks, leading spaces on each new line are stripped off. The initial indentation is defined by the first line after a break. The indentation is decreased if word characters appear at a lower indentation, but it is never increased. Lines ending with '\' are concatenaded. Escape sequences that are recognized are \", \' and \\. The character '\' should be considered literal in other cases.

A block is a scalar leaf node, i.e., it cannot be parent of other nodes. It is to be used for holding a block of literal text. The only transformation that it undergoes is leading space stripping, according to the indentation rules. A block is child of the scalar that starts it.

[15] scalar   ::= (word | single_quoted | double_quoted )
[16] sequence ::= (scalar|group) ( (space? ',')? space? (scalar|group) )*
[17] group    ::= '(' space? sequence?  space? ')'
[18] line(n)  ::=
[19] graph    ::= line* end

6. Level 2: Graph grammar

[19] reference ::= '#' '{' number (space (char_word | char_space)* )? break

This production represents an arc to a node which is number nodes above in the canonical form of OGDL

7. Character encoding

OGDL streams must parse well without explicit encoding information for all ASCII transparent encodings. Even if OGDL doesn't mandate the use of Unicode, it does encorage its use.

All special characters used in OGDL that define structure and delimit tokens are part of the US-ASCII (ISO646-US) set. This guarantees that tools that support only single byte streams will work on any 8-bit fixed or variable length encoded stream, particularly UTF-8 and ISO8859 variants. Since the conversion from bytes to characters and back is outside the scope of OGDL, it is up to the application to decide how to treat non-printable characters which are outside the ASCII space.

8. Meta-information

The '#?' character combination used as a top level node (not necessarily the first one) is reserved for comunication between the OGDL stream and the parser. It is not mandatory and allows for future enhancements of the standard, if any. For example, some optional behavior could be switched on. Normally meta-information will not be part of the in-memory graph. Meta-information is written in OGDL, as can be seen in the following examples.

#? ogdl 1.0
#? ( ogdl 1.0, encoding iso-8859-1 )

The meta-information keys that are currently reserved are: ogdl, encoding and schema.

9. Round-tripping

OGDL streams are guaranted to round-trip in the presence of a capable parser and emitter, while maintaining a simple in-memory structure of nested nodes. Depending on the precision of the parser-emitter chain, the resulting stream may differ from the original in format or not. Comments are normally not preserved.



A. Changes to this document

20110920  Level 2 simplified (#{N)
          Deleted '--' as break.
          Production 1 simplified (asume valid chars)
          (short form for tables not included because
           it ads confusion in case of cycles).

20051220  Comments are thrown away. The tentative part
          (tables) is left out for this version.
          Nodes after groups not allowed.

20051215  Space after '#' not needed in comments. 
          Other small corrections.

20050403  Some descriptive text added.
          Defined a new EOS sequence consisting of two dashes.
          Make optional the spaces around '(', ')' and ','
          
20040614  Tabs and spaces can not be intermixed in indentation.

20040305  Comments, meta-information added. 
          Semicolons deleted, comma chages meaning.
          Round-tripping chapter added.
          Some productions commented.

20031117  Renamed to Version 1.0
          New cycle productions (were &{} and *{}). 
          Unicode BOM mandates Unicode stream.
          Implementor decides whether he/she needs level 2. 

20030902  Initial release