Version 1-prerelease

WORK IN PROGRESS

Current status: Second reading (Feb 5, 2023).

The second reading uncovered a number of issues, so there will be a third reading in two weeks.

If no major issues are found, I’ll release version 1.

Introduction

Syntactic metalanguages have made mainly haphazard gains over the past 60 years, and still only describe text-based formats. KBNF aims to be a modernized metalanguage with better expressivity and binary support.

Contents

Design Objectives

Human readability

The primary use case for KBNF is to describe text and binary grammars in a formalized way in documentation. Such a format must therefore be human-accessible, while also being concise and unambiguous.

Better expressivity

Binary formats tend to be structured in much more complicated ways than text formats in order to optimize for speed, throughput, or ease-of-processing. A metalanguage for describing such data will require much more expressivity than current metalanguages allow. Better expressivity reduces boilerplate and improves readability even in text format descriptions.

  • Repetition: Any expression can have repetition applied to it, for a specific number of occurrences or a range of occurrences.
  • Bindings: Some constructs (such as here documents or length delimited fields) require access to previously decoded values. KBNF supports assigning decoded values to variables.
  • Exclusion: Sometimes it’s easier to express something as “everything except for …”.
  • Grouping: Grouping expressions together is an obvious convenince that most other BNF offshoots have already adopted.
  • Prose: In many cases, the actual encoding of something is already well-known and specified elsewhere, or is too complex for KBNF to describe adequately. Prose offers a free-form way to describe part of a grammar.
  • Whitespace not significant: Many BNF notations (including the original BNF) assign meaning to whitespace (for example: whitespace as concatenation, or linefeeds to mark the end of a rule). This is bad from a UX perspective because it makes things harder for a human to parse in many circumstances, and reduces the ways in which a rule can be expressed over multiple lines.

Character set support

Metalanguages tend to support only ASCII, with Unicode (encoded as UTF-8) generally added as an afterthought. This restricts the usefulness of the metalanguage, as any other character sets (many of which are still in use) have no support at all.

KBNF can be used with any character set, and requires the character set to be specified as part of the grammar document header.

Codepoints as first-class citizens

  • Codepoints beyond the ASCII range must be directly inputtable into a grammar document.
  • Difficult codepoints must also be supported (for example via escape sequences).
  • Unicode categories must be supported.

Binary grammar support

Binary grammars have different needs from textual grammars, and require special support:

  • Bit arrays: Binary formats tend to work at bit-level granularity, and thus require support for arbitrarily sized bit arrays.
  • Variables, Macros & Functions: Binary formats often represent data in complex ways that can’t be parsed without passing some context around.
  • Conditionals & Logic: Binary formats often include or exclude portions based on encoded values elsewhere. Evaluating these requires the use of conditionals and logic operators.
  • Calculations: Many binary field sizes are determined by data stored elsewhere in the document, and often they require calculations of some sort to determine the final field size.
  • Transformations: Binary data often undergoes transformations that are too complex for normal BNF-style rules to express (for example LEB128).

Future proof

No specification is perfect, nor can it stand the test of time. Eventually an incompatible change will become necessary in order to stay relevant.

KBNF documents are versioned to a particular KBNF specification so that changes can be made to the specification without breaking existing tooling.

Forward Notes

About the Descriptions and Examples

Descriptions and examples will usually include some KBNF notation. When in doubt, please refer to the full KBNF grammar at the end of this document.

Bit Ordering

All sequences of bits (i.e. all expressions) are assumed to be in big endian bit order (higher bits come first), and if necessary can be swapped at any granularity using the swapped function.

For example:

  • uint(16,0xc01f) matches big endian 0xc01f (bit sequence 1,1,0,0,0,0,0,0,0,0,0,1,1,1,1,1).
  • swapped(8, uint(16,0xc01f)) matches little endian 0xc01f (bit sequence 0,0,0,1,1,1,1,1,1,1,0,0,0,0,0,0).
  • swapped(1, uint(16,0xc01f)) matches bit-swapped 0xc01f (bit sequence 1,1,1,1,1,0,0,0,0,0,0,0,0,0,1,1).

Codepoints follow the byte ordering of the character encoding scheme specified in the document header (although per-byte bit ordering remains nominally big endian). Character sets with ambiguous byte ordering (such as utf-16) should generally be avoided in favor of those with explicit byte ordering (utf-16be, utf-16le).

Non-Greedy

All expression matching is assumed to be non-greedy.

For example, given the following grammar:

document  = record+;
record    = letter+ & terminator;
letter    = 'a'~'z';
terminaor = "zzz";

The document azzzbzzzczzz contains 3 records (a, b, and c), not one record (azzzbzzzc).

Grammar Document

A KBNF grammar document begins with a header section, followed by a series of production rules.

document = document_header & MAYBE_WSLC & start_rule & (MAYBE_WSLC & rule)*;

Document Header

The document header identifies the file format as KBNF, and contains the following mandatory information:

  • The version of the KBNF specification that the document adheres to.
  • The case-insensitive name of the character encoding used for all codepoint related expressions (use the IANA preferred MIME name whenever possible).

Optionally, it may also include header lines. An empty line terminates the document header section.

kbnf_v1 utf-8
- identifier  = mygrammar_v1
- description = My first grammar, version 1
- kbnf_specification = https://github.com/kstenerud/kbnf/blob/master/kbnf_v1.md

document = "a"; # Yeah, this grammar doesn't do much...

Production Rules

Production rules are written in the form nonterminal = expression;, with optional whitespace (including newlines) between rule elements. The terminating semicolon makes it more clear where a rule ends, and also allows more freedom for visually laying out the elements of a rule.

rule          = symbol_rule | macro_rule | function_rule;
start_rule    = symbol_rule;
symbol_rule   = symbol & TOKEN_SEP & '=' & TOKEN_SEP & expression & TOKEN_SEP & ';';
macro_rule    = macro & TOKEN_SEP & '=' & TOKEN_SEP & expression & TOKEN_SEP & ';';
function_rule = function & TOKEN_SEP & '=' & TOKEN_SEP & prose & TOKEN_SEP & ';';

The left part of a rule can define a symbol, a macro, or a function. Their case-sensitive names share the same global namespace (i.e. they must be globally unique).

Note: Whitespace in a KBNF rule is only used to separate tokens and for visual layout purposes; it does not imply any semantic meaning.

Start Rule

The first rule listed in a KBNF document is the start rule. Only a symbol can be a start rule.

Symbols

A symbol acts as a placeholder for something to be substituted in another rule.

symbol_rule           = symbol & TOKEN_SEP & '=' & TOKEN_SEP & expression & TOKEN_SEP & ';';
symbol                = identifier_restricted;
identifier_restricted = identifier_any ! reserved_identifiers;
identifier_any        = name;
name                  = name_firstchar & name_nextchar*;
name_firstchar        = unicode(L,M);
name_nextchar         = name_firstchar | unicode(N) | '_';
reserved_identifiers  = "sized" | "aligned" | "swapped" | "when" | "bind"
                      | "uint" | "sint" | "float" | "inf" | "qnan" | "snan";

Note: Symbol names are not limited to ASCII.

Example: A record consists of a company name (which must not contain two full-width colons in a row), followed by two full-width colons, followed by an employee count in full-width characters (possibly approximated to the nearest 10,000), and is terminated by a linefeed.

記録		= 会社名 & "::" & 従業員数 & LF;
会社名		= unicode(L,M) & unicode(L,M,N,P,S,Zs)* ! "::";
従業員数		= '1'~'9' & '0'~'9'* & '万'?;
LF		= '[a]';

Or if you prefer, the same thing with English symbol names:

record         = company_name & "::" & employee_count & LF;
company_name   = unicode(L,M) & unicode(L,M,N,P,S,Zs)* ! "::";
employee_count = '1'~'9' & '0'~'9'* & '万'?;
LF             = '[a]';

Macros

A macro is essentially a symbol that accepts parameters, which are bound to local variables for use within the macro. The macro’s contents are written like regular rules, but also have access to the injected local variables.

macro_rule = macro & TOKEN_SEP & '=' & TOKEN_SEP & expression & TOKEN_SEP & ';';
macro      = identifier_restricted & PARENTHESIZED(param_name & (ARG_SEP & param_name)*);

When called, a macro substitutes the passed-in parameters and proceeds like a normal rule would. Parameter and return types are inferred based on how the parameters are used within the macro, and the type resulting from the macro’s expression. The grammar is malformed if a macro is called with incompatible types, or is used in a context that is incompatible with its return type.

call       = identifier_any & PARENTHESIZED(call_param & (ARG_SEP & call_param)*);
call_param = any_type;

Example: The main section consists of three records: A type 1 record and two type 2 records. A record begins with a type byte, followed by a length byte, followed by that many bytes of data.

main_section = record(1) & record(2){2};
record(type) = byte(type) byte(bind(length, ~)) & byte(~){length};
byte(v)      = uint(8,v);

In the above example, record must only be called with an unsigned integer, because the type field is passed to the byte macro, which calls the uint function, which expects an unsigned parameter.

Example: An IPV4 packet contains “header length” and “total length” fields, which together determine how big the “options” and “payload” sections are. “protocol” determines the protocol of the payload.

ip_packet                    = ...
                             & u4(bind(header_length, 5~)) # length is in 32-bit words
                               ...
                             & u16(bind(total_length, 20~)) # length is in bytes
                               ...
                             & u8(bind(protocol, registered_protocol))
                               ...
                             & options((header_length-5) * 32)
                             & payload(protocol, (total_length-(header_length*4)) * 8)
                             ;

options(bit_count)           = sized(bit_count, option*);
option                       = option_eool
                             | option_nop
                             | ...
                             ;

payload(protocol, bit_count) = sized(bit_count, payload_contents(protocol) & u1(0)*);
payload_contents(protocol)   = when(protocol = 0, protocol_hopopt)
                             | when(protocol = 1, protocol_icmp)
                             | ...
                             ;

Functions

Functions behave similarly to macros, except that they are opaque: whereas a macro is defined within the bounds of the grammatical notation, a function’s procedure is either one of the built-in functions, or is user-defined in prose (as a description, or as a URL pointing to a description).

Functions that take no parameters are defined and called without the trailing parentheses (similar to defining or calling a symbol).

Function Parameter and Return Types

Since functions are opaque, their parameter and return types cannot be automatically deduced like they can for macros. Functions therefore declare all parameter and return types. If a function is called with the wrong types or its return value is used in an incompatible context, the grammar is malformed.

The following standard types are recognized:

  • expression
  • condition
  • number
  • unsigned
  • signed
  • real

Custom types may be invented when the standard types are insufficient (such as in the unicode function), provided their textual representation doesn’t cause ambiguities with the KBNF document encoding.

Ranged Types

Numeric types used as parameters or return types from functions can be prepended with a tilde (~) to indicate that they accept ranges (such as in the uint, sint, and float functions).

Variadic Functions

The last parameter in a function can be made variadic by appending ... (such as in the unicode function).

function_rule      = function & TOKEN_SEP & '=' & TOKEN_SEP & prose & TOKEN_SEP & ';';
function           = function_no_args | function_with_args;
function_no_args   = identifier_restricted & TOKEN_SEP & type_specifier;
function_with_args = identifier_restricted
                   & PARENTHESIZED(function_param & (ARG_SEP & function_param)*)
                   & TOKEN_SEP & type_specifier
                   ;
function_param     = param_name & TOKEN_SEP & type_specifier;
type_specifier     = ':' & TOKEN_SEP & type_alternatives & (TOKEN_SEP & vararg)?;
type_alternatives  = maybe_ranged_type & (TOKEN_SEP & '|' & TOKEN_SEP & maybe_ranged_type)*;
vararg             = "...";
maybe_ranged_type  = '~'? & (basic_type_name | custom_type_name);
basic_type_name    = "expression"
                   | "condition"
                   | "number"
                   | "unsigned"
                   | "signed"
                   | "real"
                   ;
custom_type_name   = name;

Example: A function to convert an unsigned int to its unsigned little endian base 128 representation.

uleb128(v: unsigned): expression = """https://en.wikipedia.org/wiki/LEB128#Unsigned_LEB128""";

Example: A record contains a date followed by a colon, followed by a temperature reading.

record              = iso8601 & ':' & temperature;
iso8601: expression = """https://en.wikipedia.org/wiki/ISO_8601#Combined_date_and_time_representations""";
temperature         = digit+ & ('.' & digit+)?;
digit               = '0'~'9';

Variables

In some contexts, resolved data (data that has already been matched) or literal values can be bound to a variable for use elsewhere. Variables are bound either manually using the bind builtin function, or automatically when passing parameters to a macro. The variable’s type is inferred from its provenance and where it is ultimately used (a type mismatch indicates a malformed grammar).

Note: Variables cannot be re-bound.

When binding an expression that itself binds a variable, that expression’s bound variables are accessible from the outer scope using dot notation (this_exp_bound_value.sub_exp_bound_value).

Example: A document consists of a type 1 record, followed by any number of type 5, 6, and 7 records, and is terminated with a type 0 record of length 0. A record begins with a header consisting of an 8-bit type and a 16-bit big endian unsigned integer indicating how many bytes of record data follow.

document            = record(1) & (record(5) | record(6) | record(7))* & terminator_record;
record(type)        = bind(header, record_header(type)) & record_data(header.length);
record_header(type) = u8(type) & u16(bind(length, ~));
record_data(length) = u8(~){length};
terminator_record   = u8(0) u16(0);
u8(v)               = uint(8, v);
u16(v)              = uint(16, v);
  • The record rule (a macro because it takes parameters) binds the result of the record_header rule to a variable called header. This gives it access to the record_header length variable as header.length.
  • The record_header rule specifies an 8-bit type value (a variable passed in to the macro as a parameter), and binds a 16-bit integer value to a variable called length.
  • The record_data rule takes a length parameter and matches that many bytes using repetition.

Types

KBNF has four main types:

  • identifier
  • expression
  • condition
  • number, of which there are three subtypes:
    • unsigned: limited to positive integers and 0
    • signed: limited to positive and negative integers, and 0 (but excluding -0)
    • real: any value from the set of reals, including qnan and snan unless otherwise specified

Types become relevant when calling functions (and indirectly when calling macros), which have restrictions on what types they accept and return. Also, repetition amounts are restricted to unsigned integers.

Note: number “subtypes” (signed, unsigned, real) aren’t actual types per se, but rather restrictions on what values are allowed in a particular context. calculations, for example, are done as if all operands were reals regardless of their actual “subtype” (subtracting two unsigned integers can give a negative result, dividing integers can result in a fraction, etc).

Identifier

A unique identifier for symbols, macros, and functions (which are all scoped globally), or variables (which are scoped locally).

Identifiers are case sensitive, and must be unique to their scope. Locally scoped identifiers (i.e. variable names) must be unique to both the local and global scope (name shadowing is not allowed).

Identifiers start with a letter, and can contain letters, numbers and the underscore character. The builtin function names are reserved at the global scope.

The general convention is to use all uppercase identifiers for “background-y” things like whitespace and separators and other structural components, which makes them easier for a human to gloss over (see the KBNF grammar document as an example).

identifier           = (identifier_firstchar & identifier_nextchar*) ! reserved_identifiers;
identifier_firstchar = unicode(L,M);
identifier_nextchar  = identifier_firstchar | unicode(N) | '_';
reserved_identifiers = "sized" | "aligned" | "swapped" | "when" | "bind"
                     | "uint" | "sint" | "float" | "inf" | "qnan" | "snan"
                     ;

Expressions

An expression represents the set of possible bit sequences that can be produced. Expressions are non-greedy (the shortest possible interpretation of an expression will be matched first).

expression = symbol
           | call
           | string_literal
           | maybe_ranged(codepoint_literal)
           | combination
           | builtin_functions
           | variable
           | grouped(expression)
           ;

Numbers

Numbers are used in calculations, numeric ranges, and as parameters to functions.

Certain functions take numeric parameters but restrict the allowed values (e.g. integers only, min/max value, etc).

Numbers can be expressed as numeric literals, or derived from functions, macros, and calculations.

Literals

Numeric Literals

Numeric literals can be expressed in binary, octal, decimal, or hexadecimal notation for integers, and in decimal or hexadecimal notation for reals.

Note: Decimal real notation translates more cleanly to decimal float encodings such as ieee754 decimal, and hexadecimal real notation translates more cleanly to binary float encodings such as ieee754 binary.

Conversions from literal reals to floating point encodings that differ in base are assumed to follow the generally accepted algorithms for such conversions (e.g. Grisu, std::strtod).

number_literal       = int_literal_bin | int_literal_oct | int_real_literal_dec | int_real_literal_hex;
int_real_literal_dec = neg? digit_dec+
                     & ('.' & digit_dec+ & (('e' | 'E') ('+' | '-')? & digit_dec+)?)?
                     ;
int_real_literal_hex = neg? & '0' & ('x' | 'X') & digit_hex+
                     & ('.' & digit_hex+ & (('p' | 'P') & ('+' | '-')? & digit_dec+)?)?
                     ;
int_literal_bin      = neg? & '0' & ('b' | 'B') & digit_bin+;
int_literal_oct      = neg? & '0' & ('o' | 'O') & digit_oct+;
neg                  = '-';

Examples:

header_signature = uint(5, 0b10111);
ascii_char_8bit  = uint(8, 0x00~0x7f);
tolerance        = float(32, -1.5~1.5);

Codepoints

Codepoints can be represented as literals, ranges, and category sets. Codepoint literals are placed between single or double quotes.

Expressing codepoint literals as a range causes every codepoint in the range to be added as an alternative.

codepoint_literal = '"' & maybe_escaped(printable_ws ! '"'){1} & '"'
                  | "'" & maybe_escaped(printable_ws ! "'"){1} & "'"
                  ;

Examples:

letter_a     = 'a';     # or "a"
a_to_c       = 'a'~'c'; # or "a"~"c", or 'a' | 'b' | 'c', or "a" | "b" | "c"
alphanumeric = unicode(L,N);

Strings

A string is syntactic sugar for a series of specific codepoints concatenated together. String literals are placed between single or double quotes.

string_literal = '"' & maybe_escaped(printable_ws ! '"'){2~} & '"'
               | "'" & maybe_escaped(printable_ws ! "'"){2~} & "'"
               ;

Example: The following are all equivalent:

str_abc_1 = "abc";
str_abc_2 = 'abc';
str_abc_3 = "a" & "b" & "c";
str_abc_4 = 'a' & 'b' & 'c';

Escape Sequence

Codepoint literals, string literals, and prose may contain codepoint escape sequences to represent troublesome codepoints.

Escape sequences are initiated with the backslash () character. If the next character following is an open square brace ([), it begins a codepoint escape. Otherwise the sequence represents that literal character.

escape_sequence = '\' & (printable ! '[') | codepoint_escape);

Example: A string containing double quotes.

mystr = "This is a "string""; # or you could use single quotes: 'This is a "string"'

Codepoint Escape

A codepoint escape interprets the hex digits between the sequence [ and ] as the hexadecimal numeric value of the codepoint being referred to.

codepoint_escape = '[' & digit_hex+ & ']';

Example: Emoji

mystr = "This is a [1f415]"; # "This is a 🐕"

Prose

Prose is meant as a last resort in attempting to describe something. If it has already been described elsewhere, you could put a URL in here. Otherise you could put in a natural language description.

prose = '"""' & (maybe_escaped(printable_wsl)+ ! '"""') & '"""'
      | "'''" & (maybe_escaped(printable_wsl)+ ! "'''") & "'''"
      ;

Note: Prose can only be used to define a function because it is by nature opaque; the function definition will assign types.

Example: A record contains a date and temperature separated by :, followed by a newline, followed by a flowery description of any length in iambic pentameter (newlines allowed), terminated by ===== on its own line.

record              = date & ':' & temperature & LF & flowery_description & LF & '=====' & LF;
date                = """YYYY-MM-DD, per https://en.wikipedia.org/wiki/ISO_8601#Calendar_dates""";
temperature         = digit+ & ('.' & digit+)?;
digit               = '0'~'9';
flowery_description: expression = """
A poetic description of the weather, written in iambic pentameter. For example:

While barred clouds bloom the soft-dying day,
And touch the stubble-plains with rosy hue;
Then in a wailful choir the small gnats mourn
Among the river sallows, borne aloft
Or sinking as the light wind lives or dies.
""";

Combinations

Combinations combine expressions together into more powerful expressions.

Combination precedence (low to high):

Concatenation

The concatenation combination produces an expression consisting of the expression on the left, followed by the expression on the right (both must match in their proper order for the combined expression to match). The operator symbol is & (think of it as meaning “x and then y”).

concatenate = expression & TOKEN_SEP & '&' & TOKEN_SEP & expression;

Example: Assignment consists of an identifier, at least one space, an equals sign, at least one space, and then an integer value, followed by a linefeed.

assignment = "a"~"z"+ 
           & " "+
           & "="
           & " "+
           & "0"~"9"+
           & "[a]"
           ;

Alternative

The alternative combination produces an expression that can match either the expression on the left or the expression on the right.

Alternatives are separated by a pipe (|) character. Only one of the alternative branches will be taken.

alternate = expression & TOKEN_SEP & '|' & TOKEN_SEP & expression;

Example: Addition or subtraction consists of an identifier, at least one space, a plus or minus sign, at least one space, and then another identifier, followed by a linefeed.

caculation = "a"~"z"+
           & " "+
           & ("+" | "-")
           & " "+
           & "a"~"z"+
           & "[a]"
           ;

Exclusion

Exclusion removes an expression from the set of expression alternatives.

exclude = expression & TOKEN_SEP & '!' & TOKEN_SEP & expression;

Example: An identifier can be any lowercase ASCII string except “fred”.

identifier = "a"~"z"+ ! "fred";

Repetition

“Repetition” is a bit of a misnomer, because it actually defines how many times an expression occurs, not how many times it repeats. Repetition amounts can be defined as a range or as a discrete amount. Think of repetition as “this expression, concatenated together for this range of occurrences”.

The repetition amount is an unsigned integer, appended to an expression as a discrete amount or range between curly braces (e.g. {10} or {1~5}). There are also shorthand notations for common cases:

  • ?: Zero or one (equivalent to {0~1})
  • *: Zero or more (equivalent to {0~})
  • +: One or more (equivalent to {1~})

repetition          = repeat_range | repeat_zero_or_one | repeat_zero_or_more | repeat_one_or_more;
repeat_range        = expression & '{' & TOKEN_SEP & maybe_ranged(number) & TOKEN_SEP & '}';
repeat_zero_or_one  = expression & '?';
repeat_zero_or_more = expression & '*';
repeat_one_or_more  = expression & '+';

Example: An identifier is 5, 6, 7, or 8 characters long, and is made up of characters from ‘a’ to ‘z’.

identifier = 'a'~'z'{5~8};

Example: An identifier must start with at least one uppercase ASCII letter, optionally followed by any number of lowercase ASCII letters, and optionally suffixed with an underscore.

identifier = 'A'~'Z'+ & 'a'~'z'* & '_'?;

Grouping

Expressions, calculations and conditions can be grouped in order to override the default precedence, or as a visual aid to make things more readable. To group, place the items between parentheses.

grouped(item)       = PARENTHESIZED(item);
PARENTHESIZED(item) = '(' & TOKEN_SEP item TOKEN_SEP & ')';

Exmples:

my_rule = ('a' | 'b') & ('x' | 'y'); my_macro1(a) = uint(8, (a + 5) * 2); my_macro2(a, b) = when( (a < 10 | a > 20) & (b < 10 | b > 20), "abc" ) | "def" ;

Calculations

Calculations perform arithmetic operations on numbers, producing a new number. All operands are treated as reals for the purpose of the calculation.

The following operations can be used:

  • Add (+)
  • Subtract (-)
  • Multiply (*)
  • Divide (/)
  • Modulus (%)
  • Power (^, where x^y means x to the power of y)
  • Negation (‘-‘)

Note: Calculations can produce a quiet NaN value under certain conditions in accordiance with the IEEE 754 specification. If different processing is required (such as traps or exceptions), this must be documented in your specification.

Operator precedence (low to high):

  • add, subtract
  • multiply, divide, modulus
  • power
  • negation

number       = calc_add | calc_sub | calc_mul_div;
calc_mul_div = calc_mul | calc_div | calc_mod | calc_pow_neg;
calc_pow_neg = calc_pow | calc_neg_val;
calc_neg_val = calc_neg | calc_val;
calc_val     = number_literal | variable | maybe_grouped(number);
calc_add     = number & TOKEN_SEP & '+' & TOKEN_SEP & calc_mul_div;
calc_sub     = number & TOKEN_SEP & '-' & TOKEN_SEP & calc_mul_div;
calc_mul     = calc_mul_div & TOKEN_SEP & '*' & TOKEN_SEP & calc_pow_val;
calc_div     = calc_mul_div & TOKEN_SEP & '/' & TOKEN_SEP & calc_pow_val;
calc_mod     = calc_mul_div & TOKEN_SEP & '%' & TOKEN_SEP & calc_pow_val;
calc_pow     = calc_pow_val & TOKEN_SEP & '^' & TOKEN_SEP & calc_neg_val;
calc_neg     = '-' & calc_val;

Example: A record begins with a 4-bit length field (length is in 32-bit increments) and 4-bit flags field containing (…), followed by the contents of the record.

record = uint(4, bind(length, ~)) & flags & uint(8, ~){length*4};
flags  = ...

Conditions

Conditions are produced by comparing numbers, and by performing logical operations on those comparisons, resulting in either true or false. Conditions are used in when calls, and can be grouped.

Comparisons:

  • Less than (<)
  • Less than or equal to (<=)
  • Equal to (=)
  • Greater than or equal to (>=)
  • Greater than (>)

Logical operations:

  • And (&)
  • Or (|)
  • Not (!), which is a unary operator

Condition precedence (low to high):

  • comparisons
  • logical or
  • logical and
  • logical not
= | ">";
logical_or = condition & TOKEN_SEP & '|' & TOKEN_SEP & condition;
logical_and = condition & TOKEN_SEP & '&' & TOKEN_SEP & condition;
logical_not = '!' & TOKEN_SEP & condition;">

condition          = comparison | logical_op;
logical_op         = logical_or | logical_op_and_not;
logical_op_and_not = logical_and | logical_op_not;
logical_op_not     = logical_not | maybe_grouped(condition);
comparison         = number & TOKEN_SEP & comparator & TOKEN_SEP & number;
comparator         = "<" | "<=" | "=" | ">= | ">";
logical_or         = condition & TOKEN_SEP & '|' & TOKEN_SEP & condition;
logical_and        = condition & TOKEN_SEP & '&' & TOKEN_SEP & condition;
logical_not        = '!' & TOKEN_SEP & condition;

Example:

record       = uint(8, bind(type, 1~))
             & ( when(type = 1, type_1)
               | when(type = 2, type_2)
               | when(type > 2, type_default)
               )
             ;
type_1       = ...
type_2       = ...
type_default = ...

Ranges

A range consists of one of the following:

  • A low value and a high value separated by a tilde (low ~ high), indicating a (closed interval) low and high bound.
  • A low value followed by a tilde (low ~), indicating a low bound only.
  • A tilde followed by a high value (~ high), indicating a high bound only.
  • A tilde by itself (~), indicating no bound.
  • A value with no tilde, restricting the "range" to only that value.

A codepoint range represents the set of each codepoint in the range as alternatves.

A repetition range represents a range in the number of occurrences that will match the rule.

A number range will ultimately be passed to a context requiring a specific subtype (such as repetition, uint, sint, float), and will thus represent each value in the range (for all discrete values that are representable by the subtype) as alternatves.

Notes:

  • Quiet NaN and signaling NaN values are never included in the set of reals returned by a range (e.g. float(64,~), float(64,0~), float(64,~0) etc do not include float(64,qnan) or float(64,snan)).
  • The concept of negative zero (-0) is included in the set returned by any range that crosses out of negative values.

expression         = ...
                   | maybe_ranged(codepoint_literal)
                   | ...
                   ;
repeat_range       = expression & '{' & TOKEN_SEP & maybe_ranged(number) & TOKEN_SEP & '}';
function_uint      = fname_uint & PARENTHESIZED(bit_count & ARG_SEP & maybe_ranged(number));
function_sint      = fname_sint & PARENTHESIZED(bit_count & ARG_SEP & maybe_ranged(number));
function_float     = fname_float & PARENTHESIZED(bit_count & ARG_SEP & maybe_ranged(number));
ranged(item)       = (item & TOKEN_SEP)? & '~' & (TOKEN_SEP & item)?;
maybe_ranged(item) = item | ranged(item);

Example: Codepoint range.

hex_digit = ('0'~'9' | 'a'~'f');

Example: Repetition range: A name field contains between 1 and 100 characters.

name_field = unicode(L,M,N,P,S){1~100};

Example: Number range: The RPM value is an unsigned 16 bit big endian integer from 0 to 1000.

rpm = uint(16, ~1000); # It's a uint, so already limited to 0~

Comments

A comment begins with a hash char (#) and continues to the end of the current line. Comments can be placed after pretty much any token.

comment = '#' & (printable_ws ! LINE_END)* & LINE_END;

Example:

kbnf_v1 utf-8
- identifier = mygrammar_v1
- description = My first grammar

# This is the first place where a comment can exist.
myrule # comment
 = # comment
 myexpression # comment
 ; # comment
# comment

Builtin Functions

KBNF comes with some fundamental functions built-in:

sized Function

sized(bit_count: unsigned, expr: expression): expression =
    """
    Requires that `expr` produce exactly `bit_count` bits.
    Expressions containing repetition that would have matched on their own are
    no longer sufficient until the production fills exactly `bit_count` bits.
    """;

Example: A name field must contain exactly 200 bytes worth of character data, padded with spaces as needed.

name_field = sized(200*8, unicode(L,M,N,P,Zs)* & ' '*);

Technically, the & ' '* part is superfluous since Unicode category Zs already includes space, but it helps readability to highlight how to pad the field. One could even be more explicit:

name_field    = sized(200*8, unicode(L,M,N,P,Zs)* & space_padding);
space_padding = ' '*;

Example: The "records" section can contain any number of length-delimited records, but must be exactly 1024 bytes long. This section can be padded with 0 length records (which is a record with a length field of 0 and no payload - essentially a zero byte).

record_section     = sized(1024*8, record* & zero_length_record*);
record             = byte(bind(length,~)) & byte(~){length};
zero_length_record = byte(0);
byte(v)            = uint(8,v);

aligned Function

aligned(bit_count: unsigned, expr: expression, padding: expression): expression =
    """
    Requires that `expr` and `padding` together produce a multiple of `bit_count` bits.
    If `expr` doesn't produce a multiple of `bit_count` bits, the `padding` expression
    is used in the same manner as the `sized` function to produce the remaining bits.
    """;

Example: The "records" section can contain any number of length-delimited records, but must end on a 32-bit boundary. This section can be padded with 0 length records (which is a record with a length field of 0 and no payload - essentially a zero byte).

record_section     = aligned(32, record*, zero_length_record*);
record             = byte(bind(length,~)) & byte(~){length};
zero_length_record = byte(0);
byte(v)            = uint(8, v);

swapped Function

swapped(bit_granularity: unsigned, expr: expression): expression =
    """
    Swaps the order of `expr`'s bits with a granularity of `bit_granularity`.
    If `expr` doesn't resolve to a multiple of `bit_granularity` bits, the
    expression doesn't match.
    """;

Example: A document begins with a 32-bit little endian unsigned int version field, followed by the contents. Only version 5 documents are supported.

document  = version_5 & contents;
version_5 = swapped(8, uint(32, 5));
contents  = ...

Example: A header begins with a 16-bit unsigned int identifier that is actually bit-swapped, followed by contents based on the identifier.

header               = bitswapped_uint16(bind(identifier, ~)) & contents(identifier);
bitswapped_uint16(v) = swapped(1, uint(16, v));
contents(identifier) = when(identifier = 1, type_1)
                     | when(identifier = 2, type_2)
                     ;
type_1               = ...
type_2               = ...

when Function

when(cond: condition, expr: expression): expression =
    """
    Matches `expr` only when `cond` is true.
    """;

Example: The extensions section contains 32 extension slots. Each extension starts with a 1-byte type field, followed by a 24-bit big endian field containing the length of the payload. Valid payload types are 1, 2, or 3 (payload type 0 is a dummy type meaning "no extension", so the length field is ignored and there is no payload data). The same extension type can be used multiple times.

extensions          = extension{32};
extension           = uint(8,bind(type,0~3)) & uint(24,bind(length,~))
                    & ( when(type = 1, extension_1(length))
                      | when(type = 2, extension_2(length))
                      | when(type = 3, extension_3(length))
                      # When type is 0, no extension and length is ignored
                      )
                    ;
extension_1(length) = ...
extension_2(length) = ...
extension_3(length) = ...

bind Function

bind(variable_name: identifier, value: expression | ~number): expression | ~number =
    """
    Binds `value` to a local variable for subsequent re-use in the current rule.
    `bind` transparently passes through the type and value of `value`, meaning that
    the context around the `bind` call behaves as though only what the `bind`
    function surrounded is present. This allows a match as normal, while also
    allowing the resolved value to be used again later in the rule.
    """;

Example: Match "abc/abc", "fred/fred" etc.

sequence = bind(repeating_value,('a'~'z')+) & '/' & repeating_value;

Example: BASH "here" document: Bind the variable "terminator" to whatever follows the "<<" until the next linefeed. The here-document contents continue until the terminator value is encountered again.

here_document             = "<<" & bind(terminator, NOT_LF+) & LF & here_contents(terminator) & terminator;
here_contents(terminator) = ANY_CHAR* ! terminator;
ANY_CHAR                  = ~;
LF                        = '[a]';
NOT_LF                    = ANY_CHAR ! LF;

Example: Interpret the next 16 bits as a big endian unsigned int and bind the resolved number to "length". That many following bytes make up the record contents.

length_delimited_record = uint16(bind(length, ~)) & record_contents(length);
record_contents(length) = byte(~){length};
uint16(v)               = uint(16, v);
byte(v)                 = uint(8, v);

unicode Function

unicode(categories: unicode_category ...): expression =
    """
    Creates an expression containing the alternatives set of all Unicode
    codepoints that have any of the given Unicode categories.

    `categories` is a comma separated list of 1 letter major category or 2-letter minor
    category names, as listed in https://www.unicode.org/versions/Unicode15.0.0/ch04.pdf#G134153

    Example: all letters and space separators: unicode(L,Zs)
    """;

Example: Allow letter, numeral, and space characters.

letter_digit_space = unicode(N,L,Zs);

uint Function

uint(bit_count: unsigned, range: ~unsigned): expression =
    """
    Creates an expression that matches the given range of big endian unsigned
    integers with the given number of bits.
    """;

Example: The length field is a 16-bit unsigned integer value.

sint Function

sint(bit_count: unsigned, range: ~signed): expression =
    """
    Creates an expression that matches the given range of big endian signed
    integers with the given number of bits.
    """;

Example: The points field is a 16-bit signed integer value from -10000 to 10000.

points = sint(32, -10000~10000);

float Function

float(bit_count: unsigned, range: ~real): expression =
    """
    Creates an expression that matches the given range of big endian ieee754 binary
    floating point values. `bit_count` must be a valid size according to ieee754 binary.
    """;

Note: ranges passed to the float function will never include qnan or snan. These special values cannot be part of a range, and instead must be explicitly passed to the float function.

Example: The temperature field is a 32-bit float value from -1000 to 1000.

rpm = float(32, -1000~1000);

Example: Accept any real or any NaN, encoded as a float64.

value                 = float64_or_nan(~);
float64_or_nan(range) = float(64,range) | float(64,qnan) | float(64,snan);

inf Function

inf: real =
    """
    Returns a number representing the mathematical concept of infinity.
    The sign of the infinity can be reversed using negation.
    This representation is only for the concept itself; actual encodings in a
    document will depend on the encoding format used.
    """;

Example: Negative infinity used as a record terminator.

record     = reading* terminator;
reading    = float(32, ~) ! terminator;
terminator = float(32, -inf);

qnan Function

qnan: real =
    """
    returns a number representing the concept of "not-a-number" in its quiet form.
    This representation is only for the concept itself; actual encodings in a
    document will depend on the encoding format used.
    """;

Example: Quiet NaN used to mark invalid readings.

record  = reading{32};
reading = float(32, ~) | invalid;
invalid = float(32, qnan);

snan Function

snan: real =
    """
    returns a number representing the concept of "not-a-number" in its signaling form.
    This representation is only for the concept itself; actual encodings in a
    document will depend on the encoding format used.
    """;

Example: Signaling NaN used to mark invalid readings.

record  = reading{32};
reading = float(32, ~) | invalid;
invalid = float(32, snan);

Examples

A Complex Example

  • A document contains one or more sections, and terminates on EOF.
  • A section begins with a sentinel (a record with type between 0x80 and 0xfe, and a length of 0), followed by an arbitrary number of records, followed by the same sentinel value again to terminate the list of records in this section.
  • A record is comprised of an 8-bit record_type, a payload, and a possible suffix depending on the record_type.
  • The payload is comprised of a little endian 24-bit length field representing the number of bytes in the payload, followed by the payload bytes, followed by possible 0xff padding to bring it to a multiple of 32 bits.
  • Depending on the record_type, there may be a suffix.

document                = section+;
section                 = bind(sentinel,uint(8,0x80~0xfe)) & length_field(0) & record* & sentinel;
record                  = bind(record_type,type_field) & padded_payload & suffix(record_type.type);
type_field              = uint(8,bind(type,0~2));
padded_payload          = aligned(32, payload, uint(8,0xff)*);
payload                 = length_field(bind(byte_count,~)) & uint(8,~){byte_count};
length_field(contents)  = swapped(8, uint(24,contents));
suffix(type)            = when(type = 2, type2)
                        | when(type = 1, type1)
                        # type 0 means no suffix
                        ;
type1                   = ...
type2                   = ...

Example: Internet Protocol version 4

See accompanying document: ipv4.kbnf

The KBNF Grammar in KBNF

= | ">";
logical_or = condition & TOKEN_SEP & '|' & TOKEN_SEP & condition;
logical_and = condition & TOKEN_SEP & '&' & TOKEN_SEP & condition;
logical_not = '!' & TOKEN_SEP & condition;

number = calc_add | calc_sub | calc_mul_div;
calc_mul_div = calc_mul | calc_div | calc_mod | calc_pow_neg;
calc_pow_neg = calc_pow | calc_neg_val;
calc_neg_val = calc_neg | calc_val;
calc_val = number_literal | variable | maybe_grouped(number);
calc_add = number & TOKEN_SEP & '+' & TOKEN_SEP & calc_mul_div;
calc_sub = number & TOKEN_SEP & '-' & TOKEN_SEP & calc_mul_div;
calc_mul = calc_mul_div & TOKEN_SEP & '*' & TOKEN_SEP & calc_pow_val;
calc_div = calc_mul_div & TOKEN_SEP & '/' & TOKEN_SEP & calc_pow_val;
calc_mod = calc_mul_div & TOKEN_SEP & '%' & TOKEN_SEP & calc_pow_val;
calc_pow = calc_pow_val & TOKEN_SEP & '^' & TOKEN_SEP & calc_neg_val;
calc_neg = '-' & calc_val;

grouped(item) = PARENTHESIZED(item);
ranged(item) = (item & TOKEN_SEP)? & '~' & (TOKEN_SEP & item)?;
maybe_grouped(item) = item | grouped(item);
maybe_ranged(item) = item | ranged(item);

number_literal = int_literal_bin | int_literal_oct | int_real_literal_dec | int_real_literal_hex;
int_real_literal_dec = neg? digit_dec+
& ('.' & digit_dec+ & (('e' | 'E') ('+' | '-')? & digit_dec+)?)?
;
int_real_literal_hex = neg? & '0' & ('x' | 'X') & digit_hex+
& ('.' & digit_hex+ & (('p' | 'P') & ('+' | '-')? & digit_dec+)?)?
;
int_literal_bin = neg? & '0' & ('b' | 'B') & digit_bin+;
int_literal_oct = neg? & '0' & ('o' | 'O') & digit_oct+;
neg = '-';

identifier_any = name;
identifier_restricted = identifier_any ! reserved_identifiers;
reserved_identifiers = "sized"
| "aligned"
| "swapped"
| "when"
| "bind"
| "uint"
| "sint"
| "float"
| "inf"
| "qnan"
| "snan"
;

name = name_firstchar & name_nextchar*;
name_firstchar = unicode(L,M);
name_nextchar = name_firstchar | unicode(N) | '_';

printable = unicode(L,M,N,P,S);
printable_ws = printable | WS;
printable_wsl = printable | WSL;
digit_bin = '0'~'1';
digit_oct = '0'~'7';
digit_dec = '0'~'9';
digit_bin = ('0'~'9') | ('a'~'f') | ('A'~'F');

comment = '#' & printable_ws* & LINE_END;

PARENTHESIZED(item) = '(' & TOKEN_SEP & item & TOKEN_SEP & ')';
ARG_SEP = TOKEN_SEP & ',' & TOKEN_SEP;
TOKEN_SEP = MAYBE_WSLC;

# Whitespace
MAYBE_WS = WS*;
SOME_WS = WS & MAYBE_WS;
MAYBE_WSLC = (WSL | comment)*;
SOME_WSLC = WSL & MAYBE_WSLC;
WSL = WS | LINE_END;
WS = HT | SP;
LINE_END = CR? & LF;
HT = '[9]';
LF = '[a]';
CR = '[d]';
SP = '[20]';">

kbnf_v1 utf-8
- identifier  = kbnf_v1
- description = Karl's Backus-Naur Form, version 1

document               = document_header & MAYBE_WSLC & start_rule & (MAYBE_WSLC & rule)*;

kbnf_version           = '1';

document_header        = "kbnf_v" & kbnf_version & SOME_WS
                       & character_encoding & LINE_END
                       & header_line* & LINE_END
                       ;
character_encoding     = ('a'~'z' | 'A'~'Z' | '0'~'9' | '_' | '-' | '.' | ':' | '+' | '(' | ')'){1~40};
header_line            = '-' & SOME_WS
                       & header_name & MAYBE_WS
                       & '=' & MAYBE_WS
                       & header_value & LINE_END
                       ;
header_name            = printable+;
header_value           = printable_ws+;

rule                   = symbol_rule | macro_rule | function_rule;
start_rule             = symbol_rule;
symbol_rule            = symbol & TOKEN_SEP & '=' & TOKEN_SEP & expression & TOKEN_SEP & ';';
macro_rule             = macro & TOKEN_SEP & '=' & TOKEN_SEP & expression & TOKEN_SEP & ';';
function_rule          = function & TOKEN_SEP & '=' & TOKEN_SEP & prose & TOKEN_SEP & ';';

expression             = symbol
                       | call
                       | string_literal
                       | maybe_ranged(codepoint_literal)
                       | combination
                       | builtin_functions
                       | variable
                       | grouped(expression)
                       ;

symbol                 = identifier_restricted;
macro                  = identifier_restricted & PARENTHESIZED(param_name & (ARG_SEP & param_name)*);
param_name             = identifier_restricted;
function               = function_no_args | function_with_args;
function_no_args       = identifier_restricted & TOKEN_SEP & type_specifier;
function_with_args     = identifier_restricted
                       & PARENTHESIZED(function_param & (ARG_SEP & function_param)*)
                       & TOKEN_SEP & type_specifier
                       ;
function_param         = param_name & TOKEN_SEP & type_specifier;
type_specifier         = ':' & TOKEN_SEP & type_alternatives & (TOKEN_SEP & vararg)?;
type_alternatives      = type_name & (TOKEN_SEP & '|' & TOKEN_SEP & type_name)*;
vararg                 = "...";
type_name              = basic_type_name | custom_type_name;
basic_type_name        = "expression"
                       | "condition"
                       | "number"
                       | "unsigned"
                       | "signed"
                       | "real"
                       ;
custom_type_name       = name;

call                   = identifier_any & PARENTHESIZED(call_param & (ARG_SEP & call_param)*);
call_param             = any_type;

combination            = alternate | combination_w_exclude;
combination_w_exclude  = exclude | combination_w_concat;
combination_w_concat   = concatenate | combination_w_repeat;
combination_w_repeat   = repetition | combination;
alternate              = expression & TOKEN_SEP & '|' & TOKEN_SEP & expression;
concatenate            = expression & TOKEN_SEP & '&' & TOKEN_SEP & expression;
exclude                = expression & TOKEN_SEP & '!' & TOKEN_SEP & expression;
repetition             = repeat_range | repeat_zero_or_one | repeat_zero_or_more | repeat_one_or_more;
repeat_range           = expression & '{' & TOKEN_SEP & maybe_ranged(number) & TOKEN_SEP & '}';
repeat_zero_or_one     = expression & '?';
repeat_zero_or_more    = expression & '*';
repeat_one_or_more     = expression & '+';

prose                  = '"""' & maybe_escaped(printable_wsl)+ & '"""'
                       | "'''" & maybe_escaped(printable_wsl)+ & "'''"
                       ;
codepoint_literal      = '"' & maybe_escaped(printable_ws ! '"'){1} & '"'
                       | "'" & maybe_escaped(printable_ws ! "'"){1} & "'"
                       ;
string_literal         = '"' & maybe_escaped(printable_ws ! '"'){2~} & '"'
                       | "'" & maybe_escaped(printable_ws ! "'"){2~} & "'"
                       ;
maybe_escaped(charset) = (charset ! '\') | escape_sequence;
escape_sequence        = '\' & (printable ! '[') | codepoint_escape);
codepoint_escape       = '[' & digit_hex+ & ']';

builtin_functions      = sized
                       | aligned
                       | swapped
                       | when
                       | bind
                       | unicode
                       | uint
                       | sint
                       | float
                       | inf
                       | qnan
                       | snan
                       ;

sized(bit_count: unsigned, expr: expression): expression =
    """
    Requires that `expr` produce exactly `bit_count` bits.
    Expressions containing repetition that would have matched on their own are
    no longer sufficient until the production fills exactly `bit_count` bits.
    """;

aligned(bit_count: unsigned, expr: expression, padding: expression): expression =
    """
    Requires that `expr` and `padding` together produce a multiple of `bit_count` bits.
    If `expr` doesn't produce a multiple of `bit_count` bits, the `padding` expression
    is used in the same manner as the `sized` function to produce the remaining bits.
    """;

swapped(bit_granularity: unsigned, expr: expression): expression =
    """
    Swaps the order of `expr`'s bits with a granularity of `bit_granularity`.
    If `expr` doesn't resolve to a multiple of `bit_granularity` bits, the
    expression doesn't match.
    """;

when(cond: condition, expr: expression): expression =
    """
    Matches `expr` only when `cond` is true.
    """;

bind(variable_name: identifier, value: expression | ~number): expression | ~number =
    """
    Binds `value` to a local variable for subsequent re-use in the current rule.
    `bind` transparently passes through the type and value of `value`, meaning that
    the context around the `bind` call behaves as though only what the `bind`
    function surrounded is present. This allows a match as normal, while also
    allowing the resolved value to be used again later in the rule.
    """;

unicode(categories: unicode_category ...): expression =
    """
    Creates an expression containing the alternatives set of all Unicode
    codepoints that have any of the given Unicode categories.

    `categories` is a comma separated list of 1 letter major category or 2-letter minor
    category names, as listed in https://www.unicode.org/versions/Unicode15.0.0/ch04.pdf#G134153

    Example: all letters and space separators: unicode(L,Zs)
    """;

uint(bit_count: unsigned, range: ~unsigned): expression =
    """
    Creates an expression that matches the given range of big endian unsigned
    integers with the given number of bits.
    """;

sint(bit_count: unsigned, range: ~signed): expression =
    """
    Creates an expression that matches the given range of big endian signed
    integers with the given number of bits.
    """;

float(bit_count: unsigned, range: ~real): expression =
    """
    Creates an expression that matches the given range of big endian ieee754 binary
    floating point values. `bit_count` must be a valid size according to ieee754 binary.
    """;

inf: real =
    """
    Returns a number representing the mathematical concept of infinity.
    The sign of the infinity can be reversed using negation.
    This representation is only for the concept itself; actual encodings in a
    document will depend on the encoding format used.
    """;

qnan: real =
    """
    returns a number representing the concept of "not-a-number" in its quiet form.
    This representation is only for the concept itself; actual encodings in a
    document will depend on the encoding format used.
    """;

snan: real =
    """
    returns a number representing the concept of "not-a-number" in its signaling form.
    This representation is only for the concept itself; actual encodings in a
    document will depend on the encoding format used.
    """;

padding                = expression;
bit_count              = number;
bit_granularity        = number;
unicode_category       = ('A'~'Z') & ('a'~'z')?;
any_type               = condition | number | expression;

variable               = local_id | variable & '.' & local_id;
local_id               = identifier_restricted;

condition              = comparison | logical_ops;
logical_ops            = logical_or | logical_ops_and_not;
logical_ops_and_not    = logical_and | logical_op_not;
logical_op_not         = logical_not | maybe_grouped(condition);
comparison             = number & TOKEN_SEP & comparator & TOKEN_SEP & number;
comparator             = "<" | "<=" | "=" | ">= | ">";
logical_or             = condition & TOKEN_SEP & '|' & TOKEN_SEP & condition;
logical_and            = condition & TOKEN_SEP & '&' & TOKEN_SEP & condition;
logical_not            = '!' & TOKEN_SEP & condition;

number                 = calc_add | calc_sub | calc_mul_div;
calc_mul_div           = calc_mul | calc_div | calc_mod | calc_pow_neg;
calc_pow_neg           = calc_pow | calc_neg_val;
calc_neg_val           = calc_neg | calc_val;
calc_val               = number_literal | variable | maybe_grouped(number);
calc_add               = number & TOKEN_SEP & '+' & TOKEN_SEP & calc_mul_div;
calc_sub               = number & TOKEN_SEP & '-' & TOKEN_SEP & calc_mul_div;
calc_mul               = calc_mul_div & TOKEN_SEP & '*' & TOKEN_SEP & calc_pow_val;
calc_div               = calc_mul_div & TOKEN_SEP & '/' & TOKEN_SEP & calc_pow_val;
calc_mod               = calc_mul_div & TOKEN_SEP & '%' & TOKEN_SEP & calc_pow_val;
calc_pow               = calc_pow_val & TOKEN_SEP & '^' & TOKEN_SEP & calc_neg_val;
calc_neg               = '-' & calc_val;

grouped(item)          = PARENTHESIZED(item);
ranged(item)           = (item & TOKEN_SEP)? & '~' & (TOKEN_SEP & item)?;
maybe_grouped(item)    = item | grouped(item);
maybe_ranged(item)     = item | ranged(item);

number_literal         = int_literal_bin | int_literal_oct | int_real_literal_dec | int_real_literal_hex;
int_real_literal_dec   = neg? digit_dec+
                       & ('.' & digit_dec+ & (('e' | 'E') ('+' | '-')? & digit_dec+)?)?
                       ;
int_real_literal_hex   = neg? & '0' & ('x' | 'X') & digit_hex+
                       & ('.' & digit_hex+ & (('p' | 'P') & ('+' | '-')? & digit_dec+)?)?
                       ;
int_literal_bin        = neg? & '0' & ('b' | 'B') & digit_bin+;
int_literal_oct        = neg? & '0' & ('o' | 'O') & digit_oct+;
neg                    = '-';

identifier_any         = name;
identifier_restricted  = identifier_any ! reserved_identifiers;
reserved_identifiers   = "sized"
                       | "aligned"
                       | "swapped"
                       | "when"
                       | "bind"
                       | "uint"
                       | "sint"
                       | "float"
                       | "inf"
                       | "qnan"
                       | "snan"
                       ;

name                   = name_firstchar & name_nextchar*;
name_firstchar         = unicode(L,M);
name_nextchar          = name_firstchar | unicode(N) | '_';

printable              = unicode(L,M,N,P,S);
printable_ws           = printable | WS;
printable_wsl          = printable | WSL;
digit_bin              = '0'~'1';
digit_oct              = '0'~'7';
digit_dec              = '0'~'9';
digit_bin              = ('0'~'9') | ('a'~'f') | ('A'~'F');

comment                = '#' & printable_ws* & LINE_END;

PARENTHESIZED(item)    = '(' & TOKEN_SEP & item & TOKEN_SEP & ')';
ARG_SEP                = TOKEN_SEP & ',' & TOKEN_SEP;
TOKEN_SEP              = MAYBE_WSLC;

# Whitespace
MAYBE_WS               = WS*;
SOME_WS                = WS & MAYBE_WS;
MAYBE_WSLC             = (WSL | comment)*;
SOME_WSLC              = WSL & MAYBE_WSLC;
WSL                    = WS | LINE_END;
WS                     = HT | SP;
LINE_END               = CR? & LF;
HT                     = '[9]';
LF                     = '[a]';
CR                     = '[d]';
SP                     = '[20]';

Read More