LEXER

The lexer is the first item in the compiler chain. The only exceptions being when there is a preprocessor as in C. The job of the lexer is to read in plain text source code and convert it into a series of "tokens" such as symbols, keywords, identifiers, or value literals.

O's compiler is broken up into a series of completely separated parts - the lexer of which being the first. Each component in the compiler chain has some formalised input format and output format, allowing for simple transfer between multiple stages. It also has the added advantage of allowing incremental compilation and dramatically simplified parallel compilation for much of the process.

Input format
Tokens
Encoding
Output format


Back to top

INPUT FORMAT

As described, the O lexer takes in plain text source code. O source code is encoded in UTF-8 (or UTF-compatible ASCII). It is comprised of various characters and symbols. Line endings may be a linefeed (U+000A) as on the UNIX family of operating systems, a carriage return (U+000D) as on MacOS, or a carriage return followed immediately by a linefeed (U+000D, U+000A) as on Windows. Whitespace in a file may be comprised of any of space (U+0020) or tab (U+0009) characters in any combination. The file may end arbitrarily, or optionally with either an "end of file" character (U+001A) or null character (U+0000).

If an "end of file" or null character is present in the middle of the physical file, the O lexer will see this as the end of the file and exit early.

COMMENTS

There are 3 classes of comment in O, 2 of which are peripheral and are ignored by the lexer. The first is a line comment. A line comment begins with // and continues until the end of the line in source code. From the start of the line comment, any characters at all may be present as the lexer does not validate or check any of it. It will search through until the first line-break, ignoring any other content. The second comment type is a block comment. A block comment begins with /* and ends with */. Much like the line comment, content within the bounds of the comment may be completely arbitrary as it will all be ignored completely by the lexer.

Comments may be used for various purposes such as describing developer intent, highlighting noteworthy code, marking places where stuff is a work in progress and what is needed, and so on. Line and block comments should not be used for documentation. It is not mandatory that they may not be used, but there are documentation comments that are designed for the purpose of documenting code and include features designed to make that simple and effective.


Back to top

TOKENS

Tokens are broken up into several groups. Each group has different properties. Some - keywords for example - are a collection of static sequences of characters, meaning their possibilities are very small and detecting them is very simple. Others, such as varstring literals, are much more complicated as they are broken up into several sections with other tokens in the middle.

Symbols
Keywords
Marked Keywords
Integer Literals
Float Literals
Decimal Literals
Byte Literals
Boolean Literals
Character Literals
String Literals
Hexstring Literals
Varstring Literals
Documentation
Identifiers


Back to tokens

SYMBOLS

A symbol is a small group of non-alphanumeric characters. These could represent an operator such as + or *~, or could be grouping characters such as ()[]{} among other things. There is a finite set of symbols, and a match to one comprises a complete token.

0: ( 1: ) 2: { 3: } 4: [ 5: ] 6: = 7: == 8: != 9: > 10: >= 11: <= 12: < 13: + 14: += 15: ++ 16: - 17: -= 18: -- 19: * 20: *= 21: / 22: /= 23: ~ 24: ~= 25: *~ 26: *~= 27: ^ 28: ^= 29: % 30: %= 31: | 32: |= 33: && 34: || 35: ! 36: >< 37: ?? 38: ## 39: #? 40: . 41: .. 42: , 43: ; 44: :

Each symbol has a corresponding number. This number is the index of that symbol and is what is used to represent that symbol in the output format of the lexer (explained later). This same numbering principle is used for things like keywords.

Symbols may be concatenated directly in source code, for instance, +- will be identified as two tokens. However, be aware that some concatenated symbols may actually be understood as different symbols. For instance, +== might be intended to be an addition operator followed by an equivalence operator, but the lexer will tokenise it as += and =.


Back to tokens

KEYWORDS

Unlike many other languages, O groups keywords into different groups. This is because of its ability to allow using keywords to separate method parameters rather than commas in method definitions. Some keywords - such as those for core types - should not be used for this purpose as it would cause serious confusion. Keywords are broken up into groups like so:

Core types: 0: bool 1: byte 2: char 3: decimal 4: double 5: float 6: int 7: long 8: string 9: uint 10: ulong 11: var 12: void

Keywords: 0: base 1: body 2: builder 3: class 4: entrypoint 5: enum 6: flat 7: interface 8: new 9: out 10: pipe 11: piped 12: private 13: public 14: ref 15: restricted 16: static 17: this

Statements and Methods: 0: assert 1: catch 2: continue 3: else 4: finally 5: for 6: foreach 7: forever 8: give 5: has 9: if 10: import 7: is 11: nameof 12: pass 13: return 14: strof 15: throw 16: try 17: typeof 18: where 19: while

Separators: 0: and 1: at 2: but 3: by 4: from 6: in 8: of 9: or 10: then 11: to


Back to tokens

MARKED KEYWORDS

When declaring methods that use keywords for separators, marked keywords are used to make sure the definition is clear. A marked keyword is simply a keyword with an underscore on either side, such as _from_. Such marked keywords may be any of the "separator" keywords.

0: _and_ 1: _at_ 2: _but_ 3: _by_ 4: _from_ 6: _in_ 8: _of_ 9: _or_ 10: _then_ 11: _to_


Back to tokens

INTEGER LITERALS

An integer literal represents a whole number in code. It may be given in base 2, 10, or 16. The digits of an integer literal may be interspersed with underscores to aid in readability. For instance, 123 thousand million could be written as 123_000_000_000. These underscores may be anywhere within the number and may be grouped together into long strings of underscores, but they may not be placed on the exterior of the number i.e. as the first or last digit. Denary integers are just the number by itself with no need for prefixes or suffixes at all. Binary integers are prefixed with a lowercase letter "b", and hexadecimal integers are prefixed with "0x". Here are some examples of integer literals:

12 12_34 1_____2 b101101 0x8aD5


Back to tokens

FLOAT LITERALS

A float literal represents any real number in code. Like integers, it may be given in base 2, 10, or 16 using the same prefixes as with integers. A float literal will only be recognised as such by the presence of a decimal point. This decimal point may not be on the exterior of the number i.e. there must be at least one digit either side of it. .5 is not valid, where 0.5 is. Like integer literals, the digits may be interspersed with underscores, but underscores may not be on the exterior of the number. Also, underscores may not be placed directly adjacent to the decimal point. Here are some examples of float literals:

12.0 12_3.4_5 1__2.3__4 b101.101 0x8a.D5


Back to tokens

DECIMAL LITERALS

O supports a binary-coded decimal (BCD) type for places where staying accurate to arbitrarily large or small denary numbers is important. A decimal literal may be an integer or real number. It takes a format identical to a denary integer or float literal with the same rules for decimal points (if one is present) and underscores. What shows that it is a decimal literal is that the number is prefixed with a lowercase letter "d". Here are some examples of decimal literals:

d12 d12_34 d1_____2 d123.45 d12.0


Back to tokens

BYTE LITERALS

Recognising that in many cases, just working with raw data can be much simpler than with specific data types, O has a plain "byte" type and comes with a means to provide byte literals. These byte literals may be in either base 2 or 16. A binary byte literal is comprised of an upper case letter "B" followed immediately by 8 binary digits ("0" or "1"). A hexadecimal byte literal is comprised of an upper case letter "X" followed immediately by 2 hexadecimal digits. With any numeric literal, hexadecimal digits are case-insensitive, so a literal may contain lowercase letters "a" to "f", uppercase letters "A" to "F", or a combination of the two arbitrarily. Bytes are unsigned. Here are some examples of byte literals:

B10110100 X8a


Back to tokens

BOOLEAN LITERALS

Boolean literals are very simple and follow the same pattern as keywords in that there is a static and finite set of possible literals. The "yes" and "no" literals may be used completely interchangeably with the more common "true" and "false" as their meanings are identical. This second pair of options exists merely as an aid towards writing more easily understandable code where the given word makes more semantic sense when the code is worded in English.

0: false 1: no 2: true 3: yes


Back to tokens

CHARACTER LITERALS

Character literals represent a single UTF-8 character. This may be done by providing the character directly, or by providing a representation of the character using an escape sequence. Character literals begin and end with single quotes "'". The character within the quotes may be any single unicode character, but may not be a control character (U+0000 through U+009F). For creating character literals that store those characters, an escape sequence should be used.

\' : ' \" : " \{ : { \} : } \\ : \ \b : U+0008 \n : U+000A \r : U+000D \t : U+0009 \xnn : U+00nn \unnnn : U+nnnn \Unnnnnnnn : U+nnnnnnnn

Here are some examples of character literals:

'a' ' ' '\t' '\u1234' '\n'

To represent a single quote character in a literal, use its escape sequence. If the character is placed there directly then the token will not be understood as the lexer will see a character literal with no content (invalid and will throw an error) followed by a single quote.


Back to tokens

STRING LITERALS

String literals represent a list of 0 or more UTF-8 characters. The characters in a string literal follow exactly the same rules as character literals in terms of representation, but a string literal begins and ends with a double quote character '"'. Similar to the character literal, a string may not contain a double quote character, as the lexer will interpret that as the end of the string literal. To represent the character, use its escape sequence. Here are some examples of string literals:

"Hello World!" "" "Text with a\nline break in the middle" "\u1234\u2345" "\x1B[31mred text\x1B[0m"


Back to tokens

HEXSTRING LITERALS

Hexstrings are used to conveniently represent larger quantities of raw binary data where providing an array literal of bytes is tedious or uses excessive amounts of space. A hexstring is started with x" and ended with ". It may contain any hexadecimal digit or space character (U+0020) and is case-insensitive. The one restriction is that the number of hexadecimal digits present in the string must be even for it to represent an exact number of bytes. As an example, the following hexstring and byte array literal produce identical data:

x"12ab 34CD 56ef" [ X12 , Xab , X34 , XCD , X56 , Xef ]

Note how much briefer and simpler to type the hexstring is. The spaces may be placed anywhere within the string to organise the data as the user sees fit. This data literal should be very convenient for users that work with hex-dumps as much of the visual output of a hex-dump can be copied verbatim into a hexstring literal for use in code.


Back to tokens

VARSTRING LITERALS

Varstrings are used to intersperse code into a string. This is used to dramatically simplify expressions where values stored in variables or retrieved from method calls are to be placed into a string. O already makes it somewhat simpler by having plenty of overloads for operators such as the concatenation operator ~, but the varstring makes it even more concise and more readable in the process. Take this example. All 3 of these expressions evaluate to the same value. The first is the most needlessly inefficient method in which the strof() method is used on every non-string value being used. The second understands that that isn't needed and just adds in the values. The third makes use of a varstring to make the whole thing brief and clear.

// person.name is a string
// person.pronoun is an enum
// person.age is a uint

"This is " ~ person.name ~ ", " ~ strof( person.pronoun ) ~ " is " ~ strof( person.age ) ~ " years old."

"This is " ~ person.name ~ ", " ~ person.pronoun ~ " is " ~ person.age ~ " years old."

v"This is {person.name}, {person.pronoun} is {person.age} years old."

Any expression may go between the braces in a varstring literal provided that expression has a type other than void. When the code is compiled, the contents of the braces is evaluated and placed into the location in the string.


Back to tokens

DOCUMENTATION

As well as line and block comments, there are documentation comments. These comments start with /// and continue to the end of the line much like line comments. The difference though is that these comments are kept and stored by the lexer. These comments are used to document the item following them - hence the name. Like varstrings, documentation comments may contain code within braces in the middle of them. Where varstrings allow any expression in this place, documentation comments only allow namespace-and-member-qualified identifiers. This is because these parts of the documentation comment are used to draw attention to specific parts of code. Take the following examples that demonstrate this principle quite effectively:

/// Return whether {c} is in {s}.
bool find( char c _in_ piped string s ) ;

/// Summation of {n} from {a} to {b} for the function {f}.
long sigma( _from_ long a _to_ long b , body long f( long n ).action ) ;

When the compiler is working through the files, it can generate documentation from these comments, saving the need for a human to enumerate every method, class, and variable and describe them individually. That job is done naturally during the programming process, meaning the only remaining documentation that needs to be written is broader usage guides or more detailed explanations where necessary. Documentation comments are not intended to completely remove the need for writing formal documentation, but rather to simplify the process somewhat. Other key benefits are in language servers where the server may use these documentation comments to generate contextual tooltips on the fly, even substituting in values when over an invocation of the method if desired.


Back to tokens

IDENTIFIER

Identifiers make up all remaining code that is not a symbol, keyword, or value literal. These are used to name variables, methods, classes, and so on in a manner that the user can read and understand. Identifiers may be comprised of any of the "Universal characters" for identifiers as in Annex D of ISO/IEC 9899:1999(E) for the C programming language. The only conditions are that the identifier may not start or end with an underscore "_" (U+005F) or an Arabic digit 0-9 (U+0030 - U+0039). The following lists all valid characters.

Basic: 0030 - 0039 0041 - 005A 005F 0061 - 007A

Latin: 00AA 00BA 00C0 - 00D6 00D8 - 00F6 00F8 - 01F5 01FA - 0217 0250 - 02A8 1E00 - 1E9B 1EA0 - 1EF9 207F

Greek: 0386 0388 - 038A 038C 038E - 03A1 03A3 - 03CE 03D0 - 03D6 03DA 03DC 03DE 03E0 03E2 - 03F3 1F00 - 1F15 1F18 - 1F1D 1F20 - 1F45 1F48 - 1F4D 1F50 - 1F57 1F59 1F5B 1F5D 1F5F - 1F7D 1F80 - 1FB4 1FB6 - 1FBC 1FC2 - 1FC4 1FC6 - 1FCC 1FD0 - 1FD3 1FD6 - 1FDB 1FE0 - 1FEC 1FF2 - 1FF4 1FF6 - 1FFC

Cyrillic: 0401 - 040C 040E - 044F 0451 - 045C 045E - 0481 0490 - 04C4 04C7 - 04C8 04CB - 04CC 04D0 - 04EB 04EE - 04F5 04F8 - 04F9

Armenian: 0531 - 0556 0561 - 0587

Hebrew: 05B0 - 05B9 05BB - 05BD 05BF 05C1 - 05C2 05D0 - 05EA 05F0 - 05F2

Arabic: 0621 - 063A 0640 - 0652 0670 - 06B7 06BA - 06BE 06C0 - 06CE 06D0 - 06DC 06E5 - 06E8 06EA - 06ED

Devanagari: 0901 - 0903 0905 - 0939 093E - 094D 0950 - 0952 0958 - 0963

Bengali: 0981 - 0983 0985 - 098C 098F - 0990 0993 - 09A8 09AA - 09B0 09B2 09B6 - 09B9 09BE - 09C4 09C7 - 09C8 09CB - 09CD 09DC - 09DD 09DF - 09E3 09F0 - 09F1

Gurmukhi: 0A02 0A05 - 0A0A 0A0F - 0A10 0A13 - 0A28 0A2A - 0A30 0A32 - 0A33 0A35 - 0A36 0A38 - 0A39 0A3E - 0A42 0A47 - 0A48 0A4B - 0A4D 0A59 - 0A5C 0A5E 0A74

Gujarati: 0A81 - 0A83 0A85 - 0A8B 0A8D 0A8F - 0A91 0A93 - 0AA8 0AAA - 0AB0 0AB2 - 0AB3 0AB5 - 0AB9 0ABD - 0AC5 0AC7 - 0AC9 0ACB - 0ACD 0AD0 0AE0

Oriya: 0B01 - 0B03 0B05 - 0B0C 0B0F - 0B10 0B13 - 0B28 0B2A - 0B30 0B32 - 0B33 0B36 - 0B39 0B3E - 0B43 0B47 - 0B48 0B4B - 0B4D 0B5C - 0B5D 0B5F - 0B61

Tamil: 0B82 - 0B83 0B85 - 0B8A 0B8E - 0B90 0B92 - 0B95 0B99 - 0B9A 0B9C 0B9E - 0B9F 0BA3 - 0BA4 0BA8 - 0BAA 0BAE - 0BB5 0BB7 - 0BB9 0BBE - 0BC2 0BC6 - 0BC8 0BCA - 0BCD

Telugu: 0C01 - 0C03 0C05 - 0C0C 0C0E - 0C10 0C12 - 0C28 0C2A - 0C33 0C35 - 0C39 0C3E - 0C44 0C46 - 0C48 0C4A - 0C4D 0C60 - 0C61

Kannada: 0C82 - 0C83 0C85 - 0C8C 0C8E - 0C90 0C92 - 0CA8 0CAA - 0CB3 0CB5 - 0CB9 0CBE - 0CC4 0CC6 - 0CC8 0CCA - 0CCD 0CDE 0CE0 - 0CE1

Malayalam: 0D02 - 0D03 0D05 - 0D0C 0D0E - 0D10 0D12 - 0D28 0D2A - 0D39 0D3E - 0D43 0D46 - 0D48 0D4A - 0D4D 0D60 - 0D61

Thai: 0E01 - 0E3A 0E40 - 0E5B

Lao: 0E81 - 0E82 0E84 0E87 - 0E88 0E8A 0E8D 0E94 - 0E97 0E99 - 0E9F 0EA1 - 0EA3 0EA5 0EA7 0EAA - 0EAB 0EAD - 0EAE 0EB0 - 0EB9 0EBB - 0EBD 0EC0 - 0EC4 0EC6 0EC8 - 0ECD 0EDC - 0EDD

Tibetan: 0F00 0F18 - 0F19 0F35 0F37 0F39 0F3E - 0F47 0F49 - 0F69 0F71 - 0F84 0F86 - 0F8B 0F90 - 0F95 0F97 0F99 - 0FAD 0FB1 - 0FB7 0FB9

Georgian: 10A0 - 10C5 10D0 - 10F6

Hiragana: 3041 - 3093 309B - 309C

Katakana: 30A1 - 30F6 30FB - 30FC

Bopomofo: 3105 - 312C

CJK Unified Ideographs: 4E00 - 9FA5

Hangul: AC00 - D7A3

Digits: 0660 - 0669 06F0 - 06F9 0966 - 096F 09E6 - 09EF 0A66 - 0A6F 0AE6 - 0AEF 0B66 - 0B6F 0BE7 - 0BEF 0C66 - 0C6F 0CE6 - 0CEF 0D66 - 0D6F 0E50 - 0E59 0ED0 - 0ED9 0F20 - 0F33

Special characters: 00B5 00B7 02B0 - 02B8 02BB 02BD - 02C1 02D0 - 02D1 02E0 - 02E4 037A 0559 093D 0B3D 1FBE 203F - 2040 2102 2107 210A - 2113 2115 2118 - 211D 2124 2126 2128 212A - 2131 2133 - 2138 2160 - 2182 3005 - 3007 3021-3029


Back to top

ENCODING

As the lexer moves through a file, it will check the start of the remaining file contents for every kind of token and note the length of the result if any. The longest valid token is chosen and added to the list of tokens with the type of token, characters in source code that made up that token, and that token's location in source code for later use. The length of characters of this token is then removed from the file buffer, leaving the remainder of the file. The cycle then repeats until either no valid token can be found (in which case an error is thrown and told to the user), the file buffer is completely emptied, or an end-of-file character is found.

This will then leave the lexer with an array of source-based tokens i.e. where the content is taken verbatim from the source code. Each token will have a type taken from an enum of token types such as symbols, keywords etc. It will then have a position in source, being 2 unsigned integers denoting the line of code and character index on that line of the start of the token. Finally, it will have a string containing the exact source code that makes up that token. This string content cannot be directly used as the token's content though. For instance, the integer literal with string content "3" would be stored as a string, giving it data 00110011. To the following sections of the compiler, this would be seen as an 8-bit integer representing the number 51, which is incorrect by an order of magnitude. What we therefore need to do is convert the string representation of the token into the actual meaningful value that it represents e.g. turning our "3" into a ulong with value 3.

The lexer will therefore work through all the tokens and convert their string representation into their meaningful value. Each token type has its own specific format for data representation.

Symbols
Keywords
Marked Keywords
Integer Literals
Float Literals
Decimal Literals
Byte Literals
Boolean Literals
Character Literals
String Literals
Hexstring Literals
Varstring Literals
Documentation
Identifiers


Back to encoding

SYMBOLS

Symbols are represented simply. Each symbol has a number associated with it. This is simply encoded into an unsigned 8-bit integer. For instance, the symbol ++ has index 14, meaning it would be encoded as 00001110. Using an 8-bit integer means the token will align properly with whole bytes and also leaves plenty of room for more symbols should any be added in future versions of the language.

Each token "type" also has a number associated with it in much the same way. The type index for a symbol is 0.


Back to encoding

KEYWORDS

Keywords are done in much the same way as symbols. Each group of keywords has a different type index, and then the keyword in question has an index. Core types have an index of 1, Keywords are 2, Statements and methods 3, and Separators 4. The index of the keyword is - like the symbols - stored as an 8-bit unsigned integer.


Back to encoding

MARKED KEYWORDS

Marked keywords follow an identical principle to symbols and keywords. They have a type index of 5 and the index of the marked keyword is, again, an 8-bit unsigned integer.


Back to encoding

INTEGER LITERALS

As sign is not interpreted at this point in the compiler chain, all numeric literals are read as positive. Integer literals are therefore stored as unsigned 64-bit integers (ulong) to cover the entire range of possible numbers they could represent. Like most modern processor architectures, all values excluding character and string values are stored in "little-endian" (LE) manner. This means that the least significant byte is placed at the lowest memory address ("first"). Integer literals have type index 6.


Back to encoding

FLOAT LITERALS

Float literals are stored as IEEE 754-2008-compliant 64-bit floating point numbers. As all number literals are treated as positive, the sign bit is always 0. The remaining 11-bit exponent and 52-bit mantissa are used as normal. Due to the properties of floating point numbers, a denary float literal will only be certainly correct up to 15 significant figures. However, this does not restrict what may be written in source code. Similar to integer literals, these are stored little-endian. Float literals have type index 7.


Back to encoding

DECIMAL LITERALS

Decimal literals are stored as binary-coded decimals (BCDs). This type is specifically intended for working with decimal numbers at arbitrary degrees of precision that a floating point could not. BCDs are separated into nibbles (4-bit sections), where each nibble represents a digit, decimal point, or sign. The digits 0-9 are represented as normal as 4-bit unsigned integers. The decimal point is represented as 1111. The last nibble is reserved for the sign, which is 1100 for positive and 1101 for negative.

Unlike other numeric types, the BCD is stored big-endian (BE) with the most significant digits in the lowest ("first") memory address. The sign nibble is placed at the highest memory address. Decimals that do not evenly fit a whole number of bytes are padded with 0s at the start. Take the following 2 examples:

-1,234,567
0001 0010 0011 0100 0101 0110 0111 1101
1    2    3    4    5    6    7    -

-123,456
0000 0001 0010 0011 0100 0101 0110 1101
0    1    2    3    4    5    6    -

Like strings, decimals may be of arbitrary length. Decimal literals have a type index of 8.


Back to encoding

BYTE LITERALS

Byte literals have a type index of 9. They are stored quite intuitively. As it is simply a literal of a single byte, one byte is used to contain the data the literal represents.


Back to encoding

BOOLEAN LITERALS

Boolean literals - like byte literals - have intuition to them. As the variations for each literal effectively mean the same thing, both are assigned the same value i.e. typing "true" in code and typing "yes" in code does not change the value assigned as the two are identical. A "true" value is represented with a byte of all 1s, and a "false" value is represented with a byte of all 0s. Boolean literals have a type index of 10.


Back to encoding

CHARACTER LITERALS

Many programming languages such as C treat "char" and "byte" as essentially the same thing, restricting characters to 8 bits, locking away more than 99.98% of Unicode - which in a lot of cases renders the data type useless. O instead stores characters in 1-4 bytes in UTF-8 encoding. Character literals have a type index of 11. During this conversion, escape sequences are converted into their corresponding characters.


Back to encoding

STRING LITERALS

String literals - with a type index of 12 - are stored in standard UTF-8 encoding. Like character literals, escape sequences are expanded at this point.


Back to encoding

HEXSTRING LITERALS

Hexstring literals represent a string of bytes rather than characters. The string is broken up into character pairs, where each pair of hexadecimal digits corresponds to a single byte. As the lexer will have refused to recognise a hexstring literal with an odd number of characters, it can be guaranteed that this literal represents a whole number of bytes. The hexstring is converted into those corresponding bytes in the order provided in the literal. Hexstrings have a type index of 13.


Back to encoding

VARSTRING LITERALS

Varstring literals are more complicated. Because they may be "interrupted" by arbitrary amounts of code, the lexer needs to look ahead to see if there are what can be identified as the other parts of it up to the end of the literal. Each of these sections is then stored separately as their own tokens. The start of a varstring has type index 14. Any middle section surrounded on both sides by code blocks has type index 15, and the end of a varstring where it is terminated by a double quote character has type index 16.


Back to encoding

DOCUMENTATION

Like varstring literals, documentation comments may be interrupted with code blocks. However - because the content of these blocks is restricted to just namespace and member-qualified identifiers, the lexer can be more efficient in how it tries to identify documentation comment literals. Like varstrings, documentation comments come in 3 possible sections - a start, middle, and end. A documentation comment start has type index 17, a middle has type index 18, and an end 19.


Back to encoding

IDENTIFIERS

Identifiers have type index 20. Identifiers are stored as strings representing their name in source code. Like character and string literals, this is in UTF-8 format.


Back to top

OUTPUT FORMAT

Once the lexer has identified every token in a file and produced each token's meaningful data representation, these tokens need to be arranged in a manner that is easy to organise and read by the next stage of the compiler. As different tokens take up different amounts of space, it is unrealistic to attempt to fit every token into its own evenly-sized space in the output file. The solution is a very simple binary format divided into a few small sections.

At the top level, the file represents a list of all the tokens in order. There are no separators or delimiters between tokens - they are all simply concatenated. Each of these token sections though follows a basic format that allows them to be identified and extracted from the file easily.

The first 64 bits of the token is an unsigned integer representing that token's size in bytes. This allows whichever program parsing the token file to immediately know the size of the current token and extract that amount of data from the file. The next 8 bits store the type index of the token. The next 128 bits store the token's location in source as 2 unsigned 64 bit integers starting at 1 and increasing. The first integer represents the line in the file, and the second represents the character or column. All remaining data in the token is the meaningful content of the token.

Take the following example file. We have 5 tokens of varying types. See how the lexer acts on the data through each stage of lexing:

int x =
3 ;

The file is first broken up into each token with a location, type, and string representation.

Token 0:
	Location: line 1 column 1
	Type: Core type keyword (1)
	String: int
Token 1:
	Location: line 1 column 5
	Type: Identifier (20)
	String: x
Token 2:
	Location: line 1 column 7
	Type: Symbol (0)
	String: =
Token 3:
	Location: line 2 column 1
	Type: Integer literal (6)
	String: 3
Token 4:
	Location: line 2 column 3
	Type: Symbol (0)
	String: ;

Next, the string representation is converted into a meaningful representation.

Token 0:
	Location: line 1 column 1
	Type: 1
	Data: 0x06
Token 1:
	Location: line 1 column 5
	Type: 20
	Data: 0x78
Token 2:
	Location: line 1 column 7
	Type: 0
	Data: 0x06
Token 3:
	Location: line 2 column 1
	Type: 6
	Data: 0x00000003
Token 4:
	Location: line 2 column 3
	Type: 0
	Data: 0x2B

Next, each token is pieced together into it's pure data form.

Token 0:
	0000000E
	01
	00000001 00000001
	06
Token 1:
	0000000E
	14
	00000001 00000005
	78
Token 2:
	0000000E
	00
	00000001 00000007
	06
Token 3:
	00000011
	06
	00000002 00000001
	00000003
Token 4:
	0000000E
	00
	00000002 00000003
	2B

... and are finally concatenated together to form the entire output file.

0000000E 01 00000001 00000001 06 0000000E 14 00000001 00000005 78 0000000E 00 00000001 00000007 06 00000011 06 00000002 00000001 00000003 0000000E 00 00000002 00000003 2B

It is at this point that the lexer has done all of its work. It will then pass this raw binary output out of stdout where it may be read in by the next stage in the compiler chain - or, alternatively, saved to a file for caching and later use to speed up future compilation speeds.