/ Lexer

Lexer

The O compiler is broken up into a collection of library-style micro-programs that are orchestrated by a top-level application. This makes it easy to break up compilation into multiple stages and write the compiler in smaller stages. The lexer is the first of these stages. The lexer reads in UTF-8 encoded plain text source code and breaks it up into a series of tokens such as symbols, comments, keywords, and identifiers. The source code representing each of these tokens is then converted into a raw data representation i.e. the string "3" in source code being converted into an actual numeric value 3. These token objects are then passed onto the next stage of the compiler.

Each token stores a line, column, and length property denoting the location of the token in source code. It has a type index denoting what kind of token it is, and it has a data portion.

Tokens

EOF, EOL, and Whitespace

At the first ocurrence of an EOF character, the lexer will finish and output results. Neither EOF, EOL, nor Whitespace characters are included in the output tokens.

EOF Characters:\u001A\u0000 EOL Characters:\u000D\u000A\u000A\u000D Whitespace Characters:\u0020\u0009\u000B\u000C

Comments

Comments are non-tokens in that they are identified by the lexer, but are not passed all the way through to the output. The only exception is documentation comments which are described later. The 2 main types of comments are line and block comments, which you will be familiar with from any C-like language. Line comments begin with // and continue until the first EOL. Line comments may contain any data otherwise. Block comments begin with /*, may contain any data like line coments, and end at either the first */ or at the EOF. This means that you can comment out an end portion of a file with only the opening /*.

Symbols

Symbols are static strings in source code i.e. they match any of a finite set of characters such as + or != or keywords such as class. Symbols have a type index of 0. Each one has an associated number, and the lexer will store the data value not as a string of the characters, but as a ubyte indexing into the list of symbols.

Unlike a lot of other languages, O does not reserve common type names as keywords, electing instead to have later compiler stages catch out misuse based on the implicitly imported core namespace in every file that contains these basic type classes. Similarly, common statements are defined in this namespace rather than being reserved.

0: (1: )2: {3: }4: [5: ]6: =7: ==8: !=9: >10: >=11: <=12: <13: +14: +=15: ++16: -17: -=18: --19: *20: *=21: /22: /=23: ~24: ~=25: *~26: *~=27: ^28: ^=29: %30: %=31: |32: |=33: &&34: ||35: !36: ><37: ??38: ##39: #?40: .41: ..42: ...43: ,44: ;45: : 46: as 47: body 48: class 49: dependency 50: entrypoint 51: expose 52: enum 53: flat 54: has 55: import 56: interface 57: is 58: new 59: piped 60: private 61: public 62: ref 63: restricted 64: static

Separators

O has the ability to use various keywords as separators in method definitions and invocations as a means of making more readable and understandable methods. For instance, find( 'o' , "hello" ) may be understandable in that we're returning whether we can find the character in the string, but if these are exprssions or variables instead then it may be more confusing and require reading additional code. However, find( 'o' in "hello" ) is immediately understandable, even if more complicated parameters are provided.

These separators are reserved keywords, and so may not be used as identifiers. Like symbols, these are stored as a ubyte indexing into the list of available separators and have a type index of 1.

0: and1: at2: but3: by4: from5: in6: of7: or8: then9: to

Integer Literals

Integer literals represent any integer value. From the perspective of the lexer, these are only seen as positive, as interpretation as signed values is only done later on in the compiler chain. Integer literals may be prefixed with 0x and provided in hexadecimal, or prefixed with b and provided in binary. Underscores may be placed freely in the middle of the literal to break up larger numbers to be more understandable.

Integer literals are stored little-endian in as few bytes as are required to store the number and have a type index of 2.

Float Literals

Float literals represent any positive real number. They may only be provided in base 10 and must contain a decimal point . to be identified as float literals rather than integer literals. There must be at least 1 digit on either side of the point. Like integer literals, underscores may be placed arbitrarily in the middle of the literal, but may not be adjacent to the decimal point.

Float literals have type index of 3. They are stored little-endian as 64bit doubles as in IEEE 754.

Byte Literals

Byte literals have 2 forms: binary and hex. A binary byte literal is formed of a capital B followed by 8 binary digits. A hex byte literal is formed of a capital X followed by 2 hex digits. Byte literals are stored as the single byte they represent and have a type index of 4.

Boolean Literals

Boolean literals are a single word in code and are stored as a single byte of value 0 for false and 1 for true. A false value may be represented with either of the words false and no and a true value may be represented with either true or yes. All 4 of these words are reserved and so may not be used as identifiers. Boolean literals have a type index of 5.

Character Literals

For storing a single byte of information, the byte type should be used. The char type is for real characters, which reach far beyond a single byte. A character literal contains any single UTF-8 character or escape sequence surrounded by single quotes '. An escape sequence is started with a \ character, followed by 1 or more characters to form the sequence. Below is a table of all valid escape sequences.

\'A literal single quote.
\"A literal double quote.
\{A literal opening brace.
\}A literal closing brace.
\\A literal backslash.
\rA carriage return character.
\nA line-feed character.
\tA tab character.
\xnnThe unicode character U+00nn.
\unnnnThe unicode character U+nnnn.
\UnnnnnnnnThe unicode character U+nnnnnnnn.

Character literals have a type index of 6. They are stored in 1 to 4 bytes as a single valid UTF-8 character.

String Literals

String literals are an arbitrary number of UTF-8 characters and escape sequences surrounded by double quotes ". All the same escape sequences that apply in character literals apply in string literals. String literals have a type index of 7. In the event of an empty string literal (""), the data is of length 0 which is perfectly allowed. Strings are otherwise stored as plain UTF-8 data.

Hexstring Literals

Hexstrings to bytes are what strings are to characters. A hexstring literal is intended to provide a clean means of representing larger amounts of raw data in code. They start with x" and may contain any combination of hex digits and whitespace (not including EOL characters). The only restraint is that there is an even number of hex digits in the string as it must represent a whole number of bytes. Hexstrings have a type index of 8 and are stored as the bytes that they represent.

Varstring Literals

Often, we may want a string literal that has within it the values of variables. It is simple to break the string up into sections and concatenate it to the variables we wish to include, but this takes space and isn't as immediately understood as having the variables or even whole expressions placed directly in the string. That is the purpose of the varstring. The varstring has 4 possible forms for each possible section: a full varstring that doesn't contain any expressions, starting with v" and ending with "; a varstring start that starts with v" and ends with {; a varstring middle that starts with } and ends with {; and a varstring end that starts with } and ends with ". These sections go together around the expressions between the braces to form the complete literal. Here is an example:

v"Hi there {name}! If you were born in {dob.year} then in 2000 you were {dob.year <= 2000 ?? "not yet born" ## dob.year - 2000}."

As each section of a varstring literal plays a different part, they each have a different type index. A full varstring has a type index of 9, then varstring starts, middles, and ends have type indeces of 10, 11, and 12 respectively. They are stored in the same manner as regular strings, as plain UTF-8 data of the contents of the literal. Tokens between the braces are treated just like any other tokens.

Documentation

Documentation comments provide a means of easily documenting code. Unlike other comment types, documentation comments are retained after lexing and continue through to following compiler stages. The contents of these comments are intended for use by tools such as language servers and document generators for providing tooltips in editors and document outputs during compilation. Documentation comments resemble regular line comments, but start with 3 slashes (///) rather than 2.

Another feature of documentation comments is that parameter names may be placed between braces to form macros on the documentation, allowing a language server or document generator to provide much richer information, such as tooltips showing exact values being used.

Multiple documentation comments may be placed one after another and may describe different things, such as descriptions of individual parameters or additional notes on method inputs, outputs, and exceptions. Here is an extreme example where a method is fully documented with general description as well as descriptions of both parameters, an error, exception, and output. The @ syntax specifies what type of thing is being documented.

/// Do the important thing based on {s} and {c}.
/// @p s The string to be used.
/// @p c The object being acted upon.
/// @e Will pass an {invalidargerr} if {s} is empty.
/// @x Will throw a {bigbadex} if the operation fails.
/// @o Returns an {int} count of how many important things were done.
int doimportantthing( string s , someclass c ) where ( s.len > 0 ) ;

As Documentation comments are only really managed separately from the main compiler chain, they are only lexed and parsed in a similar manner to line comments and are stored as string data after having whitespace trimmed from either end. The 3 slashes are not included in the string data. Documentation comments have a type index of 13.

Identifiers

Identifiers represent variables, types, methods, statements, and any other meaningful token that doesn't fall into any other category. Identifiers may be comprised of any of the "Universal characters for identifiers" as detailed in Annex D oy ISO/IEC 9899:1999(E) for the C programming language. The only restriction is that an identifier may not start or end with an arabic digit 0-9 (U+0030 - U+0039). The following lists all valid characters:

Basic U+0030 - U+0039 U+0041 - U+005A U+005F U+0061 - U+007A Latin U+00AA U+00BA U+00C0 - U+00D6 U+00D8 - U+00F6 U+00F8 - U+01F5 U+01FA - U+0217 U+0250 - U+02A8 U+1E00 - U+1E9B U+1EA0 - U+1EF9 U+207F Greek U+0386 U+0388 - U+038A U+038C U+038E - U+03A1 U+03A3 - U+03CE U+03D0 - U+03D6 U+03DA U+03DC U+03DE U+03E0 U+03E2 - U+03F3 U+1F00 - U+1F15 U+1F18 - U+1F1D U+1F20 - U+1F45 U+1F48 - U+1F4D U+1F50 - U+1F57 U+1F59 U+1F5B U+1F5D U+1F5F - U+1F7D U+1F80 - U+1FB4 U+1FB6 - U+1FBC U+1FC2 - U+1FC4 U+1FC6 - U+1FCC U+1FD0 - U+1FD3 U+1FD6 - U+1FDB U+1FE0 - U+1FEC U+1FF2 - U+1FF4 U+1FF6 - U+1FFC Cyrillic U+0401 - U+040C U+040E - U+044F U+0451 - U+045C U+045E - U+0481 U+0490 - U+04C4 U+04C7 - U+04C8 U+04CB - U+04CC U+04D0 - U+04EB U+04EE - U+04F5 U+04F8 - U+04F9 Armenian U+0531 - U+0556 U+0561 - U+0587 Hebrew U+05B0 - U+05B9 U+05BB - U+05BD U+05BF U+05C1 - U+05C2 U+05D0 - U+05EA U+05F0 - U+05F2 Arabic U+0621 - U+063A U+0640 - U+0652 U+0670 - U+06B7 U+06BA - U+06BE U+06C0 - U+06CE U+06D0 - U+06DC U+06E5 - U+06E8 U+06EA - U+06ED Devanagari U+0901 - U+0903 U+0905 - U+0939 U+093E - U+094D U+0950 - U+0952 U+0958 - U+0963 Bengali U+0981 - U+0983 U+0985 - U+098C U+098F - U+0990 U+0993 - U+09A8 U+09AA - U+09B0 U+09B2 U+09B6 - U+09B9 U+09BE - U+09C4 U+09C7 - U+09C8 U+09CB - U+09CD U+09DC - U+09DD U+09DF - U+09E3 U+09F0 - U+09F1 Gurmukhi U+0A02 U+0A05 - U+0A0A U+0A0F - U+0A10 U+0A13 - U+0A28 U+0A2A - U+0A30 U+0A32 - U+0A33 U+0A35 - U+0A36 U+0A38 - U+0A39 U+0A3E - U+0A42 U+0A47 - U+0A48 U+0A4B - U+0A4D U+0A59 - U+0A5C U+0A5E U+0A74 Gujarati U+0A81 - U+0A83 U+0A85 - U+0A8B U+0A8D U+0A8F - U+0A91 U+0A93 - U+0AA8 U+0AAA - U+0AB0 U+0AB2 - U+0AB3 U+0AB5 - U+0AB9 U+0ABD - U+0AC5 U+0AC7 - U+0AC9 U+0ACB - U+0ACD U+0AD0 U+0AE0 Oriya U+0B01 - U+0B03 U+0B05 - U+0B0C U+0B0F - U+0B10 U+0B13 - U+0B28 U+0B2A - U+0B30 U+0B32 - U+0B33 U+0B36 - U+0B39 U+0B3E - U+0B43 U+0B47 - U+0B48 U+0B4B - U+0B4D U+0B5C - U+0B5D U+0B5F - U+0B61 Tamil U+0B82 - U+0B83 U+0B85 - U+0B8A U+0B8E - U+0B90 U+0B92 - U+0B95 U+0B99 - U+0B9A U+0B9C U+0B9E - U+0B9F U+0BA3 - U+0BA4 U+0BA8 - U+0BAA U+0BAE - U+0BB5 U+0BB7 - U+0BB9 U+0BBE - U+0BC2 U+0BC6 - U+0BC8 U+0BCA - U+0BCD Telugu U+0C01 - U+0C03 U+0C05 - U+0C0C U+0C0E - U+0C10 U+0C12 - U+0C28 U+0C2A - U+0C33 U+0C35 - U+0C39 U+0C3E - U+0C44 U+0C46 - U+0C48 U+0C4A - U+0C4D U+0C60 - U+0C61 Kannada U+0C82 - U+0C83 U+0C85 - U+0C8C U+0C8E - U+0C90 U+0C92 - U+0CA8 U+0CAA - U+0CB3 U+0CB5 - U+0CB9 U+0CBE - U+0CC4 U+0CC6 - U+0CC8 U+0CCA - U+0CCD U+0CDE U+0CE0 - U+0CE1 Malayalam U+0D02 - U+0D03 U+0D05 - U+0D0C U+0D0E - U+0D10 U+0D12 - U+0D28 U+0D2A - U+0D39 U+0D3E - U+0D43 U+0D46 - U+0D48 U+0D4A - U+0D4D U+0D60 - U+0D61 Thai U+0E01 - U+0E3A U+0E40 - U+0E5B Lao U+0E81 - U+0E82 U+0E84 U+0E87 - U+0E88 U+0E8A U+0E8D U+0E94 - U+0E97 U+0E99 - U+0E9F U+0EA1 - U+0EA3 U+0EA5 U+0EA7 U+0EAA - U+0EAB U+0EAD - U+0EAE U+0EB0 - U+0EB9 U+0EBB - U+0EBD U+0EC0 - U+0EC4 U+0EC6 U+0EC8 - U+0ECD U+0EDC - U+0EDD Tibetan U+0F00 U+0F18 - U+0F19 U+0F35 U+0F37 U+0F39 U+0F3E - U+0F47 U+0F49 - U+0F69 U+0F71 - U+0F84 U+0F86 - U+0F8B U+0F90 - U+0F95 U+0F97 U+0F99 - U+0FAD U+0FB1 - U+0FB7 U+0FB9 Georgian U+10A0 - U+10C5 U+10D0 - U+10F6 Hiragana U+3041 - U+3093 U+309B - U+309C Katakana U+30A1 - U+30F6 U+30FB - U+30FC Bopomofo U+3105 - U+312C CJK Unified Ideographs U+4E00 - U+9FA5 Hangul U+AC00 - U+D7A3 Digits U+0660 - U+0669 U+06F0 - U+06F9 U+0966 - U+096F U+09E6 - U+09EF U+0A66 - U+0A6F U+0AE6 - U+0AEF U+0B66 - U+0B6F U+0BE7 - U+0BEF U+0C66 - U+0C6F U+0CE6 - U+0CEF U+0D66 - U+0D6F U+0E50 - U+0E59 U+0ED0 - U+0ED9 U+0F20 - U+0F33 Special characters U+00B5 U+00B7 U+02B0 - U+02B8 U+02BB U+02BD - U+02C1 U+02D0 - U+02D1 U+02E0 - U+02E4 U+037A U+0559 U+093D U+0B3D U+1FBE U+203F - U+2040 U+2102 U+2107 U+210A - U+2113 U+2115 U+2118 - U+211D U+2124 U+2126 U+2128 U+212A - U+2131 U+2133 - U+2138 U+2160 - U+2182 U+3005 - U+3007 U+3021-3029

Identifiers have type index 14 and are stored as string data of the identifier itself.