Lexical


Lexical


At its simplest level, an O source code file is made up of a collection of symbols, keywords, value literals, identifiers, and comments.

Comments

There are 3 kinds of comments in O: line comments, block comments, and documentation comments. Line comments start with // and continue until the end of the line. Block comments start with /* and end with */. They may span multiple lines (as is their intention), and the will nest. Here are some examples:

this is not in a comment
// this is a line comment
but it doesn't span lines
/* this is a block comment
and it spans lines /* and can be nested, so
even though one ends here, */ the outer one still continues until
here */ so this bit here isn't commented
/// this is a documentation comment, but they may only be used in certain places

Documentation comments act in the same way as line comments, but start with /// instead. Both line and block comments are ignored by the compiler, but documentation comments are saved for use by other parts of the compiler, such as to be output for use by a language server or to generate documentation files. This does mean that documentation comments are limited in where they may be used, so placing documentation comments in arbitrary locations in a file may lead to parsing errors. Documentation comments may be used in groups like in the example below to add different bits of information to the same thing.

Also notice the various syntactic constructions within the comments. Several of these comments make use of identifiers in braces to relate to specific types and variables, and others use an @ syntax to document specific things such as parameters with @p name, errors and exceptions with @e and @x, and outputs with @o.

Note that the syntax in documentation comments has no effect on the actual compilation of the program. Content of documentation comments is only interacted with when the compiler is being used to generate documentation or to output information for a language server to use.

/// Do the important thing based on {s} and {c}.
/// @p s The string to be used.
/// @p c The object being acted upon.
/// @e Will pass an {invalidargerr} if {s} is empty.
/// @x Will throw a {bigbadex} if the operation fails.
/// @o Returns an {int} count of how many important things were done.
int doimportantthing( string s , someclass c ) where ( s.len > 0 ) ;

Symbols and Keywords

Symbols are vital for specifying meaning in code as well as grouping things together as required. Symbols are also used for operators such as + and -. Keywords are special words that are restricted in use, meaning they can't be used as identifiers. Below is a list of all symbols:

( ) { } [ ] = == != > >= <= < + += ++ - -= -- * *= / /= ~ ~= *~ *~= ^ ^= % %= | |= && || ! >< ?? ## ? # #? -> . .. ... , ; : $ as body class dependency entrypoint expose enum flat has import interface is new piped private public ref restricted static

Unlike a lot of other languages, O does not reserve common type names as keywords, electing instead to have later compiler stages catch out misuse based on the implicitly imported core namespace in every file that contains these basic type classes. Similarly, common statements are defined in this namespace rather than being reserved.

Separators

O has the ability to use various keywords as separators in method definitions and invocations as a means of making more readable and understandable methods. For instance, find( 'o' , "hello" ) may be understandable in that we're returning whether we can find the character in the string, but if these are exprssions or variables instead then it may be more confusing and require reading additional code. However, find( 'o' in "hello" ) is immediately understandable, even if more complicated parameters are provided.

These separators are reserved keywords, and so may not be used as identifiers.

and at but by from in of or then to

Integer Literals

Integer literals can be provided in binary, decimal, or hexadecimal. A decimal integer literal is simply a string of numeric digits 0-9. Underscores may be placed arbitrarily on the interior of the literal to aid readability, for instance, 123456789 might take some time to mentally parse to know that the first digit represents 100 million, but if the literal is written as 123_456_789 then it becomes much more readable.

To mark an integer literal as to be in binary, simply prefix it with b. Binary integers may only be comprised of binary digits 0-1. Note that the b may not be immediately followed by an underscore e.g. b1011 is a binary integer literal, but b_1011 will be seen as an identifier. This same principle applies to hexadecimal integer literals, which are prefixed with 0x. Hexadecimal integer literals are not case-sensitive, so both a-f and A-F may be used as hexadicemal digits interchangably and in the same literal.

Float Literals

Float literals - unlike integers - may only be provided in base 10. The distinguishing character in a float literal is a decimal point ., which must be present in a float literal and must have digits on both sides of it. For instance, .3 is not valid, where 0.3 is. Underscores may be freely used within the literal, but may not be placed adjacent to the decimal point. Like integer literals, underscores may not be the first or last characters of the literal.

Byte Literals

A byte literal is used to represent a single byte of data. It doesn't have a numeric representation as it is not intended to be used as a number, but instead simply raw data. A byte literal may be provided in one of 2 forms: binary and hexadecimal. A binary byte literal is comprised of a B followed by exactly 8 binary digits 0-1. A hexadecimal byte literal is comprised of a X followed by exactly 2 hexadecimal digits 0-9a-fA-F. Byte literals may not contain underscores.

Boolean Literals

Boolean literals are just as you might expect, being either the word true or false. O however also uses the words yes and no. Each variant can be used interchangably, and it is left to the user to decide which seems more semantically appropriate to read in code as it has no effect on functionality.

Character Literals

For storing a single byte of information, the byte type should be used. The char type is for real characters, which reach far beyond a single byte. A character literal contains any single UTF-8 character or escape sequence surrounded by single quotes '. An escape sequence is started with a \ character, followed by 1 or more characters to form the sequence. Below is a table of all valid escape sequences.

\'A literal single quote
\"A literal double quote
\{A literal opening brace
\}A literal closing brace
\\A literal backslash
\rA carriage return character
\nA line-feed character
\tA tab character
\xnnThe unicode character U+nn
\unnnnThe unicode character U+nnnn
\UnnnnnnnnThe unicode character U+nnnnnnnn

Note that these characters do not always have to be escaped. For instance, braces only need to be escaped in var-string literals, double-quotes don't need to be escaped in character literals, and single-quotes only need to be escaped in character literals.

String Literals

String literals are an arbitrary number of UTF-8 characters and escape sequences surrounded by double quotes ". All the same escape sequences that apply in character literals apply in string literals. In the event of an empty string literal "", the data is of length 0 which is perfectly allowed. Strings are otherwise stored as plain UTF-8 data.

Hex-string Literals

Hexstrings to bytes are what strings are to characters. A hexstring literal is intended to provide a clean means of representing larger amounts of raw data in code. They start with x" and may contain any combination of hex digits and whitespace (but may only span a single line). The only restraint is that there is an even number of hex digits in the string as it must represent a whole number of bytes.

Var-string Literals

Often, we may want a string literal that has within it the values of variables. It is simple to break the string up into sections and concatenate it to the variables we wish to include, but this takes space and isn't as immediately understood as having the variables or even whole expressions placed directly in the string. That is the purpose of the varstring. A var-string is started with v" and ends with ", but may contain any number of expressions within braces, which get interpolated into the string intuitively. Here is an example:

v"Hi there {name}! If you were born in {dob.year} then in 2000 you were {dob.year <= 2000 ?? "not yet born" ## dob.year - 2000}."

Identifiers

Identifiers represent variables, types, methods, statements, and any other meaningful token that doesn't fall into any other category. Identifiers may be comprised of any of the "Universal characters for identifiers" as detailed in Annex D of ISO/IEC 9899:1999(E) for the C programming language. The only restriction is that an identifier may not start with an arabic digit 0-9 (U+0030 - U+0039).

For most users, just using the characters a-zA-Z0-9_ with no 0-9 as the first character will cover almost all bases. For those who's native language doesn't use the latin alphabet and that may want to write identifiers in different languages, this is a full table of all characters named in ISO/IEC 9899:1999(E):

Basic: U+0030 - U+0039 U+0041 - U+005A U+005F U+0061 - U+007A Latin: U+00AA U+00BA U+00C0 - U+00D6 U+00D8 - U+00F6 U+00F8 - U+01F5 U+01FA - U+0217 U+0250 - U+02A8 U+1E00 - U+1E9B U+1EA0 - U+1EF9 U+207F Greek: U+0386 U+0388 - U+038A U+038C U+038E - U+03A1 U+03A3 - U+03CE U+03D0 - U+03D6 U+03DA U+03DC U+03DE U+03E0 U+03E2 - U+03F3 U+1F00 - U+1F15 U+1F18 - U+1F1D U+1F20 - U+1F45 U+1F48 - U+1F4D U+1F50 - U+1F57 U+1F59 U+1F5B U+1F5D U+1F5F - U+1F7D U+1F80 - U+1FB4 U+1FB6 - U+1FBC U+1FC2 - U+1FC4 U+1FC6 - U+1FCC U+1FD0 - U+1FD3 U+1FD6 - U+1FDB U+1FE0 - U+1FEC U+1FF2 - U+1FF4 U+1FF6 - U+1FFC Cyrillic: U+0401 - U+040C U+040E - U+044F U+0451 - U+045C U+045E - U+0481 U+0490 - U+04C4 U+04C7 - U+04C8 U+04CB - U+04CC U+04D0 - U+04EB U+04EE - U+04F5 U+04F8 - U+04F9 Armenian: U+0531 - U+0556 U+0561 - U+0587 Hebrew: U+05B0 - U+05B9 U+05BB - U+05BD U+05BF U+05C1 - U+05C2 U+05D0 - U+05EA U+05F0 - U+05F2 Arabic: U+0621 - U+063A U+0640 - U+0652 U+0670 - U+06B7 U+06BA - U+06BE U+06C0 - U+06CE U+06D0 - U+06DC U+06E5 - U+06E8 U+06EA - U+06ED Devangari: U+0901 - U+0903 U+0905 - U+0939 U+093E - U+094D U+0950 - U+0952 U+0958 - U+0963 Bengali: U+0981 - U+0983 U+0985 - U+098C U+098F - U+0990 U+0993 - U+09A8 U+09AA - U+09B0 U+09B2 U+09B6 - U+09B9 U+09BE - U+09C4 U+09C7 - U+09C8 U+09CB - U+09CD U+09DC - U+09DD U+09DF - U+09E3 U+09F0 - U+09F1 Gurmukhi: U+0A02 U+0A05 - U+0A0A U+0A0F - U+0A10 U+0A13 - U+0A28 U+0A2A - U+0A30 U+0A32 - U+0A33 U+0A35 - U+0A36 U+0A38 - U+0A39 U+0A3E - U+0A42 U+0A47 - U+0A48 U+0A4B - U+0A4D U+0A59 - U+0A5C U+0A5E U+0A74 Gujarati: U+0A81 - U+0A83 U+0A85 - U+0A8B U+0A8D U+0A8F - U+0A91 U+0A93 - U+0AA8 U+0AAA - U+0AB0 U+0AB2 - U+0AB3 U+0AB5 - U+0AB9 U+0ABD - U+0AC5 U+0AC7 - U+0AC9 U+0ACB - U+0ACD U+0AD0 U+0AE0 Oriya: U+0B01 - U+0B03 U+0B05 - U+0B0C U+0B0F - U+0B10 U+0B13 - U+0B28 U+0B2A - U+0B30 U+0B32 - U+0B33 U+0B36 - U+0B39 U+0B3E - U+0B43 U+0B47 - U+0B48 U+0B4B - U+0B4D U+0B5C - U+0B5D U+0B5F - U+0B61 Tamil: U+0B82 - U+0B83 U+0B85 - U+0B8A U+0B8E - U+0B90 U+0B92 - U+0B95 U+0B99 - U+0B9A U+0B9C U+0B9E - U+0B9F U+0BA3 - U+0BA4 U+0BA8 - U+0BAA U+0BAE - U+0BB5 U+0BB7 - U+0BB9 U+0BBE - U+0BC2 U+0BC6 - U+0BC8 U+0BCA - U+0BCD Telugu: U+0C01 - U+0C03 U+0C05 - U+0C0C U+0C0E - U+0C10 U+0C12 - U+0C28 U+0C2A - U+0C33 U+0C35 - U+0C39 U+0C3E - U+0C44 U+0C46 - U+0C48 U+0C4A - U+0C4D U+0C60 - U+0C61 Kannada: U+0C82 - U+0C83 U+0C85 - U+0C8C U+0C8E - U+0C90 U+0C92 - U+0CA8 U+0CAA - U+0CB3 U+0CB5 - U+0CB9 U+0CBE - U+0CC4 U+0CC6 - U+0CC8 U+0CCA - U+0CCD U+0CDE U+0CE0 - U+0CE1 Malayalam: U+0D02 - U+0D03 U+0D05 - U+0D0C U+0D0E - U+0D10 U+0D12 - U+0D28 U+0D2A - U+0D39 U+0D3E - U+0D43 U+0D46 - U+0D48 U+0D4A - U+0D4D U+0D60 - U+0D61 Thai: U+0E01 - U+0E3A U+0E40 - U+0E5B Lao: U+0E81 - U+0E82 U+0E84 U+0E87 - U+0E88 U+0E8A U+0E8D U+0E94 - U+0E97 U+0E99 - U+0E9F U+0EA1 - U+0EA3 U+0EA5 U+0EA7 U+0EAA - U+0EAB U+0EAD - U+0EAE U+0EB0 - U+0EB9 U+0EBB - U+0EBD U+0EC0 - U+0EC4 U+0EC6 U+0EC8 - U+0ECD U+0EDC - U+0EDD Tibetan: U+0F00 U+0F18 - U+0F19 U+0F35 U+0F37 U+0F39 U+0F3E - U+0F47 U+0F49 - U+0F69 U+0F71 - U+0F84 U+0F86 - U+0F8B U+0F90 - U+0F95 U+0F97 U+0F99 - U+0FAD U+0FB1 - U+0FB7 U+0FB9 Georgian: U+10A0 - U+10C5 U+10D0 - U+10F6 Hiragana: U+3041 - U+3093 U+309B - U+309C Katakana: U+30A1 - U+30F6 U+30FB - U+30FC Bopomofo: U+3105 - U+312C CJK Unified Ideographs: U+4E00 - U+9FA5 Hangul: U+AC00 - U+D7A3 Digits: U+0660 - U+0669 U+06F0 - U+06F9 U+0966 - U+096F U+09E6 - U+09EF U+0A66 - U+0A6F U+0AE6 - U+0AEF U+0B66 - U+0B6F U+0BE7 - U+0BEF U+0C66 - U+0C6F U+0CE6 - U+0CEF U+0D66 - U+0D6F U+0E50 - U+0E59 U+0ED0 - U+0ED9 U+0F20 - U+0F33 Special Characters: U+00B5 U+00B7 U+02B0 - U+02B8 U+02BB U+02BD - U+02C1 U+02D0 - U+02D1 U+02E0 - U+02E4 U+037A U+0559 U+093D U+0B3D U+1FBE U+203F - U+2040 U+2102 U+2107 U+210A - U+2113 U+2115 U+2118 - U+211D U+2124 U+2126 U+2128 U+212A - U+2131 U+2133 - U+2138 U+2160 - U+2182 U+3005 - U+3007 U+3021 - U+3029