Manual: tokenizer
Converts strings into Lever tokens
This tokenizer is the first thing that happens to a Lever program that is about to be run. The tokenizer is invoked by the parser and the compiler.
To bootstrap Lever's compiler with Python, there is a ported implementation of this module in compiler/lever_parser/reader/.
Table of contents ↑Lever's tokenizer resembles Python's tokenizer a lot. The major difference is that the keywords and operators collected by the tokenizer are determined by the 'table' -argument passed in as an argument. This allows the parsing engine to have a last word on the keywords the tokenizer parses.
This tokenizer should understand..
- Comments starting with '#' that continue to the end of the line.
- Python's raw string syntax, eg. r"foo\bar", r'foo\bar'
- Symbols, eg. identifiers, regex. [a-zA-Z_][a-zA-Z0-9_]*
- Hexadecimals starting with 0x...
- Whole numbers, regex. [0-9]+
- Numbers with decimal point and exponent, eg. 1.23e-10, 1.2
- Strings with single or double quotes, eg. 'hello' or "world"
- Custom keywords or operators.
The tokenizer treats space as separator between tokens, and does not produce Literal -objects for it.
01. Keyword smearing
The keyword table given to the tokenizer may contain symbols such as '!='. For these to be correctly recognized, the non-alphabetic keywords have to be smeared.
Here's a smearing function you can use for that purpose:
default_smear = (keyword): for ch in keyword if ch.is_alpha() return [keyword] result = [] prefix = [] for ch in keyword prefix.append(ch) result.append("".join(prefix)) return result
02. API
path | The path to the file to be tokenized. |
table | Keyword table. |
symtab | Symbol table. |
returns | List of Literal -objects. |
string | The string to be tokenized. |
table | Keyword table. |
symtab | Symbol table. |
returns | List of Literal -objects. |
To create your own symbol table, create an object with attributes: 'string', 'symbol', 'hex', 'int', 'float'.
03. Internal details
The rest of this file isn't well-documented.
start | start source location {col, lno} |
stop | stop source location {col, lno} |
name | 'name' of the token, retrieved from the symtab -object. |
string | The string captured by this token. |
This object is likely unnecessary, and may be replaced by something with .col and .lno -attributes in the future.
Represents a character stream used by the parser. This is purely an internal detail.
stream | not documented |
table | not documented |
symtab | not documented |
returns | not documented |