Manual: tokenizer

Converts strings into Lever tokens

This tokenizer is the first thing that happens to a Lever program that is about to be run. The tokenizer is invoked by the parser and the compiler.

To bootstrap Lever's compiler with Python, there is a ported implementation of this module in compiler/lever_parser/reader/.

Table of contents ↑
Table of contents
01. Keyword smearing
02. API
03. Internal details

Lever's tokenizer resembles Python's tokenizer a lot. The major difference is that the keywords and operators collected by the tokenizer are determined by the 'table' -argument passed in as an argument. This allows the parsing engine to have a last word on the keywords the tokenizer parses.

This tokenizer should understand..

  1. Comments starting with '#' that continue to the end of the line.
  2. Python's raw string syntax, eg. r"foo\bar", r'foo\bar'
  3. Symbols, eg. identifiers, regex. [a-zA-Z_][a-zA-Z0-9_]*
  4. Hexadecimals starting with 0x...
  5. Whole numbers, regex. [0-9]+
  6. Numbers with decimal point and exponent, eg. 1.23e-10, 1.2
  7. Strings with single or double quotes, eg. 'hello' or "world"
  8. Custom keywords or operators.

The tokenizer treats space as separator between tokens, and does not produce Literal -objects for it.

01. Keyword smearing

The keyword table given to the tokenizer may contain symbols such as '!='. For these to be correctly recognized, the non-alphabetic keywords have to be smeared.

Here's a smearing function you can use for that purpose:

default_smear = (keyword):
    for ch in keyword
        if ch.is_alpha()
            return [keyword]
    result = []
    prefix = []
    for ch in keyword
        prefix.append(ch)
        result.append("".join(prefix))
    return result

02. API

read_file(path, table = null, symtab = null)
path The path to the file to be tokenized.
table Keyword table.
symtab Symbol table.
returns List of Literal -objects.
Tokenize a file.
volatile
read_string(string, table = null, symtab = null)
string The string to be tokenized.
table Keyword table.
symtab Symbol table.
returns List of Literal -objects.
Tokenize a string.
volatile
default_symtab : object
The default symbol table that is used if you do not pass a third argument to read_file or read_string.

To create your own symbol table, create an object with attributes: 'string', 'symbol', 'hex', 'int', 'float'.

volatile

03. Internal details

The rest of this file isn't well-documented.

class Literal extends object
The representation of a token.
volatile
+init(self, start, stop, name, string)
start start source location {col, lno}
stop stop source location {col, lno}
name 'name' of the token, retrieved from the symtab -object.
string The string captured by this token.
volatile
+repr(self)
volatile
class Position extends object
represents a source location.

This object is likely unnecessary, and may be replaced by something with .col and .lno -attributes in the future.

volatile
+init(self, col, lno)
col column, starts from 0
lno line number, starts from 1
volatile
+repr(self)
volatile
class TextStream extends object
internal

Represents a character stream used by the parser. This is purely an internal detail.

+init(self, source, index = null, col = null, lno = null)
volatile
current : property
not documented
volatile
filled : property
not documented
volatile
get_digit(self, base = null)
self not documented
base not documented
returns not documented
not documented
volatile
is_digit(self, base = null)
self not documented
base not documented
returns not documented
not documented
volatile
is_space(self)
self not documented
returns not documented
not documented
volatile
is_sym(self)
self not documented
returns not documented
not documented
volatile
pair_ahead(self, table)
self not documented
table not documented
returns not documented
not documented
volatile
position : property
not documented
volatile
default_symtab : object
The default symbol table that is used if you do not pass a third argument to read_file or read_string.
volatile
dir : path
not documented
volatile
escape_sequence(stream)
stream not documented
returns not documented
not documented
volatile
escape_sequences : dict
not documented
volatile
fs : Module
not documented
volatile
import : Import
not documented
volatile
name = "tokenizer"
not documented
volatile
next_token(stream, table, symtab = null)
stream not documented
table not documented
symtab not documented
returns not documented
not documented
volatile