This mini-language has a C/Java like program structure.
int
float
bool
Arithmetic (+ - / * %)
Boolean (&& || !)
Comparison (< > >= <= == !=)
While loops
For loops
If then else (...)
continue
& break
statements
In order to actually run, there must be a main() function defined.
Functions must be defined before they can be used.
A predefined function print(x)
is supplied that can print a float, int or bool.
This first phase turns the input stream of characters into a stream of tokens by chopping the input into small pieces, and ignoring unecessary details (like whitespace).
Though keywords in code may be distinguishable to us, internally, they're stored as a long string like this:
"INT X = 5;
"where spaces and newlines and tabs are stored as characters "\s" "\n" "\t".
A computer isn't able to differentiate between characters with meaning, and those without.
Imagine you're reading a book. The Lexer's (scanner/tokenizer) job would be breaking down the sentences and words into understandable parts. The lexer gives meaning to groups of characters.
The lexer reads the code you've written, character by character, and it will group characters together into meaningful chunks, discarding unnecessary details like whitespace.
\s\s\s
Hover over me!First, we have to make a list of all the special keywords (tokens) for the language.
Here are all the possible tokens for our mini language:
INT
FLOAT
BOOL
VOID
INT_LITERAL
FLOAT_LITERAL
BOOL_LITERAL
IDENTIFIER
IF
ELSE
WHILE
FOR
RETURN
BREAK
CONTINUE
PLUS
MINUS
MULT
DIV
MOD
NOT
ASSIGN
AND
OR
EQUAL
NOT_EQUAL
GREATER_THAN
GREATER_THAN_OR_EQUAL
LESS_THAN
LESS_THAN_OR_EQUAL
LEFT_PAREN
RIGHT_PAREN
LEFT_BRACE
RIGHT_BRACE
SEMICOLON
COMMA
EOF
Our mini language ignores all whitespace. So, we don't need a token for those characters - they are irrelevant to the structure of our program, so our lexer will simply skip those characters.
Similarly, our lexer can dispose of all comments, as they aren't actual program code.
This doesn't have to be the case. Python, for example, will have tokens for indentation as it is a whitespace sensitive language.
Then, for each token, we define a regex that matches the pattern of that token.
The regex for simple keyword tokens is just the string of characters:
Hover over the tokens to see the regexint
INT
continue
CONTINUE
Identifiers (variable and function names) and literals (values like 3.14, 9) are more complicated as we have to match a general pattern for that type of token.
[a-zA-Z_][a-zA-Z0-9_]*
IDENTIFIER
[0-9]+.[0-9]+
FLOAT_LITERAL
We can read the identifier regex as one alphabetical character, followed by any number of alphanumeric characters.
We can read the float literal regex as one digit 0-9, followed by a ".", with at least one digit after it.
Then, we construct an NFA for each regex.
int
Float Literal:[0-9]+.[0-9]+
We want our lexer to be fast.
The computers we use are deterministic machines, whilst NFAs are non-deterministic because we can travel to multiple states at once.
If we're trying to match a token and we hit a dead end in the NFA, we have to backtrack and try another path, and this can be expensive.
But finite automatons are simply recognisers that accept or reject a string based on its pattern: how do we get back a list of tokens from a long string of characters?
And how do we decide when to end? How can we lex an input like "fort'?
Should we get:
,
or
?