Compiler Visualiser

Visually learn the theory behind how a compiler lexes and parses high-level code.

Input Code

Use the code editor to explore how your input code is compiled.

Don't have any ideas? Use any of the examples from the dropdown.

Guide

This mini-language has a C/Java like program structure.

Valid data types:

  • int

  • float

  • bool

Valid operators:

  • Arithmetic (+ - / * %)

  • Boolean (&& || !)

  • Comparison (< > >= <= == !=)

Control flow structures:

  • While loops

  • For loops

  • If then else (...)

  • continue & break statements

Language semantics

  • In order to actually run, there must be a main() function defined.

  • Functions must be defined before they can be used.

  • A predefined function print(x) is supplied that can print a float, int or bool.

The Lexer

This first phase turns the input stream of characters into a stream of tokens by chopping the input into small pieces, and ignoring unecessary details (like whitespace).

Input code is just a long sequence of (meaningless) characters.

Though keywords in code may be distinguishable to us, internally, they're stored as a long string like this:

"

INT X = 5;

"

where spaces and newlines and tabs are stored as characters "\s" "\n" "\t".

A computer isn't able to differentiate between characters with meaning, and those without.

Imagine you're reading a book. The Lexer's (scanner/tokenizer) job would be breaking down the sentences and words into understandable parts. The lexer gives meaning to groups of characters.

Tokenizing

The lexer reads the code you've written, character by character, and it will group characters together into meaningful chunks, discarding unnecessary details like whitespace.

\s\s\s

Hover over me!

Step one - defining our tokens

First, we have to make a list of all the special keywords (tokens) for the language.

Here are all the possible tokens for our mini language:

Type Declaration

INT

FLOAT

BOOL

VOID

Literals

INT_LITERAL

FLOAT_LITERAL

BOOL_LITERAL

Identifiers

IDENTIFIER

Keywords

IF

ELSE

WHILE

FOR

RETURN

BREAK

CONTINUE

Operators

PLUS

MINUS

MULT

DIV

MOD

NOT

ASSIGN

AND

OR

Comparison

EQUAL

NOT_EQUAL

GREATER_THAN

GREATER_THAN_OR_EQUAL

LESS_THAN

LESS_THAN_OR_EQUAL

Delimiters

LEFT_PAREN

RIGHT_PAREN

LEFT_BRACE

RIGHT_BRACE

SEMICOLON

COMMA

End of file

EOF

Our mini language ignores all whitespace. So, we don't need a token for those characters - they are irrelevant to the structure of our program, so our lexer will simply skip those characters.

Similarly, our lexer can dispose of all comments, as they aren't actual program code.

This doesn't have to be the case. Python, for example, will have tokens for indentation as it is a whitespace sensitive language.

Step two - Regex

(What is regex?)

Then, for each token, we define a regex that matches the pattern of that token.

The regex for simple keyword tokens is just the string of characters:

Hover over the tokens to see the regex
int

INT

continue

CONTINUE

Identifiers (variable and function names) and literals (values like 3.14, 9) are more complicated as we have to match a general pattern for that type of token.

[a-zA-Z_][a-zA-Z0-9_]*

IDENTIFIER

[0-9]+.[0-9]+

FLOAT_LITERAL

We can read the identifier regex as one alphabetical character, followed by any number of alphanumeric characters.

We can read the float literal regex as one digit 0-9, followed by a ".", with at least one digit after it.

Step three - Conversion to NFA

(What's that?)

Then, we construct an NFA for each regex.

int
q₀ntq₁q₂q₃i

Float Literal:[0-9]+.[0-9]+

q₀.q₂q₃q₁[0-9][0-9][0-9][0-9]

Step four - Conversion to DFA

We want our lexer to be fast.

The computers we use are deterministic machines, whilst NFAs are non-deterministic because we can travel to multiple states at once.

If we're trying to match a token and we hit a dead end in the NFA, we have to backtrack and try another path, and this can be expensive.

Step five - traversing the DFA

But finite automatons are simply recognisers that accept or reject a string based on its pattern: how do we get back a list of tokens from a long string of characters?

And how do we decide when to end? How can we lex an input like "fort'?
Should we get:

<FOR>

,

<IDENTIFIER, t>

or

<IDENTIFIER, fort>

?

Ready to lex