Regular expressions, context-free grammars, and scoping

February 3, 2016

Today we’re finishing up syntax. Repeating some things from recitation. Then moving on to some semantics.

Left off last time talking about regexes.

Regular expressions

Regex	Set of characters	Example
`a`	`{a}`	`a`
`rs`	Set of all strings resulting from concatenating a string from `r` and a string from `s`.	If `r` defines the set `{a, b}`, and `s` defines the set `{c, d}`, then `rs` defines the set `{ac, ad, bc, bd}`.
`r`\|`s`	The union of the sets defined by `r` and `s`.	Same example as above, `r`\|`s` defines `{a, b, c, d}`
`r*`	The set of all strings resulting from 0 or more concatentations of strings in `r`.	If `r` is `{a}`, `aaaaaaa`, etc.
`[a-b c-d]`	Shorthand for the range of characters whose ASCII values are between `a` and `b` or `c` and `d`.	`[0-9]` denotes digits, or `[A-Za-z]` defines all uppercase and lowercase latin letters.

There are many extensions to regular expressions, but prof wants us to stick to these simple rules.

In programming languages, identifiers are often specified with [A-Za-z][A-Za-z0-9]*, meaning that identifiers can contain all letters and numbers, and can be of any length, but cannot start with a number.

Other examples are integer literals [0-9][0-9]*, floating point literals [0-9][0-9]*.[0-9]*, or scientific notation (ɛ|+|-)[0-9][0-9]*xE(+|-)[0-9][0-9]*. Remeber that ɛ denotes the empty string.

Another shorthand value:

Regex	Set of characters	Example
`r+`	Set of strings of 1 or more concatenations of `r`, equivalent to `rr*`	See number literals above.

We can use that too.

But remember that regexes can’t do nesting, ex. can’t check whether parentheses are balanced. For that we need a context-free-grammar.

Context-free grammar

Consists of two kinds of symbols: non-terminals (usually written in uppercase letters) and terminals (usually in lowercase).

Terminals are the things you actually see in the program, specific identifiers or operators. Non-terminals are categories of things, ex. in English they’d be things like noun, verb, and so on.

One of your non-terminals is special, called the start non-terminal. And then you have a set of rules for converting non-terminals into a string of terminals and/or non-terminals.

These rules are also called productions.

Example

Non-terminals: {Sentence, Subject, Verb, Object, Noun}

Terminals: {john, sally, saw, met}

Rules:

Sentence -> Subject Verb
Subject -> Subject Verb Object
Verb -> saw
Verb -> met
Object -> Noun
Subject -> Noun
Noun -> john
Noun -> sally

Sentence is the start non-terminal.

To make a derivation

Start with the start non-terminal, then repeatedly apply the rules until the string contains only terminals.

Example derivation

Sentence => Subject Verb Object
         => Subject met Object
         => Noun met Object
         => Noun met Noun
         => john met Noun
         => john met sally

A context-free grammar is called such because you don’t have to know the context of something like Verb in order to know what to replace it with. It’s not sufficient for English grammar (there are still more powerful grammars that can handle something like that), but what they really are good for is nesting

CFGs and nesting example

Let’s make a CFG for the set of all strings containing balanced parentheses. Some examples from the set: {ɛ, (), (()), ()(), (()(())), ...}

This’ll be simple. Our symbols and rules:

S -> ɛ
S -> (S)
S -> SS

That’s it. That’s the whole grammar.

It’s kind of a pain to write three separate lines though. So here’s a shorthand to write it all in one line:

S -> ɛ | (S) | SS

Typical example of a question he might ask us: give us a string, and show the derivation of the string using a given grammar.

For example: what derivation gives ((()()))?

S => (S) => ((S)) => ((SS)) => (((S)S)) => ((()S)) => ((()(S))) => ((()()))

That is a derivation: a set of steps that start at the start string, and end when there are no more non-terminals.

Parse tree

A parse tree is a data structure that is a graphical representation of a derivation. The root is the start non-terminal, an internal node is a non-terminal, and a leaf is a terminal.

      S
     /|\
    ( S )
     /|\
    ( S )
     / \
   S     S
  /|\   /|\
 ( S ) ( S )
   |     |
   ɛ     ɛ

These are important because this data structure that the compiler generates when parsing your program. It does the opposite of the derivation, looking at the string of terminals and working its way up the tree.

Example: arithmetic expressions

More rules:

Exp -> Exp + Exp | Exp * Exp | Var | Num
where Var = [A-Za-z][A-Za-z0-9]*
      Num = [0-9][0-9]*

Exp is the start symbol. An example derivation:

Exp => Exp + Exp
    => Exp + Exp * Exp
    => x + Exp * Exp
    => x + y * Exp
    => x + y * z

But what does the parse tree look like for this?

          Exp
        /  |  \
      Exp  +  Exp
       |    /  |  \
       x   y   *   z

There’s a problem with this grammar. This parse tree is equally valid:

          Exp
        /  |  \
      Exp  *  Exp
    /  |  \    |
   x   +   y   z

So there’s ambiguity, to the point where you’d get two different answers from these two different interpretations of the expression.

Any grammar where you can derive two distinct parse trees from the same string is an ambiguous grammar.

What are the ways to handle these ambiguities? One way is to introduce rules outside the grammar, by defining precedence, i.e. defining which rule should be applied first. For example, in most programming languages multiplication has higher precedence than addition.

The other option is to change the grammar so it’s no longer ambiguous. For example, an unambiguous version of that grammar could be:

Exp -> Term + Term | Factor
Term -> Factor * Factor | Var | Num
Factor -> ??????

Prof says this might be slightly wrong, he’s doing this from memory. He says Paul will give the corrected one.

But this would give the following derivation:

Exp => Term + Term
    => Var + Factor * Factor
    => (... skipping a few steps ...)
    => Var + Var * Var
    => x + y * z

But this way there’s only one parse tree that comes out of this:

          Exp
        /  |  \
      Term  +  Term
       |     /  |  \
       x Factor * Factor
           |        |
           y        z

Or something along those lines.

But this tends to make your grammar more complicated, so most programming language implementations just use precedence rules instead.

And that’s all we’re going to talk about in terms of syntax.

Syntax: summary

Use regular expressions for the words of the program. Use context-free grammars to define the hierarchy of constructs in the language.

Semantics

But syntax is not enough, you need to talk about what it actually means.

There are two kinds of semantics:

Static semantics: rules that govern the use of types in a program.
Dynamic semantics: defines the behaviour of valid strings in the language.

Recap: phases of a compiler

Lexical analysis, or “lexing”, done by the “lexer”. Converts characters into words. (I.e. tokens, converting “for” into the token for, etc.) Runs on regexes, effectively. See also: lex or flex.
Parsing, or “syntactic analysis”. Done by the “parser”. Groups words into the “phrases” of your program, where “phrases” are your statements, procedures, etc. Is based on the CFG you use to define your language. See also: yacc or bison.
Type checking. Making sure that you’re using types correctly, and you’re not trying to, ex. multiply two hashmaps together or whatever. Done by the type checker.
Code generation, done by the code generator. Takes the internal representation of the program and generates the target language (machine code, Javascript, etc.)
Optimization, often done before, after, or mixed in with code generation. Creates improvements on the generated code. Ex. removes code that can never be executed, replaces sections of code with code that’s equivalent but more efficient, etc.

In a compiler, the type checking is done during compilation. In an interpreter, it’s done during execution. This is the difference between static and dynamic type checking: static is during compilation, dynamic is during execution.

So some languages, like C, have static type checking. Others, like Scheme or Python, have dynamic type checking.

Scoping

Scoping is the process of associating the names in your program with the things those names represent.

The “names” are things like x and foo and cos and Integer.

The “things” can be memory locations, values, types, procedures, etc.

Example, in C:

int x;
float y;
f(x, y); // Scoping rules will say that the x and y here
         // are the same as the ones above.

Some definitions:

Scope: the portion of the program in which a name is visible.

Block: the syntax used in a language to define a scope. Ex. in C:

{
	int z;
}

This is the block that defines the scope of z.

In many languages a block can be defined by a procedure. Ex. in Ada:

procedure f is
	x: integer;
	begin
		...code...
	end

Where x here is visible only between the begin and the end.

Block-structured language: a language where the blocks (including procedures) can be nested arbitrarily. (Note that because it includes procedures, C doesn’t count as a “true” block-structured language.)

Let’s show an example with made-up syntax:

procedure f is
	x: integer;
	y: integer;
	procedure g is
		y: integer; // This overrides the variable y above
		begin
			y := x + 3 // Note that x is still visible,
			// because g is inside f's scope. However, the
			// y here is in g's scope.
		end
	begin
		x := 4
		y := 8 // This, however, refers to f's y, because
		// it's outside g's scope.
	end

The rule for resolving variables is to look at the inner scope first. If you don’t find a definition, move one scope up, and repeat.

In short: the declaration matching the use of a variable is the declaration of the innermost surrounding scope (in a block-structured language).

However, sometimes it’s not so simple. But that’s for next lecture.