COS 126 Lecture 16: Formal Languages

NOTE: See also notes written by Kevin Wayne

An alphabet is a finite, non-empty set of symbols. We've been looking at the alphabet consisting of the two symbols, 0 and 1.

Strings are an ordered, finite set of elements from the alphabet (including the empty set.)

A language is a subset of all the possible strings over an alphabet. For the binary alphabet, every binary number is a string over the alphabet. (We say 'over' the alphabet to mean that the elements of the string consist only of alphabet symbols.) A language is a particular subset of these binary strings, such as all 4-bit strings, or all strings with an even number of 0's, or all strings with more 1's than 0's.

This slide summarizes the concepts introduced in lecture 15.


slide 2 - Grammars

A particular grammar is defined by its nonterminals, terminals, start symbol, and productions.

Think of nonterminals as a set of variables. In the lecture notes, they will be indicated by < and > brackets. Any word within < and > is considered to be a nonterminal. In exercises and precept, you will often see nonterminals specified as capital letters.

The terminals of a grammar are the set of characters that make up the alphabet of the corresponding language. Often, we will look at languages with limited alphabets, like binary numbers. In that case, there will be exactly two terminals: 0 and 1.

The start symbol will be one particular non-terminal. Whenever you generate a string in the language using a grammar, you will start with this start symbol. If none is explicitly indicated, you can usually assume that the letter 'S' indicates the start symbol.

Productions are the 'rules' of the grammar. You will usually see notation with arrows: -> or colon-equals: :=. On the left-hand side of each production, there should be at least one non-terminal. Sometimes, you will see multiple productions separated by '|'. This is just shorthand:

A := a | aA | aB

is equivalent to the three production rules:

A := a
A := aA
A := aB

Basically, what you do to generate strings in the language using a grammar is begin with the start symbol. Next, use one of the productions to replace the start symbol with something else (find a rule that has the start symbol on the left hand side. Replace the start symbol with whatever is on the right hand side of this rule.) You repeatedly replace non-terminals (and sometimes terminals) of the string you are working on with the productions available to you, until your string has only terminals. Proceeding in this manner, any strings of terminals you can generate with the grammar are said to be 'in the language' defined by the grammar. If this seems kind of vague now, don't worry too much. This is a topic that's probably best learned through examples and practice.