# Chomsky Normal Form

The Chomsky Normal Form (CNF), proposed by the linguist Noam Chomsky, is a normalized version of the CFG with a standard set of rules defining how production rule must be written.

A grammar where every production is either of the form A → BC or A → c (where A, B, C are arbitrary variables and can arbitrary symbol).

Example:

S → AS | a
A → SA | b

(If a language contains ε, then we allow S → ε where S is a start symbol, and forbid S on RHS.)

Why Chomsky Normal Form?

The key advantage is that in Chomsky Normal Form, every derivation of a string of n letters has exactly 2n − 1 steps.

Thus: one can determine if a string is in the language by exhaustive search of all derivations.

Conversion to CNF.

The conversion to Chomsky Normal Form has four main steps:

1. Get rid of all ε productions.
2. Get rid of all productions where RHS is one variable.
3. Replace every production that is too long by shorter productions.
4. Move all terminals to productions where RHS is one terminal.

Eliminate ε Productions.

Determine the nullable variables (those that generate ε) (algorithm given earlier).
Go through all productions, and for each, omit every possible subset of nullable variables.
For example, if P → AxB with both A and B nullable, add productions P → xB | Ax | x.
After this, delete all productions with empty RHS.

Eliminate Variable Unit Productions

Unit production is where RHS has only one symbol.
Consider production A → B. Then for every production B → α, add the production A → α. Repeat until done (but don’t re-create a unit production already deleted).

Replace Long Productions by Shorter Ones.

For example, if have production A → BCD, then replace it with A → BE and E → CD.
(In theory, this introduces many new variables, but one can re-use variables if careful.)

Move Terminals to Unit Productions.

For every terminal on the right of a non-unit production, add a substitute variable.
For example, replace production A → bC with productions A → BC and B → b.

Example.

Consider the CFG:
S → aXbX
X → aY | bY | ε
Y → X | c
The variable X is nullable, and so, therefore, is Y.
After elimination of ε, we obtain:
S → aXbX | abX | aXb | ab
X → aY | bY | a | b
Y → X | c

Example: Step 2.

After elimination of the unit production Y → X,
we obtain:
S → aXbX | abX | aXb | ab
X → aY | bY | a | b
Y → aY | bY | a | b | c

Example: Steps 3 & 4

Now, break up the RHSs of S; and replace a by A,
b by B and c by C wherever not units:
S → EF | AF | EB | AB
X → AY | BY | a | b
Y → AY | BY | a | b | c
E → AX
F → BX
A → a
B → b
C → c

Summary

There are special forms for CFGs such as Chomsky Normal Form, where every production has the form A → BC or A → c. The algorithm to convert to this form involves (1) determining all nullable variables and getting rid of all ε-productions, (2) getting rid of all variable unit productions, (3) breaking up long productions, and (4) moving terminals to unit productions.   