Grammars in Compiler Design



When we write code, the compiler must determine whether it is a valid code or not. To identify that, the compiler depends on syntax analysis, which in turn relies on grammars.

Grammars, or formal grammars, define the rules that specify the syntax of a programming language. They dictate how individual elements, such as keywords and operators, combine to form valid statements and expressions. In this article, we will explore the concept of grammars, their structure, and their classification.

What are Grammars?

At a high level, a grammar is a formal system of set of rules. These rules are used to describe the structure of a language. These rules define which strings of symbols (or tokens) are valid within that language.

From our knowledge in formal languages, we know that the grammars are critical for programming languages because they show that written code follows a specific syntax.

Formally, we can say a grammar contains a set of parts.

  • Terminal Symbols − The basic symbols that form the language. They are like letters, numbers, and operators (for example {0, 1}, {a, b}).
  • Nonterminal Symbols − Symbols that represent groups of terminals or other non-terminals. They are often used to structure rules.
  • Productions − Rules that describe how non-terminals can be replaced with terminals or other non-terminals.
  • Starting Nonterminal − A specific nonterminal symbol from which derivations starts.

For example, in a programming language, terminal symbols could be any characters like +, -, a, b, or digits. Non-terminals might represent expressions, terms, or statements.

How Do Grammars Work?

Unlike natural language grammars, the formal language grammar generates strings in a language by starting with the starting nonterminal. This is repeatedly applying rules until only terminal symbols remain. This sequence of steps is called a derivation.

Just consider an example grammar, G1

G1:
	S → 0S0  
	S → 1S1  
	S → 0  
	S → 1  

This grammar defines strings that are palindromes over the alphabet {0, 1}. Now we know a palindrome reads the same backward and forward.

Example of Derivation

Let us derive the string 0010100 using G1

Start with S (the starting nonterminal).

  • Apply rule 1: S → 0S0 → 0S0
  • Apply rule 1 again: 0S0 → 00S00
  • Apply rule 2: 00S00 → 001S100
  • Apply rule 3: 001S100 → 0010100

Thus, 0010100 is part of the language generated by G1.

Example of Derivation

Classification of Grammars

In 1959, Noam Chomsky introduced a classification system for grammars based on their complexity and expressive power. This system consists of four types of grammars: We will read the classification in more detail later but here is a basic overview.

Type 0: Unrestricted Grammars

Rules can have any string of terminals and non-terminals on either side of the production arrow. For example,

Rule: SaB → cS 

These grammars can describe any computational problem that a Turing machine can solve.

Type 1: Context-Sensitive Grammars

The rules are of the form αAγ → αβγ, where A is a nonterminal and α, β, γ are strings of terminals and non-terminals. Here, the non-terminal A can only be replaced when it appears in the context of α and γ. For example,

Rule: SaB → caB

Type 2: Context-Free Grammars (CFGs)

Here the rules are of the form A → α, where A is a single nonterminal and α is any string of terminals and non-terminals. The CFGs are widely used in programming languages to define syntax. For example,

Rule: S → aSb 

Type 3: Right-Linear Grammars

Here the rules are of the form A → aB or A → a, where A and B are non-terminals and a is a terminal. These grammars are used for lexical analysis by defining tokens like identifiers and constants.

Example of Grammars

Let us look at two grammars from the text −

Grammar G1: Palindromes

This grammar generates palindromes over {0, 1} −

G1:
	S → 0S0  
	S → 1S1  
	S → 0  
	S → 1  

Language: {0, 1, 000, 010, 101, 111, 00000, }

Grammar G2: Balanced Strings of a and b

This grammar gives strings where the number of as equals the number of bs, and all as come before bs −

G2:
	S → ASB  
	S →   
	A → a  
	B → b  

Language: {, ab, aabb, aaabbb, …} = {anbn | n ≥ 0}

Derivation Example for aabb

  • Start with S
  • Apply rule 1: S → ASB
  • Apply rule 1 again: ASB → AASBB
  • Apply rules 3 and 4: AASBB → AaBB → aabb

Context-Free Grammars in Programming Languages

In compiler design and syntax analysis, we generally use context-free grammars. In compiler design and syntax analysis, we generally use context-free grammars (CFGs). Most programming languages are defined using CFGs because they provide a balance between simplicity and expressive power.

CFGs are ideal for describing −

  • Arithmetic expressions (E → E + T | T)
  • Control structures (S → if E then S else S)
  • Function definitions (F → def id(params) { S })

The rules for CFGs can also be written in the form called Backus-Naur Form (BNF). This is widely used notation in language design.

Advantages of Using Grammars

Following are the advantages of using grammars −

  • Clarity − Grammars provide an easier and formal way to specify the syntax of a language.
  • Error Detection − They help us the compilers to detect syntax errors during parsing.
  • Automation − Tools like parser generators use grammars to automatically create syntax analyzers.

Challenges with Grammars

Like other concepts, grammars have limitations as well as advantages.

  • Ambiguity − Some grammars can produce multiple derivation trees for the same string, leading to confusion.
  • Complexity − Context-sensitive grammars are powerful but impractical for most compiler designs due to their complexity.
  • Limitations − Certain aspects of programming languages, like semantic rules, cannot be captured by grammars alone.

Conclusion

In this chapter, we explored the concept of grammars and their significance in compiler design. We began by understanding what grammars are and how they generate strings in a language.

Through examples like G1 and G2, we saw how grammars define palindromes and balanced strings of "a" and "b". We also gained an overview of Chomsky's classification of grammars. Syntax analysis focuses on context-free grammars and their role in programming languages. Ultimately, we understood that grammars form the foundation of how programming languages are defined and interpreted.

Advertisements