• No results found

y DEFINING LANGUAGES BY ANOTHER NEW METHOD

We wish now to be very careful about the phrases we use to define languages. We defined L 1 i n Chapter 2 b y the symbols:

L1 = { x" for n = 1 2 3 . . . I

and we presumed that we all understood exactly which values n could take. We might even have defined the language L2 by the symbols:

L2 = { x" for n = 1 3 5 7 . . . I

and again we could presume that we all agree on what words are in this language.

We might define a l anguage by the symbols:

L5 = { �\'' for n = 1 4 9 16 . . . I

but now the symbols are becoming more of an IQ test than a clear definition.

What words are in the language

L6 = I x" for n = 3 4 8 22 · . . . I ?

Perhaps these are the ages of the sisters of Loui s XIV when he assumed the throne of France. More preci sion and less guesswork are required, especially where computers are concerned. In this chapter, we shall develop some new language-defining symbolism that will be m uch more precise than the ellipsis.

Let us reconsider the language L4 of Chapter 2:

L4 = IA x xx xxx xxxx . . .I

In that chapter, we presented one method for i ndicating this set as the closure of a smal ler set.

Let S = { .r l . Then L4 = S*.

As shorthand for this, we could have written

31

We now introduce the use of the K leene star appl ied not to a set, but directly to the letter .r and written as a superscript as if it were an exponent:

x*

The simple expression x* will be used to indicate some sequence of .r's (maybe none at all). This x is intentionally written in boldface type to distinguish it from an alphabet charac­

ter.

We can think of the star as an unknown power or undeterm ined power. That is. x* stands for a string of .r's, but we do not specify how many. It stands for any string of .r's in the language L4.

The star operator applied to a letter is analogous to the star operator appl ied to a set. It represents an arbitrary concatenation of copies of that letter (maybe none at all). This nota­

tion can be used to help us define languages by writing L4 = language(x*)

Since x* is any string of .r's, L4 is then the set of all possible strings of .r's of any length ( in­

cluding A).

We should not confuse x*, which is a language-defining symbol. with L4, which is the name we have given to a certain language. This is why we use the word "language" in the equation. We shall soon give a name to the world in which this symbol x* l ives. but not quite yet. Suppose that we wished to describe the language L over the alphabet l = ! a h I . where

L = ! a ah ahh ahhh ahhhh . . . I

We could summarize this language by the English phrase "all words of the form one a fol­

lowed by some number of h's (maybe no h 's at all)."

Using our star notation and boldface letters, we may write L = language(a b*) or without the space

L = language(ab*)

The meaning is clear: This is a language in which the words are the concatenation of an ini­

tial a w ith some or no h's (i.e., b*).

Whether we put a space inside ab* or not is only for the clarity of reading: it does not change the set of strings this represents. No string can contain a blank unless a blank is a character in the alphabet l. If we want blanks to be in the alphabet, we normally introduce some special symbol to stand for them, as blanks themselves are invisible to the naked eye.

The reason for putting a blank between a and b* in the product above is to emphasize the point that the star operator is applied to the b only. We have now used a boldface letter with­

out a star as well as with a star.

We can apply the Kleene star to the whole string ah if we want, as follows:

(ab)* A or ah or ahah or ahahah

Parentheses are not letters in the alphabet of this language, so they can be used to indi­

cate factoring without accidentally changing the words. Since the star represents some kind of exponentiation, we use it as powers are used in algebra, where by universal understanding the expression .1y2 means .r(y2), not (.ry)2.

Defining Languages by Another New Method 33

If we want to define the language L 1 this way, we may write L 1 = language(xx*)

This means that we start each word of L1 by writing down an x and then we follow it with some string of x's (which may be no more x's at all ) . Or we may use the 1 notation from Chapter 2 and write

L1 = language(x+ )

meaning all words o f the fo nn x t o some positive power (i.e., not x0 = A ) . The 1 notation i s a convenience, but is not essential since w e can say the same thing with * 's alone.

EXAMPLE

The language L1 can be defined by any of the expressions below:

xx* x + xx*x* x*xx* x + x* x*x + x*x*x*xx*

Remember, x* can always be A.

EXAMPLE

The language defined by the expression

ab*a

is the set of all strings of a's and h's that have at least two letters, that begin and end with a 's, and that have nothing but h's inside (if anything at al l).

Language(ab*a) = l aa aha ahha ahhha ahhhha . . . )

It would be a subtle m istake to say only that this language is the set of all words that begin and end with an a and have only h's in between, because this description may also apply to the word a, depending on how it is interpreted. Our symbolism eliminates this ambiguity .

EXAMPLE

The language of the expression

a*b*

contains all the strings of a's and h's in which all the a's (if any) come before all the h's (if any ).

Language(a*b*) = l A a h aa ah hh aaa aah ahh hhh aaaa . . . ) Notice that ha and aha are not in this language. Notice also that there need not be the same

number of a's and h's.

Here we should again be very careful to observe that a*b* #- (ab)*

since the language defined by the expression on the right contains the word abah, whereas the language defined by the expression on the left does not. This cautions us against thinking of the * as a normal algebraic exponent.

The language defined by the expression a*b*a* contains the word haa since it starts with zero a 's followed by one h followed by two a 's.

EXAMPLE

The following expressions both define the language L2 = ! .r000 } : x(xx)* or (xx)*x but the expression

x*xx*

does not since it includes the word (xx) x (x).

We now introduce another use for the plus sign. By the expression x + y where x and y are strings of characters from an alphabet, we mean "either x or y." This means that x + y of­

fers a choice, much the same way that x* does. Care should be taken so as not to confuse this with " as an exponent.

EXAMPLE

Consider the language T defined over the alphabet l = I a h c I :

T = I a c ah ch ahh chh ahhh chhh ahhhh chhhh . . . I

All the words in T begin with an a or a c and then are followed by some number of h's. Sym­

bolical ly, we may write this as

T = language((a + c)b*)

= language( either a or c then some h 's )

We should, of course, have said "some or no h's." We often drop the zero option because it is tiresome. We let the word "some" always mean "some or no," and when we mean "some posi tive number of," we say that.

We say that the expression (a + c)b* defi nes a language in the fol lowing sense. For each

* or + , used as a superscript, we must select some n umber of factors for which it stands. For each other + , we must decide whether to choose the right-side expression or the left-side ex­

pression. For every set of choices, we have generated a particular string. The set of al l strings that can be produced by this method is the language of the expression. In the example

(a + c)b*

we must choose either a or c for the fi rst letter and then we choose how many h's the b*

stands for. Each set of choices is a word. If from (a + c) we choose c and we choose b* to

mean hhh, we have the word chhh.

EXAMPLE

Now let us consider a finite language L that contains all the strings of a 's and h's of length three exactly:

L = ! aaa aah aha ahh haa hah hha hhh I

Formal Definition of Regular Expressions 35 null string. This is a very important expression and we shall use it often.

Again, this expression represents a language. If we choose that * stands for 5 , then (a + b)*

gives

(a + b)5 = (a + b)(a + b)(a + b)(a + b)(a + b)

We now have to make five more choices: either a or b for the first letter, either a or h for the second letter, and so on.

This is a very powerful notation. We can describe all words that begin with the letter a simply as

a(a + b)*

that is, first an a, then anything (as many choices as we want of either letter a or h).

All words that begin with an a and end with a h can be defined by the expression a(a + b)*b = a(arbitrary string)h

FORMAL DEFINITION OF REGULAR EXPRESSIONS

After all the introduction we have endured of the slow evolution of these language-defining expressions, it is time for us to identify them with their proper name and give them a math­

ematical definition. As is no surprise to those who have read the title of this chapter, these are called regular expressions. Similarly, the corresponding languages that they define are referred to as regular languages. We shall soon see that this language-defining tool is of limited capacity in that there are many interesting languages that cannot be defined by regu­

lar expressions, which is why this volume has more than 1 00 pages. A regular language is one that can be defined by a regular expression even though it may also have many other fine definitions. A regular expression, on the other hand, must take a very rigorous form as de­

fined below recursively.

DEFINITION

The symbols that appear in regular expressions are the letters of the alphabet !,, the symbol for the null string A, parentheses, the star operator, and the plus sign.

The set of regular expressions is defined by the following rules:

Rule I Every letter of I can be made into a regular expression by writing it in bold­

face; A itself is a regular expression.

Rule 2 If r 1 and r 2 are regular expressions, then so are:

We could have included the plus sign as a superscript in r 1 + as part of the definition, but since we know that r 1 + = r 1 r1 *, this would add nothing valuable.

This is a language of language-definers. It is analogous to a book that lists all the books in print. Every word in such a book is a book-definer. The same confusion occurs in everyday speech. The string "French" is both a word (an adjective) and a language-defining name (a noun). However difficult computer theory may seem, common English usage is much harder.

Because of Rule 1 , we may have trouble in distinguishing when we write an a whether we mean a, the letter in I; a, the word in I*; { a } , the one-word language; or a, the regular expression for that language. Context and typography will guide us.

As with the recursive definition of arithmetic expressions, we have included the use of parentheses as an option, not a requirement. Let us emphasize again the implicit parentheses in r1 *. If r1 = aa + b, then the expression r 1 * technically refers to the expression but their languages are quite different. Care should always be taken to produce the expres­

sion we actually want, but this much care is too much to ask of mortals, and when we write

r 1 * in the rest of the book, we really mean (r 1 )*.

The definition we have given for regular expressions contains one subtle but important omission: the language <J>. This language is not the same as the one represented by the regu­

lar expression A, or by any other regular expression that comes from our definition. We al­

ready have a symbol for the word with no letters and a symbol for the language with no words. Do we really need to invent yet another symbol for the regular expression that defines the language with no words? Would it simply be the regular expression with no characters, analogous to the word lambda (A) in the language of regular expressions? To the purely log­

ical Vulcan mind, that would be the only answer, but since we have already employed the boldface lambda (A) to mean the regular expression defining the word lambda, we take the liberty of using the boldface phi (<J>) to be the regular expression for the null language. We have already wasted enough thought on the various degrees of nothingness to qualify as me­

dieval ecclesiastics; the desire for more precision would require psycho-active medication.

For any r, we have

r + <J> = r

and

<J>r = <!>

Formal Definition of Regular Expressions 37

but what is far less clear is exactly what cl>* should mean. We shall avoid this philosophical crisis by never using this symbolism and avoiding those who do.

EXAMPLE

Let us consider the language defined by the expression (a + b)* a (a + b)*

At the beginning, we have (a + b)*, which stands for anything, that is, any string of a 's and h's, then comes an a, then another anything. All told, the language is the set of all words over the alphabet = { a h ) that have an a in them somewhere. The only words left out are those that have only h's and the word A.

For example, the word abbaah can be considered to be derived from this expression by three different sets of choices:

(A)a(hhaah) or (ahb)a(ab) or (ahha)a(h)

If the only words left out of the language defined by the expression above are the words without a's (A and strings of b's), then these omitted words are exactly the language defined by the expression b* . If we combine these two, we should produce the language of all strings. In other words, since

all strings = (all strings with an a) + (all strings without an a) it should make sense to write

(a + b)* = (a + b)*a(a + b)* + b*

Here, we have added two language-defining expressions to produce an expression that de­

fines the union of the two languages defined by the individual expressions. We have done

this with languages as sets before, but now we are doing it with these emerging language­

defining expressions.

We should note that this use of the plus sign is consistent with the principle that in these expressions plus means choice. When we add sets to form a union, we are saying first choose the left set or the right set and then find a word in that set. In the expression above, first choose (a + b)*a(a + b)* or b* and then make further choices for the pluses and stars and finally arrive at a word that is included in the total language defined by the expression.

In this way, we see that the use of plus for union is actually a natural equivalence of the use of plus for choice.

Notice that this use of the plus sign is far from the normal meaning of addition in the al­

gebraic sense, as we can see from

a* = a* + a*

a* = a* + a* + a*

a* = a* + aaa

For plus as union or plus as choice, these all make sense; for plus as algebra, they lead to

presumptions of subtractions that are misguided. •

EXAMPLE

The language of all words that have at least two a 's can be described by the expression (a + b)*a(a + b)*a(a + b)*

= (some beginning)(the first important a)(some middle)(the second important a)(some end)

where the arbitrary parts can have as many a 's (or h's) as they want.

EXAMPLE

Another expression that denotes all the words with at least two a 's is b*ab*a(a + b)*

We scan through some jungle of h's (or no h's) until we find the first a, then more h's (or no h's), then the second a, then we finish up with anything. In this set are ahhhahh and aaaaa.

We can write

(a + b)*a(a + b)*a(a + b)* = b*ab*a(a + b)*

where by the equal sign we do not mean that these expressions are equal algebraically in the same way as

x + x = 2i:

but that they are equal because they describe the same item, as with 1 6th President = Abraham Lincoln We could write

language((a + b)*a(a + b)*a(a + b)*)

= language(b*ab*a(a + b)*)

= all words with at least two a's

To be careful about this point, we say that two expressions are equivalent if they describe the same language.

The expressions below also describe the language of words with at least two a's:

and

If we wanted all the words with exactly two a 's, we could use the expression b*ab*ab*

Formal Definition of Regular Expressions 39

which describes such words as aah, haha, and hhhahbbah. To make the word aah, we let the

first and second b* become A and the last becomes h. •

EXAMPLE

The language of all words that have at least one a and at least one h is somewhat trickier. If we write

(a + b)*a(a + b)*b(a + b)*

= (arbitrary) a(arbitrary) h(arbitrary)

we are then requiring that an a precede a h in the word. Such words as ha and hhaaaa are not included in this set. Since, however, we know that either the a comes before the h or the h comes before the a, we could define this set by the expression

(a + b)*a(a + b)*b(a + b)* + (a + b)*b(a + b)*a(a + b)*

Here, we are still using the plus sign in the general sense of disjunction (or). We are taking the union of two sets, but it is more correct to think of this + as offering alternatives in forming words.

There is a simpler expression that defines the same language. If we are confident that the only words that are omitted by the first term

(a + b)*a(a + b)*b(a + b)*

are the words of the form some h's followed by some a's, then it would be sufficient to add these specific exceptions into the set. These exceptions are all defined by the regular expres­

sion

bb*aa*

The language of all words over the alphabet I = I a h l that contain both an a and a h is therefore also defined by the expression

(a + b)*a(a + b)*b(a + b)* + bb*aa*

Notice that it is necessary to write bb*aa* because b*a* will admit words we do not want, such as aaa. a's, all h's, or A. When these are included, we get everything. Therefore, the regular expression

(a + b)*a(a + b)*b(a + b)* + bb*aa* + a* + b*

defines all possible strings of a's and h's. The word A is included in both a* and b*.

We can then write

(a + b)* = (a + b)*a(a + b)*b(a + b)* + bb*aa* + a* + b*

which is not a very obvious equivalence at all. •

We must not misinterpret the fact that every regular expression defines some language to mean that the associated language has a simple English description, such as in the preced­

ing examples. It may very well be that the regular expression itself is the simplest descrip­

tion of the particular language. For example,

(A + ba*)(ab*a + ba*)*b(a* + b*a)bab*

probably has no cute concise alternate characterization. And even if it does reduce to some­

thing simple, there is no way of knowing this. That is, there is no algorithm to discover hid­

den meaning.

EXAMPLE

All temptation to treat these language-defining expressions as if they were algebraic polyno­

mials should be dispelled by these equivalences:

(a + b)* = (a + b)* + (a + b)* that do not contain the substring ah (which are accounted for in the first term) are all a's, all b's, A , or some h's followed by some a's. All four missing types are covered by b*a*. • Usually, when we employ the star operator, we are defining an infinite language. We can represent a finite language by using the plus sign (union sign) alone. If the language l over the alphabet I = I a h } contains only the finite list of words

L = I ahba haaa hhhh I then we can represent l by the symbolic expression

L = language(abba + baaa + bbbb)

Every word in L is some choice of options of this expression.

If l is a finite language that includes the null word A. then the expression that defines l

must also employ the symbol A.

For example, if

l = { A a aa bbh I then the symbolic expression for l must be

L = language(A + a + aa + bbb)

The symbol A is a very useful addition to our system of language-defining symbolic ex­

pressions.

EXAMPLE

Let V be the language of all strings of a's and b's in which either the c;trings are all h's or else there is an a followed by some b's. Let V also contain the word A :

Let V be the language of all strings of a's and b's in which either the c;trings are all h's or else there is an a followed by some b's. Let V also contain the word A :