Structured Learning with Perceptron - Machine Learning for NLP

2.2 Machine Learning for NLP

2.2.1 Structured Learning with Perceptron

The Perceptron algorithm is one of the earliest machine learning methods that tried to mimic the behavior of the human brain. It was originally proposed by Rosenblatt [1958] as an intuitive way to learn from labeled data with binary labels. There are formal proofs [Novikoff, 1963; Freund and Schapire, 1999] that show this algorithm makes a finite number of mistakes for learning a linearly separable data until it reaches a vector that can perfectly discriminate between examples having thezeroandonelabels. In a traditional sense, Perceptron is a binary discriminative classifier that can give linear classifiers. The algorithm is as follows: given a datasetD={(x1, y1),· · · ,(xn, yn)}such that each data

pointxiis ad-dimensional real vector andyi ∈ {0,1}, Perceptron learns a weight vector

ω∈Rd_{and a bias term}_b_{, such that:}

zi =I(ω⊤xi+b≥0)

whereIis the indicator function. It starts with a zero weight vector and bias term, and visits every data point once at a time. Whenever the above condition does not hold for

zi, i.e. zi ̸=yi, it updates its weights according to the following equation:

ω =ω+ (yi−zi)xi

and

The algorithm converges when no mistake is made by the classifier.

Learning Structures with Perceptron

The Perceptron algorithm is a binary classifier while many problems in natural language processing are multi-label structured prediction. Moreover, the features in natural language problems are categorical: they are words, part-of-speech tags, and other types of string-based features. A standard way to make use of any classifier including the Per- ceptron algorithm is to convert each feature to a binary indicator feature. For example, if presence of a word is important to us, we can have a separate binary feature for that word. For example, the following feature shows a binary indicator feature for inputxand

output label y (among all possible labels in Y = {y1,· · · , yl}) such that this feature is

non-zero only if the inputxis the word “going” and the output label is “subject”.

fk(x, y) =         

1 if x=going andy=subject

0 otherwise

wherek is the feature index in the binary feature vector. It is worth noting that one can

extend this trick to other ways of representing features in data; e.g. joint existence of two words, or a pair of word and tag, or even the count of occurrence of a word, with the expense of making features more complicated and sparser in the final feature vector. Thus, each feature in the sparse feature vector can be defined with arbitrary or even overlapping features.

tured prediction with Perceptron: in the case of using structures, such as dependency parsing, a feature vector of a tree can be defined as the sum of all feature vectors in the structure. For example, for graph-based parsing, each substructure is defined as the arcs of the tree. In other words, a structureyis decomposed into substructures or arcs, where

arcr = (head(i)→i)gets its own feature vectorf(x, r)∈RD. The final feature vector

of the structure can be seen as the sum of the feature vectors:

f(x, y) =∑

r∈y

f(x, r)∈RD

Thus the score of a structure y from input x can be defined as multiplication of its

feature vector by the weight vectorω ∈RD_:

score(y|x;ω) = ω⊤f(x, y) =ω⊤∑

r∈y

f(x, r) =∑

r∈y

ω⊤f(x, r)

Therefore given an inputx, the best structure can be defined as the maximum scoring

structure among (usually exponential number of) possible structures.

y∗ =arg max

y∈Y(x)score(y|x;ω)

whereY(x)is the set of all possible structures. Dynamic programming or beam search is

usually used to solve argmax function.

Collins [2002] found that averaging all parameters during all iterations [Freund and Schapire, 1999] gives more reliable parameters in different natural language processing

Inputs: 1) Training dataD ={(x1, y1),· · ·,(xn, yn)}, 2) Feature functionf(x, y)that maps an

(x, y)pair to aD-dimensional vector; 3)T: number of training epochs.

Initialization:Setaωj = 0, ωj = 0∀j ∈ {1,· · ·, D}.

Algorithm: c = 0

fort=1toT do

fori=1tondo

c = c + 1

zi=arg maxy∈Y(xi)ω⊤f(xi, y) ▷Use dynamic programming or beam search.

if zi ̸=yithen ▷Apply sparse updates.

for eachjwherefj(xi, yi)̸= 0do

Setωj =ωj+fj(xi, yi)

Setaωj =aωj+c·fj(xi, yi)

for eachjwherefj(xi, zi)̸= 0do

Setωj =ωj−fj(xi, zi)

Setaωj =aωj−c·fj(xi, zi)

aω=ω− aω_c ▷Calculate averaged weights.

Output: aω.

Figure 2.8: Pseudo-code for the averaged structured Perceptron algorithm. The trivial implementation of the averaging trick is not efficient. Daumé III [2006] used the trick in this pseudo-code to make averaging efficient.

tasks: aω = ∑T×n i=1 ω (i) T ×n

The above averaging technique is inherently expensive for sparse feature vectors. Daumé III [2006] propose a simple trick to get the averaged values with less complexity. Figure 2.8 depicts the algorithm with the averaging trick from Daumé III [2006] for structured prediction.

In document Cross-Lingual Transfer of Natural Language Processing Systems (Page 48-51)