Perceptron ตอนที่ 1 : โครงสร้างและการเรียนรู้

Perceptron ในข้อเขียนนี้หมายถึง Learning Algorithm ใช้เพื่อการทำ binary classification [5] ถูกนำเสนอโดย McCulloch & Pitts [2][3] พัฒนาและปรับปรุงโดย Minski & Papert [3][7] เป็นพื้นฐานสำคัญเรื่องหนึ่งของการทำความเข้าใจเรื่อง Machine Learning / Deep Learning

โครงสร้างของ Perceptron

McCulloch & Pitts (MP model) ได้นำเสนอตัวแบบแรกของ perceptron ที่มีฟังก์ชั่นสองฟังก์ชั่นคือ function G (ดูรูปที่ 1) ใช้คำนวณหา weighted sum [12] ของ input ทั้งหมดเข้าด้วยกันให้เป็นตัวเลขค่าเดียว ส่งต่อไปให้ function F (ดูรูปที่ 1) เพื่อสร้าง output \( y \in \{1,0\}\) โดยเทียบกับค่า threshold (\( \theta\)) ที่ถูกกำหนดไว้ก่อน

รูปที่ 1 Perceptron ของ McCulloch & Pitts

function ที่ทำงานร่วมกันสองส่วน :

\[ \begin{align*} g(x) &= \sum_{i=1}^{n}{w_i \cdot x_i} +b \quad \tag{1} \\ \\ \hat{y}=sign(g(x)) &= \begin{cases} 1 & \quad \text{if }g(x) \geq \theta \\ 0 & \quad \text{otherwise } \end{cases} \quad \tag{2} \\ \\ \hat{y}=sign(g(x)) &= \begin{cases} 1 & \quad \text{if }g(x) \geq \theta \\ -1 & \quad \text{otherwise } \end{cases} \quad \tag{3} \end{align*} \]

w คือ weight โดยที่ \( w_i \in \Re^d \)

x คือ input โดยที่ \( x_i \in \Re^d \)

b คือ bias

y คือ output จะเลือกใช้แบบ (2) หรือ (3) ขึ้นกับ activation function หรือ threshold function [9] ที่เลือกใช้

ตัวอย่างที่มักหยิบขึ้นมาอธิบายขั้นตอนการทำงานการใช้ perceptron หาผลลัพธ์ของ OR Gate หรือ AND Gate

OR Gate, \(\theta = 1, b = 0, w_1=1,w_2=1 \)
\(X_1 \)	\(X_2 \)	\(g(x) \)	\(\hat{y} \)	y (\( X_1 \vee X_2 \))
1	1	2	1	1
1	0	1	1	1
0	1	1	1	1
0	0	0	0	0

AND Gate, \(\theta = 2,b=0,w_1=1,w_2=1 \)
\(X_1 \)	\(X_2 \)	\(g(x) \)	\( \hat{y} \)	y (\( X_1 \wedge X_2 \))
1	1	2	1	1
1	0	1	0	0
0	1	1	0	0
0	0	0	0	0

Rosenblatt [6] ได้เสนอตัวแบบที่เหมือนกับของ McCulloch & Pitts แต่มีเพิ่ม learning algorithm เข้าไปเพื่อให้ตัวแบบมีความเป็น adaptive สามารถปรับปรุงค่าของ parameters ได้เอง (รูปที่ 2) โดยอาศัยค่าจากตัวแปรใหม่เรียกว่า error โดยที่จะเกิดขึ้นเมื่อ \( \hat{y} \neq y \) ( \( \hat{y} \) คือ output ของตัวแบบ และ y คือค่าสังเกตุ )

รูปที่ 2 Perceptron ของ Rosenblatt

Learning Algorithm ของ Perceptron

เป้าหมายของการเรียนรู้ของ perceptron คือการหาค่าประมาณของ parameters \( \theta \) ซึ่งหมายถึงค่าของ weight และ bias มีการสร้างตัวแปรเพิ่มขึ้นมาคือความต่างระหว่าง output กับค่าสังเกตุ (\( y_i \neq \hat{y_i}\)) จะเรียกว่าค่า error ซึ่งหาได้โดยการวัดความต่างระหว่างข้อมูล [11] ในระหว่างการเรียนรู้

หากมี error เกิดขึ้น ค่านี้ก็จะถูกนำไปใช้ในการปรับปรุงค่าของ parameters กระทำวนกันไปจนกว่าจะได้ผลลัพธ์ออกมาตรงตามเป้าหมาย เขียนเป็น pseudocode ได้ดังนี้

Assign: \( sign(y) = \begin{cases} +1 & \quad \text{if }y \geq 0 \\ -1 & \quad \text{otherwise } \end{cases} \)

Input: \( (x_i,y_i) \text{ when } x \in \Re^d \text{ and } y \in \{-1,+1 \}\)

Algorithm :

\( \text{initial }\theta \text{ randomly or } \theta = 0 \\ \text{While not convergence do } \\ \quad\quad\text{pick }x_i \text{ randomly}\\ \quad\quad \hat{y} = sign(\theta \cdot x_i ) \\ \quad\quad \text{if } \hat{y} \neq y_i \text{ then} \\ \quad\quad\quad\quad \theta = \theta + y_i \cdot x_i \\ \quad\quad\text{else }\\ \quad\quad\quad\quad\text{unchange } \theta\\ \)

Output :\( \theta \)

การปรับปรุงค่า parameters

เพื่อให้เห็นภาพง่ายขึ้น จะขออ้างอิงจากเรื่อง vector [13] และ linear classifier [8][10] มาใช้ประกอบการอธิบาย เริ่มต้นด้วยการแปลงสมการ (1), (2) หรือ (3) ให้ในรูปของสมการ vector ก่อน

\[ \begin{align*} g(x) &= \vec{W} \cdot \vec{X} \longrightarrow (4) \\\\ \vec{W} \cdot \vec{X} &= cos(\theta)\big|\big|\vec{W}\big|\big| \cdot \big|\big|\vec{X}\big|\big|\longrightarrow (5) \\ \\ cos(\theta) &= \frac{ \vec{W} \cdot \vec{X} } { \big|\big|\vec{W}\big|\big| \cdot \big|\big|\vec{X}\big|\big|} \longrightarrow (6) \end{align*} \]

\(\theta \) คือมุมที่ทำกันระหว่างเวคเตอร์ \(\vec{X} \) และ \( \vec{W} \)

เราทราบว่า \( \big|\big|\vec{W} \big|\big| \) และ \( \big|\big|\vec{X} \big|\big| \) ต้องมากกว่า 0 (มีทิศทางเป็นบวกเสมอ) ดังนั้นทิศทางของ \( cos(\theta) \) จึงแปรผันตรงกับ \( \vec{W} \cdot \vec{X} \)

ความรู้จากเรื่อง Linear classification นำไปสู่ข้อสรุปว่า decision line ที่เหมาะสมในการใช้ classify เมื่อ

\[ \vec{W} \cdot \vec{X} = 0 \longrightarrow (7)\]

\( \theta \) หรือมุมระหว่าง \( \vec{W } \)และ \(\vec{X} \) ต้องทำให้ \( cos(\theta) = 0 \) หรือ \( vec{W}\) ต้องตั้งฉากกับ \( \vec{X}\)

กล่าวโดยสรุปว่า ความพยายามในการหา parameters ( \( \vec{W}\)) ที่เหมาะสมคือการพยายามปรับค่า elementใน \( \vec{W}\) จนกว่าจะตั้งฉากกับ \( \vec{X}\) เหมือนกับการพยายามปรับเข็มยาวบนหน้าปัดนาฬิกาให้ตั้งฉากกับเข็มสั้น

กรณีที่ 1 เมื่อ \( \vec{W} \cdot \vec{X} > 0 \) หรือ \( cos(\theta) > 0 \therefore \theta < 90 \text{degrees}\) :

รูปที่ 3

การ update ต้องทำให้ \( \theta \) เพิ่มขึ้นโดย \[ \vec{W_{new}} = \vec{W_{old}} - \vec{X} \longrightarrow (8)\]

กรณีที่ 2 : เมื่อ \( \vec{W} \cdot \vec{X} < 0 \) หรือ \( cos(\theta) < 0 \therefore \theta > 90 \text{degrees}\) :

รูปที่ 4

การ update ต้องทำให้ \( \theta \) ลดลงโดย \[ \vec{W_{new}} = \vec{W_{old}} + \vec{X} \longrightarrow (9)\]

นิยามตัวแปรใหม่

\[ \begin{align*} error &= y - \hat{y} \\ error &= y - sign(\vec{W} \cdot \vec{X}) \end{align*} \]

กรณีที่ 1 \( \vec{W} \cdot \vec{X} > 0 \)

\[ \begin{align*} \vec{W} \cdot \vec{X} &> 0 \\ sign(\vec{W} \cdot \vec{X}) &= 1 \\ \therefore error &= -1 \\ \end{align*} \]

กรณีที่ 2 \( \vec{W} \cdot \vec{X} < 0 \)

\[ \begin{align*} \vec{W} \cdot \vec{X} & < 0 \\ sign(\vec{W} \cdot \vec{X}) & = -1 \\ \therefore error &= 1 \\ \end{align*} \]

นำมาสรุปรวมกันจะได้ว่า \[ \boxed{\therefore error = -sign(\vec{W} \cdot \vec{X}) \longrightarrow (10) } \]

ที่กล่าวมานำไปสู่ข้อสรุป สมการสำหรับปรับปรุง parameters คือ

\[ \begin{equation} \boxed{ \vec{W_{new}} = \vec{W_{old}} + error \cdot \vec{X} \longrightarrow (11) } \end{equation} \]

Python code :

กำหนด Input และ target ในตัวอย่างจะใช้ AND Gate

   
import numpy as np
   
#AND Gate
# input X 
X = np.array([[1,0,0],[1,0,1],[1,1,0],[1,1,1]])
label = np.array([0,0,0,1])

# weighting W + bias
W = np.random.rand(3)

ตัวแปร X คือ Array ของ List ที่ประกอบด้วยสมาชิก 3 ตัว ซึ่งเป็นตัวแทนของ bias, \( x_1,x_2 \) ตามลำดับ (ดูสมการที่ (1) และ (4) ประกอบ) ตัวแปร label แทนค่าของ output ที่ควรจะเป็นเมื่อ \(x_1 \wedge x_2 \)

  
summer_func = lambda w,x : np.dot(w.T,x)
threshold_func = lambda x: 0 if x < 0 else 1    
predict_func = lambda w,x : threshold_func(summer_func(w,x) )

กำหนด function ใช้งานคือ

summer_func หาค่าผลรวมของผลคูณระหว่าง \(w_i,x_i \) ตามสมการที่ (4)
threshold_func หาทิศทางหรือเครื่องหมาย ตามสมการที่ (2)
predict_func รวมเอา summer_func และ threshold_func เข้าด้วยกัน

 
#training

epochs = 10
for e in range(epochs):
    i = np.random.choice(len(X))
    x = X[i]
    a = predict_func(W,x)    
    e = label[i] - a
    W = W + e * x
print(W)

[-2.06467451  1.94237225  0.71971274]

ใช้ learning algorithm ของ perceptron เพื่อปรับค่าของ W โดยกำหนดให้จำนวนรอบไว้ 10 รอบ ค่า W ที่แสดงออกมาคือเป้าหมายที่ต้องการ เป็นชุดตัวเลขที่ทำให้ parameter vector \( \vec{W}\) ตั้งฉากกับ input vector \( \vec{X}\) ตีความได้เป็น bias = -2.06467451 ,\( w_1\) = 1.94237225 ,\( w_2\) = 0.71971274

นำ parameter W ไปทดสอบ

AND Gate
\(X_1 \)	\(X_2 \)	\(z = \vec{W} \cdot \vec{X} \)	\( sign(z) \)	\( X_1 \wedge X_2 \)
1	1	\(-2.06467451 + 1 \times 1.94237225 + 1 \times 0.71971274 = 0.597 \)	1	1
1	0	\(-2.06467451 + 1 \times 1.94237225 + 0 \times 0.71971274 = -1.223 \)	0	0
0	1	\(-2.06467451 + 0 \times 1.94237225 + 1 \times 0.71971274 = -1.345 \)	0	0
0	0	\(-2.06467451 + 0 \times 1.94237225 + 0 \times 0.71971274 = -2.064 \)	0	0

จะนำเอา \( \vec{W} = \text{[-2.06467451 1.94237225 0.71971274]} \) ไป plot กราฟดูเทียบกับตำแหน่งของค่า \(X_1,X_2 \) เพื่อให้เห็นภาพชัดชึ้นว่า parameter นี้สามารถใช้แบ่งกลุ่มข้อมูลได้อย่างไร

รูปที่ 5

ในรูปที่ 5 เส้นสีขาวคือเส้นตรงที่เป็นตัวแทนของ \(\vec{W} \) ที่สามารถคั่นระหว่างจุด \(X_1,X_2 =(1,1) \) ออกจากจุดอื่นได้ สอดคล้องกับผลของ And Gate

สรุป

perceptron เป็นตัวแบบหรือ algorithm (แล้วแต่มุมมอง) ที่มีองค์ประกอบสำคัญคือ

Input : \( x_i \in \Re^d\)
Weight : \( w_i \in \Re^d\)
bias : \( b \in \Re\)
Aggregation function : \( g(x) = \sum_{i=1}^{n}w_i \cdot x_i + b\)
Threshold function : \( t(z) = \begin{cases} 1 \quad & \text{if }z \geq \theta \\ -1 \quad & \text{otherwise } \end{cases} \)
Learning algorithm สำหรับการปรับแต่งค่า parameters (Weight และ bias) จาก error (ความต่างระหว่างค่าสังเกตุกับ output)

ข้อจำกัด

ตัวแบบนี้ทำงานได้ดีในรูปแบบปัญหา linear separator [10] หรือ binary classification แต่ยังไม่สามารถนำมาใช้กับปัญหาที่เป็น nonlinear classification อย่าง XOR Gate ได้ จึงมีการพัฒนาตัวแบบนี้ไปสู่ multiplayer perceptron [14] ซึ่งจะได้กล่าวถึงต่อไป