Salinan Terjemahan A5 PDF

Pola Pengakuan ∎ (∎∎∎∎) ∎∎∎-∎∎∎
Isi daftar tersedia diScienceDirect
Pola
homepage jurnalPengakuan:www.elsevier.com/locate/pr
multivariat pohon bolak keputusan

Hong Kuan Sok a, n, Melanie Po-Leen Ooi a, b, Ye Chow Kuang sebuah, Serge Demidenko a, c
sebuah Lanjutan Teknik platform & Elektro dan Teknik Sistem Komputer, Sekolah Teknik, Universitas Monash, 47500 Bandar
Sunway, Malaysia b Sekolah Teknik & Ilmu Pengetahuan Alam, Heriot-Watt University , 62200 Putrajaya, Malaysia c Sekolah
Teknik & Advanced Technology, Massey University, Private Bag 1.102.904, Auckland 0745, Selandia Baru
Artikel Info
Pasal sejarah: Diterima 5 Mei 2015 Diterima dalam bentuk direvisi 7 Juli 2015 Diterima 17 Agustus 2015
Keywords: Alternating pohon keputusan meningkatkan pohon keputusan multivariat Lasso Lars
http://dx.doi.org/10.1016/j.patcog.2015.08.014 0031-3203 / & 2015 Elsevier Ltd All rights reserved.
abstrak
Pohon keputusanyang dipahami, tetapi pada biaya akurasi prediksi relatif lebih rendah dibandingkan dengan pengklasifikasi
kotak hitam kuat lainnya seperti SVMs. Meningkatkan telah menjadi strategi populer untuk membuat sebuah ensemble dari
pohon keputusan untuk meningkatkan kinerja klasifikasi mereka, tetapi dengan mengorbankan keuntungan comprehensibility.
Untuk tujuan ini, bergantian pohon keputusan (ADTree) telah diusulkan untuk memungkinkan meningkatkan dalam pohon
keputusan tunggal untuk mempertahankan pemahaman. Namun, ADTrees yang ada univariat, yang membatasi penerapannya.
Penelitian ini mengusulkan algoritma baru - multivariat ADTree. Hal ini menyajikan dan membahas variasi yang berbeda (Fisher
ADTree, Jarang ADTree, dan rutinatas Logistik ADTree) bersama dengan validasi empiris mereka pada satu set dataset yang
tersedia untuk umum. Hal ini menunjukkan bahwa ADTree multivariat memiliki akurasi prediksi yang tinggi sebanding dengan
ansambel pohon keputusan, sementara tetap mempertahankan pemahaman yang baik yang dekat dengan pemahaman pohon
keputusan univariat individu.
& 2015 Elsevier Ltd All rights reserved.
1.Pendahuluan
pohonKeputusan adalah salah satu pengklasifikasi yang paling kuat dan populer yang tersedia. Mereka adalah model grafis
asiklik diarahkan bahwa memecahkan masalah fikasi classi- menggunakan representasi simbolis, yaitu, grafik node keputusan
yang terhubung melalui tepi (Gambar. 1 (a)). Akibatnya, mereka mengikuti logika manusia flowchart-seperti dan penalaran,
membuat mereka sangat dipahami. Pohon keputusan memodelkan domain pro blem sebagai seperangkat aturan keputusan. Model
seperti itu transparan dan dimengerti untuk spesialis di bidang aplikasi yang relevan [1]. Misalnya dalam [2], ahli medis
menggunakan informasi kuantitatif yang diperoleh dari bolak model pohon keputusan untuk mendapatkan pemahaman yang lebih
baik antara fenotipe penyakit dan status sayang. Oleh karena itu comprehensibility sifat, membuat pohon keputusan yang sangat
bisa dijangkau oleh pengguna di luar hanya sebuah komunitas pembelajaran mesin, dan karena itu mereka dapat ditemukan
dalam berbagai aplikasi seperti bisnis [3], manufaktur [4], biologi komputasi [5] , bioinformatika [6], dll
Hal ini sering mungkin untuk lebih meningkatkan akurasi klasifikasi pohon keputusan individu dengan menggabungkan
sejumlah pohon keputusan untuk membuat mayoritas sebagai keputusan [7]. Ada dua strategi populer untuk mencapai ini:
mengantongi [8] dan meningkatkan [9]. Sayangnya, sebuah ensemble dari pohon keputusan menghasilkan banyak variasi dalam
tasi represen- simbolik yang menyebabkan classifier keseluruhan menjadi besar, kompleks, dan sulit untuk menafsirkan. Ini
meniadakan keuntungan comprehensibility menjadi pohon keputusan [10]. Masalah denganbesar dan tidak bisa dimengerti
pohon keputusandidorong menyebabkan penemuan dari pohon keputusan alternating (ADTree), yang dirancang untuk
mempertahankan interpretability dalam paradigma meningkatkan [10]. Daripada membangun pohon keputusan pada setiap siklus
meningkatkan, tunggul keputusan yang lebih sederhana dibuat.
Gambar. 1 (b) menunjukkan ilustrasi grafis dari ADTree tersebut. Serupa dengan pohon keputusan pada Gambar. 1 (a),
ADTree juga merupakan model grafis asiklik diarahkan. Namun, makna simbolik setiap node dan cara di mana mereka
terhubung berbeda. Tidak menggunakan node daun di node terminal, atau node keputusan sebagai node internal. Sebaliknya,
banyak tunggul keputusan (atau satu tingkat pohon keputusan) digabungkan untuk mendapatkan representasi khusus di mana
masing-masing dari tunggul terdiri dari simpul keputusan dan dua node prediksi.
ADTree dapat dilihat sebagai generalisasi longgar untuk pohon keputusan standar, didorong pohon keputusan, dan
meningkatkan tunggul keputusan [10] karena alasan berikut. Pertama, ADTree dapat digunakan sebagai alternatif untuk mewakili
setiap model pohon keputusan standar dengan ality function- yang sama. Selain itu, ADTree memungkinkan beberapa tunggul
keputusan di bawah node prediksi yang sama untuk mendapatkan mayoritas sebagai keputusan. Meningkatkan dapat
diimplementasikan langsung dalam pohon yang sama yang bertentangan dengan cara konvensional menciptakan pohon
keputusan didorong atau tunggul keputusan didorong. Ada sejumlah ekstensi dari ADTree seperti multi-label ADTree [11],
multi-kelas ADTree [12] dan fitur kompleks ADTree [13]. ADTree telah berhasil diterapkan di berbagai aplikasi seperti
gangguan genetik [14], prediksi kinerja perusahaan [3], dan bioinformatika [15].
Sayangnya, ada dua kelemahan utama menggunakan node keputusan univariat di ADTree. Pertama, karena dengan keputusan
univariat lain
n penulis Sesuai. Tel .: þ60 35 514 6238; fax: þ6035 514 6207.
alamatE-mail: sok.hong.kuan@monash.edu (HK Sok).
pohon, pemisahan didasarkan pada fitur tunggal adalah partisi sumbu-paralel ruang input. Hal ini menyebabkan bias tinggi dan
menghasilkan keputusan besar
Silakan mengutip artikel ini sebagai:. HK Sok, et al, pohon keputusan bolak multivariat, Pola Recognition (2015),
http://dx.doi.org/10.1016/j. patcog.2015.08.014i
HK 2 Sok et al. / Pola Pengakuan ∎ (∎∎∎∎) ∎∎∎-∎∎∎ Gambar 1. Pohon keputusan. (A) pohon keputusan klasik yang terdiri
dari node keputusan sebagai node internal dan node daun sebagai node terminal; (b) pohon keputusan alternating yang dapat
digunakan untuk mewakili pohon keputusan standar ditunjukkan pada bagian (a) untuk membuat prediksi yang sama; dan (c)
akomodasi dari meningkatkan di ADTree, dimana lebih tunggul keputusan dapat ditambahkan ke setiap node prediksi yang ada
(disorot dalam lingkaran) untuk memperoleh mayoritas sebagai keputusan.
pohon di masalah klasifikasi yang memiliki fitur co-dependent. Pohon keputusan besar dan kompleks yang dihasilkan
mempersulit proses tafsiran. Kedua, ADTree induksi didasarkan pada (PAC) kerangka pembelajaran mungkin mendekati benar
yang membutuhkan pelajar lemah untuk mencapai ε tingkat kesalahan yang sedikit lebih baik dari dugaan acak untuk masalah
kelas biner; secara resmi εr0: 5ÀΨ untuk Ψ konstan kecil (dikenal sebagai tepi). Sayangnya, sederhana tunggul keputusan
univariat kadang tidak memenuhi kondisi belajar yang lemah. Hal ini menyebabkan prosedur meningkatkan gagal dalam
menghasilkan model fungsi ADTree [16].
Tujuan dari makalah ini adalah untuk menyajikan multivariat bolak algoritma pembelajaran pohon keputusan novel dengan
meningkatkan kemampuan yang menawarkan kinerja klasifikasi peningkatan pohon keputusan sementara remaining dipahami.
Tujuan adalah untuk:
1. mengungguli yang ada ADTree univariat dan multivariat keputusan (yang tidak dikuatkan) pohon dalam hal akurasi prediksi
sementara menawarkan comprehensibility baik; 2. Cocokkan kinerja pohon keputusan univariat untukunivariat
masalahsementara mengalahkan mereka pada dataset multivariat; 3. Memberikan comprehensibility unggul dibandingkan
denganensemble-.
pohon keputusan berdasarkan
Ada beberapa subbagian yang berbeda dalam algoritma ADTree ada yang bisa direstrukturisasi untuk menginduksi ADTree
multivariat. Dalam makalah ini, tiga kemungkinan variasi dieksplorasi, yaitu Fisher ADTree, Jarang ADTree [17], dan rutinatas
Logistik ADTree. The ADTree Jarang disajikan dalam makalah sebelumnya adalah upaya pertama untuk menginduksi ADTree
multivariat. Kertas saat menyajikan hasil secara signifikan baru dan dikembangkan lebih lanjut yang mencakup dua tambahan
multi pohon keputusan variate alternating (ADTree) desain. Hal ini meningkatkan cakupan materi dan pemahaman serta
penerapan oleh praktisi dan peneliti di lapangan. Selain itu, eksperimen secara signifikan lebih kuat dengan diskusi diperpanjang
pada validitas, penggunaan dan penerapan bolak multivariat deci- pohon sion. Semua tiga varian ADTree multivariat diuji pada
set dataset dunia nyata terhadap sejumlah algoritma pembelajaran didirikan pohon keputusan seperti ADTree univariat asli [10];
pohon keputusan univariat: C4.5 [18] dan Kereta [19]; pohon keputusan multivariat - Fisher pohon keputusan [20]; dan ensemble
dari pohon keputusan: Didorong C4.5 dan miring Acak Hutan [21]. Perhatikan bahwa ada varian lain dari pohon keputusan yang
disajikan dalam literatur (misalnya, [22,23]). Namun, algoritma benchmarking dipilih berdasarkan ketersediaan kode sumber, dan
mereka digunakan sebagai perwakilan dari keluarga pohon keputusan yang berbeda. Hal ini dilakukan untuk membandingkan
akurasi prediksi secara keseluruhan, waktu induksi, ukuran pohon dan kompleksitas / comprehensibility terhadap keluarga yang
berbeda dari pohon keputusan. Untuk verifikasi statistik dan perbandingan,
Silakanmengutip artikel ini sebagai: HK Sok, et al, multivariat pohon keputusan bolak, Pola Recognition (2015),
http://dx.doi.org/10.1016/j.. patcog.2015.08.014i
standar 10Â10 kali lipat stratified cross-validasi dilakukan pada semua dataset untuk menghasilkan estimasi kinerja.
Sisa dari makalah ini adalah sebagai berikut: Bagian 2 menyediakan tinjauan literatur singkat tentang metode pembelajaran,
meningkatkan, dan ADTree. Algoritma ADTree multivariat yang diusulkan disajikan dalam Bagian 3. eksperimental setup dan
hasil yang diperoleh diberikan dalam Bagian 4 bersama-sama dengan diskusi rinci. Bagian 5 menyajikan kesimpulan dan
menguraikan pekerjaan di masa depan.
2. Latar Belakang bolak pohon keputusan
2.1. Kerangka belajar diawasi
Untuk lebih mudah dibaca, notasi yang digunakan dalam makalah ini pertama kali dijelaskan. Vektor diketik dengan huruf
tebal (misalnya, x) dan mereka semua vektor kolom kecuali ditentukan lain. Skalar diketik di reguler (misalnya, λ). Matriks
diberikan di ibukota tebal (misalnya, X). Entri tertentu di vektor diindeks dengan skalar. Misalnya, masuknya i dari vektor kolom
x dinotasikan sebagai x
i.
Untuk matriks, masuknya baris ke-i dan j kolom dari matriks X dinotasikan sebagai X
ij.
Seluruh baris i dari matriks X dinotasikan sebagai X
i
dan seluruh kolom j dari matriks X dinotasikan sebagai X
j.
Di bawah pembelajaran diawasi, dataset pelatihan 1/2 X; y Š terdiri
dari satu set n sampel berlabel, di mana masing-masing sampel XAR
P
adalah vektor kolom bernilai real fitur p dan yang label sesuai yA f Th1; A1
g mengasumsikan kelas baik positif atau negatif untuk masalah klasifikasi biner. Dimensi dari desain matriks X adalah n Â p, dan
vektor kolom y adalah dari panjang n. Baris ke-i dari desain matriks X atau X
i
mengacu pada sampel engan, vektor dialihkan, yaitu, XT. Tujuan dari algoritma pembelajaran pohon
keputusan adalah untuk belajar model klasifikasi tunggal. Untuk belajar ensemble, pelajar lemah berulang kali dipanggil untuk
belajar beberapa model.
2.2. Meningkatkan
Meningkatkan merupakan perkembangan penting di bidang mesin belajar. Hal ini memungkinkan untuk setiap pilihan
algoritma pembelajaran lazim selama lemah kondisi pembelajaran εr0: 5ÀΨ puas untuk masalah kelas biner. Kertas [24]
menunjukkan bahwa pohon keputusan adalah pilihan populer sebagai peserta didik yang lemah karena ketidakstabilan yang
melekat mereka untuk variasi kecil dalam dataset pelatihan. Meningkatkan menciptakan variasi tersebut melalui distribusi berat
selama sampel pelatihan oleh berurutan ing reweight-. Makalah ini mengimplementasikan dua algoritma meningkatkan berbeda
untuk menginduksi ADTree multivariat, yaitu, AdaBoost dan LogitBoost (lihat Tabel 1).
AdaBoost menginisialisasi distribusi berat w sebagai salah satu seragam dengan nilai bobot awal 1 = n. Berat sampel i di untuk
meningkatkan prosedur tth diindikasikan T meningkatkan prosedur untuk memperoleh lemah w ðtÞ i lemah pelajar, menentukan
koefisien linear. AdaBoost kemudian mengulangi
Model f
t xD Þ dari model lemah
α
t
HK Sok et al. / Pola Pengakuan ∎ (∎∎∎∎) ∎∎∎-∎∎∎ 3
Tabel 1 AdaBoost dan LogitBoost algoritma.
AdaBoost LogitBoost
Input: pelatihan dataset 1/2 X; y Š 1. Inisialisasi: w
i
1/4 1 = n8i
Input: dataset pelatihan À 1/2 X; Á
y = 2, Š
w
i 2. Untuk t 1/41; :::; T
2.1. Mendapatkan f
t
Silakan mengutip artikel ini sebagai: HK Sok, et al, multivariat pohon keputusan bolak, Pola Recognition (2015),
1. Inisialisasi: yn
i
1/4 y
i
Th1
xD Þ dari seorang pelajar lemah 2.2.
Tentukan α
t
1/4 1 2
log 1a ε t
ε
t
dengan ε
t
1/4
Pn
w
i; t
Â Saya À f
t
ðX
i
Thay
i
ÂÂ
i 1/4 1 2.3.
Update distribusi berat w
DT i
th 1th
1/4 w
ðtÞ i
exp ay i
α
t
f
t
ðX
i
th
dengan w
i 2.3. Perbarui GX
i
D Þþ1 2
ð Þ, p X
i
D Þ'GX
i
g
t
X
i
Output: F xD Þ1/4
PT
ð Þ 1/4
exp ð G ð X
exp i
Þ Þþ ð G exp ð X i
DÞÀÞ
GðX
i
ÞÞ
t 1/4 1
α
t
f
t
xD Þ
output: F xD Þ1/4
PT
t 1/4 1
g
t
XD Þ
berdasarkan ε kesalahan
t,
dan memperbarui distribusi berat sebelum prosedur meningkatkan berikutnya dimulai. Fungsi Indikator I: ð Þ mengembalikan 1
jika ekspresi Boolean dalam fungsi dievaluasi sebagai Benar. Output diperoleh melalui kombinasi linear dari model yang lemah.
Untuk LogitBoost, distribusi berat seragam juga diinisialisasi dengan cara yang sama seperti di AdaBoost. Selain itu, juga
melacak estimasi probabilitas positif kelas p XD Þ dan nilai regresi G xD Þ untuk semua n sampel. Kemudian mengulangi untuk
T prosedur meningkatkan. Tanggapan kerja (atau pseudo-label z) dan distribusi berat diperbarui pada awal setiap prosedur
meningkatkan. Sebuah fungsi regresi g xD Þ dilengkapi dengan masalah tertimbang setidaknya regresi persegi. Nilai-nilai regresi
semua sampel pelatihan diperbarui untuk menghitung estimasi probabilitas baru dari setiap sampel pada akhir setiap prosedur
meningkatkan. Output adalah fungsi regresi, dimana klasifikasi dicapai dengan mengambil tanda penjumlahan.
2.3. Bolakpohon keputusan
ADTreeunik menjembatani kesenjangan antara meningkatkan dan algoritma pohon keputusan. Alih-alih pendekatan
konvensional membangun hutan pohon keputusan, prosedur meningkatkan tergabung dalam pohon keputusan tunggal untuk
memfasilitasi tanggung comprehen-. ADTree terdiri dari bolak lapisan node keputusan dan node prediksi dimulai dengan node
prediksi akar. Mathema- tically, ADTree dapat digambarkan sebagai seperangkat aturan keputusan seperti pada (1). Setiap aturan
keputusan akan kembali salah satu berikut: prediksi positif skor α þ, skor prediksi negatif a À, atau skor nol tergantung pada
bersarang jika pernyataan (prasyarat) maka [if (kondisi) maka α th sehingga α lain À] rt
xD Þ: jika lain 0 Prasyarat adalah gabungan dari kondisi, sementara kondisi itu sendiri adalah predikat Boolean yang tertanam di
node keputusan.
AD Model Tree: 1/4 È r
t
XD Þ
É
T t 1/4
0;ð1Þ
Untuk melakukan klasifikasi, sampel input diurutkan top-down dari node prediksi akar. Alih-alih mengikuti jalan tunggal dari
node keputusan akar ke salah satu node daun di pohon keputusan standar, satu atau lebih jalur bisa dilalui dalam ADTree karena
mungkin beberapa tunggul keputusan di bawah node prediksi yang sama. Skor prediksi dari semua node prediksi dilalui
dijumlahkan untuk membuat prediksi pada label kelas. Tanda penjumlahan digunakan untuk menunjukkan baik positif atau label
kelas negatif. Besarnya penjumlahan adalah indikasi yang baik kepercayaan klasifikasi.
Dalam hal belajar, model yang ADTree dapat tumbuh melalui algoritma meningkatkan. AdaBoost dilaksanakan dalam
pekerjaan mani pada ADTree [10]. Dalam tahun kemudian beberapa penelitian bekerja pada ADTree digunakan algoritma
meningkatkan berbeda seperti AdaBoost.MH untuk menginduksi
Gambar. 2. Ilustrasi ADTree induksi dimana tunggul keputusan baru ditambahkan ke salah satu node prediksi yang ada setelah
setiap prosedur meningkatkan. Pelajar lemah menghasilkan serangkaian kondisi dasar C yang menjadi calon potensial untuk
membentuk simpul keputusan berikutnya. Kombinasi terbaik dari kondisi dasar dan prediksi simpul dipilih untuk membentuk
tunggul keputusan baru.
sedikit berbeda Model ADTree untuk menangani masalah multi-label [11], sementara yang lain bekerja LogitBoost untuk
mengatasi masalah yang multi [12]. Dengan mengacu pada Gambar. 2, node prediksi akar con- pertama structed diberi dataset
asli. Sisa induksi diulang untuk sejumlah tertentu meningkatkan iterasi. Setiap siklus meningkatkan menambahkan tunggul
keputusan baru untuk salah satu “terbaik” prediksi node untuk secara optimal memperluas ADTree. Prasyarat mengacu pada
pilihan node prediksi yang dipilih untuk dimasukkan ke dalam ADTree itu. Kondisi mengacu pada simpul keputusan tunggul
keputusan. Dua nilai prediksi mengacu pada node prediksi tunggul keputusan. Distribusi berat selama dataset pelatihan kemudian
diperbarui berdasarkan aturan pengambilan keputusan yang baru ditambahkan. Hal ini membantu untuk memandu lemah pelajar
berikutnya ketika menghasilkan satu set baru kondisi dasar.
Pelajar lemah ditunjukkan pada Gambar. 2 adalah independen dari induksi inti ADTree. The ADTree univariat yang ada
menggunakan pendekatan yang lengkap untuk menghasilkan satu set kondisi dasar univariat, masing-masing berdasarkan fitur
yang berbeda diberikan distribusi berat. Bagian berikutnya dari kertas mengusulkan metode yang berbeda untuk menggantikan
peserta didik yang lemah ini dalam rangka memperkenalkan node keputusan multivariat. Hal ini memungkinkan induksi kondisi
dasar multivariat untuk membangun ADTree multivariat.
3. Usulan ADTree multivariat
Sebagaimana dibahas di atas dalam Bagian 2, ADTree membutuhkan satu set kondisi dasar pada awal setiap prosedur
meningkatkan (Gbr. 2). Ini adalah kandidat simpul keputusan potensial yang mendikte acteristics char- dari model ADTree. Yang
lemah pembelajar bertanggung jawab untuk menghasilkan set kondisi dasar menggunakan pelatihan tertimbang
1/4 1 = n G ð X
i
TH 1/4 0, dan p ð X
i
th 1/4 0: 5 8i 2. Untuk t 1 /4 1; :::; T
2.1.
Menghitung z respon bekerja
i
1/4
pX
yn
ii
ÀpðX
i
th
dan bobot w
i
ðÞð1ÀpðX
i
Þ Þ 1/4 p ð X
i
Þ ð 1AP ð X
i
Þ Þ 2.2 . Fit g
t
xD Þ oleh tertimbang kuadrat-regresi z
i
untuk X
i
HK 4 Sok et al. / Pola Pengakuan ∎ (∎∎∎∎) dataset ∎∎∎-∎∎∎. Seperti dibahas sebelumnya, ADTree yang ada menerapkan
pelajar lemah univariat (Gambar. 3 (a)) di mana satu set kondisi ambang dasar univariat digunakan pada θ untuk mengevaluasi
perpecahan. Fitur-j yang dipilih atau XJ
adalah bentuk perpecahan univariat. Gambar 3 (b) -. (D) menunjukkan tiga
pendekatan yang diusulkan untuk menggantikan peserta didik yang lemah univariat untuk menginduksi ADTree multivariat.
Dalam pendekatan pertama (Gambar. 3 (b)), dataset pelatihan reweighted disediakan pada awal setiap prosedur meningkatkan.
Tujuannya adalah untuk menggunakan analisis diskriminan Fisher untuk mendapatkan β vektor untuk membentuk sebuah fitur
buatan xTβ dengan kekuatan diskriminasi yang lebih tinggi dibandingkan dengan yang menggunakan fitur individu. Ini hasil
dalam kondisi dasar multivariat. Pohon ini disebut sebagai Fisher ADTree (Bagian 3.1).
Gambar. 3 (c) menunjukkan ADTree jarang, dimana seleksi fitur dimasukkan untuk memangkas keluar fitur yang tidak relevan
dari kondisi dasar multivariat. Analisis diskriminan linear jarang [25] digunakan untuk nol fitur berlebihan dalam vektor β
diskriminatif. Dengan demikian, kondisi dasar multivariat jarang diperoleh (Bagian 3.2). The Fisher ADTree dan ADTree jarang
didasarkan pada AdaBoost, dimana distribusi berat harus secara eksplisit dimasukkan ke seorang pelajar lemah. Struktur menarik
dari LogitBoost adalah bahwa berat secara intrinsik bagian dari masalah regresi linier. Oleh karena itu, menggunakan LogitBoost
memungkinkan kita untuk mengambil keuntungan dari teknik regresi regularized berkembang dengan baik [26] untuk melakukan
seleksi fitur. Dalam tulisan ini, interpretasi regresi logistik aditif dari LogitBoost digunakan untuk membentuk multivariat kondisi
dasar g xD Þ sebagai fungsi regresi bukan Boolean fungsi f xD Þ (lihat Gambar. 3 (d)). Ini adalah LADTree regularized (Bagian
3.3).
3.1. Fisher ADTree
Fisher diskriminan [27] adalah mapan diawasi technique yang menemukan sebuah ruang bagian atasnya sampel
diproyeksikan baik memisahkan bentuk dalam kelas antara kelas Fisher menurut kovarians rasio J kovarians untuk ß
Á matrix mereka (2 ), yang kelas βT
matriks P
label. bisa w β menjadi βT
dari dimaksimalkanTujuannyaP
β b diproyeksikan dengan melalui rasa hormat adalah sampel. untuk memaksimalkan pemecahan ini yang
masalah eigen-nilai umum. Parameter β yang dioptimalkan kemudian digunakan dalam ADTree yang diusulkan Fisher untuk
membentuk fitur buatan xTβ, yang merupakan kombinasi linear dari semua fitur asli. Hal ini mengakibatkan node keputusan
multivariat, karena menggunakan semua fitur bukan hanya fitur j individu yang digunakan dalam varian univariat. Jumlah
dimensi dari ruang bagian ditentukan oleh jumlah total kelas K. Ini hanya dapat memiliki maksimum K A1 proyeksi
diskriminatif. Untuk masalah kelas biner, itu menghasilkan vektor diskriminatif β tunggal.
JÀβ
Á
1/4
βT βT
PP
b
β;
ð2Þ
mana
P
b
Silakan mengutip artikel ini sebagai: HK Sok, et al, multivariat pohon keputusan bolak, Pola Recognition (2015),
w
β
dan
P
w
adalah masing-masing antara kelas dan kelas within- covariances dari dataset asli. Mereka diperkirakan dari
dataset pelatihan menggunakan (3) dan (4). Vektor rata-rata seluruh dataset pelatihan dilambangkan sebagai μ sementara vektor
rata-rata kelas k adalah
Gambar 3. pelajar yang lemah untuk:. (A) ADTree univariat mana satu set kondisi dasar univariat diperoleh melalui pendekatan
lengkap, satu untuk masing-masing fitur diindeks oleh j; (b) ADTree Fisher yang menghasilkan kondisi basa tunggal multivariat
karena semua fitur yang digunakan untuk membentuk fitur buatan xTβ bukan fitur-j, x
j;
(
c) jarang ADTree mana vektor β dapat jarang dengan banyak nol elemen untuk memfasilitasi pemilihan fitur; dan (d) regularized
LADTree dimana kondisi dasar g xD Þ adalah fungsi regresi bukan Boolean fungsi f xD Þ.
dilambangkan sebagai μ
k.
X
b
XK
k 1/4 1
AA
Aμ
k
Á
T
ð3Þ
X
w
1/4
μ
k
Àμ
Àμ
1/4
XK
X
ÀX
i
Àμ
kk 1/4 1
iA Kath Kelas
A
X
i
Àμ
k
Á
T
ð4Þ
dalam estimasi diskriminan asli Fisher, berat memungkinkan terjadinya distribusi tidak dimasukkan sebagai bagian dari
formulasi optimasi. Untuk menggunakan diskriminan Fisher di bawah pekerjaan frame- meningkatkan, kebutuhan belajar β
menjadi adaptif untuk distribusi berat. Jika tidak, setiap prosedur meningkatkan akan menghasilkan β yang identik, yang
menentang tujuan dari meningkatkan.
Untuk mencapai adaptasi berat badan, versi tertimbang digunakan seperti pada (5) dan (6) untuk menemukan diskriminatif P
vektor β w
di mana kelas k-th tertimbang berarti vektor dan secara keseluruhan berarti vektor ditunjukkan pada (7) dan (8 ) masing-masing.
Label kelas y diperlukan dalam perhitungan ini untuk mendapatkan β vektor diskriminatif. Detail tertimbang diskriminan Fisher
ditunjukkan dalam Algoritma 1 yang membentuk peserta didik yang lemah untuk ADTree Fisher. Mengingat β, fitur buatan
dihasilkan melalui xTβ proyeksi linear dan digunakan untuk membentuk kondisi dasar multivariat (lihat Gambar. 3 (b)).
X
b
P
b
dan
XK
k 1/4 1
ÀÁ
T
ð5Þ
X
w
1/4
P
1
w
i
Àμ
k
Àμ
Á
μ
k
Àμ
iA Kath kelas
XK
X
k 1/41
kelasiA Kath
ÀÁ
T
ð6Þ
μ
k
1/4
w
i
ÀX
i
Àμ
k
Á
X
i
Àμ
k
P 1/4
P
kelasiAkÀth
w
i
X
i ð7Þ saya AkÀth kelas
μ 1/4
w
i
1
XK K
k 1/4 1
μ
k
ð8Þ
A yang sama pendekatan diimplementasikan dalam Fisher pohon keputusan [20], yang merupakan perpanjangan multivariat
dari C4.5. Harus ditekankan yang diusulkan Fisher ADTree berbeda dari pohon keputusan Fisher yang ada dalam kemampuannya
untuk “meningkatkan” beberapa tunggul keputusan di bawah node prediksi yang sama untuk meningkatkan prediksi akhir.
Algoritma 1. tertimbang diskriminanFisher:
Masukan dataset pelatihan 1/2 X; y Š dan distribusi berat w prosedur statistik untuk mengekstrak informasi berdasarkan 1/2 X; y
Š
meliputi:
1. Hitung rata-rata tertimbang dari kelas positif dan negatif masing-masing:
μ
1 2. Hitung dan μ
2
tertimbang menggunakan (7);
antara kelas kovarians matriks
P
b menggunakan (5);
HK Sok et al. / Pola Pengakuan ∎ (∎∎∎∎) ∎∎∎-∎∎∎ 5
3. Hitung tertimbang dalam kelas kovarians matriks
P
w
θ
n
adalah vektor acak. Optimal θ vektor skor
kemudian menggunakan (6); eigen 4. Maksimalkan
masalahFisher P
b
β 1/4λ
rasio P
(2) w
β oleh mana memecahkan λ dan β umum
juga disebut sebagai nilai eigen dan vektor eigen masing-masing. Output: β
dinormalisasi sehingga θT
D
θ
π1/4 1; 2. Ulangi sampai konvergensi:
2.1. Untuk θ tetap, memecahkan (11) untuk mendapatkan β menggunakan Larsen [29] algoritma;
2.2. Untuk β tetap, menghitung θ 1/4 D
π
3.2. ADTree jarang
Bagian ini menjelaskan konsep ADTree jarang di mana implementasi rinci dapat ditemukan dalam karya sebelumnya [17].
Desain ADTree jarang menggunakan Linear Jarang Analisis Diskriminan (SLDA) [25] di tempat analisis diskriminan Fisher. Hal
ini memungkinkan beberapa fitur berlebihan untuk memusatkan perhatian dan dihapus dari classifier, sehingga mencapai subset
lebih optimal fitur. Namun SLDA tidak awalnya dirancang untuk meningkatkan. Dengan demikian solver yang mendasari untuk
SLDA dimodifikasi untuk memperhitungkan distribusi berat. Latar belakang pada SLDA (disajikan di bawah ini di Bagian 3.2.1
dan 3.2.2) menjelaskan bagaimana hal itu dapat disesuaikan untuk meningkatkan ADTree.
3.2.1. Jarang Linear Discriminant Analysis
jarang Linear Discriminant Analysis (SLDA) memungkinkan beberapa elemen dari β untuk menjadi persis nol dan karenanya
untuk secara efektif menghilangkan fitur yang sesuai dari analisis diskriminan. Untuk mencapai hal ini, SLDA menggunakan
Optimal Scoring [28] untuk menggabungkan sparsity mendorong penalti untuk tujuan seleksi fitur. Optimal Scoring hanya
formulasi yang berbeda dari diskriminan Fisher dimana mereka berdua menghasilkan β vektor diskriminatif yang setara hingga
faktor [28].
Optimal Scoring mengubah y label kategoris dari dataset pelatihan menjadi label bernilai real Yθ dimana Y adalah matriks
indikator y dengan representasi satu-panas, yaitu, Y
ik
Silakan mengutip artikel ini sebagai:. HK Sok, et al, multivariat bergantian pohon keputusan, Pola Recognition (2015),
T
Xβ. Optimal θ vektor skor kemudian orto-dinormalisasi untuk membuatnya ortogonal
q
0
ðÞ
A1
Y.
Output: β dan θ
3.2.2. Adaptasi untuk ADTree jarang
Seperti ditunjukkan pada Gambar. 3 (c), solusi β diperlukan untuk membentuk kondisi dasar multivariat. Dalam rangka untuk
beradaptasi SLDA bawah ing boost-, modifikasi sederhana telah dibuat untuk algoritma Larsen (langkah 2.1 dari Algoritma 2)
seperti yang diusulkan dalam [17] untuk membimbing belajar β berdasarkan distribusi berat. Untuk θ tetap; masalah optimasi
dalam (10) dapat diselesaikan dengan mudah menggunakan algoritma Larsen oleh reformulasi (10) ke dalam masalah regresi
Lasso (11) menggunakan dataset pelatihan augmented (12).
min β
na 1‖YnθÀXnβ‖2 2
þλ
1
‖β‖
1
ð11Þ
Xn
ð nþp Thap
AA
A12
X
ffiffiffiffiffi λ
2!
1/4 1þλ
2
p
I;
Yn
ð n THP Þ
1/4
Y0
ð12Þ
Untuk Lasso masalah regresi (11), yang β solusi unik tidak lagi. Sebaliknya, β sekarang menjadi solusi piecewise β linear
jarang untuk nilai-nilai besar dari hukuman. Seluruh keluarga dari fungsi β λ
1
pada λ
1.
Itu memungkinkan solusi yang lebih disebut sebagai jalur regularisasi. Jalan regularisasi dimulai dari null β 1/4 1 untuk
solusi dan berakhir dengan solusi β penuh. Larsen
menemukan sampel pelatihan linear engan yang ditugaskan untuk kelas k, dan 0 di tempat lain di
breakpoints analitis yang menghasilkan serangkaianÞ β KD
s
olusiyang baris ke-i. Vektor θ terdiri entri bernilai real untuk menetapkan nyata
diindeks oleh k. Namun, itu tidak mengakomodasi
nilai bobot tambahan untuk masing-masing kelas. Formulasi Scoring optimal ditampilkan dalam (9) dengan θT
D
θ
π1/4 1 untuk menghindari solusi null.
masukan dalam memecahkan setiap kemungkinan β solusi kd
Þ.Dalam rangka mengimplementasikan meningkatkan, perlu untuk beradaptasi
distribusi berat dalam menemukan
min θ; β
nÀ1‖YθÀXβ‖2 2
tunduk θT
D
π
θ 1/4 1; di mana D
π
T
Y
linear breakpoint. Pelaksanaan rinci dapat ditemukan di [17].
Mengingat serangkaianβ KD Þ
solusi(regularisasi path) darið9Þ
Algoritma2, hanya satu solusi β diperlukan untuk membentuk multivariat
Optimal Scoring di SLDA lebih lanjut dibatasi oleh elastis Net, salah satu Lasso-jenis hukuman diperiksa dengan baik dan
didokumentasikan teknik [29], menghasilkan (10). Elastis Net adalah kombinasi cembung kedua Ridge ‖β‖2 2
1/4na
kondisi dasar1Y.Teknik pemilihan model dapat diterapkan untuk memilih solusi β optimal. Dalam ADTree jarang, umum cross-
validasi (GCV) seperti pada (13) dilaksanakan untuk tujuan ini. Ini
menghukum persis [30] dan Lasso ‖β‖ 1
[31] penalizations. Dengan l
1 -norm nol sementara menghasilkan ukuran seberapa baik model estimasi (βsolution) sesuai dengan output, ukuran dataset
pelatihan, yaitu, n, dan kompleksitas β dalam hal derajat kebebasan atau d [32] . Solusi β yang paling optimal adalah bahwa
dengan ukuran GCV terendah. Perlu dicatat bahwa teknik pemilihan model lainnya dapat diimplementasikan, seperti Kriteria
Informasi Akaike [33] atau Kriteria Informasi Bayesian [34]. Tergantung pada aplikasi, pilihan diadopsi mungkin memiliki
beberapa efek pada kompleksitas simpul keputusan dan ukuran pohon.
GCV 1/4 pada β, Lasso memaksa beberapa elemen β menjadi hukuman Ridge menstabilkan β untuk memastikan solusi yang unik
ini didapat dan mendorong pengelompokan fitur berkorelasi (besaran β serupa untuk fitur berkorelasi). Parameter regularisasi
untuk Lasso dan Ridge penalti areλ
1
dan λ
2
masing-masing.
min θ; β
na 1‖YθÀXβ‖2 2
þλ
2
‖β‖2 2
þλ
1
‖β‖
1
tunduk θT
D
π
θ 1/4 1 ð10Þ
Para penulis SLDA mengusulkan algoritma berulang sederhana untuk memecahkan (10) untuk dua parameter: optimal vektor
skor θ dan vektor β diskriminatif. Pertama, θ tersebut ditahan tetap sementara mengoptimalkan β, dan kemudian β tersebut
ditahan tetap sementara memecahkan untuk θ. Proses ini diulang sampai konvergensi berlangsung. The pemikiran diimple- rinci
ditampilkan dalam Algoritma 2.
Algoritma 2. SLDA
Input: dataset pelatihan, yaitu, X dan Y
1. Inisialisasi sepele skor optimal vektor θ
0
‖YnθÀXnβ‖2 ð NAD Þ
2
2
ð13Þ
Salah satu ciri tambahan ini ADTree jarang yang memungkinkan pengguna untuk mengadakan pemilihan sejumlah fitur untuk
node keputusan. For example, the sparse ADTree can be used to generate a univariate ADTree by selecting k 1⁄4 1 to obtain β
solution with 1 active feature.
3.3. Regularized logistic ADTree
The multivariate ADTree can be induced based on a different
for which consists
boosting technique. In this research the use of LogitBoost [35] is specifically investigated because of its unique structure.
LogitBoost of all 1 s. For initialization, θ is a set θ 1⁄4 ðIÀθ
0
θ
0
is an additive logistic regression interpretation of AdaBoost [35]. It T
D
π
Þθ
n
where
HK 6 Sok et al. / Pattern Recognition ∎ (∎∎∎∎) ∎∎∎–∎∎∎ links AdaBoost to the classical logistic regression, which is a
LADTree's modularity and flexibility are the greatest
advantage of probabilistic discriminative model for classification tasks. The
this approach over all other ADTree designs. LADTree
users can logistic regression models the log-odds ratio between positive
apply any of the newer classical penalization
techniques [26] and and negative class posteriors Pr ð y 1⁄4 þ1jx Þ and Pr ð y 1⁄4 À1jx Þ with
select any number of features that they wish to
incorporate in a regression model G xð Þ as follows:
order to customize the tree for their specific applications.
log
Pr ð y Pr ð y 1⁄4 1⁄4 þ1jx À1jx Þ Þ
1⁄4 G xð Þ ð14Þ
4. Comparative experimental analysis LogitBoost is a
nonparametric extension to (14). It uses a linear combination of regression models (15) to estimate the log-odds
4.1. Experimental design and validation ratio instead
of the fixed parametric form for G xð Þ. The total number of regression models is M.
The proposed new multivariate ADTree designs discussed in
G xð Þ1⁄4
X
M
m 1⁄4 1
g
m
xð Þ Section 3 above are Fisher's ADTree, Sparse ADTree, regularized
ð15Þ
LADTree using Lasso and regularized LADTree using Elastic Net. In
In each boosting procedure of LogitBoost, the aim is to solve a weighted least squares regression problem (the details were
presented earlier in Section 2.2). Hence, LogitBoost can be viewed as Iteratively Reweighted Least Squares (IRLS) regression
formula- tion. Any regression model g
m
xð Þ can be implemented to induce G xð Þ. To restricted form multivariate ADTree, the to be of
a linear type g
m
xð Þ : xTβ regression which results model in g
solving
m
xð Þ is
(16). The matrix W is diagonal of n Â n dimension. Each diagonal entry indicates a weight value for one training sample. Vector
z is the updated pseudolabel of length n. The only parameter to optimize is β. The weight is incorporated directly such that the
output and design matrix are W
1=2
z and W
1=2
X respectively. Here the optimization process is of a standard linear regression
type.
order to gauge their performance against other types of decision trees, several well-known and well-represented in the literature
algorithms are chosen to include each general type of the decision tree (Table 2). The discriminant analysis classifier is also
included since this technique has been implemented in two of the multivariate ADTree designs. The chosen learning algorithms
are listed below:
1. Univariate decision tree: C4.5 and CART [37]; 2. Multivariate decision tree: Fisher's decision tree [20]; 3. Ensemble of
univariate decision trees: Boosted C4.5 [37]; 4. Ensemble of multivariate decision trees: Oblique Random
Forest [21]; 5. Univariate boosted decision tree: ADTree [10]; 6. Sparse discriminant analysis [38]. min β
nÀ1‖W
1=2
zÀW
1=2
Xβ‖2 2
ð16Þ
The datasets used in this study are given in Table 3.
The datasets By expressing the problem in the form of (16), it becomes possible
are shortlisted such that each of them consists of only
real-value to take advantage of the vast regularized linear regression literature to
feature measurements. Datasets with categorical
features are excluded accommodate the boosting weight distribution. Note that for the
since multivariate trees must convert categorical
features to real- sparse ADTree learning algorithm, LARSEN algorithm has been
valued features, while such conversions could bias the
performance modified to accommodate the weight distribution. For the regularized
comparisons. LADTree, the weight distribution is
assimilated as a part of the linear
University of California, Irvine (UCI) datasets [39]
are associated regression problem in minimizing the residual value between W
1=2
z
with a wide range of real-world problems. This allows
comparing and W
1=2
Xβ. This alleviates the need to convert the categorical
performances of the trees across datasets of varying
characteristics (ie, responses to real-valued ones through the optimal scoring.
feature measurements of different nature representing
particular Unfortunately just the use of (16) is still insufficient since a
domain problems). Three additional spectral datasets
from the Uni- constraint or penalization function must be placed on β in order to
versity of Eastern Finland (UEF) [40] are included
because their provide node (eg, the feature capability selection). to shape Therefore characteristics a penalization of the ADTree
function decision
JÀβ
Á
is
characteristics are known to have highly correlated features. This allows comparisons between the decision trees on multivariate
corre- used on (17) to obtain a constrained regression solution shown in
lated features. All datasets are preprocessed to center
each feature to (17). From Bayesian perspective, this is effectively equivalent to
zero and with the standard deviation of one. All
experiments were placing a priori on the β solution in maximizing the posterior
conducted on PC with Intel® CoreTM 3.2 GHz i5
CPU and 4 GB RAM. likelihood.
A standard 10-times 10-fold stratified cross-validation was
min β
nÀ 1‖W
1=2
zÀW
1=2
Xβ‖2 2
þJ À β
Á
performed on each dataset for each learning algorithm to generate ð17Þ
performance estimation data. The employed performance metrics were: prediction accuracy, induction time, decision tree size,
and There is a wide range of penalization techniques of the form (18). Classical ones include Ridge (‖β‖2 2
), Lasso (‖β‖
1
), and Elastic Net as
decision node complexity. Comprehensibility can be viewed as a tradeoff between the decision tree size and decision node
presented previously in Section 3.2. Their solvers can be implemented in a classical form in each boosting procedure to produce
multivariate
Table 2 base conditions for the regularized LADTree
induction (see Fig. 3(d)).
Abbreviated algorithm names. In this paper, two
different variants of regularized LADTree using Lasso and Elastic Net respectively were presented.
Abbreviation Description
The proposed regularized LADTree has a modular design that
ADT Alternating Decision Tree can seamlessly
incorporate different types of linear regularization
C4.5 C4.5 techniques. This essentially gives ADTree
the ability to change its inherent model selection (or feature selection) approach without affecting the learning algorithm itself.
The use of different reg-
CART FADT FDT oRF CART
Fisher's Alternating Decision Tree Fisher's Decision Tree Oblique Random Forest ularization techniques also allows users to
preselect the number of
rLADT
EN features for their given application. For
example, selecting k 1⁄4 1 in the original LARS (solver for Lasso regression) [36] or LARSEN algorithm will generate a
univariate tree. The proposed regularized
Please cite this article as: HK Sok, et al., Multivariate alternating decision trees, Pattern Recognition (2015),
Elastic Net Regularized Logistic Alternating Decision Tree rLADT
L
Lasso Regularized Logistic Alternating Decision Tree SADT
EN
Elastic Net Regularized Sparse Alternating Decision Tree SLDA Sparse Linear Discriminant Analysis
HK Sok et al. / Pattern Recognition ∎ (∎∎∎∎) ∎∎∎–∎∎∎ 7
Table 3
475–491 times respectively). Nevertheless, the
inclusion of raw induc- Summary of UCI and UEF datasets.
tion time is to show that the fast induction property of the decision
Dataset ID Dataset Number of samples Number of features
tree is not lost in the proposed multivariate ADTree variants in comparisons to other decision tree families.
1 Breast cancer 569 30
Statistical comparison results are available in Fig. 4.
All statistical 2 Blood transfusion 748 4
significant differences were detected at a 0.01 significant
level. The 3 Liver disorder 345 6 4 Vertebral 310 6 5 Pimaindian 768 8 6 Heart 267 44
Friedman's and post-hoc Nemenyi's tests were performed based on average rank values of every learning algorithm. The average
ranks are shown in brackets next to their corresponding learning algo- 7 MAGIC gamma 19,020 10
rithms. A lower average rank value indicates better
performance, and 8 Parkinson 195 22
vice versa. Groups of algorithms that are not statistically
significantly 9 Haberman 306 3 10 ILPD 579 10 11 Ionosphere 351 33 12 Spambase 4601 57
different are indicated using a bold line.
When examining the classifiers disparately across all datasets, almost all of them have similar prediction accuracy without
statisti- 13 Wilt 4839 5
cally significant difference (SSD) (see Fig. 4). Only oRF
is significantly 14 QSAR 1055 41 15 Climate 540 18 16 Banknote 1372 4 17 Woodchip (UEF) 10,000 26
more accurate than ADT. Each classifier is superior in some cases and inferior in others, as predicted by the “No-Free-Lunch”
theorem [43]. Therefore, average ranks of algorithms' performance were also 18 Forest (UEF) 707 93
derived because they give a measure of how well a
particular classifier 19 Paper (UEF) 180 31
performs across a variety of datasets.
When only statistical testing is considered, it can be seen that the classical decision trees (C4.5 and CART) are consistently
within the complexity. For each dataset, the best performing algorithm was
top group of algorithms for all the performance
indicators. These given rank value of 1, the second was given rank value of 2, and so
results well agree with the literature findings and can
be considered on. The ranks were averaged if the performances were tied. An
as validating them. These algorithms are fast to build
and they are average rank of each learning algorithm was calculated over
comprehensible. The above is perhaps the main reason
why they multiple datasets as shown in (18), where r
il
represents the rank
remain relevant despite more powerful methods being
introduced of lth algorithm for ith dataset. The total number of datasets is M
over the years to further improve classification
performance. while the total number of algorithms is L.
Despite being within the first group of algorithms without SSD in
R
l
1⁄4
M1
X
i
terms of prediction accuracy, C4.5 and CART are ranked at the bottom half out of the 11 classifiers. This shows that their
performance indeed could be improved with different tree induction strategies. The ensemble-based decision trees (oRF and
boosted C4.5) have been designed aiming for it. They are consistently ranked in the top tier of the classification performance (see
Fig. 4). Nonetheless based on the performed experimental analysis, the accuracy improvement is not statistically significant when
comparing to C4.5 and CART across the range of different datasets. Yet, the tradeoff in terms of the decision tree size and split
complexity is statistically significantly worse compared to some other classifiers. They also have the worst average rank in terms
of the induction time. For example, when comparing oRF to CART (see Fig. 4), it can be seen that the first is statistically larger
with more complex nodes. At the same time it offers no statistically significant improvement in accuracy. This clearly negates
some of good qualities of being a decision tree.
The proposed multivariate ADTree variants offer a flexible nonparametric approach to adapt to different characteristics of
datasets. It has a built-in opportunity to decide whether to build a full decision tree or a simple decision stump (linear decision
boundary like SLDA) based on user-supplied stopping criteria. Multivariate ADTree modifications are also able to decide on
whether to boost multiple splits on the same input space sub- region or to use a standard decision tree partitioning (like FDT).
SADT and rLADT allow the use of univariate or multivariate decision nodes (or even both types of them) within the same tree.
The benefits of these properties are further elaborated below in relation to other decision tree families and SLDA.
4.2.1. Generalizing the alternating decision tree
ADTree is a tradeoff between the classical decision trees and ensemble of decision trees. It retains the comprehensibility of a
decision tree despite going through boosting cycles. ADT algorithm has significantly worse prediction accuracies compared to
the oblique random forest. Besides its induction time is on average longer than the most of other analyzed decision trees. These
limitations are due to the univariate base conditions, as validated in the experiments. All the r
il
ð18Þ
Statistical comparison was conducted along the lines suggested in [41]. The null hypothesis was such that all algorithms were
performing similarly. Nonparametric Friedman's test (19) was used for hypothesis testing. It was based on the average ranks of
classification models to detect if there were statistically significant differences among the classifiers. In case of rejection of the
null hypothesis, Nemenyi's test [41] was applied to determine which pairs of algorithms were statistically different. That was
done based on the critical difference (20) in terms of rank value where the critical values for Nemenyi's test were q
α
.
χ2 F
"
1⁄4
L ð 12M
Lþ1 Þ
ð19Þ
CD 1⁄4 q
α
X
l
R
2l
À
L ð Lþ1 4 Þ
2
Š
r
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi L ð Lþ1 Þ 6M
ð20Þ
4.2. Experimental results and discussions
The raw results covering prediction accuracy, induction time, decision tree size, and split complexity are summarized in the
tables in Appendices A–D. Each value in these tables represents the aver- age7standard deviation of 10-time 10-fold stratified
cross validation for a particular pair of a learning algorithm and dataset. Induction time is excluded from the statistical
comparisons and only average compu- tation time across all datasets is reported due to the varying execution speed of different
platform implementations such as MATLAB, JAVA, and R. In addition, there are some timing overheads as both JAVA and R
codes are called within the MATLAB platform when conducting the experiments. There are some reported differences in
execution times of the same code under different programmin g languages such as the one in [42] where it is shown that JAVA,
MATLAB, and R languages are slower than Cþ þ (approximately between 2.2 and 2.69; 9–11; and
HK 8 Sok et al. / Pattern Recognition ∎ (∎∎∎∎) ∎∎∎–∎∎∎ Fig. 4. Average ranks of the algorithms (rank values are shown in
bracket next to the algorithms) and the corresponding statistical comparisons for: (a) prediction accuracy, (b) decision tree size,
and (c) split complexity (total number of nonzero feature coefficients). Groups of algorithms that are not statistically significantly
different are connected in bold line. CD refers to critical difference in terms of rank value. The proposed multivariate ADTree
variants are shown in bold for readability.
Fig. 5. Frequency of training samples versus feature range (first decision stump) for: (a) univariate ADTree, and (b) Fisher's
ADTree for Forest dataset.
three proposed multivariate ADTree variants overcome these limitations. The Forest dataset can be used as an example to
illustrate it.
ADT selected 90th feature of the Forest dataset for splitting in the first decision stump. However, the histogram in Fig. 5(a)
shows an obvious distribution overlap between positive and negative training samples over the selected feature range. This
violates the weak learning condition of achieving at least 50% accuracy, which is the requirement for boosting to work. In
contrast to that, the proposed FADT algorithm uses Fisher's discriminant analysis. It synthesizes a feature (through linear
projection) that is more discriminative, where the positive and negative training samples are well separated over the feature range
(see Fig. 5(b)).
The univariate ADTree is a subclass of SADT and rLADT algorithms. It can be generated by choosing to use only one active
feature when computing the regularization path. A separate analysis was performed
and in this research to compare ADT with SADT
UNI
rLADT
UNI
. The raw performances of the three algorithms are tabulated in
Appendix E. Both SADT
UNI
and rLADT
UNI
do not suffer from the ADT's exhaustive approach that is used to generate a set of univariate base
conditions. Therefore, it can be observed that both the trees are consistently faster to induce compared to ADT on all datasets.
Since they are all univariate ADTrees, the split complexity is completely dependent on the tree size. Both rLADT
UNI
and SADT
UNI
are statistically smaller than ADT, thus leading to the conclusion that they are generally more
comprehensible. However, the ADT is statistically significantly better in its accuracy of prediction in comparison to SADT
UNI
and rLADT
UNI.
Thus, even though SADT and rLADT are able to induce a purely
univariate ADTree, in cases where prediction accuracy is prized over induction time and tree size, it is more beneficial to induce
multivariate trees.
Table 4 Performance comparison between FDT and FADT for some cases where FADT predicts better than FDT. The best
performing value is highlighted in bold.
Dataset Prediction accuracy Induction time Tree size
FDT FADT FDT FADT FDT FADT
Liver disorder 50.6774.61 63.4771.54 0.0070.00 0.1970.10 7.2471.32 43.42715.18 Heart 73.6271.95 76.3472.22 0.0470.00
0.0970.09 11.5870.60 17.95713.21 Pimaindians 64.9674.86 75.1671.04 0.0170.00 0.0470.02 14.5071.55 26.38711.20 MAGIC
gamma 74.6974.12 82.5270.12 1.1070.03 8.5571.29 266.46753.76 109.6978.04 ILPD 59.0373.52 71.3970.26 0.0170.00
0.0370.05 10.8472.83 8.6575.57 QSAR 81.1870.90 85.6770.57 0.1370.01 0.0670.04 44.3271.54 14.5976.62
Table 5 Prediction accuracy of SADT and SLDA for cases whereby SADT generates a single SLDA model in its sole decision
node (eg, it behaves like an SLDA). The best performing value is highlighted in bold.
Dataset SADT SLDA
Blood transfusion 76.2170.00 66.0070.42 Banknote 98.2170.05 97.4970.09 Woodchip (UEF) 99.5870.01 99.6170.01 Paper
(UEF) 100.0070.00 100.0070.00
Fig. 6. SADT models based on: (a) woodchip and (b) heart datasets. SADT is capable of generating either SLDA-like model in
(a), or further extend it as a tree model in (b).
4.2.2. Fisher's ADTree as an extension to Fisher's decision tree
The proposed FADT is a direct extension of the existing FDT algorithm. The difference comes due to accommodation of
boosting in FADT that allows a majority-voted decision from multiple multivariate decision nodes on the same input space sub-
region.
The performed experiments did not show any statistically significant differences for performance metrics between FADT and
FDT. However, the average prediction accuracy rank of FADT was better than the FDT while the average induction time,
decision tree size, and split complexity ranks of FDT were better than those of FADT. The incorporation of boosting by FADT
improved the prediction accuracies for 11 out of 19 datasets. Both trees had similar prediction accuracies for 6 datasets, while
FDT was more accurate on 2 datasets. Table 4 shows some examples where FADT predicted better than FDT.
Using Liver Disorder dataset as an example (Table 4), it can be seen that FADT can improve the classification accuracy of
Liver Disorder by $13% over FDT. At the same time FADT's tree size is larger by around 6 times compared to that of FDT. This
example may lead to a wrong conclusion that FADT improves the classification performance at the cost of a larger decision tree.
While this may be
true for ensemble-based multivariate decision trees such as the oblique random forest, it is not the case for FADT. In actuality,
FADT built a smaller tree in 7 out of 19 experimented datasets. Some of the examples of that are MAGIC gamma, ILPD and
QSAR datasets (Table 4). Most interestingly is that FADT improved MAGIC gamma prediction by 8% while building a smaller
tree (some 2.5 times smaller) than that of FDT. This can be explained by the boosting providing significantly better
discrimination on the already parti- tioned regions rather than going down in depth for further splitting. In two cases where FDT
gave a better prediction compared to FADT (ie, Ionosphere and Woodchip datasets), it is likely that FADT suffered from the
over-fitting phenomenon. It can be observed from using Woodchip dataset as an example that FADT generated a significantly
larger decision tree size of 141.0475.00 compared to 18.2270.36 of FDT. This is likely due to the over-fitting. In general, FADT
does improve the prediction accuracy of FDT through boosting. Furthermore in many cases it achieves this without necessarily
sacrificing the tree size.
4.2.3. Sparse ADTree – a nonparametric extension of SLDA
The proposed SADT is a direct nonparametric extension of SLDA, which itself is a powerful discriminant analysis method. It
is important to note that the parametric form of Sparse Linear Discriminant Analysis makes a linear assumption on the underlying
data. However there are cases where a linear classifier is insufficient. Such cases can be accommodated by SADT by inducing a
suitable number of decision boundaries to better discriminate the input space. The ability of SADT to extend SLDA into a tree
representation increases the prediction accuracy across multiple datasets. In fact it showed improvements on 12 out of 19 datasets
employed in the reported experimental research. Besides, SADT performed compar- ably with SLDA in the other 7 datasets.
Table 5 compares the classification performance of datasets where SADT generates a single SLDA classifier in its sole
decision stump. For example, Woodchip is a linearly separable dataset whereby SADT generated a single decision boundary.
This is essentially similar to SLDA (Fig. 5(a)). There is no surprise therefore that both SADT and SLDA achieved close
experimental results in terms of their accuracy performance: 99.5870.01% and 99.6170.01% respectively. Close performances
can also be noticed in their induction time (0.3970.05 s for SADT and 0.5870.03 for SLDA) as well as split complexity (both
were 26.0070.00).
Table 6 Prediction accuracies of C4.5, CART and rLADTree for medical datasets with highly discriminative features. The best
performing value is highlighted in bold.
Dataset Prediction accuracy
C4.5 CART rLADT
L
rLADT
EN
Breast cancer 93.5270.78 93.0470.73 96.4070.26 96.5270.23 Liver disorders 65.7672.14 66.3872.45 66.9972.22 66.8571.54
Vertebral 81.2371.01 80.8171.25 82.8771.03 82.6871.43
HK 10 Sok et al. / Pattern Recognition ∎ (∎∎∎∎) ∎∎∎–∎∎∎ Table 7 Prediction accuracies and tree sizes of C4.5, CART and
rLADT for spectral datasets with highly correlated features. The best performing value is highlighted in bold.
Dataset Prediction accuracy Decision tree size
C4.5 CART rLADT
L
rLADT
EN
C4.5 CART rLADT
L
rLADT
EN
Wood-chip 91.9470.18 91.6470.26 99.6170.02 99.4570.01 516.5473.97 313.08725.40 4.0070.00 4.0070.00 Forest 88.6170.70
88.3070.52 95.9470.35 93.1970.32 44.6471.09 27.5273.96 16.45713.78 4.0070.00 Paper 95.3271.39 96.9971.33 98.3671.07
96.1871.67 10.4870.30 10.0270.75 37.9379.81 58.1279.01
Fig. 7. rLADT
L
model (a) and stem plot (b) of β feature coefficients of each and every spectral measurements for woodchip dataset
from the decision node in (a). The stems are colored according to the visible light colors depending on the wavelength.
Fig. 8. rLADT
L
model on the Vertebral dataset whereby it is possible to boost multiple decision stumps on the same input space sub-
region and there are both univariate (white) and multivariate (gray) decision nodes.
In most other cases in the reported experimental research, more than a single decision boundary was required. It is illustrated
here with the Heart example as shown in Fig. 6(b). The classification performance was increased from 68.0471.30% (SLDA) to
76.3671.77% (SADT) by building a tree rather than a single decision boundary.
In short, SADT behaves as SLDA for datasets that are linearly separable, and behaves as a tree for cases that are not. This
alleviates the need for practitioners to select a right parametric form to achieve better prediction. However the improved
prediction comes at the cost of a longer induction time and higher split complexity measure. For example, Heart dataset required
an induction time of 2.8870.68 s compared to SLDA's 0.1470.01s along with a split complexity of 263.097251.29 compared to
SLDA's 8.2072.13.
Although SADT was larger than SLDA, it in fact built the smallest tree on average among all the decision trees in the reported
experimental analysis. Furthermore, its spilt complexity
was ranked just below the univariate trees. SADT achieved better classification on 9 out the 19 datasets compared to C4.5 and
CART while inducing a smaller decision tree in 12 out of the 19 datasets. In short, it can be concluded that SADT is a successful
nonparametric extension to SLDA. It creates a parsimonious version of a multivariate decision tree that results in the smallest
decision tree on average even when compared against univariate decision trees, while with only slightly higher split complexity.
4.2.4. Regularized LADTree
The most notable extension is the proposed rLADTree. Despite being boosted and multivariate version, it shows no
statistically significant difference in terms of the tree size and node complexity when compared to univariate unboosted decision
trees such as C4.5 and CART. First, performance of rLADTree is examined and compared with C4.5 and CART on datasets with
highly discriminative features. In the performed experiments these were represented by medical datasets. Feature
measurements using such databases are good indicators of the capability to discriminate between different types of medical
conditions (see Table 6).
It is generally known that univariate decision trees perform well on this kind of datasets [44]. Breast cancer, for example, is a
classification problem on discriminating between malign and benign breast cancer diagnoses. Features here are extracted from
digitized images of fine needle aspirate of breast mass. C4.5 and CART achieved average accuracies of 93.5270.78% and
93.0470.73% respectively. For the same datasets, rLADT
L
and rLADT
EN
achieved similar accuracies of 96.4270.26% and 96.5270.23%.
On the contrary, univariate decision nodes do not handle datasets with complex interaction well [44]. For datasets with highly
correlated multivariate features, such as spectral datasets (see Table 7), multivariate decision trees are preferable. For example,
the Woodchip dataset consists of spectral reflectance for two different types of woodchips: birch and scots pine. The induced
rLADT
L
model
comprising only a single decision stump achieved accuracy of 99.6170.02%. Magnitudes of the feature coefficients (together
with their signs) of this decision stump are shown in Fig. 7. Colors in the figure represent wavelengths in the visible light range.
It is easy to comprehend the importance of each spectral measurement at different wavelengths that best discriminates between
birch and scots pine species from the rLADT
L
5. Conclusion
In this paper, three different methods are presented to induce a multivariate ADTree. The aim is to equip the decision tree
with the boosting capability while ensuring that it remains comprehensible. Although no single algorithm can outperform all
others on all possible datasets as suggested by the No-Free-Lunch theorem, it is clear from the performed experimental analysis
that the most optimal tree can be built if dataset characteristics can be matched by right selection of a specific decision tree
algorithm. For example, if the domain problem has a few highly discriminative features, C4.5 and CART algorithms are capable
of generating an optimal decision tree as shown in the literature and confirmed by the performed experiments. However, they
induce large and incomprehensible decision trees with lower prediction accuracies when complex interactions exist among the
features (ie, spectral datasets). Ensemble-based forests are good at giving high classification performance across most data types,
but at the expense of other factors, such as induction time and comprehensibility. In many cases, the characteristics of the
datasets are unknown a priori. Therefore, an optimal classifier with a right induction bias that best captures the underlying
characteristics is also unknown. More often than not, practitioners are required to perform experimentation across different types
of classifiers to determine the suitable one for their given domain problem.
The proposed multivariate ADTree variants (in particular, the sparse ADTree and regularized LADTree) are non-parametric
decision trees that are equipped with additional boosting and regularization techniques to better match complexity of given
datasets. They are therefore able to optimally represent datasets with few highly discriminative features (eg, C4.5 and CART),
datasets with correlated features such as spectral datasets (eg, multivariate decision trees), and datasets that require multiple
models (eg, ensemble-based forests).
The proposed Fisher's ADTree is a boosted alternative to multivariate decision trees such as Fisher's decision tree. The
proposed sparse ADTree incorporates a sparseness criterion into the multivariate ADTree to allow for better comprehension
through the feature selection. It is a nonparametric extension to SLDA. It performs the same partitioning as SLDA for datasets
that satisfy the linear assumption while also over- coming limitations of SLDA by automatically fitting multiple decision
boundaries to improve the classification accuracies for datasets that cannot be classified with a single linear decision boundary.
The most distinctive is the regularized LADTree, which is capable of performing without statistical significant difference to
the state-of-art C4.5 and CART algorithms in terms of the tree size and node complex-
Table 8 Comparisons between characteristics of multivariate ADTrees.
Characteristics Fisher's ADTree Sparse ADTree Regularized LADTree
Boosting AdaBoost AdaBoost LogitBoost Penalization
terms
model. In contrast, C4.5 and CART induced large decision trees of over 500 and 300 nodes respectively
(refer to Appendix C). These large decision trees were quite incomprehensible when compared to the one induced using the
rLADT
L
model. At the same time, they achieved lower classification accuracies.
Regularized LADTree allows flexible hierarchical modeling by building a suitable regression model for estimating class
posterior distribution in performing classification. Incorporation of the regularization terms enables the decision node
complexity to be optimally selected. Thus, it is possible to have multivariate decision nodes with varying sparsity, univariate
decision node, or even both within the same regression model.
Versatility of the regularized LADTree can be illustrated by using the Vertebral dataset. The rLADT
L
model consists of decision nodes that are both univariate and multivariate within the same tree
(Fig. 8). It achieved accuracy of 82.8771.03% compared to the best performing 84.9471.08% of oRF. Yet its tree size was
significantly smaller (36.1976.21 compared to 4614.32775.26), and it offered a lower node complexity (44.45734.29 compared to
4514.32775.26).
In short, the regularized LADTree is able to optimally induce a range of possible regression models, such as: (1) single linear
regression model (similar to SLDA); (2) additive regression model; and (3) hierarchical regression model (decision tree). Its
modularity and ability to better match different complexities of data enable it to retain the good qualities of a decision tree such
as short induction time and good comprehensibility while enjoying the advantages of boosting and feature selection techniques of
the well-established regularization approach.
4.2.5. Comparisons between multivariate ADTree variants
Each proposed multivariate ADTree algorithm has its own distinctive characteristics set that distinguishes it from the other
variants. From the experiments, it was observed that Fisher's ADTree is the fastest to induce. However, the sparse ADTree is the
most comprehensible, while the regularized LADTree is the most accurate. Table 8 highlights the main characteristics of the
algorithms and their differences.
None Restricted to Elastic Net May use any regularization technique
Prediction
accuracy
Lower than Sparse and Regularized ADTree
Better than the Fisher's ADTree Best among the multivariate ADTree variants
Induction time Fastest among the multivariate ADTrees
due to the use of a single analytical solution
Slowest among the multivariate ADTrees due to a series of optimizations and the use of additional parameters (ie optimal score
vector)
Faster than Sparse ADTree
Decision tree
size
Approximately the same as Regularized LADTree
Smallest among the multivariate ADTrees Approximately the same as Fisher's ADTree
Multivariate
split complexity
Most complex of the three variations due to the implemented Fisher's discriminant analysis
Approximately the same as Regularized LADTree Approximately the same as Sparse ADTree
Advantages Embedded feature extraction for better
discrimination and satisfying weak learning condition
1. Has feature selection mechanisms to select optimal
feature set for each decision node 2. Regularization path with “early stopping” that allows
decision node complexity tuning
1. Flexible hierarchical additive modeling 2. Framework that allows any linear type
regularization (maximum a posteriori) without modification to the solver 3. Probabilistic decision due to the additive
logistic regression interpretation
Disadvantages No feature selection mechanism Restricted to Elastic Net Higher model complexity compared to Sparse
ADTree
Appendix ity for most datasets despite being a boosted multivariate tree. Most
A significantly, regularized LADTree had
better classification performance across all datasets. It is ranked directly in the second tier after the
See Table A1. decision tree ensemble algorithms
while remaining comprehensible. For example, on applications that contain features with complex interactions, the regularized
LADTree builds a more accurate and much smaller tree with its multivariate node compared to C4.5 and CART. At
Appendix B
the same time its node complexity remains small due to the use of regularization techniques. It is important to note that the
greatest
See Table B1. advantage lies in the regularized
LADTree's modularity, which allows a wide range of established linear regularization techniques to be applied. This bridges
between the decision tree and powerful regularization research fields.
Appendix C
For the future research, it would be important to investigate how ADTree can be designed based on different boosting algo-
See Table C1.
rithms to handle wide range of domain problems. This would lead to an advantage over the classical decision trees, which often
require a new learning mechanism to achieve certain properties.
Appendix D
Acknowledgments
See Table D1.
This work was supported by the Monash University Malaysia through a Higher Degree Research Scholarship and Malaysia
Appendix E Ministry of Higher Education, Malaysia
Fundamental Research Grant Scheme FRGS/2/2014/TK03/MUSM/02/1.
See Table E1.
Table A1 Prediction accuracy (average7standard deviation of 10-time 10-fold stratified cross validation) in terms of percentage.
ID C4.5 CART FDT ADT FADT SADT rLADT
L
rLADT
EN
boosted C4.5 oRF SLDA
1 93.5270.78 93.0470.73 94.9770.46 94.6870.80 96.1270.26 96.8970.37 96.4070.26 96.5270.23 97.1670.26 97.2170.27
96.3670.25 2 77.9270.62 78.2870.37 78.5270.57 77.3870.41 76.3870.41 76.2170.00 76.2170.00 76.2170.00 77.5470.42
78.5071.01 66.0070.42 3 65.7672.14 66.3872.45 50.6774.61 62.3471.37 63.4771.54 62.8170.81 66.9972.22 66.8571.54
68.9071.17 72.5971.24 62.5570.38 4 81.2371.01 80.8171.25 83.0671.24 82.7171.24 82.8771.13 83.3570.57 82.8771.03
82.6871.43 83.1071.51 84.9471.08 79.6570.87 5 74.5770.90 74.1470.37 64.9674.86 72.5470.91 75.1671.04 76.0170.41
74.6870.87 74.3570.85 73.7471.08 76.0970.54 76.1470.15 6 75.5071.88 78.3171.12 73.6271.95 78.5771.66 76.3472.22
76.3671.77 79.4370.00 79.4370.00 80.8171.12 81.1871.21 68.0471.30 7 85.1270.15 85.3770.12 74.6974.12 78.5970.10
82.5270.12 78.9070.03 81.4070.09 81.4370.11 88.0070.64 87.7370.07 79.4570.02 8 83.8372.01 86.8871.63 84.6371.57
88.8671.57 81.8272.28 81.6771.01 8 1.2872.27 77.9971.93 92.7471.13 91.9971.44 82.0271.34 9 70.5471.08 72.2171.23
71.6071.00 71.9171.59 72.5670.71 71.2870.81 72.9870.83 73.0771.03 71.2771.29 69.3671.28 74.0470.63 10 68.3571.75
71.1170.65 59.0373.52 71.2370.52 71.4170.63 71.3970.26 71.5170.00 71.5170.00 71.5871.34 72.1370.85 63.4070.60 11
90.2971.29 88.7071.02 87.9771.47 84.1871.39 80.1571.57 84.7670.92 87.8171.15 87.4771.11 94.0270.42 94.3070.42 86.4870.80
12 92.7970.40 92.4170.23 90.8370.35 93.6370.16 91.0370.06 90.9170.08 91.6470.14 91.6070.20 95.2370.28 86.7270.24
90.6270.06 13 98.1570.11 98.2370.07 97.9370.17 96.4670.06 96.1470.05 94.5970.06 94.6170.00 94.6170.00 98.5370.08
98.2570.11 91.2870.06 14 83.5171.08 82.8870.57 81.1870.90 84.0470.92 85.6770.57 85.1770.36 85.1370.79 85.2370.67
86.9170.65 87.3670.48 84.9570.29 15 89.8470.71 91.2770.37 90.0670.98 91.0770.44 91.4970.00 91.4970.00 91.4970.00
91.4970.00 92.5170.51 91.5570.25 78.6770.83 16 98.5170.25 98.2670.32 99.7470.13 88.8070.23 99.0670.12 98.2170.05
96.6670.06 96.6270.04 99.8070.10 99.8270.05 97.4970.09 17 91.9470.18 91.6470.26 99.4770.05 67.8270.16 91.3571.95
99.5870.01 99.6170.02 99.4570.01 98.6670.08 99.3170.04 99.6170.01 18 88.6170.70 88.3070.52 95.2570.46 83.7170.38
95.8470.31 96.4270.36 95.9470.35 93.1970.32 92.4670.46 96.4670.34 96.2970.26 19 95.3271.39 96.9971.33 100.0070.00
96.5171.67 100.0070.00 100.0070.00 98.3671.07 96.1871.67 96.4171.38 98.5170.69 100.0070.00
Table B1 Induction time (average7standard deviation of 10-time 10-fold stratified cross validation) in terms of seconds.
L
rLADT
EN
1 0.0270.00 0.1670.01 0.0270.00 1.4370.39 0.0270.04 0.4470.22 1.0270.47 2.9170.71 1.1070.12 3.9670.30 0.1070.03 2
0.0070.00 0.0570.00 0.0070.00 0.0770.03 0.1370.08 0.0170.00 0.0270.03 0.9370.38 0.0270.00 14.3670.54 0.0170.00 3 0.0070.00
0.0370.00 0.0070.00 0.1170.05 0.1970.10 0.0170.01 0.0870.03 0.0170.02 0.0670.01 6.2670.12 0.0170.00 4 0.0070.00 0.0370.00
0.0070.00 0.1970.09 0.1070.08 0.0270.01 0.0970.04 0.1070.05 0.1670.02 3.4070.07 0.0170.00 5 0.0170.00 0.0970.00 0.0170.00
0.1270.07 0.0770.06 0.0470.02 0.0970.06 0.0870.04 0.5870.10 12.9070.29 0.0170.00 6 0.0270.00 0.0770.00 0.0470.00 0.3270.14
0.0970.09 2.8870.68 0.2770.25 0.1070.07 0.5770.10 2.5970.07 0.1470.01 7 1.3570.04 8.5870.13 1.1070.03 16.2570.79 8.5571.29
1.2670.77 14.0771.79 0.3270.25 263.29782.54 362.6073.17 0.1970.02 8 0.0170.00 0.0470.00 0.0170.00 1.2570.16 0.1170.07
0.2570.15 0.5470.23 15.6972.07 0.2470.03 1.6770.03 0.0470.01 9 0.0070.00 0.0270.01 0. 0070.00 0.0570.03 0.1370.05
0.0370.01 0.0670.03 0.7870.24 0.0170.00 5.9470.13 0.0070.00 10 0.0170.00 0.0870.00 0.0170.00 0.0370.03 0.0370.05 0.0870.05
0.0170.02 0.0670.02 0.5470.10 8.7570.14 0.0170.00 11 0.0270.00 0.1070.01 0.0270.00 1.9670.82 0.0670.03 0.4470.25 1.0570.46
0.0170.02 0.8570.08 3.5470.05 0.1070.11 12 0.9270.05 2.8370.07 0.7770.06 41.2372.72 0.7470.36 6.8272.75 14.9073.09
0.8370.26 11.3872.01 45.1770.43 1.7570.11 13 0.0570.00 0.4770.01 0.0670.00 0.5770.31 0.1670.07 0.0370.01 0.0170.00
0.0370.00 4.1670.58 36.4270.57 0.0270.00 14 0.1070.00 0.3770.02 0.1370.01 6.6871.42 0.0670.04 2.8571.13 1.8970.93
0.0170.00 4.6270.79 13.2370.27 0.2370.02 15 0.0170.00 0.1170.00 0.0170.00 0.7770.45 0.0270.01 0.0570.04 0.0170.00
0.0170.00 0.7870.07 4.6770.08 0.0370.01 16 0.0170.00 0.1270.01 0.0170.00 0.3770.09 0.0470.02 0.0170.01 0.1870.15 0.0170.01
0.4670.06 7.9570.16 0.0170.00 17 1.0770.12 7.3070.52 0.3070.02 1.2570.02 7.9870.67 0.3970.05 0.0770.00 0.3070.03
76.26710.19 109.9271.25 0.5870.0 3 18 0.1170.00 0.9170.05 0.1870.01 6.2371.10 0.0270.02 11.1775.35 3.1874.07 0.4670.05
5.6370.78 5.3670.08 0.9770.04 19 0.0070.00 0.0470.00 0.0070.00 1.3470.26 0.0070.00 0.1170.02 0.6670.24 0.8370.24 0.0170.00
1.2370.02 0.0870.00
Table C1 Decision tree size (average7standard deviation of 10-time 10-fold stratified cross validation) in terms of total number of
nodes. SLDA is not a decision tree and hence not included.
L
rLADT
EN
boosted C4.5 oRF
1 21.4470.94 13.0871.62 11.0070.78 71.65712.25 6.4074.79 11.5974.84 45.61713.24 42.16712.06 1091.14738.20 1253.86752.21
2 10.8271.06 13.6272.26 6.1270.66 31.9377.48 33.88712.39 4.0070.00 6.8274.76 5.1773.70 62.80724.90 5992.067121.13 3
49.1872.52 24.8074.97 7.2471.32 33.2876.64 43.42715.18 5.6272.64 36.1976.21 36.94712.65 691.087495.77 4614.32775.26 4
19.9270.89 13.9272.61 14.5071.55 51.91714.21 26.38711.20 7.6372.33 34.06710.07 33.70712.11 1522.357266.26
2649.22779.57 5 39.0274.72 18.6076.76 9.3472.64 25.7577.54 19.09710.22 13.2474.32 23.20710.41 21.6777.74
4086.2971509.32 7703.047141.01 6 36.9070.84 3.8871.71 11.5870.60 23.0875.33 17.95713.21 50.56712.02 16.96711.54
11.8677.54 1333.74731.50 1710.66740.41 7 726.48720.51 209.62718.26 266.46753.76 137.6573.80 109.6978.04 15.9478.25
100.6378.35 101.8678.90 71846.90718979.61 131636.687653.16 8 19.0471.11 11.4071.96 9.4670.60 90.2578.48 27.70711.51
16.3077.90 57.85713.21 77.2 9715.88 726.317127.97 1191.58738.67 9 4.8871.04 4.4272.26 1.9070.51 29.98710.19 36.25711.53
18.2275.43 42.82710.52 42.1977.75 28.50722.10 3522.84786.06 10 55.6478.40 2.0071.61 10.8472.83 8.0574.32 8.6575.57
15.8878.10 4.8772.75 4.9372.94 3750.717926.73 6550.84782.18 11 26.5471.14 9.8472.38 11.3670.77 64.06715.41 29.41713.17
14.0274.67 53.32713.64 41.8078.75 1075.88728.76 1729.80754.26 12 207.4274.27 116.68715.31 91.9472.41 139.8773.49
25.0378.23 11.8074.52 70.1877.62 70.8779.99 2335.49795.68 8218.667230.63 13 51.4071.50 40.8073.13 25.5073.78
36.88711.14 40.7878.62 5.0271.27 4.0070.00 4.0070.00 4895.307180.73 7115.567205.23 14 110.9671.69 43.3076.18 44.3271.54
104.56711.39 14.5976.62 33.31711.11 48.2278.46 51.43710.28 5463.94795.72 6193.347112.46 15 28.0671.05 3.3271.19
14.8671.60 37.00717.02 10.0676.13 5.6873.55 4.0070.00 4.0070.00 1524.22745.23 2052.64752.95 16 29.0671.05 33.1871.01
11.3270.67 64.1878.55 28.3077.21 4.3370.85 34.45716.43 5.2972.72 1205.94795.27 1868.34767.28 17 516.547 3.97
313.08725.40 18.2270.36 30.5570.32 141.0475.00 4.0070.00 4.0070.00 4.0070.00 17562.827141.26 13836.327222.18 18
44.6471.09 27.5273.96 8.6270.55 86.6278.22 6.1373.49 10.3074.03 16.45713.78 4.0070.00 2224.28761.67 1741.60761.64 19
10.4870.30 10.0270.75 3.0070.00 76.1277.46 4.0070.00 4.0070.00 37.9379.81 58.1279.01 18.50730.03 745.02726.68
Table D1 Complexity of multivariate split (average7standard deviation) in terms of total number of nonzero coefficients.
L
rLADT
EN
1 10.2271.96 6.0472.24 150.00723.74 23.5574.08 54.007132.60 90.867166.57 202.377200.66 138.197107.47 520.57719.10
2884.657130.52 22.1971.87 2 4.9172.00 6.3173.58 10.2475.52 10.3172.49 43.40744.17 4.0070.00 3.82710.48 3.0374.39
28.95712.23 5892.067121.13 4.0070.00 3 24.0974.98 11.90710.25 18.72719.00 10.7672.21 84.82787.82 9.20713.04 44.45734.29
41.80731.76 336.277242.46 4514.32775.26 5.9770.17 4 9.4672.62 6.4674.45 40.50717.40 16.9774.74 50.76757.16 12.98712.43
35.82739.89 30.67726.79 737.007129.39 2549.22779.57 4.9270.31 5 19.0176.48 8.8078.18 33.36731.31 8.2572.51 48.17773.28
29.83732.14 37.85743.21 35.56741.20 2021.297747.64 7603.047141.01 6.8870.38 6 17.9571.60 1.4472.78 232.76756.39
7.3671.78 248.607480.87 263.097251.29 83.077127.79 51.54769.33 641.87715.75 4831.987121.24 8.2072.13 7 362.74727.11
104.31728.88 1327.307967.66 45.5571.27 362.30773.92 46.54780.58 138.84757.95 130.21750.22 35900.4579483.78
197305.027 979.74 9.8670.40 8 9.0271.62 5.2071.79 93.06718.19 29.7572.83 195.807294.95 100.577135.98 208.737138.44
244.287127.61 338.96759.91 2183.16777.35 17.2072.19 9 1.9471.70 1.7172.90 1.3572.43 9.6673.40 35.16742.59 13.44718.77
23.17722.82 23.89724.34 12.45710.55 3422.84786.06 1.9570.33 10 27.32712.83 0.5072.71 49.20749.05 2.3571.44 25.13763.21
40.83768.14 8.80721.31 9.13724.40 1851.687458.88 9676.267123.26 7.5570.72 11 12.7772.14 4.4273.32 170.94746.05
21.0275.14 312.517406.59 128.697173.76 256.277206.14 175.067160.33 512.94714.38 4074.507135.66 26.2073.41 12
103.2179.23 57.84719.64 2591.797227.20 46.2971.16 456.577629.77 200.707193.33 825.207452.10 686.867470.06
1150.72744.27 28415.317807.22 54.9271.35 13 6.94711.70 0.0570.50 61.25727.60 8.8672.96 66.30753.11 6.5677.09 4.9670.20
4.9570.22 2422.65790.36 7015.567205.23 4.9570.22 14 54.9877.28 21.1579.93 888.06798.76 34.5273.80 185.737354.97
422.987452.77 295.287181.99 291.917174.98 2706.97747.86 18280.027337.37 39.9870.92 15 13.5372.52 1.167 2.42
124.74744.41 12.0075.67 54.367110.28 17.59753.92 6.8275.12 6.8275.12 737.11722.61 3905.287105.89 8.0473.13 16
14.0371.77 16.0972.14 20.6474.21 21.0672.85 36.40730.41 3.7773.69 21.68723.53 3.7174.28 578.15745.95 1768.34767.28
3.1870.39 17 257.7779.31 156.04738.09 223.86739.27 9.8570.11 1213.687137.87 26.0070.00 25.9570.22 25.9570.22
8756.41770.63 34340.807555.46 26.0070.00 18 21.8272.35 13.2675.55 354.33788.37 28.5472.74 159.037382.03 274.067515.69
293.937572.49 86.5277.59 1087.14730.84 7387.207277.38 67.1176.18 19 4.7470.60 4.5170.88 31.0070.00 25.0472.49
31.0070.00 28.8371.58 273.287238.99 163.97774.44 8.28713.27 1612.55766.70 30.0171.31
HK 14 Sok et al. / Pattern Recognition ∎ (∎∎∎∎) ∎∎∎–∎∎∎ Please cite this article as: HK Sok, et al., Multivariate
alternating decision trees, Pattern Recognition (2015), http://dx.doi.org/10.1016/j. patcog.2015.08.014i
References
[1] P. Geurts, A. Irrthum, L. Wehenkel, Supervised learning with decision tree- based methods in computational and systems
biology, Mol. Biosyst. 5 (12) (2009) 1593–1605. [2] K.-YK Liu, J. Lin, X. Zhou, STCS Wong, Boosting alternating decision
trees
modeling of disease trait information, BMC Genet. 6 (Suppl. 1) (2005) S132. [3] G. Creamer, Y. Freund, Using boosting for
financial analysis and performance prediction: application to S&P 500 companies, Latin American ADRs and banks, Comput.
Econ. 36 (2) (2010) 133–151. [4] MP-L. Ooi, HK Sok, YC Kuang, S. Demidenko, C. Chan, Defect cluster recognition system for
fabricated semiconductor wafers, Eng. Appl. Artif. Intell. 26 (3) (2013) 1029–1043. [5] C. Kingsford, SL Salzberg, What are
decision trees? Nat. Biotechnol. 26 (9)
(2008) 1011–1013. [6] J. He, H. Hu, R. Harrison, P. Tai, Y. Pan, Transmembrane segments prediction and understanding
using support vector machine and decision tree, Expert Syst. Appl. 30 (2006) 64–72. [7] J. Quinlan, Bagging, boosting, and C4.5,
in: Proceedings of the 13th National
Conference on Artificial Intelligence, 1996, pp. 725–730. [8] L. Breiman, Bagging predictors, Mach. Belajar. 24 (2) (1996)
123–140. [9] Y. Freund, R. Schapire, A decision-theoretic generalization of on-line learning
and an application to boosting, J. Comput. Syst. Sci. 55 (1997) 119–139. [10] Y. Freund, L. Mason, The alternating decision
tree learning algorithm, in: Proceedings of the 16th International Conference on Machine Learning, 1999, pp. 124–133. [11] F.
De Comité, R. Gilleron, M. Tommasi, Learning multi-label alternating decision trees from texts and data, in: Proceedings of the
3rd international conference on Machine Learning and Data Mining in Pattern Recognition, 2003, pp. 35–49. [12] G. Holmes, B.
Pfahringer, R. Kirkby, Multiclass alternating decision trees, in: Proceedings of the 13th European Conference on Machine
Learning, 2002, pp. 161–172. [13] YC Kuang, MPL Ooi, Complex feature alternating decision tree,, Int. J. Intell.
Syst. Technol. Appl. 9 (3) (2010) 335–353. [14] R. Guy, P. Santago, C. Langefeld, Bootstrap aggregating of alternating
decision trees to detect sets of SNPs that associate with disease, Genet. Epidemiol. 36 (2012) 99–106. [15] G. Stiglic, M. Bajgot,
P. Kokol, Gene set enrichment meta-learning analysis: next-generation sequencing versus microarrays, BMC Bioinform. 11
(2010) (article 176). [16] M. Drauschke, Multi-class ADTboost. Technical Report No. 6, Department of Photogrammetry
Institute of Geodesy and Geoinformation University of Bonn, 2008. [17] HK Sok, MP-L. Ooi, YC Kuang, Sparse alternating
decision tree, Pattern
Recognit. Lett. 60–61 (2015) 57–64. [18] JR Quinlan, C4.5: Programs for Machine Learning, Morgran Kaufmann
Publisher, San Francisco, 1993. [19] L. Breiman, JH Friedman, RA Olshen, Classification and Regression Trees,
Wadsworth International Group, Belmont, Canada, 1984. [20] A. López-Chau, J. Cervantes, L. López-García, FG Lamont,
Fisher's decision
tree, Expert Syst. Appl. 40 (16) (2013) 6283–6291. [21] B. Menze, B. Kelm, D. Splitthoff, On oblique random forests, in:
Proceedings of the European Conference on Machine Learning (ECML/PKDD), 2011, pp. 453– 469. [22] A. Franco-arcega,
Splitting attribute subsets for large datasets, in: Proceedings of the 23rd Canadian Conference on Artificial Intelligence, 2010, pp.
370–373. [23] S. Schulter, P. Wohlhart, C. Leistner, A. Saffari, PM Roth, H. Bischof, Alternating decision forests, in:
Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2013, pp. 508–515. [24] J.
Kozak, U. Boryczka, Multiple boosting in the ant colony decision forest meta-
classifier, Knowl.-Based Syst. 75 (2015) 141–151. [25] L. Clemmensen, T. Hastie, D. Witten, B. Ersbøll, Sparse
discriminant analysis,
Technometrics 53 (4) (2011) 406–413. [26] T. Hesterberg, NH Choi, L. Meier, C. Fraley, Least angle and l1 penalized
regression: a review, Stat. Surv. 2 (2008) 61–93. [27] R. Fisher, The use of multiple measurements in taxonomic problems,
Ann.
Eugen. 7 (1936) 179–188. [28] T. Hastie, A. Buja, R. Tibshirani, Penalized discriminant analysis,, Ann. Stat. 23
(1995) 73–102. [29] H. Zou, T. Hastie, Regularization and variable selection via the elastic net, JR
Stat. Soc.: Ser. B (Stat. Methodol. 67 (2) (2005) 301–320, Apr.. [30] A. Hoerl, R. Kennard, Ridge regression: biased
estimation for nonorthogonal
problems, Technometrics 12 (1) (1970) 55–67. [31] R. Tibshirani, Regression shrinkage and selection via the lasso,, JR Stat.
Soc.
Ser. B (Methodol.) 58 (1996) 267–288. [32] Y. Chen, P. Du, Y. Wang, Variable selection in linear models,, Wiley
Interdiscipl.
Rev.: Comput. Stat. 6 (1) (2014) 1–9. [33] A. Hirotugu, Information theory and an extension of the maximum likelihood
principle, in: Proceedings of the 2nd International Symposium on Information Theory, Tsahkadsor, Armenia, USSR, 1971. [34]
G. Schwarz, Estimating the dimension of a model, Ann. Stat. 6 (1978) 461–464. [35] J. Friedman, T. Hastie, T. Robert, Additive
logistic regression: a statistical view
of boosting, Ann. Stat. 28 (2000) 337–407. [36] B. Efron, T. Hastie, Least angle regression, Ann. Stat. 32 (2) (2004) 407–
499.
HK Sok et al. / Pattern Recognition ∎ (∎∎∎∎) ∎∎∎� �∎∎∎ 15
http://dx.doi.org/10.1016/j. patcog.2015.08.014i [37] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, IH Witten, The
[41] J. Demšar, Statistical comparisons of classifiers over
multiple data sets, J. Mach. WEKA data mining software: an update, SIGKDD Explor. 11 (1) (2009).
Belajar. Res. 7 (2006) 1–30. [38] K. Sjöstrand, L.
Clemmensen, Spasm: a matlab toolbox for sparse statistical
[42] S. Aruoba, J. Fernández-Villaverde, A Comparison of
Programming Languages modeling, 2012, [Online], Available: 〈http://www2.imm.dtu.dk/projects/
in Economics, Working Paper No. 20263, National

Bureau of Economic spasm〉 (accessed 21.08.14).
Research, 2014. [39] A. Frank, A. Asuncion, UCI

machine learning repository, [Online], Available:
[43] DH Wolpert, WG Macready, No free lunch theorems
for optimization,, IEEE 〈http://archive.ics.uci.edu/ml〉. [40] University of Eastern Finland, Spectral Color Research Group,
[Online], Avail-
able: 〈https://www.uef.fi/spectral/spectral-database〉.
Trans. Evol. Comput. 1 (1) (1997) 67–82. [44] L. Rokach, O. Maimon, Top-down induction of decision trees classifiers—a
survey, IEEE Trans. Syst. Man Cybern. Part C (Appl. Rev. 35 (4) (2005) 476–487, Nov..
Hong Kuan Sok received the Bachelor of Engineering (Honours) in Electrical and Computer Systems Engineering degree from
Monash University, Malaysia in 2010. He is currently a Ph.D. student with particular interests in machine learning and pattern
recognition.
Melanie Ooi Po-Leen received the Ph.D. degree from Monash University, Malaysia, in 2011. She is currently a Senior Lecturer
with the Engineering Faculty, Monash University. Her research interests include machine learning, computer vision, biomedical
imaging and electronic design and test.
Ye Chow Kuang received the Bachelor of Engineering (Honours) degree in electromechanical engineering, and the Ph.D. degree
from University of Southampton. He joined Monash University, Malaysia, where he is involved in the field of machine
intelligence and statistical modelling.
Serge Demidenko received the ME degree from the Belarusian State University of Informatics and Radio Electronics, and the
Ph.D. degree from the Institute of Engineering Cybernetics, Belarusian Academy of Sciences. He is currently a Professor and the
Associate Head of School of Engineering and Advanced Technology, and a Cluster Leader with Massey University, New
Zealand. His research interests include electronic design and test, instrumentation and measurements, and signal processing.

Salinan Terjemahan A5 PDF

Diunggah oleh

Informasi Dokumen

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Salinan Terjemahan A5 PDF

Diunggah oleh

Hak Cipta:

Format Tersedia

Pola Pengakuan ∎ (∎∎∎∎) ∎∎∎-∎∎∎

Isi daftar tersedia diScienceDirect

multivariat pohon bolak keputusan

in Economics, Working Paper No. 20263, National

Research, 2014. [39] A. Frank, A. Asuncion, UCI

Anda mungkin juga menyukai