Multi-Teacher Knowledge Distillation Using Teacher's Domain Expertise

Hossain, Arafat Bin

Multi-Teacher Knowledge Distillation Using Teacher's Domain Expertise

Files

Multi-Teacher Knowledge Distillation Using Teacher's Domain Expertise.pdf (874.56 KB)

Date

2022

Authors

Hossain, Arafat Bin

Publisher

North Dakota State University

Abstract

Large BERT models cannot be used with low computing power and storage capacity. Knowledge Distillation solves this problem by distilling knowledge into a smaller BERT model while retaining much of the teacher’s accuracy in student. A teacher expert in predicting one class should be chosen, by student, over others for that class - we used the teacher’s domain expertise like this to train the student. We calculated per-class accuracy for the Student and the Teacher and recorded the difference between the student from the teacher for all k classes. With k differences, we calculated the median of the differences to quantify the student's overall deviation from the teacher over all k classes. Student trained using our approach eventually outperformed all its teachers for the MIND dataset where it was 1.3% more accurate than its teacher, BERT-base-uncased, and 2.6% more accurate than its teacher, RoBERTA, in predicting k classes.

Keywords

BERT, Encoder, Knowledge Distillation, Multi Teacher Knowledge Distillation, Multi-class Text Classification

URI

https://hdl.handle.net/10365/33328

Collections

Computer Science Masters Theses

Full item page

Multi-Teacher Knowledge Distillation Using Teacher's Domain Expertise

Files

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

Citation

URI

Collections