Multi-Teacher Knowledge Distillation Using Teacher's Domain Expertise

Hossain, Arafat Bin

Author/Creator

Hossain, Arafat Bin

More Information

Show full item record

View/Open

Multi-Teacher Knowledge Distillation Using Teacher's Domain Expertise (874.5Kb)

Abstract

Large BERT models cannot be used with low computing power and storage capacity. Knowledge Distillation solves this problem by distilling knowledge into a smaller BERT model while retaining much of the teacher’s accuracy in student. A teacher expert in predicting one class should be chosen, by student, over others for that class - we used the teacher’s domain expertise like this to train the student. We calculated per-class accuracy for the Student and the Teacher and recorded the difference between the student from the teacher for all k classes. With k differences, we calculated the median of the differences to quantify the student's overall deviation from the teacher over all k classes. Student trained using our approach eventually outperformed all its teachers for the MIND dataset where it was 1.3% more accurate than its teacher, BERT-base-uncased, and 2.6% more accurate than its teacher, RoBERTA, in predicting k classes.

URI

https://hdl.handle.net/10365/33328

Collections

Computer Science Masters Theses