Multi-Teacher Knowledge Distillation Using Teacher's Domain Expertise

No Thumbnail Available

Date

2022

Journal Title

Journal ISSN

Volume Title

Publisher

North Dakota State University

Abstract

Large BERT models cannot be used with low computing power and storage capacity. Knowledge Distillation solves this problem by distilling knowledge into a smaller BERT model while retaining much of the teacher’s accuracy in student. A teacher expert in predicting one class should be chosen, by student, over others for that class - we used the teacher’s domain expertise like this to train the student. We calculated per-class accuracy for the Student and the Teacher and recorded the difference between the student from the teacher for all k classes. With k differences, we calculated the median of the differences to quantify the student's overall deviation from the teacher over all k classes. Student trained using our approach eventually outperformed all its teachers for the MIND dataset where it was 1.3% more accurate than its teacher, BERT-base-uncased, and 2.6% more accurate than its teacher, RoBERTA, in predicting k classes.

Description

Keywords

BERT, Encoder, Knowledge Distillation, Multi Teacher Knowledge Distillation, Multi-class Text Classification

Citation