Multi-Teacher Knowledge Distillation Using Teacher's Domain Expertise

Hossain, Arafat Bin2023-12-182023-12-182022https://hdl.handle.net/10365/33328Large BERT models cannot be used with low computing power and storage capacity. Knowledge Distillation solves this problem by distilling knowledge into a smaller BERT model while retaining much of the teacher’s accuracy in student. A teacher expert in predicting one class should be chosen, by student, over others for that class - we used the teacher’s domain expertise like this to train the student. We calculated per-class accuracy for the Student and the Teacher and recorded the difference between the student from the teacher for all k classes. With k differences, we calculated the median of the differences to quantify the student's overall deviation from the teacher over all k classes. Student trained using our approach eventually outperformed all its teachers for the MIND dataset where it was 1.3% more accurate than its teacher, BERT-base-uncased, and 2.6% more accurate than its teacher, RoBERTA, in predicting k classes.NDSU policy 190.6.2https://www.ndsu.edu/fileadmin/policy/190.pdfBERTEncoderKnowledge DistillationMulti Teacher Knowledge DistillationMulti-class Text ClassificationMulti-Teacher Knowledge Distillation Using Teacher's Domain ExpertiseThesis