Di Wu

Github Pages template for academic personal websites, forked from mmistakes/minimal-mistakes

Project maintained by moore3930 Hosted on GitHub Pages — Theme by mattgraham

“I would rather have questions that can’t be answered than answers that can’t be questioned.”

– Richard Feynman

1. About Shared Vocabulary

For NLP systems, tokenization is usually the first step. To overcome the so-called “OOV” problem, BPE is proposed. The main idea behind BPE is to compress data only, but the benefit of BPE is more than handling OOV (afterall, nothing is better than character-level tokenization for OOV). However, in some multilingual senarios, such as multilingual machine translation, things may change. Suppose that a system is trained on extremely imbalanced multilingual datasets, language-agnostic tokenizer, like BPE, will place most of the vocabulary space to high-resource data, and split low-resource data into a sequence of characters. It is still fine if temperature sampling like methods are applied. BUT, is it a really elegant solution? I have been thinking about this issue from the angle of information transport, and try to answer, what is a good way to measure the quality of a vocabulary in multilingual senarios and how to build it?

2. Quantify Knowledge Transfer

For multilingual/multi-task systems, generally, we rely on some specific designs to encourage knowledge transfer among languages/tasks, like shared vocabulary, backbone or other priors. We hope transfer manifests naturally following our intuition-driven modeling (actually it does). I am seeking to quantify the ratio of transfer or interplay, moreover, to measure or encourage such interactions in an explicit way.