Here you can find language resources, corpora, and pretrained models from my research that you can use in your projects. Please be sure to cite accordingly if you find my work useful!
For published code: We maintain a central repository for all our resources which includes experiment code.
Sentence Entailment Dataset in Filipino
First benchmark dataset for sentence entailment in the low-resource Filipino language. Constructed through exploting the structure of news articles. Contains 600,000 premise-hypothesis pairs, in 70-15-15 split for training, validation, and testing. Originally published in (Cruz et al., 2020).
Large Scale Unlabeled Corpora in Filipino
Large scale, unlabeled text dataset with 39 Million tokens in the training set. Inspired by the original WikiText Long Term Dependency dataset (Merity et al., 2016). TL means “Tagalog.” Published in Cruz & Cheng (2019).
Fake News Filipino Dataset
Low-Resource Fake News Detection Corpora in Filipino
The first of its kind. Contains 3,206 expertly-labeled news samples, half of which are real and half of which are fake. Published in Cruz et al. (2020).
Hate Speech Dataset
Text Classification Dataset in Filipino
Contains 10k tweets (training set) that are labeled as hate speech or non-hate speech. Released with 4,232 validation and 4,232 testing samples. Collected during the 2016 Philippine Presidential Elections and originally used in Cabasag et al. (2019). Published in Cruz & Cheng (2020).
Low-Resource Multiclass Text Classification Dataset in Filipino
Benchmark dataset for low-resource multiclass classification, with 4,015 training, 500 testing, and 500 validation examples, each labeled as part of five classes. Each sample can be a part of multiple classes. Collected as tweets and originally used in Livelo & Cheng (2018). Published in Cruz & Cheng (2020).
Pretrained ELECTRA Models
We release new Tagalog ELECTRA models in small and base configurations, with both the discriminator and generators available. All the models follow the same setups and were trained with the same hyperparameters as English ELECTRA models. Our models are available on HuggingFace Transformers and can be used on both PyTorch and Tensorflow. These models were released as part of (Cruz et al., 2020).
- ELECTRA Base Cased Discriminator -
- ELECTRA Base Uncased Discriminator -
- ELECTRA Small Cased Discriminator -
- ELECTRA Small Uncased Discriminator -
- ELECTRA Base Cased Generator -
- ELECTRA Base Uncased Generator -
- ELECTRA Small Cased Generator -
- ELECTRA Small Uncased Generator -
Pretrained BERT Models
We release four Tagalog BERT Base models and one Tagalog DistilBERT Base model, published in (Cruz & Cheng, 2020). All the models use the same configurations as the original English BERT models. Our models are available on HuggingFace Transformers and can be used on both PyTorch and Tensorflow.
- BERT Base Cased -
- BERT Base Uncased -
- BERT Base Cased WWM -
- BERT Base Uncased WWM -
- DistilBERT Base Cased -
Warning! Bias has not been thoroughly studied in this prototype model. Use with caution.
We release a prototype GPT-2 model for testing using a scaled down corpora composed of WikiText-TL-39 and the Raw NewsPH corpus. This model is part of currently ongoing research, and as such, may change as time passes. As this is a work in progress, there is currently no paper to cite for using the model directly. Please cite the papers that introduce the datasets used instead. Our model is available on HuggingFace Transformers and can be used on both PyTorch and Tensorflow.
- GPT-2 Tagalog (Small, 117M parameters) -
Other Pretrained Models
Here’s a collection of other pretrained models that we have used in our research.