Resources

Here you can find language resources, corpora, and pretrained models from my research that you can use in your projects. Please be sure to cite accordingly if you find my work useful!

For published code: We maintain a central repository for all our resources which includes experiment code.

Datasets

Pretrained ELECTRA Models

We release new Tagalog ELECTRA models in small and base configurations, with both the discriminator and generators available. All the models follow the same setups and were trained with the same hyperparameters as English ELECTRA models. Our models are available on HuggingFace Transformers and can be used on both PyTorch and Tensorflow. These models were released as part of (Cruz et al., 2020). bibtex

Discriminator Models

Generator Models

Pretrained BERT Models

We release four Tagalog BERT Base models and one Tagalog DistilBERT Base model, published in (Cruz & Cheng, 2020). All the models use the same configurations as the original English BERT models. Our models are available on HuggingFace Transformers and can be used on both PyTorch and Tensorflow. bibtex

GPT-2 Models

Warning! Bias has not been thoroughly studied in this prototype model. Use with caution.

We release a prototype GPT-2 model for testing using a scaled down corpora composed of WikiText-TL-39 and the Raw NewsPH corpus. This model is part of currently ongoing research, and as such, may change as time passes. As this is a work in progress, there is currently no paper to cite for using the model directly. Please cite the papers that introduce the datasets used instead. Our model is available on HuggingFace Transformers and can be used on both PyTorch and Tensorflow. bibtex bibtex

Other Pretrained Models

Here’s a collection of other pretrained models that we have used in our research.