Here you can find language resources, corpora, and pretrained models from my research that you can use in your projects. Please be sure to cite accordingly if you find my work useful!

Datasets

  • WikiText-TL-39 download bibtex
    Large Scale Unlabeled Corpora in Filipino
    Large scale, unlabeled text dataset with 39 Million tokens in the training set. Inspired by the original WikiText Long Term Dependency dataset (Merity et al., 2016). TL means "Tagalog." Published in Cruz & Cheng (2019).

  • Fake News Filipino Dataset download bibtex
    Low-Resource Fake News Detection Corpora in Filipino
    The first of its kind. Contains 3,206 expertly-labeled news samples, half of which are real and half of which are fake. Published in Cruz et al. (2020).

  • Hate Speech Dataset
    Text Classification Dataset in Filipino download bibtex
    Contains 10k tweets (training set) that are labeled as hate speech or non-hate speech. Released with 4,232 validation and 4,232 testing samples. Collected during the 2016 Philippine Presidential Elections and originally used in Cabasag et al. (2019). Published in Cruz & Cheng (2020).

  • Dengue Dataset download bibtex
    Low-Resource Multiclass Text Classification Dataset in Filipino
    Benchmark dataset for low-resource multiclass classification, with 4,015 training, 500 testing, and 500 validation examples, each labeled as part of five classes. Each sample can be a part of multiple classes. Collected as tweets and originally used in Livelo & Cheng (2018). Published in Cruz & Cheng (2020).

Pretrained ELECTRA Models

We release new Tagalog ELECTRA models in small and base configurations, with both the discriminator and generators available. All the models follow the same setups and were trained with the same hyperparameters as English ELECTRA models. Our models are available on HuggingFace Transformers and can be used on both PyTorch and Tensorflow. (Paper coming soon!)

Discriminator Models

Generator Models

The models can be loaded using the code below:

from transformers import TFAutoModel, AutoModel, AutoTokenizer

# TensorFlow
model = TFAutoModel.from_pretrained('jcblaise/electra-tagalog-small-cased-generator', from_pt=True)
tokenizer = AutoTokenizer.from_pretrained('jcblaise/electra-tagalog-small-cased-generator', do_lower_case=False)

# PyTorch
model = AutoModel.from_pretrained('jcblaise/electra-tagalog-small-cased-generator')
tokenizer = AutoTokenizer.from_pretrained('jcblaise/electra-tagalog-small-cased-generator', do_lower_case=False)

Pretrained BERT Models

We release four Tagalog BERT Base models and one Tagalog DistilBERT Base model, published in (Cruz & Cheng, 2020). All the models use the same configurations as the original English BERT models. Our models are available on HuggingFace Transformers and can be used on both PyTorch and Tensorflow. bibtex

The models can be loaded using the code below:

from transformers import TFAutoModel, AutoModel, AutoTokenizer

# TensorFlow
model = TFAutoModel.from_pretrained('jcblaise/bert-tagalog-base-cased', from_pt=True)
tokenizer = AutoTokenizer.from_pretrained('jcblaise/bert-tagalog-base-cased', do_lower_case=False)

# PyTorch
model = AutoModel.from_pretrained('jcblaise/bert-tagalog-base-cased')
tokenizer = AutoTokenizer.from_pretrained('jcblaise/bert-tagalog-base-cased', do_lower_case=False)

Other Pretrained Models

  • ULMFiT-Tagalog download bibtex
    Tagalog pretrained AWD-LSTM compatible with the FastAI library. Published in Cruz & Cheng (2019).
  • GloVe Embeddings (30k Vocab) download
    GloVe embeddings trained on WikiText-TL-39 with vocabulary size limited at 30k tokens.
  • GloVe Embeddings (100k Vocab) download
    GloVe embeddings trained on WikiText-TL-39 with vocabulary size limited at 100k tokens.