💫 Industrial-strength Natural Language Processing (NLP) in Python
-
Updated
Jul 12, 2024 - Python
💫 Industrial-strength Natural Language Processing (NLP) in Python
Easy token price estimates for 400+ LLMs. TokenOps.
👑 spaCy building blocks and visualizers for Streamlit apps
All the slides, accompanying code and exercises all stored in this repo. 🎈
Trankit is a Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processing
Ekphrasis is a text processing tool, geared towards text from social networks, such as Twitter or Facebook. Ekphrasis performs tokenization, word normalization, word segmentation (for splitting hashtags) and spell correction, using word statistics from 2 big corpora (english Wikipedia, twitter - 330mil english tweets).
Rule-based token, sentence segmentation for Russian language
[Paper][Preprint 2024] MyGO: Discrete Modality Information as Fine-Grained Tokens for Multi-modal Knowledge Graph Completion
OmniTokenizer: one model and one weight for image-video joint tokenization.
A unified tokenization tool for Images, Chinese and English.
Simple multilingual lemmatizer for Python, especially useful for speed and efficiency
Fast bare-bones BPE for modern tokenizer training
Implementation of the GBST block from the Charformer paper, in Pytorch
Implementation of the LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens Paper
Code for Zero-Shot Tokenizer Transfer
[LREC 2022] An off-the-shelf pre-trained Tweet NLP Toolkit (NER, tokenization, lemmatization, POS tagging, dependency parsing) + Tweebank-NER dataset
使用sentencepiece中BPE训练中文词表,并在transformers中进行使用。
FPE - Format Preserving Encryption with FF3 in Python
A Fast and Accurate Neural Thai Word Segmenter
NLP Cloud serves high performance pre-trained or custom models for NER, sentiment-analysis, classification, summarization, paraphrasing, intent classification, product description and ad generation, chatbot, grammar and spelling correction, keywords and keyphrases extraction, text generation, image generation, code generation, and more...
Add a description, image, and links to the tokenization topic page so that developers can more easily learn about it.
To associate your repository with the tokenization topic, visit your repo's landing page and select "manage topics."