Skip to contents

What is LimpiaR?

LimpiaR is an R library of functions for cleaning & pre-processing text data. The name comes from ‘limpiar’ the Spanish verb’to clean’. Generally when calling a LimpiaR function, you can think of it as ‘clean…’.

LimpiaR is primarily used for cleaning unstructured text data, such as that which comes from social media or reviews. In its initial release, it is focused around the Spanish language, however, some of its functions are language-ambivalent.

Installation

You can install the development version of LimpiaR from GitHub with:

# install.packages("devtools")
devtools::install_github("jpcompartir/LimpiaR")

LimpiaR provides a comprehensive suite of text cleaning and processing functions, primarily focused on preparing text data for machine learning and analytics tasks. Below you’ll find the functions organised by their primary purpose.

Functions for editing the text variable in place.

Function Description Language Support Primary Use Case Notes
limpiar_accents Removes accented characters Language-agnostic Text normalisation Useful for reducing token complexity
limpiar_spaces Removes redundant spaces Language-agnostic Text cleaning Also standardises punctuation spacing
limpiar_url Removes URLs from text Language-agnostic Text cleaning Handles various URL formats
limpiar_repeat_chars Normalises repeated characters Spanish-focused Text normalisation Handles laugh patterns (jajaja)
limpiar_shorthands Expands common abbreviations Spanish-focused Text normalisation e.g., “porq” → “porque”
limpiar_tags Normalises social media tags Language-agnostic Social media prep Handles @mentions and #hashtags
limpiar_stopwords Removes common stopwords Spanish-focused Text analysis Offers “sentiment” and “topics” modes
limpiar_slang Normalises dialectal variations Spanish-focused Text normalisation Handles multiple Spanish dialects
limpiar_emojis_es Converts emojis to Spanish text Spanish Text normalisation Spanish-specific emoji descriptions
limpiar_recode_emojis Recodes emojis to text Language-agnostic Text normalisation General emoji handling
limpiar_remove_emojis Removes emojis completely Language-agnostic Text cleaning Complete emoji removal
limpiar_pp_products Replaces product mentions English/Spanish Entity normalisation For product analysis
limpiar_pp_companies Replaces company mentions English/Spanish Entity normalisation For company analysis
limpiar_non_ascii Removes non-ASCII characters Language-agnostic Text cleaning Less aggressive than alphanumeric
limpiar_alphanumeric Keeps only letters/numbers Language-agnostic Text cleaning Most aggressive cleaning

Removing Posts

Functions for removing unwanted posts entirely (rather than cleaning).

Function Description Language Support Primary Use Case Notes
limpiar_duplicates Removes duplicate content Language-agnostic Data cleaning Also removes protected content
limpiar_retweets Removes retweet content Language-agnostic Social media cleaning Identifies RT patterns
limpiar_spam_grams Removes spam-like patterns Language-agnostic Content filtering Uses n-gram analysis

Utility

Miscellaneous functions designed to speed up aspects of cleaning text.

Function Description Language Support Primary Use Case Notes
limpiar_inspect Viewable pane for pattern matches Language-agnostic Data exploration Interactive viewing
limpiar_na_cols Removes NA-heavy columns Language-agnostic Data cleaning Configurable threshold
limpiar_link_click Makes URLs clickable and short Language-agnostic UI enhancement For Shiny/DataTable
limpiar_ex_subreddits Extracts subreddit names Language-agnostic Reddit analysis URL parsing

Parts of Speech Processing

A collection of functions that collectively make up a Parts of Speech (POS) analysis workflow.

Function Description Language Support Primary Use Case Notes
limpiar_pos_import_model Imports Parts of Speech models and caches 65+ languages POS analysis prep Uses UDPipe models
limpiar_pos_annotate Performs POS analysis 65+ languages Text analysis Includes dependency parsing