LimpiaR • LimpiaR

What is LimpiaR?

LimpiaR is an R library of functions for cleaning & pre-processing text data. The name comes from ‘limpiar’ the Spanish verb’to clean’. Generally when calling a LimpiaR function, you can think of it as ‘clean…’.

LimpiaR is primarily used for cleaning unstructured text data, such as that which comes from social media or reviews. In its initial release, it is focused around the Spanish language, however, some of its functions are language-ambivalent.

Installation

You can install the development version of LimpiaR from GitHub with:

# install.packages("devtools")
devtools::install_github("jpcompartir/LimpiaR")

LimpiaR provides a comprehensive suite of text cleaning and processing functions, primarily focused on preparing text data for machine learning and analytics tasks. Below you’ll find the functions organised by their primary purpose.

Functions for editing the text variable in place.

Function	Description	Language Support	Primary Use Case	Notes
limpiar_accents	Removes accented characters	Language-agnostic	Text normalisation	Useful for reducing token complexity
limpiar_spaces	Removes redundant spaces	Language-agnostic	Text cleaning	Also standardises punctuation spacing
limpiar_url	Removes URLs from text	Language-agnostic	Text cleaning	Handles various URL formats
limpiar_repeat_chars	Normalises repeated characters	Spanish-focused	Text normalisation	Handles laugh patterns (jajaja)
limpiar_shorthands	Expands common abbreviations	Spanish-focused	Text normalisation	e.g., “porq” → “porque”
limpiar_tags	Normalises social media tags	Language-agnostic	Social media prep	Handles @mentions and #hashtags
limpiar_stopwords	Removes common stopwords	Spanish-focused	Text analysis	Offers “sentiment” and “topics” modes
limpiar_slang	Normalises dialectal variations	Spanish-focused	Text normalisation	Handles multiple Spanish dialects
limpiar_emojis_es	Converts emojis to Spanish text	Spanish	Text normalisation	Spanish-specific emoji descriptions
limpiar_recode_emojis	Recodes emojis to text	Language-agnostic	Text normalisation	General emoji handling
limpiar_remove_emojis	Removes emojis completely	Language-agnostic	Text cleaning	Complete emoji removal
limpiar_pp_products	Replaces product mentions	English/Spanish	Entity normalisation	For product analysis
limpiar_pp_companies	Replaces company mentions	English/Spanish	Entity normalisation	For company analysis
limpiar_non_ascii	Removes non-ASCII characters	Language-agnostic	Text cleaning	Less aggressive than alphanumeric
limpiar_alphanumeric	Keeps only letters/numbers	Language-agnostic	Text cleaning	Most aggressive cleaning

Removing Posts

Functions for removing unwanted posts entirely (rather than cleaning).

Function	Description	Language Support	Primary Use Case	Notes
limpiar_duplicates	Removes duplicate content	Language-agnostic	Data cleaning	Also removes protected content
limpiar_retweets	Removes retweet content	Language-agnostic	Social media cleaning	Identifies RT patterns
limpiar_spam_grams	Removes spam-like patterns	Language-agnostic	Content filtering	Uses n-gram analysis

Utility

Miscellaneous functions designed to speed up aspects of cleaning text.

Function	Description	Language Support	Primary Use Case	Notes
limpiar_inspect	Viewable pane for pattern matches	Language-agnostic	Data exploration	Interactive viewing
limpiar_na_cols	Removes NA-heavy columns	Language-agnostic	Data cleaning	Configurable threshold
limpiar_link_click	Makes URLs clickable and short	Language-agnostic	UI enhancement	For Shiny/DataTable
limpiar_ex_subreddits	Extracts subreddit names	Language-agnostic	Reddit analysis	URL parsing

Parts of Speech Processing

A collection of functions that collectively make up a Parts of Speech (POS) analysis workflow.

Function	Description	Language Support	Primary Use Case	Notes
limpiar_pos_import_model	Imports Parts of Speech models and caches	65+ languages	POS analysis prep	Uses UDPipe models
limpiar_pos_annotate	Performs POS analysis	65+ languages	Text analysis	Includes dependency parsing

What is LimpiaR?

Installation

LimpiaR Functions Overview

Removing Posts

Utility

Parts of Speech Processing