Remove everything except letters, numbers, and spaces
Source:R/limpiar_alphanumeric.R
limpiar_alphanumeric.Rd
A simple regex for retaining only a-z, A-Z and 0-9 as well as white space characters, including new lines. This function will remove accented characters, and any non-English characters, punctuation, etc. so it is a heavy-duty approach to cleaning and should be used prudently. If you know that you need to keep accents, try limpiar_non_ascii
first, before avoiding these functions altogether.
Examples
test_df <- data.frame(
text = c(
"Simple text 123", # Basic ASCII only
"Hello! How are you? 😊 🌟", # ASCII + punctuation + emojis
"café München niño", # Latin-1 accented characters
"#special@chars&(~)|[$]", # Special characters and symbols
"混合汉字と日本語 → ⌘ £€¥" # CJK characters + symbols + arrows
)
)
limpiar_alphanumeric(test_df, text)
#> text
#> 1 Simple text 123
#> 2 Hello How are you
#> 3 caf Mnchen nio
#> 4 specialchars
#> 5