Skip to contents

LimpiaR Overview

LimpiaR is a package built to expedite pre-processing and cleaning of text data, with some handy functions in Spanish and R. To get started, we’ll load few helpful libraries.

Walkthrough

LimpiaR’s functions begin with limpiar_. Once the library has been loaded, typing limpiar_ in an Rstudio script or Rmarkdown code block, will produce a drop down menu of all LimpiaR functions, which should help you to find the name of the function you’re looking for - you can then use tab to autocomplete the function. Once inside the function, RStudio should give you a popover which shows the argument the function expects. You can also type ‘control + space’ when your cursor is inside the function’s brackets to force extra help.

data
#> # A tibble: 10 × 2
#>    Mention.Content                                                   Mention.Url
#>    <chr>                                                             <chr>      
#>  1 "holaaaaaa! cóómo    estás @magdalena   ?!"                       www.twitte…
#>  2 "  han visto este articulo!? Que horror! https://guardian.com/em… www.twitte…
#>  3 "ayyyyyy a mi me   gustaria ir a londres yaaa #llevame #porfavor" www.facebo…
#>  4 "jajajajaja eres un wn!"                                          www.facebo…
#>  5 "RT dale un click a ver una mujer baila con su perro"             www.twitte…
#>  6 "grax ntonces q?"                                                 www.youtub…
#>  7 "yo soy el mejor 😂😂😂, no eres nada!! 🤣🤣 "                    www.instag…
#>  8 "grax ntonces q?"                                                 www.youtub…
#>  9 "grax ntonces q?"                                                 www.youtub…
#> 10 "grax ntonces q?"                                                 www.youtub…

Column Names

We created a data frame of posts and URLs. After loading libraries and data, the first part of any workflow should be to clean the column names, this makes using tab completion, and accessing column names much faster (which in the long run = big productivity gains). For this, we’ll use the janitor package. You can uncomment the code to install janitor if it is not already installed on your machine.

# ifelse(!"janitor" %in% installed.packages(),
#    install.packages("janitor"), library(janitor))

(data <- data %>% 
   janitor::clean_names())
#> # A tibble: 10 × 2
#>    mention_content                                                   mention_url
#>    <chr>                                                             <chr>      
#>  1 "holaaaaaa! cóómo    estás @magdalena   ?!"                       www.twitte…
#>  2 "  han visto este articulo!? Que horror! https://guardian.com/em… www.twitte…
#>  3 "ayyyyyy a mi me   gustaria ir a londres yaaa #llevame #porfavor" www.facebo…
#>  4 "jajajajaja eres un wn!"                                          www.facebo…
#>  5 "RT dale un click a ver una mujer baila con su perro"             www.twitte…
#>  6 "grax ntonces q?"                                                 www.youtub…
#>  7 "yo soy el mejor 😂😂😂, no eres nada!! 🤣🤣 "                    www.instag…
#>  8 "grax ntonces q?"                                                 www.youtub…
#>  9 "grax ntonces q?"                                                 www.youtub…
#> 10 "grax ntonces q?"                                                 www.youtub…

Lower Case Text Variable

For most workflows, the next step will be to make the text variable lower case. We do this to make tokens like ‘AMAZING’ or ‘Amazing’ -> ‘amazing’. You do not need a LimpiaR function for this, as the base R function tolower() works just fine.

(data <- data %>%
  mutate(mention_content = tolower(mention_content)))
#> # A tibble: 10 × 2
#>    mention_content                                                   mention_url
#>    <chr>                                                             <chr>      
#>  1 "holaaaaaa! cóómo    estás @magdalena   ?!"                       www.twitte…
#>  2 "  han visto este articulo!? que horror! https://guardian.com/em… www.twitte…
#>  3 "ayyyyyy a mi me   gustaria ir a londres yaaa #llevame #porfavor" www.facebo…
#>  4 "jajajajaja eres un wn!"                                          www.facebo…
#>  5 "rt dale un click a ver una mujer baila con su perro"             www.twitte…
#>  6 "grax ntonces q?"                                                 www.youtub…
#>  7 "yo soy el mejor 😂😂😂, no eres nada!! 🤣🤣 "                    www.instag…
#>  8 "grax ntonces q?"                                                 www.youtub…
#>  9 "grax ntonces q?"                                                 www.youtub…
#> 10 "grax ntonces q?"                                                 www.youtub…

Limpiar Functions

limpiar_accents

Now we’re going to look at LimpiaR’s functions individually. The first function is limpiar_accents, this will replace the accents most common in Spanish words from the text variable, with their Latin-alphabet equivalents e.g. ‘é -> e’ . We will use the assignment operator to make sure these changes are saved.

Tip: you can type ?limpiar_accents to access the documentation, and see which arguments you need to fill in.

(data <- data %>%
  limpiar_accents(text_var = mention_content))
#> # A tibble: 10 × 2
#>    mention_content                                                   mention_url
#>    <chr>                                                             <chr>      
#>  1 "holaaaaaa! coomo    estas @magdalena   ?!"                       www.twitte…
#>  2 "  han visto este articulo!? que horror! https://guardian.com/em… www.twitte…
#>  3 "ayyyyyy a mi me   gustaria ir a londres yaaa #llevame #porfavor" www.facebo…
#>  4 "jajajajaja eres un wn!"                                          www.facebo…
#>  5 "rt dale un click a ver una mujer baila con su perro"             www.twitte…
#>  6 "grax ntonces q?"                                                 www.youtub…
#>  7 "yo soy el mejor 😂😂😂, no eres nada!! 🤣🤣 "                    www.instag…
#>  8 "grax ntonces q?"                                                 www.youtub…
#>  9 "grax ntonces q?"                                                 www.youtub…
#> 10 "grax ntonces q?"                                                 www.youtub…

limpiar_duplicates

Now we’ll remove duplicate posts, notice that we don’t actually need to type text_var = mention_content, because the default argument for text_var is already mention_content.

(data <- data %>%
  limpiar_duplicates())
#> # A tibble: 7 × 2
#>   mention_content                                                    mention_url
#>   <chr>                                                              <chr>      
#> 1 "holaaaaaa! coomo    estas @magdalena   ?!"                        www.twitte…
#> 2 "  han visto este articulo!? que horror! https://guardian.com/emo… www.twitte…
#> 3 "ayyyyyy a mi me   gustaria ir a londres yaaa #llevame #porfavor"  www.facebo…
#> 4 "jajajajaja eres un wn!"                                           www.facebo…
#> 5 "rt dale un click a ver una mujer baila con su perro"              www.twitte…
#> 6 "grax ntonces q?"                                                  www.youtub…
#> 7 "yo soy el mejor 😂😂😂, no eres nada!! 🤣🤣 "                     www.instag…

Note: If the text column in our data frame were called ‘text’ we would have to specify text_var = text, or call:

data %>% rename(text = mention_content)
#> # A tibble: 7 × 2
#>   text                                                               mention_url
#>   <chr>                                                              <chr>      
#> 1 "holaaaaaa! coomo    estas @magdalena   ?!"                        www.twitte…
#> 2 "  han visto este articulo!? que horror! https://guardian.com/emo… www.twitte…
#> 3 "ayyyyyy a mi me   gustaria ir a londres yaaa #llevame #porfavor"  www.facebo…
#> 4 "jajajajaja eres un wn!"                                           www.facebo…
#> 5 "rt dale un click a ver una mujer baila con su perro"              www.twitte…
#> 6 "grax ntonces q?"                                                  www.youtub…
#> 7 "yo soy el mejor 😂😂😂, no eres nada!! 🤣🤣 "                     www.instag…

limpiar_retweets

If you need to remove retweets, for example to create a bigram network, limpiar has a function just for that.

(data <- data %>% 
   limpiar_retweets())
#> # A tibble: 6 × 2
#>   mention_content                                                    mention_url
#>   <chr>                                                              <chr>      
#> 1 "holaaaaaa! coomo    estas @magdalena   ?!"                        www.twitte…
#> 2 "  han visto este articulo!? que horror! https://guardian.com/emo… www.twitte…
#> 3 "ayyyyyy a mi me   gustaria ir a londres yaaa #llevame #porfavor"  www.facebo…
#> 4 "jajajajaja eres un wn!"                                           www.facebo…
#> 5 "grax ntonces q?"                                                  www.youtub…
#> 6 "yo soy el mejor 😂😂😂, no eres nada!! 🤣🤣 "                     www.instag…

limpiar_url

We generally don’t want URLs appearing in our charts or analyses, so we can remove them with the limpiar_url function.

(data <- data %>%
   limpiar_url())
#> # A tibble: 6 × 2
#>   mention_content                                                   mention_url 
#>   <chr>                                                             <chr>       
#> 1 "holaaaaaa! coomo    estas @magdalena   ?!"                       www.twitter…
#> 2 "  han visto este articulo!? que horror!  no se puede!!"          www.twitter…
#> 3 "ayyyyyy a mi me   gustaria ir a londres yaaa #llevame #porfavor" www.faceboo…
#> 4 "jajajajaja eres un wn!"                                          www.faceboo…
#> 5 "grax ntonces q?"                                                 www.youtube…
#> 6 "yo soy el mejor 😂😂😂, no eres nada!! 🤣🤣 "                    www.instagr…

limpiar_spaces

Next we’ll look at how to use LimpiaR to remove annoying white spaces, like those at the beginning of a sentence, or between punctuation, or multiple white spaces for no reason; as is common in the messy data we often encounter.

(data <- data %>%
  limpiar_spaces())
#> # A tibble: 6 × 2
#>   mention_content                                               mention_url     
#>   <chr>                                                         <chr>           
#> 1 holaaaaaa! coomo estas @magdalena?!                           www.twitter.com…
#> 2 han visto este articulo!? que horror! no se puede!!           www.twitter.com…
#> 3 ayyyyyy a mi me gustaria ir a londres yaaa #llevame #porfavor www.facebook.co…
#> 4 jajajajaja eres un wn!                                        www.facebook.co…
#> 5 grax ntonces q?                                               www.youtube.com…
#> 6 yo soy el mejor 😂😂😂, no eres nada!! 🤣🤣                   www.instagram.c…

limpiar_tags

We can also remove user handles (e.g. @magdalena) and hashtags with the limpiar_tags function. Remember, you can type ?limpiar_tags to access documentation.

Replace only hashtags:

data %>%
  limpiar_tags(user = FALSE, hashtag = TRUE)
#> # A tibble: 6 × 2
#>   mention_content                                            mention_url        
#>   <chr>                                                      <chr>              
#> 1 holaaaaaa! coomo estas @magdalena?!                        www.twitter.com/po…
#> 2 han visto este articulo!? que horror! no se puede!!        www.twitter.com/po…
#> 3 ayyyyyy a mi me gustaria ir a londres yaaa hashtag hashtag www.facebook.com/p…
#> 4 jajajajaja eres un wn!                                     www.facebook.com/p…
#> 5 grax ntonces q?                                            www.youtube.com/po…
#> 6 yo soy el mejor 😂😂😂, no eres nada!! 🤣🤣                www.instagram.com/…

Replace only user tags:

data %>%
  limpiar_tags(user = TRUE, hashtag = FALSE)
#> # A tibble: 6 × 2
#>   mention_content                                               mention_url     
#>   <chr>                                                         <chr>           
#> 1 holaaaaaa! coomo estas @user?!                                www.twitter.com…
#> 2 han visto este articulo!? que horror! no se puede!!           www.twitter.com…
#> 3 ayyyyyy a mi me gustaria ir a londres yaaa #llevame #porfavor www.facebook.co…
#> 4 jajajajaja eres un wn!                                        www.facebook.co…
#> 5 grax ntonces q?                                               www.youtube.com…
#> 6 yo soy el mejor 😂😂😂, no eres nada!! 🤣🤣                   www.instagram.c…

Replace both hashtags and user handles:

data %>%
  limpiar_tags()
#> # A tibble: 6 × 2
#>   mention_content                                            mention_url        
#>   <chr>                                                      <chr>              
#> 1 holaaaaaa! coomo estas @user?!                             www.twitter.com/po…
#> 2 han visto este articulo!? que horror! no se puede!!        www.twitter.com/po…
#> 3 ayyyyyy a mi me gustaria ir a londres yaaa hashtag hashtag www.facebook.com/p…
#> 4 jajajajaja eres un wn!                                     www.facebook.com/p…
#> 5 grax ntonces q?                                            www.youtube.com/po…
#> 6 yo soy el mejor 😂😂😂, no eres nada!! 🤣🤣                www.instagram.com/…

Quick recap - we’ve looked at:

  • cleaning column names with janitor::clean_names()
  • making the text variable lower case with mutate() & tolower()
  • cleaning accents with limpiar_accents()
  • cleaning duplicate posts with limpiar_duplicates()
  • cleaning retweets with limpiar_retweets()
  • cleaning urls with limpiar_url()
  • cleaning spaces with limpiar_spaces()
  • cleaning user handles and hashtags with limpiar_tags()

limpiar_shorthands

One of the biggest problems with the messy data we encounter, are shorthands. Generally, algorithms have been trained on clean, standard language, so they do not encounter shorthands and abbreviations. Shorthands also change all the time, making it impractical to continuously train algorithms as new shorthands arise. This function attempts to bridge that gap, by normalising the most common shorthands.

(data <- data %>%
   limpiar_shorthands())
#> # A tibble: 6 × 2
#>   mention_content                                               mention_url     
#>   <chr>                                                         <chr>           
#> 1 holaaaaaa! coomo estas @magdalena?!                           www.twitter.com…
#> 2 han visto este articulo!? que horror! no se puede!!           www.twitter.com…
#> 3 ayyyyyy a mi me gustaria ir a londres yaaa #llevame #porfavor www.facebook.co…
#> 4 jajajajaja eres un wuevon!                                    www.facebook.co…
#> 5 gracias entonces que?                                         www.youtube.com…
#> 6 yo soy el mejor 😂😂😂, no eres nada!! 🤣🤣                   www.instagram.c…

limpiar_repeated_chars

We don’t want our algorithm to have to learn the difference between ‘ajajaj’ and ‘jaja’ or ‘ay’ and ‘ayyyy’ as practically speaking, there is none. We also don’t want to introduce unnecessary tokens, so we normalise the most common occurrences of repeated or additional characters.

(data <- data %>%
   limpiar_repeat_chars())
#> # A tibble: 6 × 2
#>   mention_content                                        mention_url            
#>   <chr>                                                  <chr>                  
#> 1 hola! coomo estas @magdalena?!                         www.twitter.com/post1  
#> 2 han visto este articulo!? que horror! no se puede!!    www.twitter.com/post2  
#> 3 ay a mi me gustaria ir a londres ya #llevame #porfavor www.facebook.com/post1 
#> 4 jaja eres un wuevon!                                   www.facebook.com/post2 
#> 5 gracias entonces que?                                  www.youtube.com/post1  
#> 6 yo soy el mejor 😂😂😂, no eres nada!! 🤣🤣            www.instagram.com/post1

Generally, the steps we’ve taken so far will be used in each and every analysis/project to help clean the data. We will now look at some of the more circumstantial functions, i.e. they will not be used in every analysis.

Emojis

Emojis are a type of non-ASCII unicode character, which means that removing all non-ASCII characters will remove emojis by force. It also means that functions designed to target certain unicode characters by patterns, may inadvertently remove other special characters - as well as emojis, or instead of emojis!

limpiar_recode_emojis()
Why not juse a regular expression? We scraped some lists of emojis which get the emojis directly for replacement. This approach is more computationally expensive than filtering out via RegEx, but it is often more precise as it targets the emojis directly.

We don’t need to use limpiar_recode_emojis for every analysis, as many ParseR & SegmentR functions ignore them implicitly. However, if we know that we need to replace them with their text descriptios, we can can use limpiar__recode_emojis(). One problem with this, and the reason it is for special cases only, is the emoji’s encodings are in English. We may, at some point, translate them to Spanish, but it seems unlikely.

data %>%
  limpiar_recode_emojis(text_var = mention_content, with_emoji_tag = FALSE)
#> # A tibble: 6 × 2
#>   mention_content                                                    mention_url
#>   <chr>                                                              <chr>      
#> 1 hola! coomo estas @magdalena?!                                     www.twitte…
#> 2 han visto este articulo!? que horror! no se puede!!                www.twitte…
#> 3 ay a mi me gustaria ir a londres ya #llevame #porfavor             www.facebo…
#> 4 jaja eres un wuevon!                                               www.facebo…
#> 5 gracias entonces que?                                              www.youtub…
#> 6 yo soy el mejor face with tears of joy face with tears of joy fac… www.instag…

Or if we set with_emoji_tag to TRUE, our emojis are now pasted together with ’_’ and have an ’_emoji’ label.

data %>%
  limpiar_recode_emojis(mention_content, with_emoji_tag = TRUE)
#> # A tibble: 6 × 2
#>   mention_content                                                    mention_url
#>   <chr>                                                              <chr>      
#> 1 hola! coomo estas @magdalena?!                                     www.twitte…
#> 2 han visto este articulo!? que horror! no se puede!!                www.twitte…
#> 3 ay a mi me gustaria ir a londres ya #llevame #porfavor             www.facebo…
#> 4 jaja eres un wuevon!                                               www.facebo…
#> 5 gracias entonces que?                                              www.youtub…
#> 6 yo soy el mejor face_with_tears_of_joy_emoji face_with_tears_of_j… www.instag…

Warning the limpiar_recode_emojis() function is quite slow and scales poorly with the size of inputs. So if using on a large dataset with many long documents expect functions to take a while to run.

limpiar_remove_emojis()

What about in situations where we don’t want to replace emojis with their text inputs, or we don’t mind risking the loss of some other non-ASCII characters, or we want something that runs fast?

Instead of limpiar_recode_emojis we can use limpiar_remove_emojis! This function operates with a fairly simple RegEx pattern, meaning it runs a lot more efficiently than its recode counterpart.

data %>%
  limpiar_remove_emojis(mention_content)
#> # A tibble: 6 × 2
#>   mention_content                                          mention_url          
#>   <chr>                                                    <chr>                
#> 1 "hola! coomo estas @magdalena?!"                         www.twitter.com/post1
#> 2 "han visto este articulo!? que horror! no se puede!!"    www.twitter.com/post2
#> 3 "ay a mi me gustaria ir a londres ya #llevame #porfavor" www.facebook.com/pos…
#> 4 "jaja eres un wuevon!"                                   www.facebook.com/pos…
#> 5 "gracias entonces que?"                                  www.youtube.com/post1
#> 6 "yo soy el mejor , no eres nada!! "                      www.instagram.com/po…

non-ASCII characters

ASCII (American Standard Code for Information Interchange) is a character-encoding standard for representing numbers and text. There are 128 ASCII characters, including the letters from a-z in upper and lowercase, numbers 0-9, common punctuation marks, and a few additional characters with specific uses for computers. Everything else is non-ASCII.

All 128 ASCII characters
Dec Hex Char Description
000 00 NUL Null
001 01 SOH Start of Heading
002 02 STX Start of Text
003 03 ETX End of Text
004 04 EOT End of Transmission
005 05 ENQ Enquiry
006 06 ACK Acknowledge
007 07 BEL Bell
008 08 BS Backspace
009 09 HT Horizontal Tab
010 0A LF Line Feed
011 0B VT Vertical Tab
012 0C FF Form Feed
013 0D CR Carriage Return
014 0E SO Shift Out
015 0F SI Shift In
016 10 DLE Data Link Escape
017 11 DC1 Device Control 1 (XON)
018 12 DC2 Device Control 2
019 13 DC3 Device Control 3 (XOFF)
020 14 DC4 Device Control 4
021 15 NAK Negative Acknowledge
022 16 SYN Synchronous Idle
023 17 ETB End of Transmission Block
024 18 CAN Cancel
025 19 EM End of Medium
026 1A SUB Substitute
027 1B ESC Escape
028 1C FS File Separator
029 1D GS Group Separator
030 1E RS Record Separator
031 1F US Unit Separator
032 20 SPACE Space
033 21 ! Exclamation Mark
034 22 Double Quote
035 23 # Number Sign
036 24 $ Dollar Sign
037 25 % Percent
038 26 & Ampersand
039 27 Single Quote
040 28 ( Left Parenthesis
041 29 ) Right Parenthesis
042 2A * Asterisk
043 2B + Plus
044 2C , Comma
045 2D - Hyphen
046 2E . Period
047 2F / Forward Slash
048 30 0 Zero
049 31 1 One
050 32 2 Two
051 33 3 Three
052 34 4 Four
053 35 5 Five
054 36 6 Six
055 37 7 Seven
056 38 8 Eight
057 39 9 Nine
058 3A : Colon
059 3B ; Semicolon
060 3C < Less Than
061 3D = Equals
062 3E > Greater Than
063 3F ? Question Mark
064 40 @ At Sign
065 41 A Uppercase A
066 42 B Uppercase B
067 43 C Uppercase C
068 44 D Uppercase D
069 45 E Uppercase E
070 46 F Uppercase F
071 47 G Uppercase G
072 48 H Uppercase H
073 49 I Uppercase I
074 4A J Uppercase J
075 4B K Uppercase K
076 4C L Uppercase L
077 4D M Uppercase M
078 4E N Uppercase N
079 4F O Uppercase O
080 50 P Uppercase P
081 51 Q Uppercase Q
082 52 R Uppercase R
083 53 S Uppercase S
084 54 T Uppercase T
085 55 U Uppercase U
086 56 V Uppercase V
087 57 W Uppercase W
088 58 X Uppercase X
089 59 Y Uppercase Y
090 5A Z Uppercase Z
091 5B [ Left Bracket
092 5C \ Backslash
093 5D ] Right Bracket
094 5E ^ Caret
095 5F _ Underscore
096 60 ` Backtick
097 61 a Lowercase a
098 62 b Lowercase b
099 63 c Lowercase c
100 64 d Lowercase d
101 65 e Lowercase e
102 66 f Lowercase f
103 67 g Lowercase g
104 68 h Lowercase h
105 69 i Lowercase i
106 6A j Lowercase j
107 6B k Lowercase k
108 6C l Lowercase l
109 6D m Lowercase m
110 6E n Lowercase n
111 6F o Lowercase o
112 70 p Lowercase p
113 71 q Lowercase q
114 72 r Lowercase r
115 73 s Lowercase s
116 74 t Lowercase t
117 75 u Lowercase u
118 76 v Lowercase v
119 77 w Lowercase w
120 78 x Lowercase x
121 79 y Lowercase y
122 7A z Lowercase z
123 7B { Left Brace
124 7C | Vertical Bar
125 7D } Right Brace
126 7E ~ Tilde
127 7F DEL Delete

For our purposes we have extended the ASCII characters to include Latin accents (é, í, etc.), let’s get our original data frame back to demonstrate. We should remove things like emojis, but keep our punctuation and accented characters:

#> # A tibble: 10 × 2
#>    mention_content                                                   mention_url
#>    <chr>                                                             <chr>      
#>  1 "holaaaaaa! cóómo    estás @magdalena   ?!"                       www.twitte…
#>  2 "  han visto este articulo!? Que horror! https://guardian.com/em… www.twitte…
#>  3 "ayyyyyy a mi me   gustaria ir a londres yaaa #llevame #porfavor" www.facebo…
#>  4 "jajajajaja eres un wn!"                                          www.facebo…
#>  5 "RT dale un click a ver una mujer baila con su perro"             www.twitte…
#>  6 "grax ntonces q?"                                                 www.youtub…
#>  7 "yo soy el mejor 😂😂😂, no eres nada!! 🤣🤣 "                    www.instag…
#>  8 "grax ntonces q?"                                                 www.youtub…
#>  9 "grax ntonces q?"                                                 www.youtub…
#> 10 "grax ntonces q?"                                                 www.youtub…
data %>%
  limpiar_non_ascii(mention_content)
#> # A tibble: 10 × 2
#>    mention_content                                                   mention_url
#>    <chr>                                                             <chr>      
#>  1 "holaaaaaa! cóómo    estás @magdalena   ?!"                       www.twitte…
#>  2 "  han visto este articulo!? Que horror! https://guardian.com/em… www.twitte…
#>  3 "ayyyyyy a mi me   gustaria ir a londres yaaa #llevame #porfavor" www.facebo…
#>  4 "jajajajaja eres un wn!"                                          www.facebo…
#>  5 "RT dale un click a ver una mujer baila con su perro"             www.twitte…
#>  6 "grax ntonces q?"                                                 www.youtub…
#>  7 "yo soy el mejor , no eres nada!!  "                              www.instag…
#>  8 "grax ntonces q?"                                                 www.youtub…
#>  9 "grax ntonces q?"                                                 www.youtub…
#> 10 "grax ntonces q?"                                                 www.youtub…

limpiar_alphanumeric

Similar to non-ASCII characters, we can retain only the alphanumeric characters (a-zA-Z0-9 + spaces). This is a heavy-duty option which will remove all accented characters.

data %>%
  limpiar_alphanumeric(mention_content)
#> # A tibble: 10 × 2
#>    mention_content                                                   mention_url
#>    <chr>                                                             <chr>      
#>  1 "holaaaaaa cmo    ests magdalena   "                              www.twitte…
#>  2 "  han visto este articulo Que horror httpsguardiancomemojisbann… www.twitte…
#>  3 "ayyyyyy a mi me   gustaria ir a londres yaaa llevame porfavor"   www.facebo…
#>  4 "jajajajaja eres un wn"                                           www.facebo…
#>  5 "RT dale un click a ver una mujer baila con su perro"             www.twitte…
#>  6 "grax ntonces q"                                                  www.youtub…
#>  7 "yo soy el mejor  no eres nada  "                                 www.instag…
#>  8 "grax ntonces q"                                                  www.youtub…
#>  9 "grax ntonces q"                                                  www.youtub…
#> 10 "grax ntonces q"                                                  www.youtub…

If we want to use limpiar_alphanumeric and retain our accented characters, then we should recode the accents first with limpiar_accents:

data %>%
  limpiar_accents(mention_content) %>%
  limpiar_alphanumeric(mention_content)
#> # A tibble: 10 × 2
#>    mention_content                                                   mention_url
#>    <chr>                                                             <chr>      
#>  1 "holaaaaaa coomo    estas magdalena   "                           www.twitte…
#>  2 "  han visto este articulo Que horror httpsguardiancomemojisbann… www.twitte…
#>  3 "ayyyyyy a mi me   gustaria ir a londres yaaa llevame porfavor"   www.facebo…
#>  4 "jajajajaja eres un wn"                                           www.facebo…
#>  5 "RT dale un click a ver una mujer baila con su perro"             www.twitte…
#>  6 "grax ntonces q"                                                  www.youtub…
#>  7 "yo soy el mejor  no eres nada  "                                 www.instag…
#>  8 "grax ntonces q"                                                  www.youtub…
#>  9 "grax ntonces q"                                                  www.youtub…
#> 10 "grax ntonces q"                                                  www.youtub…

You’ll need to make an informed choice between limpiar_alphanumeric, limpiar_non_ascii and other functions like limpiar_accents and limpiar_*_emojis

limpiar_stopwords

Stop words are common words that do not provide us with much information as to an utterance’s meaning. For example, in the sentence: ‘the man is in prison for theft’, if we knew only one word from this sentence, and that word was ‘is’, ‘in’, ‘the’, or ‘for’ then we wouldn’t have much idea what the sentence is about. However, ‘prison’ or ‘theft’, would give us a lot more information.

For many analyses, we remove stop words to help us see the ‘highest information’ words, to get a high-level understanding of large bodies of texts (such as in topic modelling and bigram networks.) For virtually all scenarios, you will want to use the limpiar_stopwords() with the argument stop_words = “topics” like so:

data %>%
  limpiar_stopwords(stop_words = "topics") %>%
  limpiar_spaces() #to clear the spaces of words that were removed
#> # A tibble: 10 × 2
#>    mention_content                                                   mention_url
#>    <chr>                                                             <chr>      
#>  1 holaaaaaa! cóómo estás @magdalena?!                               www.twitte…
#>  2 visto articulo!? Que horror! https://guardian.com/emojisbanned N… www.twitte…
#>  3 ayyyyyy gustaria londres yaaa #llevame #porfavor                  www.facebo…
#>  4 jajajajaja wn!                                                    www.facebo…
#>  5 RT dale click mujer baila perro                                   www.twitte…
#>  6 grax ntonces q?                                                   www.youtub…
#>  7 😂😂😂,!! 🤣🤣                                                    www.instag…
#>  8 grax ntonces q?                                                   www.youtub…
#>  9 grax ntonces q?                                                   www.youtub…
#> 10 grax ntonces q?                                                   www.youtub…

However, sometimes we want to keep words that would usually be treated as stopwords for a specific purpose. For example, when we’re analysing sentiment ‘negatives’ can invert the sentiment of a text - ‘no me gusta’ vs ‘me gusta’. If we remove all instances of ‘no’ from our data, we will do a worse job at analysing sentiment. For Spanish we have a slightly shorter list of stopwords for sentiment than topics, where we have removed a few choice terms.

data %>%
  limpiar_stopwords(stop_words = "sentiment") %>%
  limpiar_spaces() 
#> # A tibble: 10 × 2
#>    mention_content                                                   mention_url
#>    <chr>                                                             <chr>      
#>  1 holaaaaaa! cóómo estás @magdalena?!                               www.twitte…
#>  2 visto articulo!? Que horror! https://guardian.com/emojisbanned N… www.twitte…
#>  3 ayyyyyy gustaria londres yaaa #llevame #porfavor                  www.facebo…
#>  4 jajajajaja wn!                                                    www.facebo…
#>  5 RT dale click mujer baila perro                                   www.twitte…
#>  6 grax ntonces q?                                                   www.youtub…
#>  7 mejor 😂😂😂, no nada!! 🤣🤣                                      www.instag…
#>  8 grax ntonces q?                                                   www.youtub…
#>  9 grax ntonces q?                                                   www.youtub…
#> 10 grax ntonces q?                                                   www.youtub…

Warning - sentences can look quite strange without stopwords, and a lot of social posts are virtually meaningless altogether!

It’s also worth pointing out, that a lot of information can be lost when removing stop words. Many phrases in English and Spanish have very different meanings when a stop word is removed, and some stopwords lists contain negatives, which can drastically change the meaning of a sentence! So use with care.

Utility Functions

We are nearly at the end of this introduction to LimpiaR, but before we finish, let’s look at two utility functions which may be useful. We’ve conjoured up a new data frame called df, which we will use to show the last two functions and how to chain everything together.

df
#> # A tibble: 10 × 3
#>    mention_content                                            mention_url na_col
#>    <chr>                                                      <chr>       <chr> 
#>  1 "holaaaaaa! cóómo    estás @magdalena   ?!"                www.twitte… NA    
#>  2 "  han visto este articulo!? Que horror! https://guardian… www.twitte… NA    
#>  3 "ayyyyyy a mi me   gustaria ir a londres yaaa #llevame #p… www.facebo… NA    
#>  4 "jajajajaja eres un wn!"                                   www.facebo… NA    
#>  5 "RT dale un click a ver una mujer baila con su perro"      www.twitte… NA    
#>  6 "grax ntonces q?"                                          www.youtub… NA    
#>  7 "yo soy el mejor 😂😂😂, no eres nada!! 🤣🤣 "             www.instag… NA    
#>  8 "grax ntonces q?"                                          www.youtub… NA    
#>  9 "grax ntonces q?"                                          www.youtub… tadaa 
#> 10 "grax ntonces q?"                                          www.youtub… NA

limpiar_inpsect

So, imagine that we see a strange pattern, and we want to check what’s going on with that specific pattern. We can use limpiar_inspect to view all posts which contain that pattern in an interactive frame!

limpiar_inspect(df, 
                pattern = "ntonces", 
                text_var = mention_content,
                url_var = mention_url,
                title = "ntonces")
mention_content mention_url
grax ntonces q? www.youtube.com/post1
grax ntonces q? www.youtube.com/post2
grax ntonces q? www.youtube.com/post3
grax ntonces q? www.youtube.com/post4

Whilst it’s pretty obvious that all of the ‘grax ntonces q?’ posts are exactly the same, in the real world we’re going to have 10,000 times as many posts, and searching for suspicious patterns may take up a lot of our time.

limpiar_na_cols

This final function is useful when we want to remove ‘mostly NA’ columns of a data frame. We may want to do this to save memory, for example if we have 400,000 posts and 80 columns. In this case we’ll get rid of all columns for which 25% or more of their values are NA.

limpiar_na_cols(df,threshold =  0.25)
#> # A tibble: 10 × 2
#>    mention_content                                                   mention_url
#>    <chr>                                                             <chr>      
#>  1 "holaaaaaa! cóómo    estás @magdalena   ?!"                       www.twitte…
#>  2 "  han visto este articulo!? Que horror! https://guardian.com/em… www.twitte…
#>  3 "ayyyyyy a mi me   gustaria ir a londres yaaa #llevame #porfavor" www.facebo…
#>  4 "jajajajaja eres un wn!"                                          www.facebo…
#>  5 "RT dale un click a ver una mujer baila con su perro"             www.twitte…
#>  6 "grax ntonces q?"                                                 www.youtub…
#>  7 "yo soy el mejor 😂😂😂, no eres nada!! 🤣🤣 "                    www.instag…
#>  8 "grax ntonces q?"                                                 www.youtub…
#>  9 "grax ntonces q?"                                                 www.youtub…
#> 10 "grax ntonces q?"                                                 www.youtub…

Putting It All Together

To speed things up, we could call the functions together in one big long pipe.

df %>%
  limpiar_na_cols(threshold = 0.25)%>%
  limpiar_accents()%>%
  limpiar_retweets()%>%
  limpiar_shorthands()%>%
  limpiar_repeat_chars()%>%
  limpiar_url()%>%
  limpiar_remove_emojis()%>%
  limpiar_shorthands()%>%
  limpiar_spaces()%>%
  limpiar_duplicates()
#> # A tibble: 6 × 2
#>   mention_content                                        mention_url            
#>   <chr>                                                  <chr>                  
#> 1 hola! coomo estas @magdalena?!                         www.twitter.com/post1  
#> 2 han visto este articulo!? Que horror! NO SE PUEDE!!    www.twitter.com/post2  
#> 3 ay a mi me gustaria ir a londres ya #llevame #porfavor www.facebook.com/post1 
#> 4 jaja eres un wuevon!                                   www.facebook.com/post2 
#> 5 gracias entonces que?                                  www.youtube.com/post1  
#> 6 yo soy el mejor, no eres nada!!                        www.instagram.com/post1

However, we generally want to check what the effects of our transformations are on our data, so doing a lot of operations like this without any intermediate checks is risky.