LimpiaR - Parts of Speech Workflow
Source:vignettes/parts_of_speech_workflow.Rmd
parts_of_speech_workflow.Rmd
We want to take a document (or later a set of documents) like:
Who was the first person on the moon?
And extract for each token (where in this case token = word) in the document its part of speech and its lemma. So that for each document we have a table like:
token | lemma | pos_tag |
---|---|---|
Who | who | PRON |
was | be | AUX |
the | the | DET |
first | first | ADJ |
man | man | NOUN |
on | on | ADP |
the | the | DET |
moon | moon | NOUN |
? | ? | PUNCT |
In this vignette we’ll be looking at how to use LimpiaR via {udpipe} to achieve this. For a primer on parts of speech see Parts of Speech Wikipedia.
For all UPOS Tags use the following table as a reference:
Part of Speech | UPOS Tag | Definition |
---|---|---|
Adjective | ADJ | Usually describes or modifies a noun. It provides information about an object’s size, color, shape, etc. Example: “big”, “blue”. |
Adposition | ADP | Class of words that includes prepositions and postpositions, used to express spatial or temporal relations or mark various semantic roles. Example: “in”, “on”, “before”, “after”. |
Adverb | ADV | Modifies a verb, an adjective, another adverb, or even an entire sentence. It provides information about manner, time, place, frequency, degree, etc. Example: “quickly”, “very”. |
Auxiliary | AUX | Special verb used to add functional or grammatical meaning to the clause in which it appears. It accompanies the main verb and forms different aspects, voices, or moods. Example: “is”, “have”, “will”. |
Coordinating Conjunction | CCONJ | Joins words, phrases, or clauses of similar grammatical status and syntactic importance. Example: “and”, “but”, “or”. |
Determiner | DET | Introduces a noun and provides context in terms of definiteness, where it belongs in a sequence, quantity, or ownership. Example: “a”, “the”, “this”, “those”. |
Interjection | INTJ | An abrupt remark, made apart from the main sentence structure, often used to express strong emotion or surprise. Example: “Oh!”, “Wow!”, “Ugh!”. |
Noun | NOUN | Represents a person, place, thing, or idea. Nouns can be subjects, objects, or complement in a sentence. Example: “dog”, “city”, “happiness”. |
Numeral | NUM | A word, symbol, or group of words representing a number. Example: “one”, “first”, “100”. |
Particle | PART | A function word that does not fit into the other categories but is used to express grammatical relationships with other words or to specify the attitude of the speaker. Example: “up” in “stand up”. |
Pronoun | PRON | Replaces a noun, often used to avoid repetition. Pronouns can do most things that nouns can do. Example: “he”, “they”, “who”. |
Proper Noun | PROPN | Names a specific person, place, thing, or idea and is typically capitalized. Example: “Elizabeth”, “London”, “Microsoft”. |
Punctuation | PUNCT | Symbols that organize writing into clauses, phrases, and sentences, clarify meaning, and indicate pauses. Example: “.”, “,”, “!”. |
Subordinating Conjunction | SCONJ | Introduces a subordinate clause and indicates the relationship between the subordinate clause and the rest of the sentence. Example: “because”, “although”, “when”. |
Symbol | SYM | A mark or character used as a conventional representation of an object, function, or process. Example: “$”, “%”, “&”. |
Verb | VERB | Expresses an action, occurrence, or state of being. Verbs are central to a clause. Example: “run”, “be”, “have”. |
Other | X | A category used for words or tokens that do not fit into the above categories. This is less common and often used in tagging to mark anomalies or unclassifiable items. |
Workflow
The main motivation for processing and extracting parts of speech is to home in on the particular aspects of language which are most informative for answering a given research question. For example, if you want to know what people think about something, you might focus on adjectives and nouns because adjectives describe things, and nouns are things. If you want to know how people are using something, you might focus more on verbs/phrasal verbs as they are ‘doing words’.
Getting started
LimpiaR’s PoS workflow leans on pre-existing functionality of the {udpipe} package to allow users to import a pre-trained model for parts of speech analysis whilst enabling advanced users to perform ‘dependency parsing’.
What is dependency parsing?
“a technique which provides to each word in a sentence the link to another word in the sentence, which is called its syntactical head. This link between each 2 words furthermore has a certain type of relationship giving you further details about it”
see here for more info.
Importing a udpipe model
First, we import the model we want to use. UDPipe pre-trained models
are built upon Universal Dependencies treebanks and are made available
for more than 65 languages based on 101 treebanks. For demonstrative
purposes we’ll use language = "english"
.
model <- limpiar_pos_import_model(language = "english")
Example data
Now that we have our model imported into our session, let’s get some data to tag, tokenise, lemmatise, and extract parts of speech. We need a data frame with both a text variable and an ID column which uniquely identifies each text.
(
data <-tibble(text = tolower(stringr::sentences[1:100]),
universal_message_id = paste0("TWITTER", 1:100))
)
## # A tibble: 100 × 2
## text universal_message_id
## <chr> <chr>
## 1 the birch canoe slid on the smooth planks. TWITTER1
## 2 glue the sheet to the dark blue background. TWITTER2
## 3 it's easy to tell the depth of a well. TWITTER3
## 4 these days a chicken leg is a rare dish. TWITTER4
## 5 rice is often served in round bowls. TWITTER5
## 6 the juice of lemons makes fine punch. TWITTER6
## 7 the box was thrown beside the parked truck. TWITTER7
## 8 the hogs were fed chopped corn and garbage. TWITTER8
## 9 four hours of steady work faced us. TWITTER9
## 10 a large size in stockings is hard to sell. TWITTER10
## # ℹ 90 more rows
So now we have some data and we also have our desired model loaded
into session, the function we need next is
limpiar_pos_annotate()
.
Before we begin, it’s important to note that our output will be sensitive to cleaning steps. For example, it is possible that with the removal of punctuation, PoS Tagging may under-perform. This could also be said for the POS tagging process as nouns(NOUN) and proper-nouns(PROPN) will be harder to differentiate if punctuation is removed as well as all text being lowercase.
So be careful when pre-processing your text variable if you’re intending to follow the PoS workflow.
Extracting parts of speech
Thelimpiar_pos_annotate
function converts the text of
each document into its tokens, lemmas, POS tags, and dependency
relationships*. The input data frame should have one document per row,
the return data frame will have one row per token. This means the return
data frame will have many more rows than the input data frame!
dependency_parse default value
*thedependency_parse
argument is set to FALSE
by default as this step can be costly in time and compute if performed
on large data sets. Most users will get by just fine without parsing
dependencies, but if you are sure you need them set this argument to
TRUE
.
limpiar_pos_annotate
can be sped up by using
in_parallel = TRUE
, this argument makes the function
process multiple rows of the data frame in parallel.
Running the function in parallel prevents the progress from being updated.
By default the function will print its progress every 100 rows, this
way you know that it’s still working and roughly how long it should take
to finish running. If you don’t want the progress updates to print in
your console set update_progress = 0
.
# annotate texts and perform dependency parsing
annotations <- limpiar_pos_annotate(data = data, text_var = text, id_var = universal_message_id, pos_model = model, dependency_parse = TRUE, in_parallel = FALSE, update_progress = 25)
## 2024-12-09 11:20:05.235727 Annotating text fragment 1/100
## 2024-12-09 11:20:05.401948 Annotating text fragment 26/100
## 2024-12-09 11:20:05.542495 Annotating text fragment 51/100
## 2024-12-09 11:20:05.700728 Annotating text fragment 76/100
Now that we have our texts tokenized, dependencies parsed, and POS annotations are complete, let’s take a look at the output. After annotating we have 882 rows, which is an 8.82x increase!
## # A tibble: 882 × 9
## universal_message_id paragraph_id sentence_id token_id token lemma pos_tag
## <chr> <int> <int> <chr> <chr> <chr> <chr>
## 1 TWITTER1 1 1 1 the the DET
## 2 TWITTER1 1 1 2 birch birch NOUN
## 3 TWITTER1 1 1 3 canoe canoe NOUN
## 4 TWITTER1 1 1 4 slid slid NOUN
## 5 TWITTER1 1 1 5 on on ADP
## 6 TWITTER1 1 1 6 the the DET
## 7 TWITTER1 1 1 7 smooth smooth ADJ
## 8 TWITTER1 1 1 8 planks plank NOUN
## 9 TWITTER1 1 1 9 . . PUNCT
## 10 TWITTER2 1 1 1 glue glue VERB
## # ℹ 872 more rows
## # ℹ 2 more variables: head_token_id <chr>, dependency_tag <chr>
Output has some additional columns, we will pay closest attention to paragraph_id, sentence_id, token, lemma, pos_tag, dependency_tag.
What are these new columns?
paragraph_id
and sentence_id
tell us which
of the document’s paragraphs, and which sentence the token is in.
Because our data set contains only single sentences, all of these values
are 1.
token
is the word the rest of the columns refer to.
lemma
is the lemmatised
version of the token, an example of lemmatisation is ‘served’ being
converted to ‘serve’.
pos_tag
displays PoS labels such as; Noun(NOUN), Proper
Noun(PROPN), Pronoun(PRON), Verb(VERB), Adjective(ADJ) etc. For more on
the exact tags and what they mean see PoS Tags
or the UPOS tags table at the top of this document.
dependency_tag
displays the dependency relation for each
token. Dependency relations are significantly more complicated than
parts of speech. see Dependency Relations
for more details.
head_token_id
, dependency_tag
and
feats
will only be present if we select
dependency_parse = TRUE
, they tell us the id of the token
(inside the document, paragraph, sentence) which the token of that row
has the dependency relation to, the dependency relation and additional
features of the relationship
Finally, the column you inserted as id_var =
. This
variable will help you to join the results back to your original data
frame.
Manipulating the output
Now that we have adequate context, let’s look at what we can do with our output.
We can combine commonly-used {dplyr} functions for data wrangling and summarisation to filter for certain parts of speech and count their occurrences.
Given the size of our data set there are limited inferences we can make from the counts, but if we just want to know which adjectives are seen most frequently:
# Count adjectives
annotations %>%
filter(pos_tag %in% c("ADJ")) %>%
count(token, pos_tag, sort = TRUE)
## # A tibble: 65 × 3
## token pos_tag n
## <chr> <chr> <int>
## 1 blue ADJ 3
## 2 clear ADJ 3
## 3 third ADJ 3
## 4 large ADJ 2
## 5 more ADJ 2
## 6 sharp ADJ 2
## 7 small ADJ 2
## 8 wide ADJ 2
## 9 young ADJ 2
## 10 bent ADJ 1
## # ℹ 55 more rows
Alternatively we could do the same thing for nouns, but this time count the lemma (in case of plurals).
#Count the lemma of each noun
annotations %>%
filter(pos_tag %in% c("NOUN")) %>% # this time selecting nouns
count(lemma, sort = TRUE)
## # A tibble: 214 × 2
## lemma n
## <chr> <int>
## 1 fence 3
## 2 week 3
## 3 air 2
## 4 boy 2
## 5 cat 2
## 6 coat 2
## 7 day 2
## 8 fish 2
## 9 girl 2
## 10 grass 2
## # ℹ 204 more rows
Adjective - Noun collocations
Are there any adjective -> noun pairs seen more than once? This
task gets slightly more involved as we need to look at the value of each
token and the value in the next row, so first we need
dplyr::filter
and dplyr::lag
+
dplyr::lead
, and to group by the doc_id, paragraph_id and
sentence_id.
adj_to_noun <- annotations %>%
filter(
pos_tag == "ADJ" & lead(pos_tag) == "NOUN" | pos_tag == "NOUN" & lag(pos_tag) == "ADJ", .by = c(doc_id, paragraph_id, sentence_id))
Then we’ll need to perform some more {dplyr} magic to count the ADJ-NOUN collocations. The data is now set up so that row 1+2, 3+4, 4+5 and so on and so forth are ADJ-NOUN collocations, we can use this information to extract the pairs with a combination of floor division, exploiting how cumsum meshes with logicals in R, and a ‘grouped, summarise paste with a flourish of collapse’ trick.
adj_to_noun %>%
mutate(collocation_id = row_number() %% 2, .before = 1) %>%
mutate(collocation_id = cumsum(collocation_id == 1)) %>%
summarise(collocation = paste0(token, collapse = " "), .by = collocation_id) %>%
count(collocation, sort = TRUE)
## # A tibble: 56 × 2
## collocation n
## <chr> <int>
## 1 third week 2
## 2 bent hook 1
## 3 big task 1
## 4 blue air 1
## 5 blue background 1
## 6 blue fish 1
## 7 clear response 1
## 8 clear spring 1
## 9 clear water 1
## 10 cloud hung 1
## # ℹ 46 more rows
The only ADJ-NOUN collocation seen more than once was ‘third week’. In a larger data set we would expect to find more and higher-frequency ADJ-NOUN collocations.
However, when manipulating the data this way, the onus is on the user to know and understand what transformations should be created.
Repairing the sentence
For this next example we’re going to convert all pos_tag == ADJ and pos_tag == “NOUN” to their lemma, and stitch the sentences back together so that they’re ready for visualisations. We’ll make use of case_when and the now-familiar grouped summarise paste trick. If we group by universal_message_id we’ll get back a 100 row data frame which we can join back to our original data frame.
annotations <- annotations %>%
mutate(token =
case_when(pos_tag == "ADJ" ~ lemma,
pos_tag == "NOUN" ~ lemma,
TRUE ~ lemma)) %>%
select(token, universal_message_id) %>%
summarise(sentence = paste0(token, collapse = " "),
.by = universal_message_id)
Once we have the information regarding the co-occurrences of nouns and adjectives within our documents, we are able to evidence what entities appear throughout our data, as well as the type of language used in conjunction with and toward them. We can later add to this, by visualizing the frequency of terms appearing within the same document as well as those that are explicitly used one after another, just like the example above, where we see ‘hot sun’ appear as most frequent.
Other languages - Spanish
The workflow for Spanish is very similar to English. We’ll select language = “spanish” and then we’ll take a look at the output for a Spanish document. We’ll take a paragraph from El Pais and store it in a data frame named ‘spanish’
“Hace muchos años, cuando llegó a La Habana, encontró el mejor acto de amor —el estudio de la escritura, que es aún mejor que el estudio de la vida— por parte de la joven Yalenis Velazco, quien dedicó un intenso ensayo a su obra.”
spanish_model <- limpiar_pos_import_model("spanish") # Load the model
spanish <- spanish %>%
limpiar_pos_annotate(text, id, spanish_model, udpate_progress = 0) # Extract Pos Tags
## 2024-12-09 11:20:08.091523 Annotating text fragment 1/1
## # A tibble: 49 × 3
## token lemma pos_tag
## <chr> <chr> <chr>
## 1 Hace hacer VERB
## 2 muchos mucho DET
## 3 años año NOUN
## 4 , , PUNCT
## 5 cuando cuando SCONJ
## 6 llegó llegar VERB
## 7 a a ADP
## 8 La el DET
## 9 Habana habana PROPN
## 10 , , PUNCT
## # ℹ 39 more rows
We see immediately that ‘hace’ a conjugation of the verb ‘hacer’ is converted to hacer, muchos -> mucho, años -> año. These changes would generally be agreeable, they’ll help us to reduce the overall number of words in our data set, and our text-based visualisations should be cleaner. However, it is always a good idea to carefully consider the effects of any transformation you perform on data, rather than doing it blindly.
We can convert all verbs to lemma and stitch the document back together using the same tricks as we did for English:
spanish %>%
mutate(
token = ifelse(pos_tag == "VERB", lemma, token)
) %>%
summarise(.by = id,
text = paste0(token, collapse = " "))
## # A tibble: 1 × 2
## id text
## <dbl> <chr>
## 1 1 hacer muchos años , cuando llegar a La Habana , encontrar el mejor acto…
Clearly the sentence is now ungrammatical, but for a topic modelling workflow or some n-gram networks, the data will be cleaner and the counts will better reflect the real frequencies for each verb.
No models are 100% correct, so we expect to see the occasional false output. This is particularly true when it comes to languages other than English which tend to have less training data, and be more grammatically complex.