Code
library(tidyverse) # Load the libraries we'll need
library(LimpiaR)When mining opinion we often want to separate what people are saying about something directly from when they’re mentioning it indirectly.
Compare
- Company are really impressing me, I’ll be purchasing the pro subscription later this month.
- Enterprise’s subscription is miles better than Company’s, I’ve never used anything better.
We’ll look at how to do just that in this vignette, using dependency parsing courtesy of {udpipe}.
library(tidyverse) # Load the libraries we'll need
library(LimpiaR)model <- limpiar_pos_import_model("english")
dummy <- tibble(
text = c("Company are really impressing me, I'll be purchasing the pro subscription later this month.",
"Enterprise's subscription is miles better than Company's, I've never used anything better.", "Company suck."),
doc_id = as.character(c(1, 2, 3))
)
dependencies <- dummy %>%
limpiar_pos_annotate(
text_var = text,
id_var = id,
pos_model = model,
dependency_parse = TRUE
)2024-02-28 11:30:40.633358 Annotating text fragment 1/3
dependencies %>%
select(-c(doc_id, sentence, xpos, token_id, feats, paragraph_id, sentence_id))# A tibble: 37 × 5
token lemma pos_tag head_token_id dependency_tag
<chr> <chr> <chr> <chr> <chr>
1 Company company NOUN 4 nsubj
2 are be AUX 4 aux
3 really really ADV 4 advmod
4 impressing impress VERB 0 root
5 me I PRON 4 obj
6 , , PUNCT 4 punct
7 I I PRON 10 nsubj
8 'll will AUX 10 aux
9 be be AUX 10 aux
10 purchasing purchase VERB 4 parataxis
# ℹ 27 more rows
Our data has been transformed, we now have 12 columns and 37 rows, where before we had 2 columns and 3 rows respectively. Each row now contains a word from our input texts.
The important column that we’ll look at is ‘dependency_tag’. In the first text we find that “Company” is labelled as ‘nsubj’. nsubj stands for the nominal subject. Grammatically the nominal subject is the part of the clause which is doing the thing that is being done (for want of better language). So in this case the thing being done is impressing, and it’s Company that’s doing the impressing. Roughly speaking we can say that when Company’s dependency is nsubj, the following clause will relate closely to Company.
dependencies %>%
filter(token == "Company") %>%
relocate(token, dependency_tag, .after = "token_id") %>%
select(-c(doc_id, sentence, xpos, token_id, feats, paragraph_id, sentence_id))# A tibble: 3 × 5
token dependency_tag lemma pos_tag head_token_id
<chr> <chr> <chr> <chr> <chr>
1 Company nsubj company NOUN 4
2 Company obl company PROPN 6
3 Company det company DET 2
Company is tagged as ‘obl’ or oblique - this means that it is mentioned in relation to something else, and that thing is Enterprise’s subscription.
dependencies %>%
filter(doc_id == 2) %>%
select(-c(doc_id, paragraph_id, sentence_id, sentence, feats, xpos)) %>%
relocate(dependency_tag, .after = token)# A tibble: 17 × 6
token_id token dependency_tag lemma pos_tag head_token_id
<chr> <chr> <chr> <chr> <chr> <chr>
1 1 Enterprise nmod:poss Enterprise PROPN 3
2 2 's case 's PART 1
3 3 subscription nsubj subscription NOUN 6
4 4 is cop be AUX 6
5 5 miles obl:npmod miles NOUN 6
6 6 better root better ADJ 0
7 7 than case than ADP 8
8 8 Company obl company PROPN 6
9 9 's case 's PART 8
10 10 , punct , PUNCT 6
11 11 I nsubj I PRON 14
12 12 've aux have AUX 14
13 13 never advmod never ADV 14
14 14 used parataxis use VERB 6
15 15 anything obj anything PRON 14
16 16 better amod better ADJ 15
17 17 . punct . PUNCT 6
If we were tasked with identifying what people really think about Company, we could start with finding the sentences, paragraphs, or documents where Company is pos_tag is one of “nsubj”, “obj”, “nsubj:pass”. This would likely reduce our data set significantly and return a set of high-quality documents.
However, we would certainly be dispensing with a lot of useful data; by Probability Theory, if we look for just “nsubj” we’ll find fewer (or as many) documents as if we look for [“nsubj” or “obj”] likewise [“nsubj” or “obj” or “nsubj:pass”].
Let’s up the ante and read in some data from a Peaks & Pits project which has been classified by our Peaks & Pits SetFit model.
data <- read_csv("~/Downloads/peaks_pits_sample_v2.csv") %>%
mutate(doc_id = as.character(row_number()))
model <- limpiar_pos_import_model("english")Then we perform the annotation, and we extract all of the sentences that have Excel in them.
data_annotate <- data %>%
limpiar_pos_annotate(message_og, doc_id, model, dependency_parse = TRUE, in_parallel = TRUE, update_progress = 200)
data_annotate <- data_annotate %>%
mutate(token = tolower(token))
excel <- filter(data_annotate, token == "excel")The simple counts of the dependencies suggest that Excel is often referred to as a compound, so we would expect to find something else next to it, like Microsoft or MS.
excel %>%
count(dependency_tag, sort = TRUE)# A tibble: 22 × 2
dependency_tag n
<chr> <int>
1 compound 2533
2 nmod 764
3 obl 701
4 nsubj 669
5 obj 543
6 conj 460
7 flat 228
8 root 146
9 nmod:poss 52
10 appos 49
# ℹ 12 more rows
dependency_tag n 1 compound 2533 2 nmod 764 3 obl 701 4 nsubj 669 5 obj 543 6 conj 460 7 flat 228 8 root 146 9 nmod:poss 52 10 appos 49 11 amod 27 12 nsubj:pass 26 13 parataxis 15 14 advcl 12 15 xcomp 10 16 ccomp 9 17 obl:npmod 7 18 acl:relcl 5 19 acl 3 20 iobj 2 21 list 1 22 vocative 1
Recall that nouns are things, and noun-phrases are multiple words which refer to a single thing. It takes a bit of training to spot noun-phrases but take the following sentence: >“The man with sunglasses sipped his orange juice”
There are two noun-phrases - “The man with sunglasses” forms one noun-phrase, and “orange juice” forms the other.
Compounds tend to be Excel in noun-phrases, like ‘Excel spreadsheet’, ‘Excel files’, ‘Excel functions’. It’s not to say that these are not useful, but we’d want to do some more processing on them to combine the compounds and find out what the dependency relation of the entire compound is.
It’s not as trivial as it may seem to combine the compounds - we don’t have a nice function like extract_nounphrases from {spacyr} and the function I made is presumably somewhere in the middle of Bangkok because I cannot find the bugger
excel %>%
filter(dependency_tag == "compound") %>%
slice(1:10) %>%
pull(sentence) [1] "They would then be recorded on an excel spreadsheet against the product details."
[2] "Could not save jason file in excel utility of ITR of updated return .system showing run time error."
[3] "Whether you're a beginner or a seasoned Excel user, Lookup functions offer a simple and efficient way to handle complex data tasks."
[4] "bruh chatgpt can open excel and word files, read them and then complete what i need in same format."
[5] "#Excel #MissExcel #MissExcelWorld"
[6] "Saw you mentioned it's fed by an excel file."
[7] "You can pull in Excel files, transform them using a view commands in Python, and output the result back to excel."
[8] "At work we have a massive (150mb) excel file with database connections, pasted raw data, charts etc."
[9] "Learn 6 of the new Excel functions in 40 seconds!"
[10] "I started using Python in other areas of my life such as when I would received an excel file from uni I would try get creative and manipulate the sheet."
However, I think the most important dependencies we should be looking to analyse if we want to find sentences and potential peak/pit moments which are specifically about Excel are nsubj, obj, nsubj:pass (nominal subject in the passive voice).
Nsubj vs obj is mostly around whether the thing is doing something, or is having something done to it, just like subject-object e.g. “The dog barks at the cat”, the dog is the nsubj, the cat is the obj.
this paper is useful to look at some pre-defined combinations of dependencies to mine.
Let’s take a look at some examples where excel is either an nsubj, or an nsubj passive. Clearly everything is very quite specific and pointed towards Excel. It’s worth keeping in mind that SetFit has already classified this data, and some cleaning will have been done to the initial data set, but it’s good nonetheless.
excel %>%
filter(str_detect(dependency_tag, "nsubj")) %>%
slice(1:10) %>%
pull(sentence) [1] "every day i realise excel has a function that could have saved me literal hours of work"
[2] "Failed to save uploaded excel file due to an error:"
[3] "Microsoft Excel be trolling and gaslighting tf out of me."
[4] "And I'm wondering if Excel has a tool for finding these characters or a way to internally convert the text to a format that doesn't mess with the data merge."
[5] "The world would be better off if @Microsoft #Excel had one button that froze frames and added a filter at the same time."
[6] "Excel saves you ALOT of time manually sifting through TXids"
[7] "That time of the year where I need to create my excel file…"
[8] "Excel is amazing."
[9] "This is like saying Microsoft Excel is the worst because it’s not Google Sheets."
[10] "MS Excel is the greatest application ever made bar none HAHAHAHA."
When we look at Excel as the obj of the sentence, we should find that it’s preceded by verbs, so we’ll find people doing specific things with Excel.
excel %>%
filter(dependency_tag == "obj") %>%
slice(1:10) %>%
pull(sentence) [1] "@gabbyjaaay__ chegg was my bestfriend when i took stats.😭Microsoft excel also helps a lot too."
[2] "Nope there is absolutely nothing you can do to prevent excel from overwriting a csv file with its opinionated data formatting."
[3] "I started learning Microsoft Excel and I'm giddy at how interesting I'm finding it."
[4] "Hi all, I can open excel fine and work on files but when I try to open files from within Excel it constantly crashes (from file>open)"
[5] "I would love to simplify that and I cannot imagine this to be hard (although you never know, knowing that MS also created Excel and a lot of things in Excel that should be easy are really hard...)."
[6] "While I have merged Excel CVS files in the past, I'm really a graphic designer, and only use Excel as a tool to create the occasional CVS."
[7] "But also a client using excel to track their workflow is going to look very different than a ready-to-analyze dataset."
[8] "The fact that macos Excel on a mb pro i9, 32gbs of ram, couldnt handle a complex spreadsheet (5 dynamic tables, 12000 rows)."
[9] "a vmware running w10 Excel can handle it... virtualized in the VERY SAME MACHINE."
[10] "You can use excel to make techniques from the first 2 threads EASIER •"
We’re currently retrieving individual sentences rather than whole documents, we have a choice whether to retrieve all of the sentences or just the sentence in which Excel is the dependency (or one of) tag we’re after
[ ] shrinking the data set to a manageable size for any project which uses one of the larger LLMs
[ ] explore how this dovetails with semantic search stuff and whether there’s stuff to build together
[ ] explanatory / introductory slides
[ ] which dep. relations are most useful for what research questions
[ ] how to manage multiple tokens of interest
[ ] consolidate compounds
We’re just scratching the surface here, I think there’s a lot of potential, though.
What’s lacking? A data set to figure out what type of recall/precision we get here. I’d guess precision would be quite high, recall quite low if using any single of these methods, but in combination we might be able to make it quite high.
How will/would it dovetail with BERTopic/Landscape workflows? Could see us filtering for the appropriate dependency tags for products before running embeddings/topic modelling. Should be a cleaner output overall.