Dependency Parsing & Phrase Mining with udpipe

Motivation

When mining opinion we often want to separate what people are saying about something directly from when they’re mentioning it indirectly.

Can I just use GPT for this? You can, but it’ll cost you a lot of money and a lot of time. This is an ideal way of reducing the size of the data set and finding the needle in the haystack.

Compare

  1. Company are really impressing me, I’ll be purchasing the pro subscription later this month.
  1. Enterprise’s subscription is miles better than Company’s, I’ve never used anything better.
  1. Is entirely about Company, whereas in 2. Company is used in reference to something else. Wouldn’t it be great if we could use some syntactic clues to help us differentiate sentences of these two types, at scale?

We’ll look at how to do just that in this vignette, using dependency parsing courtesy of {udpipe}.

Code
library(tidyverse) # Load the libraries we'll need
library(LimpiaR)
Code
model <- limpiar_pos_import_model("english")

dummy <- tibble(
  text = c("Company are really impressing me, I'll be purchasing the pro subscription later this month.", 
           "Enterprise's subscription is miles better than Company's, I've never used anything better.", "Company suck."),
  doc_id = as.character(c(1, 2, 3))
)

dependencies <- dummy %>%
  limpiar_pos_annotate(
    text_var = text, 
    id_var = id, 
    pos_model = model,
    dependency_parse = TRUE
  )
2024-02-28 11:30:40.633358 Annotating text fragment 1/3
Code
dependencies %>%
  select(-c(doc_id, sentence, xpos, token_id, feats, paragraph_id, sentence_id))
# A tibble: 37 × 5
   token      lemma    pos_tag head_token_id dependency_tag
   <chr>      <chr>    <chr>   <chr>         <chr>         
 1 Company    company  NOUN    4             nsubj         
 2 are        be       AUX     4             aux           
 3 really     really   ADV     4             advmod        
 4 impressing impress  VERB    0             root          
 5 me         I        PRON    4             obj           
 6 ,          ,        PUNCT   4             punct         
 7 I          I        PRON    10            nsubj         
 8 'll        will     AUX     10            aux           
 9 be         be       AUX     10            aux           
10 purchasing purchase VERB    4             parataxis     
# ℹ 27 more rows

Our data has been transformed, we now have 12 columns and 37 rows, where before we had 2 columns and 3 rows respectively. Each row now contains a word from our input texts.

The important column that we’ll look at is ‘dependency_tag’. In the first text we find that “Company” is labelled as ‘nsubj’. nsubj stands for the nominal subject. Grammatically the nominal subject is the part of the clause which is doing the thing that is being done (for want of better language). So in this case the thing being done is impressing, and it’s Company that’s doing the impressing. Roughly speaking we can say that when Company’s dependency is nsubj, the following clause will relate closely to Company.

Code
dependencies %>%
  filter(token == "Company") %>%
  relocate(token, dependency_tag, .after = "token_id") %>%
  select(-c(doc_id, sentence, xpos, token_id, feats, paragraph_id, sentence_id))
# A tibble: 3 × 5
  token   dependency_tag lemma   pos_tag head_token_id
  <chr>   <chr>          <chr>   <chr>   <chr>        
1 Company nsubj          company NOUN    4            
2 Company obl            company PROPN   6            
3 Company det            company DET     2            
  1. Is quite a lot more complicated than 1. The nominal subject is Enterprise’s subscription, Enterprise itself is labelled as nmod:poss, this means that it’s possessive over another noun, and that noun is the subscription. The subscription is the nominal subject in this case, so the clause is primarily about Enterprise’s subscription.

Company is tagged as ‘obl’ or oblique - this means that it is mentioned in relation to something else, and that thing is Enterprise’s subscription.

Code
dependencies %>%
  filter(doc_id == 2) %>%
  select(-c(doc_id, paragraph_id, sentence_id, sentence, feats, xpos)) %>%
  relocate(dependency_tag, .after = token)
# A tibble: 17 × 6
   token_id token        dependency_tag lemma        pos_tag head_token_id
   <chr>    <chr>        <chr>          <chr>        <chr>   <chr>        
 1 1        Enterprise   nmod:poss      Enterprise   PROPN   3            
 2 2        's           case           's           PART    1            
 3 3        subscription nsubj          subscription NOUN    6            
 4 4        is           cop            be           AUX     6            
 5 5        miles        obl:npmod      miles        NOUN    6            
 6 6        better       root           better       ADJ     0            
 7 7        than         case           than         ADP     8            
 8 8        Company      obl            company      PROPN   6            
 9 9        's           case           's           PART    8            
10 10       ,            punct          ,            PUNCT   6            
11 11       I            nsubj          I            PRON    14           
12 12       've          aux            have         AUX     14           
13 13       never        advmod         never        ADV     14           
14 14       used         parataxis      use          VERB    6            
15 15       anything     obj            anything     PRON    14           
16 16       better       amod           better       ADJ     15           
17 17       .            punct          .            PUNCT   6            

How can we use this information?

If we were tasked with identifying what people really think about Company, we could start with finding the sentences, paragraphs, or documents where Company is pos_tag is one of “nsubj”, “obj”, “nsubj:pass”. This would likely reduce our data set significantly and return a set of high-quality documents.

However, we would certainly be dispensing with a lot of useful data; by Probability Theory, if we look for just “nsubj” we’ll find fewer (or as many) documents as if we look for [“nsubj” or “obj”] likewise [“nsubj” or “obj” or “nsubj:pass”].

nsubj passive example: Man United were defeated by Liverpool. Man United is the nsubj, the sentence is in the passive voice.

Real world data

Let’s up the ante and read in some data from a Peaks & Pits project which has been classified by our Peaks & Pits SetFit model.

Code
data <- read_csv("~/Downloads/peaks_pits_sample_v2.csv") %>%
  mutate(doc_id = as.character(row_number()))

model <- limpiar_pos_import_model("english")

Then we perform the annotation, and we extract all of the sentences that have Excel in them.

Code
data_annotate <- data %>%
  limpiar_pos_annotate(message_og, doc_id, model, dependency_parse = TRUE, in_parallel = TRUE, update_progress = 200)

data_annotate <- data_annotate %>%
  mutate(token = tolower(token)) 

excel <- filter(data_annotate, token == "excel")

Dependencies

The simple counts of the dependencies suggest that Excel is often referred to as a compound, so we would expect to find something else next to it, like Microsoft or MS.

Code
excel %>%
  count(dependency_tag, sort = TRUE)
# A tibble: 22 × 2
   dependency_tag     n
   <chr>          <int>
 1 compound        2533
 2 nmod             764
 3 obl              701
 4 nsubj            669
 5 obj              543
 6 conj             460
 7 flat             228
 8 root             146
 9 nmod:poss         52
10 appos             49
# ℹ 12 more rows
Dependency tags count (for while developing)

dependency_tag n 1 compound 2533 2 nmod 764 3 obl 701 4 nsubj 669 5 obj 543 6 conj 460 7 flat 228 8 root 146 9 nmod:poss 52 10 appos 49 11 amod 27 12 nsubj:pass 26 13 parataxis 15 14 advcl 12 15 xcomp 10 16 ccomp 9 17 obl:npmod 7 18 acl:relcl 5 19 acl 3 20 iobj 2 21 list 1 22 vocative 1

Compound

Tip

Recall that nouns are things, and noun-phrases are multiple words which refer to a single thing. It takes a bit of training to spot noun-phrases but take the following sentence: >“The man with sunglasses sipped his orange juice”

There are two noun-phrases - “The man with sunglasses” forms one noun-phrase, and “orange juice” forms the other.

Compounds tend to be Excel in noun-phrases, like ‘Excel spreadsheet’, ‘Excel files’, ‘Excel functions’. It’s not to say that these are not useful, but we’d want to do some more processing on them to combine the compounds and find out what the dependency relation of the entire compound is.

It’s not as trivial as it may seem to combine the compounds - we don’t have a nice function like extract_nounphrases from {spacyr} and the function I made is presumably somewhere in the middle of Bangkok because I cannot find the bugger

Code
excel %>%
  filter(dependency_tag == "compound") %>%
  slice(1:10) %>%
  pull(sentence)
 [1] "They would then be recorded on an excel spreadsheet against the product details."                                                                        
 [2] "Could not save jason file in excel utility of ITR of updated return .system showing run time error."                                                     
 [3] "Whether you're a beginner or a seasoned Excel user, Lookup functions offer a simple and efficient way to handle complex data tasks."                     
 [4] "bruh chatgpt can open excel and word files, read them and then complete what i need in same format."                                                     
 [5] "#Excel #MissExcel #MissExcelWorld"                                                                                                                       
 [6] "Saw you mentioned it's fed by an excel file."                                                                                                            
 [7] "You can pull in Excel files, transform them using a view commands in Python, and output the result back to excel."                                       
 [8] "At work we have a massive (150mb) excel file with database connections, pasted raw data, charts etc."                                                    
 [9] "Learn 6 of the new Excel functions in 40 seconds!"                                                                                                       
[10] "I started using Python in other areas of my life such as when I would received an excel file from uni I would try get creative and manipulate the sheet."

However, I think the most important dependencies we should be looking to analyse if we want to find sentences and potential peak/pit moments which are specifically about Excel are nsubj, obj, nsubj:pass (nominal subject in the passive voice).

Nsubj vs obj is mostly around whether the thing is doing something, or is having something done to it, just like subject-object e.g. “The dog barks at the cat”, the dog is the nsubj, the cat is the obj.

this paper is useful to look at some pre-defined combinations of dependencies to mine.

Nominal Subject - nsubj

Let’s take a look at some examples where excel is either an nsubj, or an nsubj passive. Clearly everything is very quite specific and pointed towards Excel. It’s worth keeping in mind that SetFit has already classified this data, and some cleaning will have been done to the initial data set, but it’s good nonetheless.

Code
excel %>% 
  filter(str_detect(dependency_tag, "nsubj")) %>%
  slice(1:10) %>%
  pull(sentence)
 [1] "every day i realise excel has a function that could have saved me literal hours of work"                                                                      
 [2] "Failed to save uploaded excel file due to an error:"                                                                                                          
 [3] "Microsoft Excel be trolling and gaslighting tf out of me."                                                                                                    
 [4] "And I'm wondering if Excel has a tool for finding these characters or a way to internally convert the text to a format that doesn't mess with the data merge."
 [5] "The world would be better off if @Microsoft #Excel had one button that froze frames and added a filter at the same time."                                     
 [6] "Excel saves you ALOT of time manually sifting through TXids"                                                                                                  
 [7] "That time of the year where I need to create my excel file…"                                                                                                  
 [8] "Excel is amazing."                                                                                                                                            
 [9] "This is like saying Microsoft Excel is the worst because it’s not Google Sheets."                                                                             
[10] "MS Excel is the greatest application ever made bar none HAHAHAHA."                                                                                            

Object - obj

When we look at Excel as the obj of the sentence, we should find that it’s preceded by verbs, so we’ll find people doing specific things with Excel.

Code
excel %>%
  filter(dependency_tag == "obj") %>%
  slice(1:10) %>%
  pull(sentence)
 [1] "@gabbyjaaay__ chegg was my bestfriend when i took stats.😭Microsoft excel also helps a lot too."                                                                                                      
 [2] "Nope there is absolutely nothing you can do to prevent excel from overwriting a csv file with its opinionated data formatting."                                                                       
 [3] "I started learning Microsoft Excel and I'm giddy at how interesting I'm finding it."                                                                                                                  
 [4] "Hi all, I can open excel fine and work on files but when I try to open files from within Excel it constantly crashes (from file>open)"                                                                
 [5] "I would love to simplify that and I cannot imagine this to be hard (although you never know, knowing that MS also created Excel and a lot of things in Excel that should be easy are really hard...)."
 [6] "While I have merged Excel CVS files in the past, I'm really a graphic designer, and only use Excel as a tool to create the occasional CVS."                                                           
 [7] "But also a client using excel to track their workflow is going to look very different than a ready-to-analyze dataset."                                                                               
 [8] "The fact that macos Excel on a mb pro i9, 32gbs of ram, couldnt handle a complex spreadsheet (5 dynamic tables, 12000 rows)."                                                                         
 [9] "a vmware running w10 Excel can handle it... virtualized in the VERY SAME MACHINE."                                                                                                                    
[10] "You can use excel to make techniques from the first 2 threads EASIER •"                                                                                                                               
Note

We’re currently retrieving individual sentences rather than whole documents, we have a choice whether to retrieve all of the sentences or just the sentence in which Excel is the dependency (or one of) tag we’re after

Multiple Excel Tags

When a document comprises a number of sentences, and multiple of those sentences contain Excel, it’s more likely the document really is about Excel than if it’s just mentioned once (once more by basics of Probability Theory). This is particularly the case if Excel is the nominal subject in those sentences. These documents should be ideal candidate documents to send to an expensive classifier like GPT.

Code
excel %>%
  filter(str_detect(dependency_tag, "nsubj")) %>%
  count(doc_id, sort = TRUE)
# A tibble: 649 × 2
   doc_id     n
    <dbl> <int>
 1   2421     5
 2   2674     4
 3    513     3
 4   2700     3
 5   3835     3
 6   4522     3
 7    117     2
 8    142     2
 9    496     2
10    765     2
# ℹ 639 more rows

For example, doc_ids 2674 and 2421 have multiple sentences containing Excel, of which multiple are also ‘nsubj’ or the nominal subject. This suggests there will be multiple things said about Excel, and it’s likely any sentiment/classification we attach to this post will be to do with Excel

Code
data_annotate %>%
  filter(doc_id == 2421) %>%
  pull(sentence) %>%
  unique()
 [1] "Excel is love."                                                                                                                                                                                                        
 [2] "Excel is life."                                                                                                                                                                                                        
 [3] "No joke, Excel is incredibly powerful and can do just about anything - simple count of two numbers, a detailed CRM, or [play Doom](https://www.youtube.com/watch?v=J2qU7t6Jmfw)."                                      
 [4] "Whenever you come across a wide and deep powerful tool like this, the best way to get a handle is to look at your workflow, map it out, and try to match functions or similar functions to what Excel can do or offer."
 [5] "E.g. since you are an HR director (or working for one to help them out), map out their top workflows and duties step by step."                                                                                         
 [6] "See pitfalls, bottlenecks, where things are easy and things are hard."                                                                                                                                                 
 [7] "1. Need to budget your department?"                                                                                                                                                                                    
 [8] "Plenty of simple budgeting templates on the interwebs."                                                                                                                                                                
 [9] "Get started with a simple one, make your own, customize, adjust, assess and add more or use a more complex one as your needs arise."                                                                                   
[10] "2. Need a way to keep track of your tasks?"                                                                                                                                                                            
[11] "You can build a simple task sheet, and put in a calendar too."                                                                                                                                                         
[12] "3. Need to import in a bunch of payroll data and apply a bunch of routine functions like Cost of Living raises etc.?"                                                                                                  
[13] "You can build macros for Excel to clean up data, prettify it, and apply functions to it so you don't have to spend 10-15 minutes doing mundane work."                                                                  
[14] "The more you can abstract, specify and analyze your workflow, the more we can help you out."                                                                                                                           
[15] ">"                                                                                                                                                                                                                     
[16] "any suggestions on what other functions in excel will be useful to me in my work?"                                                                                                                                     
[17] "This sounds too generic for us to give good specific advice, especially since duties between HR directors can radically differ based on the company."                                                                  
[18] "Remember that it is entirely possible to abstract out a workflow without needing to divulge sensitive information."                                                                                                    
[19] "And this practice also helps you search online for your specific needs."                                                                                                                                               
Code
data_annotate %>% 
  filter(doc_id == 2674) %>% 
  pull(sentence) %>%
  unique()
 [1] "As an experienced Excel user, I used to believe that Excel and VBA were all I needed, until I discovered dimensional modelling."                     
 [2] "Since then, I hardly use Excel, except for simple calculations or report output."                                                                    
 [3] "The reason for this is that I don't trust data that is manually entered into a spreadsheet."                                                         
 [4] "Instead, I rely on data extracted from existing models."                                                                                             
 [5] "Excel is a great tool, and I used it to build dashboards before dashboards were a thing."                                                            
 [6] "However, with more experience, I realized that while Excel does the job, it doesn't promote good data practices."                                    
 [7] "It's overused due to its ease of access."                                                                                                            
 [8] "I still use Excel for prototyping models, but then I request IT to build the logic in the ETL and incorporate it into an existing or new data model."
 [9] "Recently, I found Jupyter Notebooks, which is perfect for replacing what I used to do with VBA."                                                     
[10] "While Excel may be better for one-off tasks, in my experience, those one-offs always end up being required again."                                   
[11] "In my opinion, if the definition of BI includes tools and processes, using Excel, while a great tool, promotes weaker processes."                    

The nmod relationship is used when the modifying noun phrase or prepositional phrase specifies something about the noun it modifies, like location, possession, purpose, e.g. “[something] on MS Excel”, “[something else] in excel”

Code
excel %>%
  filter(dependency_tag == "nmod") %>%
  slice(1:10) %>%
  pull(sentence)
 [1] "But I used to test accessibility on MS Excel."                                                                                                                           
 [2] "Xlookup function in excel has made a life so better"                                                                                                                     
 [3] "Give Lookup functions a try today and experience the power of Excel at your fingertips!"                                                                                 
 [4] "Me: I really like that the CONCAT function in Excel has superseded the deprecated function CONCATENATE and now uses common syntax like that found in PowerShell or Bash."
 [5] "Doing it natively in excel would take tons of obnoxious work."                                                                                                           
 [6] "I would LOVE a feature that allowed a user to export data into something like Microsoft Excel!!!"                                                                        
 [7] "DAX, which stands for Data Analysis eXpressions, is the programming language of Microsoft Power BI, Microsoft Analysis Services, and Microsoft Power Pivot for Excel."   
 [8] "Hi all, I can open excel fine and work on files but when I try to open files from within Excel it constantly crashes (from file>open)"                                   
 [9] "Just learned a new trick in Excel - using the 'Text to Columns' function to quickly split cell contents into separate columns."                                          
[10] "If you already know some programming from other languages, you can quickly learn to do a TON of the advanced functions in Excel - you just need to learn the syntax."    
Seems often like when Excel == “compound” the other part of the compound will be something like Microsoft, and prior to that we’d get a preposition, which would make it an nmod -> compound type situation. Verify that…

Oblique reference to Excel - i.e. not the nominal subj (nsubj), the object (obj) or the iobj (indirect obj?). These are sentences where people are mentioning Excel but not directly, which can be v interesting to know about.

Like for example we can see in the 10th example, Copilot in Excel. We could look at what people are saying about Copilot + x this way - I’d guess (but need to check) that Copilot is the nsubj

Note

This dep rel seems quite similar to nmod, they might be able to be treated together.

Code
excel %>% 
  filter(dependency_tag == "obl") %>%
  slice(11:20) %>%
  pull(sentence)
 [1] "Probably no other tool is as malleable as Excel is."                                                                                                           
 [2] "We're using an MRP software from Microsoft that integrates extremely well with Excel, so I'm trying to teach myself more and more."                            
 [3] "Writing plain text in Excel without any formatting except what's built in has been surprisingly satisfying."                                                   
 [4] "When I am pushing the enter button on Microsoft Excel to enter numerical values plus 2 when I can just drag the lower right on the first cell."                
 [5] "Nope, until recently I didn't even have a computer capable of running anything more demanding than Microsoft excel."                                           
 [6] "I've tried downloading to Excel but any changes can't be uploaded, can only paste 500 rows, if someone else is using the sheet my Excel file is outdated, etc."
 [7] "And fail because they didn’t realize their target audience can already do what their product offers in Microsoft Excel."                                       
 [8] "The worst thing it has done is going into Microsoft Excel, which is just astounding."                                                                          
 [9] "I recently started using the XLSTAT add-in in Microsoft Excel, and it has been a nice experience."                                                             
[10] "Copilot in Excel makes creating data visuals & pivot tables super easy!"                                                                                       

Further research / development

[ ] shrinking the data set to a manageable size for any project which uses one of the larger LLMs

[ ] explore how this dovetails with semantic search stuff and whether there’s stuff to build together

[ ] explanatory / introductory slides

[ ] which dep. relations are most useful for what research questions

[ ] how to manage multiple tokens of interest

[ ] consolidate compounds

We’re just scratching the surface here, I think there’s a lot of potential, though.

What’s lacking? A data set to figure out what type of recall/precision we get here. I’d guess precision would be quite high, recall quite low if using any single of these methods, but in combination we might be able to make it quite high.

How will/would it dovetail with BERTopic/Landscape workflows? Could see us filtering for the appropriate dependency tags for products before running embeddings/topic modelling. Should be a cleaner output overall.