Lecture 5, 2018-10-22

Regular expressions

Regular Expressions (abbreviated as REGEX, also known as “Grep”)" are a mini-language pattern-matching used by many computer languages (R, python), and programmes (OpenOffice/LibreOffice, RStudio “find” function).

We’ll use regular expressions through the tidyverse/stringr functions str_subset (returns the matching strings) and str_detect (returns TRUE/FALSE)

  • str_which Returns the indexes of matching elements
  • str_count Returns how many times the pattern matches
  • str_locate, str_locate_all Returns start and end position of matches
  • str_extract Returns the matching part of the string

str_subset (get matching strings)

all_names <- babynames %>% pull(name) %>% unique()
all_names %>% str_subset("Rob.n")
##  [1] "Robin"     "Robina"    "Robena"    "Robinson"  "Robinette"
##  [6] "Robyn"     "Robenia"   "Robinetta" "Robyne"    "Robynn"   
## [11] "Robynne"   "Roben"     "Robinn"    "Robine"    "Robinann" 
## [16] "Robann"    "Robins"    "Robinique" "Robenson"  "Robinho"

str_detect (test each string for a match)

all_names %>% str_subset("Rob.n") %>% str_detect("nn")
##  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE
## [12] FALSE  TRUE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE

Special characters . and ^ and $

  • . any character
all_names %>% str_subset("x.x")
## [1] "Alexix"
  • ^ beginning of a string
# make a lowercase version of this list (grep is case-sensitive, 
# and in this data all the word-beginnings are uppercase)
all_names %>% str_to_lower() -> all_names_lower
all_names_lower %>% head(5)
## [1] "mary"      "anna"      "emma"      "elizabeth" "minnie"

all_names_lower %>% str_subset("^x") %>% head(5)
## [1] "xavier"   "xenia"    "xenophon" "xandra"   "xiomara"
  • $ end of a string
all_names %>% str_subset("xx$") 
##  [1] "Maxx"     "Alexx"    "Alixx"    "Foxx"     "Daxx"     "Jaxx"    
##  [7] "Lexx"     "Maddoxx"  "Knoxx"    "Madoxx"   "Lenoxx"   "Luxx"    
## [13] "Phoenixx" "Rexx"     "Braxx"    "Dexx"
  • all together
all_names_lower %>% str_subset("^d.d.$")
## [1] "dude" "dede" "dodd" "dody" "didi" "dodi" "dade" "dedi" "deda"

Repeats

How many x?

  • x? means 0 or 1
  • x* means 0 or more
  • x+ means 1 or more
  • x{3} means exactly 3
  • x{2,4} means 2 to 4
  • x{,4} means 0 to 4
  • x{2,} means 2 or more

Character classes

Match a set of characters with [ ]

all_names %>% str_subset("xe[aeiou]$")
## [1] "Alexei" "Alexea" "Lexee"  "Alexee" "Dixee"

Match a range of characters with -

all_names %>% str_subset("[A-C][a-c]") %>% head(24)
##  [1] "Carrie"    "Catherine" "Caroline"  "Callie"    "Barbara"  
##  [6] "Carolyn"   "Abbie"     "Cassie"    "Catharine" "Carolina" 
## [11] "Cathrine"  "Abigail"   "Camille"   "Carol"     "Carra"    
## [16] "Abby"      "Bama"      "Calla"     "Camilla"   "Carey"    
## [21] "Carlotta"  "Caddie"    "Carl"      "Calvin"

Character classes with counts

all_names %>% str_subset("[aeiouAEIOU]{4}")
##  [1] "Louie"       "Louiese"     "Louia"       "Louies"      "Gioia"      
##  [6] "Loueen"      "Louaine"     "Daaiel"      "Sequoia"     "Laquoia"    
## [11] "Shaquoia"    "Seqouia"     "Sequioa"     "Jauier"      "Tequoia"    
## [16] "Keiaira"     "Saquoia"     "Taquoia"     "Joshuaaaron" "Breeaunna"  
## [21] "Breeauna"    "Reaiah"      "Keiauna"     "Keeaira"     "Zoeie"      
## [26] "Keiairra"    "Zoiee"       "Kiaeem"      "Jaquoia"     "Kauai"      
## [31] "Kaiea"       "Ismaaeel"    "Leeaira"     "Douaa"       "Ieuan"      
## [36] "Zoeii"       "Beauen"      "Naieem"      "Zoiie"       "Alaiia"

Negating a character class with [^ ]

# names with 5 non-vowels
all_names %>% str_subset("[^aeiouyAEIOUY]{5}")
## [1] "Armstrong"       "Chrstine"        "Chrstopher"      "Chrstina"       
## [5] "Johnchristopher" "Markchristopher" "Johnchristian"

Match a - inside a character class

usernames <- c("user01", "3l33t", "men@work", "no-name")
# n.b. negated search; any symbol not a letter, number or -
usernames %>% str_subset("[^a-zA-Z0-9-]")
## [1] "men@work"

Or

all_names %>% str_subset("a{3}|e{3}|i{3}|o{3}|u{3}")
## [1] "Kathleeen"   "Joshuaaaron"

Do grouping with (a|b)

# any vowel, then three of the same vowel
all_names %>% str_subset("[aeiou](a{3}|e{3}|i{3}|o{3}|u{3})")
## [1] "Joshuaaaron"

Escape character \

The backslash \ means that the following special character is used literally, rather than in its special regex meaning:

\[, \(, \| 

etc.

Also, don’t forget:

\\

Tab space and new line can be written with:

\t, \n

Try this

  • Are there any four letter names starting “Eri” apart from “Erik” and “Eric”?
  • How many spelling variants of Mary/Maria can you find in a single search?
  • Can you think of a way (using what we’ve learned) to get the second character of a string? (for example, mutate the babynames tibble to make a column with “second letter”)