Create character vector
Read raw text into R
String printing
- Encoding strings with format()
C-style string formatting with sprintf()
String manipulations
String manipulations with {stringr}
Regular expressions
- Regex basics

So far, we have talked about common data structures in R, how to subset them, and how to write control-flow statements to iteratively work with data. And in all these lectures, we worked with data that are already cleaned and wholesome. However, the real world data are mostly full of problems, especially character data. In this lecture, we will talk about how to work with character data (i.e., strings). In R, they are represented as vectors of class character.

In R, character strings can be expressed with single or double quotes.

  'a character string with single quote'
  "a character string with double quote"

We can insert single quotes in a string with double quotes, and vice versa.

 "The 'R' project"
 'The "R" project'

But we cannot inset single quotes in a string with single quotes, or inset double quotes within string with double quotes.

In base R, we have functions to deal with character data, including nchar(), strsplit(), substr(), paste(), paste0(). We also have other packages such as {stringr} and {glue} to deal with strings.

Create character vector

# empty string
(empty_str <- "")

## [1] ""

class(empty_str)

## [1] "character"

# empty character vector
(empty_chr <- character(length = 0))

## character(0)

class(empty_chr)

## [1] "character"

# they are different
length(empty_str)

## [1] 1

length(empty_chr)

## [1] 0

empty_chr[1] <- "first"
empty_chr

## [1] "first"

empty_chr[4] <- "fourth"
empty_chr

## [1] "first"  NA       NA       "fourth"

# is.character, as.character; we already covered them in previous lectures

Now let’s revisit paste().

paste("The life of", pi, sep = " ")

## [1] "The life of 3.14159265358979"

paste("The life of", pi, sep = "-")

## [1] "The life of-3.14159265358979"

paste0("The life of-", pi)

## [1] "The life of-3.14159265358979"

paste("I", "love", "R", sep = "-")

## [1] "I-love-R"

paste("X", 1:5, sep = ".") # recycle

## [1] "X.1" "X.2" "X.3" "X.4" "X.5"

paste("X", 1:5, sep = ".", collapse = "-")

## [1] "X.1-X.2-X.3-X.4-X.5"

paste("NA will be coerced", NA)

## [1] "NA will be coerced NA"

Read raw text into R

We can use readLines() to read text into R as is.

top105 = readLines("http://www.textfiles.com/music/ktop100.txt")
head(top105, n = 15)

##  [1] "From: ed@wente.llnl.gov (Ed Suranyi)"                                          
##  [2] "Date: 12 Jan 92 21:23:55 GMT"                                                  
##  [3] "Newsgroups: rec.music.misc"                                                    
##  [4] "Subject: KITS' year end countdown"                                             
##  [5] ""                                                                              
##  [6] ""                                                                              
##  [7] "On Jan. 1, 1992, the \"Modern Rock\" station KITS San Francisco (\"Live-105\")"
##  [8] "broadcast its list of the \"Top 105.3 of 1991.\"  Here is the countdown"       
##  [9] "list:"                                                                         
## [10] ""                                                                              
## [11] "1. NIRVANA                      SMELLS LIKE TEEN SPIRIT"                       
## [12] "2. EMF                          UNBELIEVABLE"                                  
## [13] "3. R.E.M.                       LOSING MY RELIGION"                            
## [14] "4. SIOUXSIE & THE BANSHEES      KISS THEM FOR ME"                              
## [15] "5. B.A.D. II                    RUSH"

tail(top105, n = 10)

##  [1] "101. SMASHING PUMPKINS          SIVA"                       
##  [2] "102. ELVIS COSTELLO             OTHER SIDE OF ..."          
##  [3] "103. SEERS                      PSYCHE OUT"                 
##  [4] "104. THRILL KILL CULT           SEX ON WHEELZ"              
##  [5] "105. MATTHEW SWEET              I'VE BEEN WAITING"          
##  [6] "105.3  LATOUR                   PEOPLE ARE STILL HAVING SEX"
##  [7] ""                                                           
##  [8] "Ed"                                                         
##  [9] "ed@wente.llnl.gov"                                          
## [10] ""

String printing

Function	Description
`print()`	generic printing
`noquote()`	print with no quotes
`cat()`	concatenation
`format()`	special formats
`toString()`	convert to string
`sprintf()`	printing

The choice of function will depend on what we want to print, how we want to print it, and where we want to print it.

print(top105[1])

## [1] "From: ed@wente.llnl.gov (Ed Suranyi)"

print(top105[1], quote = FALSE)

## [1] From: ed@wente.llnl.gov (Ed Suranyi)

noquote(top105[1])

## [1] From: ed@wente.llnl.gov (Ed Suranyi)

cat(top105[1]) # similar to print

## From: ed@wente.llnl.gov (Ed Suranyi)

cat(top105[1:2])

## From: ed@wente.llnl.gov (Ed Suranyi) Date: 12 Jan 92 21:23:55 GMT

cat(top105[1:2], sep = " + ")

## From: ed@wente.llnl.gov (Ed Suranyi) + Date: 12 Jan 92 21:23:55 GMT

cat(month.name[1:4], sep = " --> ")

## January --> February --> March --> April

cat(top105[1:2], sep = " + ", fill = 30) # break long strings

## From: ed@wente.llnl.gov (Ed Suranyi) + 
## Date: 12 Jan 92 21:23:55 GMT

# save as a text file
# cat(top105[1:2], sep = " + ", file = "output.tex")

Encoding strings with `format()`

format(11.7)

## [1] "11.7"

format(11.7, nsmall = 3)

## [1] "11.700"

format(c(5, 10.2), digits = 2)

## [1] " 5" "10"

format(c(5, 10.2), digits = 2, nsmall = 2)

## [1] " 5.00" "10.20"

format(pi)

## [1] "3.141593"

format(1234567890, big.mark = ",")

## [1] "1,234,567,890"

format(c("A", "BB", "CCC"), width = 5, justify = "none")

## [1] "A"   "BB"  "CCC"

format(c("A", "BB", "CCC"), width = 5, justify = "left")

## [1] "A    " "BB   " "CCC  "

format(c("A", "BB", "CCC"), width = 5, justify = "centre")

## [1] "  A  " " BB  " " CCC "

format(c("A", "BB", "CCC"), width = 5, justify = "right")

## [1] "    A" "   BB" "  CCC"

C-style string formatting with `sprintf()`

sprintf(fmt, ...)

The argument fmt specifies format, and all start with %, followed with numbers or letters. See ?sprintf for more examples.

# fixed point
sprintf("%f", pi)

## [1] "3.141593"

# decimal notation with 3 decimal digits
sprintf("%.3f", pi)

## [1] "3.142"

String manipulations

Basic string manipulation functions

Function	Description
`nchar()`	number of characters
`tolower()`	convert to lower case
`toupper()`	convert to upper case
`casefold()`	case folding
`chartr()`	character translation
`abbreviate()`	abbreviation
`substring()`	substrings of a character vector
`substr()`	substrings of a character vector

nchar(c("How", "many", "characters?"))

## [1]  3  4 11

length(c("How", "many", "characters?")) # vs

## [1] 3

nchar("How many characters?") # space counts

## [1] 20

tolower(c("alL CAses", "BBBBD"))

## [1] "all cases" "bbbbd"

toupper(c("alL CAses", "BBBBD"))

## [1] "ALL CASES" "BBBBD"

casefold(c("alL CAses", "BBBBD"))

## [1] "all cases" "bbbbd"

casefold(c("alL CAses", "BBBBD"), upper = TRUE)

## [1] "ALL CASES" "BBBBD"

unname(abbreviate(top105[1:15]))

##  [1] "Fe(S"           "D1J92G"         "N:r."           "SKyec"         
##  [5] ""               ""               "OJ11t\"RsKSF("  "bilot\"1o1Hitc"
##  [9] "lst:"           ""               "1NSLTS"         "2.EU"          
## [13] "3RLMR"          "4S&TBKTFM"      "5BIR"

y = c("may", "the", "force", "be", "with", "you")
substr(y, 2, 3)

## [1] "ay" "he" "or" "e"  "it" "ou"

substr(y, 2, 3) <- ":)"
y

## [1] "m:)"   "t:)"   "f:)ce" "b:"    "w:)h"  "y:)"

String manipulations with `{stringr}`

Even though the base functions are enough to allow us to get the job done, they are not consistent (both the arguments and behaviors) and have drawbacks. For example:

paste("one", "word", "here", NULL, character(0))

## [1] "one word here  "

In the above example, NULL and character(0) have zero length and probably should be removed. But they were converted to empty string ““.

In R, the {stringr} package is very useful to work with strings. According to the description of the package (see https://cran.r-project.org/web/packages/stringr/index.html) stringr

A consistent, simple and easy to use set of wrappers around the fantastic ‘stringi’ package. All function and argument names (and positions) are consistent, all functions deal with “NA”’s and zero length vectors in the same way, and the output from one function is easy to feed into the input of another.

# install.packages("stringr")
# load 'stringr'
library(stringr)

All functions in stringr start with “str_” followed by a term associated to the task they perform. The following table contains the stringr functions for basic string operations:

Function	Description	Similar to
`str_c()`	string concatenation	`paste()`
`str_length()`	number of characters	`nchar()`
`str_sub()`	extracts substrings	`substring()`
`str_dup()`	duplicates characters	none
`str_trim()`	removes leading and trailing whitespace	none
`str_pad()`	pads a string	none
`str_wrap()`	wraps a string paragraph	`strwrap()`

str_c("one", "word", "here", NULL, character(0), sep = " ")

## character(0)

some_factor = factor(c(1, 1, 1, 2, 2, 2), labels = c("good", "bad"))
some_factor

## [1] good good good bad  bad  bad 
## Levels: good bad

# nchar(some_factor)
# Error in nchar(some_factor) : 'nchar()' requires a character vector
str_length(some_factor)

## [1] 4 4 4 3 3 3

str_sub("adios", 1:3)

## [1] "adios" "dios"  "ios"

str_sub("adios", start = 1, end = -2)

## [1] "adio"

str_dup("hola", times = 3)

## [1] "holaholahola"

str_dup(c("hola", "adios"), times = 3)

## [1] "holaholahola"    "adiosadiosadios"

str_dup(c("hola", "adios"), times = c(2, 5))

## [1] "holahola"                  "adiosadiosadiosadiosadios"

str_pad("hola", width = 7, side = "left")

## [1] "   hola"

str_pad("hola", width = 7, side = "both")

## [1] " hola  "

str_pad("hola", width = 7, side = "right")

## [1] "hola   "

str_pad("hola", width = 7, side = "right", pad = "&")

## [1] "hola&&&"

stringr_desc = "A consistent, simple and easy to use set of wrappers around the fantastic 'stringi' package. All function and argument names (and positions) are consistent, all functions deal with NA's and zero length vectors in the same way, and the output from one function is easy to feed into the input of another."
stringr_desc

## [1] "A consistent, simple and easy to use set of wrappers around the fantastic 'stringi' package. All function and argument names (and positions) are consistent, all functions deal with NA's and zero length vectors in the same way, and the output from one function is easy to feed into the input of another."

cat(str_wrap(stringr_desc, width = 80, indent = 3))

##    A consistent, simple and easy to use set of wrappers around the fantastic
## 'stringi' package. All function and argument names (and positions) are
## consistent, all functions deal with NA's and zero length vectors in the same
## way, and the output from one function is easy to feed into the input of another.

cat(str_wrap(stringr_desc, width = 80, exdent = 3))

## A consistent, simple and easy to use set of wrappers around the fantastic
##    'stringi' package. All function and argument names (and positions) are
##    consistent, all functions deal with NA's and zero length vectors in the
##    same way, and the output from one function is easy to feed into the input of
##    another.

# text with whitespaces
btext = c("This example ", "has several   ", "whitespaces like this ")
str_trim(btext)

## [1] "This example"          "has several"           "whitespaces like this"

word(btext, 1)

## [1] "This"        "has"         "whitespaces"

word(btext, 2)

## [1] "example" "several" "like"

Regular expressions

The above functions can already allow us to handle some basic tasks with strings. However, if we really want to unleash the power of strings manipulation, we need to leanr regular expressions (aka regex).

A regular expression is a pattern that describes a set of strings.

Because that it is impossible to cover most of it in one lecture, I put some resources here so that you can read more later.

Regex wikipedia
Regular-expression.info
An RStudio addin slash regex utility belt
help(regex) in R

Regex basics

The main purpose of working with regular expressions is to describe patterns that are used to match against text strings. The simplest version of pattern matching is to search for some specific characters in a string.

metacharacters

There are some special characters that have a reserved status and they are known as metacharacters. The metacharacters in Extended Regular Expressions (EREs) are:

. \ | ( ) [ { $ * + ?

For example, the pattern “money$” does not match “money$”. Likewise, the pattern “what?” does not match “what?”. Except for a few cases, metacharacters have a special meaning and purporse when working with regular expressions. In R, we need to escape them with a double backslash \.

Metacharacter	Escape in R
.	`\\.`
$	`\\$`
*	`\\*`
+	`\\+`
?	`\\?`
\|	`\\\|`
\	`\\\\`
^	`\\^`
[	`\\[`
]	`\\]`
{	`\\{`
}	`\\}`
(	`\\(`
)	`\\)`

money = "$money"
# the naive but wrong way
sub(pattern = "$", replacement = "", x = money)

## [1] "$money"

# the usual (in other languages) yet wrong way in R
sub(pattern = "\$", replacement = "", x = money)

# escape in R
sub(pattern = "\\$", replacement = "", x = money)

## [1] "money"

sub("\\|", "", "Peace|Love")

## [1] "PeaceLove"

sub("\\^", "", "Peace^Love")

## [1] "PeaceLove"

sub("\\[", "", "Peace[Love]")

## [1] "PeaceLove]"

sub("\\]", "", "Peace[Love]")

## [1] "Peace[Love"

Sequences

Sequences define sequences of characters which can match. We have short-hand versions (or anchors) for commonly used sequences in R:

Anchor in R	Description
`\\d`	match a digit character
`\\D`	match a non-digit character
`\\s`	match a space character
`\\S`	match a non-space character
`\\w`	match a word character
`\\W`	match a non-word character
`\\b`	match a word boundary
`\\B`	match a non-(word boundary)
`\\h`	match a horizontal space
`\\H`	match a non-horizontal space
`\\v`	match a vertical space
`\\V`	match a non-vertical space

sub("\\d", "_", "Covid 19")

## [1] "Covid _9"

gsub("\\d", "_", "Covid 19")

## [1] "Covid __"

sub("\\D", "_", "Covid 19")

## [1] "_ovid 19"

gsub("\\D", "_", "Covid 19")

## [1] "______19"

sub("\\s", "_", "Covid 19")

## [1] "Covid_19"

sub("\\S", "_", "Covid 19")

## [1] "_ovid 19"

gsub("\\S", "_", "Covid 19")

## [1] "_____ __"

sub("\\w", "_", "Covid 19")

## [1] "_ovid 19"

gsub("\\w", "_", "Covid 19")

## [1] "_____ __"

sub("\\W", "_", "Covid 19")

## [1] "Covid_19"

gsub("\\W", "_", "Covid 19")

## [1] "Covid_19"

character class

A character class or character set is a list of characters enclosed by square brackets [ ]. Character sets are used to match only one of several characters. For instance, the regex character class [aA] matches any lower case letter a or any upper case letter A.

Anchor in R	Description
`[aeiou]`	match any one lower case vowel
`[AEIOU]`	match any one upper case vowel
`[0123456789]`	match any digit
`[0-9]`	match any digit (same as previous class)
`[a-z]`	match any lower case ASCII letter
`[A-Z]`	match any upper case ASCII letter
`[a-zA-Z0-9]`	match any of the above classes
`[^aeiou]`	match anything other than a lowercase vowel
`[^0-9]`	match anything other than a digit

d = c("car", "bike", "plane", "boat", "Oct 07", "I-II-III", "R 4.1.1")
# look for 'e' or 'i'
grep(pattern = "[ei]", x = d, value = TRUE)

## [1] "bike"  "plane"

grep(pattern = "[01]", x = d, value = TRUE)

## [1] "Oct 07"  "R 4.1.1"

POSIX Character Classes

Closely related to the regex character classes we have what is known as POSIX character classes. In R, POSIX character classes are represented with expressions inside double brackets [[ ]].

Class	POSIX Character Classes in R Description
`[[:lower:]]`	Lower-case letters
`[[:upper:]]`	Upper-case letters
`[[:alpha:]]`	Alphabetic characters ([[:lower:]] and [[:upper:]])
`[[:digit:]]`	Digits: 0,1,2,3,4,5,6,7,8,9
`[[:alnum:]]`	Alphanumeric characters ([[:alpha:]] and [[:digit:]])
`[[:blank:]]`	Blank characters: space and tab
`[[:cntrl:]]`	Control characters
`[[:punct:]]`	Punctuation characters: ! ”#%&’()*+,-. /: ;
`[[:space:]]`	Space characters: tab, newline, vertical tab, form feed, carriage return, and space
`[[:xdigit:]]`	Hexadecimal digits: 0-9 A B C D E F a b c d e f
`[[:print:]]`	Printable characters ([[:alpha:]], [[:punct:]] and space)
`[[:graph:]]`	Graphical characters ([[:alpha:]] and [[:punct:]])

gsub(pattern = "[[:blank:]]", replacement = "", x = d)

## [1] "car"      "bike"     "plane"    "boat"     "Oct07"    "I-II-III" "R4.1.1"

gsub(pattern = "[[:lower:]]", replacement = "", x = d)

## [1] ""         ""         ""         ""         "O 07"     "I-II-III" "R 4.1.1"

gsub(pattern = "[[:alnum:]]", replacement = "", x = d)

## [1] ""    ""    ""    ""    " "   "--"  " .."

Quantifiers

Quantifiers are used when we want to match a certain number of characters that meet certain criteria. Quantifiers specify how many instances of a character, group, or character class must be present in the input for a match to be found.

Quantifier	Description
`?`	The preceding item is optional and will be matched at most once
`*`	The preceding item will be matched zero or more times
`+`	The preceding item will be matched one or more times
`{n}`	The preceding item is matched exactly n times
`{n,}`	The preceding item is matched n or more times
`{n,m}`	The preceding item is matched at least n times, but not more than m times

sts = row.names(USArrests)
sts

##  [1] "Alabama"        "Alaska"         "Arizona"        "Arkansas"      
##  [5] "California"     "Colorado"       "Connecticut"    "Delaware"      
##  [9] "Florida"        "Georgia"        "Hawaii"         "Idaho"         
## [13] "Illinois"       "Indiana"        "Iowa"           "Kansas"        
## [17] "Kentucky"       "Louisiana"      "Maine"          "Maryland"      
## [21] "Massachusetts"  "Michigan"       "Minnesota"      "Mississippi"   
## [25] "Missouri"       "Montana"        "Nebraska"       "Nevada"        
## [29] "New Hampshire"  "New Jersey"     "New Mexico"     "New York"      
## [33] "North Carolina" "North Dakota"   "Ohio"           "Oklahoma"      
## [37] "Oregon"         "Pennsylvania"   "Rhode Island"   "South Carolina"
## [41] "South Dakota"   "Tennessee"      "Texas"          "Utah"          
## [45] "Vermont"        "Virginia"       "Washington"     "West Virginia" 
## [49] "Wisconsin"      "Wyoming"

# match 'm' at most once (0 or 1)
grep(pattern = "m?", sts, value = TRUE)

##  [1] "Alabama"        "Alaska"         "Arizona"        "Arkansas"      
##  [5] "California"     "Colorado"       "Connecticut"    "Delaware"      
##  [9] "Florida"        "Georgia"        "Hawaii"         "Idaho"         
## [13] "Illinois"       "Indiana"        "Iowa"           "Kansas"        
## [17] "Kentucky"       "Louisiana"      "Maine"          "Maryland"      
## [21] "Massachusetts"  "Michigan"       "Minnesota"      "Mississippi"   
## [25] "Missouri"       "Montana"        "Nebraska"       "Nevada"        
## [29] "New Hampshire"  "New Jersey"     "New Mexico"     "New York"      
## [33] "North Carolina" "North Dakota"   "Ohio"           "Oklahoma"      
## [37] "Oregon"         "Pennsylvania"   "Rhode Island"   "South Carolina"
## [41] "South Dakota"   "Tennessee"      "Texas"          "Utah"          
## [45] "Vermont"        "Virginia"       "Washington"     "West Virginia" 
## [49] "Wisconsin"      "Wyoming"

grep(pattern = "m{1}", sts, value = TRUE) # only once

## [1] "Alabama"       "New Hampshire" "Oklahoma"      "Vermont"      
## [5] "Wyoming"

grep(pattern = "l+", sts, value = TRUE) #  once or more

##  [1] "Alabama"        "Alaska"         "California"     "Colorado"      
##  [5] "Delaware"       "Florida"        "Illinois"       "Maryland"      
##  [9] "North Carolina" "Oklahoma"       "Pennsylvania"   "Rhode Island"  
## [13] "South Carolina"

grep(pattern = "l{2,3}", sts, value = TRUE) # 2 or 3 times

## [1] "Illinois"

Position of pattern within a string

^: matches the start of the string
$: matches the end of the string
\b: matches the empty string at either edge of a word. Don’t confuse it with ^ $ which marks the edge of a string.
\B: matches the empty string provided it is not at an edge of a word.

strings <- c("abcd", "cdab", "cabd", "c abd")
grep("ab", strings, value = TRUE)

## [1] "abcd"  "cdab"  "cabd"  "c abd"

grep("^ab", strings, value = TRUE)

## [1] "abcd"

grep("ab$", strings, value = TRUE)

## [1] "cdab"

grep("\\bab", strings, value = TRUE)

## [1] "abcd"  "c abd"

grep("\\Bab", strings, value = TRUE)

## [1] "cdab" "cabd"

Operators

.: matches any single character
[...]: matches any one of the characters inside the brackets
[^...]: matches any other characters except those inside the brackets
|: or, matches either side of the vertical bar
(...): grouping in regex. This allows you to retrieve the bits that matched various parts of your regular expression so you can alter them or use them for building up a new string. Each group can than be refer using \\N, with N being the No. of (...) used. This is called backreference.

Main Base R functions for Regex

Function	Purpose	Characteristic
`grep()`	finding regex matches	which elements are matched (index or value)
`grepl()`	finding regex matches	which elements are matched (TRUE & FALSE)
`regexpr()`	finding regex matches	positions of the first match
`gregexpr()`	finding regex matches	positions of all matches
`regexec()`	finding regex matches	hybrid of `regexpr()` and `gregexpr()`
`sub()`	replacing regex matches	only first match is replaced
`gsub()`	replacing regex matches	all matches are replaced
`strsplit()`	splitting regex matches	split vector according to matches

All regex functions require two main arguments: a pattern (i.e. regular expression), and a text to match.

Regex functions in `{stringr}`

They all following the following general form:

str_function(string, pattern)

Strings and Regular Expression

Daijiang Li

10/03/2023

Create character vector

Read raw text into R

String printing

Encoding strings with `format()`

C-style string formatting with `sprintf()`

String manipulations

String manipulations with `{stringr}`

Regular expressions

Regex basics

metacharacters

Sequences

character class

POSIX Character Classes

Quantifiers

Position of pattern within a string

Operators

Main Base R functions for Regex

Regex functions in `{stringr}`

Strings and Regular Expression

Daijiang Li

10/03/2023

Create character vector

Read raw text into R

String printing

Encoding strings with format()

C-style string formatting with sprintf()

String manipulations

String manipulations with {stringr}

Regular expressions

Regex basics

metacharacters

Sequences

character class

POSIX Character Classes

Quantifiers

Position of pattern within a string

Operators

Main Base R functions for Regex

Regex functions in {stringr}

Encoding strings with `format()`

C-style string formatting with `sprintf()`

String manipulations with `{stringr}`

Regex functions in `{stringr}`