So far, we have talked about common data structures in R, how to subset them, and how to write control-flow statements to iteratively work with data. And in all these lectures, we worked with data that are already cleaned and wholesome. However, the real world data are mostly full of problems, especially character data. In this lecture, we will talk about how to work with character data (i.e., strings). In R, they are represented as vectors of class character.

In R, character strings can be expressed with single or double quotes.

  'a character string with single quote'
  "a character string with double quote"

We can insert single quotes in a string with double quotes, and vice versa.

 "The 'R' project"
 'The "R" project'

But we cannot inset single quotes in a string with single quotes, or inset double quotes within string with double quotes.

In base R, we have functions to deal with character data, including nchar(), strsplit(), substr(), paste(), paste0(). We also have other packages such as {stringr} and {glue} to deal with strings.

Create character vector

# empty string
(empty_str <- "")
## [1] ""
class(empty_str)
## [1] "character"
# empty character vector
(empty_chr <- character(length = 0))
## character(0)
class(empty_chr)
## [1] "character"
# they are different
length(empty_str)
## [1] 1
length(empty_chr)
## [1] 0
empty_chr[1] <- "first"
empty_chr
## [1] "first"
empty_chr[4] <- "fourth"
empty_chr
## [1] "first"  NA       NA       "fourth"
# is.character, as.character; we already covered them in previous lectures

Now let’s revisit paste().

paste("The life of", pi, sep = " ")
## [1] "The life of 3.14159265358979"
paste("The life of", pi, sep = "-")
## [1] "The life of-3.14159265358979"
paste0("The life of-", pi)
## [1] "The life of-3.14159265358979"
paste("I", "love", "R", sep = "-")
## [1] "I-love-R"
paste("X", 1:5, sep = ".") # recycle
## [1] "X.1" "X.2" "X.3" "X.4" "X.5"
paste("X", 1:5, sep = ".", collapse = "-") 
## [1] "X.1-X.2-X.3-X.4-X.5"
paste("NA will be coerced", NA)
## [1] "NA will be coerced NA"

Read raw text into R

We can use readLines() to read text into R as is.

top105 = readLines("http://www.textfiles.com/music/ktop100.txt")
head(top105, n = 15)
##  [1] "From: ed@wente.llnl.gov (Ed Suranyi)"                                          
##  [2] "Date: 12 Jan 92 21:23:55 GMT"                                                  
##  [3] "Newsgroups: rec.music.misc"                                                    
##  [4] "Subject: KITS' year end countdown"                                             
##  [5] ""                                                                              
##  [6] ""                                                                              
##  [7] "On Jan. 1, 1992, the \"Modern Rock\" station KITS San Francisco (\"Live-105\")"
##  [8] "broadcast its list of the \"Top 105.3 of 1991.\"  Here is the countdown"       
##  [9] "list:"                                                                         
## [10] ""                                                                              
## [11] "1. NIRVANA                      SMELLS LIKE TEEN SPIRIT"                       
## [12] "2. EMF                          UNBELIEVABLE"                                  
## [13] "3. R.E.M.                       LOSING MY RELIGION"                            
## [14] "4. SIOUXSIE & THE BANSHEES      KISS THEM FOR ME"                              
## [15] "5. B.A.D. II                    RUSH"
tail(top105, n = 10)
##  [1] "101. SMASHING PUMPKINS          SIVA"                       
##  [2] "102. ELVIS COSTELLO             OTHER SIDE OF ..."          
##  [3] "103. SEERS                      PSYCHE OUT"                 
##  [4] "104. THRILL KILL CULT           SEX ON WHEELZ"              
##  [5] "105. MATTHEW SWEET              I'VE BEEN WAITING"          
##  [6] "105.3  LATOUR                   PEOPLE ARE STILL HAVING SEX"
##  [7] ""                                                           
##  [8] "Ed"                                                         
##  [9] "ed@wente.llnl.gov"                                          
## [10] ""

String printing

Function Description
print() generic printing
noquote() print with no quotes
cat() concatenation
format() special formats
toString() convert to string
sprintf() printing

The choice of function will depend on what we want to print, how we want to print it, and where we want to print it.

print(top105[1])
## [1] "From: ed@wente.llnl.gov (Ed Suranyi)"
print(top105[1], quote = FALSE)
## [1] From: ed@wente.llnl.gov (Ed Suranyi)
noquote(top105[1])
## [1] From: ed@wente.llnl.gov (Ed Suranyi)
cat(top105[1]) # similar to print
## From: ed@wente.llnl.gov (Ed Suranyi)
cat(top105[1:2])
## From: ed@wente.llnl.gov (Ed Suranyi) Date: 12 Jan 92 21:23:55 GMT
cat(top105[1:2], sep = " + ")
## From: ed@wente.llnl.gov (Ed Suranyi) + Date: 12 Jan 92 21:23:55 GMT
cat(month.name[1:4], sep = " --> ")
## January --> February --> March --> April
cat(top105[1:2], sep = " + ", fill = 30) # break long strings
## From: ed@wente.llnl.gov (Ed Suranyi) + 
## Date: 12 Jan 92 21:23:55 GMT
# save as a text file
# cat(top105[1:2], sep = " + ", file = "output.tex") 

Encoding strings with format()

format(11.7)
## [1] "11.7"
format(11.7, nsmall = 3)
## [1] "11.700"
format(c(5, 10.2), digits = 2)
## [1] " 5" "10"
format(c(5, 10.2), digits = 2, nsmall = 2)
## [1] " 5.00" "10.20"
format(pi)
## [1] "3.141593"
format(1234567890, big.mark = ",")
## [1] "1,234,567,890"
format(c("A", "BB", "CCC"), width = 5, justify = "none")
## [1] "A"   "BB"  "CCC"
format(c("A", "BB", "CCC"), width = 5, justify = "left")
## [1] "A    " "BB   " "CCC  "
format(c("A", "BB", "CCC"), width = 5, justify = "centre")
## [1] "  A  " " BB  " " CCC "
format(c("A", "BB", "CCC"), width = 5, justify = "right")
## [1] "    A" "   BB" "  CCC"

C-style string formatting with sprintf()

sprintf(fmt, ...)

The argument fmt specifies format, and all start with %, followed with numbers or letters. See ?sprintf for more examples.

# fixed point
sprintf("%f", pi)
## [1] "3.141593"
# decimal notation with 3 decimal digits
sprintf("%.3f", pi)
## [1] "3.142"

String manipulations

Basic string manipulation functions

Function Description
nchar() number of characters
tolower() convert to lower case
toupper() convert to upper case
casefold() case folding
chartr() character translation
abbreviate() abbreviation
substring() substrings of a character vector
substr() substrings of a character vector
nchar(c("How", "many", "characters?"))
## [1]  3  4 11
length(c("How", "many", "characters?")) # vs
## [1] 3
nchar("How many characters?") # space counts
## [1] 20
tolower(c("alL CAses", "BBBBD"))
## [1] "all cases" "bbbbd"
toupper(c("alL CAses", "BBBBD"))
## [1] "ALL CASES" "BBBBD"
casefold(c("alL CAses", "BBBBD"))
## [1] "all cases" "bbbbd"
casefold(c("alL CAses", "BBBBD"), upper = TRUE)
## [1] "ALL CASES" "BBBBD"
unname(abbreviate(top105[1:15]))
##  [1] "Fe(S"           "D1J92G"         "N:r."           "SKyec"         
##  [5] ""               ""               "OJ11t\"RsKSF("  "bilot\"1o1Hitc"
##  [9] "lst:"           ""               "1NSLTS"         "2.EU"          
## [13] "3RLMR"          "4S&TBKTFM"      "5BIR"
y = c("may", "the", "force", "be", "with", "you")
substr(y, 2, 3)
## [1] "ay" "he" "or" "e"  "it" "ou"
substr(y, 2, 3) <- ":)"
y
## [1] "m:)"   "t:)"   "f:)ce" "b:"    "w:)h"  "y:)"

String manipulations with {stringr}

Even though the base functions are enough to allow us to get the job done, they are not consistent (both the arguments and behaviors) and have drawbacks. For example:

paste("one", "word", "here", NULL, character(0))
## [1] "one word here  "

In the above example, NULL and character(0) have zero length and probably should be removed. But they were converted to empty string ““.

In R, the {stringr} package is very useful to work with strings. According to the description of the package (see https://cran.r-project.org/web/packages/stringr/index.html) stringr

A consistent, simple and easy to use set of wrappers around the fantastic ‘stringi’ package. All function and argument names (and positions) are consistent, all functions deal with “NA”’s and zero length vectors in the same way, and the output from one function is easy to feed into the input of another.

# install.packages("stringr")
# load 'stringr'
library(stringr)

All functions in stringr start with “str_” followed by a term associated to the task they perform. The following table contains the stringr functions for basic string operations:

Function Description Similar to
str_c() string concatenation paste()
str_length() number of characters nchar()
str_sub() extracts substrings substring()
str_dup() duplicates characters none
str_trim() removes leading and trailing whitespace none
str_pad() pads a string none
str_wrap() wraps a string paragraph strwrap()
str_c("one", "word", "here", NULL, character(0), sep = " ")
## character(0)
some_factor = factor(c(1, 1, 1, 2, 2, 2), labels = c("good", "bad"))
some_factor
## [1] good good good bad  bad  bad 
## Levels: good bad
# nchar(some_factor)
# Error in nchar(some_factor) : 'nchar()' requires a character vector
str_length(some_factor)
## [1] 4 4 4 3 3 3
str_sub("adios", 1:3)
## [1] "adios" "dios"  "ios"
str_sub("adios", start = 1, end = -2)
## [1] "adio"
str_dup("hola", times = 3)
## [1] "holaholahola"
str_dup(c("hola", "adios"), times = 3)
## [1] "holaholahola"    "adiosadiosadios"
str_dup(c("hola", "adios"), times = c(2, 5))
## [1] "holahola"                  "adiosadiosadiosadiosadios"
str_pad("hola", width = 7, side = "left")
## [1] "   hola"
str_pad("hola", width = 7, side = "both")
## [1] " hola  "
str_pad("hola", width = 7, side = "right")
## [1] "hola   "
str_pad("hola", width = 7, side = "right", pad = "&")
## [1] "hola&&&"
stringr_desc = "A consistent, simple and easy to use set of wrappers around the fantastic 'stringi' package. All function and argument names (and positions) are consistent, all functions deal with NA's and zero length vectors in the same way, and the output from one function is easy to feed into the input of another."
stringr_desc
## [1] "A consistent, simple and easy to use set of wrappers around the fantastic 'stringi' package. All function and argument names (and positions) are consistent, all functions deal with NA's and zero length vectors in the same way, and the output from one function is easy to feed into the input of another."
cat(str_wrap(stringr_desc, width = 80, indent = 3))
##    A consistent, simple and easy to use set of wrappers around the fantastic
## 'stringi' package. All function and argument names (and positions) are
## consistent, all functions deal with NA's and zero length vectors in the same
## way, and the output from one function is easy to feed into the input of another.
cat(str_wrap(stringr_desc, width = 80, exdent = 3))
## A consistent, simple and easy to use set of wrappers around the fantastic
##    'stringi' package. All function and argument names (and positions) are
##    consistent, all functions deal with NA's and zero length vectors in the
##    same way, and the output from one function is easy to feed into the input of
##    another.
# text with whitespaces
btext = c("This example ", "has several   ", "whitespaces like this ")
str_trim(btext)
## [1] "This example"          "has several"           "whitespaces like this"
word(btext, 1)
## [1] "This"        "has"         "whitespaces"
word(btext, 2)
## [1] "example" "several" "like"

Regular expressions

The above functions can already allow us to handle some basic tasks with strings. However, if we really want to unleash the power of strings manipulation, we need to leanr regular expressions (aka regex).

A regular expression is a pattern that describes a set of strings.

Because that it is impossible to cover most of it in one lecture, I put some resources here so that you can read more later.

Regex basics

The main purpose of working with regular expressions is to describe patterns that are used to match against text strings. The simplest version of pattern matching is to search for some specific characters in a string.

metacharacters

There are some special characters that have a reserved status and they are known as metacharacters. The metacharacters in Extended Regular Expressions (EREs) are:

. \ | ( ) [ { $ * + ?

For example, the pattern “money$” does not match “money$”. Likewise, the pattern “what?” does not match “what?”. Except for a few cases, metacharacters have a special meaning and purporse when working with regular expressions. In R, we need to escape them with a double backslash \.

Metacharacter Escape in R
. \\.
$ \\$
* \\*
+ \\+
? \\?
| \\|
\ \\\\
^ \\^
[ \\[
] \\]
{ \\{
} \\}
( \\(
) \\)
money = "$money"
# the naive but wrong way
sub(pattern = "$", replacement = "", x = money)
## [1] "$money"
# the usual (in other languages) yet wrong way in R
sub(pattern = "\$", replacement = "", x = money)
# escape in R
sub(pattern = "\\$", replacement = "", x = money)
## [1] "money"
sub("\\|", "", "Peace|Love")
## [1] "PeaceLove"
sub("\\^", "", "Peace^Love")
## [1] "PeaceLove"
sub("\\[", "", "Peace[Love]")
## [1] "PeaceLove]"
sub("\\]", "", "Peace[Love]")
## [1] "Peace[Love"

Sequences

Sequences define sequences of characters which can match. We have short-hand versions (or anchors) for commonly used sequences in R:

Anchor in R Description
\\d match a digit character
\\D match a non-digit character
\\s match a space character
\\S match a non-space character
\\w match a word character
\\W match a non-word character
\\b match a word boundary
\\B match a non-(word boundary)
\\h match a horizontal space
\\H match a non-horizontal space
\\v match a vertical space
\\V match a non-vertical space
sub("\\d", "_", "Covid 19")
## [1] "Covid _9"
gsub("\\d", "_", "Covid 19")
## [1] "Covid __"
sub("\\D", "_", "Covid 19")
## [1] "_ovid 19"
gsub("\\D", "_", "Covid 19")
## [1] "______19"
sub("\\s", "_", "Covid 19")
## [1] "Covid_19"
sub("\\S", "_", "Covid 19")
## [1] "_ovid 19"
gsub("\\S", "_", "Covid 19")
## [1] "_____ __"
sub("\\w", "_", "Covid 19")
## [1] "_ovid 19"
gsub("\\w", "_", "Covid 19")
## [1] "_____ __"
sub("\\W", "_", "Covid 19")
## [1] "Covid_19"
gsub("\\W", "_", "Covid 19")
## [1] "Covid_19"

character class

A character class or character set is a list of characters enclosed by square brackets [ ]. Character sets are used to match only one of several characters. For instance, the regex character class [aA] matches any lower case letter a or any upper case letter A.

Anchor in R Description
[aeiou] match any one lower case vowel
[AEIOU] match any one upper case vowel
[0123456789] match any digit
[0-9] match any digit (same as previous class)
[a-z] match any lower case ASCII letter
[A-Z] match any upper case ASCII letter
[a-zA-Z0-9] match any of the above classes
[^aeiou] match anything other than a lowercase vowel
[^0-9] match anything other than a digit
d = c("car", "bike", "plane", "boat", "Oct 07", "I-II-III", "R 4.1.1")
# look for 'e' or 'i'
grep(pattern = "[ei]", x = d, value = TRUE)
## [1] "bike"  "plane"
grep(pattern = "[01]", x = d, value = TRUE)
## [1] "Oct 07"  "R 4.1.1"

POSIX Character Classes

Closely related to the regex character classes we have what is known as POSIX character classes. In R, POSIX character classes are represented with expressions inside double brackets [[ ]].

Class POSIX Character Classes in R Description
[[:lower:]] Lower-case letters
[[:upper:]] Upper-case letters
[[:alpha:]] Alphabetic characters ([[:lower:]] and [[:upper:]])
[[:digit:]] Digits: 0,1,2,3,4,5,6,7,8,9
[[:alnum:]] Alphanumeric characters ([[:alpha:]] and [[:digit:]])
[[:blank:]] Blank characters: space and tab
[[:cntrl:]] Control characters
[[:punct:]] Punctuation characters: ! ”#%&’()*+,-. /: ;
[[:space:]] Space characters: tab, newline, vertical tab, form feed, carriage return, and space
[[:xdigit:]] Hexadecimal digits: 0-9 A B C D E F a b c d e f
[[:print:]] Printable characters ([[:alpha:]], [[:punct:]] and space)
[[:graph:]] Graphical characters ([[:alpha:]] and [[:punct:]])
gsub(pattern = "[[:blank:]]", replacement = "", x = d)
## [1] "car"      "bike"     "plane"    "boat"     "Oct07"    "I-II-III" "R4.1.1"
gsub(pattern = "[[:lower:]]", replacement = "", x = d)
## [1] ""         ""         ""         ""         "O 07"     "I-II-III" "R 4.1.1"
gsub(pattern = "[[:alnum:]]", replacement = "", x = d)
## [1] ""    ""    ""    ""    " "   "--"  " .."

Quantifiers

Quantifiers are used when we want to match a certain number of characters that meet certain criteria. Quantifiers specify how many instances of a character, group, or character class must be present in the input for a match to be found.

Quantifier Description
? The preceding item is optional and will be matched at most once
* The preceding item will be matched zero or more times
+ The preceding item will be matched one or more times
{n} The preceding item is matched exactly n times
{n,} The preceding item is matched n or more times
{n,m} The preceding item is matched at least n times, but not more than m times
sts = row.names(USArrests)
sts
##  [1] "Alabama"        "Alaska"         "Arizona"        "Arkansas"      
##  [5] "California"     "Colorado"       "Connecticut"    "Delaware"      
##  [9] "Florida"        "Georgia"        "Hawaii"         "Idaho"         
## [13] "Illinois"       "Indiana"        "Iowa"           "Kansas"        
## [17] "Kentucky"       "Louisiana"      "Maine"          "Maryland"      
## [21] "Massachusetts"  "Michigan"       "Minnesota"      "Mississippi"   
## [25] "Missouri"       "Montana"        "Nebraska"       "Nevada"        
## [29] "New Hampshire"  "New Jersey"     "New Mexico"     "New York"      
## [33] "North Carolina" "North Dakota"   "Ohio"           "Oklahoma"      
## [37] "Oregon"         "Pennsylvania"   "Rhode Island"   "South Carolina"
## [41] "South Dakota"   "Tennessee"      "Texas"          "Utah"          
## [45] "Vermont"        "Virginia"       "Washington"     "West Virginia" 
## [49] "Wisconsin"      "Wyoming"
# match 'm' at most once (0 or 1)
grep(pattern = "m?", sts, value = TRUE)
##  [1] "Alabama"        "Alaska"         "Arizona"        "Arkansas"      
##  [5] "California"     "Colorado"       "Connecticut"    "Delaware"      
##  [9] "Florida"        "Georgia"        "Hawaii"         "Idaho"         
## [13] "Illinois"       "Indiana"        "Iowa"           "Kansas"        
## [17] "Kentucky"       "Louisiana"      "Maine"          "Maryland"      
## [21] "Massachusetts"  "Michigan"       "Minnesota"      "Mississippi"   
## [25] "Missouri"       "Montana"        "Nebraska"       "Nevada"        
## [29] "New Hampshire"  "New Jersey"     "New Mexico"     "New York"      
## [33] "North Carolina" "North Dakota"   "Ohio"           "Oklahoma"      
## [37] "Oregon"         "Pennsylvania"   "Rhode Island"   "South Carolina"
## [41] "South Dakota"   "Tennessee"      "Texas"          "Utah"          
## [45] "Vermont"        "Virginia"       "Washington"     "West Virginia" 
## [49] "Wisconsin"      "Wyoming"
grep(pattern = "m{1}", sts, value = TRUE) # only once
## [1] "Alabama"       "New Hampshire" "Oklahoma"      "Vermont"      
## [5] "Wyoming"
grep(pattern = "l+", sts, value = TRUE) #  once or more
##  [1] "Alabama"        "Alaska"         "California"     "Colorado"      
##  [5] "Delaware"       "Florida"        "Illinois"       "Maryland"      
##  [9] "North Carolina" "Oklahoma"       "Pennsylvania"   "Rhode Island"  
## [13] "South Carolina"
grep(pattern = "l{2,3}", sts, value = TRUE) # 2 or 3 times
## [1] "Illinois"

Position of pattern within a string

  • ^: matches the start of the string
  • $: matches the end of the string
  • \b: matches the empty string at either edge of a word. Don’t confuse it with ^ $ which marks the edge of a string.
  • \B: matches the empty string provided it is not at an edge of a word.
strings <- c("abcd", "cdab", "cabd", "c abd")
grep("ab", strings, value = TRUE)
## [1] "abcd"  "cdab"  "cabd"  "c abd"
grep("^ab", strings, value = TRUE)
## [1] "abcd"
grep("ab$", strings, value = TRUE)
## [1] "cdab"
grep("\\bab", strings, value = TRUE)
## [1] "abcd"  "c abd"
grep("\\Bab", strings, value = TRUE)
## [1] "cdab" "cabd"

Operators

  • .: matches any single character
  • [...]: matches any one of the characters inside the brackets
  • [^...]: matches any other characters except those inside the brackets
  • |: or, matches either side of the vertical bar
  • (...): grouping in regex. This allows you to retrieve the bits that matched various parts of your regular expression so you can alter them or use them for building up a new string. Each group can than be refer using \\N, with N being the No. of (...) used. This is called backreference.

Main Base R functions for Regex

Function Purpose Characteristic
grep() finding regex matches which elements are matched (index or value)
grepl() finding regex matches which elements are matched (TRUE & FALSE)
regexpr() finding regex matches positions of the first match
gregexpr() finding regex matches positions of all matches
regexec() finding regex matches hybrid of regexpr() and gregexpr()
sub() replacing regex matches only first match is replaced
gsub() replacing regex matches all matches are replaced
strsplit() splitting regex matches split vector according to matches

All regex functions require two main arguments: a pattern (i.e. regular expression), and a text to match.

Regex functions in {stringr}

They all following the following general form:

str_function(string, pattern)