So far, we have talked about common data structures in R, how to
subset them, and how to write control-flow statements to iteratively
work with data. And in all these lectures, we worked with data that are
already cleaned and wholesome. However, the real world data are mostly
full of problems, especially character data. In this lecture, we will
talk about how to work with character data (i.e., strings). In R, they
are represented as vectors of class character
.
In R, character strings can be expressed with
single or double quotes
.
'a character string with single quote'
"a character string with double quote"
We can insert single quotes in a string with double quotes, and vice
versa.
"The 'R' project"
'The "R" project'
But we cannot inset single quotes in a string with single
quotes, or inset double quotes within string with double
quotes.
In base R, we have functions to deal with character data, including
nchar()
, strsplit()
, substr()
,
paste()
, paste0()
. We also have other packages
such as {stringr}
and {glue}
to deal with
strings.
Create character vector
# empty string
(empty_str <- "")
## [1] ""
## [1] "character"
# empty character vector
(empty_chr <- character(length = 0))
## character(0)
## [1] "character"
# they are different
length(empty_str)
## [1] 1
## [1] 0
empty_chr[1] <- "first"
empty_chr
## [1] "first"
empty_chr[4] <- "fourth"
empty_chr
## [1] "first" NA NA "fourth"
# is.character, as.character; we already covered them in previous lectures
Now let’s revisit paste()
.
paste("The life of", pi, sep = " ")
## [1] "The life of 3.14159265358979"
paste("The life of", pi, sep = "-")
## [1] "The life of-3.14159265358979"
paste0("The life of-", pi)
## [1] "The life of-3.14159265358979"
paste("I", "love", "R", sep = "-")
## [1] "I-love-R"
paste("X", 1:5, sep = ".") # recycle
## [1] "X.1" "X.2" "X.3" "X.4" "X.5"
paste("X", 1:5, sep = ".", collapse = "-")
## [1] "X.1-X.2-X.3-X.4-X.5"
paste("NA will be coerced", NA)
## [1] "NA will be coerced NA"
Read raw text into R
We can use readLines()
to read text into R as
is.
top105 = readLines("http://www.textfiles.com/music/ktop100.txt")
head(top105, n = 15)
## [1] "From: ed@wente.llnl.gov (Ed Suranyi)"
## [2] "Date: 12 Jan 92 21:23:55 GMT"
## [3] "Newsgroups: rec.music.misc"
## [4] "Subject: KITS' year end countdown"
## [5] ""
## [6] ""
## [7] "On Jan. 1, 1992, the \"Modern Rock\" station KITS San Francisco (\"Live-105\")"
## [8] "broadcast its list of the \"Top 105.3 of 1991.\" Here is the countdown"
## [9] "list:"
## [10] ""
## [11] "1. NIRVANA SMELLS LIKE TEEN SPIRIT"
## [12] "2. EMF UNBELIEVABLE"
## [13] "3. R.E.M. LOSING MY RELIGION"
## [14] "4. SIOUXSIE & THE BANSHEES KISS THEM FOR ME"
## [15] "5. B.A.D. II RUSH"
## [1] "101. SMASHING PUMPKINS SIVA"
## [2] "102. ELVIS COSTELLO OTHER SIDE OF ..."
## [3] "103. SEERS PSYCHE OUT"
## [4] "104. THRILL KILL CULT SEX ON WHEELZ"
## [5] "105. MATTHEW SWEET I'VE BEEN WAITING"
## [6] "105.3 LATOUR PEOPLE ARE STILL HAVING SEX"
## [7] ""
## [8] "Ed"
## [9] "ed@wente.llnl.gov"
## [10] ""
String printing
print() |
generic printing |
noquote() |
print with no quotes |
cat() |
concatenation |
format() |
special formats |
toString() |
convert to string |
sprintf() |
printing |
The choice of function will depend on what we want to print, how we
want to print it, and where we want to print it.
## [1] "From: ed@wente.llnl.gov (Ed Suranyi)"
print(top105[1], quote = FALSE)
## [1] From: ed@wente.llnl.gov (Ed Suranyi)
## [1] From: ed@wente.llnl.gov (Ed Suranyi)
cat(top105[1]) # similar to print
## From: ed@wente.llnl.gov (Ed Suranyi)
## From: ed@wente.llnl.gov (Ed Suranyi) Date: 12 Jan 92 21:23:55 GMT
cat(top105[1:2], sep = " + ")
## From: ed@wente.llnl.gov (Ed Suranyi) + Date: 12 Jan 92 21:23:55 GMT
cat(month.name[1:4], sep = " --> ")
## January --> February --> March --> April
cat(top105[1:2], sep = " + ", fill = 30) # break long strings
## From: ed@wente.llnl.gov (Ed Suranyi) +
## Date: 12 Jan 92 21:23:55 GMT
# save as a text file
# cat(top105[1:2], sep = " + ", file = "output.tex")
String manipulations
Basic string manipulation functions
nchar() |
number of characters |
tolower() |
convert to lower case |
toupper() |
convert to upper case |
casefold() |
case folding |
chartr() |
character translation |
abbreviate() |
abbreviation |
substring() |
substrings of a character vector |
substr() |
substrings of a character vector |
nchar(c("How", "many", "characters?"))
## [1] 3 4 11
length(c("How", "many", "characters?")) # vs
## [1] 3
nchar("How many characters?") # space counts
## [1] 20
tolower(c("alL CAses", "BBBBD"))
## [1] "all cases" "bbbbd"
toupper(c("alL CAses", "BBBBD"))
## [1] "ALL CASES" "BBBBD"
casefold(c("alL CAses", "BBBBD"))
## [1] "all cases" "bbbbd"
casefold(c("alL CAses", "BBBBD"), upper = TRUE)
## [1] "ALL CASES" "BBBBD"
unname(abbreviate(top105[1:15]))
## [1] "Fe(S" "D1J92G" "N:r." "SKyec"
## [5] "" "" "OJ11t\"RsKSF(" "bilot\"1o1Hitc"
## [9] "lst:" "" "1NSLTS" "2.EU"
## [13] "3RLMR" "4S&TBKTFM" "5BIR"
y = c("may", "the", "force", "be", "with", "you")
substr(y, 2, 3)
## [1] "ay" "he" "or" "e" "it" "ou"
substr(y, 2, 3) <- ":)"
y
## [1] "m:)" "t:)" "f:)ce" "b:" "w:)h" "y:)"
String manipulations with {stringr}
Even though the base functions are enough to allow us to get the job
done, they are not consistent (both the arguments and behaviors) and
have drawbacks. For example:
paste("one", "word", "here", NULL, character(0))
## [1] "one word here "
In the above example, NULL
and character(0)
have zero length and probably should be removed. But they were converted
to empty string ““.
In R, the {stringr}
package
is very useful to work with strings. According to the description of the
package (see https://cran.r-project.org/web/packages/stringr/index.html)
stringr
A consistent, simple and easy to use set of wrappers around the
fantastic ‘stringi’ package. All function and argument names (and
positions) are consistent, all functions deal with “NA”’s and zero
length vectors in the same way, and the output from one function is easy
to feed into the input of another.
# install.packages("stringr")
# load 'stringr'
library(stringr)
All functions in stringr start with “str_” followed by a term
associated to the task they perform. The following table contains the
stringr functions for basic string operations:
str_c() |
string concatenation |
paste() |
str_length() |
number of characters |
nchar() |
str_sub() |
extracts substrings |
substring() |
str_dup() |
duplicates characters |
none |
str_trim() |
removes leading and trailing whitespace |
none |
str_pad() |
pads a string |
none |
str_wrap() |
wraps a string paragraph |
strwrap() |
str_c("one", "word", "here", NULL, character(0), sep = " ")
## character(0)
some_factor = factor(c(1, 1, 1, 2, 2, 2), labels = c("good", "bad"))
some_factor
## [1] good good good bad bad bad
## Levels: good bad
# nchar(some_factor)
# Error in nchar(some_factor) : 'nchar()' requires a character vector
str_length(some_factor)
## [1] 4 4 4 3 3 3
## [1] "adios" "dios" "ios"
str_sub("adios", start = 1, end = -2)
## [1] "adio"
str_dup("hola", times = 3)
## [1] "holaholahola"
str_dup(c("hola", "adios"), times = 3)
## [1] "holaholahola" "adiosadiosadios"
str_dup(c("hola", "adios"), times = c(2, 5))
## [1] "holahola" "adiosadiosadiosadiosadios"
str_pad("hola", width = 7, side = "left")
## [1] " hola"
str_pad("hola", width = 7, side = "both")
## [1] " hola "
str_pad("hola", width = 7, side = "right")
## [1] "hola "
str_pad("hola", width = 7, side = "right", pad = "&")
## [1] "hola&&&"
stringr_desc = "A consistent, simple and easy to use set of wrappers around the fantastic 'stringi' package. All function and argument names (and positions) are consistent, all functions deal with NA's and zero length vectors in the same way, and the output from one function is easy to feed into the input of another."
stringr_desc
## [1] "A consistent, simple and easy to use set of wrappers around the fantastic 'stringi' package. All function and argument names (and positions) are consistent, all functions deal with NA's and zero length vectors in the same way, and the output from one function is easy to feed into the input of another."
cat(str_wrap(stringr_desc, width = 80, indent = 3))
## A consistent, simple and easy to use set of wrappers around the fantastic
## 'stringi' package. All function and argument names (and positions) are
## consistent, all functions deal with NA's and zero length vectors in the same
## way, and the output from one function is easy to feed into the input of another.
cat(str_wrap(stringr_desc, width = 80, exdent = 3))
## A consistent, simple and easy to use set of wrappers around the fantastic
## 'stringi' package. All function and argument names (and positions) are
## consistent, all functions deal with NA's and zero length vectors in the
## same way, and the output from one function is easy to feed into the input of
## another.
# text with whitespaces
btext = c("This example ", "has several ", "whitespaces like this ")
str_trim(btext)
## [1] "This example" "has several" "whitespaces like this"
## [1] "This" "has" "whitespaces"
## [1] "example" "several" "like"
Regular expressions
The above functions can already allow us to handle some basic tasks
with strings. However, if we really want to unleash the power of strings
manipulation, we need to leanr regular expressions (aka
regex).
A regular expression is a pattern that describes a set of
strings.
Because that it is impossible to cover most of it in one lecture, I
put some resources here so that you can read more later.
Regex basics
The main purpose of working with regular expressions is to describe
patterns that are used to match against text strings. The simplest
version of pattern matching is to search for some specific characters in
a string.
Sequences
Sequences define sequences of characters which can match. We have
short-hand versions (or anchors) for commonly used sequences in R:
\\d |
match a digit character |
\\D |
match a non-digit character |
\\s |
match a space character |
\\S |
match a non-space character |
\\w |
match a word character |
\\W |
match a non-word character |
\\b |
match a word boundary |
\\B |
match a non-(word boundary) |
\\h |
match a horizontal space |
\\H |
match a non-horizontal space |
\\v |
match a vertical space |
\\V |
match a non-vertical space |
sub("\\d", "_", "Covid 19")
## [1] "Covid _9"
gsub("\\d", "_", "Covid 19")
## [1] "Covid __"
sub("\\D", "_", "Covid 19")
## [1] "_ovid 19"
gsub("\\D", "_", "Covid 19")
## [1] "______19"
sub("\\s", "_", "Covid 19")
## [1] "Covid_19"
sub("\\S", "_", "Covid 19")
## [1] "_ovid 19"
gsub("\\S", "_", "Covid 19")
## [1] "_____ __"
sub("\\w", "_", "Covid 19")
## [1] "_ovid 19"
gsub("\\w", "_", "Covid 19")
## [1] "_____ __"
sub("\\W", "_", "Covid 19")
## [1] "Covid_19"
gsub("\\W", "_", "Covid 19")
## [1] "Covid_19"
character class
A character class or character set is a list of
characters enclosed by square brackets [ ]
. Character sets
are used to match only one of several characters. For instance, the
regex character class [aA]
matches any lower case letter a
or any upper case letter A.
[aeiou] |
match any one lower case vowel |
[AEIOU] |
match any one upper case vowel |
[0123456789] |
match any digit |
[0-9] |
match any digit (same as previous class) |
[a-z] |
match any lower case ASCII letter |
[A-Z] |
match any upper case ASCII letter |
[a-zA-Z0-9] |
match any of the above classes |
[^aeiou] |
match anything other than a lowercase vowel |
[^0-9] |
match anything other than a digit |
d = c("car", "bike", "plane", "boat", "Oct 07", "I-II-III", "R 4.1.1")
# look for 'e' or 'i'
grep(pattern = "[ei]", x = d, value = TRUE)
## [1] "bike" "plane"
grep(pattern = "[01]", x = d, value = TRUE)
## [1] "Oct 07" "R 4.1.1"
POSIX Character Classes
Closely related to the regex character classes we have what is known
as POSIX character classes. In R, POSIX character classes are
represented with expressions inside double brackets
[[ ]]
.
[[:lower:]] |
Lower-case letters |
[[:upper:]] |
Upper-case letters |
[[:alpha:]] |
Alphabetic characters ([[:lower:]] and
[[:upper:]]) |
[[:digit:]] |
Digits: 0,1,2,3,4,5,6,7,8,9 |
[[:alnum:]] |
Alphanumeric characters ([[:alpha:]] and
[[:digit:]]) |
[[:blank:]] |
Blank characters: space and tab |
[[:cntrl:]] |
Control characters |
[[:punct:]] |
Punctuation characters: ! ”#%&’()*+,-. /: ; |
[[:space:]] |
Space characters: tab, newline, vertical tab, form
feed, carriage return, and space |
[[:xdigit:]] |
Hexadecimal digits: 0-9 A B C D E F a b c d e f |
[[:print:]] |
Printable characters ([[:alpha:]], [[:punct:]] and
space) |
[[:graph:]] |
Graphical characters ([[:alpha:]] and [[:punct:]]) |
gsub(pattern = "[[:blank:]]", replacement = "", x = d)
## [1] "car" "bike" "plane" "boat" "Oct07" "I-II-III" "R4.1.1"
gsub(pattern = "[[:lower:]]", replacement = "", x = d)
## [1] "" "" "" "" "O 07" "I-II-III" "R 4.1.1"
gsub(pattern = "[[:alnum:]]", replacement = "", x = d)
## [1] "" "" "" "" " " "--" " .."
Quantifiers
Quantifiers are used when we want to match a certain number of
characters that meet certain criteria. Quantifiers specify how many
instances of a character, group, or character class must be present in
the input for a match to be found.
? |
The preceding item is optional and will be matched at
most once |
* |
The preceding item will be matched zero or more
times |
+ |
The preceding item will be matched one or more
times |
{n} |
The preceding item is matched exactly n times |
{n,} |
The preceding item is matched n or more times |
{n,m} |
The preceding item is matched at least n times, but not
more than m times |
sts = row.names(USArrests)
sts
## [1] "Alabama" "Alaska" "Arizona" "Arkansas"
## [5] "California" "Colorado" "Connecticut" "Delaware"
## [9] "Florida" "Georgia" "Hawaii" "Idaho"
## [13] "Illinois" "Indiana" "Iowa" "Kansas"
## [17] "Kentucky" "Louisiana" "Maine" "Maryland"
## [21] "Massachusetts" "Michigan" "Minnesota" "Mississippi"
## [25] "Missouri" "Montana" "Nebraska" "Nevada"
## [29] "New Hampshire" "New Jersey" "New Mexico" "New York"
## [33] "North Carolina" "North Dakota" "Ohio" "Oklahoma"
## [37] "Oregon" "Pennsylvania" "Rhode Island" "South Carolina"
## [41] "South Dakota" "Tennessee" "Texas" "Utah"
## [45] "Vermont" "Virginia" "Washington" "West Virginia"
## [49] "Wisconsin" "Wyoming"
# match 'm' at most once (0 or 1)
grep(pattern = "m?", sts, value = TRUE)
## [1] "Alabama" "Alaska" "Arizona" "Arkansas"
## [5] "California" "Colorado" "Connecticut" "Delaware"
## [9] "Florida" "Georgia" "Hawaii" "Idaho"
## [13] "Illinois" "Indiana" "Iowa" "Kansas"
## [17] "Kentucky" "Louisiana" "Maine" "Maryland"
## [21] "Massachusetts" "Michigan" "Minnesota" "Mississippi"
## [25] "Missouri" "Montana" "Nebraska" "Nevada"
## [29] "New Hampshire" "New Jersey" "New Mexico" "New York"
## [33] "North Carolina" "North Dakota" "Ohio" "Oklahoma"
## [37] "Oregon" "Pennsylvania" "Rhode Island" "South Carolina"
## [41] "South Dakota" "Tennessee" "Texas" "Utah"
## [45] "Vermont" "Virginia" "Washington" "West Virginia"
## [49] "Wisconsin" "Wyoming"
grep(pattern = "m{1}", sts, value = TRUE) # only once
## [1] "Alabama" "New Hampshire" "Oklahoma" "Vermont"
## [5] "Wyoming"
grep(pattern = "l+", sts, value = TRUE) # once or more
## [1] "Alabama" "Alaska" "California" "Colorado"
## [5] "Delaware" "Florida" "Illinois" "Maryland"
## [9] "North Carolina" "Oklahoma" "Pennsylvania" "Rhode Island"
## [13] "South Carolina"
grep(pattern = "l{2,3}", sts, value = TRUE) # 2 or 3 times
## [1] "Illinois"
Position of pattern within a string
^
: matches the start of the string
$
: matches the end of the string
\b
: matches the empty string at either edge of a word.
Don’t confuse it with ^
$
which marks the edge
of a string.
\B
: matches the empty string provided it is not at an
edge of a word.
strings <- c("abcd", "cdab", "cabd", "c abd")
grep("ab", strings, value = TRUE)
## [1] "abcd" "cdab" "cabd" "c abd"
grep("^ab", strings, value = TRUE)
## [1] "abcd"
grep("ab$", strings, value = TRUE)
## [1] "cdab"
grep("\\bab", strings, value = TRUE)
## [1] "abcd" "c abd"
grep("\\Bab", strings, value = TRUE)
## [1] "cdab" "cabd"
Operators
.
: matches any single character
[...]
: matches any one of the characters inside the
brackets
[^...]
: matches any other characters except those
inside the brackets
|
: or, matches either side of the vertical bar
(...)
: grouping in regex. This allows you to retrieve
the bits that matched various parts of your regular expression so you
can alter them or use them for building up a new string. Each group can
than be refer using \\N
, with N being the No. of
(...)
used. This is called backreference.
Main Base R functions for Regex
grep() |
finding regex matches |
which elements are matched (index or value) |
grepl() |
finding regex matches |
which elements are matched (TRUE & FALSE) |
regexpr() |
finding regex matches |
positions of the first match |
gregexpr() |
finding regex matches |
positions of all matches |
regexec() |
finding regex matches |
hybrid of regexpr() and
gregexpr() |
sub() |
replacing regex matches |
only first match is replaced |
gsub() |
replacing regex matches |
all matches are replaced |
strsplit() |
splitting regex matches |
split vector according to matches |
All regex functions require two main arguments: a pattern
(i.e. regular expression), and a text to match.
Regex functions in {stringr}
They all following the following general form:
str_function(string, pattern)