g re g e xpr() – The “gregexpr(“is similar to the regexpr() function, but the starting position of every match found is returned.
str_locate _all() {stringr}– The “s tr_locate_all() (stringr)” also performs the same task as the other functions.
Important: You will need to include the “{string r}” package to use the “str_locate ()” and “str-locate _all()”
functions.
Substring Extraction
The “substr()” and “str_sub() {stringr}” functions are used to extract a substring from a string. You can extract a fixed width substring using one of these functions.
substr() - The “substr()”
takes a sub string from a string.
str_sub()
{stringr} – The “
str_sub(){stringr}” function performs the same way
as the “substr()” function.
> str_sub("some text",1,2) [1] "so"
Important: You will need to include the “{string r}” package to use the “str_sub()” function.
Word Extraction
The “first.word()” function is used to extract the first word in a string.
first.word(){hmisc} – The “first.word(){hmisc}” function extracts the first word or expression.
grep() - The “grep()” returns the value or the position of the regular expression if
“value=T” and if “value=F”.
The following examples show to implement the grep() function to return a value or position of an expression:
str_extract(), str_extract_all(), str_match(), str_match_all() (stringr) and m() {caroline} – These functions are similar to the grep()function.
str_extract() and str_extract_all() - The “str_extract() and str_extract_all()” functions will return a vector.
str_match() and str_match_all() - The “str_match() and str_match_all()” functions will return a matrix and m() function dataframe.
Important: You will need to include the “{caroline }” and “{string r}” packages to use the str_extract(), str_extract_all(), str_match(), str_match_all() and m() functions.
The following examples will show you how to use each of these functions:
// A string is assigned with the day, month and year.
> library("stringr")
> regexp <- "([[:digit:]]{2}) ([[:alpha:]]+) ([[:digit:]]{4})"
> string1 <- "November 07 bday 1973 November 22 2015"
// The str_extract() function is implemented to extract a specific string
> str_extract(string,regexp) [1] "07 November 1973"
// The str_extract_all() function is implemented to extract the entire string.
> str_extract_all(string1,regexp)
[[1]] [1] "07 Novemver 1973" "22 November 2015"
// The str_match”() function is implemented to match a specific string.
> str_match(string,regexp)
> [,1] [,2] [,3] [,4] [1,] "07 November 1973" "22" "November" "2015"
// The str_match_all() function is implemented to match the entire string.
> str_match_all(string,regexp)
[[1]] [,1] [,2] [,3] [,4] [1,] "07 November 1973" "07 November 1973"
[2,] "22 November 2015" "22" "November" "2015"
// The m() function is implemented to match the day, month, and year.
> library("caroline")
> m(pattern = regexp, vect = string1, names = c("day","month","year"), types = rep("character",3)) day month year [1] 22 November 2015
String Substitution
R allows you to make a string substitution within a string. The following functions below are used interchangeably to make substitutions.
sub() – The “sub()” function is the standard function for making a string substitution within a string.
g sub() – The “gsub()” performs the same way as the sub()function. The only difference is the gsub() function replaces all occurrences of the pattern, whereas the sub() only replaces the first occurrence.
str_re place () {stringr}– The “str_re place () {stringr}” function also has the same functionality as the sub() and gsub() functions.
Important: You will need to include the “{string r}” package to use the “str_replace()” function.
In the following example, the British date is used with the pattern 2 digit day, blank space, letters, a blank space, and a 4 digit year. The 2 digit day is detected with the “[[:digit:]]{2}” expression, the letters are detected with the “[[:alpha:]]+” expression, and the 4 digit year is detected with
“[[:alpha]]+” expression. The three strings are within a set of parenthesis. The first substring is saved in “\\1”, and the second substring is saved in “\\2” and the third substring is saved in “\\3”.
// The first substring returns the first part of the regular expression.
> string <- "07 November 1973"
> regexp <- "([[:digit:]]{2}) ([[:alpha:]]+) ([[:digit:]]{4})"
> sub(pattern = regexp, replacement = "\\1", x = string)
// The second substring returns the second part of the regular expression.
> sub(pattern = regexp, replacement = "\\2", x = string)
// The third substring returns the third part of the regular
> sub(pattern = regexp, replacement = "\\3", x = string)
In the following examples, the “sub()” and “gsub()” functions are used to replace strings. The first example uses the sub() function to remove the first space and the second example uses the gsub()
function to remove all the spaces in the string.
The “chartr()” function allows you to substitute characters in an expression or statement. The definition of the function means “character translation”. You can also use the following functions to perform the same task as the “chartr()” function.
replacechar(){cwhmisc} – The “replacechar(){cwhmisc}” function is in the {cwhmisc}
package. It is also used to substitute characters in an expression.
str_replace_all(){stringr} – The “str_replace_all(){stringr}” function is in the {stringr}
package. It performs the same task as the “chartr()” and “replacechar()” functions.
The following examples will show you how to substitute characters with chartr(), replachcar(), and str_replace_all() functions:
Important: Remember to include the “{cwhmisc}” and “{string r}” functions to use the replacechar() and str_replace_all() functions.
Convert Letters
R allows you to convert letters in various ways. You can use one of the following functions to perform the appropriate letter conversion:
tolower() - The “tolower()” function converts uppercase to lowercase letters.
toupper() – The “toupper()” converts lower-case to upper-case letters.
capitalize() {hmisc} – The “capitalize()” function in the {hmisc} package capitalizes the first letter of a string.
cap(){cwhmisc} – The “cap()” function performs the same task as the toupper() function, by capitalizing letters.
capitalize(){cwhmisc} – The “capitalize()” function performs the same task as the cap() and toupper() functions.
lower(){cwhmisc} – The “lower()” function performs the same task as the tolower() function, by converting uppercase to lower-case letters.
lowerize(){cwhmisc} – The “lowerize()” function performs the same task as the tolower() and lower() function, by converting uppercase to lowercase letters.
CapLeading(){cwhmisc} – The “CapLeading()”function capitalize the first character in a string.
> capitalize("florida")
The “str_pad()” is also used to fill characters in a string. The following examples applies the padding() and str_pad() functions to fill characters in a string.
The “padding ()” function on the other hand, handles character vectors in better way, but the best way is to use both the “sapply()” and
“padding()” functions. The following examples show how you can implement the str_pad(), padding(), and sapply() functions.
Important: Remember to include the “{string r}” and “{cwhmisc}” package to use the applicable functions.
Important: Remember to include the “{me misc}”, “{g data}” and “{string r}” packages to use the functions for removing leading and trailing white paces.
Compare and Compute Strings
R provides specific operators and functions for assessing and comparing strings. The following examples will show you each these functions and operators are applied.
== - The “==” operator returns TRUE if both of the strings are the same and FALSE if it is not. It is used to determine
if the strings are the same.
// The expression returns “FALSE”.
> "xyz"=="zyz"
[1] FALSE
// The expression returns “TRUE”.
> "xyz"=="xyz"
[1] TRUE
Note : The functions that are used compare and calculate strings apply to the Levenshtein distance. This is a string metric function that is used for measuring the distance between two strings.
adist() {utils}– The “adist()” function of the {utils} package is used to calculate the approximate string distance
between vectors.
> adist("match","matching") [1] 3
stringMatch(){MiscPsycho} – The “stringMatch()” function in the {MiscPsycho} package is used to
compare the similarity of two strings. If “normalize = YES” in the function, then the “Levenshtein distance” is divided by the maximum length of the string.
// The stringmatch() function returns the number of characters that do not match
> library("MiscPsycho")
> stringMatch("match","matching",normalize="NO",penalty=1,case.sensitive = TRUE) [1] 3
string dist() {stringdist} - The “stringdist()” function in the {stringdist} package returns an approximate string matching and string distance.
// The stringdist() function returns the number of characters that do not match.
> library(stringdist)
> stringdist(“live”,“lively”, method=”d1”) [1] 2
le ve nshte inDist() {RecordLinkage} – The “levenshteinDist()” function in the {RecordLinkage} package compares two strings.
// The levenshteinDist() function compares two strings.
levenshteinDist("records","recrd")
The "agrep()” function may also be used to approximate matches the Levenshtein distance. The following expressions are used within function to return the value of the string.
“value = TRUE” – The “value=TRUE” expression returns the value of the string.
“value = FALSE” – The “value=FALSE” expression returns the position of the string.
max – The “max” expression returns the maximal Levenshtein distance.
The following examples implements the “agrep()” function to return the value of a string:
> agrep(pattern = "lively", x = c("1 live", "1", "1 LAZY"), max = 2, value = TRUE) [1] "1 lazy"
> agrep("lively", c("1 live", "1", "1 LIVE"), max = 3, value = TRUE) [1] "1 lazy"
There are also some miscellaneous functions that are used to manipulate and evaluate string expressions.
de parse () – The “deparse()” function converts unevaluated expressions into character strings.
char.e xpand() {base}– The “ char.e xpand() {base}” function expands a string based on it’s target.
pmatch() {base}andcharmatch() {base} – The “pmatch() {base}andcharmatch() {base}”
function are used to search for matches within the elements of the first argument.
The following example implements the “pmatch()” and “nomatch()” functions:
// The pmatch() function returns “0” if there is “nomatch”.
> pmatch(c("w","x","y","z")
> table = c(("x","z")nomatch = 0) [1] 0 1 0 2
make .unique ()– The “ make .unique ()” function is used to make a unique character string. This will help you to turn a string into an identifier.
The following example applies the “make .unique()” function to make a string into an identifier:
// The make.unique() function makes each character unique
> make.unique(c("x", "x", "x")) [1] "x" "x.1" "x.2"
Note : Remember to include the “{base }” package to use the char.expand(), pmatch(), and charmatch() functions.