Count each next occurence of string in substring
today I have faced a problem that I couldn't solve on my own, despite searching for solutions - it appeared to me, that either my approach is wrong or noone before asked similar question.
I'm playing around with Markov attribution, so I've got columns with strings that look like that:
A > B > B > C > B > A > C > B > A
etc.
...it is created on base of postgresql function 'string_agg'.
What I think would be important for me is assigning a number of for which time each string appears in entire string. To make it clear, at the end of the day, it would look like this:
A1 > B1 > B2 > C1 > B3 > A2 > C2 > B4 > A3
There are three main challenges:
- there are around 100 different types of elements to be counted that may change it time, so it makes it hard to hardcode it,
- the dataset is around 200k rows,
- strings may be up to few hundred characters long
The only thing that came up to my mind is to write some sort of loop, but it feels like it would take up ages until it finishes.
I also thought about solving it on postgresql level, but couldn't find efficient and easy solution to it neither.
r statistics
add a comment |
today I have faced a problem that I couldn't solve on my own, despite searching for solutions - it appeared to me, that either my approach is wrong or noone before asked similar question.
I'm playing around with Markov attribution, so I've got columns with strings that look like that:
A > B > B > C > B > A > C > B > A
etc.
...it is created on base of postgresql function 'string_agg'.
What I think would be important for me is assigning a number of for which time each string appears in entire string. To make it clear, at the end of the day, it would look like this:
A1 > B1 > B2 > C1 > B3 > A2 > C2 > B4 > A3
There are three main challenges:
- there are around 100 different types of elements to be counted that may change it time, so it makes it hard to hardcode it,
- the dataset is around 200k rows,
- strings may be up to few hundred characters long
The only thing that came up to my mind is to write some sort of loop, but it feels like it would take up ages until it finishes.
I also thought about solving it on postgresql level, but couldn't find efficient and easy solution to it neither.
r statistics
Reproducible data would help us help you.
– snoram
Nov 13 at 11:33
Are the>part of the string?
– Rui Barradas
Nov 13 at 11:37
Unfortunately I cannot share company's data, but it basically looks like the one provided in example - if there is some public dataset that I can play around with I would be glad to provide you with some examples. And yes, " > " is part of string, however it can be changed into any character, i.e. spacebar.
– Marcin
Nov 13 at 11:39
add a comment |
today I have faced a problem that I couldn't solve on my own, despite searching for solutions - it appeared to me, that either my approach is wrong or noone before asked similar question.
I'm playing around with Markov attribution, so I've got columns with strings that look like that:
A > B > B > C > B > A > C > B > A
etc.
...it is created on base of postgresql function 'string_agg'.
What I think would be important for me is assigning a number of for which time each string appears in entire string. To make it clear, at the end of the day, it would look like this:
A1 > B1 > B2 > C1 > B3 > A2 > C2 > B4 > A3
There are three main challenges:
- there are around 100 different types of elements to be counted that may change it time, so it makes it hard to hardcode it,
- the dataset is around 200k rows,
- strings may be up to few hundred characters long
The only thing that came up to my mind is to write some sort of loop, but it feels like it would take up ages until it finishes.
I also thought about solving it on postgresql level, but couldn't find efficient and easy solution to it neither.
r statistics
today I have faced a problem that I couldn't solve on my own, despite searching for solutions - it appeared to me, that either my approach is wrong or noone before asked similar question.
I'm playing around with Markov attribution, so I've got columns with strings that look like that:
A > B > B > C > B > A > C > B > A
etc.
...it is created on base of postgresql function 'string_agg'.
What I think would be important for me is assigning a number of for which time each string appears in entire string. To make it clear, at the end of the day, it would look like this:
A1 > B1 > B2 > C1 > B3 > A2 > C2 > B4 > A3
There are three main challenges:
- there are around 100 different types of elements to be counted that may change it time, so it makes it hard to hardcode it,
- the dataset is around 200k rows,
- strings may be up to few hundred characters long
The only thing that came up to my mind is to write some sort of loop, but it feels like it would take up ages until it finishes.
I also thought about solving it on postgresql level, but couldn't find efficient and easy solution to it neither.
r statistics
r statistics
edited Nov 13 at 11:38
asked Nov 13 at 11:32
Marcin
356
356
Reproducible data would help us help you.
– snoram
Nov 13 at 11:33
Are the>part of the string?
– Rui Barradas
Nov 13 at 11:37
Unfortunately I cannot share company's data, but it basically looks like the one provided in example - if there is some public dataset that I can play around with I would be glad to provide you with some examples. And yes, " > " is part of string, however it can be changed into any character, i.e. spacebar.
– Marcin
Nov 13 at 11:39
add a comment |
Reproducible data would help us help you.
– snoram
Nov 13 at 11:33
Are the>part of the string?
– Rui Barradas
Nov 13 at 11:37
Unfortunately I cannot share company's data, but it basically looks like the one provided in example - if there is some public dataset that I can play around with I would be glad to provide you with some examples. And yes, " > " is part of string, however it can be changed into any character, i.e. spacebar.
– Marcin
Nov 13 at 11:39
Reproducible data would help us help you.
– snoram
Nov 13 at 11:33
Reproducible data would help us help you.
– snoram
Nov 13 at 11:33
Are the
> part of the string?– Rui Barradas
Nov 13 at 11:37
Are the
> part of the string?– Rui Barradas
Nov 13 at 11:37
Unfortunately I cannot share company's data, but it basically looks like the one provided in example - if there is some public dataset that I can play around with I would be glad to provide you with some examples. And yes, " > " is part of string, however it can be changed into any character, i.e. spacebar.
– Marcin
Nov 13 at 11:39
Unfortunately I cannot share company's data, but it basically looks like the one provided in example - if there is some public dataset that I can play around with I would be glad to provide you with some examples. And yes, " > " is part of string, however it can be changed into any character, i.e. spacebar.
– Marcin
Nov 13 at 11:39
add a comment |
3 Answers
3
active
oldest
votes
How to do this is described in the gsubfn vignette. Using the code there first we define a proto object pword with methods pre and fun. pre initializes the word list (which stores the current count for each word encountered) and fun updates it each time a new word is encountered and also suffixes the word with the count returning the suffixed word.
Having defined the foregoing, run gsubfn using pword. For each component of the input gsubfn will first run pre and then for each match of the regular expression \w+ gsubfn will input the match to fun, run fun and replace the match with the output of fun.
We have assumed that the words to be suffixed with a count are matched by w+ which is the case for the example in the question but if your actual data is different you may need to change the pattern.
library(gsubfn)
s <- rep("A > B > B > C > B > A > C > B > A", 3) # sample input
pwords <- proto(
pre = function(this) { this$words <- list() },
fun = function(this, x) {
if (is.null(words[[x]])) this$words[[x]] <- 0
this$words[[x]] <- this$words[[x]] + 1
paste0(x, words[[x]])
}
)
gsubfn("\w+", pwords, s)
giving:
[1] "A1 > B1 > B2 > C1 > B3 > A2 > C2 > B4 > A3"
[2] "A1 > B1 > B2 > C1 > B3 > A2 > C2 > B4 > A3"
[3] "A1 > B1 > B2 > C1 > B3 > A2 > C2 > B4 > A3"
Thank you for your answer, among given solutions this one was best at solving my problem and provided me a tool that helped me in dealing with my task, in pair with great explanation. Now I'm only fighting with writing a regexp that would ignore punctuation marks, as sometimes my words contain dots and hyphens and then the repetition is counted in wrong manner, I didn't take that under consideration when providing you with my poor example.
– Marcin
Nov 13 at 14:12
1
This will match anything other than space or >:[^ >]+.
– G. Grothendieck
Nov 13 at 14:13
Worked like a charm, I owe you one, thank you!
– Marcin
Nov 13 at 14:19
add a comment |
Here is a rough example using data.table:
library(data.table)
# Example data:
data <- data.table(
s = c("A > B > B > C > B > A > C > B > A",
"A > B > B > C > B > A > C > B > C > D")
)
# Processing steps (can probably be shortened)
n <- strsplit(data[["s"]], " > ")
datal <- melt(n)
setDT(datal)
datal[, original_order := 1:.N
][, temp := paste0(value, 1:.N), by = .(L1, value)
][order(original_order), paste(temp, collapse = " > "), by = L1]
# Output:
L1 V1
1: 1 A1 > B1 > B2 > C1 > B3 > A2 > C2 > B4 > A3
2: 2 A1 > B1 > B2 > C1 > B3 > A2 > C2 > B4 > C3 > D1
add a comment |
Here is a function that uses base R only.
Note that if you are using a diferent set of regex metacharacters, it should be easy to have a function argument metachar, defaulting to the one in the function body.
count_seq <- function(x, sep = ">"){
metachar <- '. | ( ) [ { ^ $ * + ?'
sep2 <- if(grepl(sep, metachar)) paste0("\", sep) else sep
y <- unlist(strsplit(x, sep2))
y <- trimws(y)
z <- ave(y, y, FUN = seq_along)
paste(paste0(y, z), collapse = sep)
}
x <- "A > B > B > C > B > A > C > B > A"
count_seq(x)
#[1] "A1>B1>B2>C1>B3>A2>C2>B4>A3"
count_seq(x, sep = " > ")
#[1] "A1 > B1 > B2 > C1 > B3 > A2 > C2 > B4 > A3"
y <- "A | B | B | C | B | A | C | B | A"
count_seq(y, sep = "|")
#[1] "A1|B1|B2|C1|B3|A2|C2|B4|A3"
Nice. Finally you would do something like:sapply(column, count_seq, " > "), right?
– snoram
Nov 13 at 11:55
@snoram Yes, that's the idea. I thought of looping (*apply) in the function but I it might be better to leave it like this, more general, and let the user decide.
– Rui Barradas
Nov 13 at 11:59
2
In a similar vein using @snoram's 's':lapply(strsplit(s, " > "), function(x) paste0(x, ave(x, x, FUN = seq_along), collapse = " > "))
– Henrik
Nov 13 at 12:11
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53280116%2fcount-each-next-occurence-of-string-in-substring%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
3 Answers
3
active
oldest
votes
3 Answers
3
active
oldest
votes
active
oldest
votes
active
oldest
votes
How to do this is described in the gsubfn vignette. Using the code there first we define a proto object pword with methods pre and fun. pre initializes the word list (which stores the current count for each word encountered) and fun updates it each time a new word is encountered and also suffixes the word with the count returning the suffixed word.
Having defined the foregoing, run gsubfn using pword. For each component of the input gsubfn will first run pre and then for each match of the regular expression \w+ gsubfn will input the match to fun, run fun and replace the match with the output of fun.
We have assumed that the words to be suffixed with a count are matched by w+ which is the case for the example in the question but if your actual data is different you may need to change the pattern.
library(gsubfn)
s <- rep("A > B > B > C > B > A > C > B > A", 3) # sample input
pwords <- proto(
pre = function(this) { this$words <- list() },
fun = function(this, x) {
if (is.null(words[[x]])) this$words[[x]] <- 0
this$words[[x]] <- this$words[[x]] + 1
paste0(x, words[[x]])
}
)
gsubfn("\w+", pwords, s)
giving:
[1] "A1 > B1 > B2 > C1 > B3 > A2 > C2 > B4 > A3"
[2] "A1 > B1 > B2 > C1 > B3 > A2 > C2 > B4 > A3"
[3] "A1 > B1 > B2 > C1 > B3 > A2 > C2 > B4 > A3"
Thank you for your answer, among given solutions this one was best at solving my problem and provided me a tool that helped me in dealing with my task, in pair with great explanation. Now I'm only fighting with writing a regexp that would ignore punctuation marks, as sometimes my words contain dots and hyphens and then the repetition is counted in wrong manner, I didn't take that under consideration when providing you with my poor example.
– Marcin
Nov 13 at 14:12
1
This will match anything other than space or >:[^ >]+.
– G. Grothendieck
Nov 13 at 14:13
Worked like a charm, I owe you one, thank you!
– Marcin
Nov 13 at 14:19
add a comment |
How to do this is described in the gsubfn vignette. Using the code there first we define a proto object pword with methods pre and fun. pre initializes the word list (which stores the current count for each word encountered) and fun updates it each time a new word is encountered and also suffixes the word with the count returning the suffixed word.
Having defined the foregoing, run gsubfn using pword. For each component of the input gsubfn will first run pre and then for each match of the regular expression \w+ gsubfn will input the match to fun, run fun and replace the match with the output of fun.
We have assumed that the words to be suffixed with a count are matched by w+ which is the case for the example in the question but if your actual data is different you may need to change the pattern.
library(gsubfn)
s <- rep("A > B > B > C > B > A > C > B > A", 3) # sample input
pwords <- proto(
pre = function(this) { this$words <- list() },
fun = function(this, x) {
if (is.null(words[[x]])) this$words[[x]] <- 0
this$words[[x]] <- this$words[[x]] + 1
paste0(x, words[[x]])
}
)
gsubfn("\w+", pwords, s)
giving:
[1] "A1 > B1 > B2 > C1 > B3 > A2 > C2 > B4 > A3"
[2] "A1 > B1 > B2 > C1 > B3 > A2 > C2 > B4 > A3"
[3] "A1 > B1 > B2 > C1 > B3 > A2 > C2 > B4 > A3"
Thank you for your answer, among given solutions this one was best at solving my problem and provided me a tool that helped me in dealing with my task, in pair with great explanation. Now I'm only fighting with writing a regexp that would ignore punctuation marks, as sometimes my words contain dots and hyphens and then the repetition is counted in wrong manner, I didn't take that under consideration when providing you with my poor example.
– Marcin
Nov 13 at 14:12
1
This will match anything other than space or >:[^ >]+.
– G. Grothendieck
Nov 13 at 14:13
Worked like a charm, I owe you one, thank you!
– Marcin
Nov 13 at 14:19
add a comment |
How to do this is described in the gsubfn vignette. Using the code there first we define a proto object pword with methods pre and fun. pre initializes the word list (which stores the current count for each word encountered) and fun updates it each time a new word is encountered and also suffixes the word with the count returning the suffixed word.
Having defined the foregoing, run gsubfn using pword. For each component of the input gsubfn will first run pre and then for each match of the regular expression \w+ gsubfn will input the match to fun, run fun and replace the match with the output of fun.
We have assumed that the words to be suffixed with a count are matched by w+ which is the case for the example in the question but if your actual data is different you may need to change the pattern.
library(gsubfn)
s <- rep("A > B > B > C > B > A > C > B > A", 3) # sample input
pwords <- proto(
pre = function(this) { this$words <- list() },
fun = function(this, x) {
if (is.null(words[[x]])) this$words[[x]] <- 0
this$words[[x]] <- this$words[[x]] + 1
paste0(x, words[[x]])
}
)
gsubfn("\w+", pwords, s)
giving:
[1] "A1 > B1 > B2 > C1 > B3 > A2 > C2 > B4 > A3"
[2] "A1 > B1 > B2 > C1 > B3 > A2 > C2 > B4 > A3"
[3] "A1 > B1 > B2 > C1 > B3 > A2 > C2 > B4 > A3"
How to do this is described in the gsubfn vignette. Using the code there first we define a proto object pword with methods pre and fun. pre initializes the word list (which stores the current count for each word encountered) and fun updates it each time a new word is encountered and also suffixes the word with the count returning the suffixed word.
Having defined the foregoing, run gsubfn using pword. For each component of the input gsubfn will first run pre and then for each match of the regular expression \w+ gsubfn will input the match to fun, run fun and replace the match with the output of fun.
We have assumed that the words to be suffixed with a count are matched by w+ which is the case for the example in the question but if your actual data is different you may need to change the pattern.
library(gsubfn)
s <- rep("A > B > B > C > B > A > C > B > A", 3) # sample input
pwords <- proto(
pre = function(this) { this$words <- list() },
fun = function(this, x) {
if (is.null(words[[x]])) this$words[[x]] <- 0
this$words[[x]] <- this$words[[x]] + 1
paste0(x, words[[x]])
}
)
gsubfn("\w+", pwords, s)
giving:
[1] "A1 > B1 > B2 > C1 > B3 > A2 > C2 > B4 > A3"
[2] "A1 > B1 > B2 > C1 > B3 > A2 > C2 > B4 > A3"
[3] "A1 > B1 > B2 > C1 > B3 > A2 > C2 > B4 > A3"
edited Nov 13 at 14:09
answered Nov 13 at 12:02
G. Grothendieck
145k9126231
145k9126231
Thank you for your answer, among given solutions this one was best at solving my problem and provided me a tool that helped me in dealing with my task, in pair with great explanation. Now I'm only fighting with writing a regexp that would ignore punctuation marks, as sometimes my words contain dots and hyphens and then the repetition is counted in wrong manner, I didn't take that under consideration when providing you with my poor example.
– Marcin
Nov 13 at 14:12
1
This will match anything other than space or >:[^ >]+.
– G. Grothendieck
Nov 13 at 14:13
Worked like a charm, I owe you one, thank you!
– Marcin
Nov 13 at 14:19
add a comment |
Thank you for your answer, among given solutions this one was best at solving my problem and provided me a tool that helped me in dealing with my task, in pair with great explanation. Now I'm only fighting with writing a regexp that would ignore punctuation marks, as sometimes my words contain dots and hyphens and then the repetition is counted in wrong manner, I didn't take that under consideration when providing you with my poor example.
– Marcin
Nov 13 at 14:12
1
This will match anything other than space or >:[^ >]+.
– G. Grothendieck
Nov 13 at 14:13
Worked like a charm, I owe you one, thank you!
– Marcin
Nov 13 at 14:19
Thank you for your answer, among given solutions this one was best at solving my problem and provided me a tool that helped me in dealing with my task, in pair with great explanation. Now I'm only fighting with writing a regexp that would ignore punctuation marks, as sometimes my words contain dots and hyphens and then the repetition is counted in wrong manner, I didn't take that under consideration when providing you with my poor example.
– Marcin
Nov 13 at 14:12
Thank you for your answer, among given solutions this one was best at solving my problem and provided me a tool that helped me in dealing with my task, in pair with great explanation. Now I'm only fighting with writing a regexp that would ignore punctuation marks, as sometimes my words contain dots and hyphens and then the repetition is counted in wrong manner, I didn't take that under consideration when providing you with my poor example.
– Marcin
Nov 13 at 14:12
1
1
This will match anything other than space or >:
[^ >]+ .– G. Grothendieck
Nov 13 at 14:13
This will match anything other than space or >:
[^ >]+ .– G. Grothendieck
Nov 13 at 14:13
Worked like a charm, I owe you one, thank you!
– Marcin
Nov 13 at 14:19
Worked like a charm, I owe you one, thank you!
– Marcin
Nov 13 at 14:19
add a comment |
Here is a rough example using data.table:
library(data.table)
# Example data:
data <- data.table(
s = c("A > B > B > C > B > A > C > B > A",
"A > B > B > C > B > A > C > B > C > D")
)
# Processing steps (can probably be shortened)
n <- strsplit(data[["s"]], " > ")
datal <- melt(n)
setDT(datal)
datal[, original_order := 1:.N
][, temp := paste0(value, 1:.N), by = .(L1, value)
][order(original_order), paste(temp, collapse = " > "), by = L1]
# Output:
L1 V1
1: 1 A1 > B1 > B2 > C1 > B3 > A2 > C2 > B4 > A3
2: 2 A1 > B1 > B2 > C1 > B3 > A2 > C2 > B4 > C3 > D1
add a comment |
Here is a rough example using data.table:
library(data.table)
# Example data:
data <- data.table(
s = c("A > B > B > C > B > A > C > B > A",
"A > B > B > C > B > A > C > B > C > D")
)
# Processing steps (can probably be shortened)
n <- strsplit(data[["s"]], " > ")
datal <- melt(n)
setDT(datal)
datal[, original_order := 1:.N
][, temp := paste0(value, 1:.N), by = .(L1, value)
][order(original_order), paste(temp, collapse = " > "), by = L1]
# Output:
L1 V1
1: 1 A1 > B1 > B2 > C1 > B3 > A2 > C2 > B4 > A3
2: 2 A1 > B1 > B2 > C1 > B3 > A2 > C2 > B4 > C3 > D1
add a comment |
Here is a rough example using data.table:
library(data.table)
# Example data:
data <- data.table(
s = c("A > B > B > C > B > A > C > B > A",
"A > B > B > C > B > A > C > B > C > D")
)
# Processing steps (can probably be shortened)
n <- strsplit(data[["s"]], " > ")
datal <- melt(n)
setDT(datal)
datal[, original_order := 1:.N
][, temp := paste0(value, 1:.N), by = .(L1, value)
][order(original_order), paste(temp, collapse = " > "), by = L1]
# Output:
L1 V1
1: 1 A1 > B1 > B2 > C1 > B3 > A2 > C2 > B4 > A3
2: 2 A1 > B1 > B2 > C1 > B3 > A2 > C2 > B4 > C3 > D1
Here is a rough example using data.table:
library(data.table)
# Example data:
data <- data.table(
s = c("A > B > B > C > B > A > C > B > A",
"A > B > B > C > B > A > C > B > C > D")
)
# Processing steps (can probably be shortened)
n <- strsplit(data[["s"]], " > ")
datal <- melt(n)
setDT(datal)
datal[, original_order := 1:.N
][, temp := paste0(value, 1:.N), by = .(L1, value)
][order(original_order), paste(temp, collapse = " > "), by = L1]
# Output:
L1 V1
1: 1 A1 > B1 > B2 > C1 > B3 > A2 > C2 > B4 > A3
2: 2 A1 > B1 > B2 > C1 > B3 > A2 > C2 > B4 > C3 > D1
answered Nov 13 at 11:46
snoram
6,447830
6,447830
add a comment |
add a comment |
Here is a function that uses base R only.
Note that if you are using a diferent set of regex metacharacters, it should be easy to have a function argument metachar, defaulting to the one in the function body.
count_seq <- function(x, sep = ">"){
metachar <- '. | ( ) [ { ^ $ * + ?'
sep2 <- if(grepl(sep, metachar)) paste0("\", sep) else sep
y <- unlist(strsplit(x, sep2))
y <- trimws(y)
z <- ave(y, y, FUN = seq_along)
paste(paste0(y, z), collapse = sep)
}
x <- "A > B > B > C > B > A > C > B > A"
count_seq(x)
#[1] "A1>B1>B2>C1>B3>A2>C2>B4>A3"
count_seq(x, sep = " > ")
#[1] "A1 > B1 > B2 > C1 > B3 > A2 > C2 > B4 > A3"
y <- "A | B | B | C | B | A | C | B | A"
count_seq(y, sep = "|")
#[1] "A1|B1|B2|C1|B3|A2|C2|B4|A3"
Nice. Finally you would do something like:sapply(column, count_seq, " > "), right?
– snoram
Nov 13 at 11:55
@snoram Yes, that's the idea. I thought of looping (*apply) in the function but I it might be better to leave it like this, more general, and let the user decide.
– Rui Barradas
Nov 13 at 11:59
2
In a similar vein using @snoram's 's':lapply(strsplit(s, " > "), function(x) paste0(x, ave(x, x, FUN = seq_along), collapse = " > "))
– Henrik
Nov 13 at 12:11
add a comment |
Here is a function that uses base R only.
Note that if you are using a diferent set of regex metacharacters, it should be easy to have a function argument metachar, defaulting to the one in the function body.
count_seq <- function(x, sep = ">"){
metachar <- '. | ( ) [ { ^ $ * + ?'
sep2 <- if(grepl(sep, metachar)) paste0("\", sep) else sep
y <- unlist(strsplit(x, sep2))
y <- trimws(y)
z <- ave(y, y, FUN = seq_along)
paste(paste0(y, z), collapse = sep)
}
x <- "A > B > B > C > B > A > C > B > A"
count_seq(x)
#[1] "A1>B1>B2>C1>B3>A2>C2>B4>A3"
count_seq(x, sep = " > ")
#[1] "A1 > B1 > B2 > C1 > B3 > A2 > C2 > B4 > A3"
y <- "A | B | B | C | B | A | C | B | A"
count_seq(y, sep = "|")
#[1] "A1|B1|B2|C1|B3|A2|C2|B4|A3"
Nice. Finally you would do something like:sapply(column, count_seq, " > "), right?
– snoram
Nov 13 at 11:55
@snoram Yes, that's the idea. I thought of looping (*apply) in the function but I it might be better to leave it like this, more general, and let the user decide.
– Rui Barradas
Nov 13 at 11:59
2
In a similar vein using @snoram's 's':lapply(strsplit(s, " > "), function(x) paste0(x, ave(x, x, FUN = seq_along), collapse = " > "))
– Henrik
Nov 13 at 12:11
add a comment |
Here is a function that uses base R only.
Note that if you are using a diferent set of regex metacharacters, it should be easy to have a function argument metachar, defaulting to the one in the function body.
count_seq <- function(x, sep = ">"){
metachar <- '. | ( ) [ { ^ $ * + ?'
sep2 <- if(grepl(sep, metachar)) paste0("\", sep) else sep
y <- unlist(strsplit(x, sep2))
y <- trimws(y)
z <- ave(y, y, FUN = seq_along)
paste(paste0(y, z), collapse = sep)
}
x <- "A > B > B > C > B > A > C > B > A"
count_seq(x)
#[1] "A1>B1>B2>C1>B3>A2>C2>B4>A3"
count_seq(x, sep = " > ")
#[1] "A1 > B1 > B2 > C1 > B3 > A2 > C2 > B4 > A3"
y <- "A | B | B | C | B | A | C | B | A"
count_seq(y, sep = "|")
#[1] "A1|B1|B2|C1|B3|A2|C2|B4|A3"
Here is a function that uses base R only.
Note that if you are using a diferent set of regex metacharacters, it should be easy to have a function argument metachar, defaulting to the one in the function body.
count_seq <- function(x, sep = ">"){
metachar <- '. | ( ) [ { ^ $ * + ?'
sep2 <- if(grepl(sep, metachar)) paste0("\", sep) else sep
y <- unlist(strsplit(x, sep2))
y <- trimws(y)
z <- ave(y, y, FUN = seq_along)
paste(paste0(y, z), collapse = sep)
}
x <- "A > B > B > C > B > A > C > B > A"
count_seq(x)
#[1] "A1>B1>B2>C1>B3>A2>C2>B4>A3"
count_seq(x, sep = " > ")
#[1] "A1 > B1 > B2 > C1 > B3 > A2 > C2 > B4 > A3"
y <- "A | B | B | C | B | A | C | B | A"
count_seq(y, sep = "|")
#[1] "A1|B1|B2|C1|B3|A2|C2|B4|A3"
answered Nov 13 at 11:49
Rui Barradas
15.9k41730
15.9k41730
Nice. Finally you would do something like:sapply(column, count_seq, " > "), right?
– snoram
Nov 13 at 11:55
@snoram Yes, that's the idea. I thought of looping (*apply) in the function but I it might be better to leave it like this, more general, and let the user decide.
– Rui Barradas
Nov 13 at 11:59
2
In a similar vein using @snoram's 's':lapply(strsplit(s, " > "), function(x) paste0(x, ave(x, x, FUN = seq_along), collapse = " > "))
– Henrik
Nov 13 at 12:11
add a comment |
Nice. Finally you would do something like:sapply(column, count_seq, " > "), right?
– snoram
Nov 13 at 11:55
@snoram Yes, that's the idea. I thought of looping (*apply) in the function but I it might be better to leave it like this, more general, and let the user decide.
– Rui Barradas
Nov 13 at 11:59
2
In a similar vein using @snoram's 's':lapply(strsplit(s, " > "), function(x) paste0(x, ave(x, x, FUN = seq_along), collapse = " > "))
– Henrik
Nov 13 at 12:11
Nice. Finally you would do something like:
sapply(column, count_seq, " > "), right?– snoram
Nov 13 at 11:55
Nice. Finally you would do something like:
sapply(column, count_seq, " > "), right?– snoram
Nov 13 at 11:55
@snoram Yes, that's the idea. I thought of looping (
*apply) in the function but I it might be better to leave it like this, more general, and let the user decide.– Rui Barradas
Nov 13 at 11:59
@snoram Yes, that's the idea. I thought of looping (
*apply) in the function but I it might be better to leave it like this, more general, and let the user decide.– Rui Barradas
Nov 13 at 11:59
2
2
In a similar vein using @snoram's 's':
lapply(strsplit(s, " > "), function(x) paste0(x, ave(x, x, FUN = seq_along), collapse = " > "))– Henrik
Nov 13 at 12:11
In a similar vein using @snoram's 's':
lapply(strsplit(s, " > "), function(x) paste0(x, ave(x, x, FUN = seq_along), collapse = " > "))– Henrik
Nov 13 at 12:11
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53280116%2fcount-each-next-occurence-of-string-in-substring%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Reproducible data would help us help you.
– snoram
Nov 13 at 11:33
Are the
>part of the string?– Rui Barradas
Nov 13 at 11:37
Unfortunately I cannot share company's data, but it basically looks like the one provided in example - if there is some public dataset that I can play around with I would be glad to provide you with some examples. And yes, " > " is part of string, however it can be changed into any character, i.e. spacebar.
– Marcin
Nov 13 at 11:39