Count each next occurence of string in substring

today I have faced a problem that I couldn't solve on my own, despite searching for solutions - it appeared to me, that either my approach is wrong or noone before asked similar question.

I'm playing around with Markov attribution, so I've got columns with strings that look like that:

A > B > B > C > B > A > C > B > A

etc.

...it is created on base of postgresql function 'string_agg'.

What I think would be important for me is assigning a number of for which time each string appears in entire string. To make it clear, at the end of the day, it would look like this:

A1 > B1 > B2 > C1 > B3 > A2 > C2 > B4 > A3

There are three main challenges:

there are around 100 different types of elements to be counted that may change it time, so it makes it hard to hardcode it,

the dataset is around 200k rows,

strings may be up to few hundred characters long

The only thing that came up to my mind is to write some sort of loop, but it feels like it would take up ages until it finishes.

I also thought about solving it on postgresql level, but couldn't find efficient and easy solution to it neither.

edited Nov 13 at 11:38

asked Nov 13 at 11:32

Marcin

356

Reproducible data would help us help you.
– snoram
Nov 13 at 11:33

Are the > part of the string?
– Rui Barradas
Nov 13 at 11:37

Unfortunately I cannot share company's data, but it basically looks like the one provided in example - if there is some public dataset that I can play around with I would be glad to provide you with some examples. And yes, " > " is part of string, however it can be changed into any character, i.e. spacebar.
– Marcin
Nov 13 at 11:39

add a comment |

today I have faced a problem that I couldn't solve on my own, despite searching for solutions - it appeared to me, that either my approach is wrong or noone before asked similar question.

I'm playing around with Markov attribution, so I've got columns with strings that look like that:

A > B > B > C > B > A > C > B > A

etc.

...it is created on base of postgresql function 'string_agg'.

What I think would be important for me is assigning a number of for which time each string appears in entire string. To make it clear, at the end of the day, it would look like this:

A1 > B1 > B2 > C1 > B3 > A2 > C2 > B4 > A3

There are three main challenges:

there are around 100 different types of elements to be counted that may change it time, so it makes it hard to hardcode it,

the dataset is around 200k rows,

strings may be up to few hundred characters long

The only thing that came up to my mind is to write some sort of loop, but it feels like it would take up ages until it finishes.

I also thought about solving it on postgresql level, but couldn't find efficient and easy solution to it neither.

edited Nov 13 at 11:38

asked Nov 13 at 11:32

Marcin

356

Reproducible data would help us help you.
– snoram
Nov 13 at 11:33

Are the > part of the string?
– Rui Barradas
Nov 13 at 11:37

Unfortunately I cannot share company's data, but it basically looks like the one provided in example - if there is some public dataset that I can play around with I would be glad to provide you with some examples. And yes, " > " is part of string, however it can be changed into any character, i.e. spacebar.
– Marcin
Nov 13 at 11:39

add a comment |

today I have faced a problem that I couldn't solve on my own, despite searching for solutions - it appeared to me, that either my approach is wrong or noone before asked similar question.

I'm playing around with Markov attribution, so I've got columns with strings that look like that:

A > B > B > C > B > A > C > B > A

etc.

...it is created on base of postgresql function 'string_agg'.

What I think would be important for me is assigning a number of for which time each string appears in entire string. To make it clear, at the end of the day, it would look like this:

A1 > B1 > B2 > C1 > B3 > A2 > C2 > B4 > A3

There are three main challenges:

there are around 100 different types of elements to be counted that may change it time, so it makes it hard to hardcode it,

the dataset is around 200k rows,

strings may be up to few hundred characters long

The only thing that came up to my mind is to write some sort of loop, but it feels like it would take up ages until it finishes.

I also thought about solving it on postgresql level, but couldn't find efficient and easy solution to it neither.

edited Nov 13 at 11:38

asked Nov 13 at 11:32

Marcin

356

today I have faced a problem that I couldn't solve on my own, despite searching for solutions - it appeared to me, that either my approach is wrong or noone before asked similar question.

I'm playing around with Markov attribution, so I've got columns with strings that look like that:

A > B > B > C > B > A > C > B > A

etc.

...it is created on base of postgresql function 'string_agg'.

What I think would be important for me is assigning a number of for which time each string appears in entire string. To make it clear, at the end of the day, it would look like this:

A1 > B1 > B2 > C1 > B3 > A2 > C2 > B4 > A3

There are three main challenges:

there are around 100 different types of elements to be counted that may change it time, so it makes it hard to hardcode it,

the dataset is around 200k rows,

strings may be up to few hundred characters long

The only thing that came up to my mind is to write some sort of loop, but it feels like it would take up ages until it finishes.

I also thought about solving it on postgresql level, but couldn't find efficient and easy solution to it neither.

r statistics

edited Nov 13 at 11:38

asked Nov 13 at 11:32

Marcin

356

edited Nov 13 at 11:38

asked Nov 13 at 11:32

Marcin

356

edited Nov 13 at 11:38

asked Nov 13 at 11:32

Marcin

356

asked Nov 13 at 11:32

Marcin

356

asked Nov 13 at 11:32

Marcin

356

Reproducible data would help us help you.
– snoram
Nov 13 at 11:33

Are the > part of the string?
– Rui Barradas
Nov 13 at 11:37

Unfortunately I cannot share company's data, but it basically looks like the one provided in example - if there is some public dataset that I can play around with I would be glad to provide you with some examples. And yes, " > " is part of string, however it can be changed into any character, i.e. spacebar.
– Marcin
Nov 13 at 11:39

add a comment |

Reproducible data would help us help you.
– snoram
Nov 13 at 11:33

Are the > part of the string?
– Rui Barradas
Nov 13 at 11:37

Unfortunately I cannot share company's data, but it basically looks like the one provided in example - if there is some public dataset that I can play around with I would be glad to provide you with some examples. And yes, " > " is part of string, however it can be changed into any character, i.e. spacebar.
– Marcin
Nov 13 at 11:39

Reproducible data would help us help you.
– snoram
Nov 13 at 11:33

Are the > part of the string?
– Rui Barradas
Nov 13 at 11:37

Unfortunately I cannot share company's data, but it basically looks like the one provided in example - if there is some public dataset that I can play around with I would be glad to provide you with some examples. And yes, " > " is part of string, however it can be changed into any character, i.e. spacebar.
– Marcin
Nov 13 at 11:39

add a comment |

3 Answers
3

active

oldest

votes

How to do this is described in the gsubfn vignette. Using the code there first we define a proto object pword with methods pre and fun. pre initializes the word list (which stores the current count for each word encountered) and fun updates it each time a new word is encountered and also suffixes the word with the count returning the suffixed word.

Having defined the foregoing, run gsubfn using pword. For each component of the input gsubfn will first run pre and then for each match of the regular expression \w+ gsubfn will input the match to fun, run fun and replace the match with the output of fun.

We have assumed that the words to be suffixed with a count are matched by w+ which is the case for the example in the question but if your actual data is different you may need to change the pattern.

library(gsubfn)

s <- rep("A > B > B > C > B > A > C > B > A", 3) # sample input



pwords <- proto(

  pre = function(this) { this$words <- list() },

  fun = function(this, x) {

    if (is.null(words[[x]])) this$words[[x]] <- 0

    this$words[[x]] <- this$words[[x]] + 1

    paste0(x, words[[x]])

  }

)



gsubfn("\w+", pwords, s)

giving:

[1] "A1 > B1 > B2 > C1 > B3 > A2 > C2 > B4 > A3"

[2] "A1 > B1 > B2 > C1 > B3 > A2 > C2 > B4 > A3"

[3] "A1 > B1 > B2 > C1 > B3 > A2 > C2 > B4 > A3"

edited Nov 13 at 14:09

answered Nov 13 at 12:02

G. Grothendieck

145k9126231

Thank you for your answer, among given solutions this one was best at solving my problem and provided me a tool that helped me in dealing with my task, in pair with great explanation. Now I'm only fighting with writing a regexp that would ignore punctuation marks, as sometimes my words contain dots and hyphens and then the repetition is counted in wrong manner, I didn't take that under consideration when providing you with my poor example.
– Marcin
Nov 13 at 14:12

1

This will match anything other than space or >: [^ >]+ .
– G. Grothendieck
Nov 13 at 14:13

Worked like a charm, I owe you one, thank you!
– Marcin
Nov 13 at 14:19

add a comment |

Here is a rough example using data.table:

library(data.table)



# Example data:

data <- data.table(

  s = c("A > B > B > C > B > A > C > B > A",

        "A > B > B > C > B > A > C > B > C > D")

)



# Processing steps (can probably be shortened)

n <- strsplit(data[["s"]], " > ")

datal <- melt(n)

setDT(datal)

datal[, original_order := 1:.N

      ][, temp := paste0(value, 1:.N), by = .(L1, value)

        ][order(original_order), paste(temp, collapse = " > "), by = L1]





# Output:

   L1                                              V1

1:  1      A1 > B1 > B2 > C1 > B3 > A2 > C2 > B4 > A3

2:  2 A1 > B1 > B2 > C1 > B3 > A2 > C2 > B4 > C3 > D1

answered Nov 13 at 11:46

snoram

6,447830

add a comment |

Here is a function that uses base R only.

Note that if you are using a diferent set of regex metacharacters, it should be easy to have a function argument metachar, defaulting to the one in the function body.

count_seq <- function(x, sep = ">"){

  metachar <- '.  | ( ) [ { ^ $ * + ?'

  sep2 <- if(grepl(sep, metachar)) paste0("\", sep) else sep

  y <- unlist(strsplit(x, sep2))

  y <- trimws(y)

  z <- ave(y, y, FUN = seq_along)

  paste(paste0(y, z), collapse = sep)

}



x <- "A > B > B > C > B > A > C > B > A"



count_seq(x)

#[1] "A1>B1>B2>C1>B3>A2>C2>B4>A3"

count_seq(x, sep = " > ")

#[1] "A1 > B1 > B2 > C1 > B3 > A2 > C2 > B4 > A3"



y <- "A | B | B | C | B | A | C | B | A"

count_seq(y, sep = "|")

#[1] "A1|B1|B2|C1|B3|A2|C2|B4|A3"

answered Nov 13 at 11:49

Rui Barradas

15.9k41730

Nice. Finally you would do something like: sapply(column, count_seq, " > "), right?
– snoram
Nov 13 at 11:55

@snoram Yes, that's the idea. I thought of looping (*apply) in the function but I it might be better to leave it like this, more general, and let the user decide.
– Rui Barradas
Nov 13 at 11:59

2

In a similar vein using @snoram's 's': lapply(strsplit(s, " > "), function(x) paste0(x, ave(x, x, FUN = seq_along), collapse = " > "))
– Henrik
Nov 13 at 12:11

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53280116%2fcount-each-next-occurence-of-string-in-substring%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

3 Answers
3

active

oldest

votes

3 Answers
3

active

oldest

votes

library(gsubfn)

s <- rep("A > B > B > C > B > A > C > B > A", 3) # sample input



pwords <- proto(

  pre = function(this) { this$words <- list() },

  fun = function(this, x) {

    if (is.null(words[[x]])) this$words[[x]] <- 0

    this$words[[x]] <- this$words[[x]] + 1

    paste0(x, words[[x]])

  }

)



gsubfn("\w+", pwords, s)

giving:

[1] "A1 > B1 > B2 > C1 > B3 > A2 > C2 > B4 > A3"

[2] "A1 > B1 > B2 > C1 > B3 > A2 > C2 > B4 > A3"

[3] "A1 > B1 > B2 > C1 > B3 > A2 > C2 > B4 > A3"

edited Nov 13 at 14:09

answered Nov 13 at 12:02

G. Grothendieck

145k9126231

Thank you for your answer, among given solutions this one was best at solving my problem and provided me a tool that helped me in dealing with my task, in pair with great explanation. Now I'm only fighting with writing a regexp that would ignore punctuation marks, as sometimes my words contain dots and hyphens and then the repetition is counted in wrong manner, I didn't take that under consideration when providing you with my poor example.
– Marcin
Nov 13 at 14:12

1

This will match anything other than space or >: [^ >]+ .
– G. Grothendieck
Nov 13 at 14:13

Worked like a charm, I owe you one, thank you!
– Marcin
Nov 13 at 14:19

add a comment |

library(gsubfn)

s <- rep("A > B > B > C > B > A > C > B > A", 3) # sample input



pwords <- proto(

  pre = function(this) { this$words <- list() },

  fun = function(this, x) {

    if (is.null(words[[x]])) this$words[[x]] <- 0

    this$words[[x]] <- this$words[[x]] + 1

    paste0(x, words[[x]])

  }

)



gsubfn("\w+", pwords, s)

giving:

[1] "A1 > B1 > B2 > C1 > B3 > A2 > C2 > B4 > A3"

[2] "A1 > B1 > B2 > C1 > B3 > A2 > C2 > B4 > A3"

[3] "A1 > B1 > B2 > C1 > B3 > A2 > C2 > B4 > A3"

edited Nov 13 at 14:09

answered Nov 13 at 12:02

G. Grothendieck

145k9126231

Thank you for your answer, among given solutions this one was best at solving my problem and provided me a tool that helped me in dealing with my task, in pair with great explanation. Now I'm only fighting with writing a regexp that would ignore punctuation marks, as sometimes my words contain dots and hyphens and then the repetition is counted in wrong manner, I didn't take that under consideration when providing you with my poor example.
– Marcin
Nov 13 at 14:12

1

This will match anything other than space or >: [^ >]+ .
– G. Grothendieck
Nov 13 at 14:13

Worked like a charm, I owe you one, thank you!
– Marcin
Nov 13 at 14:19

add a comment |

library(gsubfn)

s <- rep("A > B > B > C > B > A > C > B > A", 3) # sample input



pwords <- proto(

  pre = function(this) { this$words <- list() },

  fun = function(this, x) {

    if (is.null(words[[x]])) this$words[[x]] <- 0

    this$words[[x]] <- this$words[[x]] + 1

    paste0(x, words[[x]])

  }

)



gsubfn("\w+", pwords, s)

giving:

[1] "A1 > B1 > B2 > C1 > B3 > A2 > C2 > B4 > A3"

[2] "A1 > B1 > B2 > C1 > B3 > A2 > C2 > B4 > A3"

[3] "A1 > B1 > B2 > C1 > B3 > A2 > C2 > B4 > A3"

edited Nov 13 at 14:09

answered Nov 13 at 12:02

G. Grothendieck

145k9126231

library(gsubfn)

s <- rep("A > B > B > C > B > A > C > B > A", 3) # sample input



pwords <- proto(

  pre = function(this) { this$words <- list() },

  fun = function(this, x) {

    if (is.null(words[[x]])) this$words[[x]] <- 0

    this$words[[x]] <- this$words[[x]] + 1

    paste0(x, words[[x]])

  }

)



gsubfn("\w+", pwords, s)

giving:

[1] "A1 > B1 > B2 > C1 > B3 > A2 > C2 > B4 > A3"

[2] "A1 > B1 > B2 > C1 > B3 > A2 > C2 > B4 > A3"

[3] "A1 > B1 > B2 > C1 > B3 > A2 > C2 > B4 > A3"

edited Nov 13 at 14:09

answered Nov 13 at 12:02

G. Grothendieck

145k9126231

edited Nov 13 at 14:09

answered Nov 13 at 12:02

G. Grothendieck

145k9126231

answered Nov 13 at 12:02

G. Grothendieck

145k9126231

answered Nov 13 at 12:02

G. Grothendieck

145k9126231

Thank you for your answer, among given solutions this one was best at solving my problem and provided me a tool that helped me in dealing with my task, in pair with great explanation. Now I'm only fighting with writing a regexp that would ignore punctuation marks, as sometimes my words contain dots and hyphens and then the repetition is counted in wrong manner, I didn't take that under consideration when providing you with my poor example.
– Marcin
Nov 13 at 14:12

1

This will match anything other than space or >: [^ >]+ .
– G. Grothendieck
Nov 13 at 14:13

Worked like a charm, I owe you one, thank you!
– Marcin
Nov 13 at 14:19

add a comment |

Thank you for your answer, among given solutions this one was best at solving my problem and provided me a tool that helped me in dealing with my task, in pair with great explanation. Now I'm only fighting with writing a regexp that would ignore punctuation marks, as sometimes my words contain dots and hyphens and then the repetition is counted in wrong manner, I didn't take that under consideration when providing you with my poor example.
– Marcin
Nov 13 at 14:12

1

This will match anything other than space or >: [^ >]+ .
– G. Grothendieck
Nov 13 at 14:13

Worked like a charm, I owe you one, thank you!
– Marcin
Nov 13 at 14:19

Thank you for your answer, among given solutions this one was best at solving my problem and provided me a tool that helped me in dealing with my task, in pair with great explanation. Now I'm only fighting with writing a regexp that would ignore punctuation marks, as sometimes my words contain dots and hyphens and then the repetition is counted in wrong manner, I didn't take that under consideration when providing you with my poor example.
– Marcin
Nov 13 at 14:12

This will match anything other than space or >: [^ >]+ .
– G. Grothendieck
Nov 13 at 14:13

Worked like a charm, I owe you one, thank you!
– Marcin
Nov 13 at 14:19

add a comment |

Here is a rough example using data.table:

library(data.table)



# Example data:

data <- data.table(

  s = c("A > B > B > C > B > A > C > B > A",

        "A > B > B > C > B > A > C > B > C > D")

)



# Processing steps (can probably be shortened)

n <- strsplit(data[["s"]], " > ")

datal <- melt(n)

setDT(datal)

datal[, original_order := 1:.N

      ][, temp := paste0(value, 1:.N), by = .(L1, value)

        ][order(original_order), paste(temp, collapse = " > "), by = L1]





# Output:

   L1                                              V1

1:  1      A1 > B1 > B2 > C1 > B3 > A2 > C2 > B4 > A3

2:  2 A1 > B1 > B2 > C1 > B3 > A2 > C2 > B4 > C3 > D1

answered Nov 13 at 11:46

snoram

6,447830

add a comment |

Here is a rough example using data.table:

library(data.table)



# Example data:

data <- data.table(

  s = c("A > B > B > C > B > A > C > B > A",

        "A > B > B > C > B > A > C > B > C > D")

)



# Processing steps (can probably be shortened)

n <- strsplit(data[["s"]], " > ")

datal <- melt(n)

setDT(datal)

datal[, original_order := 1:.N

      ][, temp := paste0(value, 1:.N), by = .(L1, value)

        ][order(original_order), paste(temp, collapse = " > "), by = L1]





# Output:

   L1                                              V1

1:  1      A1 > B1 > B2 > C1 > B3 > A2 > C2 > B4 > A3

2:  2 A1 > B1 > B2 > C1 > B3 > A2 > C2 > B4 > C3 > D1

answered Nov 13 at 11:46

snoram

6,447830

add a comment |

Here is a rough example using data.table:

library(data.table)



# Example data:

data <- data.table(

  s = c("A > B > B > C > B > A > C > B > A",

        "A > B > B > C > B > A > C > B > C > D")

)



# Processing steps (can probably be shortened)

n <- strsplit(data[["s"]], " > ")

datal <- melt(n)

setDT(datal)

datal[, original_order := 1:.N

      ][, temp := paste0(value, 1:.N), by = .(L1, value)

        ][order(original_order), paste(temp, collapse = " > "), by = L1]





# Output:

   L1                                              V1

1:  1      A1 > B1 > B2 > C1 > B3 > A2 > C2 > B4 > A3

2:  2 A1 > B1 > B2 > C1 > B3 > A2 > C2 > B4 > C3 > D1

answered Nov 13 at 11:46

snoram

6,447830

Here is a rough example using data.table:

library(data.table)



# Example data:

data <- data.table(

  s = c("A > B > B > C > B > A > C > B > A",

        "A > B > B > C > B > A > C > B > C > D")

)



# Processing steps (can probably be shortened)

n <- strsplit(data[["s"]], " > ")

datal <- melt(n)

setDT(datal)

datal[, original_order := 1:.N

      ][, temp := paste0(value, 1:.N), by = .(L1, value)

        ][order(original_order), paste(temp, collapse = " > "), by = L1]





# Output:

   L1                                              V1

1:  1      A1 > B1 > B2 > C1 > B3 > A2 > C2 > B4 > A3

2:  2 A1 > B1 > B2 > C1 > B3 > A2 > C2 > B4 > C3 > D1

answered Nov 13 at 11:46

snoram

6,447830

answered Nov 13 at 11:46

snoram

6,447830

answered Nov 13 at 11:46

snoram

6,447830

answered Nov 13 at 11:46

snoram

6,447830

add a comment |

count_seq <- function(x, sep = ">"){

  metachar <- '.  | ( ) [ { ^ $ * + ?'

  sep2 <- if(grepl(sep, metachar)) paste0("\", sep) else sep

  y <- unlist(strsplit(x, sep2))

  y <- trimws(y)

  z <- ave(y, y, FUN = seq_along)

  paste(paste0(y, z), collapse = sep)

}



x <- "A > B > B > C > B > A > C > B > A"



count_seq(x)

#[1] "A1>B1>B2>C1>B3>A2>C2>B4>A3"

count_seq(x, sep = " > ")

#[1] "A1 > B1 > B2 > C1 > B3 > A2 > C2 > B4 > A3"



y <- "A | B | B | C | B | A | C | B | A"

count_seq(y, sep = "|")

#[1] "A1|B1|B2|C1|B3|A2|C2|B4|A3"

answered Nov 13 at 11:49

Rui Barradas

15.9k41730

Nice. Finally you would do something like: sapply(column, count_seq, " > "), right?
– snoram
Nov 13 at 11:55

@snoram Yes, that's the idea. I thought of looping (*apply) in the function but I it might be better to leave it like this, more general, and let the user decide.
– Rui Barradas
Nov 13 at 11:59

2

In a similar vein using @snoram's 's': lapply(strsplit(s, " > "), function(x) paste0(x, ave(x, x, FUN = seq_along), collapse = " > "))
– Henrik
Nov 13 at 12:11

add a comment |

count_seq <- function(x, sep = ">"){

  metachar <- '.  | ( ) [ { ^ $ * + ?'

  sep2 <- if(grepl(sep, metachar)) paste0("\", sep) else sep

  y <- unlist(strsplit(x, sep2))

  y <- trimws(y)

  z <- ave(y, y, FUN = seq_along)

  paste(paste0(y, z), collapse = sep)

}



x <- "A > B > B > C > B > A > C > B > A"



count_seq(x)

#[1] "A1>B1>B2>C1>B3>A2>C2>B4>A3"

count_seq(x, sep = " > ")

#[1] "A1 > B1 > B2 > C1 > B3 > A2 > C2 > B4 > A3"



y <- "A | B | B | C | B | A | C | B | A"

count_seq(y, sep = "|")

#[1] "A1|B1|B2|C1|B3|A2|C2|B4|A3"

answered Nov 13 at 11:49

Rui Barradas

15.9k41730

Nice. Finally you would do something like: sapply(column, count_seq, " > "), right?
– snoram
Nov 13 at 11:55

@snoram Yes, that's the idea. I thought of looping (*apply) in the function but I it might be better to leave it like this, more general, and let the user decide.
– Rui Barradas
Nov 13 at 11:59

2

In a similar vein using @snoram's 's': lapply(strsplit(s, " > "), function(x) paste0(x, ave(x, x, FUN = seq_along), collapse = " > "))
– Henrik
Nov 13 at 12:11

add a comment |

count_seq <- function(x, sep = ">"){

  metachar <- '.  | ( ) [ { ^ $ * + ?'

  sep2 <- if(grepl(sep, metachar)) paste0("\", sep) else sep

  y <- unlist(strsplit(x, sep2))

  y <- trimws(y)

  z <- ave(y, y, FUN = seq_along)

  paste(paste0(y, z), collapse = sep)

}



x <- "A > B > B > C > B > A > C > B > A"



count_seq(x)

#[1] "A1>B1>B2>C1>B3>A2>C2>B4>A3"

count_seq(x, sep = " > ")

#[1] "A1 > B1 > B2 > C1 > B3 > A2 > C2 > B4 > A3"



y <- "A | B | B | C | B | A | C | B | A"

count_seq(y, sep = "|")

#[1] "A1|B1|B2|C1|B3|A2|C2|B4|A3"

answered Nov 13 at 11:49

Rui Barradas

15.9k41730

count_seq <- function(x, sep = ">"){

  metachar <- '.  | ( ) [ { ^ $ * + ?'

  sep2 <- if(grepl(sep, metachar)) paste0("\", sep) else sep

  y <- unlist(strsplit(x, sep2))

  y <- trimws(y)

  z <- ave(y, y, FUN = seq_along)

  paste(paste0(y, z), collapse = sep)

}



x <- "A > B > B > C > B > A > C > B > A"



count_seq(x)

#[1] "A1>B1>B2>C1>B3>A2>C2>B4>A3"

count_seq(x, sep = " > ")

#[1] "A1 > B1 > B2 > C1 > B3 > A2 > C2 > B4 > A3"



y <- "A | B | B | C | B | A | C | B | A"

count_seq(y, sep = "|")

#[1] "A1|B1|B2|C1|B3|A2|C2|B4|A3"

answered Nov 13 at 11:49

Rui Barradas

15.9k41730

answered Nov 13 at 11:49

Rui Barradas

15.9k41730

answered Nov 13 at 11:49

Rui Barradas

15.9k41730

answered Nov 13 at 11:49

Rui Barradas

15.9k41730

Nice. Finally you would do something like: sapply(column, count_seq, " > "), right?
– snoram
Nov 13 at 11:55

@snoram Yes, that's the idea. I thought of looping (*apply) in the function but I it might be better to leave it like this, more general, and let the user decide.
– Rui Barradas
Nov 13 at 11:59

2

In a similar vein using @snoram's 's': lapply(strsplit(s, " > "), function(x) paste0(x, ave(x, x, FUN = seq_along), collapse = " > "))
– Henrik
Nov 13 at 12:11

add a comment |

Nice. Finally you would do something like: sapply(column, count_seq, " > "), right?
– snoram
Nov 13 at 11:55

@snoram Yes, that's the idea. I thought of looping (*apply) in the function but I it might be better to leave it like this, more general, and let the user decide.
– Rui Barradas
Nov 13 at 11:59

2

In a similar vein using @snoram's 's': lapply(strsplit(s, " > "), function(x) paste0(x, ave(x, x, FUN = seq_along), collapse = " > "))
– Henrik
Nov 13 at 12:11

Nice. Finally you would do something like: sapply(column, count_seq, " > "), right?
– snoram
Nov 13 at 11:55

@snoram Yes, that's the idea. I thought of looping (*apply) in the function but I it might be better to leave it like this, more general, and let the user decide.
– Rui Barradas
Nov 13 at 11:59

In a similar vein using @snoram's 's': lapply(strsplit(s, " > "), function(x) paste0(x, ave(x, x, FUN = seq_along), collapse = " > "))
– Henrik
Nov 13 at 12:11

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

Some of your past answers have not been well-received, and you're in danger of being blocked from answering.

Please pay close attention to the following guidance:

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Agfdhyk