Count each next occurence of string in substring












2














today I have faced a problem that I couldn't solve on my own, despite searching for solutions - it appeared to me, that either my approach is wrong or noone before asked similar question.



I'm playing around with Markov attribution, so I've got columns with strings that look like that:



A > B > B > C > B > A > C > B > A


etc.



...it is created on base of postgresql function 'string_agg'.



What I think would be important for me is assigning a number of for which time each string appears in entire string. To make it clear, at the end of the day, it would look like this:



A1 > B1 > B2 > C1 > B3 > A2 > C2 > B4 > A3


There are three main challenges:




  • there are around 100 different types of elements to be counted that may change it time, so it makes it hard to hardcode it,

  • the dataset is around 200k rows,

  • strings may be up to few hundred characters long


The only thing that came up to my mind is to write some sort of loop, but it feels like it would take up ages until it finishes.



I also thought about solving it on postgresql level, but couldn't find efficient and easy solution to it neither.










share|improve this question
























  • Reproducible data would help us help you.
    – snoram
    Nov 13 at 11:33










  • Are the > part of the string?
    – Rui Barradas
    Nov 13 at 11:37










  • Unfortunately I cannot share company's data, but it basically looks like the one provided in example - if there is some public dataset that I can play around with I would be glad to provide you with some examples. And yes, " > " is part of string, however it can be changed into any character, i.e. spacebar.
    – Marcin
    Nov 13 at 11:39
















2














today I have faced a problem that I couldn't solve on my own, despite searching for solutions - it appeared to me, that either my approach is wrong or noone before asked similar question.



I'm playing around with Markov attribution, so I've got columns with strings that look like that:



A > B > B > C > B > A > C > B > A


etc.



...it is created on base of postgresql function 'string_agg'.



What I think would be important for me is assigning a number of for which time each string appears in entire string. To make it clear, at the end of the day, it would look like this:



A1 > B1 > B2 > C1 > B3 > A2 > C2 > B4 > A3


There are three main challenges:




  • there are around 100 different types of elements to be counted that may change it time, so it makes it hard to hardcode it,

  • the dataset is around 200k rows,

  • strings may be up to few hundred characters long


The only thing that came up to my mind is to write some sort of loop, but it feels like it would take up ages until it finishes.



I also thought about solving it on postgresql level, but couldn't find efficient and easy solution to it neither.










share|improve this question
























  • Reproducible data would help us help you.
    – snoram
    Nov 13 at 11:33










  • Are the > part of the string?
    – Rui Barradas
    Nov 13 at 11:37










  • Unfortunately I cannot share company's data, but it basically looks like the one provided in example - if there is some public dataset that I can play around with I would be glad to provide you with some examples. And yes, " > " is part of string, however it can be changed into any character, i.e. spacebar.
    – Marcin
    Nov 13 at 11:39














2












2








2







today I have faced a problem that I couldn't solve on my own, despite searching for solutions - it appeared to me, that either my approach is wrong or noone before asked similar question.



I'm playing around with Markov attribution, so I've got columns with strings that look like that:



A > B > B > C > B > A > C > B > A


etc.



...it is created on base of postgresql function 'string_agg'.



What I think would be important for me is assigning a number of for which time each string appears in entire string. To make it clear, at the end of the day, it would look like this:



A1 > B1 > B2 > C1 > B3 > A2 > C2 > B4 > A3


There are three main challenges:




  • there are around 100 different types of elements to be counted that may change it time, so it makes it hard to hardcode it,

  • the dataset is around 200k rows,

  • strings may be up to few hundred characters long


The only thing that came up to my mind is to write some sort of loop, but it feels like it would take up ages until it finishes.



I also thought about solving it on postgresql level, but couldn't find efficient and easy solution to it neither.










share|improve this question















today I have faced a problem that I couldn't solve on my own, despite searching for solutions - it appeared to me, that either my approach is wrong or noone before asked similar question.



I'm playing around with Markov attribution, so I've got columns with strings that look like that:



A > B > B > C > B > A > C > B > A


etc.



...it is created on base of postgresql function 'string_agg'.



What I think would be important for me is assigning a number of for which time each string appears in entire string. To make it clear, at the end of the day, it would look like this:



A1 > B1 > B2 > C1 > B3 > A2 > C2 > B4 > A3


There are three main challenges:




  • there are around 100 different types of elements to be counted that may change it time, so it makes it hard to hardcode it,

  • the dataset is around 200k rows,

  • strings may be up to few hundred characters long


The only thing that came up to my mind is to write some sort of loop, but it feels like it would take up ages until it finishes.



I also thought about solving it on postgresql level, but couldn't find efficient and easy solution to it neither.







r statistics






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 13 at 11:38

























asked Nov 13 at 11:32









Marcin

356




356












  • Reproducible data would help us help you.
    – snoram
    Nov 13 at 11:33










  • Are the > part of the string?
    – Rui Barradas
    Nov 13 at 11:37










  • Unfortunately I cannot share company's data, but it basically looks like the one provided in example - if there is some public dataset that I can play around with I would be glad to provide you with some examples. And yes, " > " is part of string, however it can be changed into any character, i.e. spacebar.
    – Marcin
    Nov 13 at 11:39


















  • Reproducible data would help us help you.
    – snoram
    Nov 13 at 11:33










  • Are the > part of the string?
    – Rui Barradas
    Nov 13 at 11:37










  • Unfortunately I cannot share company's data, but it basically looks like the one provided in example - if there is some public dataset that I can play around with I would be glad to provide you with some examples. And yes, " > " is part of string, however it can be changed into any character, i.e. spacebar.
    – Marcin
    Nov 13 at 11:39
















Reproducible data would help us help you.
– snoram
Nov 13 at 11:33




Reproducible data would help us help you.
– snoram
Nov 13 at 11:33












Are the > part of the string?
– Rui Barradas
Nov 13 at 11:37




Are the > part of the string?
– Rui Barradas
Nov 13 at 11:37












Unfortunately I cannot share company's data, but it basically looks like the one provided in example - if there is some public dataset that I can play around with I would be glad to provide you with some examples. And yes, " > " is part of string, however it can be changed into any character, i.e. spacebar.
– Marcin
Nov 13 at 11:39




Unfortunately I cannot share company's data, but it basically looks like the one provided in example - if there is some public dataset that I can play around with I would be glad to provide you with some examples. And yes, " > " is part of string, however it can be changed into any character, i.e. spacebar.
– Marcin
Nov 13 at 11:39












3 Answers
3






active

oldest

votes


















1














How to do this is described in the gsubfn vignette. Using the code there first we define a proto object pword with methods pre and fun. pre initializes the word list (which stores the current count for each word encountered) and fun updates it each time a new word is encountered and also suffixes the word with the count returning the suffixed word.



Having defined the foregoing, run gsubfn using pword. For each component of the input gsubfn will first run pre and then for each match of the regular expression \w+ gsubfn will input the match to fun, run fun and replace the match with the output of fun.



We have assumed that the words to be suffixed with a count are matched by w+ which is the case for the example in the question but if your actual data is different you may need to change the pattern.



library(gsubfn)
s <- rep("A > B > B > C > B > A > C > B > A", 3) # sample input

pwords <- proto(
pre = function(this) { this$words <- list() },
fun = function(this, x) {
if (is.null(words[[x]])) this$words[[x]] <- 0
this$words[[x]] <- this$words[[x]] + 1
paste0(x, words[[x]])
}
)

gsubfn("\w+", pwords, s)


giving:



[1] "A1 > B1 > B2 > C1 > B3 > A2 > C2 > B4 > A3"
[2] "A1 > B1 > B2 > C1 > B3 > A2 > C2 > B4 > A3"
[3] "A1 > B1 > B2 > C1 > B3 > A2 > C2 > B4 > A3"





share|improve this answer























  • Thank you for your answer, among given solutions this one was best at solving my problem and provided me a tool that helped me in dealing with my task, in pair with great explanation. Now I'm only fighting with writing a regexp that would ignore punctuation marks, as sometimes my words contain dots and hyphens and then the repetition is counted in wrong manner, I didn't take that under consideration when providing you with my poor example.
    – Marcin
    Nov 13 at 14:12






  • 1




    This will match anything other than space or >: [^ >]+ .
    – G. Grothendieck
    Nov 13 at 14:13










  • Worked like a charm, I owe you one, thank you!
    – Marcin
    Nov 13 at 14:19



















2














Here is a rough example using data.table:



library(data.table)

# Example data:
data <- data.table(
s = c("A > B > B > C > B > A > C > B > A",
"A > B > B > C > B > A > C > B > C > D")
)

# Processing steps (can probably be shortened)
n <- strsplit(data[["s"]], " > ")
datal <- melt(n)
setDT(datal)
datal[, original_order := 1:.N
][, temp := paste0(value, 1:.N), by = .(L1, value)
][order(original_order), paste(temp, collapse = " > "), by = L1]


# Output:
L1 V1
1: 1 A1 > B1 > B2 > C1 > B3 > A2 > C2 > B4 > A3
2: 2 A1 > B1 > B2 > C1 > B3 > A2 > C2 > B4 > C3 > D1





share|improve this answer





























    2














    Here is a function that uses base R only.

    Note that if you are using a diferent set of regex metacharacters, it should be easy to have a function argument metachar, defaulting to the one in the function body.



    count_seq <- function(x, sep = ">"){
    metachar <- '. | ( ) [ { ^ $ * + ?'
    sep2 <- if(grepl(sep, metachar)) paste0("\", sep) else sep
    y <- unlist(strsplit(x, sep2))
    y <- trimws(y)
    z <- ave(y, y, FUN = seq_along)
    paste(paste0(y, z), collapse = sep)
    }

    x <- "A > B > B > C > B > A > C > B > A"

    count_seq(x)
    #[1] "A1>B1>B2>C1>B3>A2>C2>B4>A3"
    count_seq(x, sep = " > ")
    #[1] "A1 > B1 > B2 > C1 > B3 > A2 > C2 > B4 > A3"

    y <- "A | B | B | C | B | A | C | B | A"
    count_seq(y, sep = "|")
    #[1] "A1|B1|B2|C1|B3|A2|C2|B4|A3"





    share|improve this answer





















    • Nice. Finally you would do something like: sapply(column, count_seq, " > "), right?
      – snoram
      Nov 13 at 11:55










    • @snoram Yes, that's the idea. I thought of looping (*apply) in the function but I it might be better to leave it like this, more general, and let the user decide.
      – Rui Barradas
      Nov 13 at 11:59






    • 2




      In a similar vein using @snoram's 's': lapply(strsplit(s, " > "), function(x) paste0(x, ave(x, x, FUN = seq_along), collapse = " > "))
      – Henrik
      Nov 13 at 12:11











    Your Answer






    StackExchange.ifUsing("editor", function () {
    StackExchange.using("externalEditor", function () {
    StackExchange.using("snippets", function () {
    StackExchange.snippets.init();
    });
    });
    }, "code-snippets");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "1"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53280116%2fcount-each-next-occurence-of-string-in-substring%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    3 Answers
    3






    active

    oldest

    votes








    3 Answers
    3






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    1














    How to do this is described in the gsubfn vignette. Using the code there first we define a proto object pword with methods pre and fun. pre initializes the word list (which stores the current count for each word encountered) and fun updates it each time a new word is encountered and also suffixes the word with the count returning the suffixed word.



    Having defined the foregoing, run gsubfn using pword. For each component of the input gsubfn will first run pre and then for each match of the regular expression \w+ gsubfn will input the match to fun, run fun and replace the match with the output of fun.



    We have assumed that the words to be suffixed with a count are matched by w+ which is the case for the example in the question but if your actual data is different you may need to change the pattern.



    library(gsubfn)
    s <- rep("A > B > B > C > B > A > C > B > A", 3) # sample input

    pwords <- proto(
    pre = function(this) { this$words <- list() },
    fun = function(this, x) {
    if (is.null(words[[x]])) this$words[[x]] <- 0
    this$words[[x]] <- this$words[[x]] + 1
    paste0(x, words[[x]])
    }
    )

    gsubfn("\w+", pwords, s)


    giving:



    [1] "A1 > B1 > B2 > C1 > B3 > A2 > C2 > B4 > A3"
    [2] "A1 > B1 > B2 > C1 > B3 > A2 > C2 > B4 > A3"
    [3] "A1 > B1 > B2 > C1 > B3 > A2 > C2 > B4 > A3"





    share|improve this answer























    • Thank you for your answer, among given solutions this one was best at solving my problem and provided me a tool that helped me in dealing with my task, in pair with great explanation. Now I'm only fighting with writing a regexp that would ignore punctuation marks, as sometimes my words contain dots and hyphens and then the repetition is counted in wrong manner, I didn't take that under consideration when providing you with my poor example.
      – Marcin
      Nov 13 at 14:12






    • 1




      This will match anything other than space or >: [^ >]+ .
      – G. Grothendieck
      Nov 13 at 14:13










    • Worked like a charm, I owe you one, thank you!
      – Marcin
      Nov 13 at 14:19
















    1














    How to do this is described in the gsubfn vignette. Using the code there first we define a proto object pword with methods pre and fun. pre initializes the word list (which stores the current count for each word encountered) and fun updates it each time a new word is encountered and also suffixes the word with the count returning the suffixed word.



    Having defined the foregoing, run gsubfn using pword. For each component of the input gsubfn will first run pre and then for each match of the regular expression \w+ gsubfn will input the match to fun, run fun and replace the match with the output of fun.



    We have assumed that the words to be suffixed with a count are matched by w+ which is the case for the example in the question but if your actual data is different you may need to change the pattern.



    library(gsubfn)
    s <- rep("A > B > B > C > B > A > C > B > A", 3) # sample input

    pwords <- proto(
    pre = function(this) { this$words <- list() },
    fun = function(this, x) {
    if (is.null(words[[x]])) this$words[[x]] <- 0
    this$words[[x]] <- this$words[[x]] + 1
    paste0(x, words[[x]])
    }
    )

    gsubfn("\w+", pwords, s)


    giving:



    [1] "A1 > B1 > B2 > C1 > B3 > A2 > C2 > B4 > A3"
    [2] "A1 > B1 > B2 > C1 > B3 > A2 > C2 > B4 > A3"
    [3] "A1 > B1 > B2 > C1 > B3 > A2 > C2 > B4 > A3"





    share|improve this answer























    • Thank you for your answer, among given solutions this one was best at solving my problem and provided me a tool that helped me in dealing with my task, in pair with great explanation. Now I'm only fighting with writing a regexp that would ignore punctuation marks, as sometimes my words contain dots and hyphens and then the repetition is counted in wrong manner, I didn't take that under consideration when providing you with my poor example.
      – Marcin
      Nov 13 at 14:12






    • 1




      This will match anything other than space or >: [^ >]+ .
      – G. Grothendieck
      Nov 13 at 14:13










    • Worked like a charm, I owe you one, thank you!
      – Marcin
      Nov 13 at 14:19














    1












    1








    1






    How to do this is described in the gsubfn vignette. Using the code there first we define a proto object pword with methods pre and fun. pre initializes the word list (which stores the current count for each word encountered) and fun updates it each time a new word is encountered and also suffixes the word with the count returning the suffixed word.



    Having defined the foregoing, run gsubfn using pword. For each component of the input gsubfn will first run pre and then for each match of the regular expression \w+ gsubfn will input the match to fun, run fun and replace the match with the output of fun.



    We have assumed that the words to be suffixed with a count are matched by w+ which is the case for the example in the question but if your actual data is different you may need to change the pattern.



    library(gsubfn)
    s <- rep("A > B > B > C > B > A > C > B > A", 3) # sample input

    pwords <- proto(
    pre = function(this) { this$words <- list() },
    fun = function(this, x) {
    if (is.null(words[[x]])) this$words[[x]] <- 0
    this$words[[x]] <- this$words[[x]] + 1
    paste0(x, words[[x]])
    }
    )

    gsubfn("\w+", pwords, s)


    giving:



    [1] "A1 > B1 > B2 > C1 > B3 > A2 > C2 > B4 > A3"
    [2] "A1 > B1 > B2 > C1 > B3 > A2 > C2 > B4 > A3"
    [3] "A1 > B1 > B2 > C1 > B3 > A2 > C2 > B4 > A3"





    share|improve this answer














    How to do this is described in the gsubfn vignette. Using the code there first we define a proto object pword with methods pre and fun. pre initializes the word list (which stores the current count for each word encountered) and fun updates it each time a new word is encountered and also suffixes the word with the count returning the suffixed word.



    Having defined the foregoing, run gsubfn using pword. For each component of the input gsubfn will first run pre and then for each match of the regular expression \w+ gsubfn will input the match to fun, run fun and replace the match with the output of fun.



    We have assumed that the words to be suffixed with a count are matched by w+ which is the case for the example in the question but if your actual data is different you may need to change the pattern.



    library(gsubfn)
    s <- rep("A > B > B > C > B > A > C > B > A", 3) # sample input

    pwords <- proto(
    pre = function(this) { this$words <- list() },
    fun = function(this, x) {
    if (is.null(words[[x]])) this$words[[x]] <- 0
    this$words[[x]] <- this$words[[x]] + 1
    paste0(x, words[[x]])
    }
    )

    gsubfn("\w+", pwords, s)


    giving:



    [1] "A1 > B1 > B2 > C1 > B3 > A2 > C2 > B4 > A3"
    [2] "A1 > B1 > B2 > C1 > B3 > A2 > C2 > B4 > A3"
    [3] "A1 > B1 > B2 > C1 > B3 > A2 > C2 > B4 > A3"






    share|improve this answer














    share|improve this answer



    share|improve this answer








    edited Nov 13 at 14:09

























    answered Nov 13 at 12:02









    G. Grothendieck

    145k9126231




    145k9126231












    • Thank you for your answer, among given solutions this one was best at solving my problem and provided me a tool that helped me in dealing with my task, in pair with great explanation. Now I'm only fighting with writing a regexp that would ignore punctuation marks, as sometimes my words contain dots and hyphens and then the repetition is counted in wrong manner, I didn't take that under consideration when providing you with my poor example.
      – Marcin
      Nov 13 at 14:12






    • 1




      This will match anything other than space or >: [^ >]+ .
      – G. Grothendieck
      Nov 13 at 14:13










    • Worked like a charm, I owe you one, thank you!
      – Marcin
      Nov 13 at 14:19


















    • Thank you for your answer, among given solutions this one was best at solving my problem and provided me a tool that helped me in dealing with my task, in pair with great explanation. Now I'm only fighting with writing a regexp that would ignore punctuation marks, as sometimes my words contain dots and hyphens and then the repetition is counted in wrong manner, I didn't take that under consideration when providing you with my poor example.
      – Marcin
      Nov 13 at 14:12






    • 1




      This will match anything other than space or >: [^ >]+ .
      – G. Grothendieck
      Nov 13 at 14:13










    • Worked like a charm, I owe you one, thank you!
      – Marcin
      Nov 13 at 14:19
















    Thank you for your answer, among given solutions this one was best at solving my problem and provided me a tool that helped me in dealing with my task, in pair with great explanation. Now I'm only fighting with writing a regexp that would ignore punctuation marks, as sometimes my words contain dots and hyphens and then the repetition is counted in wrong manner, I didn't take that under consideration when providing you with my poor example.
    – Marcin
    Nov 13 at 14:12




    Thank you for your answer, among given solutions this one was best at solving my problem and provided me a tool that helped me in dealing with my task, in pair with great explanation. Now I'm only fighting with writing a regexp that would ignore punctuation marks, as sometimes my words contain dots and hyphens and then the repetition is counted in wrong manner, I didn't take that under consideration when providing you with my poor example.
    – Marcin
    Nov 13 at 14:12




    1




    1




    This will match anything other than space or >: [^ >]+ .
    – G. Grothendieck
    Nov 13 at 14:13




    This will match anything other than space or >: [^ >]+ .
    – G. Grothendieck
    Nov 13 at 14:13












    Worked like a charm, I owe you one, thank you!
    – Marcin
    Nov 13 at 14:19




    Worked like a charm, I owe you one, thank you!
    – Marcin
    Nov 13 at 14:19













    2














    Here is a rough example using data.table:



    library(data.table)

    # Example data:
    data <- data.table(
    s = c("A > B > B > C > B > A > C > B > A",
    "A > B > B > C > B > A > C > B > C > D")
    )

    # Processing steps (can probably be shortened)
    n <- strsplit(data[["s"]], " > ")
    datal <- melt(n)
    setDT(datal)
    datal[, original_order := 1:.N
    ][, temp := paste0(value, 1:.N), by = .(L1, value)
    ][order(original_order), paste(temp, collapse = " > "), by = L1]


    # Output:
    L1 V1
    1: 1 A1 > B1 > B2 > C1 > B3 > A2 > C2 > B4 > A3
    2: 2 A1 > B1 > B2 > C1 > B3 > A2 > C2 > B4 > C3 > D1





    share|improve this answer


























      2














      Here is a rough example using data.table:



      library(data.table)

      # Example data:
      data <- data.table(
      s = c("A > B > B > C > B > A > C > B > A",
      "A > B > B > C > B > A > C > B > C > D")
      )

      # Processing steps (can probably be shortened)
      n <- strsplit(data[["s"]], " > ")
      datal <- melt(n)
      setDT(datal)
      datal[, original_order := 1:.N
      ][, temp := paste0(value, 1:.N), by = .(L1, value)
      ][order(original_order), paste(temp, collapse = " > "), by = L1]


      # Output:
      L1 V1
      1: 1 A1 > B1 > B2 > C1 > B3 > A2 > C2 > B4 > A3
      2: 2 A1 > B1 > B2 > C1 > B3 > A2 > C2 > B4 > C3 > D1





      share|improve this answer
























        2












        2








        2






        Here is a rough example using data.table:



        library(data.table)

        # Example data:
        data <- data.table(
        s = c("A > B > B > C > B > A > C > B > A",
        "A > B > B > C > B > A > C > B > C > D")
        )

        # Processing steps (can probably be shortened)
        n <- strsplit(data[["s"]], " > ")
        datal <- melt(n)
        setDT(datal)
        datal[, original_order := 1:.N
        ][, temp := paste0(value, 1:.N), by = .(L1, value)
        ][order(original_order), paste(temp, collapse = " > "), by = L1]


        # Output:
        L1 V1
        1: 1 A1 > B1 > B2 > C1 > B3 > A2 > C2 > B4 > A3
        2: 2 A1 > B1 > B2 > C1 > B3 > A2 > C2 > B4 > C3 > D1





        share|improve this answer












        Here is a rough example using data.table:



        library(data.table)

        # Example data:
        data <- data.table(
        s = c("A > B > B > C > B > A > C > B > A",
        "A > B > B > C > B > A > C > B > C > D")
        )

        # Processing steps (can probably be shortened)
        n <- strsplit(data[["s"]], " > ")
        datal <- melt(n)
        setDT(datal)
        datal[, original_order := 1:.N
        ][, temp := paste0(value, 1:.N), by = .(L1, value)
        ][order(original_order), paste(temp, collapse = " > "), by = L1]


        # Output:
        L1 V1
        1: 1 A1 > B1 > B2 > C1 > B3 > A2 > C2 > B4 > A3
        2: 2 A1 > B1 > B2 > C1 > B3 > A2 > C2 > B4 > C3 > D1






        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Nov 13 at 11:46









        snoram

        6,447830




        6,447830























            2














            Here is a function that uses base R only.

            Note that if you are using a diferent set of regex metacharacters, it should be easy to have a function argument metachar, defaulting to the one in the function body.



            count_seq <- function(x, sep = ">"){
            metachar <- '. | ( ) [ { ^ $ * + ?'
            sep2 <- if(grepl(sep, metachar)) paste0("\", sep) else sep
            y <- unlist(strsplit(x, sep2))
            y <- trimws(y)
            z <- ave(y, y, FUN = seq_along)
            paste(paste0(y, z), collapse = sep)
            }

            x <- "A > B > B > C > B > A > C > B > A"

            count_seq(x)
            #[1] "A1>B1>B2>C1>B3>A2>C2>B4>A3"
            count_seq(x, sep = " > ")
            #[1] "A1 > B1 > B2 > C1 > B3 > A2 > C2 > B4 > A3"

            y <- "A | B | B | C | B | A | C | B | A"
            count_seq(y, sep = "|")
            #[1] "A1|B1|B2|C1|B3|A2|C2|B4|A3"





            share|improve this answer





















            • Nice. Finally you would do something like: sapply(column, count_seq, " > "), right?
              – snoram
              Nov 13 at 11:55










            • @snoram Yes, that's the idea. I thought of looping (*apply) in the function but I it might be better to leave it like this, more general, and let the user decide.
              – Rui Barradas
              Nov 13 at 11:59






            • 2




              In a similar vein using @snoram's 's': lapply(strsplit(s, " > "), function(x) paste0(x, ave(x, x, FUN = seq_along), collapse = " > "))
              – Henrik
              Nov 13 at 12:11
















            2














            Here is a function that uses base R only.

            Note that if you are using a diferent set of regex metacharacters, it should be easy to have a function argument metachar, defaulting to the one in the function body.



            count_seq <- function(x, sep = ">"){
            metachar <- '. | ( ) [ { ^ $ * + ?'
            sep2 <- if(grepl(sep, metachar)) paste0("\", sep) else sep
            y <- unlist(strsplit(x, sep2))
            y <- trimws(y)
            z <- ave(y, y, FUN = seq_along)
            paste(paste0(y, z), collapse = sep)
            }

            x <- "A > B > B > C > B > A > C > B > A"

            count_seq(x)
            #[1] "A1>B1>B2>C1>B3>A2>C2>B4>A3"
            count_seq(x, sep = " > ")
            #[1] "A1 > B1 > B2 > C1 > B3 > A2 > C2 > B4 > A3"

            y <- "A | B | B | C | B | A | C | B | A"
            count_seq(y, sep = "|")
            #[1] "A1|B1|B2|C1|B3|A2|C2|B4|A3"





            share|improve this answer





















            • Nice. Finally you would do something like: sapply(column, count_seq, " > "), right?
              – snoram
              Nov 13 at 11:55










            • @snoram Yes, that's the idea. I thought of looping (*apply) in the function but I it might be better to leave it like this, more general, and let the user decide.
              – Rui Barradas
              Nov 13 at 11:59






            • 2




              In a similar vein using @snoram's 's': lapply(strsplit(s, " > "), function(x) paste0(x, ave(x, x, FUN = seq_along), collapse = " > "))
              – Henrik
              Nov 13 at 12:11














            2












            2








            2






            Here is a function that uses base R only.

            Note that if you are using a diferent set of regex metacharacters, it should be easy to have a function argument metachar, defaulting to the one in the function body.



            count_seq <- function(x, sep = ">"){
            metachar <- '. | ( ) [ { ^ $ * + ?'
            sep2 <- if(grepl(sep, metachar)) paste0("\", sep) else sep
            y <- unlist(strsplit(x, sep2))
            y <- trimws(y)
            z <- ave(y, y, FUN = seq_along)
            paste(paste0(y, z), collapse = sep)
            }

            x <- "A > B > B > C > B > A > C > B > A"

            count_seq(x)
            #[1] "A1>B1>B2>C1>B3>A2>C2>B4>A3"
            count_seq(x, sep = " > ")
            #[1] "A1 > B1 > B2 > C1 > B3 > A2 > C2 > B4 > A3"

            y <- "A | B | B | C | B | A | C | B | A"
            count_seq(y, sep = "|")
            #[1] "A1|B1|B2|C1|B3|A2|C2|B4|A3"





            share|improve this answer












            Here is a function that uses base R only.

            Note that if you are using a diferent set of regex metacharacters, it should be easy to have a function argument metachar, defaulting to the one in the function body.



            count_seq <- function(x, sep = ">"){
            metachar <- '. | ( ) [ { ^ $ * + ?'
            sep2 <- if(grepl(sep, metachar)) paste0("\", sep) else sep
            y <- unlist(strsplit(x, sep2))
            y <- trimws(y)
            z <- ave(y, y, FUN = seq_along)
            paste(paste0(y, z), collapse = sep)
            }

            x <- "A > B > B > C > B > A > C > B > A"

            count_seq(x)
            #[1] "A1>B1>B2>C1>B3>A2>C2>B4>A3"
            count_seq(x, sep = " > ")
            #[1] "A1 > B1 > B2 > C1 > B3 > A2 > C2 > B4 > A3"

            y <- "A | B | B | C | B | A | C | B | A"
            count_seq(y, sep = "|")
            #[1] "A1|B1|B2|C1|B3|A2|C2|B4|A3"






            share|improve this answer












            share|improve this answer



            share|improve this answer










            answered Nov 13 at 11:49









            Rui Barradas

            15.9k41730




            15.9k41730












            • Nice. Finally you would do something like: sapply(column, count_seq, " > "), right?
              – snoram
              Nov 13 at 11:55










            • @snoram Yes, that's the idea. I thought of looping (*apply) in the function but I it might be better to leave it like this, more general, and let the user decide.
              – Rui Barradas
              Nov 13 at 11:59






            • 2




              In a similar vein using @snoram's 's': lapply(strsplit(s, " > "), function(x) paste0(x, ave(x, x, FUN = seq_along), collapse = " > "))
              – Henrik
              Nov 13 at 12:11


















            • Nice. Finally you would do something like: sapply(column, count_seq, " > "), right?
              – snoram
              Nov 13 at 11:55










            • @snoram Yes, that's the idea. I thought of looping (*apply) in the function but I it might be better to leave it like this, more general, and let the user decide.
              – Rui Barradas
              Nov 13 at 11:59






            • 2




              In a similar vein using @snoram's 's': lapply(strsplit(s, " > "), function(x) paste0(x, ave(x, x, FUN = seq_along), collapse = " > "))
              – Henrik
              Nov 13 at 12:11
















            Nice. Finally you would do something like: sapply(column, count_seq, " > "), right?
            – snoram
            Nov 13 at 11:55




            Nice. Finally you would do something like: sapply(column, count_seq, " > "), right?
            – snoram
            Nov 13 at 11:55












            @snoram Yes, that's the idea. I thought of looping (*apply) in the function but I it might be better to leave it like this, more general, and let the user decide.
            – Rui Barradas
            Nov 13 at 11:59




            @snoram Yes, that's the idea. I thought of looping (*apply) in the function but I it might be better to leave it like this, more general, and let the user decide.
            – Rui Barradas
            Nov 13 at 11:59




            2




            2




            In a similar vein using @snoram's 's': lapply(strsplit(s, " > "), function(x) paste0(x, ave(x, x, FUN = seq_along), collapse = " > "))
            – Henrik
            Nov 13 at 12:11




            In a similar vein using @snoram's 's': lapply(strsplit(s, " > "), function(x) paste0(x, ave(x, x, FUN = seq_along), collapse = " > "))
            – Henrik
            Nov 13 at 12:11


















            draft saved

            draft discarded




















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.





            Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


            Please pay close attention to the following guidance:


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53280116%2fcount-each-next-occurence-of-string-in-substring%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            How to pass form data using jquery Ajax to insert data in database?

            National Museum of Racing and Hall of Fame

            Guess what letter conforming each word