Quick algorithm to create union of multiple intervals data












0















I have a really simple problem and data structure but the number is so large that I need to find an efficient way.



Suppose I have an object that has an attribute which is an interval.
For example:



        `start      stop`
obj1 5 10
obj2 8 12
obj3 11 14
obj4 13 20
obj5 22 25
obj6 24 30
obj7 33 37
obj8 36 40


I want to merge it so that overlapping interval will become one object. So, the result of the example will become



         start        stop
objA 5 20
objB 22 30
objC 33 40


I am using python for this. Please notice that I have thousand of this type of data.










share|improve this question


















  • 1





    What have you tried so far? Have you written any code?

    – Yakov Dan
    Nov 19 '18 at 9:35











  • Please specify whether your ranges are always sorted (as in your example) by starts and stops.

    – Aristide
    Nov 19 '18 at 9:49






  • 2





    Duplicate of python union of multiple ranges. Please look at the answers of that question, they are exactly what you need, you just have to pass it to pandas.

    – silgon
    Nov 19 '18 at 9:56


















0















I have a really simple problem and data structure but the number is so large that I need to find an efficient way.



Suppose I have an object that has an attribute which is an interval.
For example:



        `start      stop`
obj1 5 10
obj2 8 12
obj3 11 14
obj4 13 20
obj5 22 25
obj6 24 30
obj7 33 37
obj8 36 40


I want to merge it so that overlapping interval will become one object. So, the result of the example will become



         start        stop
objA 5 20
objB 22 30
objC 33 40


I am using python for this. Please notice that I have thousand of this type of data.










share|improve this question


















  • 1





    What have you tried so far? Have you written any code?

    – Yakov Dan
    Nov 19 '18 at 9:35











  • Please specify whether your ranges are always sorted (as in your example) by starts and stops.

    – Aristide
    Nov 19 '18 at 9:49






  • 2





    Duplicate of python union of multiple ranges. Please look at the answers of that question, they are exactly what you need, you just have to pass it to pandas.

    – silgon
    Nov 19 '18 at 9:56
















0












0








0








I have a really simple problem and data structure but the number is so large that I need to find an efficient way.



Suppose I have an object that has an attribute which is an interval.
For example:



        `start      stop`
obj1 5 10
obj2 8 12
obj3 11 14
obj4 13 20
obj5 22 25
obj6 24 30
obj7 33 37
obj8 36 40


I want to merge it so that overlapping interval will become one object. So, the result of the example will become



         start        stop
objA 5 20
objB 22 30
objC 33 40


I am using python for this. Please notice that I have thousand of this type of data.










share|improve this question














I have a really simple problem and data structure but the number is so large that I need to find an efficient way.



Suppose I have an object that has an attribute which is an interval.
For example:



        `start      stop`
obj1 5 10
obj2 8 12
obj3 11 14
obj4 13 20
obj5 22 25
obj6 24 30
obj7 33 37
obj8 36 40


I want to merge it so that overlapping interval will become one object. So, the result of the example will become



         start        stop
objA 5 20
objB 22 30
objC 33 40


I am using python for this. Please notice that I have thousand of this type of data.







python intervals






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Nov 19 '18 at 9:33









BharataBharata

1781316




1781316








  • 1





    What have you tried so far? Have you written any code?

    – Yakov Dan
    Nov 19 '18 at 9:35











  • Please specify whether your ranges are always sorted (as in your example) by starts and stops.

    – Aristide
    Nov 19 '18 at 9:49






  • 2





    Duplicate of python union of multiple ranges. Please look at the answers of that question, they are exactly what you need, you just have to pass it to pandas.

    – silgon
    Nov 19 '18 at 9:56
















  • 1





    What have you tried so far? Have you written any code?

    – Yakov Dan
    Nov 19 '18 at 9:35











  • Please specify whether your ranges are always sorted (as in your example) by starts and stops.

    – Aristide
    Nov 19 '18 at 9:49






  • 2





    Duplicate of python union of multiple ranges. Please look at the answers of that question, they are exactly what you need, you just have to pass it to pandas.

    – silgon
    Nov 19 '18 at 9:56










1




1





What have you tried so far? Have you written any code?

– Yakov Dan
Nov 19 '18 at 9:35





What have you tried so far? Have you written any code?

– Yakov Dan
Nov 19 '18 at 9:35













Please specify whether your ranges are always sorted (as in your example) by starts and stops.

– Aristide
Nov 19 '18 at 9:49





Please specify whether your ranges are always sorted (as in your example) by starts and stops.

– Aristide
Nov 19 '18 at 9:49




2




2





Duplicate of python union of multiple ranges. Please look at the answers of that question, they are exactly what you need, you just have to pass it to pandas.

– silgon
Nov 19 '18 at 9:56







Duplicate of python union of multiple ranges. Please look at the answers of that question, they are exactly what you need, you just have to pass it to pandas.

– silgon
Nov 19 '18 at 9:56














3 Answers
3






active

oldest

votes


















1














df['Startpoint'] = df['stop`'].shift() < df['`start'] # Begin of interval
df['Endpoint'] = df['Startpoint'].shift(-1) # End of interval
df.loc['obj1', 'Startpoint'] = True # First line is startpoint
df['Endpoint'].fillna(True, inplace=True) # Last line is endpoint

df2 = df[df[['Startpoint', 'Endpoint']].any(axis=1)]
df2['`start'] = df2['`start'].shift()
df2.loc[df2['Endpoint'], ['`start', 'stop`']]

# `start stop`
# obj4 5.0 20
# obj6 22.0 30
# obj8 33.0 40


Find all begins and ends of interval, keep only those rows, and then shift the start value one row so that the values per interval are in the same row.



This is all pandas, so I believe it should be reasonably quick.






share|improve this answer































    0














    When the intervals are sorted by starts, this simple function should work in linear time:



    def merge_intervals(intervals):
    result =
    (start_candidate, stop_candidate) = intervals[0]
    for (start, stop) in intervals[1:]:
    if start <= stop_candidate:
    stop_candidate = max(stop, stop_candidate)
    else:
    result.append((start_candidate, stop_candidate))
    (start_candidate, stop_candidate) = (start, stop)
    result.append((start_candidate, stop_candidate))
    return result

    intervals = [
    ( 5, 10),
    ( 8, 12),
    (11, 14),
    (13, 20),
    (22, 25),
    (24, 30),
    (33, 37),
    (36, 40),
    ]

    assert merge_intervals(intervals) == [(5, 20), (22, 30), (33, 40)]





    share|improve this answer































      0














      The quickest way to deal with this kind of data is to use Union find data structure or disjoint data structure which tracks a set of elements partitioned into a number of disjoint subsets.
      I am leaving the implementation and design of the data structure on you as there are efficient ways of implementing Disjoint data structures which operates in linear time.






      share|improve this answer























        Your Answer






        StackExchange.ifUsing("editor", function () {
        StackExchange.using("externalEditor", function () {
        StackExchange.using("snippets", function () {
        StackExchange.snippets.init();
        });
        });
        }, "code-snippets");

        StackExchange.ready(function() {
        var channelOptions = {
        tags: "".split(" "),
        id: "1"
        };
        initTagRenderer("".split(" "), "".split(" "), channelOptions);

        StackExchange.using("externalEditor", function() {
        // Have to fire editor after snippets, if snippets enabled
        if (StackExchange.settings.snippets.snippetsEnabled) {
        StackExchange.using("snippets", function() {
        createEditor();
        });
        }
        else {
        createEditor();
        }
        });

        function createEditor() {
        StackExchange.prepareEditor({
        heartbeatType: 'answer',
        autoActivateHeartbeat: false,
        convertImagesToLinks: true,
        noModals: true,
        showLowRepImageUploadWarning: true,
        reputationToPostImages: 10,
        bindNavPrevention: true,
        postfix: "",
        imageUploader: {
        brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
        contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
        allowUrls: true
        },
        onDemand: true,
        discardSelector: ".discard-answer"
        ,immediatelyShowMarkdownHelp:true
        });


        }
        });














        draft saved

        draft discarded


















        StackExchange.ready(
        function () {
        StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53371736%2fquick-algorithm-to-create-union-of-multiple-intervals-data%23new-answer', 'question_page');
        }
        );

        Post as a guest















        Required, but never shown

























        3 Answers
        3






        active

        oldest

        votes








        3 Answers
        3






        active

        oldest

        votes









        active

        oldest

        votes






        active

        oldest

        votes









        1














        df['Startpoint'] = df['stop`'].shift() < df['`start'] # Begin of interval
        df['Endpoint'] = df['Startpoint'].shift(-1) # End of interval
        df.loc['obj1', 'Startpoint'] = True # First line is startpoint
        df['Endpoint'].fillna(True, inplace=True) # Last line is endpoint

        df2 = df[df[['Startpoint', 'Endpoint']].any(axis=1)]
        df2['`start'] = df2['`start'].shift()
        df2.loc[df2['Endpoint'], ['`start', 'stop`']]

        # `start stop`
        # obj4 5.0 20
        # obj6 22.0 30
        # obj8 33.0 40


        Find all begins and ends of interval, keep only those rows, and then shift the start value one row so that the values per interval are in the same row.



        This is all pandas, so I believe it should be reasonably quick.






        share|improve this answer




























          1














          df['Startpoint'] = df['stop`'].shift() < df['`start'] # Begin of interval
          df['Endpoint'] = df['Startpoint'].shift(-1) # End of interval
          df.loc['obj1', 'Startpoint'] = True # First line is startpoint
          df['Endpoint'].fillna(True, inplace=True) # Last line is endpoint

          df2 = df[df[['Startpoint', 'Endpoint']].any(axis=1)]
          df2['`start'] = df2['`start'].shift()
          df2.loc[df2['Endpoint'], ['`start', 'stop`']]

          # `start stop`
          # obj4 5.0 20
          # obj6 22.0 30
          # obj8 33.0 40


          Find all begins and ends of interval, keep only those rows, and then shift the start value one row so that the values per interval are in the same row.



          This is all pandas, so I believe it should be reasonably quick.






          share|improve this answer


























            1












            1








            1







            df['Startpoint'] = df['stop`'].shift() < df['`start'] # Begin of interval
            df['Endpoint'] = df['Startpoint'].shift(-1) # End of interval
            df.loc['obj1', 'Startpoint'] = True # First line is startpoint
            df['Endpoint'].fillna(True, inplace=True) # Last line is endpoint

            df2 = df[df[['Startpoint', 'Endpoint']].any(axis=1)]
            df2['`start'] = df2['`start'].shift()
            df2.loc[df2['Endpoint'], ['`start', 'stop`']]

            # `start stop`
            # obj4 5.0 20
            # obj6 22.0 30
            # obj8 33.0 40


            Find all begins and ends of interval, keep only those rows, and then shift the start value one row so that the values per interval are in the same row.



            This is all pandas, so I believe it should be reasonably quick.






            share|improve this answer













            df['Startpoint'] = df['stop`'].shift() < df['`start'] # Begin of interval
            df['Endpoint'] = df['Startpoint'].shift(-1) # End of interval
            df.loc['obj1', 'Startpoint'] = True # First line is startpoint
            df['Endpoint'].fillna(True, inplace=True) # Last line is endpoint

            df2 = df[df[['Startpoint', 'Endpoint']].any(axis=1)]
            df2['`start'] = df2['`start'].shift()
            df2.loc[df2['Endpoint'], ['`start', 'stop`']]

            # `start stop`
            # obj4 5.0 20
            # obj6 22.0 30
            # obj8 33.0 40


            Find all begins and ends of interval, keep only those rows, and then shift the start value one row so that the values per interval are in the same row.



            This is all pandas, so I believe it should be reasonably quick.







            share|improve this answer












            share|improve this answer



            share|improve this answer










            answered Nov 19 '18 at 10:10









            JondiedoopJondiedoop

            1,729214




            1,729214

























                0














                When the intervals are sorted by starts, this simple function should work in linear time:



                def merge_intervals(intervals):
                result =
                (start_candidate, stop_candidate) = intervals[0]
                for (start, stop) in intervals[1:]:
                if start <= stop_candidate:
                stop_candidate = max(stop, stop_candidate)
                else:
                result.append((start_candidate, stop_candidate))
                (start_candidate, stop_candidate) = (start, stop)
                result.append((start_candidate, stop_candidate))
                return result

                intervals = [
                ( 5, 10),
                ( 8, 12),
                (11, 14),
                (13, 20),
                (22, 25),
                (24, 30),
                (33, 37),
                (36, 40),
                ]

                assert merge_intervals(intervals) == [(5, 20), (22, 30), (33, 40)]





                share|improve this answer




























                  0














                  When the intervals are sorted by starts, this simple function should work in linear time:



                  def merge_intervals(intervals):
                  result =
                  (start_candidate, stop_candidate) = intervals[0]
                  for (start, stop) in intervals[1:]:
                  if start <= stop_candidate:
                  stop_candidate = max(stop, stop_candidate)
                  else:
                  result.append((start_candidate, stop_candidate))
                  (start_candidate, stop_candidate) = (start, stop)
                  result.append((start_candidate, stop_candidate))
                  return result

                  intervals = [
                  ( 5, 10),
                  ( 8, 12),
                  (11, 14),
                  (13, 20),
                  (22, 25),
                  (24, 30),
                  (33, 37),
                  (36, 40),
                  ]

                  assert merge_intervals(intervals) == [(5, 20), (22, 30), (33, 40)]





                  share|improve this answer


























                    0












                    0








                    0







                    When the intervals are sorted by starts, this simple function should work in linear time:



                    def merge_intervals(intervals):
                    result =
                    (start_candidate, stop_candidate) = intervals[0]
                    for (start, stop) in intervals[1:]:
                    if start <= stop_candidate:
                    stop_candidate = max(stop, stop_candidate)
                    else:
                    result.append((start_candidate, stop_candidate))
                    (start_candidate, stop_candidate) = (start, stop)
                    result.append((start_candidate, stop_candidate))
                    return result

                    intervals = [
                    ( 5, 10),
                    ( 8, 12),
                    (11, 14),
                    (13, 20),
                    (22, 25),
                    (24, 30),
                    (33, 37),
                    (36, 40),
                    ]

                    assert merge_intervals(intervals) == [(5, 20), (22, 30), (33, 40)]





                    share|improve this answer













                    When the intervals are sorted by starts, this simple function should work in linear time:



                    def merge_intervals(intervals):
                    result =
                    (start_candidate, stop_candidate) = intervals[0]
                    for (start, stop) in intervals[1:]:
                    if start <= stop_candidate:
                    stop_candidate = max(stop, stop_candidate)
                    else:
                    result.append((start_candidate, stop_candidate))
                    (start_candidate, stop_candidate) = (start, stop)
                    result.append((start_candidate, stop_candidate))
                    return result

                    intervals = [
                    ( 5, 10),
                    ( 8, 12),
                    (11, 14),
                    (13, 20),
                    (22, 25),
                    (24, 30),
                    (33, 37),
                    (36, 40),
                    ]

                    assert merge_intervals(intervals) == [(5, 20), (22, 30), (33, 40)]






                    share|improve this answer












                    share|improve this answer



                    share|improve this answer










                    answered Nov 19 '18 at 10:04









                    AristideAristide

                    1,43811430




                    1,43811430























                        0














                        The quickest way to deal with this kind of data is to use Union find data structure or disjoint data structure which tracks a set of elements partitioned into a number of disjoint subsets.
                        I am leaving the implementation and design of the data structure on you as there are efficient ways of implementing Disjoint data structures which operates in linear time.






                        share|improve this answer




























                          0














                          The quickest way to deal with this kind of data is to use Union find data structure or disjoint data structure which tracks a set of elements partitioned into a number of disjoint subsets.
                          I am leaving the implementation and design of the data structure on you as there are efficient ways of implementing Disjoint data structures which operates in linear time.






                          share|improve this answer


























                            0












                            0








                            0







                            The quickest way to deal with this kind of data is to use Union find data structure or disjoint data structure which tracks a set of elements partitioned into a number of disjoint subsets.
                            I am leaving the implementation and design of the data structure on you as there are efficient ways of implementing Disjoint data structures which operates in linear time.






                            share|improve this answer













                            The quickest way to deal with this kind of data is to use Union find data structure or disjoint data structure which tracks a set of elements partitioned into a number of disjoint subsets.
                            I am leaving the implementation and design of the data structure on you as there are efficient ways of implementing Disjoint data structures which operates in linear time.







                            share|improve this answer












                            share|improve this answer



                            share|improve this answer










                            answered Nov 19 '18 at 10:36









                            mohor chattmohor chatt

                            925




                            925






























                                draft saved

                                draft discarded




















































                                Thanks for contributing an answer to Stack Overflow!


                                • Please be sure to answer the question. Provide details and share your research!

                                But avoid



                                • Asking for help, clarification, or responding to other answers.

                                • Making statements based on opinion; back them up with references or personal experience.


                                To learn more, see our tips on writing great answers.




                                draft saved


                                draft discarded














                                StackExchange.ready(
                                function () {
                                StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53371736%2fquick-algorithm-to-create-union-of-multiple-intervals-data%23new-answer', 'question_page');
                                }
                                );

                                Post as a guest















                                Required, but never shown





















































                                Required, but never shown














                                Required, but never shown












                                Required, but never shown







                                Required, but never shown

































                                Required, but never shown














                                Required, but never shown












                                Required, but never shown







                                Required, but never shown







                                Popular posts from this blog

                                Guess what letter conforming each word

                                Run scheduled task as local user group (not BUILTIN)

                                Port of Spain