Quick algorithm to create union of multiple intervals data

I have a really simple problem and data structure but the number is so large that I need to find an efficient way.

Suppose I have an object that has an attribute which is an interval.
For example:

        `start      stop`

obj1      5          10

obj2      8          12

obj3      11         14

obj4      13         20

obj5      22         25

obj6      24         30

obj7      33         37

obj8      36         40

I want to merge it so that overlapping interval will become one object. So, the result of the example will become

         start        stop

objA        5          20

objB       22          30

objC       33          40

I am using python for this. Please notice that I have thousand of this type of data.

asked Nov 19 '18 at 9:33

Bharata

1781316

1

What have you tried so far? Have you written any code?

– Yakov Dan
Nov 19 '18 at 9:35

Please specify whether your ranges are always sorted (as in your example) by starts and stops.

– Aristide
Nov 19 '18 at 9:49

2

Duplicate of python union of multiple ranges. Please look at the answers of that question, they are exactly what you need, you just have to pass it to pandas.

– silgon
Nov 19 '18 at 9:56

add a comment |

I have a really simple problem and data structure but the number is so large that I need to find an efficient way.

Suppose I have an object that has an attribute which is an interval.
For example:

        `start      stop`

obj1      5          10

obj2      8          12

obj3      11         14

obj4      13         20

obj5      22         25

obj6      24         30

obj7      33         37

obj8      36         40

I want to merge it so that overlapping interval will become one object. So, the result of the example will become

         start        stop

objA        5          20

objB       22          30

objC       33          40

I am using python for this. Please notice that I have thousand of this type of data.

asked Nov 19 '18 at 9:33

Bharata

1781316

1

What have you tried so far? Have you written any code?

– Yakov Dan
Nov 19 '18 at 9:35

Please specify whether your ranges are always sorted (as in your example) by starts and stops.

– Aristide
Nov 19 '18 at 9:49

2

Duplicate of python union of multiple ranges. Please look at the answers of that question, they are exactly what you need, you just have to pass it to pandas.

– silgon
Nov 19 '18 at 9:56

add a comment |

I have a really simple problem and data structure but the number is so large that I need to find an efficient way.

Suppose I have an object that has an attribute which is an interval.
For example:

        `start      stop`

obj1      5          10

obj2      8          12

obj3      11         14

obj4      13         20

obj5      22         25

obj6      24         30

obj7      33         37

obj8      36         40

I want to merge it so that overlapping interval will become one object. So, the result of the example will become

         start        stop

objA        5          20

objB       22          30

objC       33          40

I am using python for this. Please notice that I have thousand of this type of data.

asked Nov 19 '18 at 9:33

Bharata

1781316

I have a really simple problem and data structure but the number is so large that I need to find an efficient way.

Suppose I have an object that has an attribute which is an interval.
For example:

        `start      stop`

obj1      5          10

obj2      8          12

obj3      11         14

obj4      13         20

obj5      22         25

obj6      24         30

obj7      33         37

obj8      36         40

I want to merge it so that overlapping interval will become one object. So, the result of the example will become

         start        stop

objA        5          20

objB       22          30

objC       33          40

I am using python for this. Please notice that I have thousand of this type of data.

python intervals

asked Nov 19 '18 at 9:33

Bharata

1781316

asked Nov 19 '18 at 9:33

Bharata

1781316

asked Nov 19 '18 at 9:33

Bharata

1781316

asked Nov 19 '18 at 9:33

Bharata

1781316

asked Nov 19 '18 at 9:33

Bharata

1781316

1

What have you tried so far? Have you written any code?

– Yakov Dan
Nov 19 '18 at 9:35

Please specify whether your ranges are always sorted (as in your example) by starts and stops.

– Aristide
Nov 19 '18 at 9:49

2

Duplicate of python union of multiple ranges. Please look at the answers of that question, they are exactly what you need, you just have to pass it to pandas.

– silgon
Nov 19 '18 at 9:56

add a comment |

1

What have you tried so far? Have you written any code?

– Yakov Dan
Nov 19 '18 at 9:35

Please specify whether your ranges are always sorted (as in your example) by starts and stops.

– Aristide
Nov 19 '18 at 9:49

2

Duplicate of python union of multiple ranges. Please look at the answers of that question, they are exactly what you need, you just have to pass it to pandas.

– silgon
Nov 19 '18 at 9:56

What have you tried so far? Have you written any code?

– Yakov Dan
Nov 19 '18 at 9:35

Please specify whether your ranges are always sorted (as in your example) by starts and stops.

– Aristide
Nov 19 '18 at 9:49

Duplicate of python union of multiple ranges. Please look at the answers of that question, they are exactly what you need, you just have to pass it to pandas.

– silgon
Nov 19 '18 at 9:56

add a comment |

3 Answers
3

active

oldest

votes

df['Startpoint'] = df['stop`'].shift() < df['`start'] # Begin of interval

df['Endpoint'] = df['Startpoint'].shift(-1) # End of interval

df.loc['obj1', 'Startpoint'] = True # First line is startpoint

df['Endpoint'].fillna(True, inplace=True) # Last line is endpoint



df2 = df[df[['Startpoint', 'Endpoint']].any(axis=1)]

df2['`start'] = df2['`start'].shift() 

df2.loc[df2['Endpoint'], ['`start', 'stop`']]



  #            `start  stop`

  #  obj4     5.0     20

  #  obj6    22.0     30

  #  obj8    33.0     40

Find all begins and ends of interval, keep only those rows, and then shift the start value one row so that the values per interval are in the same row.

This is all pandas, so I believe it should be reasonably quick.

answered Nov 19 '18 at 10:10

Jondiedoop

1,729214

add a comment |

When the intervals are sorted by starts, this simple function should work in linear time:

def merge_intervals(intervals):

    result = 

    (start_candidate, stop_candidate) = intervals[0]

    for (start, stop) in intervals[1:]:

        if start <= stop_candidate:

            stop_candidate = max(stop, stop_candidate)

        else:

            result.append((start_candidate, stop_candidate))

            (start_candidate, stop_candidate) = (start, stop)

    result.append((start_candidate, stop_candidate))

    return result



intervals = [

    ( 5, 10),

    ( 8, 12),

    (11, 14),

    (13, 20),

    (22, 25),

    (24, 30),

    (33, 37),

    (36, 40),

]



assert merge_intervals(intervals) == [(5, 20), (22, 30), (33, 40)]

answered Nov 19 '18 at 10:04

Aristide

1,43811430

add a comment |

The quickest way to deal with this kind of data is to use Union find data structure or disjoint data structure which tracks a set of elements partitioned into a number of disjoint subsets.
I am leaving the implementation and design of the data structure on you as there are efficient ways of implementing Disjoint data structures which operates in linear time.

answered Nov 19 '18 at 10:36

mohor chatt

925

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53371736%2fquick-algorithm-to-create-union-of-multiple-intervals-data%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

3 Answers
3

active

oldest

votes

3 Answers
3

active

oldest

votes

df['Startpoint'] = df['stop`'].shift() < df['`start'] # Begin of interval

df['Endpoint'] = df['Startpoint'].shift(-1) # End of interval

df.loc['obj1', 'Startpoint'] = True # First line is startpoint

df['Endpoint'].fillna(True, inplace=True) # Last line is endpoint



df2 = df[df[['Startpoint', 'Endpoint']].any(axis=1)]

df2['`start'] = df2['`start'].shift() 

df2.loc[df2['Endpoint'], ['`start', 'stop`']]



  #            `start  stop`

  #  obj4     5.0     20

  #  obj6    22.0     30

  #  obj8    33.0     40

Find all begins and ends of interval, keep only those rows, and then shift the start value one row so that the values per interval are in the same row.

This is all pandas, so I believe it should be reasonably quick.

answered Nov 19 '18 at 10:10

Jondiedoop

1,729214

add a comment |

df['Startpoint'] = df['stop`'].shift() < df['`start'] # Begin of interval

df['Endpoint'] = df['Startpoint'].shift(-1) # End of interval

df.loc['obj1', 'Startpoint'] = True # First line is startpoint

df['Endpoint'].fillna(True, inplace=True) # Last line is endpoint



df2 = df[df[['Startpoint', 'Endpoint']].any(axis=1)]

df2['`start'] = df2['`start'].shift() 

df2.loc[df2['Endpoint'], ['`start', 'stop`']]



  #            `start  stop`

  #  obj4     5.0     20

  #  obj6    22.0     30

  #  obj8    33.0     40

Find all begins and ends of interval, keep only those rows, and then shift the start value one row so that the values per interval are in the same row.

This is all pandas, so I believe it should be reasonably quick.

answered Nov 19 '18 at 10:10

Jondiedoop

1,729214

add a comment |

df['Startpoint'] = df['stop`'].shift() < df['`start'] # Begin of interval

df['Endpoint'] = df['Startpoint'].shift(-1) # End of interval

df.loc['obj1', 'Startpoint'] = True # First line is startpoint

df['Endpoint'].fillna(True, inplace=True) # Last line is endpoint



df2 = df[df[['Startpoint', 'Endpoint']].any(axis=1)]

df2['`start'] = df2['`start'].shift() 

df2.loc[df2['Endpoint'], ['`start', 'stop`']]



  #            `start  stop`

  #  obj4     5.0     20

  #  obj6    22.0     30

  #  obj8    33.0     40

Find all begins and ends of interval, keep only those rows, and then shift the start value one row so that the values per interval are in the same row.

This is all pandas, so I believe it should be reasonably quick.

answered Nov 19 '18 at 10:10

Jondiedoop

1,729214

df['Startpoint'] = df['stop`'].shift() < df['`start'] # Begin of interval

df['Endpoint'] = df['Startpoint'].shift(-1) # End of interval

df.loc['obj1', 'Startpoint'] = True # First line is startpoint

df['Endpoint'].fillna(True, inplace=True) # Last line is endpoint



df2 = df[df[['Startpoint', 'Endpoint']].any(axis=1)]

df2['`start'] = df2['`start'].shift() 

df2.loc[df2['Endpoint'], ['`start', 'stop`']]



  #            `start  stop`

  #  obj4     5.0     20

  #  obj6    22.0     30

  #  obj8    33.0     40

Find all begins and ends of interval, keep only those rows, and then shift the start value one row so that the values per interval are in the same row.

This is all pandas, so I believe it should be reasonably quick.

answered Nov 19 '18 at 10:10

Jondiedoop

1,729214

answered Nov 19 '18 at 10:10

Jondiedoop

1,729214

answered Nov 19 '18 at 10:10

Jondiedoop

1,729214

answered Nov 19 '18 at 10:10

Jondiedoop

1,729214

add a comment |

When the intervals are sorted by starts, this simple function should work in linear time:

def merge_intervals(intervals):

    result = 

    (start_candidate, stop_candidate) = intervals[0]

    for (start, stop) in intervals[1:]:

        if start <= stop_candidate:

            stop_candidate = max(stop, stop_candidate)

        else:

            result.append((start_candidate, stop_candidate))

            (start_candidate, stop_candidate) = (start, stop)

    result.append((start_candidate, stop_candidate))

    return result



intervals = [

    ( 5, 10),

    ( 8, 12),

    (11, 14),

    (13, 20),

    (22, 25),

    (24, 30),

    (33, 37),

    (36, 40),

]



assert merge_intervals(intervals) == [(5, 20), (22, 30), (33, 40)]

answered Nov 19 '18 at 10:04

Aristide

1,43811430

add a comment |

When the intervals are sorted by starts, this simple function should work in linear time:

def merge_intervals(intervals):

    result = 

    (start_candidate, stop_candidate) = intervals[0]

    for (start, stop) in intervals[1:]:

        if start <= stop_candidate:

            stop_candidate = max(stop, stop_candidate)

        else:

            result.append((start_candidate, stop_candidate))

            (start_candidate, stop_candidate) = (start, stop)

    result.append((start_candidate, stop_candidate))

    return result



intervals = [

    ( 5, 10),

    ( 8, 12),

    (11, 14),

    (13, 20),

    (22, 25),

    (24, 30),

    (33, 37),

    (36, 40),

]



assert merge_intervals(intervals) == [(5, 20), (22, 30), (33, 40)]

answered Nov 19 '18 at 10:04

Aristide

1,43811430

add a comment |

When the intervals are sorted by starts, this simple function should work in linear time:

def merge_intervals(intervals):

    result = 

    (start_candidate, stop_candidate) = intervals[0]

    for (start, stop) in intervals[1:]:

        if start <= stop_candidate:

            stop_candidate = max(stop, stop_candidate)

        else:

            result.append((start_candidate, stop_candidate))

            (start_candidate, stop_candidate) = (start, stop)

    result.append((start_candidate, stop_candidate))

    return result



intervals = [

    ( 5, 10),

    ( 8, 12),

    (11, 14),

    (13, 20),

    (22, 25),

    (24, 30),

    (33, 37),

    (36, 40),

]



assert merge_intervals(intervals) == [(5, 20), (22, 30), (33, 40)]

answered Nov 19 '18 at 10:04

Aristide

1,43811430

When the intervals are sorted by starts, this simple function should work in linear time:

def merge_intervals(intervals):

    result = 

    (start_candidate, stop_candidate) = intervals[0]

    for (start, stop) in intervals[1:]:

        if start <= stop_candidate:

            stop_candidate = max(stop, stop_candidate)

        else:

            result.append((start_candidate, stop_candidate))

            (start_candidate, stop_candidate) = (start, stop)

    result.append((start_candidate, stop_candidate))

    return result



intervals = [

    ( 5, 10),

    ( 8, 12),

    (11, 14),

    (13, 20),

    (22, 25),

    (24, 30),

    (33, 37),

    (36, 40),

]



assert merge_intervals(intervals) == [(5, 20), (22, 30), (33, 40)]

answered Nov 19 '18 at 10:04

Aristide

1,43811430

answered Nov 19 '18 at 10:04

Aristide

1,43811430

answered Nov 19 '18 at 10:04

Aristide

1,43811430

answered Nov 19 '18 at 10:04

Aristide

1,43811430

add a comment |

answered Nov 19 '18 at 10:36

mohor chatt

925

add a comment |

answered Nov 19 '18 at 10:36

mohor chatt

925

add a comment |

answered Nov 19 '18 at 10:36

mohor chatt

925

answered Nov 19 '18 at 10:36

mohor chatt

925

answered Nov 19 '18 at 10:36

mohor chatt

925

answered Nov 19 '18 at 10:36

mohor chatt

925

answered Nov 19 '18 at 10:36

mohor chatt

925

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Agfdhyk