timestamp difference between rows for each user

timestamp difference between rows for each user - Pyspark Dataframe

-2

I have a CSV file with following structure

USER_ID       location          timestamp          

 1             1001         19:11:39  5-2-2010

 1             6022         17:51:19  6-6-2010

 1             1041         11:11:39  5-2-2010

 2             9483         10:51:23  3-2-2012

 2             4532         11:11:11  4-5-2012

 3             4374         03:21:23  6-9-2013

 3             4334         04:53:13  4-5-2013

Basically what I would like to do using pyspark or only python is calculates the timestamp difference for different location with the same user_id number. An example from expected result would be:

USER_ID       location          timestamp difference         

 1            1001-1041         08:00:00

any idea how to reach the solution

asked Nov 19 '18 at 17:11

imed eddines

add a comment |

-2

I have a CSV file with following structure

USER_ID       location          timestamp          

 1             1001         19:11:39  5-2-2010

 1             6022         17:51:19  6-6-2010

 1             1041         11:11:39  5-2-2010

 2             9483         10:51:23  3-2-2012

 2             4532         11:11:11  4-5-2012

 3             4374         03:21:23  6-9-2013

 3             4334         04:53:13  4-5-2013

Basically what I would like to do using pyspark or only python is calculates the timestamp difference for different location with the same user_id number. An example from expected result would be:

USER_ID       location          timestamp difference         

 1            1001-1041         08:00:00

any idea how to reach the solution

asked Nov 19 '18 at 17:11

imed eddines

add a comment |

-2

I have a CSV file with following structure

USER_ID       location          timestamp          

 1             1001         19:11:39  5-2-2010

 1             6022         17:51:19  6-6-2010

 1             1041         11:11:39  5-2-2010

 2             9483         10:51:23  3-2-2012

 2             4532         11:11:11  4-5-2012

 3             4374         03:21:23  6-9-2013

 3             4334         04:53:13  4-5-2013

Basically what I would like to do using pyspark or only python is calculates the timestamp difference for different location with the same user_id number. An example from expected result would be:

USER_ID       location          timestamp difference         

 1            1001-1041         08:00:00

any idea how to reach the solution

asked Nov 19 '18 at 17:11

imed eddines

I have a CSV file with following structure

USER_ID       location          timestamp          

 1             1001         19:11:39  5-2-2010

 1             6022         17:51:19  6-6-2010

 1             1041         11:11:39  5-2-2010

 2             9483         10:51:23  3-2-2012

 2             4532         11:11:11  4-5-2012

 3             4374         03:21:23  6-9-2013

 3             4334         04:53:13  4-5-2013

Basically what I would like to do using pyspark or only python is calculates the timestamp difference for different location with the same user_id number. An example from expected result would be:

USER_ID       location          timestamp difference         

 1            1001-1041         08:00:00

any idea how to reach the solution

python apache-spark pyspark pyspark-sql

asked Nov 19 '18 at 17:11

imed eddines

asked Nov 19 '18 at 17:11

imed eddines

asked Nov 19 '18 at 17:11

imed eddines

asked Nov 19 '18 at 17:11

imed eddines

asked Nov 19 '18 at 17:11

imed eddines

add a comment |

1 Answer
1

active

oldest

votes

Assuming you want every possible combination of locations for a user, you just need to do a join on USER_ID, then subtract the date columns. The one trick here is to use unix_timestamp to parse your datetime data to an integer that supports the subtraction operation.

Example Code:

from pyspark.sql.functions import unix_timestamp, col, datediff



data = [

    (1, 1001, '19:11:39 5-2-2010'),

    (1, 6022, '17:51:19 6-6-2010'),

    (1, 1041, '11:11:39 5-2-2010'),

    (2, 9483, '10:51:23 3-2-2012'),

    (2, 4532, '11:11:11 4-5-2012'),

    (3, 4374, '03:21:23 6-9-2013'),

    (3, 4334, '04:53:13 4-5-2013')

]



df = spark.createDataFrame(data, ['USER_ID', 'location', 'timestamp'])

df = df.withColumn('timestamp', unix_timestamp('timestamp', 'HH:mm:ss dd-MM-yyyy'))



# Renaming columns to avoid conflicts after join

df2 = df.selectExpr('USER_ID as USER_ID2', 'location as location2', 'timestamp as timestamp2')

cartesian = df.join(df2, col("USER_ID") == col("USER_ID2"), "inner")



# Filter to get rid of reversed duplicates, and rows where location is same on both sides

pairs = cartesian.filter("location < location2") 

                 .drop("USER_ID2") 

                 .withColumn("diff", col("timestamp2") - col("timestamp"))

pairs.show()

answered Nov 19 '18 at 17:45

Ryan Widmaier

2,99411218

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53379599%2ftimestamp-difference-between-rows-for-each-user-pyspark-dataframe%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

Example Code:

from pyspark.sql.functions import unix_timestamp, col, datediff



data = [

    (1, 1001, '19:11:39 5-2-2010'),

    (1, 6022, '17:51:19 6-6-2010'),

    (1, 1041, '11:11:39 5-2-2010'),

    (2, 9483, '10:51:23 3-2-2012'),

    (2, 4532, '11:11:11 4-5-2012'),

    (3, 4374, '03:21:23 6-9-2013'),

    (3, 4334, '04:53:13 4-5-2013')

]



df = spark.createDataFrame(data, ['USER_ID', 'location', 'timestamp'])

df = df.withColumn('timestamp', unix_timestamp('timestamp', 'HH:mm:ss dd-MM-yyyy'))



# Renaming columns to avoid conflicts after join

df2 = df.selectExpr('USER_ID as USER_ID2', 'location as location2', 'timestamp as timestamp2')

cartesian = df.join(df2, col("USER_ID") == col("USER_ID2"), "inner")



# Filter to get rid of reversed duplicates, and rows where location is same on both sides

pairs = cartesian.filter("location < location2") 

                 .drop("USER_ID2") 

                 .withColumn("diff", col("timestamp2") - col("timestamp"))

pairs.show()

answered Nov 19 '18 at 17:45

Ryan Widmaier

2,99411218

add a comment |

Example Code:

from pyspark.sql.functions import unix_timestamp, col, datediff



data = [

    (1, 1001, '19:11:39 5-2-2010'),

    (1, 6022, '17:51:19 6-6-2010'),

    (1, 1041, '11:11:39 5-2-2010'),

    (2, 9483, '10:51:23 3-2-2012'),

    (2, 4532, '11:11:11 4-5-2012'),

    (3, 4374, '03:21:23 6-9-2013'),

    (3, 4334, '04:53:13 4-5-2013')

]



df = spark.createDataFrame(data, ['USER_ID', 'location', 'timestamp'])

df = df.withColumn('timestamp', unix_timestamp('timestamp', 'HH:mm:ss dd-MM-yyyy'))



# Renaming columns to avoid conflicts after join

df2 = df.selectExpr('USER_ID as USER_ID2', 'location as location2', 'timestamp as timestamp2')

cartesian = df.join(df2, col("USER_ID") == col("USER_ID2"), "inner")



# Filter to get rid of reversed duplicates, and rows where location is same on both sides

pairs = cartesian.filter("location < location2") 

                 .drop("USER_ID2") 

                 .withColumn("diff", col("timestamp2") - col("timestamp"))

pairs.show()

answered Nov 19 '18 at 17:45

Ryan Widmaier

2,99411218

add a comment |

Example Code:

from pyspark.sql.functions import unix_timestamp, col, datediff



data = [

    (1, 1001, '19:11:39 5-2-2010'),

    (1, 6022, '17:51:19 6-6-2010'),

    (1, 1041, '11:11:39 5-2-2010'),

    (2, 9483, '10:51:23 3-2-2012'),

    (2, 4532, '11:11:11 4-5-2012'),

    (3, 4374, '03:21:23 6-9-2013'),

    (3, 4334, '04:53:13 4-5-2013')

]



df = spark.createDataFrame(data, ['USER_ID', 'location', 'timestamp'])

df = df.withColumn('timestamp', unix_timestamp('timestamp', 'HH:mm:ss dd-MM-yyyy'))



# Renaming columns to avoid conflicts after join

df2 = df.selectExpr('USER_ID as USER_ID2', 'location as location2', 'timestamp as timestamp2')

cartesian = df.join(df2, col("USER_ID") == col("USER_ID2"), "inner")



# Filter to get rid of reversed duplicates, and rows where location is same on both sides

pairs = cartesian.filter("location < location2") 

                 .drop("USER_ID2") 

                 .withColumn("diff", col("timestamp2") - col("timestamp"))

pairs.show()

answered Nov 19 '18 at 17:45

Ryan Widmaier

2,99411218

Example Code:

from pyspark.sql.functions import unix_timestamp, col, datediff



data = [

    (1, 1001, '19:11:39 5-2-2010'),

    (1, 6022, '17:51:19 6-6-2010'),

    (1, 1041, '11:11:39 5-2-2010'),

    (2, 9483, '10:51:23 3-2-2012'),

    (2, 4532, '11:11:11 4-5-2012'),

    (3, 4374, '03:21:23 6-9-2013'),

    (3, 4334, '04:53:13 4-5-2013')

]



df = spark.createDataFrame(data, ['USER_ID', 'location', 'timestamp'])

df = df.withColumn('timestamp', unix_timestamp('timestamp', 'HH:mm:ss dd-MM-yyyy'))



# Renaming columns to avoid conflicts after join

df2 = df.selectExpr('USER_ID as USER_ID2', 'location as location2', 'timestamp as timestamp2')

cartesian = df.join(df2, col("USER_ID") == col("USER_ID2"), "inner")



# Filter to get rid of reversed duplicates, and rows where location is same on both sides

pairs = cartesian.filter("location < location2") 

                 .drop("USER_ID2") 

                 .withColumn("diff", col("timestamp2") - col("timestamp"))

pairs.show()

answered Nov 19 '18 at 17:45

Ryan Widmaier

2,99411218

answered Nov 19 '18 at 17:45

Ryan Widmaier

2,99411218

answered Nov 19 '18 at 17:45

Ryan Widmaier

2,99411218

answered Nov 19 '18 at 17:45

Ryan Widmaier

2,99411218

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Agfdhyk