timestamp difference between rows for each user - Pyspark Dataframe
I have a CSV file with following structure
USER_ID location timestamp
1 1001 19:11:39 5-2-2010
1 6022 17:51:19 6-6-2010
1 1041 11:11:39 5-2-2010
2 9483 10:51:23 3-2-2012
2 4532 11:11:11 4-5-2012
3 4374 03:21:23 6-9-2013
3 4334 04:53:13 4-5-2013
Basically what I would like to do using pyspark or only python is calculates the timestamp difference for different location with the same user_id number. An example from expected result would be:
USER_ID location timestamp difference
1 1001-1041 08:00:00
any idea how to reach the solution
python apache-spark pyspark pyspark-sql
add a comment |
I have a CSV file with following structure
USER_ID location timestamp
1 1001 19:11:39 5-2-2010
1 6022 17:51:19 6-6-2010
1 1041 11:11:39 5-2-2010
2 9483 10:51:23 3-2-2012
2 4532 11:11:11 4-5-2012
3 4374 03:21:23 6-9-2013
3 4334 04:53:13 4-5-2013
Basically what I would like to do using pyspark or only python is calculates the timestamp difference for different location with the same user_id number. An example from expected result would be:
USER_ID location timestamp difference
1 1001-1041 08:00:00
any idea how to reach the solution
python apache-spark pyspark pyspark-sql
add a comment |
I have a CSV file with following structure
USER_ID location timestamp
1 1001 19:11:39 5-2-2010
1 6022 17:51:19 6-6-2010
1 1041 11:11:39 5-2-2010
2 9483 10:51:23 3-2-2012
2 4532 11:11:11 4-5-2012
3 4374 03:21:23 6-9-2013
3 4334 04:53:13 4-5-2013
Basically what I would like to do using pyspark or only python is calculates the timestamp difference for different location with the same user_id number. An example from expected result would be:
USER_ID location timestamp difference
1 1001-1041 08:00:00
any idea how to reach the solution
python apache-spark pyspark pyspark-sql
I have a CSV file with following structure
USER_ID location timestamp
1 1001 19:11:39 5-2-2010
1 6022 17:51:19 6-6-2010
1 1041 11:11:39 5-2-2010
2 9483 10:51:23 3-2-2012
2 4532 11:11:11 4-5-2012
3 4374 03:21:23 6-9-2013
3 4334 04:53:13 4-5-2013
Basically what I would like to do using pyspark or only python is calculates the timestamp difference for different location with the same user_id number. An example from expected result would be:
USER_ID location timestamp difference
1 1001-1041 08:00:00
any idea how to reach the solution
python apache-spark pyspark pyspark-sql
python apache-spark pyspark pyspark-sql
asked Nov 19 '18 at 17:11
imed eddinesimed eddines
13
13
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
Assuming you want every possible combination of locations for a user, you just need to do a join on USER_ID, then subtract the date columns. The one trick here is to use unix_timestamp to parse your datetime data to an integer that supports the subtraction operation.
Example Code:
from pyspark.sql.functions import unix_timestamp, col, datediff
data = [
(1, 1001, '19:11:39 5-2-2010'),
(1, 6022, '17:51:19 6-6-2010'),
(1, 1041, '11:11:39 5-2-2010'),
(2, 9483, '10:51:23 3-2-2012'),
(2, 4532, '11:11:11 4-5-2012'),
(3, 4374, '03:21:23 6-9-2013'),
(3, 4334, '04:53:13 4-5-2013')
]
df = spark.createDataFrame(data, ['USER_ID', 'location', 'timestamp'])
df = df.withColumn('timestamp', unix_timestamp('timestamp', 'HH:mm:ss dd-MM-yyyy'))
# Renaming columns to avoid conflicts after join
df2 = df.selectExpr('USER_ID as USER_ID2', 'location as location2', 'timestamp as timestamp2')
cartesian = df.join(df2, col("USER_ID") == col("USER_ID2"), "inner")
# Filter to get rid of reversed duplicates, and rows where location is same on both sides
pairs = cartesian.filter("location < location2")
.drop("USER_ID2")
.withColumn("diff", col("timestamp2") - col("timestamp"))
pairs.show()
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53379599%2ftimestamp-difference-between-rows-for-each-user-pyspark-dataframe%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
Assuming you want every possible combination of locations for a user, you just need to do a join on USER_ID, then subtract the date columns. The one trick here is to use unix_timestamp to parse your datetime data to an integer that supports the subtraction operation.
Example Code:
from pyspark.sql.functions import unix_timestamp, col, datediff
data = [
(1, 1001, '19:11:39 5-2-2010'),
(1, 6022, '17:51:19 6-6-2010'),
(1, 1041, '11:11:39 5-2-2010'),
(2, 9483, '10:51:23 3-2-2012'),
(2, 4532, '11:11:11 4-5-2012'),
(3, 4374, '03:21:23 6-9-2013'),
(3, 4334, '04:53:13 4-5-2013')
]
df = spark.createDataFrame(data, ['USER_ID', 'location', 'timestamp'])
df = df.withColumn('timestamp', unix_timestamp('timestamp', 'HH:mm:ss dd-MM-yyyy'))
# Renaming columns to avoid conflicts after join
df2 = df.selectExpr('USER_ID as USER_ID2', 'location as location2', 'timestamp as timestamp2')
cartesian = df.join(df2, col("USER_ID") == col("USER_ID2"), "inner")
# Filter to get rid of reversed duplicates, and rows where location is same on both sides
pairs = cartesian.filter("location < location2")
.drop("USER_ID2")
.withColumn("diff", col("timestamp2") - col("timestamp"))
pairs.show()
add a comment |
Assuming you want every possible combination of locations for a user, you just need to do a join on USER_ID, then subtract the date columns. The one trick here is to use unix_timestamp to parse your datetime data to an integer that supports the subtraction operation.
Example Code:
from pyspark.sql.functions import unix_timestamp, col, datediff
data = [
(1, 1001, '19:11:39 5-2-2010'),
(1, 6022, '17:51:19 6-6-2010'),
(1, 1041, '11:11:39 5-2-2010'),
(2, 9483, '10:51:23 3-2-2012'),
(2, 4532, '11:11:11 4-5-2012'),
(3, 4374, '03:21:23 6-9-2013'),
(3, 4334, '04:53:13 4-5-2013')
]
df = spark.createDataFrame(data, ['USER_ID', 'location', 'timestamp'])
df = df.withColumn('timestamp', unix_timestamp('timestamp', 'HH:mm:ss dd-MM-yyyy'))
# Renaming columns to avoid conflicts after join
df2 = df.selectExpr('USER_ID as USER_ID2', 'location as location2', 'timestamp as timestamp2')
cartesian = df.join(df2, col("USER_ID") == col("USER_ID2"), "inner")
# Filter to get rid of reversed duplicates, and rows where location is same on both sides
pairs = cartesian.filter("location < location2")
.drop("USER_ID2")
.withColumn("diff", col("timestamp2") - col("timestamp"))
pairs.show()
add a comment |
Assuming you want every possible combination of locations for a user, you just need to do a join on USER_ID, then subtract the date columns. The one trick here is to use unix_timestamp to parse your datetime data to an integer that supports the subtraction operation.
Example Code:
from pyspark.sql.functions import unix_timestamp, col, datediff
data = [
(1, 1001, '19:11:39 5-2-2010'),
(1, 6022, '17:51:19 6-6-2010'),
(1, 1041, '11:11:39 5-2-2010'),
(2, 9483, '10:51:23 3-2-2012'),
(2, 4532, '11:11:11 4-5-2012'),
(3, 4374, '03:21:23 6-9-2013'),
(3, 4334, '04:53:13 4-5-2013')
]
df = spark.createDataFrame(data, ['USER_ID', 'location', 'timestamp'])
df = df.withColumn('timestamp', unix_timestamp('timestamp', 'HH:mm:ss dd-MM-yyyy'))
# Renaming columns to avoid conflicts after join
df2 = df.selectExpr('USER_ID as USER_ID2', 'location as location2', 'timestamp as timestamp2')
cartesian = df.join(df2, col("USER_ID") == col("USER_ID2"), "inner")
# Filter to get rid of reversed duplicates, and rows where location is same on both sides
pairs = cartesian.filter("location < location2")
.drop("USER_ID2")
.withColumn("diff", col("timestamp2") - col("timestamp"))
pairs.show()
Assuming you want every possible combination of locations for a user, you just need to do a join on USER_ID, then subtract the date columns. The one trick here is to use unix_timestamp to parse your datetime data to an integer that supports the subtraction operation.
Example Code:
from pyspark.sql.functions import unix_timestamp, col, datediff
data = [
(1, 1001, '19:11:39 5-2-2010'),
(1, 6022, '17:51:19 6-6-2010'),
(1, 1041, '11:11:39 5-2-2010'),
(2, 9483, '10:51:23 3-2-2012'),
(2, 4532, '11:11:11 4-5-2012'),
(3, 4374, '03:21:23 6-9-2013'),
(3, 4334, '04:53:13 4-5-2013')
]
df = spark.createDataFrame(data, ['USER_ID', 'location', 'timestamp'])
df = df.withColumn('timestamp', unix_timestamp('timestamp', 'HH:mm:ss dd-MM-yyyy'))
# Renaming columns to avoid conflicts after join
df2 = df.selectExpr('USER_ID as USER_ID2', 'location as location2', 'timestamp as timestamp2')
cartesian = df.join(df2, col("USER_ID") == col("USER_ID2"), "inner")
# Filter to get rid of reversed duplicates, and rows where location is same on both sides
pairs = cartesian.filter("location < location2")
.drop("USER_ID2")
.withColumn("diff", col("timestamp2") - col("timestamp"))
pairs.show()
answered Nov 19 '18 at 17:45
Ryan WidmaierRyan Widmaier
2,99411218
2,99411218
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53379599%2ftimestamp-difference-between-rows-for-each-user-pyspark-dataframe%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown