timestamp difference between rows for each user - Pyspark Dataframe












-2















I have a CSV file with following structure



USER_ID       location          timestamp          
1 1001 19:11:39 5-2-2010
1 6022 17:51:19 6-6-2010
1 1041 11:11:39 5-2-2010
2 9483 10:51:23 3-2-2012
2 4532 11:11:11 4-5-2012
3 4374 03:21:23 6-9-2013
3 4334 04:53:13 4-5-2013


Basically what I would like to do using pyspark or only python is calculates the timestamp difference for different location with the same user_id number. An example from expected result would be:



USER_ID       location          timestamp difference         
1 1001-1041 08:00:00


any idea how to reach the solution










share|improve this question



























    -2















    I have a CSV file with following structure



    USER_ID       location          timestamp          
    1 1001 19:11:39 5-2-2010
    1 6022 17:51:19 6-6-2010
    1 1041 11:11:39 5-2-2010
    2 9483 10:51:23 3-2-2012
    2 4532 11:11:11 4-5-2012
    3 4374 03:21:23 6-9-2013
    3 4334 04:53:13 4-5-2013


    Basically what I would like to do using pyspark or only python is calculates the timestamp difference for different location with the same user_id number. An example from expected result would be:



    USER_ID       location          timestamp difference         
    1 1001-1041 08:00:00


    any idea how to reach the solution










    share|improve this question

























      -2












      -2








      -2








      I have a CSV file with following structure



      USER_ID       location          timestamp          
      1 1001 19:11:39 5-2-2010
      1 6022 17:51:19 6-6-2010
      1 1041 11:11:39 5-2-2010
      2 9483 10:51:23 3-2-2012
      2 4532 11:11:11 4-5-2012
      3 4374 03:21:23 6-9-2013
      3 4334 04:53:13 4-5-2013


      Basically what I would like to do using pyspark or only python is calculates the timestamp difference for different location with the same user_id number. An example from expected result would be:



      USER_ID       location          timestamp difference         
      1 1001-1041 08:00:00


      any idea how to reach the solution










      share|improve this question














      I have a CSV file with following structure



      USER_ID       location          timestamp          
      1 1001 19:11:39 5-2-2010
      1 6022 17:51:19 6-6-2010
      1 1041 11:11:39 5-2-2010
      2 9483 10:51:23 3-2-2012
      2 4532 11:11:11 4-5-2012
      3 4374 03:21:23 6-9-2013
      3 4334 04:53:13 4-5-2013


      Basically what I would like to do using pyspark or only python is calculates the timestamp difference for different location with the same user_id number. An example from expected result would be:



      USER_ID       location          timestamp difference         
      1 1001-1041 08:00:00


      any idea how to reach the solution







      python apache-spark pyspark pyspark-sql






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Nov 19 '18 at 17:11









      imed eddinesimed eddines

      13




      13
























          1 Answer
          1






          active

          oldest

          votes


















          1














          Assuming you want every possible combination of locations for a user, you just need to do a join on USER_ID, then subtract the date columns. The one trick here is to use unix_timestamp to parse your datetime data to an integer that supports the subtraction operation.



          Example Code:



          from pyspark.sql.functions import unix_timestamp, col, datediff

          data = [
          (1, 1001, '19:11:39 5-2-2010'),
          (1, 6022, '17:51:19 6-6-2010'),
          (1, 1041, '11:11:39 5-2-2010'),
          (2, 9483, '10:51:23 3-2-2012'),
          (2, 4532, '11:11:11 4-5-2012'),
          (3, 4374, '03:21:23 6-9-2013'),
          (3, 4334, '04:53:13 4-5-2013')
          ]

          df = spark.createDataFrame(data, ['USER_ID', 'location', 'timestamp'])
          df = df.withColumn('timestamp', unix_timestamp('timestamp', 'HH:mm:ss dd-MM-yyyy'))

          # Renaming columns to avoid conflicts after join
          df2 = df.selectExpr('USER_ID as USER_ID2', 'location as location2', 'timestamp as timestamp2')
          cartesian = df.join(df2, col("USER_ID") == col("USER_ID2"), "inner")

          # Filter to get rid of reversed duplicates, and rows where location is same on both sides
          pairs = cartesian.filter("location < location2")
          .drop("USER_ID2")
          .withColumn("diff", col("timestamp2") - col("timestamp"))
          pairs.show()





          share|improve this answer























            Your Answer






            StackExchange.ifUsing("editor", function () {
            StackExchange.using("externalEditor", function () {
            StackExchange.using("snippets", function () {
            StackExchange.snippets.init();
            });
            });
            }, "code-snippets");

            StackExchange.ready(function() {
            var channelOptions = {
            tags: "".split(" "),
            id: "1"
            };
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function() {
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled) {
            StackExchange.using("snippets", function() {
            createEditor();
            });
            }
            else {
            createEditor();
            }
            });

            function createEditor() {
            StackExchange.prepareEditor({
            heartbeatType: 'answer',
            autoActivateHeartbeat: false,
            convertImagesToLinks: true,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: 10,
            bindNavPrevention: true,
            postfix: "",
            imageUploader: {
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            },
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            });


            }
            });














            draft saved

            draft discarded


















            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53379599%2ftimestamp-difference-between-rows-for-each-user-pyspark-dataframe%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown

























            1 Answer
            1






            active

            oldest

            votes








            1 Answer
            1






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes









            1














            Assuming you want every possible combination of locations for a user, you just need to do a join on USER_ID, then subtract the date columns. The one trick here is to use unix_timestamp to parse your datetime data to an integer that supports the subtraction operation.



            Example Code:



            from pyspark.sql.functions import unix_timestamp, col, datediff

            data = [
            (1, 1001, '19:11:39 5-2-2010'),
            (1, 6022, '17:51:19 6-6-2010'),
            (1, 1041, '11:11:39 5-2-2010'),
            (2, 9483, '10:51:23 3-2-2012'),
            (2, 4532, '11:11:11 4-5-2012'),
            (3, 4374, '03:21:23 6-9-2013'),
            (3, 4334, '04:53:13 4-5-2013')
            ]

            df = spark.createDataFrame(data, ['USER_ID', 'location', 'timestamp'])
            df = df.withColumn('timestamp', unix_timestamp('timestamp', 'HH:mm:ss dd-MM-yyyy'))

            # Renaming columns to avoid conflicts after join
            df2 = df.selectExpr('USER_ID as USER_ID2', 'location as location2', 'timestamp as timestamp2')
            cartesian = df.join(df2, col("USER_ID") == col("USER_ID2"), "inner")

            # Filter to get rid of reversed duplicates, and rows where location is same on both sides
            pairs = cartesian.filter("location < location2")
            .drop("USER_ID2")
            .withColumn("diff", col("timestamp2") - col("timestamp"))
            pairs.show()





            share|improve this answer




























              1














              Assuming you want every possible combination of locations for a user, you just need to do a join on USER_ID, then subtract the date columns. The one trick here is to use unix_timestamp to parse your datetime data to an integer that supports the subtraction operation.



              Example Code:



              from pyspark.sql.functions import unix_timestamp, col, datediff

              data = [
              (1, 1001, '19:11:39 5-2-2010'),
              (1, 6022, '17:51:19 6-6-2010'),
              (1, 1041, '11:11:39 5-2-2010'),
              (2, 9483, '10:51:23 3-2-2012'),
              (2, 4532, '11:11:11 4-5-2012'),
              (3, 4374, '03:21:23 6-9-2013'),
              (3, 4334, '04:53:13 4-5-2013')
              ]

              df = spark.createDataFrame(data, ['USER_ID', 'location', 'timestamp'])
              df = df.withColumn('timestamp', unix_timestamp('timestamp', 'HH:mm:ss dd-MM-yyyy'))

              # Renaming columns to avoid conflicts after join
              df2 = df.selectExpr('USER_ID as USER_ID2', 'location as location2', 'timestamp as timestamp2')
              cartesian = df.join(df2, col("USER_ID") == col("USER_ID2"), "inner")

              # Filter to get rid of reversed duplicates, and rows where location is same on both sides
              pairs = cartesian.filter("location < location2")
              .drop("USER_ID2")
              .withColumn("diff", col("timestamp2") - col("timestamp"))
              pairs.show()





              share|improve this answer


























                1












                1








                1







                Assuming you want every possible combination of locations for a user, you just need to do a join on USER_ID, then subtract the date columns. The one trick here is to use unix_timestamp to parse your datetime data to an integer that supports the subtraction operation.



                Example Code:



                from pyspark.sql.functions import unix_timestamp, col, datediff

                data = [
                (1, 1001, '19:11:39 5-2-2010'),
                (1, 6022, '17:51:19 6-6-2010'),
                (1, 1041, '11:11:39 5-2-2010'),
                (2, 9483, '10:51:23 3-2-2012'),
                (2, 4532, '11:11:11 4-5-2012'),
                (3, 4374, '03:21:23 6-9-2013'),
                (3, 4334, '04:53:13 4-5-2013')
                ]

                df = spark.createDataFrame(data, ['USER_ID', 'location', 'timestamp'])
                df = df.withColumn('timestamp', unix_timestamp('timestamp', 'HH:mm:ss dd-MM-yyyy'))

                # Renaming columns to avoid conflicts after join
                df2 = df.selectExpr('USER_ID as USER_ID2', 'location as location2', 'timestamp as timestamp2')
                cartesian = df.join(df2, col("USER_ID") == col("USER_ID2"), "inner")

                # Filter to get rid of reversed duplicates, and rows where location is same on both sides
                pairs = cartesian.filter("location < location2")
                .drop("USER_ID2")
                .withColumn("diff", col("timestamp2") - col("timestamp"))
                pairs.show()





                share|improve this answer













                Assuming you want every possible combination of locations for a user, you just need to do a join on USER_ID, then subtract the date columns. The one trick here is to use unix_timestamp to parse your datetime data to an integer that supports the subtraction operation.



                Example Code:



                from pyspark.sql.functions import unix_timestamp, col, datediff

                data = [
                (1, 1001, '19:11:39 5-2-2010'),
                (1, 6022, '17:51:19 6-6-2010'),
                (1, 1041, '11:11:39 5-2-2010'),
                (2, 9483, '10:51:23 3-2-2012'),
                (2, 4532, '11:11:11 4-5-2012'),
                (3, 4374, '03:21:23 6-9-2013'),
                (3, 4334, '04:53:13 4-5-2013')
                ]

                df = spark.createDataFrame(data, ['USER_ID', 'location', 'timestamp'])
                df = df.withColumn('timestamp', unix_timestamp('timestamp', 'HH:mm:ss dd-MM-yyyy'))

                # Renaming columns to avoid conflicts after join
                df2 = df.selectExpr('USER_ID as USER_ID2', 'location as location2', 'timestamp as timestamp2')
                cartesian = df.join(df2, col("USER_ID") == col("USER_ID2"), "inner")

                # Filter to get rid of reversed duplicates, and rows where location is same on both sides
                pairs = cartesian.filter("location < location2")
                .drop("USER_ID2")
                .withColumn("diff", col("timestamp2") - col("timestamp"))
                pairs.show()






                share|improve this answer












                share|improve this answer



                share|improve this answer










                answered Nov 19 '18 at 17:45









                Ryan WidmaierRyan Widmaier

                2,99411218




                2,99411218
































                    draft saved

                    draft discarded




















































                    Thanks for contributing an answer to Stack Overflow!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid



                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.


                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function () {
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53379599%2ftimestamp-difference-between-rows-for-each-user-pyspark-dataframe%23new-answer', 'question_page');
                    }
                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    How to pass form data using jquery Ajax to insert data in database?

                    National Museum of Racing and Hall of Fame

                    Guess what letter conforming each word