How to join datasets with values between other values?












0















I have a use case where I need to join 2 data-frames.



ID view



ID  BookTime
1 2
1 5
2 8
2 3
3 4


FareRule view



Start End Fare
1 3 10
3 6 20
6 10 25


Output is a result of join by checking the BookTime from the ID table. The Fare is computed based on the window that is between Start and End from FareRule.



ID  FareDue
1 10
1 20
2 25
2 20
3 20


I am creating a view out of these data-frames and using CROSS JOIN to join them. But as we know, CROSS join is expensive so is there a better way to join them?



SELECT 
ID,
Fare AS FareDue
FROM
ID
CROSS JOIN
FareRule
WHERE
BookTime >=Start
AND
BookTime< End









share|improve this question

























  • Possible duplicate of How to improve broadcast Join speed in Spark

    – user10465355
    Nov 20 '18 at 19:55











  • Why are you using cross join? From the looks of it you sgould be getting the same result with a regular join.

    – Vitaliy
    Nov 21 '18 at 6:34











  • Like other SQL query engines Spark has a query optimizer & there is no difference between CROSS JOIN (or comma) with WHERE vs INNER JOIN ON. Read about the general idea of relational query optimization/implementation anywhere. (Including dozens of published academic textbooks free online in pdf.) Read about the optimizer Catalyst & writing optimal Spark SQL queries anywhere. (Many implementers who authored the ACM paper "Spark SQL: Relational Data Processing in Spark" work at Databricks.) PS What did you learn from googling re spark sql query optimization before you considered asking?

    – philipxy
    Nov 28 '18 at 21:13


















0















I have a use case where I need to join 2 data-frames.



ID view



ID  BookTime
1 2
1 5
2 8
2 3
3 4


FareRule view



Start End Fare
1 3 10
3 6 20
6 10 25


Output is a result of join by checking the BookTime from the ID table. The Fare is computed based on the window that is between Start and End from FareRule.



ID  FareDue
1 10
1 20
2 25
2 20
3 20


I am creating a view out of these data-frames and using CROSS JOIN to join them. But as we know, CROSS join is expensive so is there a better way to join them?



SELECT 
ID,
Fare AS FareDue
FROM
ID
CROSS JOIN
FareRule
WHERE
BookTime >=Start
AND
BookTime< End









share|improve this question

























  • Possible duplicate of How to improve broadcast Join speed in Spark

    – user10465355
    Nov 20 '18 at 19:55











  • Why are you using cross join? From the looks of it you sgould be getting the same result with a regular join.

    – Vitaliy
    Nov 21 '18 at 6:34











  • Like other SQL query engines Spark has a query optimizer & there is no difference between CROSS JOIN (or comma) with WHERE vs INNER JOIN ON. Read about the general idea of relational query optimization/implementation anywhere. (Including dozens of published academic textbooks free online in pdf.) Read about the optimizer Catalyst & writing optimal Spark SQL queries anywhere. (Many implementers who authored the ACM paper "Spark SQL: Relational Data Processing in Spark" work at Databricks.) PS What did you learn from googling re spark sql query optimization before you considered asking?

    – philipxy
    Nov 28 '18 at 21:13
















0












0








0


1






I have a use case where I need to join 2 data-frames.



ID view



ID  BookTime
1 2
1 5
2 8
2 3
3 4


FareRule view



Start End Fare
1 3 10
3 6 20
6 10 25


Output is a result of join by checking the BookTime from the ID table. The Fare is computed based on the window that is between Start and End from FareRule.



ID  FareDue
1 10
1 20
2 25
2 20
3 20


I am creating a view out of these data-frames and using CROSS JOIN to join them. But as we know, CROSS join is expensive so is there a better way to join them?



SELECT 
ID,
Fare AS FareDue
FROM
ID
CROSS JOIN
FareRule
WHERE
BookTime >=Start
AND
BookTime< End









share|improve this question
















I have a use case where I need to join 2 data-frames.



ID view



ID  BookTime
1 2
1 5
2 8
2 3
3 4


FareRule view



Start End Fare
1 3 10
3 6 20
6 10 25


Output is a result of join by checking the BookTime from the ID table. The Fare is computed based on the window that is between Start and End from FareRule.



ID  FareDue
1 10
1 20
2 25
2 20
3 20


I am creating a view out of these data-frames and using CROSS JOIN to join them. But as we know, CROSS join is expensive so is there a better way to join them?



SELECT 
ID,
Fare AS FareDue
FROM
ID
CROSS JOIN
FareRule
WHERE
BookTime >=Start
AND
BookTime< End






apache-spark apache-spark-sql






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 21 '18 at 11:01









Jacek Laskowski

45.5k18134274




45.5k18134274










asked Nov 20 '18 at 19:49









Amaan KhanAmaan Khan

467




467













  • Possible duplicate of How to improve broadcast Join speed in Spark

    – user10465355
    Nov 20 '18 at 19:55











  • Why are you using cross join? From the looks of it you sgould be getting the same result with a regular join.

    – Vitaliy
    Nov 21 '18 at 6:34











  • Like other SQL query engines Spark has a query optimizer & there is no difference between CROSS JOIN (or comma) with WHERE vs INNER JOIN ON. Read about the general idea of relational query optimization/implementation anywhere. (Including dozens of published academic textbooks free online in pdf.) Read about the optimizer Catalyst & writing optimal Spark SQL queries anywhere. (Many implementers who authored the ACM paper "Spark SQL: Relational Data Processing in Spark" work at Databricks.) PS What did you learn from googling re spark sql query optimization before you considered asking?

    – philipxy
    Nov 28 '18 at 21:13





















  • Possible duplicate of How to improve broadcast Join speed in Spark

    – user10465355
    Nov 20 '18 at 19:55











  • Why are you using cross join? From the looks of it you sgould be getting the same result with a regular join.

    – Vitaliy
    Nov 21 '18 at 6:34











  • Like other SQL query engines Spark has a query optimizer & there is no difference between CROSS JOIN (or comma) with WHERE vs INNER JOIN ON. Read about the general idea of relational query optimization/implementation anywhere. (Including dozens of published academic textbooks free online in pdf.) Read about the optimizer Catalyst & writing optimal Spark SQL queries anywhere. (Many implementers who authored the ACM paper "Spark SQL: Relational Data Processing in Spark" work at Databricks.) PS What did you learn from googling re spark sql query optimization before you considered asking?

    – philipxy
    Nov 28 '18 at 21:13



















Possible duplicate of How to improve broadcast Join speed in Spark

– user10465355
Nov 20 '18 at 19:55





Possible duplicate of How to improve broadcast Join speed in Spark

– user10465355
Nov 20 '18 at 19:55













Why are you using cross join? From the looks of it you sgould be getting the same result with a regular join.

– Vitaliy
Nov 21 '18 at 6:34





Why are you using cross join? From the looks of it you sgould be getting the same result with a regular join.

– Vitaliy
Nov 21 '18 at 6:34













Like other SQL query engines Spark has a query optimizer & there is no difference between CROSS JOIN (or comma) with WHERE vs INNER JOIN ON. Read about the general idea of relational query optimization/implementation anywhere. (Including dozens of published academic textbooks free online in pdf.) Read about the optimizer Catalyst & writing optimal Spark SQL queries anywhere. (Many implementers who authored the ACM paper "Spark SQL: Relational Data Processing in Spark" work at Databricks.) PS What did you learn from googling re spark sql query optimization before you considered asking?

– philipxy
Nov 28 '18 at 21:13







Like other SQL query engines Spark has a query optimizer & there is no difference between CROSS JOIN (or comma) with WHERE vs INNER JOIN ON. Read about the general idea of relational query optimization/implementation anywhere. (Including dozens of published academic textbooks free online in pdf.) Read about the optimizer Catalyst & writing optimal Spark SQL queries anywhere. (Many implementers who authored the ACM paper "Spark SQL: Relational Data Processing in Spark" work at Databricks.) PS What did you learn from googling re spark sql query optimization before you considered asking?

– philipxy
Nov 28 '18 at 21:13














1 Answer
1






active

oldest

votes


















-1














Given the following datasets:



val id = Seq((1, 2), (1, 5), (2, 8), (2, 3), (3, 4)).toDF("ID", "BookTime")
scala> id.show
+---+--------+
| ID|BookTime|
+---+--------+
| 1| 2|
| 1| 5|
| 2| 8|
| 2| 3|
| 3| 4|
+---+--------+

val fareRule = Seq((1,3,10), (3,6,20), (6,10,25)).toDF("start", "end", "fare")
scala> fareRule.show
+-----+---+----+
|start|end|fare|
+-----+---+----+
| 1| 3| 10|
| 3| 6| 20|
| 6| 10| 25|
+-----+---+----+


You simply join them together using between expression.



val q = id.join(fareRule).where('BookTime between('start, 'end)).select('id, 'fare)
scala> q.show
+---+----+
| id|fare|
+---+----+
| 1| 10|
| 1| 20|
| 2| 25|
| 2| 10|
| 2| 20|
| 3| 20|
+---+----+


You may want to adjust between so the boundaries are exclusive on one side. between by default uses the lower bound and upper bound, inclusive.






share|improve this answer























    Your Answer






    StackExchange.ifUsing("editor", function () {
    StackExchange.using("externalEditor", function () {
    StackExchange.using("snippets", function () {
    StackExchange.snippets.init();
    });
    });
    }, "code-snippets");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "1"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53400521%2fhow-to-join-datasets-with-values-between-other-values%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    -1














    Given the following datasets:



    val id = Seq((1, 2), (1, 5), (2, 8), (2, 3), (3, 4)).toDF("ID", "BookTime")
    scala> id.show
    +---+--------+
    | ID|BookTime|
    +---+--------+
    | 1| 2|
    | 1| 5|
    | 2| 8|
    | 2| 3|
    | 3| 4|
    +---+--------+

    val fareRule = Seq((1,3,10), (3,6,20), (6,10,25)).toDF("start", "end", "fare")
    scala> fareRule.show
    +-----+---+----+
    |start|end|fare|
    +-----+---+----+
    | 1| 3| 10|
    | 3| 6| 20|
    | 6| 10| 25|
    +-----+---+----+


    You simply join them together using between expression.



    val q = id.join(fareRule).where('BookTime between('start, 'end)).select('id, 'fare)
    scala> q.show
    +---+----+
    | id|fare|
    +---+----+
    | 1| 10|
    | 1| 20|
    | 2| 25|
    | 2| 10|
    | 2| 20|
    | 3| 20|
    +---+----+


    You may want to adjust between so the boundaries are exclusive on one side. between by default uses the lower bound and upper bound, inclusive.






    share|improve this answer




























      -1














      Given the following datasets:



      val id = Seq((1, 2), (1, 5), (2, 8), (2, 3), (3, 4)).toDF("ID", "BookTime")
      scala> id.show
      +---+--------+
      | ID|BookTime|
      +---+--------+
      | 1| 2|
      | 1| 5|
      | 2| 8|
      | 2| 3|
      | 3| 4|
      +---+--------+

      val fareRule = Seq((1,3,10), (3,6,20), (6,10,25)).toDF("start", "end", "fare")
      scala> fareRule.show
      +-----+---+----+
      |start|end|fare|
      +-----+---+----+
      | 1| 3| 10|
      | 3| 6| 20|
      | 6| 10| 25|
      +-----+---+----+


      You simply join them together using between expression.



      val q = id.join(fareRule).where('BookTime between('start, 'end)).select('id, 'fare)
      scala> q.show
      +---+----+
      | id|fare|
      +---+----+
      | 1| 10|
      | 1| 20|
      | 2| 25|
      | 2| 10|
      | 2| 20|
      | 3| 20|
      +---+----+


      You may want to adjust between so the boundaries are exclusive on one side. between by default uses the lower bound and upper bound, inclusive.






      share|improve this answer


























        -1












        -1








        -1







        Given the following datasets:



        val id = Seq((1, 2), (1, 5), (2, 8), (2, 3), (3, 4)).toDF("ID", "BookTime")
        scala> id.show
        +---+--------+
        | ID|BookTime|
        +---+--------+
        | 1| 2|
        | 1| 5|
        | 2| 8|
        | 2| 3|
        | 3| 4|
        +---+--------+

        val fareRule = Seq((1,3,10), (3,6,20), (6,10,25)).toDF("start", "end", "fare")
        scala> fareRule.show
        +-----+---+----+
        |start|end|fare|
        +-----+---+----+
        | 1| 3| 10|
        | 3| 6| 20|
        | 6| 10| 25|
        +-----+---+----+


        You simply join them together using between expression.



        val q = id.join(fareRule).where('BookTime between('start, 'end)).select('id, 'fare)
        scala> q.show
        +---+----+
        | id|fare|
        +---+----+
        | 1| 10|
        | 1| 20|
        | 2| 25|
        | 2| 10|
        | 2| 20|
        | 3| 20|
        +---+----+


        You may want to adjust between so the boundaries are exclusive on one side. between by default uses the lower bound and upper bound, inclusive.






        share|improve this answer













        Given the following datasets:



        val id = Seq((1, 2), (1, 5), (2, 8), (2, 3), (3, 4)).toDF("ID", "BookTime")
        scala> id.show
        +---+--------+
        | ID|BookTime|
        +---+--------+
        | 1| 2|
        | 1| 5|
        | 2| 8|
        | 2| 3|
        | 3| 4|
        +---+--------+

        val fareRule = Seq((1,3,10), (3,6,20), (6,10,25)).toDF("start", "end", "fare")
        scala> fareRule.show
        +-----+---+----+
        |start|end|fare|
        +-----+---+----+
        | 1| 3| 10|
        | 3| 6| 20|
        | 6| 10| 25|
        +-----+---+----+


        You simply join them together using between expression.



        val q = id.join(fareRule).where('BookTime between('start, 'end)).select('id, 'fare)
        scala> q.show
        +---+----+
        | id|fare|
        +---+----+
        | 1| 10|
        | 1| 20|
        | 2| 25|
        | 2| 10|
        | 2| 20|
        | 3| 20|
        +---+----+


        You may want to adjust between so the boundaries are exclusive on one side. between by default uses the lower bound and upper bound, inclusive.







        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Nov 21 '18 at 11:00









        Jacek LaskowskiJacek Laskowski

        45.5k18134274




        45.5k18134274
































            draft saved

            draft discarded




















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53400521%2fhow-to-join-datasets-with-values-between-other-values%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Guess what letter conforming each word

            Run scheduled task as local user group (not BUILTIN)

            Port of Spain