Efficient way to cluster a timeline OR reconstruct a batch number












2















I'm working on a large dataset (150k / day) of a tester database. Each row contains data about a specific test of the product. Each tester inserts the results of his test.



I want to do some measurements like pass-fail-rate over a shift per product and tester. The problem is there are no batch numbers assigned so I can't select this easy.



Considering the given subselect of the whole table:



 id   tBegin                orderId   
------------------------------------
1 2018-10-20 00:00:05 1
2 2018-10-20 00:05:15 1
3 2018-10-20 01:00:05 1
10 2018-10-20 10:03:05 3
12 2018-10-20 11:04:05 8
20 2018-10-20 14:15:05 3
37 2018-10-20 18:12:05 1


My goal is it to cluster the data to the following



 id   tBegin                orderId   pCount 
--------------------------------------------
1 2018-10-20 00:00:05 1 3
10 2018-10-20 10:03:05 3 1
12 2018-10-20 11:04:05 8 1
20 2018-10-20 14:15:05 3 1
37 2018-10-20 18:12:05 1 1


A simple GROUP BY orderID won't do the trick, so I came upwith the following



SELECT 
MIN(c.id) AS id,
MIN(c.tBegin) AS tBegin,
c.orderId,
COUNT(*) AS pCount
FROM (
SELECT t2.id, t2.tBegin, t2.orderId,
( SELECT TOP 1 t.id
FROM history t
WHERE t.tBegin > t2.tBegin
AND t.orderID <> t2.orderID
AND <restrict date here further>
ORDER BY t.tBegin
) AS nextId
FROM history t2
) AS c
WHERE <restrict date here>
GROUP BY c.orderID, c.nextId


I left out the WHEREs that select the correct date and tester.
This works, but it seams very inefficient. I have worked with small databases, but I'm new to SQL Server 2017.



I appreciate your help very much!










share|improve this question





























    2















    I'm working on a large dataset (150k / day) of a tester database. Each row contains data about a specific test of the product. Each tester inserts the results of his test.



    I want to do some measurements like pass-fail-rate over a shift per product and tester. The problem is there are no batch numbers assigned so I can't select this easy.



    Considering the given subselect of the whole table:



     id   tBegin                orderId   
    ------------------------------------
    1 2018-10-20 00:00:05 1
    2 2018-10-20 00:05:15 1
    3 2018-10-20 01:00:05 1
    10 2018-10-20 10:03:05 3
    12 2018-10-20 11:04:05 8
    20 2018-10-20 14:15:05 3
    37 2018-10-20 18:12:05 1


    My goal is it to cluster the data to the following



     id   tBegin                orderId   pCount 
    --------------------------------------------
    1 2018-10-20 00:00:05 1 3
    10 2018-10-20 10:03:05 3 1
    12 2018-10-20 11:04:05 8 1
    20 2018-10-20 14:15:05 3 1
    37 2018-10-20 18:12:05 1 1


    A simple GROUP BY orderID won't do the trick, so I came upwith the following



    SELECT 
    MIN(c.id) AS id,
    MIN(c.tBegin) AS tBegin,
    c.orderId,
    COUNT(*) AS pCount
    FROM (
    SELECT t2.id, t2.tBegin, t2.orderId,
    ( SELECT TOP 1 t.id
    FROM history t
    WHERE t.tBegin > t2.tBegin
    AND t.orderID <> t2.orderID
    AND <restrict date here further>
    ORDER BY t.tBegin
    ) AS nextId
    FROM history t2
    ) AS c
    WHERE <restrict date here>
    GROUP BY c.orderID, c.nextId


    I left out the WHEREs that select the correct date and tester.
    This works, but it seams very inefficient. I have worked with small databases, but I'm new to SQL Server 2017.



    I appreciate your help very much!










    share|improve this question



























      2












      2








      2








      I'm working on a large dataset (150k / day) of a tester database. Each row contains data about a specific test of the product. Each tester inserts the results of his test.



      I want to do some measurements like pass-fail-rate over a shift per product and tester. The problem is there are no batch numbers assigned so I can't select this easy.



      Considering the given subselect of the whole table:



       id   tBegin                orderId   
      ------------------------------------
      1 2018-10-20 00:00:05 1
      2 2018-10-20 00:05:15 1
      3 2018-10-20 01:00:05 1
      10 2018-10-20 10:03:05 3
      12 2018-10-20 11:04:05 8
      20 2018-10-20 14:15:05 3
      37 2018-10-20 18:12:05 1


      My goal is it to cluster the data to the following



       id   tBegin                orderId   pCount 
      --------------------------------------------
      1 2018-10-20 00:00:05 1 3
      10 2018-10-20 10:03:05 3 1
      12 2018-10-20 11:04:05 8 1
      20 2018-10-20 14:15:05 3 1
      37 2018-10-20 18:12:05 1 1


      A simple GROUP BY orderID won't do the trick, so I came upwith the following



      SELECT 
      MIN(c.id) AS id,
      MIN(c.tBegin) AS tBegin,
      c.orderId,
      COUNT(*) AS pCount
      FROM (
      SELECT t2.id, t2.tBegin, t2.orderId,
      ( SELECT TOP 1 t.id
      FROM history t
      WHERE t.tBegin > t2.tBegin
      AND t.orderID <> t2.orderID
      AND <restrict date here further>
      ORDER BY t.tBegin
      ) AS nextId
      FROM history t2
      ) AS c
      WHERE <restrict date here>
      GROUP BY c.orderID, c.nextId


      I left out the WHEREs that select the correct date and tester.
      This works, but it seams very inefficient. I have worked with small databases, but I'm new to SQL Server 2017.



      I appreciate your help very much!










      share|improve this question
















      I'm working on a large dataset (150k / day) of a tester database. Each row contains data about a specific test of the product. Each tester inserts the results of his test.



      I want to do some measurements like pass-fail-rate over a shift per product and tester. The problem is there are no batch numbers assigned so I can't select this easy.



      Considering the given subselect of the whole table:



       id   tBegin                orderId   
      ------------------------------------
      1 2018-10-20 00:00:05 1
      2 2018-10-20 00:05:15 1
      3 2018-10-20 01:00:05 1
      10 2018-10-20 10:03:05 3
      12 2018-10-20 11:04:05 8
      20 2018-10-20 14:15:05 3
      37 2018-10-20 18:12:05 1


      My goal is it to cluster the data to the following



       id   tBegin                orderId   pCount 
      --------------------------------------------
      1 2018-10-20 00:00:05 1 3
      10 2018-10-20 10:03:05 3 1
      12 2018-10-20 11:04:05 8 1
      20 2018-10-20 14:15:05 3 1
      37 2018-10-20 18:12:05 1 1


      A simple GROUP BY orderID won't do the trick, so I came upwith the following



      SELECT 
      MIN(c.id) AS id,
      MIN(c.tBegin) AS tBegin,
      c.orderId,
      COUNT(*) AS pCount
      FROM (
      SELECT t2.id, t2.tBegin, t2.orderId,
      ( SELECT TOP 1 t.id
      FROM history t
      WHERE t.tBegin > t2.tBegin
      AND t.orderID <> t2.orderID
      AND <restrict date here further>
      ORDER BY t.tBegin
      ) AS nextId
      FROM history t2
      ) AS c
      WHERE <restrict date here>
      GROUP BY c.orderID, c.nextId


      I left out the WHEREs that select the correct date and tester.
      This works, but it seams very inefficient. I have worked with small databases, but I'm new to SQL Server 2017.



      I appreciate your help very much!







      sql sql-server datetime group-by sql-server-2017






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Dec 4 '18 at 10:25









      Salman A

      178k66338430




      178k66338430










      asked Nov 19 '18 at 10:25









      John BartJohn Bart

      155




      155
























          2 Answers
          2






          active

          oldest

          votes


















          1














          You can use window functions for this:



          DECLARE @t TABLE (id INT, tBegin DATETIME, orderId INT);
          INSERT INTO @t VALUES
          (1 , '2018-10-20 00:00:05', 1),
          (2 , '2018-10-20 00:05:15', 1),
          (3 , '2018-10-20 01:00:05', 1),
          (10, '2018-10-20 10:03:05', 3),
          (12, '2018-10-20 11:04:05', 8),
          (20, '2018-10-20 14:15:05', 3),
          (37, '2018-10-20 18:12:05', 1);

          WITH cte1 AS (
          SELECT *, CASE WHEN orderId = LAG(orderId) OVER (ORDER BY tBegin) THEN 0 ELSE 1 END AS chg
          FROM @t
          ), cte2 AS (
          SELECT *, SUM(chg) OVER(ORDER BY tBegin) AS grp
          FROM cte1
          ), cte3 AS (
          SELECT *, ROW_NUMBER() OVER (PARTITION BY grp ORDER BY tBegin) AS rn
          FROM cte2
          )
          SELECT *
          FROM cte3
          WHERE rn = 1



          • The first cte assigns a "change flag" to each row where the value changed

          • The second cte uses a running sum to convert 1s and 0s to a number which can be used to group rows

          • Finally you number rows within each group and select first row per group


          Demo on DB Fiddle






          share|improve this answer

































            0














            You can use cumulative approach :



            select min(id) as id, max(tBegin), orderid, count(*) 
            from (select h.*,
            row_number() over (order by id) as seq1,
            row_number() over (partition by orderid order by id) as seq2
            from history h
            ) h
            group by orderid, (seq1 - seq2)
            order by id;





            share|improve this answer























              Your Answer






              StackExchange.ifUsing("editor", function () {
              StackExchange.using("externalEditor", function () {
              StackExchange.using("snippets", function () {
              StackExchange.snippets.init();
              });
              });
              }, "code-snippets");

              StackExchange.ready(function() {
              var channelOptions = {
              tags: "".split(" "),
              id: "1"
              };
              initTagRenderer("".split(" "), "".split(" "), channelOptions);

              StackExchange.using("externalEditor", function() {
              // Have to fire editor after snippets, if snippets enabled
              if (StackExchange.settings.snippets.snippetsEnabled) {
              StackExchange.using("snippets", function() {
              createEditor();
              });
              }
              else {
              createEditor();
              }
              });

              function createEditor() {
              StackExchange.prepareEditor({
              heartbeatType: 'answer',
              autoActivateHeartbeat: false,
              convertImagesToLinks: true,
              noModals: true,
              showLowRepImageUploadWarning: true,
              reputationToPostImages: 10,
              bindNavPrevention: true,
              postfix: "",
              imageUploader: {
              brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
              contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
              allowUrls: true
              },
              onDemand: true,
              discardSelector: ".discard-answer"
              ,immediatelyShowMarkdownHelp:true
              });


              }
              });














              draft saved

              draft discarded


















              StackExchange.ready(
              function () {
              StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53372583%2fefficient-way-to-cluster-a-timeline-or-reconstruct-a-batch-number%23new-answer', 'question_page');
              }
              );

              Post as a guest















              Required, but never shown

























              2 Answers
              2






              active

              oldest

              votes








              2 Answers
              2






              active

              oldest

              votes









              active

              oldest

              votes






              active

              oldest

              votes









              1














              You can use window functions for this:



              DECLARE @t TABLE (id INT, tBegin DATETIME, orderId INT);
              INSERT INTO @t VALUES
              (1 , '2018-10-20 00:00:05', 1),
              (2 , '2018-10-20 00:05:15', 1),
              (3 , '2018-10-20 01:00:05', 1),
              (10, '2018-10-20 10:03:05', 3),
              (12, '2018-10-20 11:04:05', 8),
              (20, '2018-10-20 14:15:05', 3),
              (37, '2018-10-20 18:12:05', 1);

              WITH cte1 AS (
              SELECT *, CASE WHEN orderId = LAG(orderId) OVER (ORDER BY tBegin) THEN 0 ELSE 1 END AS chg
              FROM @t
              ), cte2 AS (
              SELECT *, SUM(chg) OVER(ORDER BY tBegin) AS grp
              FROM cte1
              ), cte3 AS (
              SELECT *, ROW_NUMBER() OVER (PARTITION BY grp ORDER BY tBegin) AS rn
              FROM cte2
              )
              SELECT *
              FROM cte3
              WHERE rn = 1



              • The first cte assigns a "change flag" to each row where the value changed

              • The second cte uses a running sum to convert 1s and 0s to a number which can be used to group rows

              • Finally you number rows within each group and select first row per group


              Demo on DB Fiddle






              share|improve this answer






























                1














                You can use window functions for this:



                DECLARE @t TABLE (id INT, tBegin DATETIME, orderId INT);
                INSERT INTO @t VALUES
                (1 , '2018-10-20 00:00:05', 1),
                (2 , '2018-10-20 00:05:15', 1),
                (3 , '2018-10-20 01:00:05', 1),
                (10, '2018-10-20 10:03:05', 3),
                (12, '2018-10-20 11:04:05', 8),
                (20, '2018-10-20 14:15:05', 3),
                (37, '2018-10-20 18:12:05', 1);

                WITH cte1 AS (
                SELECT *, CASE WHEN orderId = LAG(orderId) OVER (ORDER BY tBegin) THEN 0 ELSE 1 END AS chg
                FROM @t
                ), cte2 AS (
                SELECT *, SUM(chg) OVER(ORDER BY tBegin) AS grp
                FROM cte1
                ), cte3 AS (
                SELECT *, ROW_NUMBER() OVER (PARTITION BY grp ORDER BY tBegin) AS rn
                FROM cte2
                )
                SELECT *
                FROM cte3
                WHERE rn = 1



                • The first cte assigns a "change flag" to each row where the value changed

                • The second cte uses a running sum to convert 1s and 0s to a number which can be used to group rows

                • Finally you number rows within each group and select first row per group


                Demo on DB Fiddle






                share|improve this answer




























                  1












                  1








                  1







                  You can use window functions for this:



                  DECLARE @t TABLE (id INT, tBegin DATETIME, orderId INT);
                  INSERT INTO @t VALUES
                  (1 , '2018-10-20 00:00:05', 1),
                  (2 , '2018-10-20 00:05:15', 1),
                  (3 , '2018-10-20 01:00:05', 1),
                  (10, '2018-10-20 10:03:05', 3),
                  (12, '2018-10-20 11:04:05', 8),
                  (20, '2018-10-20 14:15:05', 3),
                  (37, '2018-10-20 18:12:05', 1);

                  WITH cte1 AS (
                  SELECT *, CASE WHEN orderId = LAG(orderId) OVER (ORDER BY tBegin) THEN 0 ELSE 1 END AS chg
                  FROM @t
                  ), cte2 AS (
                  SELECT *, SUM(chg) OVER(ORDER BY tBegin) AS grp
                  FROM cte1
                  ), cte3 AS (
                  SELECT *, ROW_NUMBER() OVER (PARTITION BY grp ORDER BY tBegin) AS rn
                  FROM cte2
                  )
                  SELECT *
                  FROM cte3
                  WHERE rn = 1



                  • The first cte assigns a "change flag" to each row where the value changed

                  • The second cte uses a running sum to convert 1s and 0s to a number which can be used to group rows

                  • Finally you number rows within each group and select first row per group


                  Demo on DB Fiddle






                  share|improve this answer















                  You can use window functions for this:



                  DECLARE @t TABLE (id INT, tBegin DATETIME, orderId INT);
                  INSERT INTO @t VALUES
                  (1 , '2018-10-20 00:00:05', 1),
                  (2 , '2018-10-20 00:05:15', 1),
                  (3 , '2018-10-20 01:00:05', 1),
                  (10, '2018-10-20 10:03:05', 3),
                  (12, '2018-10-20 11:04:05', 8),
                  (20, '2018-10-20 14:15:05', 3),
                  (37, '2018-10-20 18:12:05', 1);

                  WITH cte1 AS (
                  SELECT *, CASE WHEN orderId = LAG(orderId) OVER (ORDER BY tBegin) THEN 0 ELSE 1 END AS chg
                  FROM @t
                  ), cte2 AS (
                  SELECT *, SUM(chg) OVER(ORDER BY tBegin) AS grp
                  FROM cte1
                  ), cte3 AS (
                  SELECT *, ROW_NUMBER() OVER (PARTITION BY grp ORDER BY tBegin) AS rn
                  FROM cte2
                  )
                  SELECT *
                  FROM cte3
                  WHERE rn = 1



                  • The first cte assigns a "change flag" to each row where the value changed

                  • The second cte uses a running sum to convert 1s and 0s to a number which can be used to group rows

                  • Finally you number rows within each group and select first row per group


                  Demo on DB Fiddle







                  share|improve this answer














                  share|improve this answer



                  share|improve this answer








                  edited Nov 19 '18 at 10:40

























                  answered Nov 19 '18 at 10:35









                  Salman ASalman A

                  178k66338430




                  178k66338430

























                      0














                      You can use cumulative approach :



                      select min(id) as id, max(tBegin), orderid, count(*) 
                      from (select h.*,
                      row_number() over (order by id) as seq1,
                      row_number() over (partition by orderid order by id) as seq2
                      from history h
                      ) h
                      group by orderid, (seq1 - seq2)
                      order by id;





                      share|improve this answer




























                        0














                        You can use cumulative approach :



                        select min(id) as id, max(tBegin), orderid, count(*) 
                        from (select h.*,
                        row_number() over (order by id) as seq1,
                        row_number() over (partition by orderid order by id) as seq2
                        from history h
                        ) h
                        group by orderid, (seq1 - seq2)
                        order by id;





                        share|improve this answer


























                          0












                          0








                          0







                          You can use cumulative approach :



                          select min(id) as id, max(tBegin), orderid, count(*) 
                          from (select h.*,
                          row_number() over (order by id) as seq1,
                          row_number() over (partition by orderid order by id) as seq2
                          from history h
                          ) h
                          group by orderid, (seq1 - seq2)
                          order by id;





                          share|improve this answer













                          You can use cumulative approach :



                          select min(id) as id, max(tBegin), orderid, count(*) 
                          from (select h.*,
                          row_number() over (order by id) as seq1,
                          row_number() over (partition by orderid order by id) as seq2
                          from history h
                          ) h
                          group by orderid, (seq1 - seq2)
                          order by id;






                          share|improve this answer












                          share|improve this answer



                          share|improve this answer










                          answered Nov 19 '18 at 10:46









                          Yogesh SharmaYogesh Sharma

                          30.1k51436




                          30.1k51436






























                              draft saved

                              draft discarded




















































                              Thanks for contributing an answer to Stack Overflow!


                              • Please be sure to answer the question. Provide details and share your research!

                              But avoid



                              • Asking for help, clarification, or responding to other answers.

                              • Making statements based on opinion; back them up with references or personal experience.


                              To learn more, see our tips on writing great answers.




                              draft saved


                              draft discarded














                              StackExchange.ready(
                              function () {
                              StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53372583%2fefficient-way-to-cluster-a-timeline-or-reconstruct-a-batch-number%23new-answer', 'question_page');
                              }
                              );

                              Post as a guest















                              Required, but never shown





















































                              Required, but never shown














                              Required, but never shown












                              Required, but never shown







                              Required, but never shown

































                              Required, but never shown














                              Required, but never shown












                              Required, but never shown







                              Required, but never shown







                              Popular posts from this blog

                              Guess what letter conforming each word

                              Port of Spain

                              Run scheduled task as local user group (not BUILTIN)