Spark UDF with varargs












16















Is it an only option to list all the arguments up to 22 as shown in documentation?



https://spark.apache.org/docs/1.5.0/api/scala/index.html#org.apache.spark.sql.UDFRegistration



Anyone figured out how to do something similar to this?



sc.udf.register("func", (s: String*) => s......


(writing custom concat function that skips nulls, had to 2 arguments at the time)



Thanks










share|improve this question





























    16















    Is it an only option to list all the arguments up to 22 as shown in documentation?



    https://spark.apache.org/docs/1.5.0/api/scala/index.html#org.apache.spark.sql.UDFRegistration



    Anyone figured out how to do something similar to this?



    sc.udf.register("func", (s: String*) => s......


    (writing custom concat function that skips nulls, had to 2 arguments at the time)



    Thanks










    share|improve this question



























      16












      16








      16


      9






      Is it an only option to list all the arguments up to 22 as shown in documentation?



      https://spark.apache.org/docs/1.5.0/api/scala/index.html#org.apache.spark.sql.UDFRegistration



      Anyone figured out how to do something similar to this?



      sc.udf.register("func", (s: String*) => s......


      (writing custom concat function that skips nulls, had to 2 arguments at the time)



      Thanks










      share|improve this question
















      Is it an only option to list all the arguments up to 22 as shown in documentation?



      https://spark.apache.org/docs/1.5.0/api/scala/index.html#org.apache.spark.sql.UDFRegistration



      Anyone figured out how to do something similar to this?



      sc.udf.register("func", (s: String*) => s......


      (writing custom concat function that skips nulls, had to 2 arguments at the time)



      Thanks







      scala apache-spark apache-spark-sql user-defined-functions






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Jan 14 at 9:33









      Community

      11




      11










      asked Oct 15 '15 at 14:56









      devopslifedevopslife

      3361414




      3361414
























          1 Answer
          1






          active

          oldest

          votes


















          36














          UDFs don't support varargs* but you can pass an arbitrary number of columns wrapped using an array function:



          import org.apache.spark.sql.functions.{udf, array, lit}

          val myConcatFunc = (xs: Seq[Any], sep: String) =>
          xs.filter(_ != null).mkString(sep)

          val myConcat = udf(myConcatFunc)


          An example usage:



          val  df = sc.parallelize(Seq(
          (null, "a", "b", "c"), ("d", null, null, "e")
          )).toDF("x1", "x2", "x3", "x4")

          val cols = array($"x1", $"x2", $"x3", $"x4")
          val sep = lit("-")

          df.select(myConcat(cols, sep).alias("concatenated")).show

          // +------------+
          // |concatenated|
          // +------------+
          // | a-b-c|
          // | d-e|
          // +------------+


          With raw SQL:



          df.registerTempTable("df")
          sqlContext.udf.register("myConcat", myConcatFunc)

          sqlContext.sql(
          "SELECT myConcat(array(x1, x2, x4), '.') AS concatenated FROM df"
          ).show

          // +------------+
          // |concatenated|
          // +------------+
          // | a.c|
          // | d.e|
          // +------------+


          A slightly more complicated approach is not use UDF at all and compose SQL expressions with something roughly like this:



          import org.apache.spark.sql.functions._
          import org.apache.spark.sql.Column

          def myConcatExpr(sep: String, cols: Column*) = regexp_replace(concat(
          cols.foldLeft(lit(""))(
          (acc, c) => when(c.isNotNull, concat(acc, c, lit(sep))).otherwise(acc)
          )
          ), s"($sep)?$$", "")

          df.select(
          myConcatExpr("-", $"x1", $"x2", $"x3", $"x4").alias("concatenated")
          ).show
          // +------------+
          // |concatenated|
          // +------------+
          // | a-b-c|
          // | d-e|
          // +------------+


          but I doubt it is worth the effort unless you work with PySpark.





          * If you pass a function using varargs it will be stripped from all the syntactic sugar and resulting UDF will expect an ArrayType. For example:



          def f(s: String*) = s.mkString
          udf(f _)


          will be of type:



          UserDefinedFunction(<function1>,StringType,List(ArrayType(StringType,true)))





          share|improve this answer


























          • Hi, Is there any way to get column name while concatenating...

            – Kal
            Aug 23 '16 at 12:27











          • No, unless you pass column names explicitly as literals.

            – zero323
            Aug 23 '16 at 13:31











          • Hey thanks, can you please share the syntax for the same

            – Kal
            Aug 23 '16 at 14:23






          • 2





            @Kalpesh array(df.columns.map(c => struct(lit(c), col(c)): _*) -> udf(xs: Seq[Row] => ???).

            – zero323
            Aug 23 '16 at 14:32













          • Pay attention to write array and not Array when calling the function

            – Ameba Spugnosa
            Nov 2 '16 at 9:16











          Your Answer






          StackExchange.ifUsing("editor", function () {
          StackExchange.using("externalEditor", function () {
          StackExchange.using("snippets", function () {
          StackExchange.snippets.init();
          });
          });
          }, "code-snippets");

          StackExchange.ready(function() {
          var channelOptions = {
          tags: "".split(" "),
          id: "1"
          };
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function() {
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled) {
          StackExchange.using("snippets", function() {
          createEditor();
          });
          }
          else {
          createEditor();
          }
          });

          function createEditor() {
          StackExchange.prepareEditor({
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader: {
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          },
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          });


          }
          });














          draft saved

          draft discarded


















          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f33151866%2fspark-udf-with-varargs%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown

























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes









          36














          UDFs don't support varargs* but you can pass an arbitrary number of columns wrapped using an array function:



          import org.apache.spark.sql.functions.{udf, array, lit}

          val myConcatFunc = (xs: Seq[Any], sep: String) =>
          xs.filter(_ != null).mkString(sep)

          val myConcat = udf(myConcatFunc)


          An example usage:



          val  df = sc.parallelize(Seq(
          (null, "a", "b", "c"), ("d", null, null, "e")
          )).toDF("x1", "x2", "x3", "x4")

          val cols = array($"x1", $"x2", $"x3", $"x4")
          val sep = lit("-")

          df.select(myConcat(cols, sep).alias("concatenated")).show

          // +------------+
          // |concatenated|
          // +------------+
          // | a-b-c|
          // | d-e|
          // +------------+


          With raw SQL:



          df.registerTempTable("df")
          sqlContext.udf.register("myConcat", myConcatFunc)

          sqlContext.sql(
          "SELECT myConcat(array(x1, x2, x4), '.') AS concatenated FROM df"
          ).show

          // +------------+
          // |concatenated|
          // +------------+
          // | a.c|
          // | d.e|
          // +------------+


          A slightly more complicated approach is not use UDF at all and compose SQL expressions with something roughly like this:



          import org.apache.spark.sql.functions._
          import org.apache.spark.sql.Column

          def myConcatExpr(sep: String, cols: Column*) = regexp_replace(concat(
          cols.foldLeft(lit(""))(
          (acc, c) => when(c.isNotNull, concat(acc, c, lit(sep))).otherwise(acc)
          )
          ), s"($sep)?$$", "")

          df.select(
          myConcatExpr("-", $"x1", $"x2", $"x3", $"x4").alias("concatenated")
          ).show
          // +------------+
          // |concatenated|
          // +------------+
          // | a-b-c|
          // | d-e|
          // +------------+


          but I doubt it is worth the effort unless you work with PySpark.





          * If you pass a function using varargs it will be stripped from all the syntactic sugar and resulting UDF will expect an ArrayType. For example:



          def f(s: String*) = s.mkString
          udf(f _)


          will be of type:



          UserDefinedFunction(<function1>,StringType,List(ArrayType(StringType,true)))





          share|improve this answer


























          • Hi, Is there any way to get column name while concatenating...

            – Kal
            Aug 23 '16 at 12:27











          • No, unless you pass column names explicitly as literals.

            – zero323
            Aug 23 '16 at 13:31











          • Hey thanks, can you please share the syntax for the same

            – Kal
            Aug 23 '16 at 14:23






          • 2





            @Kalpesh array(df.columns.map(c => struct(lit(c), col(c)): _*) -> udf(xs: Seq[Row] => ???).

            – zero323
            Aug 23 '16 at 14:32













          • Pay attention to write array and not Array when calling the function

            – Ameba Spugnosa
            Nov 2 '16 at 9:16
















          36














          UDFs don't support varargs* but you can pass an arbitrary number of columns wrapped using an array function:



          import org.apache.spark.sql.functions.{udf, array, lit}

          val myConcatFunc = (xs: Seq[Any], sep: String) =>
          xs.filter(_ != null).mkString(sep)

          val myConcat = udf(myConcatFunc)


          An example usage:



          val  df = sc.parallelize(Seq(
          (null, "a", "b", "c"), ("d", null, null, "e")
          )).toDF("x1", "x2", "x3", "x4")

          val cols = array($"x1", $"x2", $"x3", $"x4")
          val sep = lit("-")

          df.select(myConcat(cols, sep).alias("concatenated")).show

          // +------------+
          // |concatenated|
          // +------------+
          // | a-b-c|
          // | d-e|
          // +------------+


          With raw SQL:



          df.registerTempTable("df")
          sqlContext.udf.register("myConcat", myConcatFunc)

          sqlContext.sql(
          "SELECT myConcat(array(x1, x2, x4), '.') AS concatenated FROM df"
          ).show

          // +------------+
          // |concatenated|
          // +------------+
          // | a.c|
          // | d.e|
          // +------------+


          A slightly more complicated approach is not use UDF at all and compose SQL expressions with something roughly like this:



          import org.apache.spark.sql.functions._
          import org.apache.spark.sql.Column

          def myConcatExpr(sep: String, cols: Column*) = regexp_replace(concat(
          cols.foldLeft(lit(""))(
          (acc, c) => when(c.isNotNull, concat(acc, c, lit(sep))).otherwise(acc)
          )
          ), s"($sep)?$$", "")

          df.select(
          myConcatExpr("-", $"x1", $"x2", $"x3", $"x4").alias("concatenated")
          ).show
          // +------------+
          // |concatenated|
          // +------------+
          // | a-b-c|
          // | d-e|
          // +------------+


          but I doubt it is worth the effort unless you work with PySpark.





          * If you pass a function using varargs it will be stripped from all the syntactic sugar and resulting UDF will expect an ArrayType. For example:



          def f(s: String*) = s.mkString
          udf(f _)


          will be of type:



          UserDefinedFunction(<function1>,StringType,List(ArrayType(StringType,true)))





          share|improve this answer


























          • Hi, Is there any way to get column name while concatenating...

            – Kal
            Aug 23 '16 at 12:27











          • No, unless you pass column names explicitly as literals.

            – zero323
            Aug 23 '16 at 13:31











          • Hey thanks, can you please share the syntax for the same

            – Kal
            Aug 23 '16 at 14:23






          • 2





            @Kalpesh array(df.columns.map(c => struct(lit(c), col(c)): _*) -> udf(xs: Seq[Row] => ???).

            – zero323
            Aug 23 '16 at 14:32













          • Pay attention to write array and not Array when calling the function

            – Ameba Spugnosa
            Nov 2 '16 at 9:16














          36












          36








          36







          UDFs don't support varargs* but you can pass an arbitrary number of columns wrapped using an array function:



          import org.apache.spark.sql.functions.{udf, array, lit}

          val myConcatFunc = (xs: Seq[Any], sep: String) =>
          xs.filter(_ != null).mkString(sep)

          val myConcat = udf(myConcatFunc)


          An example usage:



          val  df = sc.parallelize(Seq(
          (null, "a", "b", "c"), ("d", null, null, "e")
          )).toDF("x1", "x2", "x3", "x4")

          val cols = array($"x1", $"x2", $"x3", $"x4")
          val sep = lit("-")

          df.select(myConcat(cols, sep).alias("concatenated")).show

          // +------------+
          // |concatenated|
          // +------------+
          // | a-b-c|
          // | d-e|
          // +------------+


          With raw SQL:



          df.registerTempTable("df")
          sqlContext.udf.register("myConcat", myConcatFunc)

          sqlContext.sql(
          "SELECT myConcat(array(x1, x2, x4), '.') AS concatenated FROM df"
          ).show

          // +------------+
          // |concatenated|
          // +------------+
          // | a.c|
          // | d.e|
          // +------------+


          A slightly more complicated approach is not use UDF at all and compose SQL expressions with something roughly like this:



          import org.apache.spark.sql.functions._
          import org.apache.spark.sql.Column

          def myConcatExpr(sep: String, cols: Column*) = regexp_replace(concat(
          cols.foldLeft(lit(""))(
          (acc, c) => when(c.isNotNull, concat(acc, c, lit(sep))).otherwise(acc)
          )
          ), s"($sep)?$$", "")

          df.select(
          myConcatExpr("-", $"x1", $"x2", $"x3", $"x4").alias("concatenated")
          ).show
          // +------------+
          // |concatenated|
          // +------------+
          // | a-b-c|
          // | d-e|
          // +------------+


          but I doubt it is worth the effort unless you work with PySpark.





          * If you pass a function using varargs it will be stripped from all the syntactic sugar and resulting UDF will expect an ArrayType. For example:



          def f(s: String*) = s.mkString
          udf(f _)


          will be of type:



          UserDefinedFunction(<function1>,StringType,List(ArrayType(StringType,true)))





          share|improve this answer















          UDFs don't support varargs* but you can pass an arbitrary number of columns wrapped using an array function:



          import org.apache.spark.sql.functions.{udf, array, lit}

          val myConcatFunc = (xs: Seq[Any], sep: String) =>
          xs.filter(_ != null).mkString(sep)

          val myConcat = udf(myConcatFunc)


          An example usage:



          val  df = sc.parallelize(Seq(
          (null, "a", "b", "c"), ("d", null, null, "e")
          )).toDF("x1", "x2", "x3", "x4")

          val cols = array($"x1", $"x2", $"x3", $"x4")
          val sep = lit("-")

          df.select(myConcat(cols, sep).alias("concatenated")).show

          // +------------+
          // |concatenated|
          // +------------+
          // | a-b-c|
          // | d-e|
          // +------------+


          With raw SQL:



          df.registerTempTable("df")
          sqlContext.udf.register("myConcat", myConcatFunc)

          sqlContext.sql(
          "SELECT myConcat(array(x1, x2, x4), '.') AS concatenated FROM df"
          ).show

          // +------------+
          // |concatenated|
          // +------------+
          // | a.c|
          // | d.e|
          // +------------+


          A slightly more complicated approach is not use UDF at all and compose SQL expressions with something roughly like this:



          import org.apache.spark.sql.functions._
          import org.apache.spark.sql.Column

          def myConcatExpr(sep: String, cols: Column*) = regexp_replace(concat(
          cols.foldLeft(lit(""))(
          (acc, c) => when(c.isNotNull, concat(acc, c, lit(sep))).otherwise(acc)
          )
          ), s"($sep)?$$", "")

          df.select(
          myConcatExpr("-", $"x1", $"x2", $"x3", $"x4").alias("concatenated")
          ).show
          // +------------+
          // |concatenated|
          // +------------+
          // | a-b-c|
          // | d-e|
          // +------------+


          but I doubt it is worth the effort unless you work with PySpark.





          * If you pass a function using varargs it will be stripped from all the syntactic sugar and resulting UDF will expect an ArrayType. For example:



          def f(s: String*) = s.mkString
          udf(f _)


          will be of type:



          UserDefinedFunction(<function1>,StringType,List(ArrayType(StringType,true)))






          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited Apr 29 '16 at 19:38

























          answered Oct 15 '15 at 15:07









          zero323zero323

          166k40484576




          166k40484576













          • Hi, Is there any way to get column name while concatenating...

            – Kal
            Aug 23 '16 at 12:27











          • No, unless you pass column names explicitly as literals.

            – zero323
            Aug 23 '16 at 13:31











          • Hey thanks, can you please share the syntax for the same

            – Kal
            Aug 23 '16 at 14:23






          • 2





            @Kalpesh array(df.columns.map(c => struct(lit(c), col(c)): _*) -> udf(xs: Seq[Row] => ???).

            – zero323
            Aug 23 '16 at 14:32













          • Pay attention to write array and not Array when calling the function

            – Ameba Spugnosa
            Nov 2 '16 at 9:16



















          • Hi, Is there any way to get column name while concatenating...

            – Kal
            Aug 23 '16 at 12:27











          • No, unless you pass column names explicitly as literals.

            – zero323
            Aug 23 '16 at 13:31











          • Hey thanks, can you please share the syntax for the same

            – Kal
            Aug 23 '16 at 14:23






          • 2





            @Kalpesh array(df.columns.map(c => struct(lit(c), col(c)): _*) -> udf(xs: Seq[Row] => ???).

            – zero323
            Aug 23 '16 at 14:32













          • Pay attention to write array and not Array when calling the function

            – Ameba Spugnosa
            Nov 2 '16 at 9:16

















          Hi, Is there any way to get column name while concatenating...

          – Kal
          Aug 23 '16 at 12:27





          Hi, Is there any way to get column name while concatenating...

          – Kal
          Aug 23 '16 at 12:27













          No, unless you pass column names explicitly as literals.

          – zero323
          Aug 23 '16 at 13:31





          No, unless you pass column names explicitly as literals.

          – zero323
          Aug 23 '16 at 13:31













          Hey thanks, can you please share the syntax for the same

          – Kal
          Aug 23 '16 at 14:23





          Hey thanks, can you please share the syntax for the same

          – Kal
          Aug 23 '16 at 14:23




          2




          2





          @Kalpesh array(df.columns.map(c => struct(lit(c), col(c)): _*) -> udf(xs: Seq[Row] => ???).

          – zero323
          Aug 23 '16 at 14:32







          @Kalpesh array(df.columns.map(c => struct(lit(c), col(c)): _*) -> udf(xs: Seq[Row] => ???).

          – zero323
          Aug 23 '16 at 14:32















          Pay attention to write array and not Array when calling the function

          – Ameba Spugnosa
          Nov 2 '16 at 9:16





          Pay attention to write array and not Array when calling the function

          – Ameba Spugnosa
          Nov 2 '16 at 9:16


















          draft saved

          draft discarded




















































          Thanks for contributing an answer to Stack Overflow!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid



          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.


          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f33151866%2fspark-udf-with-varargs%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          Guess what letter conforming each word

          Port of Spain

          Run scheduled task as local user group (not BUILTIN)