Spark UDF with varargs

Is it an only option to list all the arguments up to 22 as shown in documentation?

https://spark.apache.org/docs/1.5.0/api/scala/index.html#org.apache.spark.sql.UDFRegistration

Anyone figured out how to do something similar to this?

sc.udf.register("func", (s: String*) => s......

(writing custom concat function that skips nulls, had to 2 arguments at the time)

Thanks

edited Jan 14 at 9:33

Community♦

asked Oct 15 '15 at 14:56

devopslife

3361414

add a comment |

Is it an only option to list all the arguments up to 22 as shown in documentation?

https://spark.apache.org/docs/1.5.0/api/scala/index.html#org.apache.spark.sql.UDFRegistration

Anyone figured out how to do something similar to this?

sc.udf.register("func", (s: String*) => s......

(writing custom concat function that skips nulls, had to 2 arguments at the time)

Thanks

edited Jan 14 at 9:33

Community♦

asked Oct 15 '15 at 14:56

devopslife

3361414

add a comment |

Is it an only option to list all the arguments up to 22 as shown in documentation?

https://spark.apache.org/docs/1.5.0/api/scala/index.html#org.apache.spark.sql.UDFRegistration

Anyone figured out how to do something similar to this?

sc.udf.register("func", (s: String*) => s......

(writing custom concat function that skips nulls, had to 2 arguments at the time)

Thanks

edited Jan 14 at 9:33

Community♦

asked Oct 15 '15 at 14:56

devopslife

3361414

Is it an only option to list all the arguments up to 22 as shown in documentation?

https://spark.apache.org/docs/1.5.0/api/scala/index.html#org.apache.spark.sql.UDFRegistration

Anyone figured out how to do something similar to this?

sc.udf.register("func", (s: String*) => s......

(writing custom concat function that skips nulls, had to 2 arguments at the time)

Thanks

scala apache-spark apache-spark-sql user-defined-functions

edited Jan 14 at 9:33

Community♦

asked Oct 15 '15 at 14:56

devopslife

3361414

edited Jan 14 at 9:33

Community♦

asked Oct 15 '15 at 14:56

devopslife

3361414

edited Jan 14 at 9:33

Community♦

edited Jan 14 at 9:33

Community♦

edited Jan 14 at 9:33

Community♦

asked Oct 15 '15 at 14:56

devopslife

3361414

asked Oct 15 '15 at 14:56

devopslife

3361414

asked Oct 15 '15 at 14:56

devopslife

3361414

add a comment |

1 Answer
1

active

oldest

votes

UDFs don't support varargs* but you can pass an arbitrary number of columns wrapped using an array function:

import org.apache.spark.sql.functions.{udf, array, lit}



val myConcatFunc = (xs: Seq[Any], sep: String) => 

  xs.filter(_ != null).mkString(sep)



val myConcat = udf(myConcatFunc)

An example usage:

val  df = sc.parallelize(Seq(

  (null, "a", "b", "c"), ("d", null, null, "e")

)).toDF("x1", "x2", "x3", "x4")



val cols = array($"x1", $"x2", $"x3", $"x4")

val sep = lit("-")



df.select(myConcat(cols, sep).alias("concatenated")).show



// +------------+

// |concatenated|

// +------------+

// |       a-b-c|

// |         d-e|

// +------------+

With raw SQL:

df.registerTempTable("df")

sqlContext.udf.register("myConcat", myConcatFunc)



sqlContext.sql(

    "SELECT myConcat(array(x1, x2, x4), '.') AS concatenated FROM df"

).show



// +------------+

// |concatenated|

// +------------+

// |         a.c|

// |         d.e|

// +------------+

A slightly more complicated approach is not use UDF at all and compose SQL expressions with something roughly like this:

import org.apache.spark.sql.functions._

import org.apache.spark.sql.Column



def myConcatExpr(sep: String, cols: Column*) = regexp_replace(concat(

  cols.foldLeft(lit(""))(

    (acc, c) => when(c.isNotNull, concat(acc, c, lit(sep))).otherwise(acc)

  )

), s"($sep)?$$", "") 



df.select(

  myConcatExpr("-", $"x1", $"x2", $"x3", $"x4").alias("concatenated")

).show

// +------------+

// |concatenated|

// +------------+

// |       a-b-c|

// |         d-e|

// +------------+

but I doubt it is worth the effort unless you work with PySpark.

* If you pass a function using varargs it will be stripped from all the syntactic sugar and resulting UDF will expect an ArrayType. For example:

def f(s: String*) = s.mkString

udf(f _)

will be of type:

UserDefinedFunction(<function1>,StringType,List(ArrayType(StringType,true)))

edited Apr 29 '16 at 19:38

answered Oct 15 '15 at 15:07

zero323

166k40484576

Hi, Is there any way to get column name while concatenating...

– Kal
Aug 23 '16 at 12:27

No, unless you pass column names explicitly as literals.

– zero323
Aug 23 '16 at 13:31

Hey thanks, can you please share the syntax for the same

– Kal
Aug 23 '16 at 14:23

2

@Kalpesh array(df.columns.map(c => struct(lit(c), col(c)): _*) -> udf(xs: Seq[Row] => ???).

– zero323
Aug 23 '16 at 14:32

Pay attention to write array and not Array when calling the function

– Ameba Spugnosa
Nov 2 '16 at 9:16

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f33151866%2fspark-udf-with-varargs%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

UDFs don't support varargs* but you can pass an arbitrary number of columns wrapped using an array function:

import org.apache.spark.sql.functions.{udf, array, lit}



val myConcatFunc = (xs: Seq[Any], sep: String) => 

  xs.filter(_ != null).mkString(sep)



val myConcat = udf(myConcatFunc)

An example usage:

val  df = sc.parallelize(Seq(

  (null, "a", "b", "c"), ("d", null, null, "e")

)).toDF("x1", "x2", "x3", "x4")



val cols = array($"x1", $"x2", $"x3", $"x4")

val sep = lit("-")



df.select(myConcat(cols, sep).alias("concatenated")).show



// +------------+

// |concatenated|

// +------------+

// |       a-b-c|

// |         d-e|

// +------------+

With raw SQL:

df.registerTempTable("df")

sqlContext.udf.register("myConcat", myConcatFunc)



sqlContext.sql(

    "SELECT myConcat(array(x1, x2, x4), '.') AS concatenated FROM df"

).show



// +------------+

// |concatenated|

// +------------+

// |         a.c|

// |         d.e|

// +------------+

A slightly more complicated approach is not use UDF at all and compose SQL expressions with something roughly like this:

import org.apache.spark.sql.functions._

import org.apache.spark.sql.Column



def myConcatExpr(sep: String, cols: Column*) = regexp_replace(concat(

  cols.foldLeft(lit(""))(

    (acc, c) => when(c.isNotNull, concat(acc, c, lit(sep))).otherwise(acc)

  )

), s"($sep)?$$", "") 



df.select(

  myConcatExpr("-", $"x1", $"x2", $"x3", $"x4").alias("concatenated")

).show

// +------------+

// |concatenated|

// +------------+

// |       a-b-c|

// |         d-e|

// +------------+

but I doubt it is worth the effort unless you work with PySpark.

* If you pass a function using varargs it will be stripped from all the syntactic sugar and resulting UDF will expect an ArrayType. For example:

def f(s: String*) = s.mkString

udf(f _)

will be of type:

UserDefinedFunction(<function1>,StringType,List(ArrayType(StringType,true)))

edited Apr 29 '16 at 19:38

answered Oct 15 '15 at 15:07

zero323

166k40484576

Hi, Is there any way to get column name while concatenating...

– Kal
Aug 23 '16 at 12:27

No, unless you pass column names explicitly as literals.

– zero323
Aug 23 '16 at 13:31

Hey thanks, can you please share the syntax for the same

– Kal
Aug 23 '16 at 14:23

2

@Kalpesh array(df.columns.map(c => struct(lit(c), col(c)): _*) -> udf(xs: Seq[Row] => ???).

– zero323
Aug 23 '16 at 14:32

Pay attention to write array and not Array when calling the function

– Ameba Spugnosa
Nov 2 '16 at 9:16

add a comment |

UDFs don't support varargs* but you can pass an arbitrary number of columns wrapped using an array function:

import org.apache.spark.sql.functions.{udf, array, lit}



val myConcatFunc = (xs: Seq[Any], sep: String) => 

  xs.filter(_ != null).mkString(sep)



val myConcat = udf(myConcatFunc)

An example usage:

val  df = sc.parallelize(Seq(

  (null, "a", "b", "c"), ("d", null, null, "e")

)).toDF("x1", "x2", "x3", "x4")



val cols = array($"x1", $"x2", $"x3", $"x4")

val sep = lit("-")



df.select(myConcat(cols, sep).alias("concatenated")).show



// +------------+

// |concatenated|

// +------------+

// |       a-b-c|

// |         d-e|

// +------------+

With raw SQL:

df.registerTempTable("df")

sqlContext.udf.register("myConcat", myConcatFunc)



sqlContext.sql(

    "SELECT myConcat(array(x1, x2, x4), '.') AS concatenated FROM df"

).show



// +------------+

// |concatenated|

// +------------+

// |         a.c|

// |         d.e|

// +------------+

A slightly more complicated approach is not use UDF at all and compose SQL expressions with something roughly like this:

import org.apache.spark.sql.functions._

import org.apache.spark.sql.Column



def myConcatExpr(sep: String, cols: Column*) = regexp_replace(concat(

  cols.foldLeft(lit(""))(

    (acc, c) => when(c.isNotNull, concat(acc, c, lit(sep))).otherwise(acc)

  )

), s"($sep)?$$", "") 



df.select(

  myConcatExpr("-", $"x1", $"x2", $"x3", $"x4").alias("concatenated")

).show

// +------------+

// |concatenated|

// +------------+

// |       a-b-c|

// |         d-e|

// +------------+

but I doubt it is worth the effort unless you work with PySpark.

* If you pass a function using varargs it will be stripped from all the syntactic sugar and resulting UDF will expect an ArrayType. For example:

def f(s: String*) = s.mkString

udf(f _)

will be of type:

UserDefinedFunction(<function1>,StringType,List(ArrayType(StringType,true)))

edited Apr 29 '16 at 19:38

answered Oct 15 '15 at 15:07

zero323

166k40484576

Hi, Is there any way to get column name while concatenating...

– Kal
Aug 23 '16 at 12:27

No, unless you pass column names explicitly as literals.

– zero323
Aug 23 '16 at 13:31

Hey thanks, can you please share the syntax for the same

– Kal
Aug 23 '16 at 14:23

2

@Kalpesh array(df.columns.map(c => struct(lit(c), col(c)): _*) -> udf(xs: Seq[Row] => ???).

– zero323
Aug 23 '16 at 14:32

Pay attention to write array and not Array when calling the function

– Ameba Spugnosa
Nov 2 '16 at 9:16

add a comment |

UDFs don't support varargs* but you can pass an arbitrary number of columns wrapped using an array function:

import org.apache.spark.sql.functions.{udf, array, lit}



val myConcatFunc = (xs: Seq[Any], sep: String) => 

  xs.filter(_ != null).mkString(sep)



val myConcat = udf(myConcatFunc)

An example usage:

val  df = sc.parallelize(Seq(

  (null, "a", "b", "c"), ("d", null, null, "e")

)).toDF("x1", "x2", "x3", "x4")



val cols = array($"x1", $"x2", $"x3", $"x4")

val sep = lit("-")



df.select(myConcat(cols, sep).alias("concatenated")).show



// +------------+

// |concatenated|

// +------------+

// |       a-b-c|

// |         d-e|

// +------------+

With raw SQL:

df.registerTempTable("df")

sqlContext.udf.register("myConcat", myConcatFunc)



sqlContext.sql(

    "SELECT myConcat(array(x1, x2, x4), '.') AS concatenated FROM df"

).show



// +------------+

// |concatenated|

// +------------+

// |         a.c|

// |         d.e|

// +------------+

A slightly more complicated approach is not use UDF at all and compose SQL expressions with something roughly like this:

import org.apache.spark.sql.functions._

import org.apache.spark.sql.Column



def myConcatExpr(sep: String, cols: Column*) = regexp_replace(concat(

  cols.foldLeft(lit(""))(

    (acc, c) => when(c.isNotNull, concat(acc, c, lit(sep))).otherwise(acc)

  )

), s"($sep)?$$", "") 



df.select(

  myConcatExpr("-", $"x1", $"x2", $"x3", $"x4").alias("concatenated")

).show

// +------------+

// |concatenated|

// +------------+

// |       a-b-c|

// |         d-e|

// +------------+

but I doubt it is worth the effort unless you work with PySpark.

* If you pass a function using varargs it will be stripped from all the syntactic sugar and resulting UDF will expect an ArrayType. For example:

def f(s: String*) = s.mkString

udf(f _)

will be of type:

UserDefinedFunction(<function1>,StringType,List(ArrayType(StringType,true)))

edited Apr 29 '16 at 19:38

answered Oct 15 '15 at 15:07

zero323

166k40484576

UDFs don't support varargs* but you can pass an arbitrary number of columns wrapped using an array function:

import org.apache.spark.sql.functions.{udf, array, lit}



val myConcatFunc = (xs: Seq[Any], sep: String) => 

  xs.filter(_ != null).mkString(sep)



val myConcat = udf(myConcatFunc)

An example usage:

val  df = sc.parallelize(Seq(

  (null, "a", "b", "c"), ("d", null, null, "e")

)).toDF("x1", "x2", "x3", "x4")



val cols = array($"x1", $"x2", $"x3", $"x4")

val sep = lit("-")



df.select(myConcat(cols, sep).alias("concatenated")).show



// +------------+

// |concatenated|

// +------------+

// |       a-b-c|

// |         d-e|

// +------------+

With raw SQL:

df.registerTempTable("df")

sqlContext.udf.register("myConcat", myConcatFunc)



sqlContext.sql(

    "SELECT myConcat(array(x1, x2, x4), '.') AS concatenated FROM df"

).show



// +------------+

// |concatenated|

// +------------+

// |         a.c|

// |         d.e|

// +------------+

A slightly more complicated approach is not use UDF at all and compose SQL expressions with something roughly like this:

import org.apache.spark.sql.functions._

import org.apache.spark.sql.Column



def myConcatExpr(sep: String, cols: Column*) = regexp_replace(concat(

  cols.foldLeft(lit(""))(

    (acc, c) => when(c.isNotNull, concat(acc, c, lit(sep))).otherwise(acc)

  )

), s"($sep)?$$", "") 



df.select(

  myConcatExpr("-", $"x1", $"x2", $"x3", $"x4").alias("concatenated")

).show

// +------------+

// |concatenated|

// +------------+

// |       a-b-c|

// |         d-e|

// +------------+

but I doubt it is worth the effort unless you work with PySpark.

* If you pass a function using varargs it will be stripped from all the syntactic sugar and resulting UDF will expect an ArrayType. For example:

def f(s: String*) = s.mkString

udf(f _)

will be of type:

UserDefinedFunction(<function1>,StringType,List(ArrayType(StringType,true)))

edited Apr 29 '16 at 19:38

answered Oct 15 '15 at 15:07

zero323

166k40484576

edited Apr 29 '16 at 19:38

answered Oct 15 '15 at 15:07

zero323

166k40484576

answered Oct 15 '15 at 15:07

zero323

166k40484576

answered Oct 15 '15 at 15:07

zero323

166k40484576

Hi, Is there any way to get column name while concatenating...

– Kal
Aug 23 '16 at 12:27

No, unless you pass column names explicitly as literals.

– zero323
Aug 23 '16 at 13:31

Hey thanks, can you please share the syntax for the same

– Kal
Aug 23 '16 at 14:23

2

@Kalpesh array(df.columns.map(c => struct(lit(c), col(c)): _*) -> udf(xs: Seq[Row] => ???).

– zero323
Aug 23 '16 at 14:32

Pay attention to write array and not Array when calling the function

– Ameba Spugnosa
Nov 2 '16 at 9:16

add a comment |

Hi, Is there any way to get column name while concatenating...

– Kal
Aug 23 '16 at 12:27

No, unless you pass column names explicitly as literals.

– zero323
Aug 23 '16 at 13:31

Hey thanks, can you please share the syntax for the same

– Kal
Aug 23 '16 at 14:23

2

@Kalpesh array(df.columns.map(c => struct(lit(c), col(c)): _*) -> udf(xs: Seq[Row] => ???).

– zero323
Aug 23 '16 at 14:32

Pay attention to write array and not Array when calling the function

– Ameba Spugnosa
Nov 2 '16 at 9:16

Hi, Is there any way to get column name while concatenating...

– Kal
Aug 23 '16 at 12:27

No, unless you pass column names explicitly as literals.

– zero323
Aug 23 '16 at 13:31

Hey thanks, can you please share the syntax for the same

– Kal
Aug 23 '16 at 14:23

@Kalpesh array(df.columns.map(c => struct(lit(c), col(c)): _*) -> udf(xs: Seq[Row] => ???).

– zero323
Aug 23 '16 at 14:32

Pay attention to write array and not Array when calling the function

– Ameba Spugnosa
Nov 2 '16 at 9:16

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Agfdhyk