Spark UDF with varargs
Is it an only option to list all the arguments up to 22 as shown in documentation?
https://spark.apache.org/docs/1.5.0/api/scala/index.html#org.apache.spark.sql.UDFRegistration
Anyone figured out how to do something similar to this?
sc.udf.register("func", (s: String*) => s......
(writing custom concat function that skips nulls, had to 2 arguments at the time)
Thanks
scala apache-spark apache-spark-sql user-defined-functions
add a comment |
Is it an only option to list all the arguments up to 22 as shown in documentation?
https://spark.apache.org/docs/1.5.0/api/scala/index.html#org.apache.spark.sql.UDFRegistration
Anyone figured out how to do something similar to this?
sc.udf.register("func", (s: String*) => s......
(writing custom concat function that skips nulls, had to 2 arguments at the time)
Thanks
scala apache-spark apache-spark-sql user-defined-functions
add a comment |
Is it an only option to list all the arguments up to 22 as shown in documentation?
https://spark.apache.org/docs/1.5.0/api/scala/index.html#org.apache.spark.sql.UDFRegistration
Anyone figured out how to do something similar to this?
sc.udf.register("func", (s: String*) => s......
(writing custom concat function that skips nulls, had to 2 arguments at the time)
Thanks
scala apache-spark apache-spark-sql user-defined-functions
Is it an only option to list all the arguments up to 22 as shown in documentation?
https://spark.apache.org/docs/1.5.0/api/scala/index.html#org.apache.spark.sql.UDFRegistration
Anyone figured out how to do something similar to this?
sc.udf.register("func", (s: String*) => s......
(writing custom concat function that skips nulls, had to 2 arguments at the time)
Thanks
scala apache-spark apache-spark-sql user-defined-functions
scala apache-spark apache-spark-sql user-defined-functions
edited Jan 14 at 9:33
Community♦
11
11
asked Oct 15 '15 at 14:56
devopslifedevopslife
3361414
3361414
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
UDFs don't support varargs* but you can pass an arbitrary number of columns wrapped using an array
function:
import org.apache.spark.sql.functions.{udf, array, lit}
val myConcatFunc = (xs: Seq[Any], sep: String) =>
xs.filter(_ != null).mkString(sep)
val myConcat = udf(myConcatFunc)
An example usage:
val df = sc.parallelize(Seq(
(null, "a", "b", "c"), ("d", null, null, "e")
)).toDF("x1", "x2", "x3", "x4")
val cols = array($"x1", $"x2", $"x3", $"x4")
val sep = lit("-")
df.select(myConcat(cols, sep).alias("concatenated")).show
// +------------+
// |concatenated|
// +------------+
// | a-b-c|
// | d-e|
// +------------+
With raw SQL:
df.registerTempTable("df")
sqlContext.udf.register("myConcat", myConcatFunc)
sqlContext.sql(
"SELECT myConcat(array(x1, x2, x4), '.') AS concatenated FROM df"
).show
// +------------+
// |concatenated|
// +------------+
// | a.c|
// | d.e|
// +------------+
A slightly more complicated approach is not use UDF at all and compose SQL expressions with something roughly like this:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.Column
def myConcatExpr(sep: String, cols: Column*) = regexp_replace(concat(
cols.foldLeft(lit(""))(
(acc, c) => when(c.isNotNull, concat(acc, c, lit(sep))).otherwise(acc)
)
), s"($sep)?$$", "")
df.select(
myConcatExpr("-", $"x1", $"x2", $"x3", $"x4").alias("concatenated")
).show
// +------------+
// |concatenated|
// +------------+
// | a-b-c|
// | d-e|
// +------------+
but I doubt it is worth the effort unless you work with PySpark.
* If you pass a function using varargs it will be stripped from all the syntactic sugar and resulting UDF will expect an ArrayType
. For example:
def f(s: String*) = s.mkString
udf(f _)
will be of type:
UserDefinedFunction(<function1>,StringType,List(ArrayType(StringType,true)))
Hi, Is there any way to get column name while concatenating...
– Kal
Aug 23 '16 at 12:27
No, unless you pass column names explicitly as literals.
– zero323
Aug 23 '16 at 13:31
Hey thanks, can you please share the syntax for the same
– Kal
Aug 23 '16 at 14:23
2
@Kalpesharray(df.columns.map(c => struct(lit(c), col(c)): _*)
->udf(xs: Seq[Row] => ???)
.
– zero323
Aug 23 '16 at 14:32
Pay attention to write array and not Array when calling the function
– Ameba Spugnosa
Nov 2 '16 at 9:16
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f33151866%2fspark-udf-with-varargs%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
UDFs don't support varargs* but you can pass an arbitrary number of columns wrapped using an array
function:
import org.apache.spark.sql.functions.{udf, array, lit}
val myConcatFunc = (xs: Seq[Any], sep: String) =>
xs.filter(_ != null).mkString(sep)
val myConcat = udf(myConcatFunc)
An example usage:
val df = sc.parallelize(Seq(
(null, "a", "b", "c"), ("d", null, null, "e")
)).toDF("x1", "x2", "x3", "x4")
val cols = array($"x1", $"x2", $"x3", $"x4")
val sep = lit("-")
df.select(myConcat(cols, sep).alias("concatenated")).show
// +------------+
// |concatenated|
// +------------+
// | a-b-c|
// | d-e|
// +------------+
With raw SQL:
df.registerTempTable("df")
sqlContext.udf.register("myConcat", myConcatFunc)
sqlContext.sql(
"SELECT myConcat(array(x1, x2, x4), '.') AS concatenated FROM df"
).show
// +------------+
// |concatenated|
// +------------+
// | a.c|
// | d.e|
// +------------+
A slightly more complicated approach is not use UDF at all and compose SQL expressions with something roughly like this:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.Column
def myConcatExpr(sep: String, cols: Column*) = regexp_replace(concat(
cols.foldLeft(lit(""))(
(acc, c) => when(c.isNotNull, concat(acc, c, lit(sep))).otherwise(acc)
)
), s"($sep)?$$", "")
df.select(
myConcatExpr("-", $"x1", $"x2", $"x3", $"x4").alias("concatenated")
).show
// +------------+
// |concatenated|
// +------------+
// | a-b-c|
// | d-e|
// +------------+
but I doubt it is worth the effort unless you work with PySpark.
* If you pass a function using varargs it will be stripped from all the syntactic sugar and resulting UDF will expect an ArrayType
. For example:
def f(s: String*) = s.mkString
udf(f _)
will be of type:
UserDefinedFunction(<function1>,StringType,List(ArrayType(StringType,true)))
Hi, Is there any way to get column name while concatenating...
– Kal
Aug 23 '16 at 12:27
No, unless you pass column names explicitly as literals.
– zero323
Aug 23 '16 at 13:31
Hey thanks, can you please share the syntax for the same
– Kal
Aug 23 '16 at 14:23
2
@Kalpesharray(df.columns.map(c => struct(lit(c), col(c)): _*)
->udf(xs: Seq[Row] => ???)
.
– zero323
Aug 23 '16 at 14:32
Pay attention to write array and not Array when calling the function
– Ameba Spugnosa
Nov 2 '16 at 9:16
add a comment |
UDFs don't support varargs* but you can pass an arbitrary number of columns wrapped using an array
function:
import org.apache.spark.sql.functions.{udf, array, lit}
val myConcatFunc = (xs: Seq[Any], sep: String) =>
xs.filter(_ != null).mkString(sep)
val myConcat = udf(myConcatFunc)
An example usage:
val df = sc.parallelize(Seq(
(null, "a", "b", "c"), ("d", null, null, "e")
)).toDF("x1", "x2", "x3", "x4")
val cols = array($"x1", $"x2", $"x3", $"x4")
val sep = lit("-")
df.select(myConcat(cols, sep).alias("concatenated")).show
// +------------+
// |concatenated|
// +------------+
// | a-b-c|
// | d-e|
// +------------+
With raw SQL:
df.registerTempTable("df")
sqlContext.udf.register("myConcat", myConcatFunc)
sqlContext.sql(
"SELECT myConcat(array(x1, x2, x4), '.') AS concatenated FROM df"
).show
// +------------+
// |concatenated|
// +------------+
// | a.c|
// | d.e|
// +------------+
A slightly more complicated approach is not use UDF at all and compose SQL expressions with something roughly like this:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.Column
def myConcatExpr(sep: String, cols: Column*) = regexp_replace(concat(
cols.foldLeft(lit(""))(
(acc, c) => when(c.isNotNull, concat(acc, c, lit(sep))).otherwise(acc)
)
), s"($sep)?$$", "")
df.select(
myConcatExpr("-", $"x1", $"x2", $"x3", $"x4").alias("concatenated")
).show
// +------------+
// |concatenated|
// +------------+
// | a-b-c|
// | d-e|
// +------------+
but I doubt it is worth the effort unless you work with PySpark.
* If you pass a function using varargs it will be stripped from all the syntactic sugar and resulting UDF will expect an ArrayType
. For example:
def f(s: String*) = s.mkString
udf(f _)
will be of type:
UserDefinedFunction(<function1>,StringType,List(ArrayType(StringType,true)))
Hi, Is there any way to get column name while concatenating...
– Kal
Aug 23 '16 at 12:27
No, unless you pass column names explicitly as literals.
– zero323
Aug 23 '16 at 13:31
Hey thanks, can you please share the syntax for the same
– Kal
Aug 23 '16 at 14:23
2
@Kalpesharray(df.columns.map(c => struct(lit(c), col(c)): _*)
->udf(xs: Seq[Row] => ???)
.
– zero323
Aug 23 '16 at 14:32
Pay attention to write array and not Array when calling the function
– Ameba Spugnosa
Nov 2 '16 at 9:16
add a comment |
UDFs don't support varargs* but you can pass an arbitrary number of columns wrapped using an array
function:
import org.apache.spark.sql.functions.{udf, array, lit}
val myConcatFunc = (xs: Seq[Any], sep: String) =>
xs.filter(_ != null).mkString(sep)
val myConcat = udf(myConcatFunc)
An example usage:
val df = sc.parallelize(Seq(
(null, "a", "b", "c"), ("d", null, null, "e")
)).toDF("x1", "x2", "x3", "x4")
val cols = array($"x1", $"x2", $"x3", $"x4")
val sep = lit("-")
df.select(myConcat(cols, sep).alias("concatenated")).show
// +------------+
// |concatenated|
// +------------+
// | a-b-c|
// | d-e|
// +------------+
With raw SQL:
df.registerTempTable("df")
sqlContext.udf.register("myConcat", myConcatFunc)
sqlContext.sql(
"SELECT myConcat(array(x1, x2, x4), '.') AS concatenated FROM df"
).show
// +------------+
// |concatenated|
// +------------+
// | a.c|
// | d.e|
// +------------+
A slightly more complicated approach is not use UDF at all and compose SQL expressions with something roughly like this:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.Column
def myConcatExpr(sep: String, cols: Column*) = regexp_replace(concat(
cols.foldLeft(lit(""))(
(acc, c) => when(c.isNotNull, concat(acc, c, lit(sep))).otherwise(acc)
)
), s"($sep)?$$", "")
df.select(
myConcatExpr("-", $"x1", $"x2", $"x3", $"x4").alias("concatenated")
).show
// +------------+
// |concatenated|
// +------------+
// | a-b-c|
// | d-e|
// +------------+
but I doubt it is worth the effort unless you work with PySpark.
* If you pass a function using varargs it will be stripped from all the syntactic sugar and resulting UDF will expect an ArrayType
. For example:
def f(s: String*) = s.mkString
udf(f _)
will be of type:
UserDefinedFunction(<function1>,StringType,List(ArrayType(StringType,true)))
UDFs don't support varargs* but you can pass an arbitrary number of columns wrapped using an array
function:
import org.apache.spark.sql.functions.{udf, array, lit}
val myConcatFunc = (xs: Seq[Any], sep: String) =>
xs.filter(_ != null).mkString(sep)
val myConcat = udf(myConcatFunc)
An example usage:
val df = sc.parallelize(Seq(
(null, "a", "b", "c"), ("d", null, null, "e")
)).toDF("x1", "x2", "x3", "x4")
val cols = array($"x1", $"x2", $"x3", $"x4")
val sep = lit("-")
df.select(myConcat(cols, sep).alias("concatenated")).show
// +------------+
// |concatenated|
// +------------+
// | a-b-c|
// | d-e|
// +------------+
With raw SQL:
df.registerTempTable("df")
sqlContext.udf.register("myConcat", myConcatFunc)
sqlContext.sql(
"SELECT myConcat(array(x1, x2, x4), '.') AS concatenated FROM df"
).show
// +------------+
// |concatenated|
// +------------+
// | a.c|
// | d.e|
// +------------+
A slightly more complicated approach is not use UDF at all and compose SQL expressions with something roughly like this:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.Column
def myConcatExpr(sep: String, cols: Column*) = regexp_replace(concat(
cols.foldLeft(lit(""))(
(acc, c) => when(c.isNotNull, concat(acc, c, lit(sep))).otherwise(acc)
)
), s"($sep)?$$", "")
df.select(
myConcatExpr("-", $"x1", $"x2", $"x3", $"x4").alias("concatenated")
).show
// +------------+
// |concatenated|
// +------------+
// | a-b-c|
// | d-e|
// +------------+
but I doubt it is worth the effort unless you work with PySpark.
* If you pass a function using varargs it will be stripped from all the syntactic sugar and resulting UDF will expect an ArrayType
. For example:
def f(s: String*) = s.mkString
udf(f _)
will be of type:
UserDefinedFunction(<function1>,StringType,List(ArrayType(StringType,true)))
edited Apr 29 '16 at 19:38
answered Oct 15 '15 at 15:07
zero323zero323
166k40484576
166k40484576
Hi, Is there any way to get column name while concatenating...
– Kal
Aug 23 '16 at 12:27
No, unless you pass column names explicitly as literals.
– zero323
Aug 23 '16 at 13:31
Hey thanks, can you please share the syntax for the same
– Kal
Aug 23 '16 at 14:23
2
@Kalpesharray(df.columns.map(c => struct(lit(c), col(c)): _*)
->udf(xs: Seq[Row] => ???)
.
– zero323
Aug 23 '16 at 14:32
Pay attention to write array and not Array when calling the function
– Ameba Spugnosa
Nov 2 '16 at 9:16
add a comment |
Hi, Is there any way to get column name while concatenating...
– Kal
Aug 23 '16 at 12:27
No, unless you pass column names explicitly as literals.
– zero323
Aug 23 '16 at 13:31
Hey thanks, can you please share the syntax for the same
– Kal
Aug 23 '16 at 14:23
2
@Kalpesharray(df.columns.map(c => struct(lit(c), col(c)): _*)
->udf(xs: Seq[Row] => ???)
.
– zero323
Aug 23 '16 at 14:32
Pay attention to write array and not Array when calling the function
– Ameba Spugnosa
Nov 2 '16 at 9:16
Hi, Is there any way to get column name while concatenating...
– Kal
Aug 23 '16 at 12:27
Hi, Is there any way to get column name while concatenating...
– Kal
Aug 23 '16 at 12:27
No, unless you pass column names explicitly as literals.
– zero323
Aug 23 '16 at 13:31
No, unless you pass column names explicitly as literals.
– zero323
Aug 23 '16 at 13:31
Hey thanks, can you please share the syntax for the same
– Kal
Aug 23 '16 at 14:23
Hey thanks, can you please share the syntax for the same
– Kal
Aug 23 '16 at 14:23
2
2
@Kalpesh
array(df.columns.map(c => struct(lit(c), col(c)): _*)
-> udf(xs: Seq[Row] => ???)
.– zero323
Aug 23 '16 at 14:32
@Kalpesh
array(df.columns.map(c => struct(lit(c), col(c)): _*)
-> udf(xs: Seq[Row] => ???)
.– zero323
Aug 23 '16 at 14:32
Pay attention to write array and not Array when calling the function
– Ameba Spugnosa
Nov 2 '16 at 9:16
Pay attention to write array and not Array when calling the function
– Ameba Spugnosa
Nov 2 '16 at 9:16
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f33151866%2fspark-udf-with-varargs%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown