Spark SQL - createDataFrame wrong struct schema












0















When trying to create a DataFrame with Spark SQL by passing it a list of Rows like so:



some_data = [{'some-column': [{'timestamp': 1353534535353, 'strVal': 'some-string'}]},
{'some-column': [{'timestamp': 1353534535354, 'strVal': 'another-string'}]}]
spark.createDataFrame([Row(**d) for d in some_data]).printSchema()


The resulting DataFrame's schema is:



root
|-- some-column: array (nullable = true)
| |-- element: map (containsNull = true)
| | |-- key: string
| | |-- value: long (valueContainsNull = true)


This schema is wrong, as strVal column is of string type (and indeed collecting on this DataFrame would result in nulls on this column).



I'd expect for the schema to be an Array of appropriate Structs - inferred with a bit of Python reflection on the types of values.
Why is this not the case?
Is there anything I can do besides providing the schema explicitly in this case?










share|improve this question





























    0















    When trying to create a DataFrame with Spark SQL by passing it a list of Rows like so:



    some_data = [{'some-column': [{'timestamp': 1353534535353, 'strVal': 'some-string'}]},
    {'some-column': [{'timestamp': 1353534535354, 'strVal': 'another-string'}]}]
    spark.createDataFrame([Row(**d) for d in some_data]).printSchema()


    The resulting DataFrame's schema is:



    root
    |-- some-column: array (nullable = true)
    | |-- element: map (containsNull = true)
    | | |-- key: string
    | | |-- value: long (valueContainsNull = true)


    This schema is wrong, as strVal column is of string type (and indeed collecting on this DataFrame would result in nulls on this column).



    I'd expect for the schema to be an Array of appropriate Structs - inferred with a bit of Python reflection on the types of values.
    Why is this not the case?
    Is there anything I can do besides providing the schema explicitly in this case?










    share|improve this question



























      0












      0








      0








      When trying to create a DataFrame with Spark SQL by passing it a list of Rows like so:



      some_data = [{'some-column': [{'timestamp': 1353534535353, 'strVal': 'some-string'}]},
      {'some-column': [{'timestamp': 1353534535354, 'strVal': 'another-string'}]}]
      spark.createDataFrame([Row(**d) for d in some_data]).printSchema()


      The resulting DataFrame's schema is:



      root
      |-- some-column: array (nullable = true)
      | |-- element: map (containsNull = true)
      | | |-- key: string
      | | |-- value: long (valueContainsNull = true)


      This schema is wrong, as strVal column is of string type (and indeed collecting on this DataFrame would result in nulls on this column).



      I'd expect for the schema to be an Array of appropriate Structs - inferred with a bit of Python reflection on the types of values.
      Why is this not the case?
      Is there anything I can do besides providing the schema explicitly in this case?










      share|improve this question
















      When trying to create a DataFrame with Spark SQL by passing it a list of Rows like so:



      some_data = [{'some-column': [{'timestamp': 1353534535353, 'strVal': 'some-string'}]},
      {'some-column': [{'timestamp': 1353534535354, 'strVal': 'another-string'}]}]
      spark.createDataFrame([Row(**d) for d in some_data]).printSchema()


      The resulting DataFrame's schema is:



      root
      |-- some-column: array (nullable = true)
      | |-- element: map (containsNull = true)
      | | |-- key: string
      | | |-- value: long (valueContainsNull = true)


      This schema is wrong, as strVal column is of string type (and indeed collecting on this DataFrame would result in nulls on this column).



      I'd expect for the schema to be an Array of appropriate Structs - inferred with a bit of Python reflection on the types of values.
      Why is this not the case?
      Is there anything I can do besides providing the schema explicitly in this case?







      apache-spark dataframe pyspark apache-spark-sql schema






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Nov 20 '18 at 9:15









      user10465355

      1,9082416




      1,9082416










      asked Nov 19 '18 at 23:22









      user976850user976850

      4411616




      4411616
























          1 Answer
          1






          active

          oldest

          votes


















          1














          This happens because the structure doesn't encode what you mean. As explained in the SQL guide Python dict is mapped to MapType.



          To work with structures you should use nested Rows (namedtuples are preferred in general, but require valid name identifiers):



          from pyspark.sql import Row

          Outer = Row("some-column")
          Inner = Row("timestamp", "strVal")

          spark.createDataFrame([
          Outer([Inner(1353534535353, 'some-string')]),
          Outer([Inner(1353534535354, 'another-string')])
          ]).printSchema()


          root
          |-- some-column: array (nullable = true)
          | |-- element: struct (containsNull = true)
          | | |-- timestamp: long (nullable = true)
          | | |-- strVal: string (nullable = true)


          With the structure you have at the moment, the scheme outcome could be achieved with intermediate JSON:



          import json

          spark.read.json(sc.parallelize(some_data).map(json.dumps)).printSchema()


          root
          |-- some-column: array (nullable = true)
          | |-- element: struct (containsNull = true)
          | | |-- strVal: string (nullable = true)
          | | |-- timestamp: long (nullable = true)


          or explicit schema:



          from pyspark.sql.types import *

          schema = StructType([StructField(
          "some-column", ArrayType(StructType([
          StructField("timestamp", LongType()),
          StructField("strVal", StringType())])
          ))])

          spark.createDataFrame(some_data, schema)


          although the last method might not be fully robust.






          share|improve this answer


























          • Thanks, the JSON trick is exactly what i was hoping for. Why isn't this the default behavior? e.g. for spark.createDataFrame(rdd, schema) - cant the schema be inferred by doing this very same trick?

            – user976850
            Nov 20 '18 at 9:05











          • Schema is inferred according to the linked specification so it couldn't be, without making a whole process full of special cases.

            – user10465355
            Nov 20 '18 at 11:24











          Your Answer






          StackExchange.ifUsing("editor", function () {
          StackExchange.using("externalEditor", function () {
          StackExchange.using("snippets", function () {
          StackExchange.snippets.init();
          });
          });
          }, "code-snippets");

          StackExchange.ready(function() {
          var channelOptions = {
          tags: "".split(" "),
          id: "1"
          };
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function() {
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled) {
          StackExchange.using("snippets", function() {
          createEditor();
          });
          }
          else {
          createEditor();
          }
          });

          function createEditor() {
          StackExchange.prepareEditor({
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader: {
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          },
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          });


          }
          });














          draft saved

          draft discarded


















          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53384072%2fspark-sql-createdataframe-wrong-struct-schema%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown

























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes









          1














          This happens because the structure doesn't encode what you mean. As explained in the SQL guide Python dict is mapped to MapType.



          To work with structures you should use nested Rows (namedtuples are preferred in general, but require valid name identifiers):



          from pyspark.sql import Row

          Outer = Row("some-column")
          Inner = Row("timestamp", "strVal")

          spark.createDataFrame([
          Outer([Inner(1353534535353, 'some-string')]),
          Outer([Inner(1353534535354, 'another-string')])
          ]).printSchema()


          root
          |-- some-column: array (nullable = true)
          | |-- element: struct (containsNull = true)
          | | |-- timestamp: long (nullable = true)
          | | |-- strVal: string (nullable = true)


          With the structure you have at the moment, the scheme outcome could be achieved with intermediate JSON:



          import json

          spark.read.json(sc.parallelize(some_data).map(json.dumps)).printSchema()


          root
          |-- some-column: array (nullable = true)
          | |-- element: struct (containsNull = true)
          | | |-- strVal: string (nullable = true)
          | | |-- timestamp: long (nullable = true)


          or explicit schema:



          from pyspark.sql.types import *

          schema = StructType([StructField(
          "some-column", ArrayType(StructType([
          StructField("timestamp", LongType()),
          StructField("strVal", StringType())])
          ))])

          spark.createDataFrame(some_data, schema)


          although the last method might not be fully robust.






          share|improve this answer


























          • Thanks, the JSON trick is exactly what i was hoping for. Why isn't this the default behavior? e.g. for spark.createDataFrame(rdd, schema) - cant the schema be inferred by doing this very same trick?

            – user976850
            Nov 20 '18 at 9:05











          • Schema is inferred according to the linked specification so it couldn't be, without making a whole process full of special cases.

            – user10465355
            Nov 20 '18 at 11:24
















          1














          This happens because the structure doesn't encode what you mean. As explained in the SQL guide Python dict is mapped to MapType.



          To work with structures you should use nested Rows (namedtuples are preferred in general, but require valid name identifiers):



          from pyspark.sql import Row

          Outer = Row("some-column")
          Inner = Row("timestamp", "strVal")

          spark.createDataFrame([
          Outer([Inner(1353534535353, 'some-string')]),
          Outer([Inner(1353534535354, 'another-string')])
          ]).printSchema()


          root
          |-- some-column: array (nullable = true)
          | |-- element: struct (containsNull = true)
          | | |-- timestamp: long (nullable = true)
          | | |-- strVal: string (nullable = true)


          With the structure you have at the moment, the scheme outcome could be achieved with intermediate JSON:



          import json

          spark.read.json(sc.parallelize(some_data).map(json.dumps)).printSchema()


          root
          |-- some-column: array (nullable = true)
          | |-- element: struct (containsNull = true)
          | | |-- strVal: string (nullable = true)
          | | |-- timestamp: long (nullable = true)


          or explicit schema:



          from pyspark.sql.types import *

          schema = StructType([StructField(
          "some-column", ArrayType(StructType([
          StructField("timestamp", LongType()),
          StructField("strVal", StringType())])
          ))])

          spark.createDataFrame(some_data, schema)


          although the last method might not be fully robust.






          share|improve this answer


























          • Thanks, the JSON trick is exactly what i was hoping for. Why isn't this the default behavior? e.g. for spark.createDataFrame(rdd, schema) - cant the schema be inferred by doing this very same trick?

            – user976850
            Nov 20 '18 at 9:05











          • Schema is inferred according to the linked specification so it couldn't be, without making a whole process full of special cases.

            – user10465355
            Nov 20 '18 at 11:24














          1












          1








          1







          This happens because the structure doesn't encode what you mean. As explained in the SQL guide Python dict is mapped to MapType.



          To work with structures you should use nested Rows (namedtuples are preferred in general, but require valid name identifiers):



          from pyspark.sql import Row

          Outer = Row("some-column")
          Inner = Row("timestamp", "strVal")

          spark.createDataFrame([
          Outer([Inner(1353534535353, 'some-string')]),
          Outer([Inner(1353534535354, 'another-string')])
          ]).printSchema()


          root
          |-- some-column: array (nullable = true)
          | |-- element: struct (containsNull = true)
          | | |-- timestamp: long (nullable = true)
          | | |-- strVal: string (nullable = true)


          With the structure you have at the moment, the scheme outcome could be achieved with intermediate JSON:



          import json

          spark.read.json(sc.parallelize(some_data).map(json.dumps)).printSchema()


          root
          |-- some-column: array (nullable = true)
          | |-- element: struct (containsNull = true)
          | | |-- strVal: string (nullable = true)
          | | |-- timestamp: long (nullable = true)


          or explicit schema:



          from pyspark.sql.types import *

          schema = StructType([StructField(
          "some-column", ArrayType(StructType([
          StructField("timestamp", LongType()),
          StructField("strVal", StringType())])
          ))])

          spark.createDataFrame(some_data, schema)


          although the last method might not be fully robust.






          share|improve this answer















          This happens because the structure doesn't encode what you mean. As explained in the SQL guide Python dict is mapped to MapType.



          To work with structures you should use nested Rows (namedtuples are preferred in general, but require valid name identifiers):



          from pyspark.sql import Row

          Outer = Row("some-column")
          Inner = Row("timestamp", "strVal")

          spark.createDataFrame([
          Outer([Inner(1353534535353, 'some-string')]),
          Outer([Inner(1353534535354, 'another-string')])
          ]).printSchema()


          root
          |-- some-column: array (nullable = true)
          | |-- element: struct (containsNull = true)
          | | |-- timestamp: long (nullable = true)
          | | |-- strVal: string (nullable = true)


          With the structure you have at the moment, the scheme outcome could be achieved with intermediate JSON:



          import json

          spark.read.json(sc.parallelize(some_data).map(json.dumps)).printSchema()


          root
          |-- some-column: array (nullable = true)
          | |-- element: struct (containsNull = true)
          | | |-- strVal: string (nullable = true)
          | | |-- timestamp: long (nullable = true)


          or explicit schema:



          from pyspark.sql.types import *

          schema = StructType([StructField(
          "some-column", ArrayType(StructType([
          StructField("timestamp", LongType()),
          StructField("strVal", StringType())])
          ))])

          spark.createDataFrame(some_data, schema)


          although the last method might not be fully robust.







          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited Nov 20 '18 at 0:58

























          answered Nov 20 '18 at 0:48









          user10465355user10465355

          1,9082416




          1,9082416













          • Thanks, the JSON trick is exactly what i was hoping for. Why isn't this the default behavior? e.g. for spark.createDataFrame(rdd, schema) - cant the schema be inferred by doing this very same trick?

            – user976850
            Nov 20 '18 at 9:05











          • Schema is inferred according to the linked specification so it couldn't be, without making a whole process full of special cases.

            – user10465355
            Nov 20 '18 at 11:24



















          • Thanks, the JSON trick is exactly what i was hoping for. Why isn't this the default behavior? e.g. for spark.createDataFrame(rdd, schema) - cant the schema be inferred by doing this very same trick?

            – user976850
            Nov 20 '18 at 9:05











          • Schema is inferred according to the linked specification so it couldn't be, without making a whole process full of special cases.

            – user10465355
            Nov 20 '18 at 11:24

















          Thanks, the JSON trick is exactly what i was hoping for. Why isn't this the default behavior? e.g. for spark.createDataFrame(rdd, schema) - cant the schema be inferred by doing this very same trick?

          – user976850
          Nov 20 '18 at 9:05





          Thanks, the JSON trick is exactly what i was hoping for. Why isn't this the default behavior? e.g. for spark.createDataFrame(rdd, schema) - cant the schema be inferred by doing this very same trick?

          – user976850
          Nov 20 '18 at 9:05













          Schema is inferred according to the linked specification so it couldn't be, without making a whole process full of special cases.

          – user10465355
          Nov 20 '18 at 11:24





          Schema is inferred according to the linked specification so it couldn't be, without making a whole process full of special cases.

          – user10465355
          Nov 20 '18 at 11:24




















          draft saved

          draft discarded




















































          Thanks for contributing an answer to Stack Overflow!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid



          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.


          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53384072%2fspark-sql-createdataframe-wrong-struct-schema%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          Guess what letter conforming each word

          Port of Spain

          Run scheduled task as local user group (not BUILTIN)