Select columns which contains a string in pyspark












1















I have a pyspark dataframe with a lot of columns, and I want to select the ones which contain a certain string, and others. For example:



df.columns = ['hello_world','hello_country','hello_everyone','byebye','ciao','index']


I want to select the ones which contains 'hello' and also the column named 'index', so the result will be:



['hello_world','hello_country','hello_everyone','index']


I want something like df.select('hello*','index')



Thanks in advance :)



EDIT:



I found a quick way to solve it, so I answered myself, Q&A style. If someone sees my solution and can provide a better one I will appreciate it










share|improve this question





























    1















    I have a pyspark dataframe with a lot of columns, and I want to select the ones which contain a certain string, and others. For example:



    df.columns = ['hello_world','hello_country','hello_everyone','byebye','ciao','index']


    I want to select the ones which contains 'hello' and also the column named 'index', so the result will be:



    ['hello_world','hello_country','hello_everyone','index']


    I want something like df.select('hello*','index')



    Thanks in advance :)



    EDIT:



    I found a quick way to solve it, so I answered myself, Q&A style. If someone sees my solution and can provide a better one I will appreciate it










    share|improve this question



























      1












      1








      1








      I have a pyspark dataframe with a lot of columns, and I want to select the ones which contain a certain string, and others. For example:



      df.columns = ['hello_world','hello_country','hello_everyone','byebye','ciao','index']


      I want to select the ones which contains 'hello' and also the column named 'index', so the result will be:



      ['hello_world','hello_country','hello_everyone','index']


      I want something like df.select('hello*','index')



      Thanks in advance :)



      EDIT:



      I found a quick way to solve it, so I answered myself, Q&A style. If someone sees my solution and can provide a better one I will appreciate it










      share|improve this question
















      I have a pyspark dataframe with a lot of columns, and I want to select the ones which contain a certain string, and others. For example:



      df.columns = ['hello_world','hello_country','hello_everyone','byebye','ciao','index']


      I want to select the ones which contains 'hello' and also the column named 'index', so the result will be:



      ['hello_world','hello_country','hello_everyone','index']


      I want something like df.select('hello*','index')



      Thanks in advance :)



      EDIT:



      I found a quick way to solve it, so I answered myself, Q&A style. If someone sees my solution and can provide a better one I will appreciate it







      python pyspark pyspark-sql






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Nov 21 '18 at 11:07







      Manrique

















      asked Nov 21 '18 at 9:23









      ManriqueManrique

      537415




      537415
























          3 Answers
          3






          active

          oldest

          votes


















          2














          I've found a quick and elegant way:



          selected = [s for s in df.columns if 'hello' in s]+['index']
          df.select(selected)


          With this solution i can add more columns I want without editing the for loop that Ali AzG suggested.






          share|improve this answer
























          • Great solution. and do not need * before selected?

            – Ali AzG
            Nov 21 '18 at 9:52













          • Thanks ! I don't :)

            – Manrique
            Nov 21 '18 at 9:58



















          2














          You can also try to use colRegex function introduced in Spark 2.3, where in you can specify the column name as regular expression as well.



          Hope it helps.



          Regards,



          Neeraj






          share|improve this answer































            1














            This sample code does what you want:



            hello_cols = 

            for col in df.columns:
            if(('index' in col) or ('hello' in col)):
            hello_cols.append(col)

            df.select(*hello_cols)





            share|improve this answer


























            • Thanks, i fixed an error in your code and it worked.

              – Manrique
              Nov 21 '18 at 9:45











            • @Antonio Manrique You're right. I wrote my code in python 2.7. Please accept my answer if it was helpful.

              – Ali AzG
              Nov 21 '18 at 9:46











            • I will give it an upvote ! But i've found myself a better option for what i am doing, i'll post it as an answer and accept it. But thank you so much !

              – Manrique
              Nov 21 '18 at 9:47













            Your Answer






            StackExchange.ifUsing("editor", function () {
            StackExchange.using("externalEditor", function () {
            StackExchange.using("snippets", function () {
            StackExchange.snippets.init();
            });
            });
            }, "code-snippets");

            StackExchange.ready(function() {
            var channelOptions = {
            tags: "".split(" "),
            id: "1"
            };
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function() {
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled) {
            StackExchange.using("snippets", function() {
            createEditor();
            });
            }
            else {
            createEditor();
            }
            });

            function createEditor() {
            StackExchange.prepareEditor({
            heartbeatType: 'answer',
            autoActivateHeartbeat: false,
            convertImagesToLinks: true,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: 10,
            bindNavPrevention: true,
            postfix: "",
            imageUploader: {
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            },
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            });


            }
            });














            draft saved

            draft discarded


















            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53408830%2fselect-columns-which-contains-a-string-in-pyspark%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown

























            3 Answers
            3






            active

            oldest

            votes








            3 Answers
            3






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes









            2














            I've found a quick and elegant way:



            selected = [s for s in df.columns if 'hello' in s]+['index']
            df.select(selected)


            With this solution i can add more columns I want without editing the for loop that Ali AzG suggested.






            share|improve this answer
























            • Great solution. and do not need * before selected?

              – Ali AzG
              Nov 21 '18 at 9:52













            • Thanks ! I don't :)

              – Manrique
              Nov 21 '18 at 9:58
















            2














            I've found a quick and elegant way:



            selected = [s for s in df.columns if 'hello' in s]+['index']
            df.select(selected)


            With this solution i can add more columns I want without editing the for loop that Ali AzG suggested.






            share|improve this answer
























            • Great solution. and do not need * before selected?

              – Ali AzG
              Nov 21 '18 at 9:52













            • Thanks ! I don't :)

              – Manrique
              Nov 21 '18 at 9:58














            2












            2








            2







            I've found a quick and elegant way:



            selected = [s for s in df.columns if 'hello' in s]+['index']
            df.select(selected)


            With this solution i can add more columns I want without editing the for loop that Ali AzG suggested.






            share|improve this answer













            I've found a quick and elegant way:



            selected = [s for s in df.columns if 'hello' in s]+['index']
            df.select(selected)


            With this solution i can add more columns I want without editing the for loop that Ali AzG suggested.







            share|improve this answer












            share|improve this answer



            share|improve this answer










            answered Nov 21 '18 at 9:49









            ManriqueManrique

            537415




            537415













            • Great solution. and do not need * before selected?

              – Ali AzG
              Nov 21 '18 at 9:52













            • Thanks ! I don't :)

              – Manrique
              Nov 21 '18 at 9:58



















            • Great solution. and do not need * before selected?

              – Ali AzG
              Nov 21 '18 at 9:52













            • Thanks ! I don't :)

              – Manrique
              Nov 21 '18 at 9:58

















            Great solution. and do not need * before selected?

            – Ali AzG
            Nov 21 '18 at 9:52







            Great solution. and do not need * before selected?

            – Ali AzG
            Nov 21 '18 at 9:52















            Thanks ! I don't :)

            – Manrique
            Nov 21 '18 at 9:58





            Thanks ! I don't :)

            – Manrique
            Nov 21 '18 at 9:58













            2














            You can also try to use colRegex function introduced in Spark 2.3, where in you can specify the column name as regular expression as well.



            Hope it helps.



            Regards,



            Neeraj






            share|improve this answer




























              2














              You can also try to use colRegex function introduced in Spark 2.3, where in you can specify the column name as regular expression as well.



              Hope it helps.



              Regards,



              Neeraj






              share|improve this answer


























                2












                2








                2







                You can also try to use colRegex function introduced in Spark 2.3, where in you can specify the column name as regular expression as well.



                Hope it helps.



                Regards,



                Neeraj






                share|improve this answer













                You can also try to use colRegex function introduced in Spark 2.3, where in you can specify the column name as regular expression as well.



                Hope it helps.



                Regards,



                Neeraj







                share|improve this answer












                share|improve this answer



                share|improve this answer










                answered Nov 21 '18 at 13:59









                neeraj bhadanineeraj bhadani

                925313




                925313























                    1














                    This sample code does what you want:



                    hello_cols = 

                    for col in df.columns:
                    if(('index' in col) or ('hello' in col)):
                    hello_cols.append(col)

                    df.select(*hello_cols)





                    share|improve this answer


























                    • Thanks, i fixed an error in your code and it worked.

                      – Manrique
                      Nov 21 '18 at 9:45











                    • @Antonio Manrique You're right. I wrote my code in python 2.7. Please accept my answer if it was helpful.

                      – Ali AzG
                      Nov 21 '18 at 9:46











                    • I will give it an upvote ! But i've found myself a better option for what i am doing, i'll post it as an answer and accept it. But thank you so much !

                      – Manrique
                      Nov 21 '18 at 9:47


















                    1














                    This sample code does what you want:



                    hello_cols = 

                    for col in df.columns:
                    if(('index' in col) or ('hello' in col)):
                    hello_cols.append(col)

                    df.select(*hello_cols)





                    share|improve this answer


























                    • Thanks, i fixed an error in your code and it worked.

                      – Manrique
                      Nov 21 '18 at 9:45











                    • @Antonio Manrique You're right. I wrote my code in python 2.7. Please accept my answer if it was helpful.

                      – Ali AzG
                      Nov 21 '18 at 9:46











                    • I will give it an upvote ! But i've found myself a better option for what i am doing, i'll post it as an answer and accept it. But thank you so much !

                      – Manrique
                      Nov 21 '18 at 9:47
















                    1












                    1








                    1







                    This sample code does what you want:



                    hello_cols = 

                    for col in df.columns:
                    if(('index' in col) or ('hello' in col)):
                    hello_cols.append(col)

                    df.select(*hello_cols)





                    share|improve this answer















                    This sample code does what you want:



                    hello_cols = 

                    for col in df.columns:
                    if(('index' in col) or ('hello' in col)):
                    hello_cols.append(col)

                    df.select(*hello_cols)






                    share|improve this answer














                    share|improve this answer



                    share|improve this answer








                    edited Nov 21 '18 at 9:44









                    Manrique

                    537415




                    537415










                    answered Nov 21 '18 at 9:39









                    Ali AzGAli AzG

                    7131717




                    7131717













                    • Thanks, i fixed an error in your code and it worked.

                      – Manrique
                      Nov 21 '18 at 9:45











                    • @Antonio Manrique You're right. I wrote my code in python 2.7. Please accept my answer if it was helpful.

                      – Ali AzG
                      Nov 21 '18 at 9:46











                    • I will give it an upvote ! But i've found myself a better option for what i am doing, i'll post it as an answer and accept it. But thank you so much !

                      – Manrique
                      Nov 21 '18 at 9:47





















                    • Thanks, i fixed an error in your code and it worked.

                      – Manrique
                      Nov 21 '18 at 9:45











                    • @Antonio Manrique You're right. I wrote my code in python 2.7. Please accept my answer if it was helpful.

                      – Ali AzG
                      Nov 21 '18 at 9:46











                    • I will give it an upvote ! But i've found myself a better option for what i am doing, i'll post it as an answer and accept it. But thank you so much !

                      – Manrique
                      Nov 21 '18 at 9:47



















                    Thanks, i fixed an error in your code and it worked.

                    – Manrique
                    Nov 21 '18 at 9:45





                    Thanks, i fixed an error in your code and it worked.

                    – Manrique
                    Nov 21 '18 at 9:45













                    @Antonio Manrique You're right. I wrote my code in python 2.7. Please accept my answer if it was helpful.

                    – Ali AzG
                    Nov 21 '18 at 9:46





                    @Antonio Manrique You're right. I wrote my code in python 2.7. Please accept my answer if it was helpful.

                    – Ali AzG
                    Nov 21 '18 at 9:46













                    I will give it an upvote ! But i've found myself a better option for what i am doing, i'll post it as an answer and accept it. But thank you so much !

                    – Manrique
                    Nov 21 '18 at 9:47







                    I will give it an upvote ! But i've found myself a better option for what i am doing, i'll post it as an answer and accept it. But thank you so much !

                    – Manrique
                    Nov 21 '18 at 9:47




















                    draft saved

                    draft discarded




















































                    Thanks for contributing an answer to Stack Overflow!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid



                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.


                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function () {
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53408830%2fselect-columns-which-contains-a-string-in-pyspark%23new-answer', 'question_page');
                    }
                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    Guess what letter conforming each word

                    Port of Spain

                    Run scheduled task as local user group (not BUILTIN)