Is there a standard conforming way to write a portable ls utility in C++?












6














Let's consider the following code listing the directory contents of the path given as the first argument to the program:



#include <filesystem>
#include <iostream>

int main(int argc, char **argv)
{

if(argc != 2)
std::cerr << "Please specify a directory.n";

for(auto& p: std::filesystem::directory_iterator(argv[1]))
std::cout << p << 'n';

}


On first sight this seems to be very lean, portable and conforming to the C++ standard (please ignore that it does not catch exceptions if the directory does not exist).



However, there seem to be a few pitfalls. In particular, the C++ standard does not seem to mandate that the encoding of argv[1] matches that accepted by std::filesystem::path constructors nor does it seem to mandate that the encoding returned by std::filesystem::path::string() matches that accepted by std::cout.



Quite the opposite, the standard seems to introduce the new term "native encoding" which may be different from the execution character set encoding and is defined as:




The native encoding of a narrow character string is the operating
system dependent current encoding for pathnames ([fs.class.path]).




From my reading of the standard no conversion between encodings takes place if std::filesystem::path::value_type matches the char type of argv[1] (which is true on any POSIX system).



This seems to allow, for example, a conforming implementation in which the execution character set encoding (and hence the encoding of argv[1] and that accepted by std::cout) is EBCDIC, but the encoding of strings accepted and provided by the filesystem library is ISO 8859-1, with no conversion performed between the two, making the filesystem library essentially useless. Worse yet, there is no way to figure out if the two encodings are the same or not.



This can even get dangerous if you start to write utilities which delete files and the to be deleted file provided by argv[1] matches a completely different file when it's interpreted in the native encoding of the filesystem library.



Note that I'm not concerned about filesystems using different encodings than those used by programs. My concern is that the standard does not seem to mandate any conversion of those encodings.



The u8path() and u8string() functions are of no use here either because the standard also provides no way to convert between UTF-8 and the execution character set encoding (used by argv[1] and std::cout).



Is there any portable, encoding agnostic and standard compliant way to do this?










share|improve this question






















  • Speaking of standards, ls will show you the current working directory if given no arguments, it won't give you flack for not specifying it. Also if you're working with EBCDIC and C++ together I'm impressed.
    – tadman
    Nov 15 '18 at 16:40








  • 2




    Yes, there is no portable way to write ls application in C++. Moreover, my experience tells me that there is no portable way to write any complex application in C++ - you will always have to rely on things which are not specified by C++ standard, either directly, or hidden inside third-party libraries like boost. In my opinion, this greatly contrasts C++ with languages like Java.
    – SergeyA
    Nov 15 '18 at 16:41












  • @SergeyA Yeah, every operating system is free to make up their own rules, and they often do for reasons we'll never be able to properly explain.
    – tadman
    Nov 15 '18 at 16:43










  • The root problem is that WG21 doesn't want to rely on POSIX here. Without that, the whole notion of a file name becomes non-portable. Now this can be reasonable; on tiny embedded systems files might be identified by merely a number.
    – MSalters
    Nov 15 '18 at 17:09










  • @MSalters I understand the reason but these systems could still exist if the standard provided a way to reliably set and get that number in the execution character set encoding.
    – Contter
    Nov 15 '18 at 17:20
















6














Let's consider the following code listing the directory contents of the path given as the first argument to the program:



#include <filesystem>
#include <iostream>

int main(int argc, char **argv)
{

if(argc != 2)
std::cerr << "Please specify a directory.n";

for(auto& p: std::filesystem::directory_iterator(argv[1]))
std::cout << p << 'n';

}


On first sight this seems to be very lean, portable and conforming to the C++ standard (please ignore that it does not catch exceptions if the directory does not exist).



However, there seem to be a few pitfalls. In particular, the C++ standard does not seem to mandate that the encoding of argv[1] matches that accepted by std::filesystem::path constructors nor does it seem to mandate that the encoding returned by std::filesystem::path::string() matches that accepted by std::cout.



Quite the opposite, the standard seems to introduce the new term "native encoding" which may be different from the execution character set encoding and is defined as:




The native encoding of a narrow character string is the operating
system dependent current encoding for pathnames ([fs.class.path]).




From my reading of the standard no conversion between encodings takes place if std::filesystem::path::value_type matches the char type of argv[1] (which is true on any POSIX system).



This seems to allow, for example, a conforming implementation in which the execution character set encoding (and hence the encoding of argv[1] and that accepted by std::cout) is EBCDIC, but the encoding of strings accepted and provided by the filesystem library is ISO 8859-1, with no conversion performed between the two, making the filesystem library essentially useless. Worse yet, there is no way to figure out if the two encodings are the same or not.



This can even get dangerous if you start to write utilities which delete files and the to be deleted file provided by argv[1] matches a completely different file when it's interpreted in the native encoding of the filesystem library.



Note that I'm not concerned about filesystems using different encodings than those used by programs. My concern is that the standard does not seem to mandate any conversion of those encodings.



The u8path() and u8string() functions are of no use here either because the standard also provides no way to convert between UTF-8 and the execution character set encoding (used by argv[1] and std::cout).



Is there any portable, encoding agnostic and standard compliant way to do this?










share|improve this question






















  • Speaking of standards, ls will show you the current working directory if given no arguments, it won't give you flack for not specifying it. Also if you're working with EBCDIC and C++ together I'm impressed.
    – tadman
    Nov 15 '18 at 16:40








  • 2




    Yes, there is no portable way to write ls application in C++. Moreover, my experience tells me that there is no portable way to write any complex application in C++ - you will always have to rely on things which are not specified by C++ standard, either directly, or hidden inside third-party libraries like boost. In my opinion, this greatly contrasts C++ with languages like Java.
    – SergeyA
    Nov 15 '18 at 16:41












  • @SergeyA Yeah, every operating system is free to make up their own rules, and they often do for reasons we'll never be able to properly explain.
    – tadman
    Nov 15 '18 at 16:43










  • The root problem is that WG21 doesn't want to rely on POSIX here. Without that, the whole notion of a file name becomes non-portable. Now this can be reasonable; on tiny embedded systems files might be identified by merely a number.
    – MSalters
    Nov 15 '18 at 17:09










  • @MSalters I understand the reason but these systems could still exist if the standard provided a way to reliably set and get that number in the execution character set encoding.
    – Contter
    Nov 15 '18 at 17:20














6












6








6







Let's consider the following code listing the directory contents of the path given as the first argument to the program:



#include <filesystem>
#include <iostream>

int main(int argc, char **argv)
{

if(argc != 2)
std::cerr << "Please specify a directory.n";

for(auto& p: std::filesystem::directory_iterator(argv[1]))
std::cout << p << 'n';

}


On first sight this seems to be very lean, portable and conforming to the C++ standard (please ignore that it does not catch exceptions if the directory does not exist).



However, there seem to be a few pitfalls. In particular, the C++ standard does not seem to mandate that the encoding of argv[1] matches that accepted by std::filesystem::path constructors nor does it seem to mandate that the encoding returned by std::filesystem::path::string() matches that accepted by std::cout.



Quite the opposite, the standard seems to introduce the new term "native encoding" which may be different from the execution character set encoding and is defined as:




The native encoding of a narrow character string is the operating
system dependent current encoding for pathnames ([fs.class.path]).




From my reading of the standard no conversion between encodings takes place if std::filesystem::path::value_type matches the char type of argv[1] (which is true on any POSIX system).



This seems to allow, for example, a conforming implementation in which the execution character set encoding (and hence the encoding of argv[1] and that accepted by std::cout) is EBCDIC, but the encoding of strings accepted and provided by the filesystem library is ISO 8859-1, with no conversion performed between the two, making the filesystem library essentially useless. Worse yet, there is no way to figure out if the two encodings are the same or not.



This can even get dangerous if you start to write utilities which delete files and the to be deleted file provided by argv[1] matches a completely different file when it's interpreted in the native encoding of the filesystem library.



Note that I'm not concerned about filesystems using different encodings than those used by programs. My concern is that the standard does not seem to mandate any conversion of those encodings.



The u8path() and u8string() functions are of no use here either because the standard also provides no way to convert between UTF-8 and the execution character set encoding (used by argv[1] and std::cout).



Is there any portable, encoding agnostic and standard compliant way to do this?










share|improve this question













Let's consider the following code listing the directory contents of the path given as the first argument to the program:



#include <filesystem>
#include <iostream>

int main(int argc, char **argv)
{

if(argc != 2)
std::cerr << "Please specify a directory.n";

for(auto& p: std::filesystem::directory_iterator(argv[1]))
std::cout << p << 'n';

}


On first sight this seems to be very lean, portable and conforming to the C++ standard (please ignore that it does not catch exceptions if the directory does not exist).



However, there seem to be a few pitfalls. In particular, the C++ standard does not seem to mandate that the encoding of argv[1] matches that accepted by std::filesystem::path constructors nor does it seem to mandate that the encoding returned by std::filesystem::path::string() matches that accepted by std::cout.



Quite the opposite, the standard seems to introduce the new term "native encoding" which may be different from the execution character set encoding and is defined as:




The native encoding of a narrow character string is the operating
system dependent current encoding for pathnames ([fs.class.path]).




From my reading of the standard no conversion between encodings takes place if std::filesystem::path::value_type matches the char type of argv[1] (which is true on any POSIX system).



This seems to allow, for example, a conforming implementation in which the execution character set encoding (and hence the encoding of argv[1] and that accepted by std::cout) is EBCDIC, but the encoding of strings accepted and provided by the filesystem library is ISO 8859-1, with no conversion performed between the two, making the filesystem library essentially useless. Worse yet, there is no way to figure out if the two encodings are the same or not.



This can even get dangerous if you start to write utilities which delete files and the to be deleted file provided by argv[1] matches a completely different file when it's interpreted in the native encoding of the filesystem library.



Note that I'm not concerned about filesystems using different encodings than those used by programs. My concern is that the standard does not seem to mandate any conversion of those encodings.



The u8path() and u8string() functions are of no use here either because the standard also provides no way to convert between UTF-8 and the execution character set encoding (used by argv[1] and std::cout).



Is there any portable, encoding agnostic and standard compliant way to do this?







c++ character-encoding filesystems c++17 c++-standard-library






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Nov 15 '18 at 16:38









ContterContter

312




312












  • Speaking of standards, ls will show you the current working directory if given no arguments, it won't give you flack for not specifying it. Also if you're working with EBCDIC and C++ together I'm impressed.
    – tadman
    Nov 15 '18 at 16:40








  • 2




    Yes, there is no portable way to write ls application in C++. Moreover, my experience tells me that there is no portable way to write any complex application in C++ - you will always have to rely on things which are not specified by C++ standard, either directly, or hidden inside third-party libraries like boost. In my opinion, this greatly contrasts C++ with languages like Java.
    – SergeyA
    Nov 15 '18 at 16:41












  • @SergeyA Yeah, every operating system is free to make up their own rules, and they often do for reasons we'll never be able to properly explain.
    – tadman
    Nov 15 '18 at 16:43










  • The root problem is that WG21 doesn't want to rely on POSIX here. Without that, the whole notion of a file name becomes non-portable. Now this can be reasonable; on tiny embedded systems files might be identified by merely a number.
    – MSalters
    Nov 15 '18 at 17:09










  • @MSalters I understand the reason but these systems could still exist if the standard provided a way to reliably set and get that number in the execution character set encoding.
    – Contter
    Nov 15 '18 at 17:20


















  • Speaking of standards, ls will show you the current working directory if given no arguments, it won't give you flack for not specifying it. Also if you're working with EBCDIC and C++ together I'm impressed.
    – tadman
    Nov 15 '18 at 16:40








  • 2




    Yes, there is no portable way to write ls application in C++. Moreover, my experience tells me that there is no portable way to write any complex application in C++ - you will always have to rely on things which are not specified by C++ standard, either directly, or hidden inside third-party libraries like boost. In my opinion, this greatly contrasts C++ with languages like Java.
    – SergeyA
    Nov 15 '18 at 16:41












  • @SergeyA Yeah, every operating system is free to make up their own rules, and they often do for reasons we'll never be able to properly explain.
    – tadman
    Nov 15 '18 at 16:43










  • The root problem is that WG21 doesn't want to rely on POSIX here. Without that, the whole notion of a file name becomes non-portable. Now this can be reasonable; on tiny embedded systems files might be identified by merely a number.
    – MSalters
    Nov 15 '18 at 17:09










  • @MSalters I understand the reason but these systems could still exist if the standard provided a way to reliably set and get that number in the execution character set encoding.
    – Contter
    Nov 15 '18 at 17:20
















Speaking of standards, ls will show you the current working directory if given no arguments, it won't give you flack for not specifying it. Also if you're working with EBCDIC and C++ together I'm impressed.
– tadman
Nov 15 '18 at 16:40






Speaking of standards, ls will show you the current working directory if given no arguments, it won't give you flack for not specifying it. Also if you're working with EBCDIC and C++ together I'm impressed.
– tadman
Nov 15 '18 at 16:40






2




2




Yes, there is no portable way to write ls application in C++. Moreover, my experience tells me that there is no portable way to write any complex application in C++ - you will always have to rely on things which are not specified by C++ standard, either directly, or hidden inside third-party libraries like boost. In my opinion, this greatly contrasts C++ with languages like Java.
– SergeyA
Nov 15 '18 at 16:41






Yes, there is no portable way to write ls application in C++. Moreover, my experience tells me that there is no portable way to write any complex application in C++ - you will always have to rely on things which are not specified by C++ standard, either directly, or hidden inside third-party libraries like boost. In my opinion, this greatly contrasts C++ with languages like Java.
– SergeyA
Nov 15 '18 at 16:41














@SergeyA Yeah, every operating system is free to make up their own rules, and they often do for reasons we'll never be able to properly explain.
– tadman
Nov 15 '18 at 16:43




@SergeyA Yeah, every operating system is free to make up their own rules, and they often do for reasons we'll never be able to properly explain.
– tadman
Nov 15 '18 at 16:43












The root problem is that WG21 doesn't want to rely on POSIX here. Without that, the whole notion of a file name becomes non-portable. Now this can be reasonable; on tiny embedded systems files might be identified by merely a number.
– MSalters
Nov 15 '18 at 17:09




The root problem is that WG21 doesn't want to rely on POSIX here. Without that, the whole notion of a file name becomes non-portable. Now this can be reasonable; on tiny embedded systems files might be identified by merely a number.
– MSalters
Nov 15 '18 at 17:09












@MSalters I understand the reason but these systems could still exist if the standard provided a way to reliably set and get that number in the execution character set encoding.
– Contter
Nov 15 '18 at 17:20




@MSalters I understand the reason but these systems could still exist if the standard provided a way to reliably set and get that number in the execution character set encoding.
– Contter
Nov 15 '18 at 17:20












1 Answer
1






active

oldest

votes


















4














No, and this is not just theoretical.



On Windows systems, paths are UTF-16, and path::value_type is wchar_t, not the char you get from char** argv. This isn't a problem by itself - path can be created from a char*. However, not every Windows file name can be expressed as a char*. Hence the program is unable to list the contents of some directories whose name cannot be expressed as char*.



Now you'd think that Linux would be better. That's actually not entirely the case - the bytes you get for a filename can depend on whether you entered them on a keyboard or via TAB completion!






share|improve this answer





















  • Point taken, but Windows and Linux are non-conforming in this respect anyway. ;-)
    – Contter
    Nov 15 '18 at 17:56











Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53324033%2fis-there-a-standard-conforming-way-to-write-a-portable-ls-utility-in-c%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









4














No, and this is not just theoretical.



On Windows systems, paths are UTF-16, and path::value_type is wchar_t, not the char you get from char** argv. This isn't a problem by itself - path can be created from a char*. However, not every Windows file name can be expressed as a char*. Hence the program is unable to list the contents of some directories whose name cannot be expressed as char*.



Now you'd think that Linux would be better. That's actually not entirely the case - the bytes you get for a filename can depend on whether you entered them on a keyboard or via TAB completion!






share|improve this answer





















  • Point taken, but Windows and Linux are non-conforming in this respect anyway. ;-)
    – Contter
    Nov 15 '18 at 17:56
















4














No, and this is not just theoretical.



On Windows systems, paths are UTF-16, and path::value_type is wchar_t, not the char you get from char** argv. This isn't a problem by itself - path can be created from a char*. However, not every Windows file name can be expressed as a char*. Hence the program is unable to list the contents of some directories whose name cannot be expressed as char*.



Now you'd think that Linux would be better. That's actually not entirely the case - the bytes you get for a filename can depend on whether you entered them on a keyboard or via TAB completion!






share|improve this answer





















  • Point taken, but Windows and Linux are non-conforming in this respect anyway. ;-)
    – Contter
    Nov 15 '18 at 17:56














4












4








4






No, and this is not just theoretical.



On Windows systems, paths are UTF-16, and path::value_type is wchar_t, not the char you get from char** argv. This isn't a problem by itself - path can be created from a char*. However, not every Windows file name can be expressed as a char*. Hence the program is unable to list the contents of some directories whose name cannot be expressed as char*.



Now you'd think that Linux would be better. That's actually not entirely the case - the bytes you get for a filename can depend on whether you entered them on a keyboard or via TAB completion!






share|improve this answer












No, and this is not just theoretical.



On Windows systems, paths are UTF-16, and path::value_type is wchar_t, not the char you get from char** argv. This isn't a problem by itself - path can be created from a char*. However, not every Windows file name can be expressed as a char*. Hence the program is unable to list the contents of some directories whose name cannot be expressed as char*.



Now you'd think that Linux would be better. That's actually not entirely the case - the bytes you get for a filename can depend on whether you entered them on a keyboard or via TAB completion!







share|improve this answer












share|improve this answer



share|improve this answer










answered Nov 15 '18 at 17:20









MSaltersMSalters

133k8115267




133k8115267












  • Point taken, but Windows and Linux are non-conforming in this respect anyway. ;-)
    – Contter
    Nov 15 '18 at 17:56


















  • Point taken, but Windows and Linux are non-conforming in this respect anyway. ;-)
    – Contter
    Nov 15 '18 at 17:56
















Point taken, but Windows and Linux are non-conforming in this respect anyway. ;-)
– Contter
Nov 15 '18 at 17:56




Point taken, but Windows and Linux are non-conforming in this respect anyway. ;-)
– Contter
Nov 15 '18 at 17:56


















draft saved

draft discarded




















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53324033%2fis-there-a-standard-conforming-way-to-write-a-portable-ls-utility-in-c%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Guess what letter conforming each word

Run scheduled task as local user group (not BUILTIN)

Port of Spain