Unable to read files from google cloud storage bucket from jupyter notebook running on data proc cluster

I am working on a Data Proc Spark cluster with an initialization action to install Jupyter notebook. I am unable to read the csv files stored on the google cloud storage bucket, however I am able to read the same files when I work on Spark Shell

Below is the error code I am getting

import pandas as pd

import numpy as np

data = pd.read_csv("gs://dataproc-78r5fe64b-a56d-4f5f4-bcf9-e1b7t6fb9d8f-au-southeast1/notebooks/datafile.csv")



    FileNotFoundError                         Traceback (most recent call last)

<ipython-input-20-2457012764fa> in <module>

----> 1 data = pd.read_csv("gs://dataproc-78r5fe64b-a56d-4f5f4-bcf9-e1b7t6fb9d8f-au-southeast1/notebooks/datafile.csv")



/opt/conda/lib/python3.6/site-packages/pandas/io/parsers.py in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, escapechar, comment, encoding, dialect, tupleize_cols, error_bad_lines, warn_bad_lines, skipfooter, doublequote, delim_whitespace, low_memory, memory_map, float_precision)

    676                     skip_blank_lines=skip_blank_lines)

    677 

--> 678         return _read(filepath_or_buffer, kwds)

    679 

    680     parser_f.__name__ = name



/opt/conda/lib/python3.6/site-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds)

    438 

    439     # Create the parser.

--> 440     parser = TextFileReader(filepath_or_buffer, **kwds)

    441 

    442     if chunksize or iterator:



/opt/conda/lib/python3.6/site-packages/pandas/io/parsers.py in __init__(self, f, engine, **kwds)

    785             self.options['has_index_names'] = kwds['has_index_names']

    786 

--> 787         self._make_engine(self.engine)

    788 

    789     def close(self):



/opt/conda/lib/python3.6/site-packages/pandas/io/parsers.py in _make_engine(self, engine)

   1012     def _make_engine(self, engine='c'):

   1013         if engine == 'c':

-> 1014             self._engine = CParserWrapper(self.f, **self.options)

   1015         else:

   1016             if engine == 'python':



/opt/conda/lib/python3.6/site-packages/pandas/io/parsers.py in __init__(self, src, **kwds)

   1706         kwds['usecols'] = self.usecols

   1707 

-> 1708         self._reader = parsers.TextReader(src, **kwds)

   1709 

   1710         passed_names = self.names is None



pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader.__cinit__()



pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._setup_parser_source()



FileNotFoundError: File b'gs://dataproc-78r5fe64b-a56d-4f5f4-bcf9-e1b7t6fb9d8f-au-southeast1/notebooks/datafile.csv' does not exist

Location path for the CSV file

gs://dataproc-78r5fe64b-a56d-4f5f4-bcf9-e1b7t6fb9d8f-au-southeast1/notebooks/datafile.csv

I have also made sure that the csv file is stored in the same storage bucket as attached to the data proc and have made sure that the file is in UTF-8-Encoded csv format

Can anyone please help me how to read the files stored in google bucket from jupyter notebook running on a dataproc cluster in google cloud.

Kindly let me know if more information is required

Thanks in Advance!!

edited Nov 14 '18 at 1:20

asked Nov 14 '18 at 1:14

Tushar Mehta

407

What kernel are you using in Jupyter? Python or Spark?
– tix
Nov 14 '18 at 2:29

I have actually tried using both the Kernels, Python and Py-Spark
– Tushar Mehta
Nov 14 '18 at 2:59

add a comment |

Below is the error code I am getting

import pandas as pd

import numpy as np

data = pd.read_csv("gs://dataproc-78r5fe64b-a56d-4f5f4-bcf9-e1b7t6fb9d8f-au-southeast1/notebooks/datafile.csv")



    FileNotFoundError                         Traceback (most recent call last)

<ipython-input-20-2457012764fa> in <module>

----> 1 data = pd.read_csv("gs://dataproc-78r5fe64b-a56d-4f5f4-bcf9-e1b7t6fb9d8f-au-southeast1/notebooks/datafile.csv")



/opt/conda/lib/python3.6/site-packages/pandas/io/parsers.py in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, escapechar, comment, encoding, dialect, tupleize_cols, error_bad_lines, warn_bad_lines, skipfooter, doublequote, delim_whitespace, low_memory, memory_map, float_precision)

    676                     skip_blank_lines=skip_blank_lines)

    677 

--> 678         return _read(filepath_or_buffer, kwds)

    679 

    680     parser_f.__name__ = name



/opt/conda/lib/python3.6/site-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds)

    438 

    439     # Create the parser.

--> 440     parser = TextFileReader(filepath_or_buffer, **kwds)

    441 

    442     if chunksize or iterator:



/opt/conda/lib/python3.6/site-packages/pandas/io/parsers.py in __init__(self, f, engine, **kwds)

    785             self.options['has_index_names'] = kwds['has_index_names']

    786 

--> 787         self._make_engine(self.engine)

    788 

    789     def close(self):



/opt/conda/lib/python3.6/site-packages/pandas/io/parsers.py in _make_engine(self, engine)

   1012     def _make_engine(self, engine='c'):

   1013         if engine == 'c':

-> 1014             self._engine = CParserWrapper(self.f, **self.options)

   1015         else:

   1016             if engine == 'python':



/opt/conda/lib/python3.6/site-packages/pandas/io/parsers.py in __init__(self, src, **kwds)

   1706         kwds['usecols'] = self.usecols

   1707 

-> 1708         self._reader = parsers.TextReader(src, **kwds)

   1709 

   1710         passed_names = self.names is None



pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader.__cinit__()



pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._setup_parser_source()



FileNotFoundError: File b'gs://dataproc-78r5fe64b-a56d-4f5f4-bcf9-e1b7t6fb9d8f-au-southeast1/notebooks/datafile.csv' does not exist

Location path for the CSV file

gs://dataproc-78r5fe64b-a56d-4f5f4-bcf9-e1b7t6fb9d8f-au-southeast1/notebooks/datafile.csv

I have also made sure that the csv file is stored in the same storage bucket as attached to the data proc and have made sure that the file is in UTF-8-Encoded csv format

Can anyone please help me how to read the files stored in google bucket from jupyter notebook running on a dataproc cluster in google cloud.

Kindly let me know if more information is required

Thanks in Advance!!

edited Nov 14 '18 at 1:20

asked Nov 14 '18 at 1:14

Tushar Mehta

407

What kernel are you using in Jupyter? Python or Spark?
– tix
Nov 14 '18 at 2:29

I have actually tried using both the Kernels, Python and Py-Spark
– Tushar Mehta
Nov 14 '18 at 2:59

add a comment |

Below is the error code I am getting

import pandas as pd

import numpy as np

data = pd.read_csv("gs://dataproc-78r5fe64b-a56d-4f5f4-bcf9-e1b7t6fb9d8f-au-southeast1/notebooks/datafile.csv")



    FileNotFoundError                         Traceback (most recent call last)

<ipython-input-20-2457012764fa> in <module>

----> 1 data = pd.read_csv("gs://dataproc-78r5fe64b-a56d-4f5f4-bcf9-e1b7t6fb9d8f-au-southeast1/notebooks/datafile.csv")



/opt/conda/lib/python3.6/site-packages/pandas/io/parsers.py in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, escapechar, comment, encoding, dialect, tupleize_cols, error_bad_lines, warn_bad_lines, skipfooter, doublequote, delim_whitespace, low_memory, memory_map, float_precision)

    676                     skip_blank_lines=skip_blank_lines)

    677 

--> 678         return _read(filepath_or_buffer, kwds)

    679 

    680     parser_f.__name__ = name



/opt/conda/lib/python3.6/site-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds)

    438 

    439     # Create the parser.

--> 440     parser = TextFileReader(filepath_or_buffer, **kwds)

    441 

    442     if chunksize or iterator:



/opt/conda/lib/python3.6/site-packages/pandas/io/parsers.py in __init__(self, f, engine, **kwds)

    785             self.options['has_index_names'] = kwds['has_index_names']

    786 

--> 787         self._make_engine(self.engine)

    788 

    789     def close(self):



/opt/conda/lib/python3.6/site-packages/pandas/io/parsers.py in _make_engine(self, engine)

   1012     def _make_engine(self, engine='c'):

   1013         if engine == 'c':

-> 1014             self._engine = CParserWrapper(self.f, **self.options)

   1015         else:

   1016             if engine == 'python':



/opt/conda/lib/python3.6/site-packages/pandas/io/parsers.py in __init__(self, src, **kwds)

   1706         kwds['usecols'] = self.usecols

   1707 

-> 1708         self._reader = parsers.TextReader(src, **kwds)

   1709 

   1710         passed_names = self.names is None



pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader.__cinit__()



pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._setup_parser_source()



FileNotFoundError: File b'gs://dataproc-78r5fe64b-a56d-4f5f4-bcf9-e1b7t6fb9d8f-au-southeast1/notebooks/datafile.csv' does not exist

Location path for the CSV file

gs://dataproc-78r5fe64b-a56d-4f5f4-bcf9-e1b7t6fb9d8f-au-southeast1/notebooks/datafile.csv

I have also made sure that the csv file is stored in the same storage bucket as attached to the data proc and have made sure that the file is in UTF-8-Encoded csv format

Can anyone please help me how to read the files stored in google bucket from jupyter notebook running on a dataproc cluster in google cloud.

Kindly let me know if more information is required

Thanks in Advance!!

edited Nov 14 '18 at 1:20

asked Nov 14 '18 at 1:14

Tushar Mehta

407

Below is the error code I am getting

import pandas as pd

import numpy as np

data = pd.read_csv("gs://dataproc-78r5fe64b-a56d-4f5f4-bcf9-e1b7t6fb9d8f-au-southeast1/notebooks/datafile.csv")



    FileNotFoundError                         Traceback (most recent call last)

<ipython-input-20-2457012764fa> in <module>

----> 1 data = pd.read_csv("gs://dataproc-78r5fe64b-a56d-4f5f4-bcf9-e1b7t6fb9d8f-au-southeast1/notebooks/datafile.csv")



/opt/conda/lib/python3.6/site-packages/pandas/io/parsers.py in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, escapechar, comment, encoding, dialect, tupleize_cols, error_bad_lines, warn_bad_lines, skipfooter, doublequote, delim_whitespace, low_memory, memory_map, float_precision)

    676                     skip_blank_lines=skip_blank_lines)

    677 

--> 678         return _read(filepath_or_buffer, kwds)

    679 

    680     parser_f.__name__ = name



/opt/conda/lib/python3.6/site-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds)

    438 

    439     # Create the parser.

--> 440     parser = TextFileReader(filepath_or_buffer, **kwds)

    441 

    442     if chunksize or iterator:



/opt/conda/lib/python3.6/site-packages/pandas/io/parsers.py in __init__(self, f, engine, **kwds)

    785             self.options['has_index_names'] = kwds['has_index_names']

    786 

--> 787         self._make_engine(self.engine)

    788 

    789     def close(self):



/opt/conda/lib/python3.6/site-packages/pandas/io/parsers.py in _make_engine(self, engine)

   1012     def _make_engine(self, engine='c'):

   1013         if engine == 'c':

-> 1014             self._engine = CParserWrapper(self.f, **self.options)

   1015         else:

   1016             if engine == 'python':



/opt/conda/lib/python3.6/site-packages/pandas/io/parsers.py in __init__(self, src, **kwds)

   1706         kwds['usecols'] = self.usecols

   1707 

-> 1708         self._reader = parsers.TextReader(src, **kwds)

   1709 

   1710         passed_names = self.names is None



pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader.__cinit__()



pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._setup_parser_source()



FileNotFoundError: File b'gs://dataproc-78r5fe64b-a56d-4f5f4-bcf9-e1b7t6fb9d8f-au-southeast1/notebooks/datafile.csv' does not exist

Location path for the CSV file

gs://dataproc-78r5fe64b-a56d-4f5f4-bcf9-e1b7t6fb9d8f-au-southeast1/notebooks/datafile.csv

I have also made sure that the csv file is stored in the same storage bucket as attached to the data proc and have made sure that the file is in UTF-8-Encoded csv format

Can anyone please help me how to read the files stored in google bucket from jupyter notebook running on a dataproc cluster in google cloud.

Kindly let me know if more information is required

Thanks in Advance!!

python jupyter-notebook google-cloud-storage google-cloud-dataproc

edited Nov 14 '18 at 1:20

asked Nov 14 '18 at 1:14

Tushar Mehta

407

edited Nov 14 '18 at 1:20

asked Nov 14 '18 at 1:14

Tushar Mehta

407

edited Nov 14 '18 at 1:20

asked Nov 14 '18 at 1:14

Tushar Mehta

407

asked Nov 14 '18 at 1:14

Tushar Mehta

407

asked Nov 14 '18 at 1:14

Tushar Mehta

407

What kernel are you using in Jupyter? Python or Spark?
– tix
Nov 14 '18 at 2:29

I have actually tried using both the Kernels, Python and Py-Spark
– Tushar Mehta
Nov 14 '18 at 2:59

add a comment |

What kernel are you using in Jupyter? Python or Spark?
– tix
Nov 14 '18 at 2:29

I have actually tried using both the Kernels, Python and Py-Spark
– Tushar Mehta
Nov 14 '18 at 2:59

What kernel are you using in Jupyter? Python or Spark?
– tix
Nov 14 '18 at 2:29

I have actually tried using both the Kernels, Python and Py-Spark
– Tushar Mehta
Nov 14 '18 at 2:59

add a comment |

1 Answer
1

active

oldest

votes

The reason Spark can read from GCS is that we configure it to use the GCS connector for paths that start with gs://. You probably want to use spark.read.csv(gs://path/to/files/) to read CSV file(s) into a Spark dataframe.

You can read and write to GCS using pandas, but it's a bit more complicated. This stackoverflow post lists some options.

Side note: If you're using Pandas, you should use a single node cluster since pandas code will not be distributed across a cluster.

answered Nov 16 '18 at 22:35

Karthik Palaniappan

69547

Thanks Karthik, Have been trying to find a solution for quite some time and I understand that pandas code would be distributed across cluster.
– Tushar Mehta
Nov 18 '18 at 21:22

Pandas is not distributed across the cluster, unless you're using something like Dask.
– Karthik Palaniappan
Nov 20 '18 at 17:02

Sorry There was a typo, I meant that I understand that pandas would not be distributed accross for cluster. And a lot of thanks for helping out with the issue.
– Tushar Mehta
Nov 21 '18 at 20:37

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53291792%2funable-to-read-files-from-google-cloud-storage-bucket-from-jupyter-notebook-runn%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

You can read and write to GCS using pandas, but it's a bit more complicated. This stackoverflow post lists some options.

Side note: If you're using Pandas, you should use a single node cluster since pandas code will not be distributed across a cluster.

answered Nov 16 '18 at 22:35

Karthik Palaniappan

69547

Thanks Karthik, Have been trying to find a solution for quite some time and I understand that pandas code would be distributed across cluster.
– Tushar Mehta
Nov 18 '18 at 21:22

Pandas is not distributed across the cluster, unless you're using something like Dask.
– Karthik Palaniappan
Nov 20 '18 at 17:02

Sorry There was a typo, I meant that I understand that pandas would not be distributed accross for cluster. And a lot of thanks for helping out with the issue.
– Tushar Mehta
Nov 21 '18 at 20:37

add a comment |

You can read and write to GCS using pandas, but it's a bit more complicated. This stackoverflow post lists some options.

Side note: If you're using Pandas, you should use a single node cluster since pandas code will not be distributed across a cluster.

answered Nov 16 '18 at 22:35

Karthik Palaniappan

69547

Thanks Karthik, Have been trying to find a solution for quite some time and I understand that pandas code would be distributed across cluster.
– Tushar Mehta
Nov 18 '18 at 21:22

Pandas is not distributed across the cluster, unless you're using something like Dask.
– Karthik Palaniappan
Nov 20 '18 at 17:02

Sorry There was a typo, I meant that I understand that pandas would not be distributed accross for cluster. And a lot of thanks for helping out with the issue.
– Tushar Mehta
Nov 21 '18 at 20:37

add a comment |

You can read and write to GCS using pandas, but it's a bit more complicated. This stackoverflow post lists some options.

Side note: If you're using Pandas, you should use a single node cluster since pandas code will not be distributed across a cluster.

answered Nov 16 '18 at 22:35

Karthik Palaniappan

69547

You can read and write to GCS using pandas, but it's a bit more complicated. This stackoverflow post lists some options.

Side note: If you're using Pandas, you should use a single node cluster since pandas code will not be distributed across a cluster.

answered Nov 16 '18 at 22:35

Karthik Palaniappan

69547

answered Nov 16 '18 at 22:35

Karthik Palaniappan

69547

answered Nov 16 '18 at 22:35

Karthik Palaniappan

69547

answered Nov 16 '18 at 22:35

Karthik Palaniappan

69547

Thanks Karthik, Have been trying to find a solution for quite some time and I understand that pandas code would be distributed across cluster.
– Tushar Mehta
Nov 18 '18 at 21:22

Pandas is not distributed across the cluster, unless you're using something like Dask.
– Karthik Palaniappan
Nov 20 '18 at 17:02

Sorry There was a typo, I meant that I understand that pandas would not be distributed accross for cluster. And a lot of thanks for helping out with the issue.
– Tushar Mehta
Nov 21 '18 at 20:37

add a comment |

Thanks Karthik, Have been trying to find a solution for quite some time and I understand that pandas code would be distributed across cluster.
– Tushar Mehta
Nov 18 '18 at 21:22

Pandas is not distributed across the cluster, unless you're using something like Dask.
– Karthik Palaniappan
Nov 20 '18 at 17:02

Sorry There was a typo, I meant that I understand that pandas would not be distributed accross for cluster. And a lot of thanks for helping out with the issue.
– Tushar Mehta
Nov 21 '18 at 20:37

Thanks Karthik, Have been trying to find a solution for quite some time and I understand that pandas code would be distributed across cluster.
– Tushar Mehta
Nov 18 '18 at 21:22

Pandas is not distributed across the cluster, unless you're using something like Dask.
– Karthik Palaniappan
Nov 20 '18 at 17:02

Sorry There was a typo, I meant that I understand that pandas would not be distributed accross for cluster. And a lot of thanks for helping out with the issue.
– Tushar Mehta
Nov 21 '18 at 20:37

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

Some of your past answers have not been well-received, and you're in danger of being blocked from answering.

Please pay close attention to the following guidance:

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Agfdhyk