Scraping over multiple pages using BeautifulSoup and dataframe iterrows

up vote
0
down vote

favorite

I'm using BeautifulSoup to scrape from multiple URLs. The URL iterates by appending a variable I have saved in a dataframe (postcode_URL).

The code breaks on line: table_rows = table.find_all('tr'), throwing the error: 'NoneType' object has no attribute 'find_all'

Interestingly, the code works perfectly if I remove the iteration and manually enter a single postcode in the URL, so I believe it must be something to do with the iteration loop.

Below is the code that I have used. Any ideas?

scraped_data = 



for x, row in postcodes_for_urls.iterrows():

    page = requests.get("http://myurl"+(row['postcode_URL']))

    soup = BeautifulSoup(page.content, 'html.parser')

    table = soup.find('table')

    table_rows = table.find_all('tr')

    for tr in table_rows:

        td = tr.find_all('td')

        row = [tr.text for tr in td]

        scraped_data.append(row)



pd.DataFrame(scraped_data, columns=["A", "B", "C"])

edited Nov 11 at 18:05

Akash Ranjan

10811

asked Nov 11 at 16:39

B Winter

103

I will suggest using scrapy (scrapy.org) instead.
– Akash Ranjan
Nov 11 at 17:17

add a comment |

up vote
0
down vote

favorite

I'm using BeautifulSoup to scrape from multiple URLs. The URL iterates by appending a variable I have saved in a dataframe (postcode_URL).

The code breaks on line: table_rows = table.find_all('tr'), throwing the error: 'NoneType' object has no attribute 'find_all'

Interestingly, the code works perfectly if I remove the iteration and manually enter a single postcode in the URL, so I believe it must be something to do with the iteration loop.

Below is the code that I have used. Any ideas?

scraped_data = 



for x, row in postcodes_for_urls.iterrows():

    page = requests.get("http://myurl"+(row['postcode_URL']))

    soup = BeautifulSoup(page.content, 'html.parser')

    table = soup.find('table')

    table_rows = table.find_all('tr')

    for tr in table_rows:

        td = tr.find_all('td')

        row = [tr.text for tr in td]

        scraped_data.append(row)



pd.DataFrame(scraped_data, columns=["A", "B", "C"])

edited Nov 11 at 18:05

Akash Ranjan

10811

asked Nov 11 at 16:39

B Winter

103

I will suggest using scrapy (scrapy.org) instead.
– Akash Ranjan
Nov 11 at 17:17

add a comment |

up vote
0
down vote

favorite

I'm using BeautifulSoup to scrape from multiple URLs. The URL iterates by appending a variable I have saved in a dataframe (postcode_URL).

The code breaks on line: table_rows = table.find_all('tr'), throwing the error: 'NoneType' object has no attribute 'find_all'

Interestingly, the code works perfectly if I remove the iteration and manually enter a single postcode in the URL, so I believe it must be something to do with the iteration loop.

Below is the code that I have used. Any ideas?

scraped_data = 



for x, row in postcodes_for_urls.iterrows():

    page = requests.get("http://myurl"+(row['postcode_URL']))

    soup = BeautifulSoup(page.content, 'html.parser')

    table = soup.find('table')

    table_rows = table.find_all('tr')

    for tr in table_rows:

        td = tr.find_all('td')

        row = [tr.text for tr in td]

        scraped_data.append(row)



pd.DataFrame(scraped_data, columns=["A", "B", "C"])

edited Nov 11 at 18:05

Akash Ranjan

10811

asked Nov 11 at 16:39

B Winter

103

I'm using BeautifulSoup to scrape from multiple URLs. The URL iterates by appending a variable I have saved in a dataframe (postcode_URL).

The code breaks on line: table_rows = table.find_all('tr'), throwing the error: 'NoneType' object has no attribute 'find_all'

Interestingly, the code works perfectly if I remove the iteration and manually enter a single postcode in the URL, so I believe it must be something to do with the iteration loop.

Below is the code that I have used. Any ideas?

scraped_data = 



for x, row in postcodes_for_urls.iterrows():

    page = requests.get("http://myurl"+(row['postcode_URL']))

    soup = BeautifulSoup(page.content, 'html.parser')

    table = soup.find('table')

    table_rows = table.find_all('tr')

    for tr in table_rows:

        td = tr.find_all('td')

        row = [tr.text for tr in td]

        scraped_data.append(row)



pd.DataFrame(scraped_data, columns=["A", "B", "C"])

python dataframe web-scraping beautifulsoup

edited Nov 11 at 18:05

Akash Ranjan

10811

asked Nov 11 at 16:39

B Winter

103

edited Nov 11 at 18:05

Akash Ranjan

10811

asked Nov 11 at 16:39

B Winter

103

edited Nov 11 at 18:05

Akash Ranjan

10811

edited Nov 11 at 18:05

Akash Ranjan

10811

edited Nov 11 at 18:05

Akash Ranjan

10811

asked Nov 11 at 16:39

B Winter

103

asked Nov 11 at 16:39

B Winter

103

asked Nov 11 at 16:39

B Winter

103

I will suggest using scrapy (scrapy.org) instead.
– Akash Ranjan
Nov 11 at 17:17

add a comment |

I will suggest using scrapy (scrapy.org) instead.
– Akash Ranjan
Nov 11 at 17:17

I will suggest using scrapy (scrapy.org) instead.
– Akash Ranjan
Nov 11 at 17:17

add a comment |

2 Answers
2

active

oldest

votes

up vote
0
down vote

accepted

I investigated the problem and tried few snippets on my laptop.

The problem is not with the DataFrame , as you are reading each row at a time in the loop, the problem is with the URL, your program correctly scraps the page having table in it and throws an error for a postcode URL that doesn't have element.

Consider the first test :

I created an HTML page without table in it :

 <html>

  <head>

     <title>Demo page</title>

  </head>



  <body>

     <h2>Demo without table</h2>

  </body>

</html>

And then I executed the python code as below :

from bs4 import BeautifulSoup

data = open('table.html').read()

parser = BeautifulSoup(data, 'html.parser')

table = parser.find('table')

rows = table.find_all('tr')

print(rows)

Above code stopped due to NoneType exception because parser.find() returns a NoneType object if table element is not found in the html data. So find_all() is not a method of NoneType object, thus it throws an error.

So I changed my HTML code as below :

<html>

<head>

    <title>Demo page</title>

</head>



<body>

    <h2>Demo without table</h2>

    <table>

        <tr>Demo</tr>

    </table>

</body>

</html>

Now the python code works well without any exception as table element is present.

So the conclusion :

The exception is because one of the postcode in the DataFrame is leading to an URL that does not contain a table. So, I recommend you to make a small change to your code :

scraped_data = 



for x, row in postcodes_for_urls.iterrows():

  page = requests.get("http://myurl"+(row['postcode_URL']))

  soup = BeautifulSoup(page.content, 'html.parser')

  table = soup.find('table')

  #add this :

  if table == None :

      continue



  table_rows = table.find_all('tr')

  for tr in table_rows:

    td = tr.find_all('td')

    row = [tr.text for tr in td]

    scraped_data.append(row)

pd.DataFrame(scraped_data, columns=["A", "B", "C"])

answered Nov 11 at 17:13

Narasimha Prasanna HN

683

Amazing, this works perfectly now - thank you for the detailed explanation.
– B Winter
Nov 11 at 20:23

add a comment |

up vote
0
down vote

All you need is to put a check on table_rows if its None and its not empty, try below code. You can also put an exception handler like try and catch statement for best practices. It always breaks because of either an empty row or because of unusual patter on real page you are scraping.

scraped_data = 



for x, row in postcodes_for_urls.iterrows():

    page = requests.get("http://myurl"+(row['postcode_URL']))

    soup = BeautifulSoup(page.content, 'html.parser')

    table = soup.find('table')



    if table is not None and len(table.find_all('tr'))>0:

        table_rows = table.find_all('tr')

        for tr in table_rows:

            td = tr.find_all('td')

            row = [tr.text for tr in td]

            scraped_data.append(row)

    else:

        scraped_data.append('EMPTY')



pd.DataFrame(scraped_data, columns=["A", "B", "C"])

edited Nov 11 at 17:28

answered Nov 11 at 17:20

Harry_pb

1,60511025

Yep, got it. Makes sense now. Thank you.
– B Winter
Nov 11 at 20:24

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53250877%2fscraping-over-multiple-pages-using-beautifulsoup-and-dataframe-iterrows%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

up vote
0
down vote

accepted

I investigated the problem and tried few snippets on my laptop.

Consider the first test :

I created an HTML page without table in it :

 <html>

  <head>

     <title>Demo page</title>

  </head>



  <body>

     <h2>Demo without table</h2>

  </body>

</html>

And then I executed the python code as below :

from bs4 import BeautifulSoup

data = open('table.html').read()

parser = BeautifulSoup(data, 'html.parser')

table = parser.find('table')

rows = table.find_all('tr')

print(rows)

So I changed my HTML code as below :

<html>

<head>

    <title>Demo page</title>

</head>



<body>

    <h2>Demo without table</h2>

    <table>

        <tr>Demo</tr>

    </table>

</body>

</html>

Now the python code works well without any exception as table element is present.

So the conclusion :

The exception is because one of the postcode in the DataFrame is leading to an URL that does not contain a table. So, I recommend you to make a small change to your code :

scraped_data = 



for x, row in postcodes_for_urls.iterrows():

  page = requests.get("http://myurl"+(row['postcode_URL']))

  soup = BeautifulSoup(page.content, 'html.parser')

  table = soup.find('table')

  #add this :

  if table == None :

      continue



  table_rows = table.find_all('tr')

  for tr in table_rows:

    td = tr.find_all('td')

    row = [tr.text for tr in td]

    scraped_data.append(row)

pd.DataFrame(scraped_data, columns=["A", "B", "C"])

answered Nov 11 at 17:13

Narasimha Prasanna HN

683

Amazing, this works perfectly now - thank you for the detailed explanation.
– B Winter
Nov 11 at 20:23

add a comment |

up vote
0
down vote

accepted

I investigated the problem and tried few snippets on my laptop.

Consider the first test :

I created an HTML page without table in it :

 <html>

  <head>

     <title>Demo page</title>

  </head>



  <body>

     <h2>Demo without table</h2>

  </body>

</html>

And then I executed the python code as below :

from bs4 import BeautifulSoup

data = open('table.html').read()

parser = BeautifulSoup(data, 'html.parser')

table = parser.find('table')

rows = table.find_all('tr')

print(rows)

So I changed my HTML code as below :

<html>

<head>

    <title>Demo page</title>

</head>



<body>

    <h2>Demo without table</h2>

    <table>

        <tr>Demo</tr>

    </table>

</body>

</html>

Now the python code works well without any exception as table element is present.

So the conclusion :

The exception is because one of the postcode in the DataFrame is leading to an URL that does not contain a table. So, I recommend you to make a small change to your code :

scraped_data = 



for x, row in postcodes_for_urls.iterrows():

  page = requests.get("http://myurl"+(row['postcode_URL']))

  soup = BeautifulSoup(page.content, 'html.parser')

  table = soup.find('table')

  #add this :

  if table == None :

      continue



  table_rows = table.find_all('tr')

  for tr in table_rows:

    td = tr.find_all('td')

    row = [tr.text for tr in td]

    scraped_data.append(row)

pd.DataFrame(scraped_data, columns=["A", "B", "C"])

answered Nov 11 at 17:13

Narasimha Prasanna HN

683

Amazing, this works perfectly now - thank you for the detailed explanation.
– B Winter
Nov 11 at 20:23

add a comment |

up vote
0
down vote

accepted

I investigated the problem and tried few snippets on my laptop.

Consider the first test :

I created an HTML page without table in it :

 <html>

  <head>

     <title>Demo page</title>

  </head>



  <body>

     <h2>Demo without table</h2>

  </body>

</html>

And then I executed the python code as below :

from bs4 import BeautifulSoup

data = open('table.html').read()

parser = BeautifulSoup(data, 'html.parser')

table = parser.find('table')

rows = table.find_all('tr')

print(rows)

So I changed my HTML code as below :

<html>

<head>

    <title>Demo page</title>

</head>



<body>

    <h2>Demo without table</h2>

    <table>

        <tr>Demo</tr>

    </table>

</body>

</html>

Now the python code works well without any exception as table element is present.

So the conclusion :

The exception is because one of the postcode in the DataFrame is leading to an URL that does not contain a table. So, I recommend you to make a small change to your code :

scraped_data = 



for x, row in postcodes_for_urls.iterrows():

  page = requests.get("http://myurl"+(row['postcode_URL']))

  soup = BeautifulSoup(page.content, 'html.parser')

  table = soup.find('table')

  #add this :

  if table == None :

      continue



  table_rows = table.find_all('tr')

  for tr in table_rows:

    td = tr.find_all('td')

    row = [tr.text for tr in td]

    scraped_data.append(row)

pd.DataFrame(scraped_data, columns=["A", "B", "C"])

answered Nov 11 at 17:13

Narasimha Prasanna HN

683

I investigated the problem and tried few snippets on my laptop.

Consider the first test :

I created an HTML page without table in it :

 <html>

  <head>

     <title>Demo page</title>

  </head>



  <body>

     <h2>Demo without table</h2>

  </body>

</html>

And then I executed the python code as below :

from bs4 import BeautifulSoup

data = open('table.html').read()

parser = BeautifulSoup(data, 'html.parser')

table = parser.find('table')

rows = table.find_all('tr')

print(rows)

So I changed my HTML code as below :

<html>

<head>

    <title>Demo page</title>

</head>



<body>

    <h2>Demo without table</h2>

    <table>

        <tr>Demo</tr>

    </table>

</body>

</html>

Now the python code works well without any exception as table element is present.

So the conclusion :

The exception is because one of the postcode in the DataFrame is leading to an URL that does not contain a table. So, I recommend you to make a small change to your code :

scraped_data = 



for x, row in postcodes_for_urls.iterrows():

  page = requests.get("http://myurl"+(row['postcode_URL']))

  soup = BeautifulSoup(page.content, 'html.parser')

  table = soup.find('table')

  #add this :

  if table == None :

      continue



  table_rows = table.find_all('tr')

  for tr in table_rows:

    td = tr.find_all('td')

    row = [tr.text for tr in td]

    scraped_data.append(row)

pd.DataFrame(scraped_data, columns=["A", "B", "C"])

answered Nov 11 at 17:13

Narasimha Prasanna HN

683

answered Nov 11 at 17:13

Narasimha Prasanna HN

683

answered Nov 11 at 17:13

Narasimha Prasanna HN

683

answered Nov 11 at 17:13

Narasimha Prasanna HN

683

Amazing, this works perfectly now - thank you for the detailed explanation.
– B Winter
Nov 11 at 20:23

add a comment |

Amazing, this works perfectly now - thank you for the detailed explanation.
– B Winter
Nov 11 at 20:23

Amazing, this works perfectly now - thank you for the detailed explanation.
– B Winter
Nov 11 at 20:23

add a comment |

up vote
0
down vote

scraped_data = 



for x, row in postcodes_for_urls.iterrows():

    page = requests.get("http://myurl"+(row['postcode_URL']))

    soup = BeautifulSoup(page.content, 'html.parser')

    table = soup.find('table')



    if table is not None and len(table.find_all('tr'))>0:

        table_rows = table.find_all('tr')

        for tr in table_rows:

            td = tr.find_all('td')

            row = [tr.text for tr in td]

            scraped_data.append(row)

    else:

        scraped_data.append('EMPTY')



pd.DataFrame(scraped_data, columns=["A", "B", "C"])

edited Nov 11 at 17:28

answered Nov 11 at 17:20

Harry_pb

1,60511025

Yep, got it. Makes sense now. Thank you.
– B Winter
Nov 11 at 20:24

add a comment |

up vote
0
down vote

scraped_data = 



for x, row in postcodes_for_urls.iterrows():

    page = requests.get("http://myurl"+(row['postcode_URL']))

    soup = BeautifulSoup(page.content, 'html.parser')

    table = soup.find('table')



    if table is not None and len(table.find_all('tr'))>0:

        table_rows = table.find_all('tr')

        for tr in table_rows:

            td = tr.find_all('td')

            row = [tr.text for tr in td]

            scraped_data.append(row)

    else:

        scraped_data.append('EMPTY')



pd.DataFrame(scraped_data, columns=["A", "B", "C"])

edited Nov 11 at 17:28

answered Nov 11 at 17:20

Harry_pb

1,60511025

Yep, got it. Makes sense now. Thank you.
– B Winter
Nov 11 at 20:24

add a comment |

up vote
0
down vote

scraped_data = 



for x, row in postcodes_for_urls.iterrows():

    page = requests.get("http://myurl"+(row['postcode_URL']))

    soup = BeautifulSoup(page.content, 'html.parser')

    table = soup.find('table')



    if table is not None and len(table.find_all('tr'))>0:

        table_rows = table.find_all('tr')

        for tr in table_rows:

            td = tr.find_all('td')

            row = [tr.text for tr in td]

            scraped_data.append(row)

    else:

        scraped_data.append('EMPTY')



pd.DataFrame(scraped_data, columns=["A", "B", "C"])

edited Nov 11 at 17:28

answered Nov 11 at 17:20

Harry_pb

1,60511025

scraped_data = 



for x, row in postcodes_for_urls.iterrows():

    page = requests.get("http://myurl"+(row['postcode_URL']))

    soup = BeautifulSoup(page.content, 'html.parser')

    table = soup.find('table')



    if table is not None and len(table.find_all('tr'))>0:

        table_rows = table.find_all('tr')

        for tr in table_rows:

            td = tr.find_all('td')

            row = [tr.text for tr in td]

            scraped_data.append(row)

    else:

        scraped_data.append('EMPTY')



pd.DataFrame(scraped_data, columns=["A", "B", "C"])

edited Nov 11 at 17:28

answered Nov 11 at 17:20

Harry_pb

1,60511025

edited Nov 11 at 17:28

answered Nov 11 at 17:20

Harry_pb

1,60511025

answered Nov 11 at 17:20

Harry_pb

1,60511025

answered Nov 11 at 17:20

Harry_pb

1,60511025

Yep, got it. Makes sense now. Thank you.
– B Winter
Nov 11 at 20:24

add a comment |

Yep, got it. Makes sense now. Thank you.
– B Winter
Nov 11 at 20:24

Yep, got it. Makes sense now. Thank you.
– B Winter
Nov 11 at 20:24

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

Some of your past answers have not been well-received, and you're in danger of being blocked from answering.

Please pay close attention to the following guidance:

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Agfdhyk