Scraping over multiple pages using BeautifulSoup and dataframe iterrows











up vote
0
down vote

favorite












I'm using BeautifulSoup to scrape from multiple URLs. The URL iterates by appending a variable I have saved in a dataframe (postcode_URL).



The code breaks on line: table_rows = table.find_all('tr'), throwing the error: 'NoneType' object has no attribute 'find_all'



Interestingly, the code works perfectly if I remove the iteration and manually enter a single postcode in the URL, so I believe it must be something to do with the iteration loop.



Below is the code that I have used. Any ideas?



scraped_data = 

for x, row in postcodes_for_urls.iterrows():
page = requests.get("http://myurl"+(row['postcode_URL']))
soup = BeautifulSoup(page.content, 'html.parser')
table = soup.find('table')
table_rows = table.find_all('tr')
for tr in table_rows:
td = tr.find_all('td')
row = [tr.text for tr in td]
scraped_data.append(row)

pd.DataFrame(scraped_data, columns=["A", "B", "C"])









share|improve this question
























  • I will suggest using scrapy (scrapy.org) instead.
    – Akash Ranjan
    Nov 11 at 17:17















up vote
0
down vote

favorite












I'm using BeautifulSoup to scrape from multiple URLs. The URL iterates by appending a variable I have saved in a dataframe (postcode_URL).



The code breaks on line: table_rows = table.find_all('tr'), throwing the error: 'NoneType' object has no attribute 'find_all'



Interestingly, the code works perfectly if I remove the iteration and manually enter a single postcode in the URL, so I believe it must be something to do with the iteration loop.



Below is the code that I have used. Any ideas?



scraped_data = 

for x, row in postcodes_for_urls.iterrows():
page = requests.get("http://myurl"+(row['postcode_URL']))
soup = BeautifulSoup(page.content, 'html.parser')
table = soup.find('table')
table_rows = table.find_all('tr')
for tr in table_rows:
td = tr.find_all('td')
row = [tr.text for tr in td]
scraped_data.append(row)

pd.DataFrame(scraped_data, columns=["A", "B", "C"])









share|improve this question
























  • I will suggest using scrapy (scrapy.org) instead.
    – Akash Ranjan
    Nov 11 at 17:17













up vote
0
down vote

favorite









up vote
0
down vote

favorite











I'm using BeautifulSoup to scrape from multiple URLs. The URL iterates by appending a variable I have saved in a dataframe (postcode_URL).



The code breaks on line: table_rows = table.find_all('tr'), throwing the error: 'NoneType' object has no attribute 'find_all'



Interestingly, the code works perfectly if I remove the iteration and manually enter a single postcode in the URL, so I believe it must be something to do with the iteration loop.



Below is the code that I have used. Any ideas?



scraped_data = 

for x, row in postcodes_for_urls.iterrows():
page = requests.get("http://myurl"+(row['postcode_URL']))
soup = BeautifulSoup(page.content, 'html.parser')
table = soup.find('table')
table_rows = table.find_all('tr')
for tr in table_rows:
td = tr.find_all('td')
row = [tr.text for tr in td]
scraped_data.append(row)

pd.DataFrame(scraped_data, columns=["A", "B", "C"])









share|improve this question















I'm using BeautifulSoup to scrape from multiple URLs. The URL iterates by appending a variable I have saved in a dataframe (postcode_URL).



The code breaks on line: table_rows = table.find_all('tr'), throwing the error: 'NoneType' object has no attribute 'find_all'



Interestingly, the code works perfectly if I remove the iteration and manually enter a single postcode in the URL, so I believe it must be something to do with the iteration loop.



Below is the code that I have used. Any ideas?



scraped_data = 

for x, row in postcodes_for_urls.iterrows():
page = requests.get("http://myurl"+(row['postcode_URL']))
soup = BeautifulSoup(page.content, 'html.parser')
table = soup.find('table')
table_rows = table.find_all('tr')
for tr in table_rows:
td = tr.find_all('td')
row = [tr.text for tr in td]
scraped_data.append(row)

pd.DataFrame(scraped_data, columns=["A", "B", "C"])






python dataframe web-scraping beautifulsoup






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 11 at 18:05









Akash Ranjan

10811




10811










asked Nov 11 at 16:39









B Winter

103




103












  • I will suggest using scrapy (scrapy.org) instead.
    – Akash Ranjan
    Nov 11 at 17:17


















  • I will suggest using scrapy (scrapy.org) instead.
    – Akash Ranjan
    Nov 11 at 17:17
















I will suggest using scrapy (scrapy.org) instead.
– Akash Ranjan
Nov 11 at 17:17




I will suggest using scrapy (scrapy.org) instead.
– Akash Ranjan
Nov 11 at 17:17












2 Answers
2






active

oldest

votes

















up vote
0
down vote



accepted










I investigated the problem and tried few snippets on my laptop.



The problem is not with the DataFrame , as you are reading each row at a time in the loop, the problem is with the URL, your program correctly scraps the page having table in it and throws an error for a postcode URL that doesn't have element.



Consider the first test :



I created an HTML page without table in it :



 <html>
<head>
<title>Demo page</title>
</head>

<body>
<h2>Demo without table</h2>
</body>
</html>


And then I executed the python code as below :



from bs4 import BeautifulSoup
data = open('table.html').read()
parser = BeautifulSoup(data, 'html.parser')
table = parser.find('table')
rows = table.find_all('tr')
print(rows)


Above code stopped due to NoneType exception because parser.find() returns a NoneType object if table element is not found in the html data. So find_all() is not a method of NoneType object, thus it throws an error.



So I changed my HTML code as below :



<html>
<head>
<title>Demo page</title>
</head>

<body>
<h2>Demo without table</h2>
<table>
<tr>Demo</tr>
</table>
</body>
</html>


Now the python code works well without any exception as table element is present.



So the conclusion :



The exception is because one of the postcode in the DataFrame is leading to an URL that does not contain a table. So, I recommend you to make a small change to your code :



scraped_data = 

for x, row in postcodes_for_urls.iterrows():
page = requests.get("http://myurl"+(row['postcode_URL']))
soup = BeautifulSoup(page.content, 'html.parser')
table = soup.find('table')
#add this :
if table == None :
continue

table_rows = table.find_all('tr')
for tr in table_rows:
td = tr.find_all('td')
row = [tr.text for tr in td]
scraped_data.append(row)


pd.DataFrame(scraped_data, columns=["A", "B", "C"])






share|improve this answer





















  • Amazing, this works perfectly now - thank you for the detailed explanation.
    – B Winter
    Nov 11 at 20:23


















up vote
0
down vote













All you need is to put a check on table_rows if its None and its not empty, try below code. You can also put an exception handler like try and catch statement for best practices. It always breaks because of either an empty row or because of unusual patter on real page you are scraping.



scraped_data = 

for x, row in postcodes_for_urls.iterrows():
page = requests.get("http://myurl"+(row['postcode_URL']))
soup = BeautifulSoup(page.content, 'html.parser')
table = soup.find('table')

if table is not None and len(table.find_all('tr'))>0:
table_rows = table.find_all('tr')
for tr in table_rows:
td = tr.find_all('td')
row = [tr.text for tr in td]
scraped_data.append(row)
else:
scraped_data.append('EMPTY')

pd.DataFrame(scraped_data, columns=["A", "B", "C"])





share|improve this answer























  • Yep, got it. Makes sense now. Thank you.
    – B Winter
    Nov 11 at 20:24











Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53250877%2fscraping-over-multiple-pages-using-beautifulsoup-and-dataframe-iterrows%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























2 Answers
2






active

oldest

votes








2 Answers
2






active

oldest

votes









active

oldest

votes






active

oldest

votes








up vote
0
down vote



accepted










I investigated the problem and tried few snippets on my laptop.



The problem is not with the DataFrame , as you are reading each row at a time in the loop, the problem is with the URL, your program correctly scraps the page having table in it and throws an error for a postcode URL that doesn't have element.



Consider the first test :



I created an HTML page without table in it :



 <html>
<head>
<title>Demo page</title>
</head>

<body>
<h2>Demo without table</h2>
</body>
</html>


And then I executed the python code as below :



from bs4 import BeautifulSoup
data = open('table.html').read()
parser = BeautifulSoup(data, 'html.parser')
table = parser.find('table')
rows = table.find_all('tr')
print(rows)


Above code stopped due to NoneType exception because parser.find() returns a NoneType object if table element is not found in the html data. So find_all() is not a method of NoneType object, thus it throws an error.



So I changed my HTML code as below :



<html>
<head>
<title>Demo page</title>
</head>

<body>
<h2>Demo without table</h2>
<table>
<tr>Demo</tr>
</table>
</body>
</html>


Now the python code works well without any exception as table element is present.



So the conclusion :



The exception is because one of the postcode in the DataFrame is leading to an URL that does not contain a table. So, I recommend you to make a small change to your code :



scraped_data = 

for x, row in postcodes_for_urls.iterrows():
page = requests.get("http://myurl"+(row['postcode_URL']))
soup = BeautifulSoup(page.content, 'html.parser')
table = soup.find('table')
#add this :
if table == None :
continue

table_rows = table.find_all('tr')
for tr in table_rows:
td = tr.find_all('td')
row = [tr.text for tr in td]
scraped_data.append(row)


pd.DataFrame(scraped_data, columns=["A", "B", "C"])






share|improve this answer





















  • Amazing, this works perfectly now - thank you for the detailed explanation.
    – B Winter
    Nov 11 at 20:23















up vote
0
down vote



accepted










I investigated the problem and tried few snippets on my laptop.



The problem is not with the DataFrame , as you are reading each row at a time in the loop, the problem is with the URL, your program correctly scraps the page having table in it and throws an error for a postcode URL that doesn't have element.



Consider the first test :



I created an HTML page without table in it :



 <html>
<head>
<title>Demo page</title>
</head>

<body>
<h2>Demo without table</h2>
</body>
</html>


And then I executed the python code as below :



from bs4 import BeautifulSoup
data = open('table.html').read()
parser = BeautifulSoup(data, 'html.parser')
table = parser.find('table')
rows = table.find_all('tr')
print(rows)


Above code stopped due to NoneType exception because parser.find() returns a NoneType object if table element is not found in the html data. So find_all() is not a method of NoneType object, thus it throws an error.



So I changed my HTML code as below :



<html>
<head>
<title>Demo page</title>
</head>

<body>
<h2>Demo without table</h2>
<table>
<tr>Demo</tr>
</table>
</body>
</html>


Now the python code works well without any exception as table element is present.



So the conclusion :



The exception is because one of the postcode in the DataFrame is leading to an URL that does not contain a table. So, I recommend you to make a small change to your code :



scraped_data = 

for x, row in postcodes_for_urls.iterrows():
page = requests.get("http://myurl"+(row['postcode_URL']))
soup = BeautifulSoup(page.content, 'html.parser')
table = soup.find('table')
#add this :
if table == None :
continue

table_rows = table.find_all('tr')
for tr in table_rows:
td = tr.find_all('td')
row = [tr.text for tr in td]
scraped_data.append(row)


pd.DataFrame(scraped_data, columns=["A", "B", "C"])






share|improve this answer





















  • Amazing, this works perfectly now - thank you for the detailed explanation.
    – B Winter
    Nov 11 at 20:23













up vote
0
down vote



accepted







up vote
0
down vote



accepted






I investigated the problem and tried few snippets on my laptop.



The problem is not with the DataFrame , as you are reading each row at a time in the loop, the problem is with the URL, your program correctly scraps the page having table in it and throws an error for a postcode URL that doesn't have element.



Consider the first test :



I created an HTML page without table in it :



 <html>
<head>
<title>Demo page</title>
</head>

<body>
<h2>Demo without table</h2>
</body>
</html>


And then I executed the python code as below :



from bs4 import BeautifulSoup
data = open('table.html').read()
parser = BeautifulSoup(data, 'html.parser')
table = parser.find('table')
rows = table.find_all('tr')
print(rows)


Above code stopped due to NoneType exception because parser.find() returns a NoneType object if table element is not found in the html data. So find_all() is not a method of NoneType object, thus it throws an error.



So I changed my HTML code as below :



<html>
<head>
<title>Demo page</title>
</head>

<body>
<h2>Demo without table</h2>
<table>
<tr>Demo</tr>
</table>
</body>
</html>


Now the python code works well without any exception as table element is present.



So the conclusion :



The exception is because one of the postcode in the DataFrame is leading to an URL that does not contain a table. So, I recommend you to make a small change to your code :



scraped_data = 

for x, row in postcodes_for_urls.iterrows():
page = requests.get("http://myurl"+(row['postcode_URL']))
soup = BeautifulSoup(page.content, 'html.parser')
table = soup.find('table')
#add this :
if table == None :
continue

table_rows = table.find_all('tr')
for tr in table_rows:
td = tr.find_all('td')
row = [tr.text for tr in td]
scraped_data.append(row)


pd.DataFrame(scraped_data, columns=["A", "B", "C"])






share|improve this answer












I investigated the problem and tried few snippets on my laptop.



The problem is not with the DataFrame , as you are reading each row at a time in the loop, the problem is with the URL, your program correctly scraps the page having table in it and throws an error for a postcode URL that doesn't have element.



Consider the first test :



I created an HTML page without table in it :



 <html>
<head>
<title>Demo page</title>
</head>

<body>
<h2>Demo without table</h2>
</body>
</html>


And then I executed the python code as below :



from bs4 import BeautifulSoup
data = open('table.html').read()
parser = BeautifulSoup(data, 'html.parser')
table = parser.find('table')
rows = table.find_all('tr')
print(rows)


Above code stopped due to NoneType exception because parser.find() returns a NoneType object if table element is not found in the html data. So find_all() is not a method of NoneType object, thus it throws an error.



So I changed my HTML code as below :



<html>
<head>
<title>Demo page</title>
</head>

<body>
<h2>Demo without table</h2>
<table>
<tr>Demo</tr>
</table>
</body>
</html>


Now the python code works well without any exception as table element is present.



So the conclusion :



The exception is because one of the postcode in the DataFrame is leading to an URL that does not contain a table. So, I recommend you to make a small change to your code :



scraped_data = 

for x, row in postcodes_for_urls.iterrows():
page = requests.get("http://myurl"+(row['postcode_URL']))
soup = BeautifulSoup(page.content, 'html.parser')
table = soup.find('table')
#add this :
if table == None :
continue

table_rows = table.find_all('tr')
for tr in table_rows:
td = tr.find_all('td')
row = [tr.text for tr in td]
scraped_data.append(row)


pd.DataFrame(scraped_data, columns=["A", "B", "C"])







share|improve this answer












share|improve this answer



share|improve this answer










answered Nov 11 at 17:13









Narasimha Prasanna HN

683




683












  • Amazing, this works perfectly now - thank you for the detailed explanation.
    – B Winter
    Nov 11 at 20:23


















  • Amazing, this works perfectly now - thank you for the detailed explanation.
    – B Winter
    Nov 11 at 20:23
















Amazing, this works perfectly now - thank you for the detailed explanation.
– B Winter
Nov 11 at 20:23




Amazing, this works perfectly now - thank you for the detailed explanation.
– B Winter
Nov 11 at 20:23












up vote
0
down vote













All you need is to put a check on table_rows if its None and its not empty, try below code. You can also put an exception handler like try and catch statement for best practices. It always breaks because of either an empty row or because of unusual patter on real page you are scraping.



scraped_data = 

for x, row in postcodes_for_urls.iterrows():
page = requests.get("http://myurl"+(row['postcode_URL']))
soup = BeautifulSoup(page.content, 'html.parser')
table = soup.find('table')

if table is not None and len(table.find_all('tr'))>0:
table_rows = table.find_all('tr')
for tr in table_rows:
td = tr.find_all('td')
row = [tr.text for tr in td]
scraped_data.append(row)
else:
scraped_data.append('EMPTY')

pd.DataFrame(scraped_data, columns=["A", "B", "C"])





share|improve this answer























  • Yep, got it. Makes sense now. Thank you.
    – B Winter
    Nov 11 at 20:24















up vote
0
down vote













All you need is to put a check on table_rows if its None and its not empty, try below code. You can also put an exception handler like try and catch statement for best practices. It always breaks because of either an empty row or because of unusual patter on real page you are scraping.



scraped_data = 

for x, row in postcodes_for_urls.iterrows():
page = requests.get("http://myurl"+(row['postcode_URL']))
soup = BeautifulSoup(page.content, 'html.parser')
table = soup.find('table')

if table is not None and len(table.find_all('tr'))>0:
table_rows = table.find_all('tr')
for tr in table_rows:
td = tr.find_all('td')
row = [tr.text for tr in td]
scraped_data.append(row)
else:
scraped_data.append('EMPTY')

pd.DataFrame(scraped_data, columns=["A", "B", "C"])





share|improve this answer























  • Yep, got it. Makes sense now. Thank you.
    – B Winter
    Nov 11 at 20:24













up vote
0
down vote










up vote
0
down vote









All you need is to put a check on table_rows if its None and its not empty, try below code. You can also put an exception handler like try and catch statement for best practices. It always breaks because of either an empty row or because of unusual patter on real page you are scraping.



scraped_data = 

for x, row in postcodes_for_urls.iterrows():
page = requests.get("http://myurl"+(row['postcode_URL']))
soup = BeautifulSoup(page.content, 'html.parser')
table = soup.find('table')

if table is not None and len(table.find_all('tr'))>0:
table_rows = table.find_all('tr')
for tr in table_rows:
td = tr.find_all('td')
row = [tr.text for tr in td]
scraped_data.append(row)
else:
scraped_data.append('EMPTY')

pd.DataFrame(scraped_data, columns=["A", "B", "C"])





share|improve this answer














All you need is to put a check on table_rows if its None and its not empty, try below code. You can also put an exception handler like try and catch statement for best practices. It always breaks because of either an empty row or because of unusual patter on real page you are scraping.



scraped_data = 

for x, row in postcodes_for_urls.iterrows():
page = requests.get("http://myurl"+(row['postcode_URL']))
soup = BeautifulSoup(page.content, 'html.parser')
table = soup.find('table')

if table is not None and len(table.find_all('tr'))>0:
table_rows = table.find_all('tr')
for tr in table_rows:
td = tr.find_all('td')
row = [tr.text for tr in td]
scraped_data.append(row)
else:
scraped_data.append('EMPTY')

pd.DataFrame(scraped_data, columns=["A", "B", "C"])






share|improve this answer














share|improve this answer



share|improve this answer








edited Nov 11 at 17:28

























answered Nov 11 at 17:20









Harry_pb

1,60511025




1,60511025












  • Yep, got it. Makes sense now. Thank you.
    – B Winter
    Nov 11 at 20:24


















  • Yep, got it. Makes sense now. Thank you.
    – B Winter
    Nov 11 at 20:24
















Yep, got it. Makes sense now. Thank you.
– B Winter
Nov 11 at 20:24




Yep, got it. Makes sense now. Thank you.
– B Winter
Nov 11 at 20:24


















draft saved

draft discarded




















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.





Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


Please pay close attention to the following guidance:


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53250877%2fscraping-over-multiple-pages-using-beautifulsoup-and-dataframe-iterrows%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Guess what letter conforming each word

Run scheduled task as local user group (not BUILTIN)

Port of Spain