Scraping over multiple pages using BeautifulSoup and dataframe iterrows
up vote
0
down vote
favorite
I'm using BeautifulSoup to scrape from multiple URLs. The URL iterates by appending a variable I have saved in a dataframe (postcode_URL).
The code breaks on line: table_rows = table.find_all('tr')
, throwing the error: 'NoneType' object has no attribute 'find_all'
Interestingly, the code works perfectly if I remove the iteration and manually enter a single postcode in the URL, so I believe it must be something to do with the iteration loop.
Below is the code that I have used. Any ideas?
scraped_data =
for x, row in postcodes_for_urls.iterrows():
page = requests.get("http://myurl"+(row['postcode_URL']))
soup = BeautifulSoup(page.content, 'html.parser')
table = soup.find('table')
table_rows = table.find_all('tr')
for tr in table_rows:
td = tr.find_all('td')
row = [tr.text for tr in td]
scraped_data.append(row)
pd.DataFrame(scraped_data, columns=["A", "B", "C"])
python dataframe web-scraping beautifulsoup
add a comment |
up vote
0
down vote
favorite
I'm using BeautifulSoup to scrape from multiple URLs. The URL iterates by appending a variable I have saved in a dataframe (postcode_URL).
The code breaks on line: table_rows = table.find_all('tr')
, throwing the error: 'NoneType' object has no attribute 'find_all'
Interestingly, the code works perfectly if I remove the iteration and manually enter a single postcode in the URL, so I believe it must be something to do with the iteration loop.
Below is the code that I have used. Any ideas?
scraped_data =
for x, row in postcodes_for_urls.iterrows():
page = requests.get("http://myurl"+(row['postcode_URL']))
soup = BeautifulSoup(page.content, 'html.parser')
table = soup.find('table')
table_rows = table.find_all('tr')
for tr in table_rows:
td = tr.find_all('td')
row = [tr.text for tr in td]
scraped_data.append(row)
pd.DataFrame(scraped_data, columns=["A", "B", "C"])
python dataframe web-scraping beautifulsoup
I will suggest using scrapy (scrapy.org) instead.
– Akash Ranjan
Nov 11 at 17:17
add a comment |
up vote
0
down vote
favorite
up vote
0
down vote
favorite
I'm using BeautifulSoup to scrape from multiple URLs. The URL iterates by appending a variable I have saved in a dataframe (postcode_URL).
The code breaks on line: table_rows = table.find_all('tr')
, throwing the error: 'NoneType' object has no attribute 'find_all'
Interestingly, the code works perfectly if I remove the iteration and manually enter a single postcode in the URL, so I believe it must be something to do with the iteration loop.
Below is the code that I have used. Any ideas?
scraped_data =
for x, row in postcodes_for_urls.iterrows():
page = requests.get("http://myurl"+(row['postcode_URL']))
soup = BeautifulSoup(page.content, 'html.parser')
table = soup.find('table')
table_rows = table.find_all('tr')
for tr in table_rows:
td = tr.find_all('td')
row = [tr.text for tr in td]
scraped_data.append(row)
pd.DataFrame(scraped_data, columns=["A", "B", "C"])
python dataframe web-scraping beautifulsoup
I'm using BeautifulSoup to scrape from multiple URLs. The URL iterates by appending a variable I have saved in a dataframe (postcode_URL).
The code breaks on line: table_rows = table.find_all('tr')
, throwing the error: 'NoneType' object has no attribute 'find_all'
Interestingly, the code works perfectly if I remove the iteration and manually enter a single postcode in the URL, so I believe it must be something to do with the iteration loop.
Below is the code that I have used. Any ideas?
scraped_data =
for x, row in postcodes_for_urls.iterrows():
page = requests.get("http://myurl"+(row['postcode_URL']))
soup = BeautifulSoup(page.content, 'html.parser')
table = soup.find('table')
table_rows = table.find_all('tr')
for tr in table_rows:
td = tr.find_all('td')
row = [tr.text for tr in td]
scraped_data.append(row)
pd.DataFrame(scraped_data, columns=["A", "B", "C"])
python dataframe web-scraping beautifulsoup
python dataframe web-scraping beautifulsoup
edited Nov 11 at 18:05
Akash Ranjan
10811
10811
asked Nov 11 at 16:39
B Winter
103
103
I will suggest using scrapy (scrapy.org) instead.
– Akash Ranjan
Nov 11 at 17:17
add a comment |
I will suggest using scrapy (scrapy.org) instead.
– Akash Ranjan
Nov 11 at 17:17
I will suggest using scrapy (scrapy.org) instead.
– Akash Ranjan
Nov 11 at 17:17
I will suggest using scrapy (scrapy.org) instead.
– Akash Ranjan
Nov 11 at 17:17
add a comment |
2 Answers
2
active
oldest
votes
up vote
0
down vote
accepted
I investigated the problem and tried few snippets on my laptop.
The problem is not with the DataFrame , as you are reading each row at a time in the loop, the problem is with the URL, your program correctly scraps the page having table in it and throws an error for a postcode URL that doesn't have element.
Consider the first test :
I created an HTML page without table in it :
<html>
<head>
<title>Demo page</title>
</head>
<body>
<h2>Demo without table</h2>
</body>
</html>
And then I executed the python code as below :
from bs4 import BeautifulSoup
data = open('table.html').read()
parser = BeautifulSoup(data, 'html.parser')
table = parser.find('table')
rows = table.find_all('tr')
print(rows)
Above code stopped due to NoneType exception because parser.find() returns a NoneType object if table element is not found in the html data. So find_all() is not a method of NoneType object, thus it throws an error.
So I changed my HTML code as below :
<html>
<head>
<title>Demo page</title>
</head>
<body>
<h2>Demo without table</h2>
<table>
<tr>Demo</tr>
</table>
</body>
</html>
Now the python code works well without any exception as table element is present.
So the conclusion :
The exception is because one of the postcode in the DataFrame is leading to an URL that does not contain a table. So, I recommend you to make a small change to your code :
scraped_data =
for x, row in postcodes_for_urls.iterrows():
page = requests.get("http://myurl"+(row['postcode_URL']))
soup = BeautifulSoup(page.content, 'html.parser')
table = soup.find('table')
#add this :
if table == None :
continue
table_rows = table.find_all('tr')
for tr in table_rows:
td = tr.find_all('td')
row = [tr.text for tr in td]
scraped_data.append(row)
pd.DataFrame(scraped_data, columns=["A", "B", "C"])
Amazing, this works perfectly now - thank you for the detailed explanation.
– B Winter
Nov 11 at 20:23
add a comment |
up vote
0
down vote
All you need is to put a check on table_rows
if its None and its not empty, try below code. You can also put an exception handler
like try
and catch
statement for best practices. It always breaks because of either an empty row or because of unusual patter on real page you are scraping.
scraped_data =
for x, row in postcodes_for_urls.iterrows():
page = requests.get("http://myurl"+(row['postcode_URL']))
soup = BeautifulSoup(page.content, 'html.parser')
table = soup.find('table')
if table is not None and len(table.find_all('tr'))>0:
table_rows = table.find_all('tr')
for tr in table_rows:
td = tr.find_all('td')
row = [tr.text for tr in td]
scraped_data.append(row)
else:
scraped_data.append('EMPTY')
pd.DataFrame(scraped_data, columns=["A", "B", "C"])
Yep, got it. Makes sense now. Thank you.
– B Winter
Nov 11 at 20:24
add a comment |
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
0
down vote
accepted
I investigated the problem and tried few snippets on my laptop.
The problem is not with the DataFrame , as you are reading each row at a time in the loop, the problem is with the URL, your program correctly scraps the page having table in it and throws an error for a postcode URL that doesn't have element.
Consider the first test :
I created an HTML page without table in it :
<html>
<head>
<title>Demo page</title>
</head>
<body>
<h2>Demo without table</h2>
</body>
</html>
And then I executed the python code as below :
from bs4 import BeautifulSoup
data = open('table.html').read()
parser = BeautifulSoup(data, 'html.parser')
table = parser.find('table')
rows = table.find_all('tr')
print(rows)
Above code stopped due to NoneType exception because parser.find() returns a NoneType object if table element is not found in the html data. So find_all() is not a method of NoneType object, thus it throws an error.
So I changed my HTML code as below :
<html>
<head>
<title>Demo page</title>
</head>
<body>
<h2>Demo without table</h2>
<table>
<tr>Demo</tr>
</table>
</body>
</html>
Now the python code works well without any exception as table element is present.
So the conclusion :
The exception is because one of the postcode in the DataFrame is leading to an URL that does not contain a table. So, I recommend you to make a small change to your code :
scraped_data =
for x, row in postcodes_for_urls.iterrows():
page = requests.get("http://myurl"+(row['postcode_URL']))
soup = BeautifulSoup(page.content, 'html.parser')
table = soup.find('table')
#add this :
if table == None :
continue
table_rows = table.find_all('tr')
for tr in table_rows:
td = tr.find_all('td')
row = [tr.text for tr in td]
scraped_data.append(row)
pd.DataFrame(scraped_data, columns=["A", "B", "C"])
Amazing, this works perfectly now - thank you for the detailed explanation.
– B Winter
Nov 11 at 20:23
add a comment |
up vote
0
down vote
accepted
I investigated the problem and tried few snippets on my laptop.
The problem is not with the DataFrame , as you are reading each row at a time in the loop, the problem is with the URL, your program correctly scraps the page having table in it and throws an error for a postcode URL that doesn't have element.
Consider the first test :
I created an HTML page without table in it :
<html>
<head>
<title>Demo page</title>
</head>
<body>
<h2>Demo without table</h2>
</body>
</html>
And then I executed the python code as below :
from bs4 import BeautifulSoup
data = open('table.html').read()
parser = BeautifulSoup(data, 'html.parser')
table = parser.find('table')
rows = table.find_all('tr')
print(rows)
Above code stopped due to NoneType exception because parser.find() returns a NoneType object if table element is not found in the html data. So find_all() is not a method of NoneType object, thus it throws an error.
So I changed my HTML code as below :
<html>
<head>
<title>Demo page</title>
</head>
<body>
<h2>Demo without table</h2>
<table>
<tr>Demo</tr>
</table>
</body>
</html>
Now the python code works well without any exception as table element is present.
So the conclusion :
The exception is because one of the postcode in the DataFrame is leading to an URL that does not contain a table. So, I recommend you to make a small change to your code :
scraped_data =
for x, row in postcodes_for_urls.iterrows():
page = requests.get("http://myurl"+(row['postcode_URL']))
soup = BeautifulSoup(page.content, 'html.parser')
table = soup.find('table')
#add this :
if table == None :
continue
table_rows = table.find_all('tr')
for tr in table_rows:
td = tr.find_all('td')
row = [tr.text for tr in td]
scraped_data.append(row)
pd.DataFrame(scraped_data, columns=["A", "B", "C"])
Amazing, this works perfectly now - thank you for the detailed explanation.
– B Winter
Nov 11 at 20:23
add a comment |
up vote
0
down vote
accepted
up vote
0
down vote
accepted
I investigated the problem and tried few snippets on my laptop.
The problem is not with the DataFrame , as you are reading each row at a time in the loop, the problem is with the URL, your program correctly scraps the page having table in it and throws an error for a postcode URL that doesn't have element.
Consider the first test :
I created an HTML page without table in it :
<html>
<head>
<title>Demo page</title>
</head>
<body>
<h2>Demo without table</h2>
</body>
</html>
And then I executed the python code as below :
from bs4 import BeautifulSoup
data = open('table.html').read()
parser = BeautifulSoup(data, 'html.parser')
table = parser.find('table')
rows = table.find_all('tr')
print(rows)
Above code stopped due to NoneType exception because parser.find() returns a NoneType object if table element is not found in the html data. So find_all() is not a method of NoneType object, thus it throws an error.
So I changed my HTML code as below :
<html>
<head>
<title>Demo page</title>
</head>
<body>
<h2>Demo without table</h2>
<table>
<tr>Demo</tr>
</table>
</body>
</html>
Now the python code works well without any exception as table element is present.
So the conclusion :
The exception is because one of the postcode in the DataFrame is leading to an URL that does not contain a table. So, I recommend you to make a small change to your code :
scraped_data =
for x, row in postcodes_for_urls.iterrows():
page = requests.get("http://myurl"+(row['postcode_URL']))
soup = BeautifulSoup(page.content, 'html.parser')
table = soup.find('table')
#add this :
if table == None :
continue
table_rows = table.find_all('tr')
for tr in table_rows:
td = tr.find_all('td')
row = [tr.text for tr in td]
scraped_data.append(row)
pd.DataFrame(scraped_data, columns=["A", "B", "C"])
I investigated the problem and tried few snippets on my laptop.
The problem is not with the DataFrame , as you are reading each row at a time in the loop, the problem is with the URL, your program correctly scraps the page having table in it and throws an error for a postcode URL that doesn't have element.
Consider the first test :
I created an HTML page without table in it :
<html>
<head>
<title>Demo page</title>
</head>
<body>
<h2>Demo without table</h2>
</body>
</html>
And then I executed the python code as below :
from bs4 import BeautifulSoup
data = open('table.html').read()
parser = BeautifulSoup(data, 'html.parser')
table = parser.find('table')
rows = table.find_all('tr')
print(rows)
Above code stopped due to NoneType exception because parser.find() returns a NoneType object if table element is not found in the html data. So find_all() is not a method of NoneType object, thus it throws an error.
So I changed my HTML code as below :
<html>
<head>
<title>Demo page</title>
</head>
<body>
<h2>Demo without table</h2>
<table>
<tr>Demo</tr>
</table>
</body>
</html>
Now the python code works well without any exception as table element is present.
So the conclusion :
The exception is because one of the postcode in the DataFrame is leading to an URL that does not contain a table. So, I recommend you to make a small change to your code :
scraped_data =
for x, row in postcodes_for_urls.iterrows():
page = requests.get("http://myurl"+(row['postcode_URL']))
soup = BeautifulSoup(page.content, 'html.parser')
table = soup.find('table')
#add this :
if table == None :
continue
table_rows = table.find_all('tr')
for tr in table_rows:
td = tr.find_all('td')
row = [tr.text for tr in td]
scraped_data.append(row)
pd.DataFrame(scraped_data, columns=["A", "B", "C"])
answered Nov 11 at 17:13
Narasimha Prasanna HN
683
683
Amazing, this works perfectly now - thank you for the detailed explanation.
– B Winter
Nov 11 at 20:23
add a comment |
Amazing, this works perfectly now - thank you for the detailed explanation.
– B Winter
Nov 11 at 20:23
Amazing, this works perfectly now - thank you for the detailed explanation.
– B Winter
Nov 11 at 20:23
Amazing, this works perfectly now - thank you for the detailed explanation.
– B Winter
Nov 11 at 20:23
add a comment |
up vote
0
down vote
All you need is to put a check on table_rows
if its None and its not empty, try below code. You can also put an exception handler
like try
and catch
statement for best practices. It always breaks because of either an empty row or because of unusual patter on real page you are scraping.
scraped_data =
for x, row in postcodes_for_urls.iterrows():
page = requests.get("http://myurl"+(row['postcode_URL']))
soup = BeautifulSoup(page.content, 'html.parser')
table = soup.find('table')
if table is not None and len(table.find_all('tr'))>0:
table_rows = table.find_all('tr')
for tr in table_rows:
td = tr.find_all('td')
row = [tr.text for tr in td]
scraped_data.append(row)
else:
scraped_data.append('EMPTY')
pd.DataFrame(scraped_data, columns=["A", "B", "C"])
Yep, got it. Makes sense now. Thank you.
– B Winter
Nov 11 at 20:24
add a comment |
up vote
0
down vote
All you need is to put a check on table_rows
if its None and its not empty, try below code. You can also put an exception handler
like try
and catch
statement for best practices. It always breaks because of either an empty row or because of unusual patter on real page you are scraping.
scraped_data =
for x, row in postcodes_for_urls.iterrows():
page = requests.get("http://myurl"+(row['postcode_URL']))
soup = BeautifulSoup(page.content, 'html.parser')
table = soup.find('table')
if table is not None and len(table.find_all('tr'))>0:
table_rows = table.find_all('tr')
for tr in table_rows:
td = tr.find_all('td')
row = [tr.text for tr in td]
scraped_data.append(row)
else:
scraped_data.append('EMPTY')
pd.DataFrame(scraped_data, columns=["A", "B", "C"])
Yep, got it. Makes sense now. Thank you.
– B Winter
Nov 11 at 20:24
add a comment |
up vote
0
down vote
up vote
0
down vote
All you need is to put a check on table_rows
if its None and its not empty, try below code. You can also put an exception handler
like try
and catch
statement for best practices. It always breaks because of either an empty row or because of unusual patter on real page you are scraping.
scraped_data =
for x, row in postcodes_for_urls.iterrows():
page = requests.get("http://myurl"+(row['postcode_URL']))
soup = BeautifulSoup(page.content, 'html.parser')
table = soup.find('table')
if table is not None and len(table.find_all('tr'))>0:
table_rows = table.find_all('tr')
for tr in table_rows:
td = tr.find_all('td')
row = [tr.text for tr in td]
scraped_data.append(row)
else:
scraped_data.append('EMPTY')
pd.DataFrame(scraped_data, columns=["A", "B", "C"])
All you need is to put a check on table_rows
if its None and its not empty, try below code. You can also put an exception handler
like try
and catch
statement for best practices. It always breaks because of either an empty row or because of unusual patter on real page you are scraping.
scraped_data =
for x, row in postcodes_for_urls.iterrows():
page = requests.get("http://myurl"+(row['postcode_URL']))
soup = BeautifulSoup(page.content, 'html.parser')
table = soup.find('table')
if table is not None and len(table.find_all('tr'))>0:
table_rows = table.find_all('tr')
for tr in table_rows:
td = tr.find_all('td')
row = [tr.text for tr in td]
scraped_data.append(row)
else:
scraped_data.append('EMPTY')
pd.DataFrame(scraped_data, columns=["A", "B", "C"])
edited Nov 11 at 17:28
answered Nov 11 at 17:20
Harry_pb
1,60511025
1,60511025
Yep, got it. Makes sense now. Thank you.
– B Winter
Nov 11 at 20:24
add a comment |
Yep, got it. Makes sense now. Thank you.
– B Winter
Nov 11 at 20:24
Yep, got it. Makes sense now. Thank you.
– B Winter
Nov 11 at 20:24
Yep, got it. Makes sense now. Thank you.
– B Winter
Nov 11 at 20:24
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53250877%2fscraping-over-multiple-pages-using-beautifulsoup-and-dataframe-iterrows%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
I will suggest using scrapy (scrapy.org) instead.
– Akash Ranjan
Nov 11 at 17:17