How to correctly scrape a JavaScript-based site?
I'm testing the code below.
from bs4 import BeautifulSoup
import requests
from selenium import webdriver
profile = webdriver.FirefoxProfile()
profile.accept_untrusted_certs = True
import time
browser = webdriver.Firefox(executable_path="C:/Utility/geckodriver.exe")
wd = webdriver.Firefox(executable_path="C:/Utility/geckodriver.exe", firefox_profile=profile)
url = "https://corp_intranet"
wd.get(url)
# set username
time.sleep(2)
username = wd.find_element_by_id("id_email")
username.send_keys("my_email@corp.com")
# set password
password = wd.find_element_by_id("id_password")
password.send_keys("my_password")
url=("https://corp_intranet")
r = requests.get(url)
content = r.content.decode('utf-8')
print(BeautifulSoup(content, 'html.parser'))
This logs into my corporate intranet fine, but it just prints very, very basic information. Hitting F12 shows me that a lot of the data on the page is rendered using JavaScript. I did a little research on this, and tried to find a way to actually grab what I see on the screen, rather than a very, very diluted version of what I can see. Is there some way to do a big data dump of all the data that is displayed on the page? Thanks.
python python-3.x selenium geckodriver
add a comment |
I'm testing the code below.
from bs4 import BeautifulSoup
import requests
from selenium import webdriver
profile = webdriver.FirefoxProfile()
profile.accept_untrusted_certs = True
import time
browser = webdriver.Firefox(executable_path="C:/Utility/geckodriver.exe")
wd = webdriver.Firefox(executable_path="C:/Utility/geckodriver.exe", firefox_profile=profile)
url = "https://corp_intranet"
wd.get(url)
# set username
time.sleep(2)
username = wd.find_element_by_id("id_email")
username.send_keys("my_email@corp.com")
# set password
password = wd.find_element_by_id("id_password")
password.send_keys("my_password")
url=("https://corp_intranet")
r = requests.get(url)
content = r.content.decode('utf-8')
print(BeautifulSoup(content, 'html.parser'))
This logs into my corporate intranet fine, but it just prints very, very basic information. Hitting F12 shows me that a lot of the data on the page is rendered using JavaScript. I did a little research on this, and tried to find a way to actually grab what I see on the screen, rather than a very, very diluted version of what I can see. Is there some way to do a big data dump of all the data that is displayed on the page? Thanks.
python python-3.x selenium geckodriver
This sounds like an X-Y problem. Instead of asking for help with your solution to the problem, edit your question and ask about the actual problem. What are you trying to do?
– DebanjanB
Nov 21 '18 at 6:10
add a comment |
I'm testing the code below.
from bs4 import BeautifulSoup
import requests
from selenium import webdriver
profile = webdriver.FirefoxProfile()
profile.accept_untrusted_certs = True
import time
browser = webdriver.Firefox(executable_path="C:/Utility/geckodriver.exe")
wd = webdriver.Firefox(executable_path="C:/Utility/geckodriver.exe", firefox_profile=profile)
url = "https://corp_intranet"
wd.get(url)
# set username
time.sleep(2)
username = wd.find_element_by_id("id_email")
username.send_keys("my_email@corp.com")
# set password
password = wd.find_element_by_id("id_password")
password.send_keys("my_password")
url=("https://corp_intranet")
r = requests.get(url)
content = r.content.decode('utf-8')
print(BeautifulSoup(content, 'html.parser'))
This logs into my corporate intranet fine, but it just prints very, very basic information. Hitting F12 shows me that a lot of the data on the page is rendered using JavaScript. I did a little research on this, and tried to find a way to actually grab what I see on the screen, rather than a very, very diluted version of what I can see. Is there some way to do a big data dump of all the data that is displayed on the page? Thanks.
python python-3.x selenium geckodriver
I'm testing the code below.
from bs4 import BeautifulSoup
import requests
from selenium import webdriver
profile = webdriver.FirefoxProfile()
profile.accept_untrusted_certs = True
import time
browser = webdriver.Firefox(executable_path="C:/Utility/geckodriver.exe")
wd = webdriver.Firefox(executable_path="C:/Utility/geckodriver.exe", firefox_profile=profile)
url = "https://corp_intranet"
wd.get(url)
# set username
time.sleep(2)
username = wd.find_element_by_id("id_email")
username.send_keys("my_email@corp.com")
# set password
password = wd.find_element_by_id("id_password")
password.send_keys("my_password")
url=("https://corp_intranet")
r = requests.get(url)
content = r.content.decode('utf-8')
print(BeautifulSoup(content, 'html.parser'))
This logs into my corporate intranet fine, but it just prints very, very basic information. Hitting F12 shows me that a lot of the data on the page is rendered using JavaScript. I did a little research on this, and tried to find a way to actually grab what I see on the screen, rather than a very, very diluted version of what I can see. Is there some way to do a big data dump of all the data that is displayed on the page? Thanks.
python python-3.x selenium geckodriver
python python-3.x selenium geckodriver
edited Nov 20 '18 at 21:55
ryguy72
asked Nov 20 '18 at 21:47


ryguy72ryguy72
4,4251821
4,4251821
This sounds like an X-Y problem. Instead of asking for help with your solution to the problem, edit your question and ask about the actual problem. What are you trying to do?
– DebanjanB
Nov 21 '18 at 6:10
add a comment |
This sounds like an X-Y problem. Instead of asking for help with your solution to the problem, edit your question and ask about the actual problem. What are you trying to do?
– DebanjanB
Nov 21 '18 at 6:10
This sounds like an X-Y problem. Instead of asking for help with your solution to the problem, edit your question and ask about the actual problem. What are you trying to do?
– DebanjanB
Nov 21 '18 at 6:10
This sounds like an X-Y problem. Instead of asking for help with your solution to the problem, edit your question and ask about the actual problem. What are you trying to do?
– DebanjanB
Nov 21 '18 at 6:10
add a comment |
2 Answers
2
active
oldest
votes
you open 2 browser delete this line
browser = webdriver.Firefox(executable_path="C:/Utility/geckodriver.exe")
the problem is in selenium you're logged in but not in requests
because it use different session
.....
.....
# missing click button? add "n" to submit or click the button
password.send_keys("my_passwordn")
# wait max 10 seconds until "theID" visible in Logged In page
WebDriverWait(wd, 10).until(EC.presence_of_element_located((By.ID, "theID")))
content = wd.page_source
print(BeautifulSoup(content, 'html.parser'))
Yeah, that works. Thanks!!
– ryguy72
Nov 21 '18 at 17:06
add a comment |
You need to have Selenium wait for the webpage to load the additional content, via either implicit or explicit waits.
An Implicit Wait lets you choose a specific amount of time to wait before scraping.
An Explicit Wait let you choose an event to wait for, such as a particular element being visible or clickable.
This answer goes into detail on this concept.
Yes, yes, the page has to be fully loaded, of course, or there is nothing to parse. The problem is, my code is not picking up much from the URL. Also, it seems like the code is opening 2 browsers, instead of just 1 browser. That's weird!
– ryguy72
Nov 20 '18 at 22:31
I suggested this because the code you posted is just grabbing a url withrequests.get
, which will only ever get the initial HTML of the response. It won't run any javascript to load the rest of the page. It sounds like you need to wait for the javascript to run, and then use something like Selenium'swebdriver.page_source
to get the HTML from the web browser that Selenium opened.
– Ryan
Nov 20 '18 at 22:52
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53402067%2fhow-to-correctly-scrape-a-javascript-based-site%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
you open 2 browser delete this line
browser = webdriver.Firefox(executable_path="C:/Utility/geckodriver.exe")
the problem is in selenium you're logged in but not in requests
because it use different session
.....
.....
# missing click button? add "n" to submit or click the button
password.send_keys("my_passwordn")
# wait max 10 seconds until "theID" visible in Logged In page
WebDriverWait(wd, 10).until(EC.presence_of_element_located((By.ID, "theID")))
content = wd.page_source
print(BeautifulSoup(content, 'html.parser'))
Yeah, that works. Thanks!!
– ryguy72
Nov 21 '18 at 17:06
add a comment |
you open 2 browser delete this line
browser = webdriver.Firefox(executable_path="C:/Utility/geckodriver.exe")
the problem is in selenium you're logged in but not in requests
because it use different session
.....
.....
# missing click button? add "n" to submit or click the button
password.send_keys("my_passwordn")
# wait max 10 seconds until "theID" visible in Logged In page
WebDriverWait(wd, 10).until(EC.presence_of_element_located((By.ID, "theID")))
content = wd.page_source
print(BeautifulSoup(content, 'html.parser'))
Yeah, that works. Thanks!!
– ryguy72
Nov 21 '18 at 17:06
add a comment |
you open 2 browser delete this line
browser = webdriver.Firefox(executable_path="C:/Utility/geckodriver.exe")
the problem is in selenium you're logged in but not in requests
because it use different session
.....
.....
# missing click button? add "n" to submit or click the button
password.send_keys("my_passwordn")
# wait max 10 seconds until "theID" visible in Logged In page
WebDriverWait(wd, 10).until(EC.presence_of_element_located((By.ID, "theID")))
content = wd.page_source
print(BeautifulSoup(content, 'html.parser'))
you open 2 browser delete this line
browser = webdriver.Firefox(executable_path="C:/Utility/geckodriver.exe")
the problem is in selenium you're logged in but not in requests
because it use different session
.....
.....
# missing click button? add "n" to submit or click the button
password.send_keys("my_passwordn")
# wait max 10 seconds until "theID" visible in Logged In page
WebDriverWait(wd, 10).until(EC.presence_of_element_located((By.ID, "theID")))
content = wd.page_source
print(BeautifulSoup(content, 'html.parser'))
answered Nov 20 '18 at 23:22
ewwinkewwink
12.1k22440
12.1k22440
Yeah, that works. Thanks!!
– ryguy72
Nov 21 '18 at 17:06
add a comment |
Yeah, that works. Thanks!!
– ryguy72
Nov 21 '18 at 17:06
Yeah, that works. Thanks!!
– ryguy72
Nov 21 '18 at 17:06
Yeah, that works. Thanks!!
– ryguy72
Nov 21 '18 at 17:06
add a comment |
You need to have Selenium wait for the webpage to load the additional content, via either implicit or explicit waits.
An Implicit Wait lets you choose a specific amount of time to wait before scraping.
An Explicit Wait let you choose an event to wait for, such as a particular element being visible or clickable.
This answer goes into detail on this concept.
Yes, yes, the page has to be fully loaded, of course, or there is nothing to parse. The problem is, my code is not picking up much from the URL. Also, it seems like the code is opening 2 browsers, instead of just 1 browser. That's weird!
– ryguy72
Nov 20 '18 at 22:31
I suggested this because the code you posted is just grabbing a url withrequests.get
, which will only ever get the initial HTML of the response. It won't run any javascript to load the rest of the page. It sounds like you need to wait for the javascript to run, and then use something like Selenium'swebdriver.page_source
to get the HTML from the web browser that Selenium opened.
– Ryan
Nov 20 '18 at 22:52
add a comment |
You need to have Selenium wait for the webpage to load the additional content, via either implicit or explicit waits.
An Implicit Wait lets you choose a specific amount of time to wait before scraping.
An Explicit Wait let you choose an event to wait for, such as a particular element being visible or clickable.
This answer goes into detail on this concept.
Yes, yes, the page has to be fully loaded, of course, or there is nothing to parse. The problem is, my code is not picking up much from the URL. Also, it seems like the code is opening 2 browsers, instead of just 1 browser. That's weird!
– ryguy72
Nov 20 '18 at 22:31
I suggested this because the code you posted is just grabbing a url withrequests.get
, which will only ever get the initial HTML of the response. It won't run any javascript to load the rest of the page. It sounds like you need to wait for the javascript to run, and then use something like Selenium'swebdriver.page_source
to get the HTML from the web browser that Selenium opened.
– Ryan
Nov 20 '18 at 22:52
add a comment |
You need to have Selenium wait for the webpage to load the additional content, via either implicit or explicit waits.
An Implicit Wait lets you choose a specific amount of time to wait before scraping.
An Explicit Wait let you choose an event to wait for, such as a particular element being visible or clickable.
This answer goes into detail on this concept.
You need to have Selenium wait for the webpage to load the additional content, via either implicit or explicit waits.
An Implicit Wait lets you choose a specific amount of time to wait before scraping.
An Explicit Wait let you choose an event to wait for, such as a particular element being visible or clickable.
This answer goes into detail on this concept.
answered Nov 20 '18 at 22:05


RyanRyan
322213
322213
Yes, yes, the page has to be fully loaded, of course, or there is nothing to parse. The problem is, my code is not picking up much from the URL. Also, it seems like the code is opening 2 browsers, instead of just 1 browser. That's weird!
– ryguy72
Nov 20 '18 at 22:31
I suggested this because the code you posted is just grabbing a url withrequests.get
, which will only ever get the initial HTML of the response. It won't run any javascript to load the rest of the page. It sounds like you need to wait for the javascript to run, and then use something like Selenium'swebdriver.page_source
to get the HTML from the web browser that Selenium opened.
– Ryan
Nov 20 '18 at 22:52
add a comment |
Yes, yes, the page has to be fully loaded, of course, or there is nothing to parse. The problem is, my code is not picking up much from the URL. Also, it seems like the code is opening 2 browsers, instead of just 1 browser. That's weird!
– ryguy72
Nov 20 '18 at 22:31
I suggested this because the code you posted is just grabbing a url withrequests.get
, which will only ever get the initial HTML of the response. It won't run any javascript to load the rest of the page. It sounds like you need to wait for the javascript to run, and then use something like Selenium'swebdriver.page_source
to get the HTML from the web browser that Selenium opened.
– Ryan
Nov 20 '18 at 22:52
Yes, yes, the page has to be fully loaded, of course, or there is nothing to parse. The problem is, my code is not picking up much from the URL. Also, it seems like the code is opening 2 browsers, instead of just 1 browser. That's weird!
– ryguy72
Nov 20 '18 at 22:31
Yes, yes, the page has to be fully loaded, of course, or there is nothing to parse. The problem is, my code is not picking up much from the URL. Also, it seems like the code is opening 2 browsers, instead of just 1 browser. That's weird!
– ryguy72
Nov 20 '18 at 22:31
I suggested this because the code you posted is just grabbing a url with
requests.get
, which will only ever get the initial HTML of the response. It won't run any javascript to load the rest of the page. It sounds like you need to wait for the javascript to run, and then use something like Selenium's webdriver.page_source
to get the HTML from the web browser that Selenium opened.– Ryan
Nov 20 '18 at 22:52
I suggested this because the code you posted is just grabbing a url with
requests.get
, which will only ever get the initial HTML of the response. It won't run any javascript to load the rest of the page. It sounds like you need to wait for the javascript to run, and then use something like Selenium's webdriver.page_source
to get the HTML from the web browser that Selenium opened.– Ryan
Nov 20 '18 at 22:52
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53402067%2fhow-to-correctly-scrape-a-javascript-based-site%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
This sounds like an X-Y problem. Instead of asking for help with your solution to the problem, edit your question and ask about the actual problem. What are you trying to do?
– DebanjanB
Nov 21 '18 at 6:10