How to correctly scrape a JavaScript-based site?

I'm testing the code below.

from bs4 import BeautifulSoup

import requests

from selenium import webdriver

profile = webdriver.FirefoxProfile()

profile.accept_untrusted_certs = True

import time



browser = webdriver.Firefox(executable_path="C:/Utility/geckodriver.exe")

wd = webdriver.Firefox(executable_path="C:/Utility/geckodriver.exe", firefox_profile=profile)

url = "https://corp_intranet"

wd.get(url)



# set username

time.sleep(2)

username = wd.find_element_by_id("id_email")

username.send_keys("my_email@corp.com")



# set password

password = wd.find_element_by_id("id_password")

password.send_keys("my_password")





url=("https://corp_intranet")

r = requests.get(url)

content = r.content.decode('utf-8')

print(BeautifulSoup(content, 'html.parser'))

This logs into my corporate intranet fine, but it just prints very, very basic information. Hitting F12 shows me that a lot of the data on the page is rendered using JavaScript. I did a little research on this, and tried to find a way to actually grab what I see on the screen, rather than a very, very diluted version of what I can see. Is there some way to do a big data dump of all the data that is displayed on the page? Thanks.

edited Nov 20 '18 at 21:55

asked Nov 20 '18 at 21:47

ryguy72

4,4251821

This sounds like an X-Y problem. Instead of asking for help with your solution to the problem, edit your question and ask about the actual problem. What are you trying to do?

– DebanjanB
Nov 21 '18 at 6:10

add a comment |

I'm testing the code below.

from bs4 import BeautifulSoup

import requests

from selenium import webdriver

profile = webdriver.FirefoxProfile()

profile.accept_untrusted_certs = True

import time



browser = webdriver.Firefox(executable_path="C:/Utility/geckodriver.exe")

wd = webdriver.Firefox(executable_path="C:/Utility/geckodriver.exe", firefox_profile=profile)

url = "https://corp_intranet"

wd.get(url)



# set username

time.sleep(2)

username = wd.find_element_by_id("id_email")

username.send_keys("my_email@corp.com")



# set password

password = wd.find_element_by_id("id_password")

password.send_keys("my_password")





url=("https://corp_intranet")

r = requests.get(url)

content = r.content.decode('utf-8')

print(BeautifulSoup(content, 'html.parser'))

edited Nov 20 '18 at 21:55

asked Nov 20 '18 at 21:47

ryguy72

4,4251821

This sounds like an X-Y problem. Instead of asking for help with your solution to the problem, edit your question and ask about the actual problem. What are you trying to do?

– DebanjanB
Nov 21 '18 at 6:10

add a comment |

I'm testing the code below.

from bs4 import BeautifulSoup

import requests

from selenium import webdriver

profile = webdriver.FirefoxProfile()

profile.accept_untrusted_certs = True

import time



browser = webdriver.Firefox(executable_path="C:/Utility/geckodriver.exe")

wd = webdriver.Firefox(executable_path="C:/Utility/geckodriver.exe", firefox_profile=profile)

url = "https://corp_intranet"

wd.get(url)



# set username

time.sleep(2)

username = wd.find_element_by_id("id_email")

username.send_keys("my_email@corp.com")



# set password

password = wd.find_element_by_id("id_password")

password.send_keys("my_password")





url=("https://corp_intranet")

r = requests.get(url)

content = r.content.decode('utf-8')

print(BeautifulSoup(content, 'html.parser'))

edited Nov 20 '18 at 21:55

asked Nov 20 '18 at 21:47

ryguy72

4,4251821

I'm testing the code below.

from bs4 import BeautifulSoup

import requests

from selenium import webdriver

profile = webdriver.FirefoxProfile()

profile.accept_untrusted_certs = True

import time



browser = webdriver.Firefox(executable_path="C:/Utility/geckodriver.exe")

wd = webdriver.Firefox(executable_path="C:/Utility/geckodriver.exe", firefox_profile=profile)

url = "https://corp_intranet"

wd.get(url)



# set username

time.sleep(2)

username = wd.find_element_by_id("id_email")

username.send_keys("my_email@corp.com")



# set password

password = wd.find_element_by_id("id_password")

password.send_keys("my_password")





url=("https://corp_intranet")

r = requests.get(url)

content = r.content.decode('utf-8')

print(BeautifulSoup(content, 'html.parser'))

python python-3.x selenium geckodriver

edited Nov 20 '18 at 21:55

asked Nov 20 '18 at 21:47

ryguy72

4,4251821

edited Nov 20 '18 at 21:55

asked Nov 20 '18 at 21:47

ryguy72

4,4251821

edited Nov 20 '18 at 21:55

asked Nov 20 '18 at 21:47

ryguy72

4,4251821

asked Nov 20 '18 at 21:47

ryguy72

4,4251821

asked Nov 20 '18 at 21:47

ryguy72

4,4251821

This sounds like an X-Y problem. Instead of asking for help with your solution to the problem, edit your question and ask about the actual problem. What are you trying to do?

– DebanjanB
Nov 21 '18 at 6:10

add a comment |

This sounds like an X-Y problem. Instead of asking for help with your solution to the problem, edit your question and ask about the actual problem. What are you trying to do?

– DebanjanB
Nov 21 '18 at 6:10

This sounds like an X-Y problem. Instead of asking for help with your solution to the problem, edit your question and ask about the actual problem. What are you trying to do?

– DebanjanB
Nov 21 '18 at 6:10

add a comment |

2 Answers
2

active

oldest

votes

you open 2 browser delete this line

browser = webdriver.Firefox(executable_path="C:/Utility/geckodriver.exe")

the problem is in selenium you're logged in but not in requests because it use different session

.....

.....

# missing click button? add "n" to submit or click the button

password.send_keys("my_passwordn")



# wait max 10 seconds until "theID" visible in Logged In page

WebDriverWait(wd, 10).until(EC.presence_of_element_located((By.ID, "theID")))



content = wd.page_source

print(BeautifulSoup(content, 'html.parser'))

answered Nov 20 '18 at 23:22

ewwink

12.1k22440

Yeah, that works. Thanks!!

– ryguy72
Nov 21 '18 at 17:06

add a comment |

You need to have Selenium wait for the webpage to load the additional content, via either implicit or explicit waits.

An Implicit Wait lets you choose a specific amount of time to wait before scraping.

An Explicit Wait let you choose an event to wait for, such as a particular element being visible or clickable.

This answer goes into detail on this concept.

answered Nov 20 '18 at 22:05

Ryan

322213

Yes, yes, the page has to be fully loaded, of course, or there is nothing to parse. The problem is, my code is not picking up much from the URL. Also, it seems like the code is opening 2 browsers, instead of just 1 browser. That's weird!

– ryguy72
Nov 20 '18 at 22:31

I suggested this because the code you posted is just grabbing a url with requests.get, which will only ever get the initial HTML of the response. It won't run any javascript to load the rest of the page. It sounds like you need to wait for the javascript to run, and then use something like Selenium's webdriver.page_source to get the HTML from the web browser that Selenium opened.

– Ryan
Nov 20 '18 at 22:52

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53402067%2fhow-to-correctly-scrape-a-javascript-based-site%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

you open 2 browser delete this line

browser = webdriver.Firefox(executable_path="C:/Utility/geckodriver.exe")

the problem is in selenium you're logged in but not in requests because it use different session

.....

.....

# missing click button? add "n" to submit or click the button

password.send_keys("my_passwordn")



# wait max 10 seconds until "theID" visible in Logged In page

WebDriverWait(wd, 10).until(EC.presence_of_element_located((By.ID, "theID")))



content = wd.page_source

print(BeautifulSoup(content, 'html.parser'))

answered Nov 20 '18 at 23:22

ewwink

12.1k22440

Yeah, that works. Thanks!!

– ryguy72
Nov 21 '18 at 17:06

add a comment |

you open 2 browser delete this line

browser = webdriver.Firefox(executable_path="C:/Utility/geckodriver.exe")

the problem is in selenium you're logged in but not in requests because it use different session

.....

.....

# missing click button? add "n" to submit or click the button

password.send_keys("my_passwordn")



# wait max 10 seconds until "theID" visible in Logged In page

WebDriverWait(wd, 10).until(EC.presence_of_element_located((By.ID, "theID")))



content = wd.page_source

print(BeautifulSoup(content, 'html.parser'))

answered Nov 20 '18 at 23:22

ewwink

12.1k22440

Yeah, that works. Thanks!!

– ryguy72
Nov 21 '18 at 17:06

add a comment |

you open 2 browser delete this line

browser = webdriver.Firefox(executable_path="C:/Utility/geckodriver.exe")

the problem is in selenium you're logged in but not in requests because it use different session

.....

.....

# missing click button? add "n" to submit or click the button

password.send_keys("my_passwordn")



# wait max 10 seconds until "theID" visible in Logged In page

WebDriverWait(wd, 10).until(EC.presence_of_element_located((By.ID, "theID")))



content = wd.page_source

print(BeautifulSoup(content, 'html.parser'))

answered Nov 20 '18 at 23:22

ewwink

12.1k22440

you open 2 browser delete this line

browser = webdriver.Firefox(executable_path="C:/Utility/geckodriver.exe")

the problem is in selenium you're logged in but not in requests because it use different session

.....

.....

# missing click button? add "n" to submit or click the button

password.send_keys("my_passwordn")



# wait max 10 seconds until "theID" visible in Logged In page

WebDriverWait(wd, 10).until(EC.presence_of_element_located((By.ID, "theID")))



content = wd.page_source

print(BeautifulSoup(content, 'html.parser'))

answered Nov 20 '18 at 23:22

ewwink

12.1k22440

answered Nov 20 '18 at 23:22

ewwink

12.1k22440

answered Nov 20 '18 at 23:22

ewwink

12.1k22440

answered Nov 20 '18 at 23:22

ewwink

12.1k22440

Yeah, that works. Thanks!!

– ryguy72
Nov 21 '18 at 17:06

add a comment |

Yeah, that works. Thanks!!

– ryguy72
Nov 21 '18 at 17:06

Yeah, that works. Thanks!!

– ryguy72
Nov 21 '18 at 17:06

add a comment |

You need to have Selenium wait for the webpage to load the additional content, via either implicit or explicit waits.

An Implicit Wait lets you choose a specific amount of time to wait before scraping.

An Explicit Wait let you choose an event to wait for, such as a particular element being visible or clickable.

This answer goes into detail on this concept.

answered Nov 20 '18 at 22:05

Ryan

322213

Yes, yes, the page has to be fully loaded, of course, or there is nothing to parse. The problem is, my code is not picking up much from the URL. Also, it seems like the code is opening 2 browsers, instead of just 1 browser. That's weird!

– ryguy72
Nov 20 '18 at 22:31

I suggested this because the code you posted is just grabbing a url with requests.get, which will only ever get the initial HTML of the response. It won't run any javascript to load the rest of the page. It sounds like you need to wait for the javascript to run, and then use something like Selenium's webdriver.page_source to get the HTML from the web browser that Selenium opened.

– Ryan
Nov 20 '18 at 22:52

add a comment |

You need to have Selenium wait for the webpage to load the additional content, via either implicit or explicit waits.

An Implicit Wait lets you choose a specific amount of time to wait before scraping.

An Explicit Wait let you choose an event to wait for, such as a particular element being visible or clickable.

This answer goes into detail on this concept.

answered Nov 20 '18 at 22:05

Ryan

322213

Yes, yes, the page has to be fully loaded, of course, or there is nothing to parse. The problem is, my code is not picking up much from the URL. Also, it seems like the code is opening 2 browsers, instead of just 1 browser. That's weird!

– ryguy72
Nov 20 '18 at 22:31

I suggested this because the code you posted is just grabbing a url with requests.get, which will only ever get the initial HTML of the response. It won't run any javascript to load the rest of the page. It sounds like you need to wait for the javascript to run, and then use something like Selenium's webdriver.page_source to get the HTML from the web browser that Selenium opened.

– Ryan
Nov 20 '18 at 22:52

add a comment |

You need to have Selenium wait for the webpage to load the additional content, via either implicit or explicit waits.

An Implicit Wait lets you choose a specific amount of time to wait before scraping.

An Explicit Wait let you choose an event to wait for, such as a particular element being visible or clickable.

This answer goes into detail on this concept.

answered Nov 20 '18 at 22:05

Ryan

322213

You need to have Selenium wait for the webpage to load the additional content, via either implicit or explicit waits.

An Implicit Wait lets you choose a specific amount of time to wait before scraping.

An Explicit Wait let you choose an event to wait for, such as a particular element being visible or clickable.

This answer goes into detail on this concept.

answered Nov 20 '18 at 22:05

Ryan

322213

answered Nov 20 '18 at 22:05

Ryan

322213

answered Nov 20 '18 at 22:05

Ryan

322213

answered Nov 20 '18 at 22:05

Ryan

322213

Yes, yes, the page has to be fully loaded, of course, or there is nothing to parse. The problem is, my code is not picking up much from the URL. Also, it seems like the code is opening 2 browsers, instead of just 1 browser. That's weird!

– ryguy72
Nov 20 '18 at 22:31

I suggested this because the code you posted is just grabbing a url with requests.get, which will only ever get the initial HTML of the response. It won't run any javascript to load the rest of the page. It sounds like you need to wait for the javascript to run, and then use something like Selenium's webdriver.page_source to get the HTML from the web browser that Selenium opened.

– Ryan
Nov 20 '18 at 22:52

add a comment |

Yes, yes, the page has to be fully loaded, of course, or there is nothing to parse. The problem is, my code is not picking up much from the URL. Also, it seems like the code is opening 2 browsers, instead of just 1 browser. That's weird!

– ryguy72
Nov 20 '18 at 22:31

I suggested this because the code you posted is just grabbing a url with requests.get, which will only ever get the initial HTML of the response. It won't run any javascript to load the rest of the page. It sounds like you need to wait for the javascript to run, and then use something like Selenium's webdriver.page_source to get the HTML from the web browser that Selenium opened.

– Ryan
Nov 20 '18 at 22:52

Yes, yes, the page has to be fully loaded, of course, or there is nothing to parse. The problem is, my code is not picking up much from the URL. Also, it seems like the code is opening 2 browsers, instead of just 1 browser. That's weird!

– ryguy72
Nov 20 '18 at 22:31

I suggested this because the code you posted is just grabbing a url with requests.get, which will only ever get the initial HTML of the response. It won't run any javascript to load the rest of the page. It sounds like you need to wait for the javascript to run, and then use something like Selenium's webdriver.page_source to get the HTML from the web browser that Selenium opened.

– Ryan
Nov 20 '18 at 22:52

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Agfdhyk