Text from webpage using BeautifulSoup
I'm trying to extract some data from https://markets.cboe.com/europe/equities/market_share/index/all/ using Python
Specifically the figure for "Market Non-displayed volume total", I've tried several ways using BeautifulSoup but none seem to get me there.
any ideas?
python http beautifulsoup
add a comment |
I'm trying to extract some data from https://markets.cboe.com/europe/equities/market_share/index/all/ using Python
Specifically the figure for "Market Non-displayed volume total", I've tried several ways using BeautifulSoup but none seem to get me there.
any ideas?
python http beautifulsoup
2
please share what you have done so far and point out where it failed.
– Alexis
Nov 20 '18 at 17:25
I can't find anything on that page that says "Market Non-displayed volume total"
– NotAnAmbiTurner
Nov 20 '18 at 17:32
add a comment |
I'm trying to extract some data from https://markets.cboe.com/europe/equities/market_share/index/all/ using Python
Specifically the figure for "Market Non-displayed volume total", I've tried several ways using BeautifulSoup but none seem to get me there.
any ideas?
python http beautifulsoup
I'm trying to extract some data from https://markets.cboe.com/europe/equities/market_share/index/all/ using Python
Specifically the figure for "Market Non-displayed volume total", I've tried several ways using BeautifulSoup but none seem to get me there.
any ideas?
python http beautifulsoup
python http beautifulsoup
asked Nov 20 '18 at 17:24
L.1995L.1995
43
43
2
please share what you have done so far and point out where it failed.
– Alexis
Nov 20 '18 at 17:25
I can't find anything on that page that says "Market Non-displayed volume total"
– NotAnAmbiTurner
Nov 20 '18 at 17:32
add a comment |
2
please share what you have done so far and point out where it failed.
– Alexis
Nov 20 '18 at 17:25
I can't find anything on that page that says "Market Non-displayed volume total"
– NotAnAmbiTurner
Nov 20 '18 at 17:32
2
2
please share what you have done so far and point out where it failed.
– Alexis
Nov 20 '18 at 17:25
please share what you have done so far and point out where it failed.
– Alexis
Nov 20 '18 at 17:25
I can't find anything on that page that says "Market Non-displayed volume total"
– NotAnAmbiTurner
Nov 20 '18 at 17:32
I can't find anything on that page that says "Market Non-displayed volume total"
– NotAnAmbiTurner
Nov 20 '18 at 17:32
add a comment |
2 Answers
2
active
oldest
votes
I would suggest giving the pandas html reader a shot:
import pandas as pd
# Read in all tables at this address as pandas dataframes
results = pd.read_html('https://markets.cboe.com/europe/equities/market_share/index/all')
# Grab the second table founds
df = results[1]
# Set the first column as the index
df = df.set_index(0)
# Switch columns and indexes
df = df.T
# Drop any columns that have no data in them
df = df.dropna(how='all', axis=1)
# Set the column under "Displayed Price Venues" as the index
df = df.set_index('Displayed Price Venues')
# Switch columns and indexes again
df = df.T
# Aesthetic. Don't like having an index name myself!
del df.index.name
# Separate the three subtables from each other!
displayed = df.iloc[0:18]
non_displayed = df.iloc[18:-1]
total = df.iloc[-1]
You can also do this in a more aggressively compact way (same code but without breaking the steps down):
import pandas as pd
# Read in all tables at this address as pandas dataframes
results = pd.read_html('https://markets.cboe.com/europe/equities/market_share/index/all')
# Do all the stuff above in one go
df = results[1].set_index(0).T.dropna(how='all',axis=1).set_index('Displayed Price Venues').T
# Aesthetic. Don't like having an index name myself!
del df.index.name
# Separate the three subtables from each other!
displayed = df.iloc[0:18]
non_displayed = df.iloc[18:-1]
total = df.iloc[-1]
1
Thanks so much ! worked like a charm
– L.1995
Nov 21 '18 at 16:06
add a comment |
The problem is the id
keep changing dynamically. Otherwise, I would have just used that but can't. Assuming that the Output value is what you're looking for, this should work, also as long as the content doesn't change or get shifted around.
from bs4 import BeautifulSoup as bs
import requests
url = 'https://markets.cboe.com/europe/equities/market_share/index/all/'
page = requests.get(url)
html = bs(page.text, 'lxml')
total_volume = html.findAll('td', class_='idx_val')
print(total_volume[645].text)
Output:
€4,378,517,621
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53398327%2ftext-from-webpage-using-beautifulsoup%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
I would suggest giving the pandas html reader a shot:
import pandas as pd
# Read in all tables at this address as pandas dataframes
results = pd.read_html('https://markets.cboe.com/europe/equities/market_share/index/all')
# Grab the second table founds
df = results[1]
# Set the first column as the index
df = df.set_index(0)
# Switch columns and indexes
df = df.T
# Drop any columns that have no data in them
df = df.dropna(how='all', axis=1)
# Set the column under "Displayed Price Venues" as the index
df = df.set_index('Displayed Price Venues')
# Switch columns and indexes again
df = df.T
# Aesthetic. Don't like having an index name myself!
del df.index.name
# Separate the three subtables from each other!
displayed = df.iloc[0:18]
non_displayed = df.iloc[18:-1]
total = df.iloc[-1]
You can also do this in a more aggressively compact way (same code but without breaking the steps down):
import pandas as pd
# Read in all tables at this address as pandas dataframes
results = pd.read_html('https://markets.cboe.com/europe/equities/market_share/index/all')
# Do all the stuff above in one go
df = results[1].set_index(0).T.dropna(how='all',axis=1).set_index('Displayed Price Venues').T
# Aesthetic. Don't like having an index name myself!
del df.index.name
# Separate the three subtables from each other!
displayed = df.iloc[0:18]
non_displayed = df.iloc[18:-1]
total = df.iloc[-1]
1
Thanks so much ! worked like a charm
– L.1995
Nov 21 '18 at 16:06
add a comment |
I would suggest giving the pandas html reader a shot:
import pandas as pd
# Read in all tables at this address as pandas dataframes
results = pd.read_html('https://markets.cboe.com/europe/equities/market_share/index/all')
# Grab the second table founds
df = results[1]
# Set the first column as the index
df = df.set_index(0)
# Switch columns and indexes
df = df.T
# Drop any columns that have no data in them
df = df.dropna(how='all', axis=1)
# Set the column under "Displayed Price Venues" as the index
df = df.set_index('Displayed Price Venues')
# Switch columns and indexes again
df = df.T
# Aesthetic. Don't like having an index name myself!
del df.index.name
# Separate the three subtables from each other!
displayed = df.iloc[0:18]
non_displayed = df.iloc[18:-1]
total = df.iloc[-1]
You can also do this in a more aggressively compact way (same code but without breaking the steps down):
import pandas as pd
# Read in all tables at this address as pandas dataframes
results = pd.read_html('https://markets.cboe.com/europe/equities/market_share/index/all')
# Do all the stuff above in one go
df = results[1].set_index(0).T.dropna(how='all',axis=1).set_index('Displayed Price Venues').T
# Aesthetic. Don't like having an index name myself!
del df.index.name
# Separate the three subtables from each other!
displayed = df.iloc[0:18]
non_displayed = df.iloc[18:-1]
total = df.iloc[-1]
1
Thanks so much ! worked like a charm
– L.1995
Nov 21 '18 at 16:06
add a comment |
I would suggest giving the pandas html reader a shot:
import pandas as pd
# Read in all tables at this address as pandas dataframes
results = pd.read_html('https://markets.cboe.com/europe/equities/market_share/index/all')
# Grab the second table founds
df = results[1]
# Set the first column as the index
df = df.set_index(0)
# Switch columns and indexes
df = df.T
# Drop any columns that have no data in them
df = df.dropna(how='all', axis=1)
# Set the column under "Displayed Price Venues" as the index
df = df.set_index('Displayed Price Venues')
# Switch columns and indexes again
df = df.T
# Aesthetic. Don't like having an index name myself!
del df.index.name
# Separate the three subtables from each other!
displayed = df.iloc[0:18]
non_displayed = df.iloc[18:-1]
total = df.iloc[-1]
You can also do this in a more aggressively compact way (same code but without breaking the steps down):
import pandas as pd
# Read in all tables at this address as pandas dataframes
results = pd.read_html('https://markets.cboe.com/europe/equities/market_share/index/all')
# Do all the stuff above in one go
df = results[1].set_index(0).T.dropna(how='all',axis=1).set_index('Displayed Price Venues').T
# Aesthetic. Don't like having an index name myself!
del df.index.name
# Separate the three subtables from each other!
displayed = df.iloc[0:18]
non_displayed = df.iloc[18:-1]
total = df.iloc[-1]
I would suggest giving the pandas html reader a shot:
import pandas as pd
# Read in all tables at this address as pandas dataframes
results = pd.read_html('https://markets.cboe.com/europe/equities/market_share/index/all')
# Grab the second table founds
df = results[1]
# Set the first column as the index
df = df.set_index(0)
# Switch columns and indexes
df = df.T
# Drop any columns that have no data in them
df = df.dropna(how='all', axis=1)
# Set the column under "Displayed Price Venues" as the index
df = df.set_index('Displayed Price Venues')
# Switch columns and indexes again
df = df.T
# Aesthetic. Don't like having an index name myself!
del df.index.name
# Separate the three subtables from each other!
displayed = df.iloc[0:18]
non_displayed = df.iloc[18:-1]
total = df.iloc[-1]
You can also do this in a more aggressively compact way (same code but without breaking the steps down):
import pandas as pd
# Read in all tables at this address as pandas dataframes
results = pd.read_html('https://markets.cboe.com/europe/equities/market_share/index/all')
# Do all the stuff above in one go
df = results[1].set_index(0).T.dropna(how='all',axis=1).set_index('Displayed Price Venues').T
# Aesthetic. Don't like having an index name myself!
del df.index.name
# Separate the three subtables from each other!
displayed = df.iloc[0:18]
non_displayed = df.iloc[18:-1]
total = df.iloc[-1]
answered Nov 20 '18 at 17:54
jfbeltranjfbeltran
9562817
9562817
1
Thanks so much ! worked like a charm
– L.1995
Nov 21 '18 at 16:06
add a comment |
1
Thanks so much ! worked like a charm
– L.1995
Nov 21 '18 at 16:06
1
1
Thanks so much ! worked like a charm
– L.1995
Nov 21 '18 at 16:06
Thanks so much ! worked like a charm
– L.1995
Nov 21 '18 at 16:06
add a comment |
The problem is the id
keep changing dynamically. Otherwise, I would have just used that but can't. Assuming that the Output value is what you're looking for, this should work, also as long as the content doesn't change or get shifted around.
from bs4 import BeautifulSoup as bs
import requests
url = 'https://markets.cboe.com/europe/equities/market_share/index/all/'
page = requests.get(url)
html = bs(page.text, 'lxml')
total_volume = html.findAll('td', class_='idx_val')
print(total_volume[645].text)
Output:
€4,378,517,621
add a comment |
The problem is the id
keep changing dynamically. Otherwise, I would have just used that but can't. Assuming that the Output value is what you're looking for, this should work, also as long as the content doesn't change or get shifted around.
from bs4 import BeautifulSoup as bs
import requests
url = 'https://markets.cboe.com/europe/equities/market_share/index/all/'
page = requests.get(url)
html = bs(page.text, 'lxml')
total_volume = html.findAll('td', class_='idx_val')
print(total_volume[645].text)
Output:
€4,378,517,621
add a comment |
The problem is the id
keep changing dynamically. Otherwise, I would have just used that but can't. Assuming that the Output value is what you're looking for, this should work, also as long as the content doesn't change or get shifted around.
from bs4 import BeautifulSoup as bs
import requests
url = 'https://markets.cboe.com/europe/equities/market_share/index/all/'
page = requests.get(url)
html = bs(page.text, 'lxml')
total_volume = html.findAll('td', class_='idx_val')
print(total_volume[645].text)
Output:
€4,378,517,621
The problem is the id
keep changing dynamically. Otherwise, I would have just used that but can't. Assuming that the Output value is what you're looking for, this should work, also as long as the content doesn't change or get shifted around.
from bs4 import BeautifulSoup as bs
import requests
url = 'https://markets.cboe.com/europe/equities/market_share/index/all/'
page = requests.get(url)
html = bs(page.text, 'lxml')
total_volume = html.findAll('td', class_='idx_val')
print(total_volume[645].text)
Output:
€4,378,517,621
answered Nov 20 '18 at 17:57
Kamikaze_goldfishKamikaze_goldfish
493311
493311
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53398327%2ftext-from-webpage-using-beautifulsoup%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
2
please share what you have done so far and point out where it failed.
– Alexis
Nov 20 '18 at 17:25
I can't find anything on that page that says "Market Non-displayed volume total"
– NotAnAmbiTurner
Nov 20 '18 at 17:32