How to scrape item position number in scrapy
How to scrape item position number from this site
website:
http://books.toscrape.com/
Please check this screenshot
https://prnt.sc/lim3zl
# -*- coding: utf-8 -*-
import scrapy
class ToscrapeSpider(scrapy.Spider):
name = 'toscrape'
allowed_domains = ['books.toscrape.com']
start_urls = ['http://books.toscrape.com/']
def parse(self, response):
lists=response.css('li.col-xs-6')
for lis in lists:
title=lis.xpath('.//h3//@title').extract_first()
price=lis.xpath('.//[@class="price_color"]//text()').extract_first()
# I need to know How to scrape there position
position=''
yield {
'Title':title,
'Price':price,
'Position':position
}
# next=response.xpath('//*[@class="next"]//@href').extract_first()
# next=response.urljoin(next)
# if next:
# yield scrapy.Request(next)
scrapy position screen-scraping
add a comment |
How to scrape item position number from this site
website:
http://books.toscrape.com/
Please check this screenshot
https://prnt.sc/lim3zl
# -*- coding: utf-8 -*-
import scrapy
class ToscrapeSpider(scrapy.Spider):
name = 'toscrape'
allowed_domains = ['books.toscrape.com']
start_urls = ['http://books.toscrape.com/']
def parse(self, response):
lists=response.css('li.col-xs-6')
for lis in lists:
title=lis.xpath('.//h3//@title').extract_first()
price=lis.xpath('.//[@class="price_color"]//text()').extract_first()
# I need to know How to scrape there position
position=''
yield {
'Title':title,
'Price':price,
'Position':position
}
# next=response.xpath('//*[@class="next"]//@href').extract_first()
# next=response.urljoin(next)
# if next:
# yield scrapy.Request(next)
scrapy position screen-scraping
What problem are you encountering? Are you getting any errors?
– Tsahi Asher
Nov 15 '18 at 15:49
actually no errors. i just need to know that How to i can scrape these item count position number
– Mohammad Palash Babu
Nov 15 '18 at 16:30
output such as Title: A Light in the Price: £51.77 Position:1 Title: Tipping the Velvet Price: £53.74 Position:2
– Mohammad Palash Babu
Nov 15 '18 at 16:32
add a comment |
How to scrape item position number from this site
website:
http://books.toscrape.com/
Please check this screenshot
https://prnt.sc/lim3zl
# -*- coding: utf-8 -*-
import scrapy
class ToscrapeSpider(scrapy.Spider):
name = 'toscrape'
allowed_domains = ['books.toscrape.com']
start_urls = ['http://books.toscrape.com/']
def parse(self, response):
lists=response.css('li.col-xs-6')
for lis in lists:
title=lis.xpath('.//h3//@title').extract_first()
price=lis.xpath('.//[@class="price_color"]//text()').extract_first()
# I need to know How to scrape there position
position=''
yield {
'Title':title,
'Price':price,
'Position':position
}
# next=response.xpath('//*[@class="next"]//@href').extract_first()
# next=response.urljoin(next)
# if next:
# yield scrapy.Request(next)
scrapy position screen-scraping
How to scrape item position number from this site
website:
http://books.toscrape.com/
Please check this screenshot
https://prnt.sc/lim3zl
# -*- coding: utf-8 -*-
import scrapy
class ToscrapeSpider(scrapy.Spider):
name = 'toscrape'
allowed_domains = ['books.toscrape.com']
start_urls = ['http://books.toscrape.com/']
def parse(self, response):
lists=response.css('li.col-xs-6')
for lis in lists:
title=lis.xpath('.//h3//@title').extract_first()
price=lis.xpath('.//[@class="price_color"]//text()').extract_first()
# I need to know How to scrape there position
position=''
yield {
'Title':title,
'Price':price,
'Position':position
}
# next=response.xpath('//*[@class="next"]//@href').extract_first()
# next=response.urljoin(next)
# if next:
# yield scrapy.Request(next)
scrapy position screen-scraping
scrapy position screen-scraping
asked Nov 15 '18 at 15:43
Mohammad Palash BabuMohammad Palash Babu
134
134
What problem are you encountering? Are you getting any errors?
– Tsahi Asher
Nov 15 '18 at 15:49
actually no errors. i just need to know that How to i can scrape these item count position number
– Mohammad Palash Babu
Nov 15 '18 at 16:30
output such as Title: A Light in the Price: £51.77 Position:1 Title: Tipping the Velvet Price: £53.74 Position:2
– Mohammad Palash Babu
Nov 15 '18 at 16:32
add a comment |
What problem are you encountering? Are you getting any errors?
– Tsahi Asher
Nov 15 '18 at 15:49
actually no errors. i just need to know that How to i can scrape these item count position number
– Mohammad Palash Babu
Nov 15 '18 at 16:30
output such as Title: A Light in the Price: £51.77 Position:1 Title: Tipping the Velvet Price: £53.74 Position:2
– Mohammad Palash Babu
Nov 15 '18 at 16:32
What problem are you encountering? Are you getting any errors?
– Tsahi Asher
Nov 15 '18 at 15:49
What problem are you encountering? Are you getting any errors?
– Tsahi Asher
Nov 15 '18 at 15:49
actually no errors. i just need to know that How to i can scrape these item count position number
– Mohammad Palash Babu
Nov 15 '18 at 16:30
actually no errors. i just need to know that How to i can scrape these item count position number
– Mohammad Palash Babu
Nov 15 '18 at 16:30
output such as Title: A Light in the Price: £51.77 Position:1 Title: Tipping the Velvet Price: £53.74 Position:2
– Mohammad Palash Babu
Nov 15 '18 at 16:32
output such as Title: A Light in the Price: £51.77 Position:1 Title: Tipping the Velvet Price: £53.74 Position:2
– Mohammad Palash Babu
Nov 15 '18 at 16:32
add a comment |
4 Answers
4
active
oldest
votes
Try to use enumerate in cycle, this will solve the problem. As I remember, something like this:
for i, lis in enumerate(lists):
position = i + 1
not solved AttributeError: 'tuple' object has no attribute 'xpath'
– Mohammad Palash Babu
Nov 15 '18 at 17:55
Can you show your code? I don't understand where you got this error.
– vezunchik
Nov 15 '18 at 20:40
It's working now but When i crawl the next page then output results came 1-20 repeatedly but i need to get the results position 21,22,23 to last 1000.
– Mohammad Palash Babu
Nov 16 '18 at 0:58
Store last position to meta while calling next request, and then add it to position in the cycle. Or use number of page to calculate position.
– vezunchik
Nov 16 '18 at 9:51
add a comment |
import scrapy
class ToscrapeSpider(scrapy.Spider):
name = 'toscrape'
allowed_domains = ['books.toscrape.com']
start_urls = ['http://books.toscrape.com/']
def parse(self, response):
products_count = response.meta.get('products_count', 0)
products = response.xpath('//article[@class="product_pod"]')
for idx, product in enumerate(products):
_image_container = product.xpath('.//div[@class="image_container"]')
detail_page_url = _image_container.xpath('.//a/@href').extract_first()
image = _image_container.xpath('.//img/@src').extract_first()
name = product.xpath('.//h3/a/@title').extract_first()
ratings = product.xpath('.//p[contains(@class, "star-rating")]/@class').extract_first()
ratings = ratings.replace('star-rating', '').strip() if ratings else ratings
price = product.xpath('.//p[@class="price_color"]/text()').extract_first()
availability = product.xpath('.//p[@class="instock availability"]//text()').extract()
availability = list(map(lambda x: x.replace('n', '').replace('t', '').strip(), availability))
availability = list(filter(lambda x: x, availability))
availability = availability[0] if availability else availability
yield dict(
position=products_count + idx + 1,
name=name,
availability=availability,
price=price,
ratings=ratings,
image=image,
pdp_url=detail_page_url,
)
next_page = response.xpath('//li[@class="next"]/a/@href').extract_first()
if next_page:
yield response.follow(next_page, meta=dict(products_count=products_count + len(products)))
can check it please stackoverflow.com/a/53335666/10636764
– Mohammad Palash Babu
Nov 16 '18 at 10:14
@MohammadPalashBabu are you sure you want to use selenium? answer the question why you want to use selenium.
– Yash Pokar
Nov 16 '18 at 12:47
@MohammadPalashBabu you don't need selenium in this project. It will slow down your crawler 20 times or may be more than that.
– Yash Pokar
Nov 16 '18 at 12:49
I know that don't needed selenium but i need a solution for selenium. cause i have a similar project
– Mohammad Palash Babu
Nov 16 '18 at 15:11
If you solve that's would be very helpful for me
– Mohammad Palash Babu
Nov 16 '18 at 15:12
add a comment |
You can simply use a class variable to track the position, like this:
import scrapy
class ToscrapeSpider(scrapy.Spider):
name = 'toscrape'
allowed_domains = ['books.toscrape.com']
start_urls = ['http://books.toscrape.com/']
position = 0
def parse(self, response):
lists = response.css('li.col-xs-6')
for lis in lists:
title = lis.xpath('.//h3//@title').extract_first()
price = lis.xpath('.//p[@class="price_color"]//text()').extract_first()
self.position += 1
yield {
'Title': title,
'Price': price,
'Position': self.position,
}
next = response.xpath('//li[@class="next"]/a/@href').extract_first()
next = response.urljoin(next)
if next:
yield scrapy.Request(next)
Then:
scrapy runspider myspider.py -o out.json
The out.json file contains:
[
{"Title": "A Light in the Attic", "Price": "u00a351.77", "Position": 1},
{"Title": "Tipping the Velvet", "Price": "u00a353.74", "Position": 2},
{"Title": "Soumission", "Price": "u00a350.10", "Position": 3},
{"Title": "Sharp Objects", "Price": "u00a347.82", "Position": 4},
{"Title": "Sapiens: A Brief History of Humankind", "Price": "u00a354.23", "Position": 5},
{"Title": "The Requiem Red", "Price": "u00a322.65", "Position": 6},
{"Title": "The Dirty Little Secrets of Getting Your Dream Job", "Price": "u00a333.34", "Position": 7},
{"Title": "The Coming Woman: A Novel Based on the Life of the Infamous Feminist, Victoria Woodhull", "Price": "u00a317.93", "Position": 8},
{"Title": "The Boys in the Boat: Nine Americans and Their Epic Quest for Gold at the 1936 Berlin Olympics", "Price": "u00a322.60", "Position": 9},
{"Title": "The Black Maria", "Price": "u00a352.15", "Position": 10},
{"Title": "Starving Hearts (Triangular Trade Trilogy, #1)", "Price": "u00a313.99", "Position": 11},
{"Title": "Shakespeare's Sonnets", "Price": "u00a320.66", "Position": 12},
{"Title": "Set Me Free", "Price": "u00a317.46", "Position": 13},
{"Title": "Scott Pilgrim's Precious Little Life (Scott Pilgrim #1)", "Price": "u00a352.29", "Position": 14},
{"Title": "Rip it Up and Start Again", "Price": "u00a335.02", "Position": 15},
{"Title": "Our Band Could Be Your Life: Scenes from the American Indie Underground, 1981-1991", "Price": "u00a357.25", "Position": 16},
{"Title": "Olio", "Price": "u00a323.88", "Position": 17},
{"Title": "Mesaerion: The Best Science Fiction Stories 1800-1849", "Price": "u00a337.59", "Position": 18},
{"Title": "Libertarianism for Beginners", "Price": "u00a351.33", "Position": 19},
{"Title": "It's Only the Himalayas", "Price": "u00a345.17", "Position": 20}
]
it's working but When i crawl the next page then output results came 1-20 repeatedly but i need to get the results position 21,22,23 to last 1000 How to i can get that 21-1000 ?
– Mohammad Palash Babu
Nov 16 '18 at 0:50
next=response.xpath('//*[@class="next"]//@href').extract_first() next=response.urljoin(next) if next: yield scrapy.Request(next)
– Mohammad Palash Babu
Nov 16 '18 at 0:53
I have edited my solution to track the position in a class variable
– Guillaume
Nov 16 '18 at 3:46
add a comment |
Yash Pokar
can you check this code please
How to i can apply your method in this selenium>scrapy code
-- coding: utf-8 --
from time import sleep
from scrapy import Spider
from selenium import webdriver
from scrapy.selector import Selector
from scrapy.http import Request
from selenium.common.exceptions import NoSuchElementException
class ToscrapeSpider(Spider):
name = 'toscrape'
allowed_domains = ['books.toscrape.com']
# start_urls = ['http://books.toscrape.com/']
def start_requests(self):
self.driver = webdriver.Chrome()
self.driver.get('http://books.toscrape.com/')
sel = Selector(text=self.driver.page_source)
lists=sel.css('li.col-xs-6')
for i, lis in enumerate(lists):
position=i+1
links=lis.xpath('.//h3//a//@href').extract_first()
links="http://books.toscrape.com/catalogue/"+links
yield Request(links,meta={'position':position},callback=self.parse_page)
while True:
try:
next_page=self.driver.find_element_by_xpath('//*[@class="next"]//a')
self.logger.info('Sleeping for 10 seconds.')
next_page.click()
sel = Selector(text=self.driver.page_source)
lists=sel.css('li.col-xs-6')
for i, lis in enumerate(lists):
position=i+1
links=lis.xpath('.//h3//a//@href').extract_first()
links="http://books.toscrape.com/catalogue/"+links
yield Request(links,meta={'position':position},callback=self.parse_page)
except NoSuchElementException:
self.logger.info('No more pages to load.')
self.driver.quit()
break
def parse_page(self, response):
title=response.xpath('//h1//text()').extract_first()
positions=response.meta['position']
yield {
'Title':title,
'Position':positions
}
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53322999%2fhow-to-scrape-item-position-number-in-scrapy%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
4 Answers
4
active
oldest
votes
4 Answers
4
active
oldest
votes
active
oldest
votes
active
oldest
votes
Try to use enumerate in cycle, this will solve the problem. As I remember, something like this:
for i, lis in enumerate(lists):
position = i + 1
not solved AttributeError: 'tuple' object has no attribute 'xpath'
– Mohammad Palash Babu
Nov 15 '18 at 17:55
Can you show your code? I don't understand where you got this error.
– vezunchik
Nov 15 '18 at 20:40
It's working now but When i crawl the next page then output results came 1-20 repeatedly but i need to get the results position 21,22,23 to last 1000.
– Mohammad Palash Babu
Nov 16 '18 at 0:58
Store last position to meta while calling next request, and then add it to position in the cycle. Or use number of page to calculate position.
– vezunchik
Nov 16 '18 at 9:51
add a comment |
Try to use enumerate in cycle, this will solve the problem. As I remember, something like this:
for i, lis in enumerate(lists):
position = i + 1
not solved AttributeError: 'tuple' object has no attribute 'xpath'
– Mohammad Palash Babu
Nov 15 '18 at 17:55
Can you show your code? I don't understand where you got this error.
– vezunchik
Nov 15 '18 at 20:40
It's working now but When i crawl the next page then output results came 1-20 repeatedly but i need to get the results position 21,22,23 to last 1000.
– Mohammad Palash Babu
Nov 16 '18 at 0:58
Store last position to meta while calling next request, and then add it to position in the cycle. Or use number of page to calculate position.
– vezunchik
Nov 16 '18 at 9:51
add a comment |
Try to use enumerate in cycle, this will solve the problem. As I remember, something like this:
for i, lis in enumerate(lists):
position = i + 1
Try to use enumerate in cycle, this will solve the problem. As I remember, something like this:
for i, lis in enumerate(lists):
position = i + 1
answered Nov 15 '18 at 17:08
vezunchikvezunchik
41937
41937
not solved AttributeError: 'tuple' object has no attribute 'xpath'
– Mohammad Palash Babu
Nov 15 '18 at 17:55
Can you show your code? I don't understand where you got this error.
– vezunchik
Nov 15 '18 at 20:40
It's working now but When i crawl the next page then output results came 1-20 repeatedly but i need to get the results position 21,22,23 to last 1000.
– Mohammad Palash Babu
Nov 16 '18 at 0:58
Store last position to meta while calling next request, and then add it to position in the cycle. Or use number of page to calculate position.
– vezunchik
Nov 16 '18 at 9:51
add a comment |
not solved AttributeError: 'tuple' object has no attribute 'xpath'
– Mohammad Palash Babu
Nov 15 '18 at 17:55
Can you show your code? I don't understand where you got this error.
– vezunchik
Nov 15 '18 at 20:40
It's working now but When i crawl the next page then output results came 1-20 repeatedly but i need to get the results position 21,22,23 to last 1000.
– Mohammad Palash Babu
Nov 16 '18 at 0:58
Store last position to meta while calling next request, and then add it to position in the cycle. Or use number of page to calculate position.
– vezunchik
Nov 16 '18 at 9:51
not solved AttributeError: 'tuple' object has no attribute 'xpath'
– Mohammad Palash Babu
Nov 15 '18 at 17:55
not solved AttributeError: 'tuple' object has no attribute 'xpath'
– Mohammad Palash Babu
Nov 15 '18 at 17:55
Can you show your code? I don't understand where you got this error.
– vezunchik
Nov 15 '18 at 20:40
Can you show your code? I don't understand where you got this error.
– vezunchik
Nov 15 '18 at 20:40
It's working now but When i crawl the next page then output results came 1-20 repeatedly but i need to get the results position 21,22,23 to last 1000.
– Mohammad Palash Babu
Nov 16 '18 at 0:58
It's working now but When i crawl the next page then output results came 1-20 repeatedly but i need to get the results position 21,22,23 to last 1000.
– Mohammad Palash Babu
Nov 16 '18 at 0:58
Store last position to meta while calling next request, and then add it to position in the cycle. Or use number of page to calculate position.
– vezunchik
Nov 16 '18 at 9:51
Store last position to meta while calling next request, and then add it to position in the cycle. Or use number of page to calculate position.
– vezunchik
Nov 16 '18 at 9:51
add a comment |
import scrapy
class ToscrapeSpider(scrapy.Spider):
name = 'toscrape'
allowed_domains = ['books.toscrape.com']
start_urls = ['http://books.toscrape.com/']
def parse(self, response):
products_count = response.meta.get('products_count', 0)
products = response.xpath('//article[@class="product_pod"]')
for idx, product in enumerate(products):
_image_container = product.xpath('.//div[@class="image_container"]')
detail_page_url = _image_container.xpath('.//a/@href').extract_first()
image = _image_container.xpath('.//img/@src').extract_first()
name = product.xpath('.//h3/a/@title').extract_first()
ratings = product.xpath('.//p[contains(@class, "star-rating")]/@class').extract_first()
ratings = ratings.replace('star-rating', '').strip() if ratings else ratings
price = product.xpath('.//p[@class="price_color"]/text()').extract_first()
availability = product.xpath('.//p[@class="instock availability"]//text()').extract()
availability = list(map(lambda x: x.replace('n', '').replace('t', '').strip(), availability))
availability = list(filter(lambda x: x, availability))
availability = availability[0] if availability else availability
yield dict(
position=products_count + idx + 1,
name=name,
availability=availability,
price=price,
ratings=ratings,
image=image,
pdp_url=detail_page_url,
)
next_page = response.xpath('//li[@class="next"]/a/@href').extract_first()
if next_page:
yield response.follow(next_page, meta=dict(products_count=products_count + len(products)))
can check it please stackoverflow.com/a/53335666/10636764
– Mohammad Palash Babu
Nov 16 '18 at 10:14
@MohammadPalashBabu are you sure you want to use selenium? answer the question why you want to use selenium.
– Yash Pokar
Nov 16 '18 at 12:47
@MohammadPalashBabu you don't need selenium in this project. It will slow down your crawler 20 times or may be more than that.
– Yash Pokar
Nov 16 '18 at 12:49
I know that don't needed selenium but i need a solution for selenium. cause i have a similar project
– Mohammad Palash Babu
Nov 16 '18 at 15:11
If you solve that's would be very helpful for me
– Mohammad Palash Babu
Nov 16 '18 at 15:12
add a comment |
import scrapy
class ToscrapeSpider(scrapy.Spider):
name = 'toscrape'
allowed_domains = ['books.toscrape.com']
start_urls = ['http://books.toscrape.com/']
def parse(self, response):
products_count = response.meta.get('products_count', 0)
products = response.xpath('//article[@class="product_pod"]')
for idx, product in enumerate(products):
_image_container = product.xpath('.//div[@class="image_container"]')
detail_page_url = _image_container.xpath('.//a/@href').extract_first()
image = _image_container.xpath('.//img/@src').extract_first()
name = product.xpath('.//h3/a/@title').extract_first()
ratings = product.xpath('.//p[contains(@class, "star-rating")]/@class').extract_first()
ratings = ratings.replace('star-rating', '').strip() if ratings else ratings
price = product.xpath('.//p[@class="price_color"]/text()').extract_first()
availability = product.xpath('.//p[@class="instock availability"]//text()').extract()
availability = list(map(lambda x: x.replace('n', '').replace('t', '').strip(), availability))
availability = list(filter(lambda x: x, availability))
availability = availability[0] if availability else availability
yield dict(
position=products_count + idx + 1,
name=name,
availability=availability,
price=price,
ratings=ratings,
image=image,
pdp_url=detail_page_url,
)
next_page = response.xpath('//li[@class="next"]/a/@href').extract_first()
if next_page:
yield response.follow(next_page, meta=dict(products_count=products_count + len(products)))
can check it please stackoverflow.com/a/53335666/10636764
– Mohammad Palash Babu
Nov 16 '18 at 10:14
@MohammadPalashBabu are you sure you want to use selenium? answer the question why you want to use selenium.
– Yash Pokar
Nov 16 '18 at 12:47
@MohammadPalashBabu you don't need selenium in this project. It will slow down your crawler 20 times or may be more than that.
– Yash Pokar
Nov 16 '18 at 12:49
I know that don't needed selenium but i need a solution for selenium. cause i have a similar project
– Mohammad Palash Babu
Nov 16 '18 at 15:11
If you solve that's would be very helpful for me
– Mohammad Palash Babu
Nov 16 '18 at 15:12
add a comment |
import scrapy
class ToscrapeSpider(scrapy.Spider):
name = 'toscrape'
allowed_domains = ['books.toscrape.com']
start_urls = ['http://books.toscrape.com/']
def parse(self, response):
products_count = response.meta.get('products_count', 0)
products = response.xpath('//article[@class="product_pod"]')
for idx, product in enumerate(products):
_image_container = product.xpath('.//div[@class="image_container"]')
detail_page_url = _image_container.xpath('.//a/@href').extract_first()
image = _image_container.xpath('.//img/@src').extract_first()
name = product.xpath('.//h3/a/@title').extract_first()
ratings = product.xpath('.//p[contains(@class, "star-rating")]/@class').extract_first()
ratings = ratings.replace('star-rating', '').strip() if ratings else ratings
price = product.xpath('.//p[@class="price_color"]/text()').extract_first()
availability = product.xpath('.//p[@class="instock availability"]//text()').extract()
availability = list(map(lambda x: x.replace('n', '').replace('t', '').strip(), availability))
availability = list(filter(lambda x: x, availability))
availability = availability[0] if availability else availability
yield dict(
position=products_count + idx + 1,
name=name,
availability=availability,
price=price,
ratings=ratings,
image=image,
pdp_url=detail_page_url,
)
next_page = response.xpath('//li[@class="next"]/a/@href').extract_first()
if next_page:
yield response.follow(next_page, meta=dict(products_count=products_count + len(products)))
import scrapy
class ToscrapeSpider(scrapy.Spider):
name = 'toscrape'
allowed_domains = ['books.toscrape.com']
start_urls = ['http://books.toscrape.com/']
def parse(self, response):
products_count = response.meta.get('products_count', 0)
products = response.xpath('//article[@class="product_pod"]')
for idx, product in enumerate(products):
_image_container = product.xpath('.//div[@class="image_container"]')
detail_page_url = _image_container.xpath('.//a/@href').extract_first()
image = _image_container.xpath('.//img/@src').extract_first()
name = product.xpath('.//h3/a/@title').extract_first()
ratings = product.xpath('.//p[contains(@class, "star-rating")]/@class').extract_first()
ratings = ratings.replace('star-rating', '').strip() if ratings else ratings
price = product.xpath('.//p[@class="price_color"]/text()').extract_first()
availability = product.xpath('.//p[@class="instock availability"]//text()').extract()
availability = list(map(lambda x: x.replace('n', '').replace('t', '').strip(), availability))
availability = list(filter(lambda x: x, availability))
availability = availability[0] if availability else availability
yield dict(
position=products_count + idx + 1,
name=name,
availability=availability,
price=price,
ratings=ratings,
image=image,
pdp_url=detail_page_url,
)
next_page = response.xpath('//li[@class="next"]/a/@href').extract_first()
if next_page:
yield response.follow(next_page, meta=dict(products_count=products_count + len(products)))
answered Nov 16 '18 at 3:16
Yash PokarYash Pokar
346311
346311
can check it please stackoverflow.com/a/53335666/10636764
– Mohammad Palash Babu
Nov 16 '18 at 10:14
@MohammadPalashBabu are you sure you want to use selenium? answer the question why you want to use selenium.
– Yash Pokar
Nov 16 '18 at 12:47
@MohammadPalashBabu you don't need selenium in this project. It will slow down your crawler 20 times or may be more than that.
– Yash Pokar
Nov 16 '18 at 12:49
I know that don't needed selenium but i need a solution for selenium. cause i have a similar project
– Mohammad Palash Babu
Nov 16 '18 at 15:11
If you solve that's would be very helpful for me
– Mohammad Palash Babu
Nov 16 '18 at 15:12
add a comment |
can check it please stackoverflow.com/a/53335666/10636764
– Mohammad Palash Babu
Nov 16 '18 at 10:14
@MohammadPalashBabu are you sure you want to use selenium? answer the question why you want to use selenium.
– Yash Pokar
Nov 16 '18 at 12:47
@MohammadPalashBabu you don't need selenium in this project. It will slow down your crawler 20 times or may be more than that.
– Yash Pokar
Nov 16 '18 at 12:49
I know that don't needed selenium but i need a solution for selenium. cause i have a similar project
– Mohammad Palash Babu
Nov 16 '18 at 15:11
If you solve that's would be very helpful for me
– Mohammad Palash Babu
Nov 16 '18 at 15:12
can check it please stackoverflow.com/a/53335666/10636764
– Mohammad Palash Babu
Nov 16 '18 at 10:14
can check it please stackoverflow.com/a/53335666/10636764
– Mohammad Palash Babu
Nov 16 '18 at 10:14
@MohammadPalashBabu are you sure you want to use selenium? answer the question why you want to use selenium.
– Yash Pokar
Nov 16 '18 at 12:47
@MohammadPalashBabu are you sure you want to use selenium? answer the question why you want to use selenium.
– Yash Pokar
Nov 16 '18 at 12:47
@MohammadPalashBabu you don't need selenium in this project. It will slow down your crawler 20 times or may be more than that.
– Yash Pokar
Nov 16 '18 at 12:49
@MohammadPalashBabu you don't need selenium in this project. It will slow down your crawler 20 times or may be more than that.
– Yash Pokar
Nov 16 '18 at 12:49
I know that don't needed selenium but i need a solution for selenium. cause i have a similar project
– Mohammad Palash Babu
Nov 16 '18 at 15:11
I know that don't needed selenium but i need a solution for selenium. cause i have a similar project
– Mohammad Palash Babu
Nov 16 '18 at 15:11
If you solve that's would be very helpful for me
– Mohammad Palash Babu
Nov 16 '18 at 15:12
If you solve that's would be very helpful for me
– Mohammad Palash Babu
Nov 16 '18 at 15:12
add a comment |
You can simply use a class variable to track the position, like this:
import scrapy
class ToscrapeSpider(scrapy.Spider):
name = 'toscrape'
allowed_domains = ['books.toscrape.com']
start_urls = ['http://books.toscrape.com/']
position = 0
def parse(self, response):
lists = response.css('li.col-xs-6')
for lis in lists:
title = lis.xpath('.//h3//@title').extract_first()
price = lis.xpath('.//p[@class="price_color"]//text()').extract_first()
self.position += 1
yield {
'Title': title,
'Price': price,
'Position': self.position,
}
next = response.xpath('//li[@class="next"]/a/@href').extract_first()
next = response.urljoin(next)
if next:
yield scrapy.Request(next)
Then:
scrapy runspider myspider.py -o out.json
The out.json file contains:
[
{"Title": "A Light in the Attic", "Price": "u00a351.77", "Position": 1},
{"Title": "Tipping the Velvet", "Price": "u00a353.74", "Position": 2},
{"Title": "Soumission", "Price": "u00a350.10", "Position": 3},
{"Title": "Sharp Objects", "Price": "u00a347.82", "Position": 4},
{"Title": "Sapiens: A Brief History of Humankind", "Price": "u00a354.23", "Position": 5},
{"Title": "The Requiem Red", "Price": "u00a322.65", "Position": 6},
{"Title": "The Dirty Little Secrets of Getting Your Dream Job", "Price": "u00a333.34", "Position": 7},
{"Title": "The Coming Woman: A Novel Based on the Life of the Infamous Feminist, Victoria Woodhull", "Price": "u00a317.93", "Position": 8},
{"Title": "The Boys in the Boat: Nine Americans and Their Epic Quest for Gold at the 1936 Berlin Olympics", "Price": "u00a322.60", "Position": 9},
{"Title": "The Black Maria", "Price": "u00a352.15", "Position": 10},
{"Title": "Starving Hearts (Triangular Trade Trilogy, #1)", "Price": "u00a313.99", "Position": 11},
{"Title": "Shakespeare's Sonnets", "Price": "u00a320.66", "Position": 12},
{"Title": "Set Me Free", "Price": "u00a317.46", "Position": 13},
{"Title": "Scott Pilgrim's Precious Little Life (Scott Pilgrim #1)", "Price": "u00a352.29", "Position": 14},
{"Title": "Rip it Up and Start Again", "Price": "u00a335.02", "Position": 15},
{"Title": "Our Band Could Be Your Life: Scenes from the American Indie Underground, 1981-1991", "Price": "u00a357.25", "Position": 16},
{"Title": "Olio", "Price": "u00a323.88", "Position": 17},
{"Title": "Mesaerion: The Best Science Fiction Stories 1800-1849", "Price": "u00a337.59", "Position": 18},
{"Title": "Libertarianism for Beginners", "Price": "u00a351.33", "Position": 19},
{"Title": "It's Only the Himalayas", "Price": "u00a345.17", "Position": 20}
]
it's working but When i crawl the next page then output results came 1-20 repeatedly but i need to get the results position 21,22,23 to last 1000 How to i can get that 21-1000 ?
– Mohammad Palash Babu
Nov 16 '18 at 0:50
next=response.xpath('//*[@class="next"]//@href').extract_first() next=response.urljoin(next) if next: yield scrapy.Request(next)
– Mohammad Palash Babu
Nov 16 '18 at 0:53
I have edited my solution to track the position in a class variable
– Guillaume
Nov 16 '18 at 3:46
add a comment |
You can simply use a class variable to track the position, like this:
import scrapy
class ToscrapeSpider(scrapy.Spider):
name = 'toscrape'
allowed_domains = ['books.toscrape.com']
start_urls = ['http://books.toscrape.com/']
position = 0
def parse(self, response):
lists = response.css('li.col-xs-6')
for lis in lists:
title = lis.xpath('.//h3//@title').extract_first()
price = lis.xpath('.//p[@class="price_color"]//text()').extract_first()
self.position += 1
yield {
'Title': title,
'Price': price,
'Position': self.position,
}
next = response.xpath('//li[@class="next"]/a/@href').extract_first()
next = response.urljoin(next)
if next:
yield scrapy.Request(next)
Then:
scrapy runspider myspider.py -o out.json
The out.json file contains:
[
{"Title": "A Light in the Attic", "Price": "u00a351.77", "Position": 1},
{"Title": "Tipping the Velvet", "Price": "u00a353.74", "Position": 2},
{"Title": "Soumission", "Price": "u00a350.10", "Position": 3},
{"Title": "Sharp Objects", "Price": "u00a347.82", "Position": 4},
{"Title": "Sapiens: A Brief History of Humankind", "Price": "u00a354.23", "Position": 5},
{"Title": "The Requiem Red", "Price": "u00a322.65", "Position": 6},
{"Title": "The Dirty Little Secrets of Getting Your Dream Job", "Price": "u00a333.34", "Position": 7},
{"Title": "The Coming Woman: A Novel Based on the Life of the Infamous Feminist, Victoria Woodhull", "Price": "u00a317.93", "Position": 8},
{"Title": "The Boys in the Boat: Nine Americans and Their Epic Quest for Gold at the 1936 Berlin Olympics", "Price": "u00a322.60", "Position": 9},
{"Title": "The Black Maria", "Price": "u00a352.15", "Position": 10},
{"Title": "Starving Hearts (Triangular Trade Trilogy, #1)", "Price": "u00a313.99", "Position": 11},
{"Title": "Shakespeare's Sonnets", "Price": "u00a320.66", "Position": 12},
{"Title": "Set Me Free", "Price": "u00a317.46", "Position": 13},
{"Title": "Scott Pilgrim's Precious Little Life (Scott Pilgrim #1)", "Price": "u00a352.29", "Position": 14},
{"Title": "Rip it Up and Start Again", "Price": "u00a335.02", "Position": 15},
{"Title": "Our Band Could Be Your Life: Scenes from the American Indie Underground, 1981-1991", "Price": "u00a357.25", "Position": 16},
{"Title": "Olio", "Price": "u00a323.88", "Position": 17},
{"Title": "Mesaerion: The Best Science Fiction Stories 1800-1849", "Price": "u00a337.59", "Position": 18},
{"Title": "Libertarianism for Beginners", "Price": "u00a351.33", "Position": 19},
{"Title": "It's Only the Himalayas", "Price": "u00a345.17", "Position": 20}
]
it's working but When i crawl the next page then output results came 1-20 repeatedly but i need to get the results position 21,22,23 to last 1000 How to i can get that 21-1000 ?
– Mohammad Palash Babu
Nov 16 '18 at 0:50
next=response.xpath('//*[@class="next"]//@href').extract_first() next=response.urljoin(next) if next: yield scrapy.Request(next)
– Mohammad Palash Babu
Nov 16 '18 at 0:53
I have edited my solution to track the position in a class variable
– Guillaume
Nov 16 '18 at 3:46
add a comment |
You can simply use a class variable to track the position, like this:
import scrapy
class ToscrapeSpider(scrapy.Spider):
name = 'toscrape'
allowed_domains = ['books.toscrape.com']
start_urls = ['http://books.toscrape.com/']
position = 0
def parse(self, response):
lists = response.css('li.col-xs-6')
for lis in lists:
title = lis.xpath('.//h3//@title').extract_first()
price = lis.xpath('.//p[@class="price_color"]//text()').extract_first()
self.position += 1
yield {
'Title': title,
'Price': price,
'Position': self.position,
}
next = response.xpath('//li[@class="next"]/a/@href').extract_first()
next = response.urljoin(next)
if next:
yield scrapy.Request(next)
Then:
scrapy runspider myspider.py -o out.json
The out.json file contains:
[
{"Title": "A Light in the Attic", "Price": "u00a351.77", "Position": 1},
{"Title": "Tipping the Velvet", "Price": "u00a353.74", "Position": 2},
{"Title": "Soumission", "Price": "u00a350.10", "Position": 3},
{"Title": "Sharp Objects", "Price": "u00a347.82", "Position": 4},
{"Title": "Sapiens: A Brief History of Humankind", "Price": "u00a354.23", "Position": 5},
{"Title": "The Requiem Red", "Price": "u00a322.65", "Position": 6},
{"Title": "The Dirty Little Secrets of Getting Your Dream Job", "Price": "u00a333.34", "Position": 7},
{"Title": "The Coming Woman: A Novel Based on the Life of the Infamous Feminist, Victoria Woodhull", "Price": "u00a317.93", "Position": 8},
{"Title": "The Boys in the Boat: Nine Americans and Their Epic Quest for Gold at the 1936 Berlin Olympics", "Price": "u00a322.60", "Position": 9},
{"Title": "The Black Maria", "Price": "u00a352.15", "Position": 10},
{"Title": "Starving Hearts (Triangular Trade Trilogy, #1)", "Price": "u00a313.99", "Position": 11},
{"Title": "Shakespeare's Sonnets", "Price": "u00a320.66", "Position": 12},
{"Title": "Set Me Free", "Price": "u00a317.46", "Position": 13},
{"Title": "Scott Pilgrim's Precious Little Life (Scott Pilgrim #1)", "Price": "u00a352.29", "Position": 14},
{"Title": "Rip it Up and Start Again", "Price": "u00a335.02", "Position": 15},
{"Title": "Our Band Could Be Your Life: Scenes from the American Indie Underground, 1981-1991", "Price": "u00a357.25", "Position": 16},
{"Title": "Olio", "Price": "u00a323.88", "Position": 17},
{"Title": "Mesaerion: The Best Science Fiction Stories 1800-1849", "Price": "u00a337.59", "Position": 18},
{"Title": "Libertarianism for Beginners", "Price": "u00a351.33", "Position": 19},
{"Title": "It's Only the Himalayas", "Price": "u00a345.17", "Position": 20}
]
You can simply use a class variable to track the position, like this:
import scrapy
class ToscrapeSpider(scrapy.Spider):
name = 'toscrape'
allowed_domains = ['books.toscrape.com']
start_urls = ['http://books.toscrape.com/']
position = 0
def parse(self, response):
lists = response.css('li.col-xs-6')
for lis in lists:
title = lis.xpath('.//h3//@title').extract_first()
price = lis.xpath('.//p[@class="price_color"]//text()').extract_first()
self.position += 1
yield {
'Title': title,
'Price': price,
'Position': self.position,
}
next = response.xpath('//li[@class="next"]/a/@href').extract_first()
next = response.urljoin(next)
if next:
yield scrapy.Request(next)
Then:
scrapy runspider myspider.py -o out.json
The out.json file contains:
[
{"Title": "A Light in the Attic", "Price": "u00a351.77", "Position": 1},
{"Title": "Tipping the Velvet", "Price": "u00a353.74", "Position": 2},
{"Title": "Soumission", "Price": "u00a350.10", "Position": 3},
{"Title": "Sharp Objects", "Price": "u00a347.82", "Position": 4},
{"Title": "Sapiens: A Brief History of Humankind", "Price": "u00a354.23", "Position": 5},
{"Title": "The Requiem Red", "Price": "u00a322.65", "Position": 6},
{"Title": "The Dirty Little Secrets of Getting Your Dream Job", "Price": "u00a333.34", "Position": 7},
{"Title": "The Coming Woman: A Novel Based on the Life of the Infamous Feminist, Victoria Woodhull", "Price": "u00a317.93", "Position": 8},
{"Title": "The Boys in the Boat: Nine Americans and Their Epic Quest for Gold at the 1936 Berlin Olympics", "Price": "u00a322.60", "Position": 9},
{"Title": "The Black Maria", "Price": "u00a352.15", "Position": 10},
{"Title": "Starving Hearts (Triangular Trade Trilogy, #1)", "Price": "u00a313.99", "Position": 11},
{"Title": "Shakespeare's Sonnets", "Price": "u00a320.66", "Position": 12},
{"Title": "Set Me Free", "Price": "u00a317.46", "Position": 13},
{"Title": "Scott Pilgrim's Precious Little Life (Scott Pilgrim #1)", "Price": "u00a352.29", "Position": 14},
{"Title": "Rip it Up and Start Again", "Price": "u00a335.02", "Position": 15},
{"Title": "Our Band Could Be Your Life: Scenes from the American Indie Underground, 1981-1991", "Price": "u00a357.25", "Position": 16},
{"Title": "Olio", "Price": "u00a323.88", "Position": 17},
{"Title": "Mesaerion: The Best Science Fiction Stories 1800-1849", "Price": "u00a337.59", "Position": 18},
{"Title": "Libertarianism for Beginners", "Price": "u00a351.33", "Position": 19},
{"Title": "It's Only the Himalayas", "Price": "u00a345.17", "Position": 20}
]
edited Nov 16 '18 at 3:45
answered Nov 15 '18 at 18:59
GuillaumeGuillaume
1,1281724
1,1281724
it's working but When i crawl the next page then output results came 1-20 repeatedly but i need to get the results position 21,22,23 to last 1000 How to i can get that 21-1000 ?
– Mohammad Palash Babu
Nov 16 '18 at 0:50
next=response.xpath('//*[@class="next"]//@href').extract_first() next=response.urljoin(next) if next: yield scrapy.Request(next)
– Mohammad Palash Babu
Nov 16 '18 at 0:53
I have edited my solution to track the position in a class variable
– Guillaume
Nov 16 '18 at 3:46
add a comment |
it's working but When i crawl the next page then output results came 1-20 repeatedly but i need to get the results position 21,22,23 to last 1000 How to i can get that 21-1000 ?
– Mohammad Palash Babu
Nov 16 '18 at 0:50
next=response.xpath('//*[@class="next"]//@href').extract_first() next=response.urljoin(next) if next: yield scrapy.Request(next)
– Mohammad Palash Babu
Nov 16 '18 at 0:53
I have edited my solution to track the position in a class variable
– Guillaume
Nov 16 '18 at 3:46
it's working but When i crawl the next page then output results came 1-20 repeatedly but i need to get the results position 21,22,23 to last 1000 How to i can get that 21-1000 ?
– Mohammad Palash Babu
Nov 16 '18 at 0:50
it's working but When i crawl the next page then output results came 1-20 repeatedly but i need to get the results position 21,22,23 to last 1000 How to i can get that 21-1000 ?
– Mohammad Palash Babu
Nov 16 '18 at 0:50
next=response.xpath('//*[@class="next"]//@href').extract_first() next=response.urljoin(next) if next: yield scrapy.Request(next)
– Mohammad Palash Babu
Nov 16 '18 at 0:53
next=response.xpath('//*[@class="next"]//@href').extract_first() next=response.urljoin(next) if next: yield scrapy.Request(next)
– Mohammad Palash Babu
Nov 16 '18 at 0:53
I have edited my solution to track the position in a class variable
– Guillaume
Nov 16 '18 at 3:46
I have edited my solution to track the position in a class variable
– Guillaume
Nov 16 '18 at 3:46
add a comment |
Yash Pokar
can you check this code please
How to i can apply your method in this selenium>scrapy code
-- coding: utf-8 --
from time import sleep
from scrapy import Spider
from selenium import webdriver
from scrapy.selector import Selector
from scrapy.http import Request
from selenium.common.exceptions import NoSuchElementException
class ToscrapeSpider(Spider):
name = 'toscrape'
allowed_domains = ['books.toscrape.com']
# start_urls = ['http://books.toscrape.com/']
def start_requests(self):
self.driver = webdriver.Chrome()
self.driver.get('http://books.toscrape.com/')
sel = Selector(text=self.driver.page_source)
lists=sel.css('li.col-xs-6')
for i, lis in enumerate(lists):
position=i+1
links=lis.xpath('.//h3//a//@href').extract_first()
links="http://books.toscrape.com/catalogue/"+links
yield Request(links,meta={'position':position},callback=self.parse_page)
while True:
try:
next_page=self.driver.find_element_by_xpath('//*[@class="next"]//a')
self.logger.info('Sleeping for 10 seconds.')
next_page.click()
sel = Selector(text=self.driver.page_source)
lists=sel.css('li.col-xs-6')
for i, lis in enumerate(lists):
position=i+1
links=lis.xpath('.//h3//a//@href').extract_first()
links="http://books.toscrape.com/catalogue/"+links
yield Request(links,meta={'position':position},callback=self.parse_page)
except NoSuchElementException:
self.logger.info('No more pages to load.')
self.driver.quit()
break
def parse_page(self, response):
title=response.xpath('//h1//text()').extract_first()
positions=response.meta['position']
yield {
'Title':title,
'Position':positions
}
add a comment |
Yash Pokar
can you check this code please
How to i can apply your method in this selenium>scrapy code
-- coding: utf-8 --
from time import sleep
from scrapy import Spider
from selenium import webdriver
from scrapy.selector import Selector
from scrapy.http import Request
from selenium.common.exceptions import NoSuchElementException
class ToscrapeSpider(Spider):
name = 'toscrape'
allowed_domains = ['books.toscrape.com']
# start_urls = ['http://books.toscrape.com/']
def start_requests(self):
self.driver = webdriver.Chrome()
self.driver.get('http://books.toscrape.com/')
sel = Selector(text=self.driver.page_source)
lists=sel.css('li.col-xs-6')
for i, lis in enumerate(lists):
position=i+1
links=lis.xpath('.//h3//a//@href').extract_first()
links="http://books.toscrape.com/catalogue/"+links
yield Request(links,meta={'position':position},callback=self.parse_page)
while True:
try:
next_page=self.driver.find_element_by_xpath('//*[@class="next"]//a')
self.logger.info('Sleeping for 10 seconds.')
next_page.click()
sel = Selector(text=self.driver.page_source)
lists=sel.css('li.col-xs-6')
for i, lis in enumerate(lists):
position=i+1
links=lis.xpath('.//h3//a//@href').extract_first()
links="http://books.toscrape.com/catalogue/"+links
yield Request(links,meta={'position':position},callback=self.parse_page)
except NoSuchElementException:
self.logger.info('No more pages to load.')
self.driver.quit()
break
def parse_page(self, response):
title=response.xpath('//h1//text()').extract_first()
positions=response.meta['position']
yield {
'Title':title,
'Position':positions
}
add a comment |
Yash Pokar
can you check this code please
How to i can apply your method in this selenium>scrapy code
-- coding: utf-8 --
from time import sleep
from scrapy import Spider
from selenium import webdriver
from scrapy.selector import Selector
from scrapy.http import Request
from selenium.common.exceptions import NoSuchElementException
class ToscrapeSpider(Spider):
name = 'toscrape'
allowed_domains = ['books.toscrape.com']
# start_urls = ['http://books.toscrape.com/']
def start_requests(self):
self.driver = webdriver.Chrome()
self.driver.get('http://books.toscrape.com/')
sel = Selector(text=self.driver.page_source)
lists=sel.css('li.col-xs-6')
for i, lis in enumerate(lists):
position=i+1
links=lis.xpath('.//h3//a//@href').extract_first()
links="http://books.toscrape.com/catalogue/"+links
yield Request(links,meta={'position':position},callback=self.parse_page)
while True:
try:
next_page=self.driver.find_element_by_xpath('//*[@class="next"]//a')
self.logger.info('Sleeping for 10 seconds.')
next_page.click()
sel = Selector(text=self.driver.page_source)
lists=sel.css('li.col-xs-6')
for i, lis in enumerate(lists):
position=i+1
links=lis.xpath('.//h3//a//@href').extract_first()
links="http://books.toscrape.com/catalogue/"+links
yield Request(links,meta={'position':position},callback=self.parse_page)
except NoSuchElementException:
self.logger.info('No more pages to load.')
self.driver.quit()
break
def parse_page(self, response):
title=response.xpath('//h1//text()').extract_first()
positions=response.meta['position']
yield {
'Title':title,
'Position':positions
}
Yash Pokar
can you check this code please
How to i can apply your method in this selenium>scrapy code
-- coding: utf-8 --
from time import sleep
from scrapy import Spider
from selenium import webdriver
from scrapy.selector import Selector
from scrapy.http import Request
from selenium.common.exceptions import NoSuchElementException
class ToscrapeSpider(Spider):
name = 'toscrape'
allowed_domains = ['books.toscrape.com']
# start_urls = ['http://books.toscrape.com/']
def start_requests(self):
self.driver = webdriver.Chrome()
self.driver.get('http://books.toscrape.com/')
sel = Selector(text=self.driver.page_source)
lists=sel.css('li.col-xs-6')
for i, lis in enumerate(lists):
position=i+1
links=lis.xpath('.//h3//a//@href').extract_first()
links="http://books.toscrape.com/catalogue/"+links
yield Request(links,meta={'position':position},callback=self.parse_page)
while True:
try:
next_page=self.driver.find_element_by_xpath('//*[@class="next"]//a')
self.logger.info('Sleeping for 10 seconds.')
next_page.click()
sel = Selector(text=self.driver.page_source)
lists=sel.css('li.col-xs-6')
for i, lis in enumerate(lists):
position=i+1
links=lis.xpath('.//h3//a//@href').extract_first()
links="http://books.toscrape.com/catalogue/"+links
yield Request(links,meta={'position':position},callback=self.parse_page)
except NoSuchElementException:
self.logger.info('No more pages to load.')
self.driver.quit()
break
def parse_page(self, response):
title=response.xpath('//h1//text()').extract_first()
positions=response.meta['position']
yield {
'Title':title,
'Position':positions
}
answered Nov 16 '18 at 10:13
Mohammad Palash BabuMohammad Palash Babu
134
134
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53322999%2fhow-to-scrape-item-position-number-in-scrapy%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
What problem are you encountering? Are you getting any errors?
– Tsahi Asher
Nov 15 '18 at 15:49
actually no errors. i just need to know that How to i can scrape these item count position number
– Mohammad Palash Babu
Nov 15 '18 at 16:30
output such as Title: A Light in the Price: £51.77 Position:1 Title: Tipping the Velvet Price: £53.74 Position:2
– Mohammad Palash Babu
Nov 15 '18 at 16:32