How to scrape item position number in scrapy












1














How to scrape item position number from this site



website:
http://books.toscrape.com/



Please check this screenshot



https://prnt.sc/lim3zl



# -*- coding: utf-8 -*-

import scrapy


class ToscrapeSpider(scrapy.Spider):
name = 'toscrape'
allowed_domains = ['books.toscrape.com']
start_urls = ['http://books.toscrape.com/']

def parse(self, response):
lists=response.css('li.col-xs-6')
for lis in lists:
title=lis.xpath('.//h3//@title').extract_first()
price=lis.xpath('.//[@class="price_color"]//text()').extract_first()
# I need to know How to scrape there position
position=''

yield {
'Title':title,
'Price':price,
'Position':position

}
# next=response.xpath('//*[@class="next"]//@href').extract_first()
# next=response.urljoin(next)
# if next:
# yield scrapy.Request(next)









share|improve this question






















  • What problem are you encountering? Are you getting any errors?
    – Tsahi Asher
    Nov 15 '18 at 15:49










  • actually no errors. i just need to know that How to i can scrape these item count position number
    – Mohammad Palash Babu
    Nov 15 '18 at 16:30










  • output such as Title: A Light in the Price: £51.77 Position:1 Title: Tipping the Velvet Price: £53.74 Position:2
    – Mohammad Palash Babu
    Nov 15 '18 at 16:32
















1














How to scrape item position number from this site



website:
http://books.toscrape.com/



Please check this screenshot



https://prnt.sc/lim3zl



# -*- coding: utf-8 -*-

import scrapy


class ToscrapeSpider(scrapy.Spider):
name = 'toscrape'
allowed_domains = ['books.toscrape.com']
start_urls = ['http://books.toscrape.com/']

def parse(self, response):
lists=response.css('li.col-xs-6')
for lis in lists:
title=lis.xpath('.//h3//@title').extract_first()
price=lis.xpath('.//[@class="price_color"]//text()').extract_first()
# I need to know How to scrape there position
position=''

yield {
'Title':title,
'Price':price,
'Position':position

}
# next=response.xpath('//*[@class="next"]//@href').extract_first()
# next=response.urljoin(next)
# if next:
# yield scrapy.Request(next)









share|improve this question






















  • What problem are you encountering? Are you getting any errors?
    – Tsahi Asher
    Nov 15 '18 at 15:49










  • actually no errors. i just need to know that How to i can scrape these item count position number
    – Mohammad Palash Babu
    Nov 15 '18 at 16:30










  • output such as Title: A Light in the Price: £51.77 Position:1 Title: Tipping the Velvet Price: £53.74 Position:2
    – Mohammad Palash Babu
    Nov 15 '18 at 16:32














1












1








1







How to scrape item position number from this site



website:
http://books.toscrape.com/



Please check this screenshot



https://prnt.sc/lim3zl



# -*- coding: utf-8 -*-

import scrapy


class ToscrapeSpider(scrapy.Spider):
name = 'toscrape'
allowed_domains = ['books.toscrape.com']
start_urls = ['http://books.toscrape.com/']

def parse(self, response):
lists=response.css('li.col-xs-6')
for lis in lists:
title=lis.xpath('.//h3//@title').extract_first()
price=lis.xpath('.//[@class="price_color"]//text()').extract_first()
# I need to know How to scrape there position
position=''

yield {
'Title':title,
'Price':price,
'Position':position

}
# next=response.xpath('//*[@class="next"]//@href').extract_first()
# next=response.urljoin(next)
# if next:
# yield scrapy.Request(next)









share|improve this question













How to scrape item position number from this site



website:
http://books.toscrape.com/



Please check this screenshot



https://prnt.sc/lim3zl



# -*- coding: utf-8 -*-

import scrapy


class ToscrapeSpider(scrapy.Spider):
name = 'toscrape'
allowed_domains = ['books.toscrape.com']
start_urls = ['http://books.toscrape.com/']

def parse(self, response):
lists=response.css('li.col-xs-6')
for lis in lists:
title=lis.xpath('.//h3//@title').extract_first()
price=lis.xpath('.//[@class="price_color"]//text()').extract_first()
# I need to know How to scrape there position
position=''

yield {
'Title':title,
'Price':price,
'Position':position

}
# next=response.xpath('//*[@class="next"]//@href').extract_first()
# next=response.urljoin(next)
# if next:
# yield scrapy.Request(next)






scrapy position screen-scraping






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Nov 15 '18 at 15:43









Mohammad Palash BabuMohammad Palash Babu

134




134












  • What problem are you encountering? Are you getting any errors?
    – Tsahi Asher
    Nov 15 '18 at 15:49










  • actually no errors. i just need to know that How to i can scrape these item count position number
    – Mohammad Palash Babu
    Nov 15 '18 at 16:30










  • output such as Title: A Light in the Price: £51.77 Position:1 Title: Tipping the Velvet Price: £53.74 Position:2
    – Mohammad Palash Babu
    Nov 15 '18 at 16:32


















  • What problem are you encountering? Are you getting any errors?
    – Tsahi Asher
    Nov 15 '18 at 15:49










  • actually no errors. i just need to know that How to i can scrape these item count position number
    – Mohammad Palash Babu
    Nov 15 '18 at 16:30










  • output such as Title: A Light in the Price: £51.77 Position:1 Title: Tipping the Velvet Price: £53.74 Position:2
    – Mohammad Palash Babu
    Nov 15 '18 at 16:32
















What problem are you encountering? Are you getting any errors?
– Tsahi Asher
Nov 15 '18 at 15:49




What problem are you encountering? Are you getting any errors?
– Tsahi Asher
Nov 15 '18 at 15:49












actually no errors. i just need to know that How to i can scrape these item count position number
– Mohammad Palash Babu
Nov 15 '18 at 16:30




actually no errors. i just need to know that How to i can scrape these item count position number
– Mohammad Palash Babu
Nov 15 '18 at 16:30












output such as Title: A Light in the Price: £51.77 Position:1 Title: Tipping the Velvet Price: £53.74 Position:2
– Mohammad Palash Babu
Nov 15 '18 at 16:32




output such as Title: A Light in the Price: £51.77 Position:1 Title: Tipping the Velvet Price: £53.74 Position:2
– Mohammad Palash Babu
Nov 15 '18 at 16:32












4 Answers
4






active

oldest

votes


















0














Try to use enumerate in cycle, this will solve the problem. As I remember, something like this:



for i, lis in enumerate(lists):
position = i + 1





share|improve this answer





















  • not solved AttributeError: 'tuple' object has no attribute 'xpath'
    – Mohammad Palash Babu
    Nov 15 '18 at 17:55










  • Can you show your code? I don't understand where you got this error.
    – vezunchik
    Nov 15 '18 at 20:40










  • It's working now but When i crawl the next page then output results came 1-20 repeatedly but i need to get the results position 21,22,23 to last 1000.
    – Mohammad Palash Babu
    Nov 16 '18 at 0:58










  • Store last position to meta while calling next request, and then add it to position in the cycle. Or use number of page to calculate position.
    – vezunchik
    Nov 16 '18 at 9:51



















0














import scrapy


class ToscrapeSpider(scrapy.Spider):
name = 'toscrape'
allowed_domains = ['books.toscrape.com']
start_urls = ['http://books.toscrape.com/']

def parse(self, response):
products_count = response.meta.get('products_count', 0)

products = response.xpath('//article[@class="product_pod"]')

for idx, product in enumerate(products):
_image_container = product.xpath('.//div[@class="image_container"]')

detail_page_url = _image_container.xpath('.//a/@href').extract_first()
image = _image_container.xpath('.//img/@src').extract_first()

name = product.xpath('.//h3/a/@title').extract_first()

ratings = product.xpath('.//p[contains(@class, "star-rating")]/@class').extract_first()
ratings = ratings.replace('star-rating', '').strip() if ratings else ratings

price = product.xpath('.//p[@class="price_color"]/text()').extract_first()
availability = product.xpath('.//p[@class="instock availability"]//text()').extract()
availability = list(map(lambda x: x.replace('n', '').replace('t', '').strip(), availability))
availability = list(filter(lambda x: x, availability))

availability = availability[0] if availability else availability

yield dict(
position=products_count + idx + 1,
name=name,
availability=availability,
price=price,
ratings=ratings,
image=image,
pdp_url=detail_page_url,
)

next_page = response.xpath('//li[@class="next"]/a/@href').extract_first()

if next_page:
yield response.follow(next_page, meta=dict(products_count=products_count + len(products)))





share|improve this answer





















  • can check it please stackoverflow.com/a/53335666/10636764
    – Mohammad Palash Babu
    Nov 16 '18 at 10:14










  • @MohammadPalashBabu are you sure you want to use selenium? answer the question why you want to use selenium.
    – Yash Pokar
    Nov 16 '18 at 12:47










  • @MohammadPalashBabu you don't need selenium in this project. It will slow down your crawler 20 times or may be more than that.
    – Yash Pokar
    Nov 16 '18 at 12:49










  • I know that don't needed selenium but i need a solution for selenium. cause i have a similar project
    – Mohammad Palash Babu
    Nov 16 '18 at 15:11










  • If you solve that's would be very helpful for me
    – Mohammad Palash Babu
    Nov 16 '18 at 15:12



















0














You can simply use a class variable to track the position, like this:



import scrapy

class ToscrapeSpider(scrapy.Spider):

name = 'toscrape'
allowed_domains = ['books.toscrape.com']
start_urls = ['http://books.toscrape.com/']

position = 0

def parse(self, response):

lists = response.css('li.col-xs-6')

for lis in lists:

title = lis.xpath('.//h3//@title').extract_first()
price = lis.xpath('.//p[@class="price_color"]//text()').extract_first()

self.position += 1

yield {
'Title': title,
'Price': price,
'Position': self.position,
}

next = response.xpath('//li[@class="next"]/a/@href').extract_first()
next = response.urljoin(next)
if next:
yield scrapy.Request(next)


Then:



scrapy runspider myspider.py -o out.json



The out.json file contains:



[
{"Title": "A Light in the Attic", "Price": "u00a351.77", "Position": 1},
{"Title": "Tipping the Velvet", "Price": "u00a353.74", "Position": 2},
{"Title": "Soumission", "Price": "u00a350.10", "Position": 3},
{"Title": "Sharp Objects", "Price": "u00a347.82", "Position": 4},
{"Title": "Sapiens: A Brief History of Humankind", "Price": "u00a354.23", "Position": 5},
{"Title": "The Requiem Red", "Price": "u00a322.65", "Position": 6},
{"Title": "The Dirty Little Secrets of Getting Your Dream Job", "Price": "u00a333.34", "Position": 7},
{"Title": "The Coming Woman: A Novel Based on the Life of the Infamous Feminist, Victoria Woodhull", "Price": "u00a317.93", "Position": 8},
{"Title": "The Boys in the Boat: Nine Americans and Their Epic Quest for Gold at the 1936 Berlin Olympics", "Price": "u00a322.60", "Position": 9},
{"Title": "The Black Maria", "Price": "u00a352.15", "Position": 10},
{"Title": "Starving Hearts (Triangular Trade Trilogy, #1)", "Price": "u00a313.99", "Position": 11},
{"Title": "Shakespeare's Sonnets", "Price": "u00a320.66", "Position": 12},
{"Title": "Set Me Free", "Price": "u00a317.46", "Position": 13},
{"Title": "Scott Pilgrim's Precious Little Life (Scott Pilgrim #1)", "Price": "u00a352.29", "Position": 14},
{"Title": "Rip it Up and Start Again", "Price": "u00a335.02", "Position": 15},
{"Title": "Our Band Could Be Your Life: Scenes from the American Indie Underground, 1981-1991", "Price": "u00a357.25", "Position": 16},
{"Title": "Olio", "Price": "u00a323.88", "Position": 17},
{"Title": "Mesaerion: The Best Science Fiction Stories 1800-1849", "Price": "u00a337.59", "Position": 18},
{"Title": "Libertarianism for Beginners", "Price": "u00a351.33", "Position": 19},
{"Title": "It's Only the Himalayas", "Price": "u00a345.17", "Position": 20}
]





share|improve this answer























  • it's working but When i crawl the next page then output results came 1-20 repeatedly but i need to get the results position 21,22,23 to last 1000 How to i can get that 21-1000 ?
    – Mohammad Palash Babu
    Nov 16 '18 at 0:50












  • next=response.xpath('//*[@class="next"]//@href').extract_first() next=response.urljoin(next) if next: yield scrapy.Request(next)
    – Mohammad Palash Babu
    Nov 16 '18 at 0:53










  • I have edited my solution to track the position in a class variable
    – Guillaume
    Nov 16 '18 at 3:46



















0














Yash Pokar



can you check this code please



How to i can apply your method in this selenium>scrapy code



-- coding: utf-8 --



from time import sleep
from scrapy import Spider
from selenium import webdriver
from scrapy.selector import Selector
from scrapy.http import Request
from selenium.common.exceptions import NoSuchElementException


class ToscrapeSpider(Spider):
name = 'toscrape'
allowed_domains = ['books.toscrape.com']
# start_urls = ['http://books.toscrape.com/']

def start_requests(self):
self.driver = webdriver.Chrome()
self.driver.get('http://books.toscrape.com/')
sel = Selector(text=self.driver.page_source)
lists=sel.css('li.col-xs-6')
for i, lis in enumerate(lists):
position=i+1
links=lis.xpath('.//h3//a//@href').extract_first()
links="http://books.toscrape.com/catalogue/"+links
yield Request(links,meta={'position':position},callback=self.parse_page)

while True:
try:
next_page=self.driver.find_element_by_xpath('//*[@class="next"]//a')
self.logger.info('Sleeping for 10 seconds.')
next_page.click()
sel = Selector(text=self.driver.page_source)
lists=sel.css('li.col-xs-6')
for i, lis in enumerate(lists):
position=i+1
links=lis.xpath('.//h3//a//@href').extract_first()
links="http://books.toscrape.com/catalogue/"+links
yield Request(links,meta={'position':position},callback=self.parse_page)

except NoSuchElementException:
self.logger.info('No more pages to load.')
self.driver.quit()
break

def parse_page(self, response):
title=response.xpath('//h1//text()').extract_first()
positions=response.meta['position']

yield {

'Title':title,
'Position':positions


}





share|improve this answer





















    Your Answer






    StackExchange.ifUsing("editor", function () {
    StackExchange.using("externalEditor", function () {
    StackExchange.using("snippets", function () {
    StackExchange.snippets.init();
    });
    });
    }, "code-snippets");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "1"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53322999%2fhow-to-scrape-item-position-number-in-scrapy%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    4 Answers
    4






    active

    oldest

    votes








    4 Answers
    4






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    0














    Try to use enumerate in cycle, this will solve the problem. As I remember, something like this:



    for i, lis in enumerate(lists):
    position = i + 1





    share|improve this answer





















    • not solved AttributeError: 'tuple' object has no attribute 'xpath'
      – Mohammad Palash Babu
      Nov 15 '18 at 17:55










    • Can you show your code? I don't understand where you got this error.
      – vezunchik
      Nov 15 '18 at 20:40










    • It's working now but When i crawl the next page then output results came 1-20 repeatedly but i need to get the results position 21,22,23 to last 1000.
      – Mohammad Palash Babu
      Nov 16 '18 at 0:58










    • Store last position to meta while calling next request, and then add it to position in the cycle. Or use number of page to calculate position.
      – vezunchik
      Nov 16 '18 at 9:51
















    0














    Try to use enumerate in cycle, this will solve the problem. As I remember, something like this:



    for i, lis in enumerate(lists):
    position = i + 1





    share|improve this answer





















    • not solved AttributeError: 'tuple' object has no attribute 'xpath'
      – Mohammad Palash Babu
      Nov 15 '18 at 17:55










    • Can you show your code? I don't understand where you got this error.
      – vezunchik
      Nov 15 '18 at 20:40










    • It's working now but When i crawl the next page then output results came 1-20 repeatedly but i need to get the results position 21,22,23 to last 1000.
      – Mohammad Palash Babu
      Nov 16 '18 at 0:58










    • Store last position to meta while calling next request, and then add it to position in the cycle. Or use number of page to calculate position.
      – vezunchik
      Nov 16 '18 at 9:51














    0












    0








    0






    Try to use enumerate in cycle, this will solve the problem. As I remember, something like this:



    for i, lis in enumerate(lists):
    position = i + 1





    share|improve this answer












    Try to use enumerate in cycle, this will solve the problem. As I remember, something like this:



    for i, lis in enumerate(lists):
    position = i + 1






    share|improve this answer












    share|improve this answer



    share|improve this answer










    answered Nov 15 '18 at 17:08









    vezunchikvezunchik

    41937




    41937












    • not solved AttributeError: 'tuple' object has no attribute 'xpath'
      – Mohammad Palash Babu
      Nov 15 '18 at 17:55










    • Can you show your code? I don't understand where you got this error.
      – vezunchik
      Nov 15 '18 at 20:40










    • It's working now but When i crawl the next page then output results came 1-20 repeatedly but i need to get the results position 21,22,23 to last 1000.
      – Mohammad Palash Babu
      Nov 16 '18 at 0:58










    • Store last position to meta while calling next request, and then add it to position in the cycle. Or use number of page to calculate position.
      – vezunchik
      Nov 16 '18 at 9:51


















    • not solved AttributeError: 'tuple' object has no attribute 'xpath'
      – Mohammad Palash Babu
      Nov 15 '18 at 17:55










    • Can you show your code? I don't understand where you got this error.
      – vezunchik
      Nov 15 '18 at 20:40










    • It's working now but When i crawl the next page then output results came 1-20 repeatedly but i need to get the results position 21,22,23 to last 1000.
      – Mohammad Palash Babu
      Nov 16 '18 at 0:58










    • Store last position to meta while calling next request, and then add it to position in the cycle. Or use number of page to calculate position.
      – vezunchik
      Nov 16 '18 at 9:51
















    not solved AttributeError: 'tuple' object has no attribute 'xpath'
    – Mohammad Palash Babu
    Nov 15 '18 at 17:55




    not solved AttributeError: 'tuple' object has no attribute 'xpath'
    – Mohammad Palash Babu
    Nov 15 '18 at 17:55












    Can you show your code? I don't understand where you got this error.
    – vezunchik
    Nov 15 '18 at 20:40




    Can you show your code? I don't understand where you got this error.
    – vezunchik
    Nov 15 '18 at 20:40












    It's working now but When i crawl the next page then output results came 1-20 repeatedly but i need to get the results position 21,22,23 to last 1000.
    – Mohammad Palash Babu
    Nov 16 '18 at 0:58




    It's working now but When i crawl the next page then output results came 1-20 repeatedly but i need to get the results position 21,22,23 to last 1000.
    – Mohammad Palash Babu
    Nov 16 '18 at 0:58












    Store last position to meta while calling next request, and then add it to position in the cycle. Or use number of page to calculate position.
    – vezunchik
    Nov 16 '18 at 9:51




    Store last position to meta while calling next request, and then add it to position in the cycle. Or use number of page to calculate position.
    – vezunchik
    Nov 16 '18 at 9:51













    0














    import scrapy


    class ToscrapeSpider(scrapy.Spider):
    name = 'toscrape'
    allowed_domains = ['books.toscrape.com']
    start_urls = ['http://books.toscrape.com/']

    def parse(self, response):
    products_count = response.meta.get('products_count', 0)

    products = response.xpath('//article[@class="product_pod"]')

    for idx, product in enumerate(products):
    _image_container = product.xpath('.//div[@class="image_container"]')

    detail_page_url = _image_container.xpath('.//a/@href').extract_first()
    image = _image_container.xpath('.//img/@src').extract_first()

    name = product.xpath('.//h3/a/@title').extract_first()

    ratings = product.xpath('.//p[contains(@class, "star-rating")]/@class').extract_first()
    ratings = ratings.replace('star-rating', '').strip() if ratings else ratings

    price = product.xpath('.//p[@class="price_color"]/text()').extract_first()
    availability = product.xpath('.//p[@class="instock availability"]//text()').extract()
    availability = list(map(lambda x: x.replace('n', '').replace('t', '').strip(), availability))
    availability = list(filter(lambda x: x, availability))

    availability = availability[0] if availability else availability

    yield dict(
    position=products_count + idx + 1,
    name=name,
    availability=availability,
    price=price,
    ratings=ratings,
    image=image,
    pdp_url=detail_page_url,
    )

    next_page = response.xpath('//li[@class="next"]/a/@href').extract_first()

    if next_page:
    yield response.follow(next_page, meta=dict(products_count=products_count + len(products)))





    share|improve this answer





















    • can check it please stackoverflow.com/a/53335666/10636764
      – Mohammad Palash Babu
      Nov 16 '18 at 10:14










    • @MohammadPalashBabu are you sure you want to use selenium? answer the question why you want to use selenium.
      – Yash Pokar
      Nov 16 '18 at 12:47










    • @MohammadPalashBabu you don't need selenium in this project. It will slow down your crawler 20 times or may be more than that.
      – Yash Pokar
      Nov 16 '18 at 12:49










    • I know that don't needed selenium but i need a solution for selenium. cause i have a similar project
      – Mohammad Palash Babu
      Nov 16 '18 at 15:11










    • If you solve that's would be very helpful for me
      – Mohammad Palash Babu
      Nov 16 '18 at 15:12
















    0














    import scrapy


    class ToscrapeSpider(scrapy.Spider):
    name = 'toscrape'
    allowed_domains = ['books.toscrape.com']
    start_urls = ['http://books.toscrape.com/']

    def parse(self, response):
    products_count = response.meta.get('products_count', 0)

    products = response.xpath('//article[@class="product_pod"]')

    for idx, product in enumerate(products):
    _image_container = product.xpath('.//div[@class="image_container"]')

    detail_page_url = _image_container.xpath('.//a/@href').extract_first()
    image = _image_container.xpath('.//img/@src').extract_first()

    name = product.xpath('.//h3/a/@title').extract_first()

    ratings = product.xpath('.//p[contains(@class, "star-rating")]/@class').extract_first()
    ratings = ratings.replace('star-rating', '').strip() if ratings else ratings

    price = product.xpath('.//p[@class="price_color"]/text()').extract_first()
    availability = product.xpath('.//p[@class="instock availability"]//text()').extract()
    availability = list(map(lambda x: x.replace('n', '').replace('t', '').strip(), availability))
    availability = list(filter(lambda x: x, availability))

    availability = availability[0] if availability else availability

    yield dict(
    position=products_count + idx + 1,
    name=name,
    availability=availability,
    price=price,
    ratings=ratings,
    image=image,
    pdp_url=detail_page_url,
    )

    next_page = response.xpath('//li[@class="next"]/a/@href').extract_first()

    if next_page:
    yield response.follow(next_page, meta=dict(products_count=products_count + len(products)))





    share|improve this answer





















    • can check it please stackoverflow.com/a/53335666/10636764
      – Mohammad Palash Babu
      Nov 16 '18 at 10:14










    • @MohammadPalashBabu are you sure you want to use selenium? answer the question why you want to use selenium.
      – Yash Pokar
      Nov 16 '18 at 12:47










    • @MohammadPalashBabu you don't need selenium in this project. It will slow down your crawler 20 times or may be more than that.
      – Yash Pokar
      Nov 16 '18 at 12:49










    • I know that don't needed selenium but i need a solution for selenium. cause i have a similar project
      – Mohammad Palash Babu
      Nov 16 '18 at 15:11










    • If you solve that's would be very helpful for me
      – Mohammad Palash Babu
      Nov 16 '18 at 15:12














    0












    0








    0






    import scrapy


    class ToscrapeSpider(scrapy.Spider):
    name = 'toscrape'
    allowed_domains = ['books.toscrape.com']
    start_urls = ['http://books.toscrape.com/']

    def parse(self, response):
    products_count = response.meta.get('products_count', 0)

    products = response.xpath('//article[@class="product_pod"]')

    for idx, product in enumerate(products):
    _image_container = product.xpath('.//div[@class="image_container"]')

    detail_page_url = _image_container.xpath('.//a/@href').extract_first()
    image = _image_container.xpath('.//img/@src').extract_first()

    name = product.xpath('.//h3/a/@title').extract_first()

    ratings = product.xpath('.//p[contains(@class, "star-rating")]/@class').extract_first()
    ratings = ratings.replace('star-rating', '').strip() if ratings else ratings

    price = product.xpath('.//p[@class="price_color"]/text()').extract_first()
    availability = product.xpath('.//p[@class="instock availability"]//text()').extract()
    availability = list(map(lambda x: x.replace('n', '').replace('t', '').strip(), availability))
    availability = list(filter(lambda x: x, availability))

    availability = availability[0] if availability else availability

    yield dict(
    position=products_count + idx + 1,
    name=name,
    availability=availability,
    price=price,
    ratings=ratings,
    image=image,
    pdp_url=detail_page_url,
    )

    next_page = response.xpath('//li[@class="next"]/a/@href').extract_first()

    if next_page:
    yield response.follow(next_page, meta=dict(products_count=products_count + len(products)))





    share|improve this answer












    import scrapy


    class ToscrapeSpider(scrapy.Spider):
    name = 'toscrape'
    allowed_domains = ['books.toscrape.com']
    start_urls = ['http://books.toscrape.com/']

    def parse(self, response):
    products_count = response.meta.get('products_count', 0)

    products = response.xpath('//article[@class="product_pod"]')

    for idx, product in enumerate(products):
    _image_container = product.xpath('.//div[@class="image_container"]')

    detail_page_url = _image_container.xpath('.//a/@href').extract_first()
    image = _image_container.xpath('.//img/@src').extract_first()

    name = product.xpath('.//h3/a/@title').extract_first()

    ratings = product.xpath('.//p[contains(@class, "star-rating")]/@class').extract_first()
    ratings = ratings.replace('star-rating', '').strip() if ratings else ratings

    price = product.xpath('.//p[@class="price_color"]/text()').extract_first()
    availability = product.xpath('.//p[@class="instock availability"]//text()').extract()
    availability = list(map(lambda x: x.replace('n', '').replace('t', '').strip(), availability))
    availability = list(filter(lambda x: x, availability))

    availability = availability[0] if availability else availability

    yield dict(
    position=products_count + idx + 1,
    name=name,
    availability=availability,
    price=price,
    ratings=ratings,
    image=image,
    pdp_url=detail_page_url,
    )

    next_page = response.xpath('//li[@class="next"]/a/@href').extract_first()

    if next_page:
    yield response.follow(next_page, meta=dict(products_count=products_count + len(products)))






    share|improve this answer












    share|improve this answer



    share|improve this answer










    answered Nov 16 '18 at 3:16









    Yash PokarYash Pokar

    346311




    346311












    • can check it please stackoverflow.com/a/53335666/10636764
      – Mohammad Palash Babu
      Nov 16 '18 at 10:14










    • @MohammadPalashBabu are you sure you want to use selenium? answer the question why you want to use selenium.
      – Yash Pokar
      Nov 16 '18 at 12:47










    • @MohammadPalashBabu you don't need selenium in this project. It will slow down your crawler 20 times or may be more than that.
      – Yash Pokar
      Nov 16 '18 at 12:49










    • I know that don't needed selenium but i need a solution for selenium. cause i have a similar project
      – Mohammad Palash Babu
      Nov 16 '18 at 15:11










    • If you solve that's would be very helpful for me
      – Mohammad Palash Babu
      Nov 16 '18 at 15:12


















    • can check it please stackoverflow.com/a/53335666/10636764
      – Mohammad Palash Babu
      Nov 16 '18 at 10:14










    • @MohammadPalashBabu are you sure you want to use selenium? answer the question why you want to use selenium.
      – Yash Pokar
      Nov 16 '18 at 12:47










    • @MohammadPalashBabu you don't need selenium in this project. It will slow down your crawler 20 times or may be more than that.
      – Yash Pokar
      Nov 16 '18 at 12:49










    • I know that don't needed selenium but i need a solution for selenium. cause i have a similar project
      – Mohammad Palash Babu
      Nov 16 '18 at 15:11










    • If you solve that's would be very helpful for me
      – Mohammad Palash Babu
      Nov 16 '18 at 15:12
















    can check it please stackoverflow.com/a/53335666/10636764
    – Mohammad Palash Babu
    Nov 16 '18 at 10:14




    can check it please stackoverflow.com/a/53335666/10636764
    – Mohammad Palash Babu
    Nov 16 '18 at 10:14












    @MohammadPalashBabu are you sure you want to use selenium? answer the question why you want to use selenium.
    – Yash Pokar
    Nov 16 '18 at 12:47




    @MohammadPalashBabu are you sure you want to use selenium? answer the question why you want to use selenium.
    – Yash Pokar
    Nov 16 '18 at 12:47












    @MohammadPalashBabu you don't need selenium in this project. It will slow down your crawler 20 times or may be more than that.
    – Yash Pokar
    Nov 16 '18 at 12:49




    @MohammadPalashBabu you don't need selenium in this project. It will slow down your crawler 20 times or may be more than that.
    – Yash Pokar
    Nov 16 '18 at 12:49












    I know that don't needed selenium but i need a solution for selenium. cause i have a similar project
    – Mohammad Palash Babu
    Nov 16 '18 at 15:11




    I know that don't needed selenium but i need a solution for selenium. cause i have a similar project
    – Mohammad Palash Babu
    Nov 16 '18 at 15:11












    If you solve that's would be very helpful for me
    – Mohammad Palash Babu
    Nov 16 '18 at 15:12




    If you solve that's would be very helpful for me
    – Mohammad Palash Babu
    Nov 16 '18 at 15:12











    0














    You can simply use a class variable to track the position, like this:



    import scrapy

    class ToscrapeSpider(scrapy.Spider):

    name = 'toscrape'
    allowed_domains = ['books.toscrape.com']
    start_urls = ['http://books.toscrape.com/']

    position = 0

    def parse(self, response):

    lists = response.css('li.col-xs-6')

    for lis in lists:

    title = lis.xpath('.//h3//@title').extract_first()
    price = lis.xpath('.//p[@class="price_color"]//text()').extract_first()

    self.position += 1

    yield {
    'Title': title,
    'Price': price,
    'Position': self.position,
    }

    next = response.xpath('//li[@class="next"]/a/@href').extract_first()
    next = response.urljoin(next)
    if next:
    yield scrapy.Request(next)


    Then:



    scrapy runspider myspider.py -o out.json



    The out.json file contains:



    [
    {"Title": "A Light in the Attic", "Price": "u00a351.77", "Position": 1},
    {"Title": "Tipping the Velvet", "Price": "u00a353.74", "Position": 2},
    {"Title": "Soumission", "Price": "u00a350.10", "Position": 3},
    {"Title": "Sharp Objects", "Price": "u00a347.82", "Position": 4},
    {"Title": "Sapiens: A Brief History of Humankind", "Price": "u00a354.23", "Position": 5},
    {"Title": "The Requiem Red", "Price": "u00a322.65", "Position": 6},
    {"Title": "The Dirty Little Secrets of Getting Your Dream Job", "Price": "u00a333.34", "Position": 7},
    {"Title": "The Coming Woman: A Novel Based on the Life of the Infamous Feminist, Victoria Woodhull", "Price": "u00a317.93", "Position": 8},
    {"Title": "The Boys in the Boat: Nine Americans and Their Epic Quest for Gold at the 1936 Berlin Olympics", "Price": "u00a322.60", "Position": 9},
    {"Title": "The Black Maria", "Price": "u00a352.15", "Position": 10},
    {"Title": "Starving Hearts (Triangular Trade Trilogy, #1)", "Price": "u00a313.99", "Position": 11},
    {"Title": "Shakespeare's Sonnets", "Price": "u00a320.66", "Position": 12},
    {"Title": "Set Me Free", "Price": "u00a317.46", "Position": 13},
    {"Title": "Scott Pilgrim's Precious Little Life (Scott Pilgrim #1)", "Price": "u00a352.29", "Position": 14},
    {"Title": "Rip it Up and Start Again", "Price": "u00a335.02", "Position": 15},
    {"Title": "Our Band Could Be Your Life: Scenes from the American Indie Underground, 1981-1991", "Price": "u00a357.25", "Position": 16},
    {"Title": "Olio", "Price": "u00a323.88", "Position": 17},
    {"Title": "Mesaerion: The Best Science Fiction Stories 1800-1849", "Price": "u00a337.59", "Position": 18},
    {"Title": "Libertarianism for Beginners", "Price": "u00a351.33", "Position": 19},
    {"Title": "It's Only the Himalayas", "Price": "u00a345.17", "Position": 20}
    ]





    share|improve this answer























    • it's working but When i crawl the next page then output results came 1-20 repeatedly but i need to get the results position 21,22,23 to last 1000 How to i can get that 21-1000 ?
      – Mohammad Palash Babu
      Nov 16 '18 at 0:50












    • next=response.xpath('//*[@class="next"]//@href').extract_first() next=response.urljoin(next) if next: yield scrapy.Request(next)
      – Mohammad Palash Babu
      Nov 16 '18 at 0:53










    • I have edited my solution to track the position in a class variable
      – Guillaume
      Nov 16 '18 at 3:46
















    0














    You can simply use a class variable to track the position, like this:



    import scrapy

    class ToscrapeSpider(scrapy.Spider):

    name = 'toscrape'
    allowed_domains = ['books.toscrape.com']
    start_urls = ['http://books.toscrape.com/']

    position = 0

    def parse(self, response):

    lists = response.css('li.col-xs-6')

    for lis in lists:

    title = lis.xpath('.//h3//@title').extract_first()
    price = lis.xpath('.//p[@class="price_color"]//text()').extract_first()

    self.position += 1

    yield {
    'Title': title,
    'Price': price,
    'Position': self.position,
    }

    next = response.xpath('//li[@class="next"]/a/@href').extract_first()
    next = response.urljoin(next)
    if next:
    yield scrapy.Request(next)


    Then:



    scrapy runspider myspider.py -o out.json



    The out.json file contains:



    [
    {"Title": "A Light in the Attic", "Price": "u00a351.77", "Position": 1},
    {"Title": "Tipping the Velvet", "Price": "u00a353.74", "Position": 2},
    {"Title": "Soumission", "Price": "u00a350.10", "Position": 3},
    {"Title": "Sharp Objects", "Price": "u00a347.82", "Position": 4},
    {"Title": "Sapiens: A Brief History of Humankind", "Price": "u00a354.23", "Position": 5},
    {"Title": "The Requiem Red", "Price": "u00a322.65", "Position": 6},
    {"Title": "The Dirty Little Secrets of Getting Your Dream Job", "Price": "u00a333.34", "Position": 7},
    {"Title": "The Coming Woman: A Novel Based on the Life of the Infamous Feminist, Victoria Woodhull", "Price": "u00a317.93", "Position": 8},
    {"Title": "The Boys in the Boat: Nine Americans and Their Epic Quest for Gold at the 1936 Berlin Olympics", "Price": "u00a322.60", "Position": 9},
    {"Title": "The Black Maria", "Price": "u00a352.15", "Position": 10},
    {"Title": "Starving Hearts (Triangular Trade Trilogy, #1)", "Price": "u00a313.99", "Position": 11},
    {"Title": "Shakespeare's Sonnets", "Price": "u00a320.66", "Position": 12},
    {"Title": "Set Me Free", "Price": "u00a317.46", "Position": 13},
    {"Title": "Scott Pilgrim's Precious Little Life (Scott Pilgrim #1)", "Price": "u00a352.29", "Position": 14},
    {"Title": "Rip it Up and Start Again", "Price": "u00a335.02", "Position": 15},
    {"Title": "Our Band Could Be Your Life: Scenes from the American Indie Underground, 1981-1991", "Price": "u00a357.25", "Position": 16},
    {"Title": "Olio", "Price": "u00a323.88", "Position": 17},
    {"Title": "Mesaerion: The Best Science Fiction Stories 1800-1849", "Price": "u00a337.59", "Position": 18},
    {"Title": "Libertarianism for Beginners", "Price": "u00a351.33", "Position": 19},
    {"Title": "It's Only the Himalayas", "Price": "u00a345.17", "Position": 20}
    ]





    share|improve this answer























    • it's working but When i crawl the next page then output results came 1-20 repeatedly but i need to get the results position 21,22,23 to last 1000 How to i can get that 21-1000 ?
      – Mohammad Palash Babu
      Nov 16 '18 at 0:50












    • next=response.xpath('//*[@class="next"]//@href').extract_first() next=response.urljoin(next) if next: yield scrapy.Request(next)
      – Mohammad Palash Babu
      Nov 16 '18 at 0:53










    • I have edited my solution to track the position in a class variable
      – Guillaume
      Nov 16 '18 at 3:46














    0












    0








    0






    You can simply use a class variable to track the position, like this:



    import scrapy

    class ToscrapeSpider(scrapy.Spider):

    name = 'toscrape'
    allowed_domains = ['books.toscrape.com']
    start_urls = ['http://books.toscrape.com/']

    position = 0

    def parse(self, response):

    lists = response.css('li.col-xs-6')

    for lis in lists:

    title = lis.xpath('.//h3//@title').extract_first()
    price = lis.xpath('.//p[@class="price_color"]//text()').extract_first()

    self.position += 1

    yield {
    'Title': title,
    'Price': price,
    'Position': self.position,
    }

    next = response.xpath('//li[@class="next"]/a/@href').extract_first()
    next = response.urljoin(next)
    if next:
    yield scrapy.Request(next)


    Then:



    scrapy runspider myspider.py -o out.json



    The out.json file contains:



    [
    {"Title": "A Light in the Attic", "Price": "u00a351.77", "Position": 1},
    {"Title": "Tipping the Velvet", "Price": "u00a353.74", "Position": 2},
    {"Title": "Soumission", "Price": "u00a350.10", "Position": 3},
    {"Title": "Sharp Objects", "Price": "u00a347.82", "Position": 4},
    {"Title": "Sapiens: A Brief History of Humankind", "Price": "u00a354.23", "Position": 5},
    {"Title": "The Requiem Red", "Price": "u00a322.65", "Position": 6},
    {"Title": "The Dirty Little Secrets of Getting Your Dream Job", "Price": "u00a333.34", "Position": 7},
    {"Title": "The Coming Woman: A Novel Based on the Life of the Infamous Feminist, Victoria Woodhull", "Price": "u00a317.93", "Position": 8},
    {"Title": "The Boys in the Boat: Nine Americans and Their Epic Quest for Gold at the 1936 Berlin Olympics", "Price": "u00a322.60", "Position": 9},
    {"Title": "The Black Maria", "Price": "u00a352.15", "Position": 10},
    {"Title": "Starving Hearts (Triangular Trade Trilogy, #1)", "Price": "u00a313.99", "Position": 11},
    {"Title": "Shakespeare's Sonnets", "Price": "u00a320.66", "Position": 12},
    {"Title": "Set Me Free", "Price": "u00a317.46", "Position": 13},
    {"Title": "Scott Pilgrim's Precious Little Life (Scott Pilgrim #1)", "Price": "u00a352.29", "Position": 14},
    {"Title": "Rip it Up and Start Again", "Price": "u00a335.02", "Position": 15},
    {"Title": "Our Band Could Be Your Life: Scenes from the American Indie Underground, 1981-1991", "Price": "u00a357.25", "Position": 16},
    {"Title": "Olio", "Price": "u00a323.88", "Position": 17},
    {"Title": "Mesaerion: The Best Science Fiction Stories 1800-1849", "Price": "u00a337.59", "Position": 18},
    {"Title": "Libertarianism for Beginners", "Price": "u00a351.33", "Position": 19},
    {"Title": "It's Only the Himalayas", "Price": "u00a345.17", "Position": 20}
    ]





    share|improve this answer














    You can simply use a class variable to track the position, like this:



    import scrapy

    class ToscrapeSpider(scrapy.Spider):

    name = 'toscrape'
    allowed_domains = ['books.toscrape.com']
    start_urls = ['http://books.toscrape.com/']

    position = 0

    def parse(self, response):

    lists = response.css('li.col-xs-6')

    for lis in lists:

    title = lis.xpath('.//h3//@title').extract_first()
    price = lis.xpath('.//p[@class="price_color"]//text()').extract_first()

    self.position += 1

    yield {
    'Title': title,
    'Price': price,
    'Position': self.position,
    }

    next = response.xpath('//li[@class="next"]/a/@href').extract_first()
    next = response.urljoin(next)
    if next:
    yield scrapy.Request(next)


    Then:



    scrapy runspider myspider.py -o out.json



    The out.json file contains:



    [
    {"Title": "A Light in the Attic", "Price": "u00a351.77", "Position": 1},
    {"Title": "Tipping the Velvet", "Price": "u00a353.74", "Position": 2},
    {"Title": "Soumission", "Price": "u00a350.10", "Position": 3},
    {"Title": "Sharp Objects", "Price": "u00a347.82", "Position": 4},
    {"Title": "Sapiens: A Brief History of Humankind", "Price": "u00a354.23", "Position": 5},
    {"Title": "The Requiem Red", "Price": "u00a322.65", "Position": 6},
    {"Title": "The Dirty Little Secrets of Getting Your Dream Job", "Price": "u00a333.34", "Position": 7},
    {"Title": "The Coming Woman: A Novel Based on the Life of the Infamous Feminist, Victoria Woodhull", "Price": "u00a317.93", "Position": 8},
    {"Title": "The Boys in the Boat: Nine Americans and Their Epic Quest for Gold at the 1936 Berlin Olympics", "Price": "u00a322.60", "Position": 9},
    {"Title": "The Black Maria", "Price": "u00a352.15", "Position": 10},
    {"Title": "Starving Hearts (Triangular Trade Trilogy, #1)", "Price": "u00a313.99", "Position": 11},
    {"Title": "Shakespeare's Sonnets", "Price": "u00a320.66", "Position": 12},
    {"Title": "Set Me Free", "Price": "u00a317.46", "Position": 13},
    {"Title": "Scott Pilgrim's Precious Little Life (Scott Pilgrim #1)", "Price": "u00a352.29", "Position": 14},
    {"Title": "Rip it Up and Start Again", "Price": "u00a335.02", "Position": 15},
    {"Title": "Our Band Could Be Your Life: Scenes from the American Indie Underground, 1981-1991", "Price": "u00a357.25", "Position": 16},
    {"Title": "Olio", "Price": "u00a323.88", "Position": 17},
    {"Title": "Mesaerion: The Best Science Fiction Stories 1800-1849", "Price": "u00a337.59", "Position": 18},
    {"Title": "Libertarianism for Beginners", "Price": "u00a351.33", "Position": 19},
    {"Title": "It's Only the Himalayas", "Price": "u00a345.17", "Position": 20}
    ]






    share|improve this answer














    share|improve this answer



    share|improve this answer








    edited Nov 16 '18 at 3:45

























    answered Nov 15 '18 at 18:59









    GuillaumeGuillaume

    1,1281724




    1,1281724












    • it's working but When i crawl the next page then output results came 1-20 repeatedly but i need to get the results position 21,22,23 to last 1000 How to i can get that 21-1000 ?
      – Mohammad Palash Babu
      Nov 16 '18 at 0:50












    • next=response.xpath('//*[@class="next"]//@href').extract_first() next=response.urljoin(next) if next: yield scrapy.Request(next)
      – Mohammad Palash Babu
      Nov 16 '18 at 0:53










    • I have edited my solution to track the position in a class variable
      – Guillaume
      Nov 16 '18 at 3:46


















    • it's working but When i crawl the next page then output results came 1-20 repeatedly but i need to get the results position 21,22,23 to last 1000 How to i can get that 21-1000 ?
      – Mohammad Palash Babu
      Nov 16 '18 at 0:50












    • next=response.xpath('//*[@class="next"]//@href').extract_first() next=response.urljoin(next) if next: yield scrapy.Request(next)
      – Mohammad Palash Babu
      Nov 16 '18 at 0:53










    • I have edited my solution to track the position in a class variable
      – Guillaume
      Nov 16 '18 at 3:46
















    it's working but When i crawl the next page then output results came 1-20 repeatedly but i need to get the results position 21,22,23 to last 1000 How to i can get that 21-1000 ?
    – Mohammad Palash Babu
    Nov 16 '18 at 0:50






    it's working but When i crawl the next page then output results came 1-20 repeatedly but i need to get the results position 21,22,23 to last 1000 How to i can get that 21-1000 ?
    – Mohammad Palash Babu
    Nov 16 '18 at 0:50














    next=response.xpath('//*[@class="next"]//@href').extract_first() next=response.urljoin(next) if next: yield scrapy.Request(next)
    – Mohammad Palash Babu
    Nov 16 '18 at 0:53




    next=response.xpath('//*[@class="next"]//@href').extract_first() next=response.urljoin(next) if next: yield scrapy.Request(next)
    – Mohammad Palash Babu
    Nov 16 '18 at 0:53












    I have edited my solution to track the position in a class variable
    – Guillaume
    Nov 16 '18 at 3:46




    I have edited my solution to track the position in a class variable
    – Guillaume
    Nov 16 '18 at 3:46











    0














    Yash Pokar



    can you check this code please



    How to i can apply your method in this selenium>scrapy code



    -- coding: utf-8 --



    from time import sleep
    from scrapy import Spider
    from selenium import webdriver
    from scrapy.selector import Selector
    from scrapy.http import Request
    from selenium.common.exceptions import NoSuchElementException


    class ToscrapeSpider(Spider):
    name = 'toscrape'
    allowed_domains = ['books.toscrape.com']
    # start_urls = ['http://books.toscrape.com/']

    def start_requests(self):
    self.driver = webdriver.Chrome()
    self.driver.get('http://books.toscrape.com/')
    sel = Selector(text=self.driver.page_source)
    lists=sel.css('li.col-xs-6')
    for i, lis in enumerate(lists):
    position=i+1
    links=lis.xpath('.//h3//a//@href').extract_first()
    links="http://books.toscrape.com/catalogue/"+links
    yield Request(links,meta={'position':position},callback=self.parse_page)

    while True:
    try:
    next_page=self.driver.find_element_by_xpath('//*[@class="next"]//a')
    self.logger.info('Sleeping for 10 seconds.')
    next_page.click()
    sel = Selector(text=self.driver.page_source)
    lists=sel.css('li.col-xs-6')
    for i, lis in enumerate(lists):
    position=i+1
    links=lis.xpath('.//h3//a//@href').extract_first()
    links="http://books.toscrape.com/catalogue/"+links
    yield Request(links,meta={'position':position},callback=self.parse_page)

    except NoSuchElementException:
    self.logger.info('No more pages to load.')
    self.driver.quit()
    break

    def parse_page(self, response):
    title=response.xpath('//h1//text()').extract_first()
    positions=response.meta['position']

    yield {

    'Title':title,
    'Position':positions


    }





    share|improve this answer


























      0














      Yash Pokar



      can you check this code please



      How to i can apply your method in this selenium>scrapy code



      -- coding: utf-8 --



      from time import sleep
      from scrapy import Spider
      from selenium import webdriver
      from scrapy.selector import Selector
      from scrapy.http import Request
      from selenium.common.exceptions import NoSuchElementException


      class ToscrapeSpider(Spider):
      name = 'toscrape'
      allowed_domains = ['books.toscrape.com']
      # start_urls = ['http://books.toscrape.com/']

      def start_requests(self):
      self.driver = webdriver.Chrome()
      self.driver.get('http://books.toscrape.com/')
      sel = Selector(text=self.driver.page_source)
      lists=sel.css('li.col-xs-6')
      for i, lis in enumerate(lists):
      position=i+1
      links=lis.xpath('.//h3//a//@href').extract_first()
      links="http://books.toscrape.com/catalogue/"+links
      yield Request(links,meta={'position':position},callback=self.parse_page)

      while True:
      try:
      next_page=self.driver.find_element_by_xpath('//*[@class="next"]//a')
      self.logger.info('Sleeping for 10 seconds.')
      next_page.click()
      sel = Selector(text=self.driver.page_source)
      lists=sel.css('li.col-xs-6')
      for i, lis in enumerate(lists):
      position=i+1
      links=lis.xpath('.//h3//a//@href').extract_first()
      links="http://books.toscrape.com/catalogue/"+links
      yield Request(links,meta={'position':position},callback=self.parse_page)

      except NoSuchElementException:
      self.logger.info('No more pages to load.')
      self.driver.quit()
      break

      def parse_page(self, response):
      title=response.xpath('//h1//text()').extract_first()
      positions=response.meta['position']

      yield {

      'Title':title,
      'Position':positions


      }





      share|improve this answer
























        0












        0








        0






        Yash Pokar



        can you check this code please



        How to i can apply your method in this selenium>scrapy code



        -- coding: utf-8 --



        from time import sleep
        from scrapy import Spider
        from selenium import webdriver
        from scrapy.selector import Selector
        from scrapy.http import Request
        from selenium.common.exceptions import NoSuchElementException


        class ToscrapeSpider(Spider):
        name = 'toscrape'
        allowed_domains = ['books.toscrape.com']
        # start_urls = ['http://books.toscrape.com/']

        def start_requests(self):
        self.driver = webdriver.Chrome()
        self.driver.get('http://books.toscrape.com/')
        sel = Selector(text=self.driver.page_source)
        lists=sel.css('li.col-xs-6')
        for i, lis in enumerate(lists):
        position=i+1
        links=lis.xpath('.//h3//a//@href').extract_first()
        links="http://books.toscrape.com/catalogue/"+links
        yield Request(links,meta={'position':position},callback=self.parse_page)

        while True:
        try:
        next_page=self.driver.find_element_by_xpath('//*[@class="next"]//a')
        self.logger.info('Sleeping for 10 seconds.')
        next_page.click()
        sel = Selector(text=self.driver.page_source)
        lists=sel.css('li.col-xs-6')
        for i, lis in enumerate(lists):
        position=i+1
        links=lis.xpath('.//h3//a//@href').extract_first()
        links="http://books.toscrape.com/catalogue/"+links
        yield Request(links,meta={'position':position},callback=self.parse_page)

        except NoSuchElementException:
        self.logger.info('No more pages to load.')
        self.driver.quit()
        break

        def parse_page(self, response):
        title=response.xpath('//h1//text()').extract_first()
        positions=response.meta['position']

        yield {

        'Title':title,
        'Position':positions


        }





        share|improve this answer












        Yash Pokar



        can you check this code please



        How to i can apply your method in this selenium>scrapy code



        -- coding: utf-8 --



        from time import sleep
        from scrapy import Spider
        from selenium import webdriver
        from scrapy.selector import Selector
        from scrapy.http import Request
        from selenium.common.exceptions import NoSuchElementException


        class ToscrapeSpider(Spider):
        name = 'toscrape'
        allowed_domains = ['books.toscrape.com']
        # start_urls = ['http://books.toscrape.com/']

        def start_requests(self):
        self.driver = webdriver.Chrome()
        self.driver.get('http://books.toscrape.com/')
        sel = Selector(text=self.driver.page_source)
        lists=sel.css('li.col-xs-6')
        for i, lis in enumerate(lists):
        position=i+1
        links=lis.xpath('.//h3//a//@href').extract_first()
        links="http://books.toscrape.com/catalogue/"+links
        yield Request(links,meta={'position':position},callback=self.parse_page)

        while True:
        try:
        next_page=self.driver.find_element_by_xpath('//*[@class="next"]//a')
        self.logger.info('Sleeping for 10 seconds.')
        next_page.click()
        sel = Selector(text=self.driver.page_source)
        lists=sel.css('li.col-xs-6')
        for i, lis in enumerate(lists):
        position=i+1
        links=lis.xpath('.//h3//a//@href').extract_first()
        links="http://books.toscrape.com/catalogue/"+links
        yield Request(links,meta={'position':position},callback=self.parse_page)

        except NoSuchElementException:
        self.logger.info('No more pages to load.')
        self.driver.quit()
        break

        def parse_page(self, response):
        title=response.xpath('//h1//text()').extract_first()
        positions=response.meta['position']

        yield {

        'Title':title,
        'Position':positions


        }






        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Nov 16 '18 at 10:13









        Mohammad Palash BabuMohammad Palash Babu

        134




        134






























            draft saved

            draft discarded




















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53322999%2fhow-to-scrape-item-position-number-in-scrapy%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            How to pass form data using jquery Ajax to insert data in database?

            National Museum of Racing and Hall of Fame

            Guess what letter conforming each word