How to scrape item position number in scrapy

How to scrape item position number from this site

website:
http://books.toscrape.com/

Please check this screenshot

https://prnt.sc/lim3zl

# -*- coding: utf-8 -*-



import scrapy





class ToscrapeSpider(scrapy.Spider):

    name = 'toscrape'

    allowed_domains = ['books.toscrape.com']

    start_urls = ['http://books.toscrape.com/']



    def parse(self, response):

        lists=response.css('li.col-xs-6')

        for lis in lists:

            title=lis.xpath('.//h3//@title').extract_first()

            price=lis.xpath('.//[@class="price_color"]//text()').extract_first()

            # I need to know How to scrape there position

            position=''



            yield {

                    'Title':title,

                    'Price':price,

                    'Position':position



            }

        # next=response.xpath('//*[@class="next"]//@href').extract_first()

        # next=response.urljoin(next)

        # if next:

        #     yield scrapy.Request(next)

asked Nov 15 '18 at 15:43

Mohammad Palash Babu

134

What problem are you encountering? Are you getting any errors?
– Tsahi Asher
Nov 15 '18 at 15:49

actually no errors. i just need to know that How to i can scrape these item count position number
– Mohammad Palash Babu
Nov 15 '18 at 16:30

output such as Title: A Light in the Price: £51.77 Position:1 Title: Tipping the Velvet Price: £53.74 Position:2
– Mohammad Palash Babu
Nov 15 '18 at 16:32

add a comment |

How to scrape item position number from this site

website:
http://books.toscrape.com/

Please check this screenshot

https://prnt.sc/lim3zl

# -*- coding: utf-8 -*-



import scrapy





class ToscrapeSpider(scrapy.Spider):

    name = 'toscrape'

    allowed_domains = ['books.toscrape.com']

    start_urls = ['http://books.toscrape.com/']



    def parse(self, response):

        lists=response.css('li.col-xs-6')

        for lis in lists:

            title=lis.xpath('.//h3//@title').extract_first()

            price=lis.xpath('.//[@class="price_color"]//text()').extract_first()

            # I need to know How to scrape there position

            position=''



            yield {

                    'Title':title,

                    'Price':price,

                    'Position':position



            }

        # next=response.xpath('//*[@class="next"]//@href').extract_first()

        # next=response.urljoin(next)

        # if next:

        #     yield scrapy.Request(next)

asked Nov 15 '18 at 15:43

Mohammad Palash Babu

134

What problem are you encountering? Are you getting any errors?
– Tsahi Asher
Nov 15 '18 at 15:49

actually no errors. i just need to know that How to i can scrape these item count position number
– Mohammad Palash Babu
Nov 15 '18 at 16:30

output such as Title: A Light in the Price: £51.77 Position:1 Title: Tipping the Velvet Price: £53.74 Position:2
– Mohammad Palash Babu
Nov 15 '18 at 16:32

add a comment |

How to scrape item position number from this site

website:
http://books.toscrape.com/

Please check this screenshot

https://prnt.sc/lim3zl

# -*- coding: utf-8 -*-



import scrapy





class ToscrapeSpider(scrapy.Spider):

    name = 'toscrape'

    allowed_domains = ['books.toscrape.com']

    start_urls = ['http://books.toscrape.com/']



    def parse(self, response):

        lists=response.css('li.col-xs-6')

        for lis in lists:

            title=lis.xpath('.//h3//@title').extract_first()

            price=lis.xpath('.//[@class="price_color"]//text()').extract_first()

            # I need to know How to scrape there position

            position=''



            yield {

                    'Title':title,

                    'Price':price,

                    'Position':position



            }

        # next=response.xpath('//*[@class="next"]//@href').extract_first()

        # next=response.urljoin(next)

        # if next:

        #     yield scrapy.Request(next)

asked Nov 15 '18 at 15:43

Mohammad Palash Babu

134

How to scrape item position number from this site

website:
http://books.toscrape.com/

Please check this screenshot

https://prnt.sc/lim3zl

# -*- coding: utf-8 -*-



import scrapy





class ToscrapeSpider(scrapy.Spider):

    name = 'toscrape'

    allowed_domains = ['books.toscrape.com']

    start_urls = ['http://books.toscrape.com/']



    def parse(self, response):

        lists=response.css('li.col-xs-6')

        for lis in lists:

            title=lis.xpath('.//h3//@title').extract_first()

            price=lis.xpath('.//[@class="price_color"]//text()').extract_first()

            # I need to know How to scrape there position

            position=''



            yield {

                    'Title':title,

                    'Price':price,

                    'Position':position



            }

        # next=response.xpath('//*[@class="next"]//@href').extract_first()

        # next=response.urljoin(next)

        # if next:

        #     yield scrapy.Request(next)

scrapy position screen-scraping

asked Nov 15 '18 at 15:43

Mohammad Palash Babu

134

asked Nov 15 '18 at 15:43

Mohammad Palash Babu

134

asked Nov 15 '18 at 15:43

Mohammad Palash Babu

134

asked Nov 15 '18 at 15:43

Mohammad Palash Babu

134

asked Nov 15 '18 at 15:43

Mohammad Palash Babu

134

What problem are you encountering? Are you getting any errors?
– Tsahi Asher
Nov 15 '18 at 15:49

actually no errors. i just need to know that How to i can scrape these item count position number
– Mohammad Palash Babu
Nov 15 '18 at 16:30

output such as Title: A Light in the Price: £51.77 Position:1 Title: Tipping the Velvet Price: £53.74 Position:2
– Mohammad Palash Babu
Nov 15 '18 at 16:32

add a comment |

What problem are you encountering? Are you getting any errors?
– Tsahi Asher
Nov 15 '18 at 15:49

actually no errors. i just need to know that How to i can scrape these item count position number
– Mohammad Palash Babu
Nov 15 '18 at 16:30

output such as Title: A Light in the Price: £51.77 Position:1 Title: Tipping the Velvet Price: £53.74 Position:2
– Mohammad Palash Babu
Nov 15 '18 at 16:32

What problem are you encountering? Are you getting any errors?
– Tsahi Asher
Nov 15 '18 at 15:49

actually no errors. i just need to know that How to i can scrape these item count position number
– Mohammad Palash Babu
Nov 15 '18 at 16:30

output such as Title: A Light in the Price: £51.77 Position:1 Title: Tipping the Velvet Price: £53.74 Position:2
– Mohammad Palash Babu
Nov 15 '18 at 16:32

add a comment |

4 Answers
4

active

oldest

votes

Try to use enumerate in cycle, this will solve the problem. As I remember, something like this:

for i, lis in enumerate(lists):

        position = i + 1

answered Nov 15 '18 at 17:08

vezunchik

41937

not solved AttributeError: 'tuple' object has no attribute 'xpath'
– Mohammad Palash Babu
Nov 15 '18 at 17:55

Can you show your code? I don't understand where you got this error.
– vezunchik
Nov 15 '18 at 20:40

It's working now but When i crawl the next page then output results came 1-20 repeatedly but i need to get the results position 21,22,23 to last 1000.
– Mohammad Palash Babu
Nov 16 '18 at 0:58

Store last position to meta while calling next request, and then add it to position in the cycle. Or use number of page to calculate position.
– vezunchik
Nov 16 '18 at 9:51

add a comment |

import scrapy





class ToscrapeSpider(scrapy.Spider):

    name = 'toscrape'

    allowed_domains = ['books.toscrape.com']

    start_urls = ['http://books.toscrape.com/']



    def parse(self, response):

        products_count = response.meta.get('products_count', 0)



        products = response.xpath('//article[@class="product_pod"]')



        for idx, product in enumerate(products):

            _image_container = product.xpath('.//div[@class="image_container"]')



            detail_page_url = _image_container.xpath('.//a/@href').extract_first()

            image = _image_container.xpath('.//img/@src').extract_first()



            name = product.xpath('.//h3/a/@title').extract_first()



            ratings = product.xpath('.//p[contains(@class, "star-rating")]/@class').extract_first()

            ratings = ratings.replace('star-rating', '').strip() if ratings else ratings



            price = product.xpath('.//p[@class="price_color"]/text()').extract_first()

            availability = product.xpath('.//p[@class="instock availability"]//text()').extract()

            availability = list(map(lambda x: x.replace('n', '').replace('t', '').strip(), availability))

            availability = list(filter(lambda x: x, availability))



            availability = availability[0] if availability else availability



            yield dict(

                position=products_count + idx + 1,

                name=name,

                availability=availability,

                price=price,

                ratings=ratings,

                image=image,

                pdp_url=detail_page_url,

            )



        next_page = response.xpath('//li[@class="next"]/a/@href').extract_first()



        if next_page:

            yield response.follow(next_page, meta=dict(products_count=products_count + len(products)))

answered Nov 16 '18 at 3:16

Yash Pokar

346311

can check it please stackoverflow.com/a/53335666/10636764
– Mohammad Palash Babu
Nov 16 '18 at 10:14

@MohammadPalashBabu are you sure you want to use selenium? answer the question why you want to use selenium.
– Yash Pokar
Nov 16 '18 at 12:47

@MohammadPalashBabu you don't need selenium in this project. It will slow down your crawler 20 times or may be more than that.
– Yash Pokar
Nov 16 '18 at 12:49

I know that don't needed selenium but i need a solution for selenium. cause i have a similar project
– Mohammad Palash Babu
Nov 16 '18 at 15:11

If you solve that's would be very helpful for me
– Mohammad Palash Babu
Nov 16 '18 at 15:12

add a comment |

You can simply use a class variable to track the position, like this:

import scrapy



class ToscrapeSpider(scrapy.Spider):



    name = 'toscrape'

    allowed_domains = ['books.toscrape.com']

    start_urls = ['http://books.toscrape.com/']



    position = 0



    def parse(self, response):



        lists = response.css('li.col-xs-6')



        for lis in lists:



            title = lis.xpath('.//h3//@title').extract_first()

            price = lis.xpath('.//p[@class="price_color"]//text()').extract_first()



            self.position += 1



            yield {

                'Title': title,

                'Price': price,

                'Position': self.position,

            }



        next = response.xpath('//li[@class="next"]/a/@href').extract_first()

        next = response.urljoin(next)

        if next:

            yield scrapy.Request(next)

Then:

scrapy runspider myspider.py -o out.json

The out.json file contains:

[

{"Title": "A Light in the Attic", "Price": "u00a351.77", "Position": 1},

{"Title": "Tipping the Velvet", "Price": "u00a353.74", "Position": 2},

{"Title": "Soumission", "Price": "u00a350.10", "Position": 3},

{"Title": "Sharp Objects", "Price": "u00a347.82", "Position": 4},

{"Title": "Sapiens: A Brief History of Humankind", "Price": "u00a354.23", "Position": 5},

{"Title": "The Requiem Red", "Price": "u00a322.65", "Position": 6},

{"Title": "The Dirty Little Secrets of Getting Your Dream Job", "Price": "u00a333.34", "Position": 7},

{"Title": "The Coming Woman: A Novel Based on the Life of the Infamous Feminist, Victoria Woodhull", "Price": "u00a317.93", "Position": 8},

{"Title": "The Boys in the Boat: Nine Americans and Their Epic Quest for Gold at the 1936 Berlin Olympics", "Price": "u00a322.60", "Position": 9},

{"Title": "The Black Maria", "Price": "u00a352.15", "Position": 10},

{"Title": "Starving Hearts (Triangular Trade Trilogy, #1)", "Price": "u00a313.99", "Position": 11},

{"Title": "Shakespeare's Sonnets", "Price": "u00a320.66", "Position": 12},

{"Title": "Set Me Free", "Price": "u00a317.46", "Position": 13},

{"Title": "Scott Pilgrim's Precious Little Life (Scott Pilgrim #1)", "Price": "u00a352.29", "Position": 14},

{"Title": "Rip it Up and Start Again", "Price": "u00a335.02", "Position": 15},

{"Title": "Our Band Could Be Your Life: Scenes from the American Indie Underground, 1981-1991", "Price": "u00a357.25", "Position": 16},

{"Title": "Olio", "Price": "u00a323.88", "Position": 17},

{"Title": "Mesaerion: The Best Science Fiction Stories 1800-1849", "Price": "u00a337.59", "Position": 18},

{"Title": "Libertarianism for Beginners", "Price": "u00a351.33", "Position": 19},

{"Title": "It's Only the Himalayas", "Price": "u00a345.17", "Position": 20}

]

edited Nov 16 '18 at 3:45

answered Nov 15 '18 at 18:59

Guillaume

1,1281724

it's working but When i crawl the next page then output results came 1-20 repeatedly but i need to get the results position 21,22,23 to last 1000 How to i can get that 21-1000 ?
– Mohammad Palash Babu
Nov 16 '18 at 0:50

next=response.xpath('//*[@class="next"]//@href').extract_first() next=response.urljoin(next) if next: yield scrapy.Request(next)
– Mohammad Palash Babu
Nov 16 '18 at 0:53

I have edited my solution to track the position in a class variable
– Guillaume
Nov 16 '18 at 3:46

add a comment |

Yash Pokar

can you check this code please

How to i can apply your method in this selenium>scrapy code

-- coding: utf-8 --

from time import sleep

from scrapy import Spider

from selenium import webdriver

from scrapy.selector import Selector

from scrapy.http import Request

from selenium.common.exceptions import NoSuchElementException





class ToscrapeSpider(Spider):

    name = 'toscrape'

    allowed_domains = ['books.toscrape.com']

    # start_urls = ['http://books.toscrape.com/']



    def start_requests(self):

        self.driver = webdriver.Chrome()

        self.driver.get('http://books.toscrape.com/')

        sel = Selector(text=self.driver.page_source)

        lists=sel.css('li.col-xs-6')

        for i, lis in enumerate(lists):

            position=i+1

            links=lis.xpath('.//h3//a//@href').extract_first()

            links="http://books.toscrape.com/catalogue/"+links

            yield Request(links,meta={'position':position},callback=self.parse_page)



        while True:

            try:

                next_page=self.driver.find_element_by_xpath('//*[@class="next"]//a')

                self.logger.info('Sleeping for 10 seconds.')

                next_page.click()

                sel = Selector(text=self.driver.page_source)

                lists=sel.css('li.col-xs-6')

                for i, lis in enumerate(lists):

                    position=i+1

                    links=lis.xpath('.//h3//a//@href').extract_first()

                    links="http://books.toscrape.com/catalogue/"+links

                    yield Request(links,meta={'position':position},callback=self.parse_page)



            except NoSuchElementException:

                self.logger.info('No more pages to load.')

                self.driver.quit()

                break



    def parse_page(self, response):

        title=response.xpath('//h1//text()').extract_first()

        positions=response.meta['position']



        yield {



                'Title':title,

                'Position':positions





                }

answered Nov 16 '18 at 10:13

Mohammad Palash Babu

134

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53322999%2fhow-to-scrape-item-position-number-in-scrapy%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

4 Answers
4

active

oldest

votes

4 Answers
4

active

oldest

votes

Try to use enumerate in cycle, this will solve the problem. As I remember, something like this:

for i, lis in enumerate(lists):

        position = i + 1

answered Nov 15 '18 at 17:08

vezunchik

41937

not solved AttributeError: 'tuple' object has no attribute 'xpath'
– Mohammad Palash Babu
Nov 15 '18 at 17:55

Can you show your code? I don't understand where you got this error.
– vezunchik
Nov 15 '18 at 20:40

It's working now but When i crawl the next page then output results came 1-20 repeatedly but i need to get the results position 21,22,23 to last 1000.
– Mohammad Palash Babu
Nov 16 '18 at 0:58

Store last position to meta while calling next request, and then add it to position in the cycle. Or use number of page to calculate position.
– vezunchik
Nov 16 '18 at 9:51

add a comment |

Try to use enumerate in cycle, this will solve the problem. As I remember, something like this:

for i, lis in enumerate(lists):

        position = i + 1

answered Nov 15 '18 at 17:08

vezunchik

41937

not solved AttributeError: 'tuple' object has no attribute 'xpath'
– Mohammad Palash Babu
Nov 15 '18 at 17:55

Can you show your code? I don't understand where you got this error.
– vezunchik
Nov 15 '18 at 20:40

It's working now but When i crawl the next page then output results came 1-20 repeatedly but i need to get the results position 21,22,23 to last 1000.
– Mohammad Palash Babu
Nov 16 '18 at 0:58

Store last position to meta while calling next request, and then add it to position in the cycle. Or use number of page to calculate position.
– vezunchik
Nov 16 '18 at 9:51

add a comment |

Try to use enumerate in cycle, this will solve the problem. As I remember, something like this:

for i, lis in enumerate(lists):

        position = i + 1

answered Nov 15 '18 at 17:08

vezunchik

41937

Try to use enumerate in cycle, this will solve the problem. As I remember, something like this:

for i, lis in enumerate(lists):

        position = i + 1

answered Nov 15 '18 at 17:08

vezunchik

41937

answered Nov 15 '18 at 17:08

vezunchik

41937

answered Nov 15 '18 at 17:08

vezunchik

41937

answered Nov 15 '18 at 17:08

vezunchik

41937

not solved AttributeError: 'tuple' object has no attribute 'xpath'
– Mohammad Palash Babu
Nov 15 '18 at 17:55

Can you show your code? I don't understand where you got this error.
– vezunchik
Nov 15 '18 at 20:40

It's working now but When i crawl the next page then output results came 1-20 repeatedly but i need to get the results position 21,22,23 to last 1000.
– Mohammad Palash Babu
Nov 16 '18 at 0:58

Store last position to meta while calling next request, and then add it to position in the cycle. Or use number of page to calculate position.
– vezunchik
Nov 16 '18 at 9:51

add a comment |

not solved AttributeError: 'tuple' object has no attribute 'xpath'
– Mohammad Palash Babu
Nov 15 '18 at 17:55

Can you show your code? I don't understand where you got this error.
– vezunchik
Nov 15 '18 at 20:40

It's working now but When i crawl the next page then output results came 1-20 repeatedly but i need to get the results position 21,22,23 to last 1000.
– Mohammad Palash Babu
Nov 16 '18 at 0:58

Store last position to meta while calling next request, and then add it to position in the cycle. Or use number of page to calculate position.
– vezunchik
Nov 16 '18 at 9:51

not solved AttributeError: 'tuple' object has no attribute 'xpath'
– Mohammad Palash Babu
Nov 15 '18 at 17:55

Can you show your code? I don't understand where you got this error.
– vezunchik
Nov 15 '18 at 20:40

It's working now but When i crawl the next page then output results came 1-20 repeatedly but i need to get the results position 21,22,23 to last 1000.
– Mohammad Palash Babu
Nov 16 '18 at 0:58

Store last position to meta while calling next request, and then add it to position in the cycle. Or use number of page to calculate position.
– vezunchik
Nov 16 '18 at 9:51

add a comment |

import scrapy





class ToscrapeSpider(scrapy.Spider):

    name = 'toscrape'

    allowed_domains = ['books.toscrape.com']

    start_urls = ['http://books.toscrape.com/']



    def parse(self, response):

        products_count = response.meta.get('products_count', 0)



        products = response.xpath('//article[@class="product_pod"]')



        for idx, product in enumerate(products):

            _image_container = product.xpath('.//div[@class="image_container"]')



            detail_page_url = _image_container.xpath('.//a/@href').extract_first()

            image = _image_container.xpath('.//img/@src').extract_first()



            name = product.xpath('.//h3/a/@title').extract_first()



            ratings = product.xpath('.//p[contains(@class, "star-rating")]/@class').extract_first()

            ratings = ratings.replace('star-rating', '').strip() if ratings else ratings



            price = product.xpath('.//p[@class="price_color"]/text()').extract_first()

            availability = product.xpath('.//p[@class="instock availability"]//text()').extract()

            availability = list(map(lambda x: x.replace('n', '').replace('t', '').strip(), availability))

            availability = list(filter(lambda x: x, availability))



            availability = availability[0] if availability else availability



            yield dict(

                position=products_count + idx + 1,

                name=name,

                availability=availability,

                price=price,

                ratings=ratings,

                image=image,

                pdp_url=detail_page_url,

            )



        next_page = response.xpath('//li[@class="next"]/a/@href').extract_first()



        if next_page:

            yield response.follow(next_page, meta=dict(products_count=products_count + len(products)))

answered Nov 16 '18 at 3:16

Yash Pokar

346311

can check it please stackoverflow.com/a/53335666/10636764
– Mohammad Palash Babu
Nov 16 '18 at 10:14

@MohammadPalashBabu are you sure you want to use selenium? answer the question why you want to use selenium.
– Yash Pokar
Nov 16 '18 at 12:47

@MohammadPalashBabu you don't need selenium in this project. It will slow down your crawler 20 times or may be more than that.
– Yash Pokar
Nov 16 '18 at 12:49

I know that don't needed selenium but i need a solution for selenium. cause i have a similar project
– Mohammad Palash Babu
Nov 16 '18 at 15:11

If you solve that's would be very helpful for me
– Mohammad Palash Babu
Nov 16 '18 at 15:12

add a comment |

import scrapy





class ToscrapeSpider(scrapy.Spider):

    name = 'toscrape'

    allowed_domains = ['books.toscrape.com']

    start_urls = ['http://books.toscrape.com/']



    def parse(self, response):

        products_count = response.meta.get('products_count', 0)



        products = response.xpath('//article[@class="product_pod"]')



        for idx, product in enumerate(products):

            _image_container = product.xpath('.//div[@class="image_container"]')



            detail_page_url = _image_container.xpath('.//a/@href').extract_first()

            image = _image_container.xpath('.//img/@src').extract_first()



            name = product.xpath('.//h3/a/@title').extract_first()



            ratings = product.xpath('.//p[contains(@class, "star-rating")]/@class').extract_first()

            ratings = ratings.replace('star-rating', '').strip() if ratings else ratings



            price = product.xpath('.//p[@class="price_color"]/text()').extract_first()

            availability = product.xpath('.//p[@class="instock availability"]//text()').extract()

            availability = list(map(lambda x: x.replace('n', '').replace('t', '').strip(), availability))

            availability = list(filter(lambda x: x, availability))



            availability = availability[0] if availability else availability



            yield dict(

                position=products_count + idx + 1,

                name=name,

                availability=availability,

                price=price,

                ratings=ratings,

                image=image,

                pdp_url=detail_page_url,

            )



        next_page = response.xpath('//li[@class="next"]/a/@href').extract_first()



        if next_page:

            yield response.follow(next_page, meta=dict(products_count=products_count + len(products)))

answered Nov 16 '18 at 3:16

Yash Pokar

346311

can check it please stackoverflow.com/a/53335666/10636764
– Mohammad Palash Babu
Nov 16 '18 at 10:14

@MohammadPalashBabu are you sure you want to use selenium? answer the question why you want to use selenium.
– Yash Pokar
Nov 16 '18 at 12:47

@MohammadPalashBabu you don't need selenium in this project. It will slow down your crawler 20 times or may be more than that.
– Yash Pokar
Nov 16 '18 at 12:49

I know that don't needed selenium but i need a solution for selenium. cause i have a similar project
– Mohammad Palash Babu
Nov 16 '18 at 15:11

If you solve that's would be very helpful for me
– Mohammad Palash Babu
Nov 16 '18 at 15:12

add a comment |

import scrapy





class ToscrapeSpider(scrapy.Spider):

    name = 'toscrape'

    allowed_domains = ['books.toscrape.com']

    start_urls = ['http://books.toscrape.com/']



    def parse(self, response):

        products_count = response.meta.get('products_count', 0)



        products = response.xpath('//article[@class="product_pod"]')



        for idx, product in enumerate(products):

            _image_container = product.xpath('.//div[@class="image_container"]')



            detail_page_url = _image_container.xpath('.//a/@href').extract_first()

            image = _image_container.xpath('.//img/@src').extract_first()



            name = product.xpath('.//h3/a/@title').extract_first()



            ratings = product.xpath('.//p[contains(@class, "star-rating")]/@class').extract_first()

            ratings = ratings.replace('star-rating', '').strip() if ratings else ratings



            price = product.xpath('.//p[@class="price_color"]/text()').extract_first()

            availability = product.xpath('.//p[@class="instock availability"]//text()').extract()

            availability = list(map(lambda x: x.replace('n', '').replace('t', '').strip(), availability))

            availability = list(filter(lambda x: x, availability))



            availability = availability[0] if availability else availability



            yield dict(

                position=products_count + idx + 1,

                name=name,

                availability=availability,

                price=price,

                ratings=ratings,

                image=image,

                pdp_url=detail_page_url,

            )



        next_page = response.xpath('//li[@class="next"]/a/@href').extract_first()



        if next_page:

            yield response.follow(next_page, meta=dict(products_count=products_count + len(products)))

answered Nov 16 '18 at 3:16

Yash Pokar

346311

import scrapy





class ToscrapeSpider(scrapy.Spider):

    name = 'toscrape'

    allowed_domains = ['books.toscrape.com']

    start_urls = ['http://books.toscrape.com/']



    def parse(self, response):

        products_count = response.meta.get('products_count', 0)



        products = response.xpath('//article[@class="product_pod"]')



        for idx, product in enumerate(products):

            _image_container = product.xpath('.//div[@class="image_container"]')



            detail_page_url = _image_container.xpath('.//a/@href').extract_first()

            image = _image_container.xpath('.//img/@src').extract_first()



            name = product.xpath('.//h3/a/@title').extract_first()



            ratings = product.xpath('.//p[contains(@class, "star-rating")]/@class').extract_first()

            ratings = ratings.replace('star-rating', '').strip() if ratings else ratings



            price = product.xpath('.//p[@class="price_color"]/text()').extract_first()

            availability = product.xpath('.//p[@class="instock availability"]//text()').extract()

            availability = list(map(lambda x: x.replace('n', '').replace('t', '').strip(), availability))

            availability = list(filter(lambda x: x, availability))



            availability = availability[0] if availability else availability



            yield dict(

                position=products_count + idx + 1,

                name=name,

                availability=availability,

                price=price,

                ratings=ratings,

                image=image,

                pdp_url=detail_page_url,

            )



        next_page = response.xpath('//li[@class="next"]/a/@href').extract_first()



        if next_page:

            yield response.follow(next_page, meta=dict(products_count=products_count + len(products)))

answered Nov 16 '18 at 3:16

Yash Pokar

346311

answered Nov 16 '18 at 3:16

Yash Pokar

346311

answered Nov 16 '18 at 3:16

Yash Pokar

346311

answered Nov 16 '18 at 3:16

Yash Pokar

346311

can check it please stackoverflow.com/a/53335666/10636764
– Mohammad Palash Babu
Nov 16 '18 at 10:14

@MohammadPalashBabu are you sure you want to use selenium? answer the question why you want to use selenium.
– Yash Pokar
Nov 16 '18 at 12:47

@MohammadPalashBabu you don't need selenium in this project. It will slow down your crawler 20 times or may be more than that.
– Yash Pokar
Nov 16 '18 at 12:49

I know that don't needed selenium but i need a solution for selenium. cause i have a similar project
– Mohammad Palash Babu
Nov 16 '18 at 15:11

If you solve that's would be very helpful for me
– Mohammad Palash Babu
Nov 16 '18 at 15:12

add a comment |

can check it please stackoverflow.com/a/53335666/10636764
– Mohammad Palash Babu
Nov 16 '18 at 10:14

@MohammadPalashBabu are you sure you want to use selenium? answer the question why you want to use selenium.
– Yash Pokar
Nov 16 '18 at 12:47

@MohammadPalashBabu you don't need selenium in this project. It will slow down your crawler 20 times or may be more than that.
– Yash Pokar
Nov 16 '18 at 12:49

I know that don't needed selenium but i need a solution for selenium. cause i have a similar project
– Mohammad Palash Babu
Nov 16 '18 at 15:11

If you solve that's would be very helpful for me
– Mohammad Palash Babu
Nov 16 '18 at 15:12

can check it please stackoverflow.com/a/53335666/10636764
– Mohammad Palash Babu
Nov 16 '18 at 10:14

@MohammadPalashBabu are you sure you want to use selenium? answer the question why you want to use selenium.
– Yash Pokar
Nov 16 '18 at 12:47

@MohammadPalashBabu you don't need selenium in this project. It will slow down your crawler 20 times or may be more than that.
– Yash Pokar
Nov 16 '18 at 12:49

I know that don't needed selenium but i need a solution for selenium. cause i have a similar project
– Mohammad Palash Babu
Nov 16 '18 at 15:11

If you solve that's would be very helpful for me
– Mohammad Palash Babu
Nov 16 '18 at 15:12

add a comment |

You can simply use a class variable to track the position, like this:

import scrapy



class ToscrapeSpider(scrapy.Spider):



    name = 'toscrape'

    allowed_domains = ['books.toscrape.com']

    start_urls = ['http://books.toscrape.com/']



    position = 0



    def parse(self, response):



        lists = response.css('li.col-xs-6')



        for lis in lists:



            title = lis.xpath('.//h3//@title').extract_first()

            price = lis.xpath('.//p[@class="price_color"]//text()').extract_first()



            self.position += 1



            yield {

                'Title': title,

                'Price': price,

                'Position': self.position,

            }



        next = response.xpath('//li[@class="next"]/a/@href').extract_first()

        next = response.urljoin(next)

        if next:

            yield scrapy.Request(next)

Then:

scrapy runspider myspider.py -o out.json

The out.json file contains:

[

{"Title": "A Light in the Attic", "Price": "u00a351.77", "Position": 1},

{"Title": "Tipping the Velvet", "Price": "u00a353.74", "Position": 2},

{"Title": "Soumission", "Price": "u00a350.10", "Position": 3},

{"Title": "Sharp Objects", "Price": "u00a347.82", "Position": 4},

{"Title": "Sapiens: A Brief History of Humankind", "Price": "u00a354.23", "Position": 5},

{"Title": "The Requiem Red", "Price": "u00a322.65", "Position": 6},

{"Title": "The Dirty Little Secrets of Getting Your Dream Job", "Price": "u00a333.34", "Position": 7},

{"Title": "The Coming Woman: A Novel Based on the Life of the Infamous Feminist, Victoria Woodhull", "Price": "u00a317.93", "Position": 8},

{"Title": "The Boys in the Boat: Nine Americans and Their Epic Quest for Gold at the 1936 Berlin Olympics", "Price": "u00a322.60", "Position": 9},

{"Title": "The Black Maria", "Price": "u00a352.15", "Position": 10},

{"Title": "Starving Hearts (Triangular Trade Trilogy, #1)", "Price": "u00a313.99", "Position": 11},

{"Title": "Shakespeare's Sonnets", "Price": "u00a320.66", "Position": 12},

{"Title": "Set Me Free", "Price": "u00a317.46", "Position": 13},

{"Title": "Scott Pilgrim's Precious Little Life (Scott Pilgrim #1)", "Price": "u00a352.29", "Position": 14},

{"Title": "Rip it Up and Start Again", "Price": "u00a335.02", "Position": 15},

{"Title": "Our Band Could Be Your Life: Scenes from the American Indie Underground, 1981-1991", "Price": "u00a357.25", "Position": 16},

{"Title": "Olio", "Price": "u00a323.88", "Position": 17},

{"Title": "Mesaerion: The Best Science Fiction Stories 1800-1849", "Price": "u00a337.59", "Position": 18},

{"Title": "Libertarianism for Beginners", "Price": "u00a351.33", "Position": 19},

{"Title": "It's Only the Himalayas", "Price": "u00a345.17", "Position": 20}

]

edited Nov 16 '18 at 3:45

answered Nov 15 '18 at 18:59

Guillaume

1,1281724

it's working but When i crawl the next page then output results came 1-20 repeatedly but i need to get the results position 21,22,23 to last 1000 How to i can get that 21-1000 ?
– Mohammad Palash Babu
Nov 16 '18 at 0:50

next=response.xpath('//*[@class="next"]//@href').extract_first() next=response.urljoin(next) if next: yield scrapy.Request(next)
– Mohammad Palash Babu
Nov 16 '18 at 0:53

I have edited my solution to track the position in a class variable
– Guillaume
Nov 16 '18 at 3:46

add a comment |

You can simply use a class variable to track the position, like this:

import scrapy



class ToscrapeSpider(scrapy.Spider):



    name = 'toscrape'

    allowed_domains = ['books.toscrape.com']

    start_urls = ['http://books.toscrape.com/']



    position = 0



    def parse(self, response):



        lists = response.css('li.col-xs-6')



        for lis in lists:



            title = lis.xpath('.//h3//@title').extract_first()

            price = lis.xpath('.//p[@class="price_color"]//text()').extract_first()



            self.position += 1



            yield {

                'Title': title,

                'Price': price,

                'Position': self.position,

            }



        next = response.xpath('//li[@class="next"]/a/@href').extract_first()

        next = response.urljoin(next)

        if next:

            yield scrapy.Request(next)

Then:

scrapy runspider myspider.py -o out.json

The out.json file contains:

[

{"Title": "A Light in the Attic", "Price": "u00a351.77", "Position": 1},

{"Title": "Tipping the Velvet", "Price": "u00a353.74", "Position": 2},

{"Title": "Soumission", "Price": "u00a350.10", "Position": 3},

{"Title": "Sharp Objects", "Price": "u00a347.82", "Position": 4},

{"Title": "Sapiens: A Brief History of Humankind", "Price": "u00a354.23", "Position": 5},

{"Title": "The Requiem Red", "Price": "u00a322.65", "Position": 6},

{"Title": "The Dirty Little Secrets of Getting Your Dream Job", "Price": "u00a333.34", "Position": 7},

{"Title": "The Coming Woman: A Novel Based on the Life of the Infamous Feminist, Victoria Woodhull", "Price": "u00a317.93", "Position": 8},

{"Title": "The Boys in the Boat: Nine Americans and Their Epic Quest for Gold at the 1936 Berlin Olympics", "Price": "u00a322.60", "Position": 9},

{"Title": "The Black Maria", "Price": "u00a352.15", "Position": 10},

{"Title": "Starving Hearts (Triangular Trade Trilogy, #1)", "Price": "u00a313.99", "Position": 11},

{"Title": "Shakespeare's Sonnets", "Price": "u00a320.66", "Position": 12},

{"Title": "Set Me Free", "Price": "u00a317.46", "Position": 13},

{"Title": "Scott Pilgrim's Precious Little Life (Scott Pilgrim #1)", "Price": "u00a352.29", "Position": 14},

{"Title": "Rip it Up and Start Again", "Price": "u00a335.02", "Position": 15},

{"Title": "Our Band Could Be Your Life: Scenes from the American Indie Underground, 1981-1991", "Price": "u00a357.25", "Position": 16},

{"Title": "Olio", "Price": "u00a323.88", "Position": 17},

{"Title": "Mesaerion: The Best Science Fiction Stories 1800-1849", "Price": "u00a337.59", "Position": 18},

{"Title": "Libertarianism for Beginners", "Price": "u00a351.33", "Position": 19},

{"Title": "It's Only the Himalayas", "Price": "u00a345.17", "Position": 20}

]

edited Nov 16 '18 at 3:45

answered Nov 15 '18 at 18:59

Guillaume

1,1281724

it's working but When i crawl the next page then output results came 1-20 repeatedly but i need to get the results position 21,22,23 to last 1000 How to i can get that 21-1000 ?
– Mohammad Palash Babu
Nov 16 '18 at 0:50

next=response.xpath('//*[@class="next"]//@href').extract_first() next=response.urljoin(next) if next: yield scrapy.Request(next)
– Mohammad Palash Babu
Nov 16 '18 at 0:53

I have edited my solution to track the position in a class variable
– Guillaume
Nov 16 '18 at 3:46

add a comment |

You can simply use a class variable to track the position, like this:

import scrapy



class ToscrapeSpider(scrapy.Spider):



    name = 'toscrape'

    allowed_domains = ['books.toscrape.com']

    start_urls = ['http://books.toscrape.com/']



    position = 0



    def parse(self, response):



        lists = response.css('li.col-xs-6')



        for lis in lists:



            title = lis.xpath('.//h3//@title').extract_first()

            price = lis.xpath('.//p[@class="price_color"]//text()').extract_first()



            self.position += 1



            yield {

                'Title': title,

                'Price': price,

                'Position': self.position,

            }



        next = response.xpath('//li[@class="next"]/a/@href').extract_first()

        next = response.urljoin(next)

        if next:

            yield scrapy.Request(next)

Then:

scrapy runspider myspider.py -o out.json

The out.json file contains:

[

{"Title": "A Light in the Attic", "Price": "u00a351.77", "Position": 1},

{"Title": "Tipping the Velvet", "Price": "u00a353.74", "Position": 2},

{"Title": "Soumission", "Price": "u00a350.10", "Position": 3},

{"Title": "Sharp Objects", "Price": "u00a347.82", "Position": 4},

{"Title": "Sapiens: A Brief History of Humankind", "Price": "u00a354.23", "Position": 5},

{"Title": "The Requiem Red", "Price": "u00a322.65", "Position": 6},

{"Title": "The Dirty Little Secrets of Getting Your Dream Job", "Price": "u00a333.34", "Position": 7},

{"Title": "The Coming Woman: A Novel Based on the Life of the Infamous Feminist, Victoria Woodhull", "Price": "u00a317.93", "Position": 8},

{"Title": "The Boys in the Boat: Nine Americans and Their Epic Quest for Gold at the 1936 Berlin Olympics", "Price": "u00a322.60", "Position": 9},

{"Title": "The Black Maria", "Price": "u00a352.15", "Position": 10},

{"Title": "Starving Hearts (Triangular Trade Trilogy, #1)", "Price": "u00a313.99", "Position": 11},

{"Title": "Shakespeare's Sonnets", "Price": "u00a320.66", "Position": 12},

{"Title": "Set Me Free", "Price": "u00a317.46", "Position": 13},

{"Title": "Scott Pilgrim's Precious Little Life (Scott Pilgrim #1)", "Price": "u00a352.29", "Position": 14},

{"Title": "Rip it Up and Start Again", "Price": "u00a335.02", "Position": 15},

{"Title": "Our Band Could Be Your Life: Scenes from the American Indie Underground, 1981-1991", "Price": "u00a357.25", "Position": 16},

{"Title": "Olio", "Price": "u00a323.88", "Position": 17},

{"Title": "Mesaerion: The Best Science Fiction Stories 1800-1849", "Price": "u00a337.59", "Position": 18},

{"Title": "Libertarianism for Beginners", "Price": "u00a351.33", "Position": 19},

{"Title": "It's Only the Himalayas", "Price": "u00a345.17", "Position": 20}

]

edited Nov 16 '18 at 3:45

answered Nov 15 '18 at 18:59

Guillaume

1,1281724

You can simply use a class variable to track the position, like this:

import scrapy



class ToscrapeSpider(scrapy.Spider):



    name = 'toscrape'

    allowed_domains = ['books.toscrape.com']

    start_urls = ['http://books.toscrape.com/']



    position = 0



    def parse(self, response):



        lists = response.css('li.col-xs-6')



        for lis in lists:



            title = lis.xpath('.//h3//@title').extract_first()

            price = lis.xpath('.//p[@class="price_color"]//text()').extract_first()



            self.position += 1



            yield {

                'Title': title,

                'Price': price,

                'Position': self.position,

            }



        next = response.xpath('//li[@class="next"]/a/@href').extract_first()

        next = response.urljoin(next)

        if next:

            yield scrapy.Request(next)

Then:

scrapy runspider myspider.py -o out.json

The out.json file contains:

[

{"Title": "A Light in the Attic", "Price": "u00a351.77", "Position": 1},

{"Title": "Tipping the Velvet", "Price": "u00a353.74", "Position": 2},

{"Title": "Soumission", "Price": "u00a350.10", "Position": 3},

{"Title": "Sharp Objects", "Price": "u00a347.82", "Position": 4},

{"Title": "Sapiens: A Brief History of Humankind", "Price": "u00a354.23", "Position": 5},

{"Title": "The Requiem Red", "Price": "u00a322.65", "Position": 6},

{"Title": "The Dirty Little Secrets of Getting Your Dream Job", "Price": "u00a333.34", "Position": 7},

{"Title": "The Coming Woman: A Novel Based on the Life of the Infamous Feminist, Victoria Woodhull", "Price": "u00a317.93", "Position": 8},

{"Title": "The Boys in the Boat: Nine Americans and Their Epic Quest for Gold at the 1936 Berlin Olympics", "Price": "u00a322.60", "Position": 9},

{"Title": "The Black Maria", "Price": "u00a352.15", "Position": 10},

{"Title": "Starving Hearts (Triangular Trade Trilogy, #1)", "Price": "u00a313.99", "Position": 11},

{"Title": "Shakespeare's Sonnets", "Price": "u00a320.66", "Position": 12},

{"Title": "Set Me Free", "Price": "u00a317.46", "Position": 13},

{"Title": "Scott Pilgrim's Precious Little Life (Scott Pilgrim #1)", "Price": "u00a352.29", "Position": 14},

{"Title": "Rip it Up and Start Again", "Price": "u00a335.02", "Position": 15},

{"Title": "Our Band Could Be Your Life: Scenes from the American Indie Underground, 1981-1991", "Price": "u00a357.25", "Position": 16},

{"Title": "Olio", "Price": "u00a323.88", "Position": 17},

{"Title": "Mesaerion: The Best Science Fiction Stories 1800-1849", "Price": "u00a337.59", "Position": 18},

{"Title": "Libertarianism for Beginners", "Price": "u00a351.33", "Position": 19},

{"Title": "It's Only the Himalayas", "Price": "u00a345.17", "Position": 20}

]

edited Nov 16 '18 at 3:45

answered Nov 15 '18 at 18:59

Guillaume

1,1281724

edited Nov 16 '18 at 3:45

answered Nov 15 '18 at 18:59

Guillaume

1,1281724

answered Nov 15 '18 at 18:59

Guillaume

1,1281724

answered Nov 15 '18 at 18:59

Guillaume

1,1281724

it's working but When i crawl the next page then output results came 1-20 repeatedly but i need to get the results position 21,22,23 to last 1000 How to i can get that 21-1000 ?
– Mohammad Palash Babu
Nov 16 '18 at 0:50

next=response.xpath('//*[@class="next"]//@href').extract_first() next=response.urljoin(next) if next: yield scrapy.Request(next)
– Mohammad Palash Babu
Nov 16 '18 at 0:53

I have edited my solution to track the position in a class variable
– Guillaume
Nov 16 '18 at 3:46

add a comment |

it's working but When i crawl the next page then output results came 1-20 repeatedly but i need to get the results position 21,22,23 to last 1000 How to i can get that 21-1000 ?
– Mohammad Palash Babu
Nov 16 '18 at 0:50

next=response.xpath('//*[@class="next"]//@href').extract_first() next=response.urljoin(next) if next: yield scrapy.Request(next)
– Mohammad Palash Babu
Nov 16 '18 at 0:53

I have edited my solution to track the position in a class variable
– Guillaume
Nov 16 '18 at 3:46

it's working but When i crawl the next page then output results came 1-20 repeatedly but i need to get the results position 21,22,23 to last 1000 How to i can get that 21-1000 ?
– Mohammad Palash Babu
Nov 16 '18 at 0:50

next=response.xpath('//*[@class="next"]//@href').extract_first() next=response.urljoin(next) if next: yield scrapy.Request(next)
– Mohammad Palash Babu
Nov 16 '18 at 0:53

I have edited my solution to track the position in a class variable
– Guillaume
Nov 16 '18 at 3:46

add a comment |

Yash Pokar

can you check this code please

How to i can apply your method in this selenium>scrapy code

-- coding: utf-8 --

from time import sleep

from scrapy import Spider

from selenium import webdriver

from scrapy.selector import Selector

from scrapy.http import Request

from selenium.common.exceptions import NoSuchElementException





class ToscrapeSpider(Spider):

    name = 'toscrape'

    allowed_domains = ['books.toscrape.com']

    # start_urls = ['http://books.toscrape.com/']



    def start_requests(self):

        self.driver = webdriver.Chrome()

        self.driver.get('http://books.toscrape.com/')

        sel = Selector(text=self.driver.page_source)

        lists=sel.css('li.col-xs-6')

        for i, lis in enumerate(lists):

            position=i+1

            links=lis.xpath('.//h3//a//@href').extract_first()

            links="http://books.toscrape.com/catalogue/"+links

            yield Request(links,meta={'position':position},callback=self.parse_page)



        while True:

            try:

                next_page=self.driver.find_element_by_xpath('//*[@class="next"]//a')

                self.logger.info('Sleeping for 10 seconds.')

                next_page.click()

                sel = Selector(text=self.driver.page_source)

                lists=sel.css('li.col-xs-6')

                for i, lis in enumerate(lists):

                    position=i+1

                    links=lis.xpath('.//h3//a//@href').extract_first()

                    links="http://books.toscrape.com/catalogue/"+links

                    yield Request(links,meta={'position':position},callback=self.parse_page)



            except NoSuchElementException:

                self.logger.info('No more pages to load.')

                self.driver.quit()

                break



    def parse_page(self, response):

        title=response.xpath('//h1//text()').extract_first()

        positions=response.meta['position']



        yield {



                'Title':title,

                'Position':positions





                }

answered Nov 16 '18 at 10:13

Mohammad Palash Babu

134

add a comment |

Yash Pokar

can you check this code please

How to i can apply your method in this selenium>scrapy code

-- coding: utf-8 --

from time import sleep

from scrapy import Spider

from selenium import webdriver

from scrapy.selector import Selector

from scrapy.http import Request

from selenium.common.exceptions import NoSuchElementException





class ToscrapeSpider(Spider):

    name = 'toscrape'

    allowed_domains = ['books.toscrape.com']

    # start_urls = ['http://books.toscrape.com/']



    def start_requests(self):

        self.driver = webdriver.Chrome()

        self.driver.get('http://books.toscrape.com/')

        sel = Selector(text=self.driver.page_source)

        lists=sel.css('li.col-xs-6')

        for i, lis in enumerate(lists):

            position=i+1

            links=lis.xpath('.//h3//a//@href').extract_first()

            links="http://books.toscrape.com/catalogue/"+links

            yield Request(links,meta={'position':position},callback=self.parse_page)



        while True:

            try:

                next_page=self.driver.find_element_by_xpath('//*[@class="next"]//a')

                self.logger.info('Sleeping for 10 seconds.')

                next_page.click()

                sel = Selector(text=self.driver.page_source)

                lists=sel.css('li.col-xs-6')

                for i, lis in enumerate(lists):

                    position=i+1

                    links=lis.xpath('.//h3//a//@href').extract_first()

                    links="http://books.toscrape.com/catalogue/"+links

                    yield Request(links,meta={'position':position},callback=self.parse_page)



            except NoSuchElementException:

                self.logger.info('No more pages to load.')

                self.driver.quit()

                break



    def parse_page(self, response):

        title=response.xpath('//h1//text()').extract_first()

        positions=response.meta['position']



        yield {



                'Title':title,

                'Position':positions





                }

answered Nov 16 '18 at 10:13

Mohammad Palash Babu

134

add a comment |

Yash Pokar

can you check this code please

How to i can apply your method in this selenium>scrapy code

-- coding: utf-8 --

from time import sleep

from scrapy import Spider

from selenium import webdriver

from scrapy.selector import Selector

from scrapy.http import Request

from selenium.common.exceptions import NoSuchElementException





class ToscrapeSpider(Spider):

    name = 'toscrape'

    allowed_domains = ['books.toscrape.com']

    # start_urls = ['http://books.toscrape.com/']



    def start_requests(self):

        self.driver = webdriver.Chrome()

        self.driver.get('http://books.toscrape.com/')

        sel = Selector(text=self.driver.page_source)

        lists=sel.css('li.col-xs-6')

        for i, lis in enumerate(lists):

            position=i+1

            links=lis.xpath('.//h3//a//@href').extract_first()

            links="http://books.toscrape.com/catalogue/"+links

            yield Request(links,meta={'position':position},callback=self.parse_page)



        while True:

            try:

                next_page=self.driver.find_element_by_xpath('//*[@class="next"]//a')

                self.logger.info('Sleeping for 10 seconds.')

                next_page.click()

                sel = Selector(text=self.driver.page_source)

                lists=sel.css('li.col-xs-6')

                for i, lis in enumerate(lists):

                    position=i+1

                    links=lis.xpath('.//h3//a//@href').extract_first()

                    links="http://books.toscrape.com/catalogue/"+links

                    yield Request(links,meta={'position':position},callback=self.parse_page)



            except NoSuchElementException:

                self.logger.info('No more pages to load.')

                self.driver.quit()

                break



    def parse_page(self, response):

        title=response.xpath('//h1//text()').extract_first()

        positions=response.meta['position']



        yield {



                'Title':title,

                'Position':positions





                }

answered Nov 16 '18 at 10:13

Mohammad Palash Babu

134

Yash Pokar

can you check this code please

How to i can apply your method in this selenium>scrapy code

-- coding: utf-8 --

from time import sleep

from scrapy import Spider

from selenium import webdriver

from scrapy.selector import Selector

from scrapy.http import Request

from selenium.common.exceptions import NoSuchElementException





class ToscrapeSpider(Spider):

    name = 'toscrape'

    allowed_domains = ['books.toscrape.com']

    # start_urls = ['http://books.toscrape.com/']



    def start_requests(self):

        self.driver = webdriver.Chrome()

        self.driver.get('http://books.toscrape.com/')

        sel = Selector(text=self.driver.page_source)

        lists=sel.css('li.col-xs-6')

        for i, lis in enumerate(lists):

            position=i+1

            links=lis.xpath('.//h3//a//@href').extract_first()

            links="http://books.toscrape.com/catalogue/"+links

            yield Request(links,meta={'position':position},callback=self.parse_page)



        while True:

            try:

                next_page=self.driver.find_element_by_xpath('//*[@class="next"]//a')

                self.logger.info('Sleeping for 10 seconds.')

                next_page.click()

                sel = Selector(text=self.driver.page_source)

                lists=sel.css('li.col-xs-6')

                for i, lis in enumerate(lists):

                    position=i+1

                    links=lis.xpath('.//h3//a//@href').extract_first()

                    links="http://books.toscrape.com/catalogue/"+links

                    yield Request(links,meta={'position':position},callback=self.parse_page)



            except NoSuchElementException:

                self.logger.info('No more pages to load.')

                self.driver.quit()

                break



    def parse_page(self, response):

        title=response.xpath('//h1//text()').extract_first()

        positions=response.meta['position']



        yield {



                'Title':title,

                'Position':positions





                }

answered Nov 16 '18 at 10:13

Mohammad Palash Babu

134

answered Nov 16 '18 at 10:13

Mohammad Palash Babu

134

answered Nov 16 '18 at 10:13

Mohammad Palash Babu

134

answered Nov 16 '18 at 10:13

Mohammad Palash Babu

134

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Agfdhyk

How to scrape item position number in scrapy

4 Answers
4

Yash Pokar

-- coding: utf-8 --

Your Answer

Post as a guest

4 Answers
4

4 Answers
4

Yash Pokar

-- coding: utf-8 --

Yash Pokar

-- coding: utf-8 --

Yash Pokar

-- coding: utf-8 --

Yash Pokar

-- coding: utf-8 --

Post as a guest

How to scrape item position number in scrapy

4 Answers 4

Yash Pokar

-- coding: utf-8 --

Your Answer

Sign up or log in

Post as a guest

Post as a guest

4 Answers 4

4 Answers 4

Yash Pokar

-- coding: utf-8 --

Yash Pokar

-- coding: utf-8 --

Yash Pokar

-- coding: utf-8 --

Yash Pokar

-- coding: utf-8 --

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

4 Answers
4

4 Answers
4

4 Answers
4