스크래피 단위 테스트

IT박스

스크래피 단위 테스트

itboxs 2021. 1. 7. 07:47

스크래피 단위 테스트

Scrapy (화면 스크레이퍼 / 웹 크롤러)에서 일부 단위 테스트를 구현하고 싶습니다. 프로젝트는 "scrapy crawl"명령을 통해 실행되므로 코와 같은 것을 통해 실행할 수 있습니다. scrapy는 twisted 위에 구축되었으므로 단위 테스트 프레임 워크 Trial을 사용할 수 있습니까? 그렇다면 어떻게? 그렇지 않으면 코를 작동 시키고 싶습니다 .

최신 정보:

나는 Scrapy-Users 에 대해 이야기하고 있는데 "테스트 코드에서 응답을 빌드 한 다음 응답으로 메서드를 호출하고 [내가] 예상되는 항목 / 요청을 출력에 가져왔다"고 주장해야한다고 생각합니다. 그래도 작동하지 않는 것 같습니다.

단위 테스트 테스트 클래스와 테스트를 빌드 할 수 있습니다.

응답 객체 생성
응답 객체를 사용하여 내 스파이더의 구문 분석 메서드를 호출하십시오.

그러나 결국 이 트레이스 백 을 생성 합니다 . 이유에 대한 통찰력이 있습니까?

내가 한 방법은 가짜 응답을 만드는 것입니다. 이렇게하면 오프라인에서 구문 분석 기능을 테스트 할 수 있습니다. 그러나 실제 HTML을 사용하여 실제 상황을 얻습니다.

이 방법의 문제점은 로컬 HTML 파일이 온라인의 최신 상태를 반영하지 않을 수 있다는 것입니다. 따라서 HTML이 온라인으로 변경되면 큰 버그가있을 수 있지만 테스트 케이스는 여전히 통과합니다. 따라서 이러한 방식으로 테스트하는 가장 좋은 방법이 아닐 수 있습니다.

내 현재 워크 플로는 오류가있을 때마다 URL과 함께 관리자에게 이메일을 보냅니다. 그런 다음 특정 오류에 대해 오류를 일으키는 내용으로 html 파일을 만듭니다. 그런 다음 단위 테스트를 만듭니다.

다음은 로컬 html 파일에서 테스트 할 샘플 Scrapy http 응답을 만드는 데 사용하는 코드입니다.

# scrapyproject/tests/responses/__init__.py

import os

from scrapy.http import Response, Request

def fake_response_from_file(file_name, url=None):
    """
    Create a Scrapy fake HTTP response from a HTML file
    @param file_name: The relative filename from the responses directory,
                      but absolute paths are also accepted.
    @param url: The URL of the response.
    returns: A scrapy HTTP response which can be used for unittesting.
    """
    if not url:
        url = 'http://www.example.com'

    request = Request(url=url)
    if not file_name[0] == '/':
        responses_dir = os.path.dirname(os.path.realpath(__file__))
        file_path = os.path.join(responses_dir, file_name)
    else:
        file_path = file_name

    file_content = open(file_path, 'r').read()

    response = Response(url=url,
        request=request,
        body=file_content)
    response.encoding = 'utf-8'
    return response

샘플 html 파일은 scrapyproject / tests / responses / osdir / sample.html에 있습니다.

그런 다음 테스트 케이스는 다음과 같이 보일 수 있습니다. 테스트 케이스 위치는 scrapyproject / tests / test_osdir.py입니다.

import unittest
from scrapyproject.spiders import osdir_spider
from responses import fake_response_from_file

class OsdirSpiderTest(unittest.TestCase):

    def setUp(self):
        self.spider = osdir_spider.DirectorySpider()

    def _test_item_results(self, results, expected_length):
        count = 0
        permalinks = set()
        for item in results:
            self.assertIsNotNone(item['content'])
            self.assertIsNotNone(item['title'])
        self.assertEqual(count, expected_length)

    def test_parse(self):
        results = self.spider.parse(fake_response_from_file('osdir/sample.html'))
        self._test_item_results(results, 10)

이것이 기본적으로 내 구문 분석 방법을 테스트하는 방법이지만 구문 분석 방법에만 해당되는 것은 아닙니다. 더 복잡해지면 Mox를 추천합니다.

새로 추가 된 스파이더 계약 은 시도해 볼 가치가 있습니다. 많은 코드를 요구하지 않고 테스트를 추가하는 간단한 방법을 제공합니다.

저는 Betamax 를 사용 하여 처음으로 실제 사이트에서 테스트를 실행하고 http 응답을 로컬로 유지하여 다음 테스트가 매우 빠르게 실행되도록합니다.

Betamax는 모든 요청을 가로 채고 이미 가로 채서 기록 된 일치하는 요청을 찾으려고 시도합니다.

최신 버전의 사이트를 얻으려면 betamax가 기록한 것을 제거하고 테스트를 다시 실행하십시오.

예:

from scrapy import Spider, Request
from scrapy.http import HtmlResponse


class Example(Spider):
    name = 'example'

    url = 'http://doc.scrapy.org/en/latest/_static/selectors-sample1.html'

    def start_requests(self):
        yield Request(self.url, self.parse)

    def parse(self, response):
        for href in response.xpath('//a/@href').extract():
            yield {'image_href': href}


# Test part
from betamax import Betamax
from betamax.fixtures.unittest import BetamaxTestCase


with Betamax.configure() as config:
    # where betamax will store cassettes (http responses):
    config.cassette_library_dir = 'cassettes'
    config.preserve_exact_body_bytes = True


class TestExample(BetamaxTestCase):  # superclass provides self.session

    def test_parse(self):
        example = Example()

        # http response is recorded in a betamax cassette:
        response = self.session.get(example.url)

        # forge a scrapy response to test
        scrapy_response = HtmlResponse(body=response.content, url=example.url)

        result = example.parse(scrapy_response)

        self.assertEqual({'image_href': u'image1.html'}, result.next())
        self.assertEqual({'image_href': u'image2.html'}, result.next())
        self.assertEqual({'image_href': u'image3.html'}, result.next())
        self.assertEqual({'image_href': u'image4.html'}, result.next())
        self.assertEqual({'image_href': u'image5.html'}, result.next())

        with self.assertRaises(StopIteration):
            result.next()

참고로 Ian Cordasco의 강연 덕분에 pycon 2015에서 betamax를 발견했습니다 .

이것은 매우 늦게 대답하지만 내가 쓴 그래서 나는 scrapy 테스트와 짜증 봤는데 scrapy 테스트를 정의 사양에 대해 scrapy 크롤러를 테스트하기위한 프레임 워크입니다.

정적 출력이 아닌 테스트 사양을 정의하여 작동합니다. 예를 들어 이러한 종류의 항목을 크롤링하는 경우 :

{
    "name": "Alex",
    "age": 21,
    "gender": "Female",
}

스크래피 테스트를 정의 할 수 있습니다 ItemSpec.

from scrapytest.tests import Match, MoreThan, LessThan
from scrapytest.spec import ItemSpec

class MySpec(ItemSpec):
    name_test = Match('{3,}')  # name should be at least 3 characters long
    age_test = Type(int), MoreThan(18), LessThan(99)
    gender_test = Match('Female|Male')

스크래피 통계에 대해 다음과 같은 아이디어 테스트도 있습니다 StatsSpec.

from scrapytest.spec import StatsSpec
from scrapytest.tests import Morethan

class MyStatsSpec(StatsSpec):
    validate = {
        "item_scraped_count": MoreThan(0),
    }

나중에 라이브 또는 캐시 된 결과에 대해 실행할 수 있습니다.

$ scrapy-test 
# or
$ scrapy-test --cache

저는 개발 변경을 위해 캐시 된 실행을 실행하고 웹 사이트 변경을 감지하기 위해 매일 cronjob을 실행했습니다.

I'm using scrapy 1.3.0 and the function: fake_response_from_file, raise an error:

response = Response(url=url, request=request, body=file_content)

I get:

raise AttributeError("Response content isn't text")

The solution is to use TextResponse instead, and it works ok, as example:

response = TextResponse(url=url, request=request, body=file_content)

Thanks a lot.

Slightly simpler, by removing the def fake_response_from_file from the chosen answer:

import unittest
from spiders.my_spider import MySpider
from scrapy.selector import Selector


class TestParsers(unittest.TestCase):


    def setUp(self):
        self.spider = MySpider(limit=1)
        self.html = Selector(text=open("some.htm", 'r').read())


    def test_some_parse(self):
        expected = "some-text"
        result = self.spider.some_parse(self.html)
        self.assertEqual(result, expected)


if __name__ == '__main__':
    unittest.main()

You can follow this snippet from the scrapy site to run it from a script. Then you can make any kind of asserts you'd like on the returned items.

I'm using Twisted's trial to run tests, similar to Scrapy's own tests. It already starts a reactor, so I make use of the CrawlerRunner without worrying about starting and stopping one in the tests.

Stealing some ideas from the check and parse Scrapy commands I ended up with the following base TestCase class to run assertions against live sites:

from twisted.trial import unittest

from scrapy.crawler import CrawlerRunner
from scrapy.http import Request
from scrapy.item import BaseItem
from scrapy.utils.spider import iterate_spider_output

class SpiderTestCase(unittest.TestCase):
    def setUp(self):
        self.runner = CrawlerRunner()

    def make_test_class(self, cls, url):
        """
        Make a class that proxies to the original class,
        sets up a URL to be called, and gathers the items
        and requests returned by the parse function.
        """
        class TestSpider(cls):
            # This is a once used class, so writing into
            # the class variables is fine. The framework
            # will instantiate it, not us.
            items = []
            requests = []

            def start_requests(self):
                req = super(TestSpider, self).make_requests_from_url(url)
                req.meta["_callback"] = req.callback or self.parse
                req.callback = self.collect_output
                yield req

            def collect_output(self, response):
                try:
                    cb = response.request.meta["_callback"]
                    for x in iterate_spider_output(cb(response)):
                        if isinstance(x, (BaseItem, dict)):
                            self.items.append(x)
                        elif isinstance(x, Request):
                            self.requests.append(x)
                except Exception as ex:
                    print("ERROR", "Could not execute callback: ",     ex)
                    raise ex

                # Returning any requests here would make the     crawler follow them.
                return None

        return TestSpider

Example:

@defer.inlineCallbacks
def test_foo(self):
    tester = self.make_test_class(FooSpider, 'https://foo.com')
    yield self.runner.crawl(tester)
    self.assertEqual(len(tester.items), 1)
    self.assertEqual(len(tester.requests), 2)

또는 설정에서 하나의 요청을 수행하고 결과에 대해 여러 테스트를 실행합니다.

@defer.inlineCallbacks
def setUp(self):
    super(FooTestCase, self).setUp()
    if FooTestCase.tester is None:
        FooTestCase.tester = self.make_test_class(FooSpider, 'https://foo.com')
        yield self.runner.crawl(self.tester)

def test_foo(self):
    self.assertEqual(len(self.tester.items), 1)

참조 URL : https://stackoverflow.com/questions/6456304/scrapy-unit-testing

'IT박스' 카테고리의 다른 글

가짜 수신 전화 Android (0)	2021.01.07
find 메서드에서 Mongoose 결과를 반환하는 방법은 무엇입니까? (0)	2021.01.07
Python의 문자열에서 하위 문자열이 몇 번 발생하는지 확인 (0)	2021.01.07
여러 인수에 대한 cmd.exe의 올바른 인용 (0)	2021.01.07
구조체 조각! = 구현하는 인터페이스 조각? (0)	2021.01.07

현재글스크래피 단위 테스트

itboxs

스크래피 단위 테스트

스크래피 단위 테스트

'IT박스' 카테고리의 다른 글

'IT박스'의 다른글

티스토리툴바

스크래피 단위 테스트

스크래피 단위 테스트

'IT박스' 카테고리의 다른 글

'IT박스'의 다른글

관련글

티스토리툴바