Scrapinghub in GSoC 2019
Scrapinghub is applying!
Scrapinghub is a company focused on information retrieval and its later manipulation, deeply involved on developing and contributing in Open Source projects regarding web crawling and data processing technologies.
This year, we are applying with our flagship project, Scrapy, our headless-browsing framework Splash, our machine-learning debugging framework ELI5, and our new crawler quality-assurance library, Spidermon. You can learn more about these projects on their respective repositories on GitHub: scrapy/scrapy, scrapinghub/splash, TeamHG-Memex/eli5, scrapinghub/spidermon.
ELI5 is a Python package which helps to debug machine learning classifiers and explain their predictions. It supports scikit-learn, xgboost, LightGBM, lightning, and sklearn-crfsuite out of the box, and it also supports black-box operation for explaining classifiers from outside this set.Check ELI5 ideas
Splash is a headless-browser framework for web crawling and scraping, specifically designed to act as an accessory for Scrapy crawlers (though, it can be used as a stand-alone tool also). It is one of very few headless browsers designed for web-scraping and it sports many conveniences and powerful APIs for data extraction.Check Splash ideas
Spidermon is a freshly open-sourced library that has been developed internally at Scrapinghub for years as a Quality-Assurance framework for Scrapy spiders. Spidermon lets spider developers define and enforce rules for data schema and field coverage, and is extensible towards broader crawl-verification and data-validation needs.Check Spidermon ideas