Go to top

Scrapinghub in GSoC 2019

Scrapinghub is applying!

Scrapinghub is a company focused on information retrieval and its later manipulation, deeply involved on developing and contributing in Open Source projects regarding web crawling and data processing technologies.

This year, we are applying with our flagship project, Scrapy, our headless-browsing framework Splash, our machine-learning debugging framework ELI5, and our new crawler quality-assurance library, Spidermon. You can learn more about these projects on their respective repositories on GitHub: scrapy/scrapy, scrapinghub/splash, TeamHG-Memex/eli5, scrapinghub/spidermon.

Scrapy

Scrapy is a very popular web crawling and scraping framework for Python (10th in Github most trending Python projects) used to write spiders for crawling and extracting data from websites.

Check Scrapy ideas

ELI5

ELI5 is a Python package which helps to debug machine learning classifiers and explain their predictions. It supports scikit-learn, xgboost, LightGBM, lightning, and sklearn-crfsuite out of the box, and it also supports black-box operation for explaining classifiers from outside this set.

Check ELI5 ideas

Splash

Splash is a headless-browser framework for web crawling and scraping, specifically designed to act as an accessory for Scrapy crawlers (though, it can be used as a stand-alone tool also). It is one of very few headless browsers designed for web-scraping and it sports many conveniences and powerful APIs for data extraction.

Check Splash ideas

Spidermon

Spidermon is a freshly open-sourced library that has been developed internally at Scrapinghub for years as a Quality-Assurance framework for Scrapy spiders. Spidermon lets spider developers define and enforce rules for data schema and field coverage, and is extensible towards broader crawl-verification and data-validation needs.

Check Spidermon ideas