Go to top

Scrapinghub and GSoC 2019

At Scrapinghub, we love open source and we know the community can build amazing things.

If you haven’t heard about it already Google Summer of Code is a global program that offers students stipends to write code for open source projects. Scrapinghub is applying to GSoC for the 5th time, and had participated in the GSoC 2014, 2015, 2016, 2017, & 2018. Julia Medina, our student in 2014, did an amazing work on Scrapy’s API and settings. Jakob de Maeyer, our student in 2015, did a great job getting Scrapy Addons off the ground.

If you’re interested in participating in GSoC 2019 as a student, take a look at the curated list of ideas below. You can also propose your own ideas, if you have any not listed below. Check the corresponding “Information for Students“ section and get in touch with the mentors. Don’t be afraid, we’re nice people :)

We would be thrilled to see any of the ideas below happen, but these are just our ideas, you are free to come up with a new subject, preferably around information retrieval :)

Let’s make it a great Google Summer of Code!

Scrapy Ideas for GSoC 2019

Scrapy and Google Summer of Code

Scrapy is a very popular web crawling and scraping framework for Python (15th in Github most trending Python projects) used to write spiders for crawling and extracting data from websites. Scrapy has a healthy and active community, and it’s applying for Google Summer of Code in 2019.

Information for Students

If you’re interested in participating in GSoC 2019 as a student, contributing to Scrapy Ideas, you should introduce yourself at the development repository on Github. The best way would be to find an issue relating to your idea, or a suggested idea from below, and making your interest known there. Don’t forget to mention the Github username(s) of any relevant mentors, so they see your message! You can also join the #scrapy IRC channel at Freenode to chat with other Scrapy users & developers.

Ideas

Support for Different robots.txt Parsers

Intermediate
Brief explanation

Scrapy’s existing robots.txt parser is not fully compliant, but the more compliant alternatives are difficult to package and use within Scrapy’s pure-python development tree. This project would introduce a new interface for parsers of robots.txt files, so that a better parser may be substituted as required or desired.

Expected Results

An interface for robots.txt parsers that abstracts the existing parser, and permits a user to substitute a different parser. The solution ought to pass the existing test suite with the existing parser, at least, and ideally it would also pass the tests with another fully-compliant parser also. This will involve adding new tests and validating existing tests for accuracy.

Stretch Goals

Standalone pure-python robots.txt parser. If you choose to improve CPython robotparser you can commit your improvements upstream

Required skills Python, Abstract Programming Techniques, API Design
Difficulty level Intermediate
Mentor(s) Nikita, Konstantin

Customisable Request Fingerprints

Intermediate
Brief explanation

Scrapy uses a Request fingerprinting scheme for de-duplicating requests and for caching. Currently, the fingerprinting algorithm cannot be modified by Scrapy users. This project would involve introducing a way for the fingerprinting algorithm or function to be modified, or to accept and incorporate new information, or to otherwise empower users to affect fingerprinting and change caching and de-duplication behaviour.

Expected Results

An API and supporting code to enable users to make use of custom fingerprinting behaviour. To properly implement this may involve negotiation with Scrapy maintainers in addition to the mentors of this project.

Required skills Python, API Design, Communication
Difficulty level Intermediate
Mentor(s) Adrian

ELI5 Ideas for GSoC 2019

ELI5 and Google Summer of Code

ELI5 is a Python package which helps to debug machine learning classifiers and explain their predictions. It supports scikit-learn, xgboost, LightGBM, lightning, and sklearn-crfsuite out of the box, and it also supports black-box operation for explaining classifiers from outside this set.

Information for Students

If you’re interested in participating in GSoC 2019 as a student, contributing to ELI5 Ideas, you should introduce yourself at the development repository on Github. The best way would be to find an issue relating to your idea, or a suggested idea from below, and making your interest known there. Don’t forget to mention the Github username(s) of any relevant mentors, so they see your message!

Ideas

Add SHAP Support

Intermediate
Brief explanation

For tree ensembles and other ML models SHAP feature importances emerged as a popular explanation and debugging method. We should expose SHAP explanations in eli5 in an unified interface, either by wrapping a third-party library, or by having our own implementation.

Expected Results

SHAP explanations presented with a consistent API in a unified interface.

Required skills Python, Machine Learning
Difficulty level Intermediate
Mentor(s) Mikhail

Splash Ideas for GSoC 2019

Splash and Google Summer of Code

Splash is a headless-browser framework for web crawling and scraping, specifically designed to act as an accessory for Scrapy crawlers (though, it can be used as a stand-alone tool also). It is one of very few headless browsers designed for web-scraping and it sports many conveniences and powerful APIs for data extraction.

Information for Students

If you’re interested in participating in GSoC 2019 as a student, contributing to Splash Ideas, you should introduce yourself at the development repository on Github. The best way would be to find an issue relating to your idea, or a suggested idea from below, and making your interest known there. Don’t forget to mention the Github username(s) of any relevant mentors, so they see your message! You can also join the #scrapy IRC channel at Freenode to chat with other Scrapy/Splash users & developers.

Ideas

Add QWebEngine Support

Advanced
Brief explanation

Splash is a headless browser with HTTP API. Currently it relies on QWebKit, which is deprecated in Qt, and is maintained as an external fork. As a replacement, Qt introduced QWebEngine, which provides interface to Chromium. The idea is to add QWebEngine (Chromium) support to Splash and allow users to choose engine to use in run-time, keeping the same API when possible. There is an initial implementation available, for render.html endpoint (https://github.com/scrapinghub/splash/pull/867).

Expected Results

The goal of this project is to provide parallel implementation for as many features as possible.

Required skills Python, Lua, Qt5, Twisted
Difficulty level Advanced
Mentor(s) Mikhail

Spidermon Ideas for GSoC 2019

Information for Students

If you’re interested in participating in GSoC 2019 as a student, contributing to Spidermon Ideas, or your own idea, you should participate in the development repository over at the Spidermon Github. There, you can make a suggestion to the maintainers and see if there is scope for you to contribute as part of GSoC.

Ideas