Listen "31: Crawling the Web using Elixir with Oleg Tarasenko and Tze Yiing"
Episode Synopsis
We talk with Oleg Tarasenko and Tze Yiing about crawling the web using Elixir. Oleg created the crawly project to help solve this problem and Tze Yiing joined him as a contributor and maintainer. We cover how Elixir is well suited to orchestrate crawling, how to deal with login pages, understanding the legal concerns, building a codeless scraper and much more!
Show Notes online - http://podcast.thinkingelixir.com/31 (http://podcast.thinkingelixir.com/31)
Elixir Community News
- https://dashbit.co/blog/ten-years-ish-of-elixir (https://dashbit.co/blog/ten-years-ish-of-elixir) – January 9th marked the 10th year since the first commit to the Elixir repository
- https://github.com/elixir-lang/elixir/commit/337c3f2d569a42ebd5fcab6fef18c5e012f9be5b (https://github.com/elixir-lang/elixir/commit/337c3f2d569a42ebd5fcab6fef18c5e012f9be5b) – First commit on the repository
- https://twitter.com/josevalim/status/1349010127270129670 (https://twitter.com/josevalim/status/1349010127270129670) – Jose Valim reveals the name of his secret project is called 'Nx'
- https://remote.com/blog/welcoming-elixir-creator-jose-valim (https://remote.com/blog/welcoming-elixir-creator-jose-valim) – Jose Valim joins Remote as a Technical Adivsor
- https://twitter.com/josevalim/status/1347858475267854336 (https://twitter.com/josevalim/status/1347858475267854336) – ExUnit will catch SIGQUIT message from CTRL+\ and shows the tests that were running
- https://github.com/elixir-lang/elixir/blob/master/lib/mix/lib/mix/tasks/test.ex#L34 (https://github.com/elixir-lang/elixir/blob/master/lib/mix/lib/mix/tasks/test.ex#L34) – ExUnit will print how much time the test suite spent on async tests vs sync tests
- https://twitter.com/fhunleth/status/1348092050487570433 (https://twitter.com/fhunleth/status/1348092050487570433) – Nerves support on the M1 is looking good
- https://www.youtube.com/playlist?list=PLqj39LCvnOWZl_Pb0Y7wGWijKbTvL4gJg (https://www.youtube.com/playlist?list=PLqj39LCvnOWZl_Pb0Y7wGWijKbTvL4gJg) – Elixir Conf 2020 videos have all been publicly released!
Do you have some Elixir news to share? Tell us at @ThinkingElixir (https://twitter.com/ThinkingElixir) or email at [email protected] (mailto:[email protected])
Discussion Resources
- https://oltarasenko.medium.com/web-scraping-with-elixir-and-crawly-extracting-data-behind-authentication-a52584e9cf13 (https://oltarasenko.medium.com/web-scraping-with-elixir-and-crawly-extracting-data-behind-authentication-a52584e9cf13)
- https://oltarasenko.medium.com/using-elixir-and-crawly-for-price-monitoring-7364d345fc64 (https://oltarasenko.medium.com/using-elixir-and-crawly-for-price-monitoring-7364d345fc64) – Using Elixir for price monitoring
- https://hex.pm/packages/crawly (https://hex.pm/packages/crawly)
- https://github.com/oltarasenko/crawly (https://github.com/oltarasenko/crawly)
- https://www.erlang-solutions.com/blog/web-scraping-with-elixir.html (https://www.erlang-solutions.com/blog/web-scraping-with-elixir.html) – Oleg's older web scraping with Elixir article
- https://www.erlang-solutions.com/blog/how-to-build-a-machine-learning-project-in-elixir.html (https://www.erlang-solutions.com/blog/how-to-build-a-machine-learning-project-in-elixir.html) – Building a machine learning projects with Elixir, Tensorflow and Crawly
- https://oltarasenko.medium.com/what-is-web-scraping-and-why-you-might-want-to-use-it-a0e4b621f6d0 (https://oltarasenko.medium.com/what-is-web-scraping-and-why-you-might-want-to-use-it-a0e4b621f6d0) – What is web scraping, and why you might want to use it?
- https://www.pillowskin.com (https://www.pillowskin.com) – Ziinc's project using scraping and aggregation
- https://www.tensorflow.org/ (https://www.tensorflow.org/)
- https://oltarasenko.medium.com/the-unofficial-guide-to-extracting-google-search-results-in-2021-with-elixir-7a6ef80d0f5b (https://oltarasenko.medium.com/the-unofficial-guide-to-extracting-google-search-results-in-2021-with-elixir-7a6ef80d0f5b)
- https://scrapy.org/ (https://scrapy.org/)
- https://github.com/fredwu/crawler (https://github.com/fredwu/crawler)
- https://www.eff.org/deeplinks/2019/09/victory-ruling-hiq-v-linkedin-protects-scraping-public-data (https://www.eff.org/deeplinks/2019/09/victory-ruling-hiq-v-linkedin-protects-scraping-public-data) – EFF legal interpretation of LinkedIn vs HiQ scraping case
- https://github.com/scrapinghub/splash/ (https://github.com/scrapinghub/splash/)
- https://www.joinhoney.com/ (https://www.joinhoney.com/)
- https://hexdocs.pm/crawly/readme.html#quickstart (https://hexdocs.pm/crawly/readme.html#quickstart) – Crawly quickstart guid
- https://hexdocs.pm/crawly/tutorial.html (https://hexdocs.pm/crawly/tutorial.html) – Crawley tutorial
- https://github.com/oltarasenko/crawly_ui (https://github.com/oltarasenko/crawly_ui) – Crawly UI project
- http://crawlyui.com/ (http://crawlyui.com/) – Crawly UI project page
- Data is the new gold
- https://t.me/elixir_crawly (https://t.me/elixir_crawly) – Crawley Telegram group
Guest Information
- https://github.com/oltarasenko (https://github.com/oltarasenko) – Oleg on Github
- https://oltarasenko.medium.com/ (https://oltarasenko.medium.com/) – Oleg's Blog
- https://twitter.com/tzeyiing (https://twitter.com/tzeyiing) – Lee TzeYiing on Twitter
- https://github.com/Ziinc (https://github.com/Ziinc) – Lee TzeYiing on Github
- https://www.tzeyiing.com (https://www.tzeyiing.com) – Lee TzeYiing Blog
Find us online
- Message the show - @ThinkingElixir (https://twitter.com/ThinkingElixir)
- Email the show - [email protected] (mailto:[email protected])
- Mark Ericksen - @brainlid (https://twitter.com/brainlid)
- David Bernheisel - @bernheisel (https://twitter.com/bernheisel)
- Cade Ward - @cadebward (https://twitter.com/cadebward)
Show Notes online - http://podcast.thinkingelixir.com/31 (http://podcast.thinkingelixir.com/31)
Elixir Community News
- https://dashbit.co/blog/ten-years-ish-of-elixir (https://dashbit.co/blog/ten-years-ish-of-elixir) – January 9th marked the 10th year since the first commit to the Elixir repository
- https://github.com/elixir-lang/elixir/commit/337c3f2d569a42ebd5fcab6fef18c5e012f9be5b (https://github.com/elixir-lang/elixir/commit/337c3f2d569a42ebd5fcab6fef18c5e012f9be5b) – First commit on the repository
- https://twitter.com/josevalim/status/1349010127270129670 (https://twitter.com/josevalim/status/1349010127270129670) – Jose Valim reveals the name of his secret project is called 'Nx'
- https://remote.com/blog/welcoming-elixir-creator-jose-valim (https://remote.com/blog/welcoming-elixir-creator-jose-valim) – Jose Valim joins Remote as a Technical Adivsor
- https://twitter.com/josevalim/status/1347858475267854336 (https://twitter.com/josevalim/status/1347858475267854336) – ExUnit will catch SIGQUIT message from CTRL+\ and shows the tests that were running
- https://github.com/elixir-lang/elixir/blob/master/lib/mix/lib/mix/tasks/test.ex#L34 (https://github.com/elixir-lang/elixir/blob/master/lib/mix/lib/mix/tasks/test.ex#L34) – ExUnit will print how much time the test suite spent on async tests vs sync tests
- https://twitter.com/fhunleth/status/1348092050487570433 (https://twitter.com/fhunleth/status/1348092050487570433) – Nerves support on the M1 is looking good
- https://www.youtube.com/playlist?list=PLqj39LCvnOWZl_Pb0Y7wGWijKbTvL4gJg (https://www.youtube.com/playlist?list=PLqj39LCvnOWZl_Pb0Y7wGWijKbTvL4gJg) – Elixir Conf 2020 videos have all been publicly released!
Do you have some Elixir news to share? Tell us at @ThinkingElixir (https://twitter.com/ThinkingElixir) or email at [email protected] (mailto:[email protected])
Discussion Resources
- https://oltarasenko.medium.com/web-scraping-with-elixir-and-crawly-extracting-data-behind-authentication-a52584e9cf13 (https://oltarasenko.medium.com/web-scraping-with-elixir-and-crawly-extracting-data-behind-authentication-a52584e9cf13)
- https://oltarasenko.medium.com/using-elixir-and-crawly-for-price-monitoring-7364d345fc64 (https://oltarasenko.medium.com/using-elixir-and-crawly-for-price-monitoring-7364d345fc64) – Using Elixir for price monitoring
- https://hex.pm/packages/crawly (https://hex.pm/packages/crawly)
- https://github.com/oltarasenko/crawly (https://github.com/oltarasenko/crawly)
- https://www.erlang-solutions.com/blog/web-scraping-with-elixir.html (https://www.erlang-solutions.com/blog/web-scraping-with-elixir.html) – Oleg's older web scraping with Elixir article
- https://www.erlang-solutions.com/blog/how-to-build-a-machine-learning-project-in-elixir.html (https://www.erlang-solutions.com/blog/how-to-build-a-machine-learning-project-in-elixir.html) – Building a machine learning projects with Elixir, Tensorflow and Crawly
- https://oltarasenko.medium.com/what-is-web-scraping-and-why-you-might-want-to-use-it-a0e4b621f6d0 (https://oltarasenko.medium.com/what-is-web-scraping-and-why-you-might-want-to-use-it-a0e4b621f6d0) – What is web scraping, and why you might want to use it?
- https://www.pillowskin.com (https://www.pillowskin.com) – Ziinc's project using scraping and aggregation
- https://www.tensorflow.org/ (https://www.tensorflow.org/)
- https://oltarasenko.medium.com/the-unofficial-guide-to-extracting-google-search-results-in-2021-with-elixir-7a6ef80d0f5b (https://oltarasenko.medium.com/the-unofficial-guide-to-extracting-google-search-results-in-2021-with-elixir-7a6ef80d0f5b)
- https://scrapy.org/ (https://scrapy.org/)
- https://github.com/fredwu/crawler (https://github.com/fredwu/crawler)
- https://www.eff.org/deeplinks/2019/09/victory-ruling-hiq-v-linkedin-protects-scraping-public-data (https://www.eff.org/deeplinks/2019/09/victory-ruling-hiq-v-linkedin-protects-scraping-public-data) – EFF legal interpretation of LinkedIn vs HiQ scraping case
- https://github.com/scrapinghub/splash/ (https://github.com/scrapinghub/splash/)
- https://www.joinhoney.com/ (https://www.joinhoney.com/)
- https://hexdocs.pm/crawly/readme.html#quickstart (https://hexdocs.pm/crawly/readme.html#quickstart) – Crawly quickstart guid
- https://hexdocs.pm/crawly/tutorial.html (https://hexdocs.pm/crawly/tutorial.html) – Crawley tutorial
- https://github.com/oltarasenko/crawly_ui (https://github.com/oltarasenko/crawly_ui) – Crawly UI project
- http://crawlyui.com/ (http://crawlyui.com/) – Crawly UI project page
- Data is the new gold
- https://t.me/elixir_crawly (https://t.me/elixir_crawly) – Crawley Telegram group
Guest Information
- https://github.com/oltarasenko (https://github.com/oltarasenko) – Oleg on Github
- https://oltarasenko.medium.com/ (https://oltarasenko.medium.com/) – Oleg's Blog
- https://twitter.com/tzeyiing (https://twitter.com/tzeyiing) – Lee TzeYiing on Twitter
- https://github.com/Ziinc (https://github.com/Ziinc) – Lee TzeYiing on Github
- https://www.tzeyiing.com (https://www.tzeyiing.com) – Lee TzeYiing Blog
Find us online
- Message the show - @ThinkingElixir (https://twitter.com/ThinkingElixir)
- Email the show - [email protected] (mailto:[email protected])
- Mark Ericksen - @brainlid (https://twitter.com/brainlid)
- David Bernheisel - @bernheisel (https://twitter.com/bernheisel)
- Cade Ward - @cadebward (https://twitter.com/cadebward)
More episodes of the podcast Thinking Elixir Podcast
283: Erlang Turns 27 and React at Risk
16/12/2025
282: Type Systems and View Transitions
09/12/2025
281: Planning for the Unexpected
02/12/2025
280: Dark Matter Developers
25/11/2025
279: Hot Code Upgrades and Hotter AI Takes
18/11/2025
278: WAL-ing Through Database Changes
11/11/2025
277: Searching Across the Hexiverse
04/11/2025
276: Elixir v1.19 Types and Speed
28/10/2025
275: From Slop to Success?
21/10/2025
274: Protocols, Permissions, and Performance
14/10/2025
ZARZA We are Zarza, the prestigious firm behind major projects in information technology.