Introduction to Web Scraping | Data Science Initiative

Event Date:

Thursday, May 12, 2022 - 4:00pm to 7:00pm

Friday, May 13, 2022 - 4:00pm to 7:00pm

Event Location:

UCSB Library

Related Link:

Data Carpentry develops and teaches workshops on the fundamental data skills needed to conduct research. Its target audience is researchers who have little to no prior computational experience, and its lessons are domain specific, building on learners' existing knowledge to enable them to quickly apply skills learned to their own research. Participants will be encouraged to help one another and to apply what they have learned to their own research problems.

Web scraping is the process of extracting data from websites. Some data that is available on the web is presented in a format that makes it easier to collect and use it, for example in the form of downloadable comma-separated values (CSV) datasets that can then be imported in a spreadsheet or loaded into a data analysis script. Often however, even though it is publicly available, data is not readily available for reuse. For example it can be contained in a PDF, or a table on a website, or spread across multiple web pages. There are a variety of ways to scrape a website to extract information for reuse. In its simplest form, this can be achieved by copying and pasting snippets from a web page, b ut this can be unpractical if there is a large amount of data to be extracted, or if it spread over a large number of pages. Instead, specialized tools and techniques can be used to automate this process, by defining what sites to visit, what information to look for, and whether data extraction should stop once the end of a page has been reached, or whether to follow hyperlinks and repeat the process recursively. Automating web scraping also allows to define whether the process should be run at regular intervals and capture changes in the data.

Find more information about this particular Carpentry course here, including a list of instructors and daily schedule.

May 18, 2022 - 11:46am