Scraping Job Requirements from LinkedIn
- Arjun Singh
- May 23, 2020
- 4 min read

Director William Hefley, from the University of Texas at Dallas, showed me one of his papers earlier this year. It contained detailed information quantifying the critical skills needed by students entering the job market as Data Analysts and Data Scientists. As a mini project, I decided to scrape job requirement data for data scientists from LinkedIn.
There are 2 main tools in this project: Selenium and BeautifulSoup. Selenium allows Python programs to manipulate browsers like a human would. This is useful for our purposes because LinkedIn is very specific about what gets written to the screen at any given time. For instance, new jobs only load after the user has scrolled to the bottom of the page. BeautifulSoup is valuable for its HTML parsing, allowing developers to quickly find the elements they're looking for among a vast amount of markup. This solves the problem of isolating the very specific text that we want such as a job id in a list of hundreds of elements.
The first step is to setup a WebDriver for Selenium. This allows Selenium to manipulate a browser (as a human would). I used the Chrome WebDriver by updating Google Chrome and downloading the corresponding version from Chrome's WebDriver site. Next, we set the implicit wait time which is the amount of time our program will wait for a web response before it throws an error. If a web page isn't loading, we don't want our program to sit on a bad or stale request forever.
Now it's time to use our WebDriver. We can call a GET request to LinkedIn by passing a string containing our search query. I used "https://www.linkedin.com/jobs/search/?keywords=data%20scientist&location=United%20States" which goes to LinkedIn jobs and searches for Data Scientist positions available in the United States. From this page alone, there are all sorts of things we can scrape.
LinkedIn always shows a job post's details when you call a job query. You would think that the HTML for what you are looking at is immediately available. How could you possibly see the content if it hasn't been pragmatically assigned to the screen? Normally, you would think that if you can see content then the HTML must have been passed through to the WebDriver. This is partially True. Some web pages load in batches because there is a lot of content that needs to be collected from the web host. This means the target element you want may take some time to pass through. Basically, we need to add WebDriverWait that listens for when you're target content is available.
So the target GET request has been called and the target content has been loaded... now what? We grab the markup by calling driver.page_source and pass this through BeautifulSoup. The job requirements and company name are easily identified using a combination of soup.find and soup.find_all. We now have what we're looking for from a single job. Half the battle is over. We just need to grab data from multiple jobs.
It's important to realize LinkedIn Jobs has a list of job "cards" to the left of the job description. Clicking a job doesn't take the user to a new page, it just writes over the last job description with the new job description. We can use this to our advantage. Upon further inspection, you will notice each job card is associated with a Job ID. We can collect this information to verify that we aren't duplicating jobs while we are scraping. The other things to notice are that the jobs are represented in a list and the current job (the one with job description visible on the right) has a unique class name containing the word "active". We can develop 3 functions that help us load new jobs. The first function will identify the current job by looking for the "active" keyword and determine the next job on the list through the BeautifulSoup find_next_sibling method. The second function will "click" on a specified element via Selenium given a path. The third method combines these two by extracting the path from the next element in function one and passing it to function two.
This is the end game. Wrapping the functions we have built using a class is the obvious next step. The constructor will kick off the loop that will continuously scrape jobs. Because the jobs don't need the frame to be reloaded, we can call the scraping functions independant of the GET request allowing us to dynamicaly scrape the new jobs. It's important to be consciensious of pagenation and the time it takes to load new assets. We can add wait functions to the loop to make sure we are giving the assets enough time to load and be scraped. We can also add a Selenium function that scrolls to the bottom of the screen after every 49 jobs to load new job cards. The final step is to convert the scraped data in JSON and we're pretty much done.
All in all, a pretty fun project. There is a reason for all of this by the ways. I'm just not going to say until my next post :)
Sources:
https://selenium-python.readthedocs.io/index.html
https://www.crummy.com/software/BeautifulSoup/bs4/doc/#find-next-siblings-and-find-next-sibling
https://medium.com/federicohaag/linkedin-scraping-with-python-d8d14519602d
https://www.freecodecamp.org/news/better-web-scraping-in-python-with-selenium-beautiful-soup-and-pandas-d6390592e251/
https://blog.hartleybrody.com/web-scraping-cheat-sheet/#useful-libraries
Comentários