title
Data Science Project from Scratch - Part 2 (Data Collection)
description
This is part two of the new Data Science Project from Scratch Series. In this video I go through how to setup a github repo and collect data for your own data science project.
github repo for this project: https://github.com/PlayingNumbers/ds_salary_proj
How to set up data science environment: https://www.youtube.com/watch?v=C4OPn58BLaU
Chrome Driver Link: https://chromedriver.chromium.org/
Data collection can be a tedious and frustrating process, but you don't necessarily have to start from scratch. You should search github to see if someone has already built a web scraper for the website you are looking at. You should also check to see if the website has an open API.
For this project, I found someone who had written a glassdoor web scraper and I was able to version it for our purposes. The code and the article that I used are linked here:
Code: https://github.com/arapfaik/scraping-glassdoor-selenium
Article: https://towardsdatascience.com/selenium-tutorial-scraping-glassdoor-com-in-10-minutes-3d0915c6d905
This web scraper was written in python with the selinium package.
This is an iterative process, so in this video you will see how I go about debugging my code and making it work.
Stay tuned for part 3 where I go through and clean up the data that we collected to make it usable for our EDA and model building.
After we scrape the data, I save the code to github.
Project from scratch playlist: https://www.youtube.com/watch?v=MpF9HENQjDo&list=PL2zq7klxX5ASFejJj80ob9ZAnBHdz5O1t
My other project playlist: https://www.youtube.com/playlist?list=PL2zq7klxX5AReJn7nZfqOKLZ3IpKj7fwc
#DataScience #KenJee #DataScienceProject
⭕ Subscribe: https://www.youtube.com/c/kenjee1?sub_confirmation=1
🎙 Listen to My Podcast: https://www.youtube.com/c/KensNearestNeighborsPodcast
🕸 Check out My Website - https://kennethjee.com/
✍️Sign up for My Newsletter - https://www.kennethjee.com/newsletter
📚 Books and Products I use - https://www.amazon.com/shop/kenjee (affiliate link)
Partners & Affiliates
🌟 365 Data Science - Courses ( 57% Annual Discount): https://365datascience.pxf.io/P0jbBY
🌟 Interview Query - https://www.interviewquery.com/?ref=kenjee
MORE DATA SCIENCE CONTENT HERE:
🐤My Twitter - https://twitter.com/KenJee_DS
👔 LinkedIn - https://www.linkedin.com/in/kenjee/
📈 Kaggle - https://www.kaggle.com/kenjee
📑 Medium Articles - https://medium.com/@kenneth.b.jee
💻 Github - https://github.com/PlayingNumbers
🏀 My Sports Blog -https://www.playingnumbers.com
Check These Videos Out Next!
My Leaderboard Project: https://www.youtube.com/watch?v=myhoWUrSP7o&ab_channel=KenJee
66 Days of Data: https://www.youtube.com/watch?v=qV_AlRwhI3I&ab_channel=KenJee
How I Would Learn Data Science in 2021: https://www.youtube.com/watch?v=41Clrh6nv1s&ab_channel=KenJee
My Playlists
Data Science Beginners: https://www.youtube.com/playlist?list=PL2zq7klxX5ATMsmyRazei7ZXkP1GHt-vs
Project From Scratch: https://www.youtube.com/watch?v=MpF9HENQjDo&list=PL2zq7klxX5ASFejJj80ob9ZAnBHdz5O1t&ab_channel=KenJee
Kaggle Projects: https://www.youtube.com/playlist?list=PL2zq7klxX5AQXzNSLtc_LEKFPh2mAvHIO
detail
{'title': 'Data Science Project from Scratch - Part 2 (Data Collection)', 'heatmap': [{'end': 466.399, 'start': 431.266, 'weight': 0.766}, {'end': 710.321, 'start': 635.171, 'weight': 0.712}, {'end': 793.141, 'start': 755.147, 'weight': 0.865}], 'summary': 'Covers data collection, github setup, and tools like spyder ide, colab, and jupyter notebooks for predicting job salaries using data from linkedin and glassdoor.com. it discusses challenges in scraping job description data, building a prediction engine, web data scraping using beautiful soup and selenium, troubleshooting code, and the process of scraping data from glassdoor to collect around a thousand records for analysis.', 'chapters': [{'end': 76.237, 'segs': [{'end': 28.126, 'src': 'embed', 'start': 0.25, 'weight': 0, 'content': [{'end': 4.953, 'text': 'Hello everyone, Ken here back with part two of the data science project from scratch series.', 'start': 0.25, 'duration': 4.703}, {'end': 12.34, 'text': "In this video, I'll show you how to go about finding your own data, and I'll also show you how to set up a GitHub repo to version your code.", 'start': 5.713, 'duration': 6.627}, {'end': 18.063, 'text': 'Data collection can be extremely different for every project, and it can also be pretty frustrating sometimes.', 'start': 12.781, 'duration': 5.282}, {'end': 22.204, 'text': 'So have fun watching me kind of go ahead and struggle through this.', 'start': 18.803, 'duration': 3.401}, {'end': 28.126, 'text': "For most of my data science work, I use the Spyder IDE, and I think that's a fine place to start.", 'start': 22.724, 'duration': 5.402}], 'summary': 'Ken shares data science project tips and tools, including github for versioning.', 'duration': 27.876, 'max_score': 0.25, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/GmW4F6MHqqs/pics/GmW4F6MHqqs250.jpg'}, {'end': 60.322, 'src': 'embed', 'start': 36.873, 'weight': 1, 'content': [{'end': 43.899, 'text': "I recommend watching the video that I've linked above and below about the data science environment setup with Anaconda.", 'start': 36.873, 'duration': 7.026}, {'end': 49.444, 'text': 'If you recall from my last video, I decided to do a project related to the data science field.', 'start': 44.46, 'duration': 4.984}, {'end': 55.75, 'text': 'I want to try and predict the salary of a position based on some of the factors associated with the job.', 'start': 50.245, 'duration': 5.505}, {'end': 60.322, 'text': "I'll be looking for this data on LinkedIn and glassdoor.com.", 'start': 56.618, 'duration': 3.704}], 'summary': 'Recommend watching video on data science setup with anaconda for predicting job salary.', 'duration': 23.449, 'max_score': 36.873, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/GmW4F6MHqqs/pics/GmW4F6MHqqs36873.jpg'}], 'start': 0.25, 'title': 'Data science project: data collection and github setup', 'summary': 'Covers data collection process, github repository setup, and use of tools like spyder ide, colab, and jupyter notebooks. it involves predicting job salaries with data from linkedin and glassdoor.com.', 'chapters': [{'end': 76.237, 'start': 0.25, 'title': 'Data science project: data collection and github setup', 'summary': 'Covers data collection process for a data science project, setting up a github repository for versioning, and tools like spyder ide, colab, and jupyter notebooks. the project involves predicting job salaries based on factors, with data being sourced from linkedin and glassdoor.com.', 'duration': 75.987, 'highlights': ['The data science project involves predicting job salaries based on factors associated with the job, with data being sourced from LinkedIn and glassdoor.com.', 'The chapter covers the data collection process and setting up a GitHub repository for versioning, along with the use of tools like Spyder IDE, Colab, and Jupyter Notebooks.', 'Recommendation to watch a video about the data science environment setup with Anaconda for more information on getting the computer set up for data science projects.', 'Encouragement to like, subscribe, and turn on notifications for future weekly videos and continuation of the series.']}], 'duration': 75.987, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/GmW4F6MHqqs/pics/GmW4F6MHqqs250.jpg', 'highlights': ['The chapter covers the data collection process and setting up a GitHub repository for versioning, along with the use of tools like Spyder IDE, Colab, and Jupyter Notebooks.', 'The data science project involves predicting job salaries based on factors associated with the job, with data being sourced from LinkedIn and glassdoor.com.', 'Recommendation to watch a video about the data science environment setup with Anaconda for more information on getting the computer set up for data science projects.']}, {'end': 270.502, 'segs': [{'end': 216.909, 'src': 'embed', 'start': 76.854, 'weight': 0, 'content': [{'end': 85.577, 'text': "So I've pulled up some job descriptions for data science roles, as you can see up here on LinkedIn and on Glassdoor.", 'start': 76.854, 'duration': 8.723}, {'end': 89.879, 'text': 'So on LinkedIn, it looks like some of this information is pretty readily accessible.', 'start': 85.737, 'duration': 4.142}, {'end': 96.502, 'text': "If we want to see where it's located for some scraping tools, we can go in and see that it's located in the span.", 'start': 90.499, 'duration': 6.003}, {'end': 105.595, 'text': "It looks like usually when there are lists like this on the side and it's all embedded, that this is probably a JavaScript component.", 'start': 97.582, 'duration': 8.013}, {'end': 107.536, 'text': 'So that might be a little bit harder to scrape.', 'start': 105.675, 'duration': 1.861}, {'end': 109.217, 'text': "Let's look at Glassdoor.", 'start': 108.297, 'duration': 0.92}, {'end': 117.741, 'text': "It looks like they use a similar web architecture, but they do have some salary data, which is what we're going for.", 'start': 109.237, 'duration': 8.504}, {'end': 129.827, 'text': "It looks like this is a Glassdoor estimates and it's generally not a great practice to do predictions or use a dependent variable that is already an estimate.", 'start': 118.522, 'duration': 11.305}, {'end': 132.849, 'text': 'But I think this is probably going to be the best we can get here.', 'start': 130.187, 'duration': 2.662}, {'end': 140.963, 'text': 'And we might, if we were to go back in the future and also append different data, be able to improve on this Glassdoor model.', 'start': 133.659, 'duration': 7.304}, {'end': 146.846, 'text': "So right now let's try and basically make our own Glassdoor prediction engine using some of the information that we have here.", 'start': 141.423, 'duration': 5.423}, {'end': 154.07, 'text': "So we've decided on actually going about and using and scraping Glassdoor because they have this price info.", 'start': 147.667, 'duration': 6.403}, {'end': 159.773, 'text': 'I think they also have the same company info as well that we would want to see.', 'start': 154.53, 'duration': 5.243}, {'end': 167.82, 'text': "So, in order to write a web scraper in order, whenever we're doing one of these projects, we want to put it on github.", 'start': 160.634, 'duration': 7.186}, {'end': 171.241, 'text': "so let's quickly make a github repo.", 'start': 167.82, 'duration': 3.421}, {'end': 174.702, 'text': "so i'm going to open the git bash, which is the command line, and we're going to do.", 'start': 171.241, 'duration': 3.461}, {'end': 177.863, 'text': "we're going to change into my documents folder.", 'start': 174.702, 'duration': 3.161}, {'end': 188.926, 'text': "then we're going to make a ds salary prop proj folder.", 'start': 177.863, 'duration': 11.063}, {'end': 194.899, 'text': "so that just makes a folder And then we're going to change into that that folder.", 'start': 188.926, 'duration': 5.973}, {'end': 201.102, 'text': "And then we're going to initialize this on Git.", 'start': 194.919, 'duration': 6.183}, {'end': 203.203, 'text': "Well, we're going to initialize this repo.", 'start': 201.362, 'duration': 1.841}, {'end': 209.186, 'text': "So it's using the Git architecture, and then we're going to link it at the end of the video to our GitHub account.", 'start': 203.223, 'duration': 5.963}, {'end': 210.987, 'text': "So we're going to go on GitHub.", 'start': 210.046, 'duration': 0.941}, {'end': 216.909, 'text': "We're going to make a new repo and we're going to call this DS salary.", 'start': 211.007, 'duration': 5.902}], 'summary': 'Analyzing data science job descriptions on linkedin and glassdoor to build a prediction model using web scraping and github.', 'duration': 140.055, 'max_score': 76.854, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/GmW4F6MHqqs/pics/GmW4F6MHqqs76854.jpg'}], 'start': 76.854, 'title': 'Data science scraper and prediction engine', 'summary': 'Discusses challenges in scraping job description data from linkedin and glassdoor, availability of salary data, building a prediction engine using glassdoor estimates, challenges, potential improvements, decision to scrape glassdoor for price and company information, and creating a github repository for a web scraper project, highlighting the use of git bash and linking the repository to the github account.', 'chapters': [{'end': 117.741, 'start': 76.854, 'title': 'Data science job descriptions', 'summary': 'Discusses the challenges of scraping job description data from linkedin and glassdoor, highlighting the availability of salary data on glassdoor.', 'duration': 40.887, 'highlights': ['Glassdoor provides salary data for data science roles, making it a valuable source for scraping tools.', 'LinkedIn job descriptions are readily accessible but may pose challenges for scraping due to embedded JavaScript components.']}, {'end': 159.773, 'start': 118.522, 'title': 'Building glassdoor prediction engine', 'summary': 'Discusses the process of building a prediction engine using glassdoor estimates, highlighting the challenges and potential improvements, as well as the decision to scrape glassdoor for price and company information.', 'duration': 41.251, 'highlights': ['Building a prediction engine using Glassdoor estimates and considering potential future improvements.', 'Decision to scrape Glassdoor for price and company information.']}, {'end': 270.502, 'start': 160.634, 'title': 'Creating github repository for web scraper', 'summary': 'Discusses the process of creating a github repository and initializing it for a web scraper project, highlighting the use of git bash and linking the repository to the github account, aiming to push the project files to the repository.', 'duration': 109.868, 'highlights': ['The process involves opening Git bash, navigating to the desired folder, creating a new folder for the project, initializing the repository using Git, and linking it to the GitHub account.', 'The created GitHub repository is intended for the web scraper project, with a focus on pushing the project files to the repository at the end of the video.', 'The speaker emphasizes the importance of putting the web scraper project on GitHub and mentions the use of Git bash for the initialization process.']}], 'duration': 193.648, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/GmW4F6MHqqs/pics/GmW4F6MHqqs76854.jpg', 'highlights': ['Glassdoor provides salary data for data science roles, making it a valuable source for scraping tools.', 'LinkedIn job descriptions are readily accessible but may pose challenges for scraping due to embedded JavaScript components.', 'Building a prediction engine using Glassdoor estimates and considering potential future improvements.', 'Decision to scrape Glassdoor for price and company information.', 'The process involves opening Git bash, navigating to the desired folder, creating a new folder for the project, initializing the repository using Git, and linking it to the GitHub account.', 'The created GitHub repository is intended for the web scraper project, with a focus on pushing the project files to the repository at the end of the video.', 'The speaker emphasizes the importance of putting the web scraper project on GitHub and mentions the use of Git bash for the initialization process.']}, {'end': 819.209, 'segs': [{'end': 353.912, 'src': 'embed', 'start': 284.742, 'weight': 0, 'content': [{'end': 292.006, 'text': 'Or you can use Selenium, which is basically a bot that goes through and clicks on elements of the page and copies them into a data frame.', 'start': 284.742, 'duration': 7.264}, {'end': 298.609, 'text': 'So, because you know some of these websites use it looks like some JavaScript.', 'start': 292.546, 'duration': 6.063}, {'end': 305.265, 'text': 'we probably want to go with the Selenium approach and You know you can go and learn all about these packages,', 'start': 298.609, 'duration': 6.656}, {'end': 309.247, 'text': "but I also think it's a lot faster and easier to see if someone has actually done this before.", 'start': 305.265, 'duration': 3.982}, {'end': 316.052, 'text': "So we're going to Google Glassdoor Scraper Selenium.", 'start': 309.808, 'duration': 6.244}, {'end': 318.373, 'text': 'I think I spelled that wrong.', 'start': 316.072, 'duration': 2.301}, {'end': 329.04, 'text': 'But, okay, so the first one is a Glassdoor Scraper in Selenium that you can do in 10 minutes by Homer Sakuria.', 'start': 319.074, 'duration': 9.966}, {'end': 334.164, 'text': 'And this looks like a lot of the data that we would really want right here.', 'start': 329.82, 'duration': 4.344}, {'end': 338.549, 'text': "So let's try, you know, there's no problem in data science.", 'start': 334.324, 'duration': 4.225}, {'end': 342.454, 'text': "When you use someone else's code, you just want to make sure that they're credited for their work.", 'start': 338.589, 'duration': 3.865}, {'end': 348.449, 'text': 'Usually you can fork their code on GitHub and you know, work off of it that way.', 'start': 342.994, 'duration': 5.455}, {'end': 350.671, 'text': "That way, you know, you're giving recognition.", 'start': 348.47, 'duration': 2.201}, {'end': 353.912, 'text': "If the code isn't something that makes sense to fork,", 'start': 351.231, 'duration': 2.681}], 'summary': 'Using selenium for web scraping data, finding glassdoor scraper in selenium by homer sakuria for required data.', 'duration': 69.17, 'max_score': 284.742, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/GmW4F6MHqqs/pics/GmW4F6MHqqs284742.jpg'}, {'end': 466.399, 'src': 'heatmap', 'start': 431.266, 'weight': 5, 'content': [{'end': 437.109, 'text': "But again, since he's using an iPython notebook and I will be using a normal Python file,", 'start': 431.266, 'duration': 5.843}, {'end': 442.412, 'text': 'I think it makes a bit more sense for me to just copy it and then credit him after.', 'start': 437.109, 'duration': 5.303}, {'end': 445.834, 'text': "So we're going to open a spider notebook.", 'start': 442.652, 'duration': 3.182}, {'end': 452.398, 'text': "We're going to navigate to our file that we created.", 'start': 447.095, 'duration': 5.303}, {'end': 466.399, 'text': "And then we are going to Let's copy this in there and see if we can actually get it to run.", 'start': 452.418, 'duration': 13.981}], 'summary': 'Copying code from ipython to python file for execution.', 'duration': 35.133, 'max_score': 431.266, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/GmW4F6MHqqs/pics/GmW4F6MHqqs431266.jpg'}, {'end': 710.321, 'src': 'heatmap', 'start': 635.171, 'weight': 0.712, 'content': [{'end': 639.513, 'text': "Okay, now that we've got this copied in, let's see if we can get it to work.", 'start': 635.171, 'duration': 4.342}, {'end': 647.734, 'text': "So there's one thing that we need to fix here and we have to go to the actual path of our Chrome driver, which I've downloaded.", 'start': 639.873, 'duration': 7.861}, {'end': 656.498, 'text': 'You want to check what Google Chrome version you have and get the correct driver for that.', 'start': 648.434, 'duration': 8.064}, {'end': 657.659, 'text': "So we're going to do path here.", 'start': 656.518, 'duration': 1.141}, {'end': 660.661, 'text': "And we're also going to put a path up here.", 'start': 658.479, 'duration': 2.182}, {'end': 664.402, 'text': 'So that way, when we just write this function, we can put our own path in.', 'start': 661.041, 'duration': 3.361}, {'end': 668.524, 'text': 'It looks like this time might be relevant to our own browser.', 'start': 665.683, 'duration': 2.841}, {'end': 673.047, 'text': "So let's just do change this to sleep time.", 'start': 668.805, 'duration': 4.242}, {'end': 681.897, 'text': 'and add that in up here as well.', 'start': 674.855, 'duration': 7.042}, {'end': 685.037, 'text': "so, with that being said, let's try and run this really quickly.", 'start': 681.897, 'duration': 3.14}, {'end': 691.679, 'text': "so we're going to import glassdoor scraper as gs.", 'start': 685.037, 'duration': 6.642}, {'end': 701.201, 'text': "we're also going to import pandas as pd and then we're going to make our data frame equal to gs dot.", 'start': 691.679, 'duration': 9.522}, {'end': 704.376, 'text': 'get jobs.', 'start': 703.055, 'duration': 1.321}, {'end': 710.321, 'text': 'we know that it takes a keyword data scientist, number of jobs.', 'start': 704.376, 'duration': 5.945}], 'summary': 'Configuring chrome driver path and running job scraping function for data scientist role.', 'duration': 75.15, 'max_score': 635.171, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/GmW4F6MHqqs/pics/GmW4F6MHqqs635171.jpg'}, {'end': 793.141, 'src': 'heatmap', 'start': 755.147, 'weight': 0.865, 'content': [{'end': 759.528, 'text': 'Okay, so we have our path 9.', 'start': 755.147, 'duration': 4.381}, {'end': 761.929, 'text': 'We want to add our path here.', 'start': 759.528, 'duration': 2.401}, {'end': 766.227, 'text': 'And then we also want the wait time.', 'start': 764.406, 'duration': 1.821}, {'end': 769.59, 'text': "I think let's do a little bit longer wait than what they had.", 'start': 766.368, 'duration': 3.222}, {'end': 771.651, 'text': "I think my internet's acting up a little bit.", 'start': 769.65, 'duration': 2.001}, {'end': 773.333, 'text': "So let's see if this actually works.", 'start': 771.671, 'duration': 1.662}, {'end': 774.333, 'text': "So that's good.", 'start': 773.853, 'duration': 0.48}, {'end': 777.776, 'text': 'The browser is popping out.', 'start': 775.094, 'duration': 2.682}, {'end': 778.777, 'text': "That's part of Selenium.", 'start': 777.876, 'duration': 0.901}, {'end': 785.342, 'text': 'So remember, Selenium takes a browser and it acts just like it were a human going through it.', 'start': 778.977, 'duration': 6.365}, {'end': 793.141, 'text': "So it's part of that kind of 15 second wait that we had.", 'start': 787.703, 'duration': 5.438}], 'summary': 'Using selenium, a 15-second wait time is added to the path 9.', 'duration': 37.994, 'max_score': 755.147, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/GmW4F6MHqqs/pics/GmW4F6MHqqs755147.jpg'}], 'start': 271.163, 'title': 'Web data scraping and code adaptation', 'summary': 'Covers web data scraping using beautiful soup and selenium, recommending the use of selenium due to potential javascript usage. it also emphasizes the importance of crediting and adapting external code in data science, highlighting the use of github and selenium for web scraping.', 'chapters': [{'end': 334.164, 'start': 271.163, 'title': 'Web data scraping methods', 'summary': "Discusses two approaches to scraping data online: beautiful soup, which organizes html and allows pulling elements, and selenium, a bot that clicks on page elements and copies them into a data frame. the recommendation is to use selenium due to potential javascript usage on websites. it's also suggested to search for existing solutions, such as a glassdoor scraper in selenium by homer sakuria, which promises to deliver desired data quickly.", 'duration': 63.001, 'highlights': ['The chapter discusses two approaches to scraping data: Beautiful Soup and Selenium, with a recommendation to use Selenium due to potential JavaScript usage on websites.', 'Selenium is described as a bot that clicks on elements of the page and copies them into a data frame.', "It's suggested to search for existing solutions, such as a Glassdoor Scraper in Selenium by Homer Sakuria, which promises to deliver desired data quickly.", 'The Glassdoor Scraper in Selenium by Homer Sakuria is mentioned as a potential solution that can be implemented in 10 minutes.']}, {'end': 819.209, 'start': 334.324, 'title': 'Credit and adaptation of external code', 'summary': 'Discusses the importance of crediting and adapting external code in data science, emphasizing the use of github for forking, adapting code from ipython to a normal python file, and utilizing selenium for web scraping.', 'duration': 484.885, 'highlights': ['The significance of forking code on GitHub to give recognition and adapt external code.', 'Adapting code from iPython to a normal Python file while crediting the original author.', 'Utilizing Selenium for web scraping and addressing issues with pop-up windows during the process.']}], 'duration': 548.046, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/GmW4F6MHqqs/pics/GmW4F6MHqqs271163.jpg', 'highlights': ['Selenium is recommended due to potential JavaScript usage on websites.', 'Selenium is described as a bot that clicks on elements of the page and copies them into a data frame.', "It's suggested to search for existing solutions, such as a Glassdoor Scraper in Selenium by Homer Sakuria, which promises to deliver desired data quickly.", 'The Glassdoor Scraper in Selenium by Homer Sakuria is mentioned as a potential solution that can be implemented in 10 minutes.', 'The significance of forking code on GitHub to give recognition and adapt external code.', 'Adapting code from iPython to a normal Python file while crediting the original author.', 'Utilizing Selenium for web scraping and addressing issues with pop-up windows during the process.']}, {'end': 1167.966, 'segs': [{'end': 898.023, 'src': 'embed', 'start': 859.407, 'weight': 0, 'content': [{'end': 865.252, 'text': "Again, this is a pretty iterative process, but it's still way faster than writing this code base from scratch.", 'start': 859.407, 'duration': 5.845}, {'end': 872.848, 'text': "And you know, in the scheme of things, for a data scientist, the portion that you're actually doing where it's data collection,", 'start': 866.039, 'duration': 6.809}, {'end': 874.45, 'text': 'is a lot smaller than the actual analysis.', 'start': 872.848, 'duration': 1.602}, {'end': 881.86, 'text': 'So people are going to care a little bit less about how you actually collect the data than if you were a software engineer or something like that.', 'start': 875.031, 'duration': 6.829}, {'end': 886.598, 'text': 'again the page loaded, which is what we want.', 'start': 883.797, 'duration': 2.801}, {'end': 892.181, 'text': 'now we have to see if it is going to actually start scraping for us.', 'start': 886.598, 'duration': 5.583}, {'end': 896.182, 'text': "so we got this, and let's go up and see the error message here.", 'start': 892.181, 'duration': 4.001}, {'end': 898.023, 'text': 'so the x out failed.', 'start': 896.182, 'duration': 1.841}], 'summary': 'Iterative process is faster than coding from scratch. data collection is smaller part for data scientists.', 'duration': 38.616, 'max_score': 859.407, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/GmW4F6MHqqs/pics/GmW4F6MHqqs859407.jpg'}, {'end': 1100.277, 'src': 'embed', 'start': 1078.648, 'weight': 3, 'content': [{'end': 1087.13, 'text': "So let's just go through and, you know, make it so that it's not just San Francisco tech companies, um or mostly California tech companies.", 'start': 1078.648, 'duration': 8.482}, {'end': 1088.21, 'text': "let's do across the U S.", 'start': 1087.13, 'duration': 1.08}, {'end': 1094.512, 'text': "And let's also, you know, make sure that we're getting the correct salary estimate in.", 'start': 1089.168, 'duration': 5.344}, {'end': 1100.277, 'text': "Okay, so let's go to kind of an old page I have where we were looking, inspecting some elements.", 'start': 1095.273, 'duration': 5.004}], 'summary': 'Expanding tech company data to include us-wide and accurate salary estimates.', 'duration': 21.629, 'max_score': 1078.648, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/GmW4F6MHqqs/pics/GmW4F6MHqqs1078648.jpg'}], 'start': 819.889, 'title': 'Troubleshooting code and data analysis', 'summary': 'Discusses troubleshooting code, emphasizing the iterative process and the efficiency compared to writing code from scratch. it also highlights the comparatively smaller portion of time spent on data collection for a data scientist.', 'chapters': [{'end': 874.45, 'start': 819.889, 'title': 'Troubleshooting code and data analysis', 'summary': 'Discusses troubleshooting code, emphasizing the iterative process and the efficiency compared to writing code from scratch. it also highlights the comparatively smaller portion of time spent on data collection for a data scientist.', 'duration': 54.561, 'highlights': ['The iterative process of troubleshooting code is emphasized as being faster than writing code from scratch.', 'Data scientists spend a smaller portion of time on data collection compared to the actual analysis.']}, {'end': 1167.966, 'start': 875.031, 'title': 'Troubleshooting web scraping code', 'summary': 'Explains the process of troubleshooting web scraping code, including issues with element selection and data scraping, as well as the need to expand the search beyond san francisco and fix salary estimate extraction.', 'duration': 292.935, 'highlights': ['The process of troubleshooting web scraping code, including issues with element selection and data scraping. The speaker discusses encountering trouble with element selection and the slow start of the scraping process, emphasizing the need to refine the code for faster scraping and error-free iteration.', 'The need to expand the search beyond San Francisco and fix salary estimate extraction. The speaker realizes the need to expand the search beyond San Francisco and ensure accurate extraction of salary estimates, emphasizing the importance of salary data for their use case and the relevance of location and company size.']}], 'duration': 348.077, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/GmW4F6MHqqs/pics/GmW4F6MHqqs819889.jpg', 'highlights': ['The iterative process of troubleshooting code is emphasized as being faster than writing code from scratch.', 'Data scientists spend a smaller portion of time on data collection compared to the actual analysis.', 'The process of troubleshooting web scraping code, including issues with element selection and data scraping.', 'The need to expand the search beyond San Francisco and fix salary estimate extraction.']}, {'end': 1658.025, 'segs': [{'end': 1469.833, 'src': 'embed', 'start': 1441.321, 'weight': 0, 'content': [{'end': 1446.302, 'text': "I'm not going to put you guys through just watching me load this data in.", 'start': 1441.321, 'duration': 4.981}, {'end': 1449.962, 'text': "And I'm going to try and get around a thousand records for us to clean up and analyze.", 'start': 1446.382, 'duration': 3.58}, {'end': 1454.443, 'text': "So after this is done, I'll actually, again, show you what that data looks like.", 'start': 1450.603, 'duration': 3.84}, {'end': 1458.524, 'text': 'We want to double check that we have that salary and locational data.', 'start': 1454.863, 'duration': 3.661}, {'end': 1460.725, 'text': "And let's see how we do.", 'start': 1459.424, 'duration': 1.301}, {'end': 1462.425, 'text': 'So we fix the salary stuff.', 'start': 1461.045, 'duration': 1.38}, {'end': 1469.833, 'text': 'And we also have companies from what it looks like is all over the place, which is really good.', 'start': 1464.469, 'duration': 5.364}], 'summary': 'Loading around a thousand records for data cleanup and analysis, checking salary and location data, and fixing salary issues.', 'duration': 28.512, 'max_score': 1441.321, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/GmW4F6MHqqs/pics/GmW4F6MHqqs1441321.jpg'}, {'end': 1522.53, 'src': 'embed', 'start': 1488.507, 'weight': 2, 'content': [{'end': 1492.851, 'text': 'And we can explore across these variables and eventually start building this model.', 'start': 1488.507, 'duration': 4.344}, {'end': 1494.65, 'text': 'So I hope that was informative.', 'start': 1493.629, 'duration': 1.021}, {'end': 1500.073, 'text': 'You obviously saw how I went about going through and getting this data and it is a pretty messy process.', 'start': 1495.05, 'duration': 5.023}, {'end': 1501.654, 'text': "So that's something you should prepare for.", 'start': 1500.273, 'duration': 1.381}, {'end': 1506.236, 'text': 'And the last thing we want to do is to upload this to our GitHub repo.', 'start': 1502.194, 'duration': 4.042}, {'end': 1512.48, 'text': "So we want to open the get bash and then we're going to, oops, that's not what we want.", 'start': 1507.857, 'duration': 4.623}, {'end': 1522.53, 'text': 'this, and so this shows you how to actually get these things onto your repos.', 'start': 1515.728, 'duration': 6.802}], 'summary': 'Exploring variables, building model, messy data prep, and github upload discussed.', 'duration': 34.023, 'max_score': 1488.507, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/GmW4F6MHqqs/pics/GmW4F6MHqqs1488507.jpg'}, {'end': 1653.741, 'src': 'embed', 'start': 1626.066, 'weight': 1, 'content': [{'end': 1632.652, 'text': "So let's just go back to this and now you can see our code.", 'start': 1626.066, 'duration': 6.586}, {'end': 1637.957, 'text': 'So I have the Chrome driver that I use for mine, the data collection, the Glassdoor scraper.', 'start': 1632.672, 'duration': 5.285}, {'end': 1642.432, 'text': "And next time you run this, we'll pull that code and we'll start working on it again.", 'start': 1638.969, 'duration': 3.463}, {'end': 1648.777, 'text': 'Please stay tuned for the next step where we go through and we actually clean up the data.', 'start': 1642.732, 'duration': 6.045}, {'end': 1652.34, 'text': 'We clean up how all of that works.', 'start': 1648.837, 'duration': 3.503}, {'end': 1653.741, 'text': 'We make it numeric, et cetera.', 'start': 1652.36, 'duration': 1.381}], 'summary': 'Developing code to collect and clean data for glassdoor scraper.', 'duration': 27.675, 'max_score': 1626.066, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/GmW4F6MHqqs/pics/GmW4F6MHqqs1626066.jpg'}], 'start': 1168.886, 'title': 'Glassdoor data scraping process', 'summary': 'Details the process of scraping data from glassdoor, including modifying the url, troubleshooting errors, and preparing to upload the code to github, aiming to collect around a thousand records for analysis.', 'chapters': [{'end': 1658.025, 'start': 1168.886, 'title': 'Glassdoor data scraping process', 'summary': 'Details the process of scraping data from glassdoor, including modifying the url, troubleshooting errors, and preparing to upload the code to github, aiming to collect around a thousand records for analysis.', 'duration': 489.139, 'highlights': ['The process involves modifying the URL, adding and adjusting the keyword, and iterating through the data to ensure collection of salary and locational data, aiming to collect around a thousand records for analysis.', 'The individual encounters several errors and troubleshooting steps, including clearing the cache and testing different methods, to ensure successful data scraping and minimize errors.', 'The final steps involve preparing to upload the code to GitHub, including adding, committing, and pushing the code to the repository, with an emphasis on the importance of preparing and cleaning up the data for analysis.']}], 'duration': 489.139, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/GmW4F6MHqqs/pics/GmW4F6MHqqs1168886.jpg', 'highlights': ['The process involves modifying the URL, adding and adjusting the keyword, and iterating through the data to ensure collection of salary and locational data, aiming to collect around a thousand records for analysis.', 'The individual encounters several errors and troubleshooting steps, including clearing the cache and testing different methods, to ensure successful data scraping and minimize errors.', 'The final steps involve preparing to upload the code to GitHub, including adding, committing, and pushing the code to the repository, with an emphasis on the importance of preparing and cleaning up the data for analysis.']}], 'highlights': ['The process involves modifying the URL, adding and adjusting the keyword, and iterating through the data to ensure collection of salary and locational data, aiming to collect around a thousand records for analysis.', 'The individual encounters several errors and troubleshooting steps, including clearing the cache and testing different methods, to ensure successful data scraping and minimize errors.', 'The final steps involve preparing to upload the code to GitHub, including adding, committing, and pushing the code to the repository, with an emphasis on the importance of preparing and cleaning up the data for analysis.', 'The iterative process of troubleshooting code is emphasized as being faster than writing code from scratch.', 'Data scientists spend a smaller portion of time on data collection compared to the actual analysis.', 'The process of troubleshooting web scraping code, including issues with element selection and data scraping.', 'The need to expand the search beyond San Francisco and fix salary estimate extraction.', 'Selenium is recommended due to potential JavaScript usage on websites.', 'Selenium is described as a bot that clicks on elements of the page and copies them into a data frame.', "It's suggested to search for existing solutions, such as a Glassdoor Scraper in Selenium by Homer Sakuria, which promises to deliver desired data quickly.", 'The Glassdoor Scraper in Selenium by Homer Sakuria is mentioned as a potential solution that can be implemented in 10 minutes.', 'The significance of forking code on GitHub to give recognition and adapt external code.', 'Adapting code from iPython to a normal Python file while crediting the original author.', 'Utilizing Selenium for web scraping and addressing issues with pop-up windows during the process.', 'Glassdoor provides salary data for data science roles, making it a valuable source for scraping tools.', 'LinkedIn job descriptions are readily accessible but may pose challenges for scraping due to embedded JavaScript components.', 'Building a prediction engine using Glassdoor estimates and considering potential future improvements.', 'Decision to scrape Glassdoor for price and company information.', 'The process involves opening Git bash, navigating to the desired folder, creating a new folder for the project, initializing the repository using Git, and linking it to the GitHub account.', 'The created GitHub repository is intended for the web scraper project, with a focus on pushing the project files to the repository at the end of the video.', 'The speaker emphasizes the importance of putting the web scraper project on GitHub and mentions the use of Git bash for the initialization process.', 'The chapter covers the data collection process and setting up a GitHub repository for versioning, along with the use of tools like Spyder IDE, Colab, and Jupyter Notebooks.', 'The data science project involves predicting job salaries based on factors associated with the job, with data being sourced from LinkedIn and glassdoor.com.', 'Recommendation to watch a video about the data science environment setup with Anaconda for more information on getting the computer set up for data science projects.']}