Web Scraping Linkedin Python



  1. Basic Web Scraping In Python
  2. Web Scraping Linkedin Python Program
  3. Web Scraping Python Example
  4. Web Scraping Linkedin Python Code
  5. Web Scraping Linkedin Python Code

Scraping Linkedin profiles information through Selenium Python - linkedinextract.py. Clone with Git or checkout with SVN using the repository’s web. In March 2020, LinkedIn filed a petition for a writ of certiorari to Supreme Courts to challenge 9th circuit decision which says that HiQ scraping LinkedIn member profiles without LinkedIn’s permission did not make any violation against federal hacking laws. LinkedIn put forth several arguments as to why the Court should grant its petition. Use Python to Scrape LinkedIn Profiles LinkedIn is a great place to find leads and engage with prospects. In order to engage with potential leads, you’ll need a list of users to contact. However, getting that list might be difficult because LinkedIn has made it difficult for web scraping tools. Mar 15, 2021 rvest takes inspiration from the web scraping library BeautifulSoup, which comes from Python. (Related: our BeautifulSoup Python tutorial.) Scraping a web page in R. In order to use the rvest library, we first need to install it and import it with the library function. We are looking for a solution to do web scraping and extract the specified Google location data that is usually displayed on the Google search results. It should also extract the name of the location. We predefined the searches, example named BANK branches. G The KCB BRANCH ITEN KENYA Use.

'Come on, I worked so hard on this project! And this is publicly accessible data! There's certainly a way around this, right? Or else, I did all of this for nothing... Sigh...'

Yep - this is what I said to myself, just after realizing that my ambitious data analysis project could get me into hot water. I intended to deploy a large-scale web crawler to collect data from multiple high profile websites. And then I was planning to publish the results of my analysis for the benefit of everybody. Pretty noble, right? Yes, but also pretty risky.

Interestingly, I've been seeing more and more projects like mine lately. And even more tutorials encouraging some form of web scraping or crawling. But what troubles me is the appalling widespread ignorance on the legal aspect of it.

So this is what this post is all about - understanding the possible consequences of web scraping and crawling. Hopefully, this will help you to avoid any potential problem.

Disclaimer: I'm not a lawyer. I'm simply a programmer who happens to be interested in this topic. You should seek out appropriate professional advice regarding your specific situation.

What are web scraping and crawling?

Let's first define these terms to make sure that we're on the same page.

  1. Web scraping: the act of automatically downloading a web page's data and extracting very specific information from it. The extracted information can be stored pretty much anywhere (database, file, etc.).
  2. Web crawling: the act of automatically downloading a web page's data, extracting the hyperlinks it contains and following them. The downloaded data is generally stored in an index or a database to make it easily searchable.

For example, you may use a web scraper to extract weather forecast data from the National Weather Service. This would allow you to further analyze it.

In contrast, you may use a web crawler to download data from a broad range of websites and build a search engine. Maybe you've already heard of Googlebot, Google's own web crawler.

So web scrapers and crawlers are generally used for entirely different purposes.

Why is web scraping often seen negatively?

The reputation of web scraping has gotten a lot worse in the past few years, and for good reasons:

  1. It's increasingly being used for business purposes to gain a competitive advantage. So there's often a financial motive behind it.
  2. It's often done in complete disregard of copyright laws and of Terms of Service (ToS).
  3. It's often done in abusive manners. For example, web scrapers might send much more requests per second than what a human would do, thus causing an unexpected load on websites. They might also choose to stay anonymous and not identify themselves. Finally, they might also perform prohibited operations on websites, like circumventing the security measures that are put in place to automatically download data, which would otherwise be inaccessible.

Tons of individuals and companies are running their own web scrapers right now. So much that this has been causing headaches for companies whose websites are scraped, like social networks (e.g. Facebook, LinkedIn, etc.) and online stores (e.g. Amazon). This is probably why Facebook has separate terms for automated data collection.

In contrast, web crawling has historically been used by the well-known search engines (e.g. Google, Bing, etc.) to download and index the web. These companies have built a good reputation over the years, because they've built indispensable tools that add value to the websites they crawl. So web crawling is generally seen more favorably, although it may sometimes be used in abusive ways as well.

Basic Web Scraping In Python

So is it legal or illegal?

Web scraping and crawling aren't illegal by themselves. After all, you could scrape or crawl your own website, without a hitch.

The problem arises when you scrape or crawl the website of somebody else, without obtaining their prior written permission, or in disregard of their Terms of Service (ToS). You're essentially putting yourself in a vulnerable position.

Just think about it; you're using the bandwidth of somebody else, and you're freely retrieving and using their data. It's reasonable to think that they might not like it, because what you're doing might hurt them in some way. So depending on many factors (and what mood they're in), they're perfectly free to pursue legal action against you.

I know what you may be thinking. 'Come on! This is ridiculous! Why would they sue me?'. Sure, they might just ignore you. Or they might simply use technical measures to block you. Or they might just send you a cease and desist letter. But technically, there's nothing that prevents them from suing you. This is the real problem.

Need proof? In Linkedin v. Doe Defendants, Linkedin is suing between 1-100 people who anonymously scraped their website. And for what reasons are they suing those people? Let's see:

  1. Violation of the Computer Fraud and Abuse Act (CFAA).
  2. Violation of California Penal Code.
  3. Violation of the Digital Millennium Copyright Act (DMCA).
  4. Breach of contract.
  5. Trespass.
  6. Misappropriation.

That lawsuit is pretty concerning, because it's really not clear what will happen to those 'anonymous' people.

Consider that if you ever get sued, you can't simply dismiss it. You need to defend yourself, and prove that you did nothing wrong. This has nothing to do with whether or not it's fair, or whether or not what you did is really illegal.

Another problem is that law isn't like anything you're probably used to. Because where you use logic, common sense and your technical expertise, they'll use legal jargon and some grey areas of law to prove that you did something wrong. This isn't a level playing field. And it certainly isn't a good situation to be in. So you'll need to get a lawyer, and this might cost you a lot of money.

Besides, based on the above lawsuit by LinkedIn, you can see that cases can undoubtedly become quite complex and very broad in scope, even though you 'just scraped a website'.

The typical counterarguments brought by people

I found that people generally try to defend their web scraping or crawling activities by downplaying their importance. And they do so typically by using the same arguments over and over again.

So let's review the most common ones:

  1. 'I can do whatever I want with publicly accessible data.'

    False. The problem is that the 'creative arrangement' of data can be copyrighted, as described on cendi.gov:

    Facts cannot be copyrighted. However, the creative selection, coordination and arrangement of information and materials forming a database or compilation may be protected by copyright. Note, however, that the copyright protection only extends to the creative aspect, not to the facts contained in the database or compilation.

    So a website - including its pages, design, layout and database - can be copyrighted, because it's considered as a creative work. And if you scrape that website to extract data from it, the simple fact of copying a web page in memory with your web scraper might be considered as a copyright violation.

    In the United States, copyrighted work is protected by the Digital Millenium Copyright Act (DMCA).

  2. 'This is fair use!'

    This is a grey area:

    • In Kelly v. Arriba Soft Corp., the court found that the image search engine Ditto.com made fair use of a professional photographer's pictures by displaying thumbnails of them.
    • In Associated Press v. Meltwater U.S. Holdings, Inc., the court found that Meltwater's news aggregator service didn't make fair use of Associated Press' articles, even though scraped articles were only displayed as excerpts of the originals.
  3. 'It's the same as what my browser already does! Scraping a site is not technically different from using a web browser. I could gather data manually, anyway!'

    False. Terms of Service (ToS) often contain clauses that prohibit crawling/scraping/harvesting and automated uses of their associated services. You're legally bound by those terms; it doesn't matter that you could get that data manually.

  4. 'The worse that might happen if I break their Terms of Service is that I might get banned or blocked.'

    This is a grey area:

    • In Facebook v. Pete Warden, Facebook's attorney threatened Mr. Warden to sue him if he published his dataset comprised of hundreds of million of scraped Facebook profiles.
    • In Linkedin Corporation v. Michael George Keating, Linkedin blocked Mr. Keating from accessing Linkedin because he had created a tool that they thought was made to scrape their website. They were wrong. But yet, he has never been able to restore his account. Fortunately, this case didn't go further.
    • In LinkedIn Corporation v. Robocog Inc, Robocog Inc. (a.k.a. HiringSolved) was ordered to pay 40000$ to Linkedin for their unauthorized scraping of the site.
  5. 'This is completely unfair! Google has been crawling/scraping the whole web since forever!'

    True. But law has apparently nothing to do with fairness. It's based on rules, interpreted by people.

  6. 'If I ever get sued, I'll Good-Will-Hunting my way into defending myself.'

    Good luck! Unless you know law and legal jargon extensively. Personally, I don't.

  7. 'But I used an automated script, so I didn't enter into any contract with the website.'

    This is a grey area:

    • In Internet Archive v. Suzanne Shell, Internet Archive was found guilty of breach of contract while copying and archiving pages from Mrs. Shell's website using its web crawlers. On her website, Mrs. Shell displays a warning stating that as soon as you copy content from her website, you enter into a contract, and you owe her 5000$US per page copied (!!!). The two parties apparently reached an amicable resolution.
    • In Southwest Airlines Co. v. BoardFirst, LLC, BoardFirst was found guilty of violating a browsewrap contract displayed on Southwest Airlines' website. BoardFirst had created a tool that automatically downloaded the boarding passes of Southwest's customers to offer them better seats.
  8. 'Terms of Service (ToS) are not enforceable anyway. They have no legal value.'

    False. The Bingham McCutchen LLP law firm published a pretty extensive article onthis matter and they state that:

    As is the general rule with any contract, a website's terms of use will generally be deemed enforceable if mutually agreed to by the parties. [...] Regardless of whether a website's terms of use are clickwrap or browsewrap, the defendant's failure to read those terms is generally found irrelevant to the enforceability of its terms. One court disregarded arguments that awareness of a website's terms of use could not be imputed to a party who accessed that website using a web crawling or scraping tool that is unable to detect, let alone agree, to such terms. Similarly, one court imputed knowledge of a website's terms of use to a defendant who had repeatedly accessed that website using such tools. Nevertheless, these cases are, again, intensely factually driven, and courts have also declined to enforce terms of use where a plaintiff has failed to sufficiently establish that the defendant knew or should have known of those terms (e.g., because the terms are inconspicuous), even where the defendant repeatedly accessed a website using web crawling and scraping tools.

    In other words, Terms of Service (ToS) will be legally enforced depending on the court, and if there's sufficient proof that you were aware of them.

  9. 'I respected their robots.txt and I crawled at a reasonable speed, so I can't possibly get into trouble, right?'

    This is a grey area.

    robots.txt is recognized as a 'technological tool to deter unwanted crawling or scraping'. But whether or not you respect it, you're still bound to the Terms of Service (ToS).

  10. 'Okay, but this is for personal use. For my personal research only. I won't re-publish it, or publish any derivative dataset, or even sell it. So I'm good to go, right?'

    This is a grey area. Terms of Service (ToS) often prohibit automatic data collection, for any purpose.

    According to the Bingham McCutchen LLP law firm:

    The terms of use for websites frequently include clauses prohibiting access or use of the website by web crawlers, scrapers or other robots, including for purposes of data collection. Courts have recognized causes of action for breaches of contract based on the use of web crawling or scraping tools in violation of such provisions.

  11. 'But the website has no robots.txt. So I can do what I want, right?'

    False. You're still bound to the Terms of Service (ToS), and the content is copyrighted.

General advice for your scraping or crawling projects

Based on the above, you can certainly guess that you should be extra cautious with web scraping and crawling.

Here are a few pieces of advice:

  1. Use an API if one is provided, instead of scraping data.
  2. Respect the Terms of Service (ToS).
  3. Respect the rules of robots.txt.
  4. Use a reasonable crawl rate, i.e. don't bombard the site with requests. Respect the crawl-delay setting provided in robots.txt; if there's none, use a conservative crawl rate (e.g. 1 request per 10-15 seconds).
  5. Identify your web scraper or crawler with a legitimate user agent string. Create a page that explains what you're doing and why, and link back to the page in your user agent string (e.g. 'MY-BOT (+https://yoursite.com/mybot.html)')
  6. If ToS or robots.txt prevent you from crawling or scraping, ask a written permission to the owner of the site, prior to doing anything else.
  7. Don't republish your crawled or scraped data or any derivative dataset without verifying the license of the data, or without obtaining a written permission from the copyright holder.
  8. If you doubt on the legality of what you're doing, don't do it. Or seek the advice of a lawyer.
  9. Don't base your whole business on data scraping. The website(s) that you scrape may eventually block you, just like what happened in Craigslist Inc. v. 3Taps Inc..
  10. Finally, you should be suspicious of any advice that you find on the internet (including mine), so please consult a lawyer.

Remember that companies and individuals are perfectly free to sue you, for whatever reasons they want. This is most likely not the first step that they'll take. But if you scrape/crawl their website without permission and you do something that they don't like, you definitely put yourself in a vulnerable position.

Example

Conclusion

As we've seen in this post, web scraping and crawling aren't illegal by themselves. They might become problematic when you play on somebody else's turf, on your own terms, without obtaining their prior permission. The same is true in real life as well, when you think about it.

Web Scraping Linkedin Python Program

There are a lot of grey areas in law around this topic, so the outcome is pretty unpredictable. Before getting into trouble, make sure that what you're doing respects the rules.

And finally, the relevant question isn't 'Is this legal?'. Instead, you should ask yourself 'Am I doing something that might upset someone? And am I willing to take the (financial) risk of their response?'.

So I hope that you appreciated my post! Feel free to leave a comment in the comment section below!

Update (24/04/2017): this post was featured in Reddit and Lobsters. It was also featured in the Programming Digest newsletter. If you get a chance to subscribe to it, you won't be disappointed! Thanks to everyone for your support and your great feedback!

Disclaimer:
This article does not, by any means, constitute legal advice. Views expressed are purely opinion and are not legally binding. Consult with your local legal counsel for advice specific to your project, country or application.

In our previous article, Is LinkedIn Scraping Legal?, we discussed two influential cases of legal action taken against web scrapers: LinkedIn v HiQ and Facebook v Power Ventures, which both involved tech giants confronting small companies in the court of law for attempts at scraping data from their website, claiming violations of the Computer Fraud and Abuse Act (CFAA), a law designed to to prevent hackers from accessing a computer system without “authorisation”.

Web Scraping Python Example

The outcome, however, was starkly different; the district court found Power liable under all claims and was ordered to pay almost $3 million in damages, whereas HiQ was granted a preliminary injunction that allowed the analytics company to continue to collect LinkedIn’s public data.

For those unaware, web scraping is the act of extracting desired information and data from websites in order to make data mining more efficient and systematic. In the case of small companies like Power Ventures and HiQ, web scraping is seen as integral for their survival; their business model revolves around taking data and processing them for their consumers’ needs. But some website owners find web scraping ultimately harmful to their Internet presence. It could be because scrapers are infringing on copyrights and trademarks, or because they slow down the servers, hence negatively impacting their revenue streams.

Whatever the reason, it is inevitable that this issue will be brought to court. The two cases mentioned highlight the legal repercussions of bulk scraping and how it is seen in the eyes of the law. But LinkedIn seems to be unhappy with the outcome, so they have decided to escalate this matter further.

What has happened since then?

In March 2020, LinkedIn filed a petition for a writ of certiorari to Supreme Courts to challenge 9th circuit decision which says that HiQ scraping LinkedIn member profiles without LinkedIn’s permission did not make any violation against federal hacking laws.

LinkedIn put forth several arguments as to why the Court should grant its petition. LinkedIn argues that there is a circuit split regarding the interpretation of the CFAA’s “unauthorised access” provision with respect to scraping. Although the Ninth Circuit has stated that using automated means to scrape a public site is in no violation of the CFAA, the First Circuit may rule that acts of scraping can be seen as a CFAA violation if they are in direct breach of the website’s terms of use against such acts. LinkedIn makes it a point that the Internet recognises no borders and an inconsistency of how the CFAA is applied is unjustified.

LinkedIn also raises some concerns about data privacy. Users might be willing to share their personal information and have some control of the extent of how much of their data is shared on a site such as LinkedIn, but they wouldn’t expect such data to be used by third-parties without their knowledge. It posits that under the Ninth Circuit’s ruling, third-party scrapers can use users’ data in any way they like, against the wishes of the users.

Web Scraping Linkedin Python Code

In June 2020, HiQ put out a brief urging the Supreme Court to deny LinkedIn’s petition. It opposes a lot of the arguments given by LinkedIn. Regarding the circuit split, HiQ states that the First Circuit decision was done in the earlier stage of the Internet and did not interpret the same issue of unauthorised access as LinkedIn suggested. It also argues that the privacy issue was unfounded as HiQ claims they were previously allowed to scrape data from LinkedIn and only prohibited such actions after HiQ did some of its own processing on that data.

What does this all mean for web scraping?

Web Scraping Linkedin Python Code

Having a higher court to weigh in on the issue will certainly be huge. It will potentially provide some level of certainty regarding the relationship of web scraping and the CFAA. With the legal implications being murky at the moment, it is still too early to say that web scrapers have it safe. Generally, it is important to be cognisant of the how and where the data is being scraped. Whether or not that data is considered to be “public data”, that is for the court to decide.