Summary of Troy Hunt: Hackers, Scrapers & Fakers: What's Really Inside the Latest LinkedIn Dataset

Summary Troy Hunt: Hackers, Scrapers & Fakers: What's Really Inside the Latest LinkedIn Dataset www.troyhunt.com

2,127 words - html page - View html page

One Line

The flagged LinkedIn dataset consisted of public profiles and fake email addresses, but blaming LinkedIn is not the appropriate response.

Slides

Slide Presentation (8 slides)

Copy slides outline Copy embed code Download as Word

Hackers, Scrapers & Fakers: What's Really Inside the Latest LinkedIn Dataset

Source: www.troyhunt.com - html - 2,127 words - view

Introduction

• Investigating data breaches as a search for truth

• Dataset titled "Linkedin Database 2023 2.5 Millions"

• Combination of publicly available LinkedIn profile data and fabricated email addresses

Dataset Overview

• Over 5.8 million email addresses in the dataset

• Email addresses constructed from first and last names

• Mix of legitimate information and fabricated data

Fabricated Email Addresses

• Pattern of using alias "[first name].[last name]@" on unrelated domains

• Common format, but not universal

• Example of real company email addresses with different formats

Legitimate Data in the Dataset

• Significant component of legitimate data

• Real people, companies, and domains

• Over 1.8k HIBP subscribers in the dataset

Sources of the Data

• Dataset appears to be an aggregation of multiple sources

• Column headings suggest data from LinkedIn, Salesforce, Spendesk, and Hubspot

• Hope to identify the source through public recognition

Action Taken with the Dataset

• Loaded into HIBP as a spam list

• Fabricated email addresses won't impact paid subscriptions

• Majority of people want to know about incidents and make their own decisions

Key Takeaways

• Dataset is a combination of public LinkedIn profiles and fabricated email addresses

• Contains over 5.8 million email addresses, including both legitimate and fake ones

• Significant amount of legitimate data, including real people, companies, and domains

• Important to stay informed and make individual decisions regarding data breaches

Key Points

The dataset titled "Linkedin Database 2023 2.5 Millions" is a combination of publicly available LinkedIn profile data and fabricated email addresses.
The dataset contains over 5.8 million email addresses, most of which are constructed from a combination of first and last names.
The data appears to be a mix of legitimate information sourced from public LinkedIn profiles, fabricated email addresses, and potentially other sources.
The fabricated email addresses follow a pattern of using the alias "[first name].[last name]@" on unrelated domains.
Despite the presence of fabricated email addresses, there is a significant component of legitimate data in the dataset, including real people, companies, and domains.

Summaries

19 word summary

LinkedIn dataset flagged as potential breach contained publicly available profiles and fake email addresses, but blaming LinkedIn is misguided.

70 word summary

The latest LinkedIn dataset flagged as a potential data breach contained a mix of publicly available profile data and fake email addresses. Investigation showed a pattern of fabricated email addresses with the same alias on unrelated domains. The dataset was included in Have I Been Pwned to inform individuals, but the fabricated email addresses were flagged as spam. Blaming LinkedIn is misguided; evidence-based analysis is crucial to avoid spreading disinformation.

146 word summary

The latest LinkedIn dataset flagged as a potential data breach was found to contain a mix of publicly available profile data and fake email addresses. Investigation revealed that the dataset included a pattern of email addresses with the same alias on unrelated domains, indicating that much of the data was fabricated. The dataset included information from public LinkedIn profiles, fabricated email addresses, and potentially other sources. It was loaded into Have I Been Pwned (HIBP), but the fabricated email addresses were flagged as spam to protect paid subscriptions. The decision to include the dataset in HIBP was made to inform individuals about potential breaches. The investigation concluded that while there is legitimate data in the dataset, there are also a significant number of fabricated email addresses. Blaming LinkedIn for the incident is misguided. It's important to approach data breaches with evidence-based analysis and avoid spreading disinformation.

410 word summary

The latest LinkedIn dataset that many people flagged as a potential data breach turned out to be a combination of publicly available LinkedIn profile data and fabricated email addresses. While LinkedIn has been breached in the past and had data scraped, this dataset seemed suspicious. Upon investigation, it was discovered that the dataset contained a pattern of email addresses with the same alias on unrelated domains, indicating that much of the data was fake.

The dataset contained information sourced from public LinkedIn profiles, fabricated email addresses, and potentially other sources. The email addresses were constructed by taking the domain of a company where the individual worked and creating an alias from their name. The companies and domains are legitimate, and the email addresses themselves are often real. The dataset was loaded into Have I Been Pwned (HIBP), but the fabricated email addresses were flagged as a spam list to prevent them from impacting paid subscriptions.

The decision to include the dataset in HIBP was made because people want to know about potential breaches and make their own decisions about what actions to take. Even if an email address on a domain doesn't actually exist, it's important for individuals to know that their personal data has been dumped in this corpus. Disinformation and misinformation should be avoided, as they can lead to false accusations and blame being placed on the wrong entities.

Overall, the investigation concluded that there is a significant component of legitimate data in the dataset. However, there are also a significant number of fabricated email addresses. The inclusion of the dataset in HIBP allows individuals to be aware of potential risks while taking into account the presence of fabricated data. It's important to note that blaming LinkedIn for this incident is misguided, as the evidence points to a different conclusion.

The author's dedication to uncovering the truth and combating disinformation led to this investigation. The goal was to provide accurate information and prevent false statements from circulating. It's crucial to approach data breaches with evidence-based analysis and avoid jumping to conclusions based on inaccurate information.

In conclusion, the latest LinkedIn dataset turned out to be a combination of publicly available data, fabricated email addresses, and potentially data from other sources. The presence of legitimate data in the dataset warranted its inclusion in HIBP, with the fabricated email addresses being flagged as a spam list. It's important to approach data breaches with caution, verify information, and avoid spreading disinformation.

Raw indexed text (14,829 chars / 2,127 words / 140 lines)

Troy Hunt: Hackers, Scrapers & Fakers: What's Really Inside the Latest LinkedIn Dataset

Mastodon

Home

Workshops

Speaking

Media

About

Contact

Sponsor

Sponsored by:

Webinar: 'How to Defend Against the Evilginx2.' Kuba Gretzky (Evilginx2) & Marcin Szary (Secfense) show a tool that counters MFA bypass.

Hackers, Scrapers & Fakers: What's Really Inside the Latest LinkedIn Dataset

07 November 2023

I like to think of investigating data breaches as a sort of scientific search for truth. You start out with a theory (a set of data coming from an alleged source), but you don't have a vested interested in whether the claim is true or not, rather you follow the evidence and see where it leads. Verification that supports the alleged source is

usually quite straightforward

, but

disproving

a claim can be a rather time consuming exercise, especially when a dataset contains fragments of truth mixed in with data that is anything but. Which is what we have here today.

To lead with the conclusion and save you reading all the details if you're not inclined, the dataset so many people flagged me this week titled "Linkedin Database 2023 2.5 Millions" turned out to be a combination of publicly available LinkedIn profile data and 5.8M email addresses mostly fabricated from a combination of first and last name. It all began with this tweet:

All good lies are believable at face value; is it feasible a massive corpus of LinkedIn data is floating around? Well,

they were

proper

breached in 2012 to the tune of 164M records

(by which I mean that incident was genuinely internal data such as email addresses and passwords extracted out by a vulnerability), then

they were massively scraped in 2021

with another 126M records going into

Have I Been Pwned

(HIBP). So, when you see a claim like the one above, it seems highly feasible at face value which is what many people take it at. But I'm a bit more suspicious than most people

First, the claim:

This one is similar to my twitter data scrapped [sic] but for linkedin plus 2023

Now, there's a whole debate about whether scraped data is breached data and indeed whether the definition of it even matters. With the rising prevalence of scraped data, this topic came up enough that

I wrote a dedicated blog post about it a couple of years ago

and concluded the following in terms of how we should define the term "breach":

A data breach occurs when information is obtained by an unauthorised party in a fashion in which it was not intended to be made available

Which makes scrapes like this alleged one a breach. If indeed it was accurate, LinkedIn data had been taken and redistributed in a way it was never intended to be by either the service itself or the individuals whose data was in this corpus. So, it's something to take seriously, and that warranted further investigation.

I scrolled through the 10M+ rows of data (many records spanned multiple rows due to line returns), and my eyes fell on a fellow Aussie who for the purposes of this exercise we'll call "EM", being the initials of her first and last name. Whilst the data I'm going to refer to is either public by design or fabricated, I don't want to use a real person as an example without their consent so let's just play it safe. Here's a fragment of EM's record:

There are 5 noteworthy parts of this I that immediately caught my attention:

There are 5 different email addresses here with the alias for each one represented in "[first name].[last name]@" form. These exist in a column titled "PROFILE_USERNAMES". (Incidentally, this is why the headline of 2.5M accounts expands out to 5.8M email addresses as there are often multiple addresses per account.)

There's a LinkedIn profile ID in the form of "[first name]-[last name]-[random hexadecimal chars]" under a column titled "PROFILE_LINKEDIN_ID". That successfully loaded EM's legitimate profile at https://www.linkedin.com/in/[id]/

The numeric value in the "PROFILE_LINKEDIN_MEMBER_ID" column matched with the value on EM's profile from the previous point.

The 2 dates starting with "2020-" are in columns titled "PROFILE_FETCHED_AT" and "PROFILE_LINKEDIN_FETCHED_AT". I assume these are self-explanatory.

EM's first and last name, precisely as it appears in each of her 5 email addresses.

On its own, this record would be unremarkable. It'd be entirely feasible -

this could very well be legit

- except when you keep looking through the remainder of the data. A pattern quickly emerged and I'm going to bold it here because it's the smoking gun that ultimately indicates that a bunch of this data is fake:

Every single record with multiple email addresses had

exactly

the same alias on completely unrelated domains and it was

almost always

in the form of "[first name].[last name]@".

Representing email addresses in this fashion is certainly common, but it's far from ubiquitous, and that's easy to demonstrate. For example, I have tons of emails from

Pluralsight

so I dig one out from my friend "CU":

There's no dot, rather a dash. Every single real Pluralsight email address I looked at was a dash rather than a dot, yet when I delved into the alleged LinkedIn data and dig out another sample Pluralsight address, here's what I found:

That's not LM's real address because it has a dot instead of a dash. Every. Single. One. Is. Fake.

Let's try this the other way around and load up the existing breached accounts in HIBP for the domain of one of EM's alleged email addresses and see how they're formed:

That's definitely not the same format as EM's address, not by a long shot. And time and time again, the same pattern of addresses in the corpus of data in the original tweet emerged, drawing me to what seems to be a pretty logical conclusion:

Each email address was fabricated by taking the actual domain of a company the individual legitimately worked at and then constructing the alias from their name.

And these

are

legitimate companies too because every single LinkedIn profile I checked had all the cues of accurate information and each domain I checked in the corpus of data was indeed the correct one for the company they worked at. I imagine someone has effectively worked through the following logic:

Get a list of LinkedIn profiles whether that be by ID or username or simply parsing them out of crawler results

Scrape the profiles and pull down legitimate information about each individual, including their employment history

Resolve the domain for each company they worked at and construct the email addresses

Profit?

On that final point, what

the point? The data wasn't being sold in that original tweet, rather it was freely downloadable. But per the date on EM's profile, the data could have been obtained much earlier and previously monetised. And on that, the date wasn't constant across records, rather there was a broad range of them as recent as July last year and as old as... well, I stopped when the records got older than me.

What is this?!

I suspect the answer may partly lie in the column headings which I've pasted here in their entirety:

"PROFILE_KEY", "PROFILE_USERNAMES", "PROFILE_SPENDESK_IDS", "PROFILE_LINKEDIN_PUBLIC_IDENTIFIER", "PROFILE_LINKEDIN_ID", "PROFILE_SALES_NAVIGATOR_ID", "PROFILE_LINKEDIN_MEMBER_ID", "PROFILE_SALESFORCE_IDS", "PROFILE_AUTOPILOT_IDS", "PROFILE_PIPL_IDS", "PROFILE_HUBSPOT_IDS", "PROFILE_HAS_LINKEDIN_SOURCE", "PROFILE_HAS_SALES_NAVIGATOR_SOURCE", "PROFILE_HAS_SALESFORCE_SOURCE", "PROFILE_HAS_SPENDESK_SOURCE", "PROFILE_HAS_ASGARD_SOURCE", "PROFILE_HAS_AUTOPILOT_SOURCE", "PROFILE_HAS_PIPL_SOURCE", "PROFILE_HAS_HUBSPOT_SOURCE", "PROFILE_FETCHED_AT", "PROFILE_LINKEDIN_FETCHED_AT", "PROFILE_SALES_NAVIGATOR_FETCHED_AT", "PROFILE_SALESFORCE_FETCHED_AT", "PROFILE_SPENDESK_FETCHED_AT", "PROFILE_ASGARD_FETCHED_AT", "PROFILE_AUTOPILOT_FETCHED_AT", "PROFILE_PIPL_FETCHED_AT", "PROFILE_HUBSPOT_FETCHED_AT", "PROFILE_LINKEDIN_IS_NOT_FOUND", "PROFILE_SALES_NAVIGATOR_IS_NOT_FOUND", "PROFILE_EMAILS", "PROFILE_PERSONAL_EMAILS", "PROFILE_PHONES", "PROFILE_FIRST_NAME", "PROFILE_LAST_NAME", "PROFILE_TEAM", "PROFILE_HIERARCHY", "PROFILE_PERSONA", "PROFILE_GENDER", "PROFILE_COUNTRY_CODE", "PROFILE_SUMMARY", "PROFILE_INDUSTRY_NAME", "PROFILE_BIRTH_YEAR", "PROFILE_MARVIN_SEARCHES", "PROFILE_POSITION_STARTED_AT", "PROFILE_POSITION_TITLE", "PROFILE_POSITION_LOCATION", "PROFILE_POSITION_DESCRIPTION", "PROFILE_COMPANY_NAME", "PROFILE_COMPANY_LINKEDIN_ID", "PROFILE_COMPANY_LINKEDIN_UNIVERSAL_NAME", "PROFILE_COMPANY_SALESFORCE_ID", "PROFILE_COMPANY_SPENDESK_ID", "PROFILE_COMPANY_HUBSPOT_ID", "PROFILE_SKILLS", "PROFILE_LANGUAGES", "PROFILE_SCHOOLS", "PROFILE_EXTERNAL_SEARCHES", "PROFILE_LINKEDIN_HEADLINE", "PROFILE_LINKEDIN_LOCATION", "PROFILE_SALESFORCE_CREATED_AT", "PROFILE_SALESFORCE_STATUS", "PROFILE_SALESFORCE_LAST_ACTIVITY_AT", "PROFILE_SALESFORCE_OWNER_CONTACT_ID", "PROFILE_SALESFORCE_OWNER_CONTACT_NAME", "PROFILE_SPENDESK_SIGNUP_AT", "PROFILE_SPENDESK_DELETED_AT", "PROFILE_SPENDESK_ROLES", "PROFILE_SPENDESK_AVERAGE_NPS_SCORE", "PROFILE_SPENDESK_NPS_SCORES_COUNT", "PROFILE_SPENDESK_FIRST_NPS_SCORE", "PROFILE_SPENDESK_LAST_NPS_SCORE", "PROFILE_SPENDESK_LAST_NPS_SCORE_SENT_AT", "PROFILE_SPENDESK_PAYMENTS_COUNT", "PROFILE_SPENDESK_TOTAL_EUR_SPENT", "PROFILE_SPENDESK_ACTIVE_SUBSCRIPTIONS_COUNT", "PROFILE_SPENDESK_LAST_ACTIVITY_AT", "PROFILE_AUTOPILOT_MAIL_CLICKED_COUNT", "PROFILE_AUTOPILOT_LAST_MAIL_CLICKED_AT", "PROFILE_AUTOPILOT_MAIL_OPENED_COUNT", "PROFILE_AUTOPILOT_LAST_MAIL_OPENED_AT", "PROFILE_AUTOPILOT_MAIL_RECEIVED_COUNT", "PROFILE_AUTOPILOT_LAST_MAIL_RECEIVED_AT", "PROFILE_AUTOPILOT_MAIL_UNSUBSCRIBED_AT", "PROFILE_AUTOPILOT_MAIL_REPLIED_AT", "PROFILE_AUTOPILOT_LISTS", "PROFILE_AUTOPILOT_SEGMENTS", "PROFILE_HUBSPOT_CFO_CONNECT_SLACK_MEMBER_STATUS", "PROFILE_HUBSPOT_IS_CFO_CONNECT_MEETUPS_MEMBER", "PROFILE_HUBSPOT_CFO_CONNECT_AREAS_OF_EXPERTISE", "PROFILE_HUBSPOT_CORPORATE_FINANCE_EXPERIENCE_YEARS_RANGE"

Check out some of those names: LinkedIn is obviously there, but so is Salesforce and Spendesk and Hubspot, among others. This reads more like an aggregation of multiple sources than it does data solely scraped from LinkedIn. My hope is that in posting this someone might pop up and say "I recognise those column headings, they're from..." Who knows.

So, here's where that leaves us: this data is a combination of information sourced from public LinkedIn profiles, fabricated emails address and in part (anecdotally based on simply eyeballing the data this is a small part), the other sources in the column headings above. But the people are real, the companies are real, the domains are real and in many cases, the email addresses themselves are real. There are over 1.8k HIBP subscribers in the data set and this is folks that have double opted-in so they've successfully received an email to that address in the past. Further, when the data was loaded into HIBP there were nearly a million email addresses that were already in the system so evidently, they were addresses that had previously been in use. Which stands to reason because even if every address was constructed by an algorithm, the pattern is common enough that there'll be a bunch of hits.

Because the conclusion is that there's a significant component of legitimate data in this corpus, I've loaded it into HIBP. But because there are also a significant number of fabricated email addresses in there, I've flagged it as

a spam list

which means the addresses won't impact the scale of anyone's paid subscription if they're monitoring domains. And whilst I know some people will suggest it shouldn't go in at all, time and time again when I've polled the public about similar incidents the overwhelming majority of people have said "we want to know about it then we'll make up our own minds what action needs to be taken". And in this case, even if you find an email address on your domain that doesn't actually exist, that person who either currently works at your company or previously did has still had their personal data dumped in this corpus. That's something most people will still want to know.

Lastly, one of the main reasons I decided to invest hours into this today is that I

loathe

disinformation and I

hate

people using that to then make statements that are completely off base. I'm looking at my Twitter feed now and see people angry at LinkedIn for this, blaming an insider due to recent layoffs there, accusing them of mishandling our data and so on and so forth. No, not this time, the evidence has led us somewhere completely different.

Have I Been Pwned

Post

Update

RSS

Troy Hunt's Picture

Troy Hunt

Hi, I'm Troy Hunt, I write this blog, create courses for Pluralsight and am a Microsoft Regional Director and MVP who travels the world speaking at events and training technology professionals

Please enable JavaScript to view the

comments powered by Disqus.

Troy Hunt

Hi, I'm Troy Hunt, I write this blog, run "Have I Been Pwned" and am a Microsoft Regional Director and MVP who travels the world speaking at events and training technology professionals

Upcoming Events

I often run

private workshops

around these, here's upcoming events I'll be at:

NDC Security: 8 to 11 Jan, Oslo (Norway)

NDC Oslo: 10 to 14 Jun, Oslo, Norway

Must Read

Data breach disclosure 101: How to succeed after you've failed

Data from connected CloudPets teddy bears leaked and ransomed, exposing kids' voice messages

Here's how I verify data breaches

When a nation is hacked: Understanding the ginormous Philippines data breach

How I optimised my life to make my job redundant

Don't have Pluralsight already?

How about a 10 day free trial?

That'll get you access to thousands of courses amongst which are dozens of my own including:

OWASP Top 10 Web Application Security Risks for ASP.NET

What Every Developer Must Know About HTTPS

Hack Yourself First: How to go on the Cyber-Offense

The Information Security Big Picture

Ethical Hacking: Social Engineering

Modernizing Your Websites with Azure Platform as a Service

Introduction to Browser Security Headers

Ethical Hacking: SQL Injection

Web Security and the OWASP Top 10: The Big Picture

Ethical Hacking: Hacking Web Applications

This is already the newest post!

Weekly Update 372

Subscribe Now!

Send new blog posts:

daily

weekly

Hey, just quickly confirm you're not a robot:

Submitting...

Got it! Check your email, click the confirmation link I just sent you and we're done.

This work is licensed under a

Creative Commons Attribution 4.0 International License

. In other words, share generously but provide attribution.

Disclaimer

Opinions expressed here are my own and may not reflect those of others. Unless I'm quoting someone, they're just my own views.

Published with Ghost

This site runs entirely on

Ghost

and is made possible thanks to their kind support. Read more about

why I chose to use Ghost