Everyone assumes that the only way for your data to be captured and used is due to hacking of the web site or cracking to gain entry into a server farm. But, there is another far easier way to capture personal information. It is called ‘Data Scraping’.
Data scraping is a technique whereby a computer program captures and harvests human-readable outputs from another program.
Typically, data is hacked when it is moved from one program to another, from one routine to another, or when it is at rest in a database. When data is structured for these scenarios, it is a format that is not normally meant for humans to read. It is structured for computers to interact with each other. When data is in a machine format, computers access it in a parsing process. Parsing, syntax analysis, or syntactic analysis is the process of analysing a string of symbols, either in natural language, computer languages or data structures, conforming to the rules of a formal grammar. The term parsing comes from Latin pars (orationis), meaning part (of speech). So, data is not human friendly in this internal computer format.
Data scraping is most often done either to interface to a legacy system which has no other mechanism which is compatible with current hardware, or to interface to a third-party system which does not provide a more convenient API. In the second case, the operator of the third-party system will often see screen scraping as unwanted, due to reasons such as increased system load, the loss of advertisement revenue, or the loss of control of the information content.
Data scraping is generally considered an ad hoc, inelegant technique, often used only as a “last resort” when no other mechanism for data interchange is available. Aside from the higher programming and processing overhead, output displays intended for human consumption often change structure frequently. Humans can cope with this easily, but a computer program may report nonsense, have been told to read data in a particular format or from a particular place, and with no knowledge of how to check its results for validity.
Screen scraping is a practice of simply reading the information displayed on a computer display and then collecting this data from the monitor and then converting it into a machine format.
Web scraping is used with the pages that are built using text-based mark-up languages (HTML and XHTML), and frequently contain a wealth of useful data in text form. However, most web pages are designed for human end-users and not for ease of automated use. Because of this, tool kits that scrape web content were created. A web scraper is an API or tool to extract data from a web site. Companies like Amazon AWS and Google provide web scraping tools, services, and public data available free of cost to end users. Newer forms of web scraping involve listening to data feeds from web servers.
Report mining is the extraction of data from human readable computer reports. Conventional data extraction requires a connection to a working source system, suitable connectivity standards or an API, and usually complex querying.
So, there are several simple and cost-effective ways to harvest your data and capture it into a new format for future manipulation and processing.
When people post comments, create profiles, interact on social media sites, respond to others, or place any content anywhere on the internet, it can be scraped, and prepared for use. Often this data is used in ways that the person never intended nor expected. So bad actors harvest your personal data, load it into a database, and then build profiles on you. They can very quickly build gigabytes of data on you to know you better than you know yourself.
It is very scary.
Although scraping is ubiquitous, it’s not clearly legal. A variety of laws may apply to unauthorized scraping, including contract, copyright, and trespass to chattels laws.
Some see scraping as clumsy and inefficient. But, data can be collected at a rate of about 7 milliseconds per page, so with all the advancements in compute power, database design, and storage, these tools can scrape a massive volume of data every second, minute, hour, and day.
Major real estate companies use it to build profiles on both buyers and sellers. They can learn exactly how you buy and what motivates you to sell. They can apply this knowledge in the negotiation, so they can extract the greatest profit margin from you.
Research companies use it to learn preferences related to specific products or services. They can tell if some product will gain traction or not. By scraping the public, they can alter their selling strategy to better align with the needs and wants of the consumer. In this case, they scrape for a consensus rather than to understand just one person. But, this field is growing rapidly, and individual preferences are now being used to zero in on specific demographic markets.
Privacy advocates are outraged after CEO Mark Zuckerberg said this week a Facebook reverse search tool may have compromised the data of the social network’s two billion users.
The feature in question was designed to enable users to enter a Facebook user’s phone numbers or email addresses into the social network’s Search tool to find friends. But new revelations from Facebook indicate the feature was also used by malicious actors to scrape the data of millions of Facebook users. The company has since disabled the feature, said Zuckerberg on Wednesday, speaking at a press conference about the company’s data privacy policies.
“Many Facebook users are naturally upset about this situation, but in the end, the moral of the story here is that people need to be more considerate about what data they are sharing and with whom,” said Craig Young, computer security researcher for Tripwire’s Vulnerability and Exposure Research Team. “This is one of those situations that should be a revelation to people on the importance of reading before clicking ‘OK.’”
So, data scraping is a very notorious means to collect data on individuals and then use that data in ways to negatively impact that individual. Every word that you type is likely being harvested to get to know you better and to be applied to extract money from your wallet. So, beware.
O’Donnell, L. (2018). Privacy advocates blast Facebook after data scraping scandal, Threatpost. Retrieved on April 7, 2018 from, https://threatpost.com/privacy-advocates-blast-facebook-after-data-scraping-scandal/131000/
Upwork Global Inc. (2018). What is Web Scraping and How Can You Use It? Retrieved on April 7, 2018 from https://www.upwork.com/hiring/for-clients/web-scraping-tutorial/
Wikipedia. (2018). Data Scraping. Retrieved on April 7, 2018 from, https://en.wikipedia.org/wiki/Data_scraping
About the Author:
Michael Martin has more than 35 years of experience in systems design for broadband networks, optical fibre, wireless and digital communications technologies.
He is a Senior Executive with IBM Canada’s GTS Network Services Group. Over the past 13 years with IBM, he has worked in the GBS Global Center of Competency for Energy and Utilities and the GTS Global Center of Excellence for Energy and Utilities. He was previously a founding partner and President of MICAN Communications and before that was President of Comlink Systems Limited and Ensat Broadcast Services, Inc., both divisions of Cygnal Technologies Corporation (CYN: TSX).
Martin currently serves on the Board of Directors for TeraGo Inc (TGO: TSX) and previously served on the Board of Directors for Avante Logixx Inc. (XX: TSX.V).
He serves as a Member, SCC ISO-IEC JTC 1/SC-41 – Internet of Things and related technologies, ISO – International Organization for Standardization, and as a member of the NIST SP 500-325 Fog Computing Conceptual Model, National Institute of Standards and Technology.
He served on the Board of Governors of the University of Ontario Institute of Technology (UOIT) and on the Board of Advisers of five different Colleges in Ontario. For 16 years he served on the Board of the Society of Motion Picture and Television Engineers (SMPTE), Toronto Section.
He holds three master’s degrees, in business (MBA), communication (MA), and education (MEd). As well, he has diplomas and certifications in business, computer programming, internetworking, project management, media, photography, and communication technology.