Where are our data?

Type articles Where are our data?

05.03.2021 - 15:47 update 22.03.2021 - 11:47

Editors: OO

Tags: computer science

| Agnieszka Niewdana |

The first and second decades of the 21st century are a time of significant IT advance. Computers have become faster and their ability to store and process a wide variety of information is huge. But where is this information? We do not wonder about this question so often. Every day we take photos, talk via various communicators, send documents, and use credit cards to pay for our purchases. We do not wonder about where these data are collected and how they can be used.

Computers are ubiquitous in our lives. They perform the actions necessary for us and record a lot of data, and the ease of obtaining, processing and storing these data is enormous. Information is recorded by our computer devices saving it on local data carriers (disks, flash memory), and in the so-called clouds. Using computer systems after logging in to the appropriate accounts and services opens up the possibility of taking personalised actions. Consequently, we more and more often choose to save our data “somewhere on the Internet”.

However, personalising cloud-based resources and services means that the information about us, our resources and activities is meticulously saved by the software that manages the access to clouds and other Internet services.

“Each of our activities leaves a digital footprint, for example, after using social networks, online stores or financial services. It is worth to remember that such a footprint is left after buying in the physical store by the means of payment cards or loyalty cards. Actually, we do not leave it only when we pay in cash,” says Dr Eng. Roman Simiński from the Institute of Computer Science of the University of Silesia.

Regardless of whether the digital footprint allows to identify us or we remain anonymous, it contains information about our activity. When, how much, and what we bought; what financial transactions we carried out; what we searched for on the Internet; where and how we moved; and what photos we took (and when and where we took them). The footprints of our activities recorded in the computer systems are connected with a number of important problems concerning, for example, the confidentiality of information about us. Its unauthorised use, e.g. for criminal purposes, is a potential and genuine threat. Nevertheless, the information recorded by IT systems can also be used in many ways that are beneficial to us.

The information recorded by the banking system at a certain time is not only a direct record of individual events, but also, in a covert manner, it reflects the processes taking place within our funds registered n the account. As Dr Eng. Simiński states, a detailed analysis of a larger number of events from a certain period may allow for the deduction of a lot of – often surprising – information about real events in our lives. For example, the termination of regular monthly incomes may indicate a potential job loss, and on the other hand, the inflow of regular monthly higher incomes may suggest that we have changed the job for a better one.

The analysis of anonymous shopping in a self-service shop might indicate a lot of information relevant to the owner. The content of baskets allows, for example, to select the groups of goods that are most often purchased together. Their identification can be used for such a composition and display of goods that the most frequently purchased ones are located in the immediate vicinity.

The above-mentioned examples show simple use of the nowadays crucial field of artificial intelligence called data science. The concept of extracting knowledge from data is a non-novel concept, it derives from the concept of machine learning. The first known and successfully applied methods were developed in the second half of the last century. The best known algorithms are: ID3, C4.5, C5.0 by Quinlan, and AQ by Ryszard Michalski, a Pole living and conducting research activity in the USA. Machine learning algorithms are based on examples from which they are supposed to learn something automatically. To learn means to create a particular description containing previously unknown knowledge about regularity, relationship, and tendencies occurring in learning examples. The concept of machine learning was supposed to “teach the machine” how to solve a problem in a way other than algorithmic. The algorithm outcomes might be different, but they are most often decision trees or decision rules. In both cases, the outcomes of the machine learning algorithms allow an attempt to classify new cases.

In fact, machine learning allows us to discover knowledge about the problem that is being solved. By developing and generalising this concept, we come to yet another one called data mining. The purpose of data mining is to discover the previously unknown but useful knowledge that is implicitly stored in the data. Since obtaining data for exploration might require additional activities (e.g. cleaning and preparation), and the results of exploration require the assessment and verification, this process is wider and called knowledge discovery in data. As mentioned earlier, the current possibilities of obtaining data, even large in volume, which may contain hidden and essential knowledge, are relatively easy and common.

The application of concept of knowledge discovery to large collections of data from real system databases is called big data. The source of data for the analyses referred to as big data is information stored in computing clouds. Using it is usually easy for us, we use convenient, easy-to-use mechanisms, which often operate automatically. We take a picture with phone camera, we watch it, and a moment later it is transferred to our “piece” of cloud.

“Usually, we do not think where it is actually saved, but it goes to the data saver centre of the given cloud owner,” says Dr Eng. Roman Simiński. “Intuition suggests that information is stored there on some disk, but in fact it is either stored on disk arrays replicating our data, or directed to distributed data storage systems, where it is duplicated, so that the failure of the device or the entire segment of the storage system would not result in data loss.”

All of this is located in properly secured centres. Safety measures are used both for the physical infrastructure (personnel access control, temperature, emergency power supply, monitoring) and system and network infrastructure (firewalls, intrusion detection systems, network segmentation, data isolation). Major cloud service providers also take care of geographic distribution so that, e.g., a natural disaster like an earthquake would not destroy all the physical resources in a given location. It therefore seems that our data are well protected against loss. However, are they safe from theft, especially when they are duplicated?

“In this case, the answer is not clear,” explains the computer scientist. “Cloud system providers are a constant target of cybercriminals attacks. From time to time, these attacks are effective and might cause a leak of some data or unavailability of services. The unavailability of cloud services can be a big problem, even if it was not caused by cybercriminal attacks and there was no data leak, its aftermath is, e.g. no access to mailboxes, documents, contacts, which can make life much more difficult.

In mid-December 2020, a well-known cloud service provider experienced a 45-minute outage, which cut off millions of users from their cloud data and services. A week earlier, a communicator of a well-known social media broke down, which affected the exchange of information between millions of users.

“We should bear in mind that even when the services are available, they require a stable internet connection. No access to the wireless network or the failure of a nearby base transceiver station are more likely to happen than the failure of a professional cloud storage centre. Despite the professionalism in maintaining data, there is always a risk of loss, so you should make backup copies of key information on local data storage devices from time to time,” recommends the scientist.

The article entitled „Where are our data?” was published in the February issue of „Gazeta Uniwersytecka UŚ” (University of Silesia Magazine) no. 5 (285)

Photo: Pixabay