We see a lot of media hype around Artificial Intelligence/Machine Learning, Data Analytics, Data Science and Internet of Things etc. but we tend to forget that without Data itself they would not be very effective. When looking at these emerging technologies, we may say that data is really the backbone.
We can describe data as a collection of facts such as words, numbers, or description of objects. The data collection can be done in various ways, such as by observations or even generated by simulations.
Historically, humans have been generating data in different forms; through symbols, drawings, and books. However, in recent years data generation has been unprecedented. It’s not only humans but machines are generating data as well. Just to see the scale of data generation, we have alone more than 90% of the data generated in the last 2 years than the entire history of humans. This is truly mind boggling.
The rate of data generation is not slowing down, and we expect that this trend will continue. The challenges will continue to exist in terms of storing, accessing, and processing data.
The amount of data collected brings a lot of opportunities if it is used and processed in the right context. However, data could also become a liability if we couldn’t use it properly. Moreover, personal data, and data security will always be a challenge.
Artificial Intelligence, Machine Learning, Big Data Analytics, Internet of Things, Digitalization and many others alike, they are creating huge values from data.
Source: Infosys
5Vs – Volume, Variety, Velocity, Veracity and Value
The 5 big Vs (Volume, Velocity, Variety, Value and Veracity) are known to describe Big Data. The 5Vs are a popular way to explain the evolution of Data and Data Systems in general. However it should not be assumed that 5Vs can completely explain data evolution or Big Data.
It is often misleading that the amount of data (Volume) is the most important among all five Vs, but it is in fact much more than just the amount of data. Not to say that data volume does not matter, but we need to consider other important variables as well.
Then we have Data velocity, which is referred to the speed at which data is processed. When looking at social media, every day around 900 million photos are uploaded to facebook, around 500 million tweets are posted on Twitter, and around 0.4 million hours of videos are uploaded on Youtube and approximately 3.5 billion searches are performed in Google. To process this big volume of data in realtime is extremely challenging. To cope with this challenge we need to utilize multiple technologies, such as fast processors, fast memories, faster networks and advanced distributed computing and algorithms. Technologies such as Hadoop, Spark, Google BigQuery etc. are able to process data at petabyte level. These tools are getting better with time and also new tools are emerging to meet future needs for data velocity.
Data variety generally refers to both structured and unstructured data. The most commonly produced data are structured: texts, tweets, pictures and videos. However, unstructured data like emails, voicemails, handwritten texts, ECG readings, audio recordings etc, are also important elements under variety. Data variety is about the ability to classify the incoming data into various categories. It is also known as the fundamental aspect of data complexity. The biggest challenge for data variety is to put data in the right context. It’s no more only large enterprises are struggling with data variety by having multiple sources of databases. Even small companies are today having multiple databases and sources and facing similar challenges as big enterprises used to have.
Data veracity is the degree to which data is accurate, precise and trustworthy. Data veracity needs to know the source of the data, how the date was produced, and who contributed in producing the data. It’s also important to know which methods were used to collect data. Many times data is not fully accurate, and it’s very hard to get 100% accurate and precise data. To deal with that we need to know how much data is unreliable and inaccurate in terms of proportion. Data veracity is an increasing problem as data is being collected from multiple sources.
Data value is the most important one among all. Data value itself is nothing but an abstract concept and a definition. Data value is extremely hard to define objectively, and thus it’s very subjective. Value is associated with a defined goal. For example, if the goal is to increase profit, and data analytics provides insights which may lead to increase the profit then data is attributed to positive value. If we need clean and correct data (data veracity), maybe volume of data is not really important here.
Data moving towards cloud
Personal computers democratized the use of computing and data processing. It was no more that computers were only meant for big businesses and public organizations. The world used to have few mainframe computers which were at most part remotely connected. PC changed the whole paradigm by making it possible to own and use computers at home.
However most of the PCs were still not connected. The data generated was largely kept in hard disks. Then the use of optical disks made it possible to store and distribute large quantities of data. Introduction of broadband in combination with Smartphones and connected devices literally changed the whole paradigm of data storage. Pictures and videos can now be instantly uploaded to the remote storage services. Most cases it is done automatically without any manual actions. This also includes automatic backups of the devices. Today, there are more than 4 billion smartphone users, and they have access to the Internet. And they are the biggest contributor to pushing data towards clouds.
Data volume continues to increase
It’s estimated that 25 quintillion bytes of data is generated every day. To encompass the scale this can fill up 10 million blu-ray disks. A single blu-ray disk can hold 50GB.
So far we don’t see any slow down in data generation. As mentioned, most of the data generated is in the last few years than the entire history of humans. At present, we witness three main data generation sources: social media, machine data and transactional data. The rise of social media can be attributed to facebook, YouTube, instagram and twitter etc.
There are around 2.6 billion active facebook users, and YouTube has around 2 billion active users. In terms of data volume, pictures and videos are the most prominent ones. 300 hours of videos are uploaded to YouTube every minute. Similarly, Instagram users upload over 100 million photos and videos overyday.
Email is one of the oldest applications which is still a source of large amounts of data generation. The use of email continues to grow. There are 293 billion emails sent daily in 2019, and this is expected to grow by 4.2% yearly to 347 billion in 2023.
Machine data is another big source of data generation. The biggest contributors to machine data are Internet of Things, logs and tracking data, data generated by scientific experiments such as high-energy particle colliders like Large Hadron Colliders (LHC), and space exploration data etc. We also expect that autonomous vehicles will also be contributing huge amounts of mancine data in the near future.
AI and data
Data can be seen as blood and oxygen for AI and Machine learning. In particular, machine learning is highly data dependent. More data we have, the better AI models can be trained. However, without well prepared training data we will not achieve a good performance level. We have to be aware that just having a lot of data is not enough, but the quality of data is as important as the amount. The value of the data becomes even more important when it’s cleaned, processed and structured.
The increasing amount of data is attributed to the explosion of AI and machine learning applications. Many AI models require data to be properly labelled. As an example, for an AI to recognize an object from a given picture, AI has to be first trained from several examples of objects which are labeled with their corresponding description. Labelling data is a very slow and tedious job, and to a large extent manual. However, lots of new techniques are being developed to automate the data labelling process. One of the famous services is Amazon SageMaker Ground Truth. The service uses a hybrid approach. The raw data first goes through amazon’s AI engine which recognizes the objects with highest matching probability. The one with lower matching probabilities are sent to manual human inspection. The end result is a very accurate labeled data.
Business intelligence and data
Recently, we are seeing a large number of companies emerging which solely provide data related services to various types of businesses. More and more companies are realizing to use data to get business analytics. It’s a good trend to improve business performance from macro to micro levels. In today’s increasingly competitive market every little gain is important. In particular, e-commerce is an extremely competitive business segment, 100s of online shops competing on the same or similar items and trying all possible ways to get new customers or trying to retain the existing customers.
Though intuition is still important in businesses as the complexity of decision making for machines is still overwhelmly hard. In terms of pattern recognition, envisioning the future and prediction, and in particular creativity human intelligence is still unmatched. However the gap is narrowing with time, and in some particular areas AI is even out performing humans. Even better approach could be the hybrid one, when data, AI and human intelligence is combined.