Today, as data science becomes more and more popular, we are surrounded by all kinds of data. When these data are not processed, they are called raw data, or primary data. Raw data includes various forms of numbers, instrument readings and data from databases. These raw data are the basis for scientific research and business analysis, but they need to go through a series of processing processes before they are converted into information that can be understood.
Raw data is a concept relative to processed data. Even after one group of research teams cleans and processes the data, another group may still regard it as raw data.
For example, during an experiment, a scientist recorded the temperature of a chemical mixture in a test tube every minute. The list of these temperature readings is the raw data. The data at this time has not undergone any processing, which means that there may be various problems such as measurement errors and data input errors in the data.
The generation of data is mainly divided into two categories: the first is captured data, such as data obtained through conscious investigation and analysis; the second is emission data, which is usually collected by machines or terminals as auxiliary functions. Imagine a cash register or smartphone automatically recording data on every transaction in addition to its primary function, but this data is often discarded as excessive or of no reference value.
In calculations, some characteristics of raw data are as follows: they may contain human, machine, or instrument errors; they may not be verified; they exist in different formats; or some entries may be suspicious and require further confirmation. For example, a date in a data entry form may appear in various forms, such as "January 31, 1999" or "31/01/1999". These raw data need to be processed before they can be converted into a standard format for subsequent analysis.
Even raw data can become usable information once processed, but only if it needs to be extracted, organized, analyzed and formatted.
Take the sales data of supermarkets as an example. The cashier system collects a large amount of raw data about customer shopping every day, but this data has no value before being analyzed. By analyzing data, managers can obtain useful information about customer purchasing behavior, peak sales periods, and other information to make appropriate business decisions.
The processed data will be stored in a database so that it can be further analyzed in many aspects. At this point, the raw data also becomes a resource that enables deeper analysis. Tim Berners-Lee, the famous inventor of the Internet, once said that sharing raw data is crucial to society. He advocates that everyone should require governments and businesses to share the data they collect.
Data drives many things in our lives because someone takes action on the data.
Proponents of embracing open data believe that once citizens and civil society organizations have access to corporate and government data, it will help them conduct independent analysis and thereby strengthen their own power. For example, a government may claim that its policies are effective in reducing unemployment, but relevant poverty advocacy groups can analyze the raw data and may come to different conclusions.
In summary, there is a clear gap between data and information. Information is the final product of data processing. After professional processing, the original data must undergo changes to evolve into meaningful information. This type of information helps decision-makers conduct effective business planning, so the process from raw data to information is critical to all walks of life.
So, in the face of today's flood of data, have you begun to think about how to effectively use these raw data to obtain truly valuable information?