Instruction
Data collection and preparation for analytics
Gathering the right data is the first and most important step in any analysis. For accurate predictions, it is essential to have good historical data. The more data you have, the more accurate the models can be. However, if you want to launch a new product, for example, historical data is often lacking, so it is necessary to conduct market research to gain information about customer preferences, competition, and other relevant factors. In many cases, the best approach is a combination of both options. We can use existing data to create a basic model and then add new data to improve the model. If you do not have enough of your own data, you can consider purchasing external data from specialized companies. This data can provide additional insights into the market and customers. Choosing the right strategy depends on several factors below:
Objectives: What questions do you want to answer with the analysis? What decisions do you want to make based on the analysis? The more specific the objective, the better you know what data you need.
Available resources: What data do you have available?
Time frame: How quickly do you need the results?
Cost: How much are you willing to invest in data collection and analysis?
***
Identification of relevant data
Go through all available sources, databases, spreadsheets, files. What data are you missing? Write down a list of data that you don't have and that you think will need to be acquired.
***
Collection and sorting of data
Internal sources: Databases: CRM systems, ERP systems, customer databases, historical sales data, etc. Files: Excel spreadsheets, CSV files, PDF documents. Web analytics: Google Analytics, Facebook Insights.
External sources: Public databases: Statistical office, company databases, etc. API: Interfaces for retrieving data from external systems (e.g. social networks). Web scraping: Automatic data retrieval from websites (requires extra services such as zite.com)
Collecting new data: If relevant historical data is missing or you want to answer completely new questions, new data collection is necessary. This can involve a variety of methods, from surveys and questionnaires to tracking customer behavior.
***
Data editing and data preprocessing
In general, the client should make basic data editing and the analyst should perform more complex transformations and preprocessing. The client should ensure that the data is stored in a comprehensible format (e.g. Excel, CSV) and has a logical structure. The client should identify and remove obvious errors in the data, such as typos or incorrectly formatted values, or make missing data visible. The client should also create a key, i.e. a list of all codes, abbreviations and definitions used, so that the analyst understands the meaning of individual columns. The analyst should then check the quality of the data using statistical methods, remove duplicates, select an appropriate method to fill in missing data. Afterwards, he/she creates new variables, summarize the data, normalize it and prepare it for further analysis.
***
Handing over data to an analyst
Choose a suitable format for data submission (e.g. Excel, CSV). Documentation: Attach a detailed description of the data, the transformations used, and the meaning of each column.