[D] Multi-level data, what is the best approach?
I’m working on a dataset and having some problems. I hope you can give me your insight.
So my objective is to predict customer churn based on incidents. Each incident is related to a contract which is related to a client. I need to predict the termination of the contract. The features can be grouped in 3 categories:
Client: client’s ID and some basic information about them
Contract: contract’s ID with their specific information and the target ‘In service/Terminated’
Incidents: every entry is an incident related to a contract with information like number of calls, date of creation, last change, incident category
Some clients have up to 10 contracts, some contracts have up to 20 incidents.
What I did is create a fresh table with the contracts only (and client’s information) and I now have to add relevant information for every contract.
I couldn’t help but find myself cherry picking some ‘relevant’ information like: Total incidents for the contract, total calls, last incident’s full information and also higher-level features like: number of contracts the user has, how much are terminated, total incidents for the user.
I feel it’s getting very messy and I’m still losing A LOT of information by doing this. Is it the only approach I have?
This was supposed to be a machine learning problem but seriously there’s nothing about machine learning at all, it’s pure data science.