[D] If you use pandas: which tasks are the hardest for data cleaning and manipulation?
Hi,
I am obsessed with making Data Science in Python faster and many people told me that data cleaning and manipulation are the most tedious tasks in their daily work.
Which are the exact tasks where you spend/lose most of your time when performing data cleaning/manipulation in pandas?
- reading in datasets (finding the right separator, dataformat, …)
- adjusting the data types of the columns – eg parse datetime, parse to numeric or categoric, others?
- removing missing values
- finding and removing duplicate values
- parsing columns and removing invalid strings?
- concatenating datasets
- joining multiple tables
- creating groupbys and aggregations
- filtering and selecting subsets
- creating new columns/feature engineering
- visualizing the dataset and exploring it
- Something else? Did I miss something?
I am planning to collect the best libraries for the tasks (or maybe write a library on my own to fill the missing gaps) in order to make the working process much faster.
I would be grateful for any input
Best,
Florian
submitted by /u/kite_and_code
[link] [comments]