[D] If you use pandas: which tasks are the hardest for data cleaning and manipulation?

Written by torontoai on May 28, 2019. Posted in Reddit MachineLearning.

Hi,

I am obsessed with making Data Science in Python faster and many people told me that data cleaning and manipulation are the most tedious tasks in their daily work.

Which are the exact tasks where you spend/lose most of your time when performing data cleaning/manipulation in pandas?

reading in datasets (finding the right separator, dataformat, …)
adjusting the data types of the columns – eg parse datetime, parse to numeric or categoric, others?
removing missing values
finding and removing duplicate values
parsing columns and removing invalid strings?
concatenating datasets
joining multiple tables
creating groupbys and aggregations
filtering and selecting subsets
creating new columns/feature engineering
visualizing the dataset and exploring it
Something else? Did I miss something?

I am planning to collect the best libraries for the tasks (or maybe write a library on my own to fill the missing gaps) in order to make the working process much faster.

I would be grateful for any input

Best,

Florian

submitted by /u/kite_and_code
[link] [comments]

Blog

Learn About Our Meetup

5000+ Members

MEETUPS

JOB POSTINGS

CONTACT

[D] If you use pandas: which tasks are the hardest for data cleaning and manipulation?