Skip to main content

Blog

Learn About Our Meetup

5000+ Members

MEETUPS

LEARN, CONNECT, SHARE

Join our meetup, learn, connect, share, and get to know your Toronto AI community. 

JOB POSTINGS

INDEED POSTINGS

Browse through the latest deep learning, ai, machine learning postings from Indeed for the GTA.

CONTACT

CONNECT WITH US

Are you looking to sponsor space, be a speaker, or volunteer, feel free to give us a shout.

[D] If you use pandas: which tasks are the hardest for data cleaning and manipulation?

Hi,

I am obsessed with making Data Science in Python faster and many people told me that data cleaning and manipulation are the most tedious tasks in their daily work.

Which are the exact tasks where you spend/lose most of your time when performing data cleaning/manipulation in pandas?

  1. reading in datasets (finding the right separator, dataformat, …)
  2. adjusting the data types of the columns – eg parse datetime, parse to numeric or categoric, others?
  3. removing missing values
  4. finding and removing duplicate values
  5. parsing columns and removing invalid strings?
  6. concatenating datasets
  7. joining multiple tables
  8. creating groupbys and aggregations
  9. filtering and selecting subsets
  10. creating new columns/feature engineering
  11. visualizing the dataset and exploring it
  12. Something else? Did I miss something?

I am planning to collect the best libraries for the tasks (or maybe write a library on my own to fill the missing gaps) in order to make the working process much faster.

I would be grateful for any input

Best,

Florian

submitted by /u/kite_and_code
[link] [comments]