Category: Reddit MachineLearning
[D] The German Credit Rating data set: widely used in ML, but no clear source
There’s a commonly-used machine learning data set called about German credit rating. I would ballpark estimate that it’s been used in hundreds of statistics and ML papers, in part due to its availability on the UCI Machine Learning Repository and in various packages, each with different variable encodings.
However, almost all of the versions I can find have missing/incomplete documentation. Many have the “Present residence since” field which takes values in {1, 2, 3, 4}, with no note on what those discretizations mean. It also lacks essential data e.g. when the data was collected and by what means.
Chasing down the citations, it looks like the original data set comes from this paper on CART from 1990:
Hofmann H. J. “Die anwendung des cart-verfahrens zur statistischen bonitatsanalyse von konsumentenkrediten”. Zeitschrift fur Betriebswirtschaft, 60:941–962, 1990
Translated:
Hofmann H. J. “The application of the CART method for statistical credit analysis of consumer credit”. Journal of Business Administration, 60:941–962, 1990
I can’t find that article anywhere. Google Scholar only has citations to it, SpringerLink doesn’t have that volume, my own university’s library only has much older and much newer volumes, and a German library network I searched only had links to some Swiss libraries which in turn linked back to SpringerLink. From the UCI link above, it appears that Dr. Hofmann was affiliated with the University of Hamburg around 1994 with the first name Hans, which led me to this page for a retired professor, though it provides no papers or contact information. There are also notable Hans J Hofmann’s in Chemistry and Anthropology, which complicates the search for this author.
It troubles me that such a commonly-used data set has no clear source. Can anyone find the original publication of this data set, and/or an original version of the data and documentation? The various versions available online (some with different variable encodings!) suggest that comparisons between papers that use this data set could be leading to false conclusions in our field (on top of the issue of so many papers being based off a single test set).
submitted by /u/SoFarFromHome
[link] [comments]
[R] Work Related to Normalizing Unicode using Computer Vision
I’m looking for work related to normalizing Unicode characters to their ASCII equivalent, through the use of computer vision, specifically employing CNN’s. Anyone familiar with work related to this? Thank you!
submitted by /u/hooligan_37
[link] [comments]
[D] Siraj Apologizes and Promises Refunds within 30 days
| |
Here is the twitter thread submitted by /u/permalip |
[D] Machine Learning – WAYR (What Are You Reading) – Week 71
This is a place to share machine learning research papers, journals, and articles that you’re reading this week. If it relates to what you’re researching, by all means elaborate and give us your insight, otherwise it could just be an interesting paper you’ve read.
Please try to provide some insight from your understanding and please don’t post things which are present in wiki.
Preferably you should link the arxiv page (not the PDF, you can easily access the PDF from the summary page but not the other way around) or any other pertinent links.
Previous weeks :
Most upvoted papers two weeks ago:
/u/blueNou_mars: Contrastive Multiview Coding
/u/StellaAthena: Detecting Learning vs Memorization in Deep Neural Networks using Shared Structure Validation Sets
Besides that, there are no rules, have fun.
submitted by /u/ML_WAYR_bot
[link] [comments]
[D] Understanding proof of MaxEnt theorem
| |
I’m reading Brian Ziebart’s work on maximum causal entropy optimization for inverse reinforcement learning. I’m reading through a few of his thesis chapters to get a deeper understanding, but have gotten stuck on one particular proof: the first line of the proof of Theorem 6.10. The theorem follows easily after the first line, but I can’t make sense of the logic behind the first line. In a nutshell, the theorem shows that under a maximum causal entropy distribution, the likelihood of any policy pi increases in proportion to the expected reward (linear in [state, action] features) under that policy. However to prove this, he starts off by writing the P(pi) = Product over all trajectories (A, S) of P_MaxEnt(A, S)^pi(A, S). I do not understand where this equation comes from. It seems strange to me that it is raising maximum entropy distribution probabilities to the power of the policy probabilities. I would greatly appreciate it if anyone could help me understand this. The theorem is from his thesis (pg 210), available here: http://www.cs.cmu.edu/~bziebart/publications/thesis-bziebart.pdf Full theorem and proof included below: submitted by /u/celestialquestrial |
[P] Trying to modify Tweepy parameters
Hey all,
I’m using Tweepy for the first time and am trying my best to follow tutorials online. I’m trying to extract tweets given a certain hashtag and am wondering if it’s possible to filter further. I am trying to:
1) Save tweets with hashtags at the end only.
2) Have less than 4 hashtags in total.
3) Filter out all images and links.
I’m having trouble finding ways to implement the first two parameters. For #3 I’ve used the following and it seems to work: Print (tweet.created_at, re.sub(r”httpS+”, “”, tweet.full_text)) .
Hope someone with more experience can shed light on it for me. I am trying to recreate a paper and for their Twitter Corpus, they followed those guidelines.
submitted by /u/MrMegaGamerz
[link] [comments]
[P] I’ve made User Behavior Prediction for everyone, called Behaiv
| |
Just finished working on java/android library for User Behavior Prediction. It makes it really easy for developers to use it. Essentially that’s just logistic regression, but there are prepared ways to get features. Still really raw, but the concept is working. https://github.com/dmi3coder/behaiv-java I hope such feature could be present in the Evernote app. It’s basically like app suggestion in ios(based on position and time) but targeted for the app itself. Planning to port it to javascript to use it with React/Angular. As well as Swift for ios support So, let me know if idea is good and should I continue working on this project or switch to something more interesting submitted by /u/dmi3coder |
[D] What are your favorite YouTube channels that features advanced research ML talks ?
Hi,
I am trying to collect some YouTube channels to follow, the idea is to find channels that features advanced research ML talks such the following [1], [2], [3].
I noticed that most of the scientific conferences don’t upload their talks such KDD, ICML, ICLR, ACL, NeurIPS except CVPR. where do you guys find these talks? When I search, I find them in several individual channels (talks upload by speakers or some random channels duplicating them from somewhere else)
submitted by /u/__Julia
[link] [comments]
[Discussion], [Project] I need YOUR help to build an NLP classifier
Hi,
It’s my first post on reedit so I would like to say hello.
I am working on my website. However, I would like the main page to be quite unusual. Instead of the menu section, I am going to use an NLP classifier with 3 classes:
-Portfolio
-About Me
-Contact
If the user typed ‘Could you show me some of your past projects’ or ‘Tell me something about you’, the classifier would classify it to one of three categories and then take to the according page.
However, the problem that I am facing is that it’s quite hard to get queries to train the classifier.
Therefore, I would like to ask you to write how would you ask someone for a portfolio, information about them and contact info.
Thanks for your time!
submitted by /u/malinjan
[link] [comments]