[D] The German Credit Rating data set: widely used in ML, but no clear source

There’s a commonly-used machine learning data set called about German credit rating. I would ballpark estimate that it’s been used in hundreds of statistics and ML papers, in part due to its availability on the UCI Machine Learning Repository and in various packages, each with different variable encodings.

However, almost all of the versions I can find have missing/incomplete documentation. Many have the “Present residence since” field which takes values in {1, 2, 3, 4}, with no note on what those discretizations mean. It also lacks essential data e.g. when the data was collected and by what means.

Chasing down the citations, it looks like the original data set comes from this paper on CART from 1990:

Hofmann H. J. “Die anwendung des cart-verfahrens zur statistischen bonitatsanalyse von konsumentenkrediten”. Zeitschrift fur Betriebswirtschaft, 60:941–962, 1990

Translated:

Hofmann H. J. “The application of the CART method for statistical credit analysis of consumer credit”. Journal of Business Administration, 60:941–962, 1990

I can’t find that article anywhere. Google Scholar only has citations to it, SpringerLink doesn’t have that volume, my own university’s library only has much older and much newer volumes, and a German library network I searched only had links to some Swiss libraries which in turn linked back to SpringerLink. From the UCI link above, it appears that Dr. Hofmann was affiliated with the University of Hamburg around 1994 with the first name Hans, which led me to this page for a retired professor, though it provides no papers or contact information. There are also notable Hans J Hofmann’s in Chemistry and Anthropology, which complicates the search for this author.

It troubles me that such a commonly-used data set has no clear source. Can anyone find the original publication of this data set, and/or an original version of the data and documentation? The various versions available online (some with different variable encodings!) suggest that comparisons between papers that use this data set could be leading to false conclusions in our field (on top of the issue of so many papers being based off a single test set).

submitted by /u/SoFarFromHome
[link] [comments]

Blog

Learn About Our Meetup

5000+ Members

MEETUPS

JOB POSTINGS

CONTACT

[D] The German Credit Rating data set: widely used in ML, but no clear source