[P] Strategies to improve data extraction from semi-structured documents (SEC filings)
I hope I am positing in the right place. If not, i’ll take the post down.
I am a university researcher working on a project that involves matching names with biographical data from SEC filings. I downloaded all the filings for the organizations I am interested in, and wrote a Python script that basically finds officers’ names in the document, and then looks for gender, age, education and job title. It is tricky because companies can have different formats for these documents, so you have to think about different possibilities the data can be presented.
I use a fuzzy string matching to account for differences in spelling and typos, and different ways of naming tables. But most of the “learning” came from me manually tuning the script. Unfortunately, the script has to do a lot of safety checks to avoid outputting jibberish data (e.g. assuming a list of company names are people, etc). Finding age is also very tricky, as you often have to parse sentences to search for patterns such as “Mr. ABC, age 56” or “Mrs. Jean B. XYZ, MD, 46.”
My script works well and outputs the data that I want. The main issue that I have is that it takes a good 30 seconds for all the calculations to be made for just one company (around 5-20 executives per company). The reason, I think, is because my script tests a lot of different possibilities, even if they are not applicable to the document.
I am sure I am not the only one working on extracting data from semi-structured documents. I wanted to know if I could get feedback on what strategies I could implement to improve my script. I would particularly be interested in methods that involve tracking the performance of each data parsing strategies so that the computer does not waste time using a method that doesn’t work so well.
Thanks a lot for your help!