Just last year on Valentine??™s Day, we made a casual analysis for the state of Coffee Meets Bagel (or CMB) plus the cliches and styles we saw in online pages girls had written (published on a new web site). Nevertheless, I didn??™t have difficult facts to backup the things I saw, just anecdotal musings and common terms we noticed while searching through a huge selection of pages presented. This present year, i’ve information to back my observations up and we??™re going to plunge involved with it.
Data Mining (or Technical Details for just what I Did So)
To begin with, I experienced to locate a option to obtain the text information through the mobile software. The community information and local cache is encrypted, therefore alternatively, we took screenshots and went it through OCR to obtain the writing. We did some manually to see so I had to automate this if it would work find, and it worked well, but going through hundreds of profiles manually copying text to an Google sheet would be tedious.
Android os features a good automation API called MonkeyRunner and a available supply Python version called AndroidViewClient, which permitted complete use of the Python libraries we currently had. We invested each day coding the script and Python that is using, PIL, and PyTesseract, I were able to comb through most of the pages in less than one hour. All this had been brought in in to A bing sheet, then downloaded to a Jupyter notebook where we went more Python scripts utilizing Pandas, NTLK, and Seaborn to filter through the information and produce the graphs below.
As a whole, We accumulated text from 2025 pages.
The information from CMB is tilted in support of the person??™s individual profile, so the information we mined through the pages I saw are tilted toward my preferences and does not express all pages. Nonetheless, also with this, it is possible to currently see styles as to how girls compose their profile. The data you??™re seeing is from my profile, Asian male within their 30’s residing in the Seattle area.
Quantity of Pages a day
The way CMB works is each day at noon, you can get a brand new profile to see that one can either pass or like. It is possible to just communicate with people if there??™s a mutual love. Often, a bonus is got by you profile or two (or four) to look at. Which used to be the outcome, but around July 2016, they relaxed that policy to demonstrate as much as 21 profiles each day, as you care able to see because of the unexpected increase. The lines that are flat March 2016 and Sept 2016 are once I deactivated the application to just simply take some slack, so there??™s some information points we missed since i did son??™t get any pages throughout that time. Of this pages seen, about 9.4% had empty parts or incomplete pages.
Because the application is showing profiles tailored toward my profile, age grouping is pretty reasonable. But, I??™ve pointed out that a couple of pages list the incorrect age, either done deliberately or inadvertently. Often, they say this into the profile saying ???my age is clearly as opposed to the detailed. It??™s either someone young attempting to be older (an 18 year listing that is old as 23) or some body older listing by themselves more youthful (a 39 yr old listing themselves because 36). They are rare circumstances set alongside the quantity of pages.
Profile length ended up being a data point that is interesting. Because this is a cellular phone application|phone that is mobile}, people won??™t be typing away way too much (as well as wanting to compose the total essay along with their UI is difficult because it wasn??™t created for long text). The number that is average of girls had written had been 47.5 with deviation of 32.1. The average number of words is 49.7 with a standard deviation of 31.6, so not much of a difference if we drop any rows that contains empty sections. There??™s a significant quantity of individuals with 10 terms or less written (9%). An unusual few composed in only emoji or used emoji in 75per cent of these profile. A couple of published their profile in Chinese. Both in of these situations, the OCR came back it as you ASCII mess of the term because it had been a blob to your text recognition.
Ethnicity vs Word Count
A note that is small just how CMB does ethnicity. Users can choose ethnicity that is multiple by themselves, which ultimately shows up as ???White/Caucasian, Pacific Islander, Asian,??? each ethnicity divided with a comma. Nonetheless, from what I can inform, this does not happen often, thus I graphed in line with the Primary Ethnicity, that I designated because the first listed ethnicity. we did so make another graph with all the current ethnicity that is different together, however it made such a tiny difference between the graphs it wasn??™t worth the work to parse the info like that. Just 6% of this pages had blended ethnicity detailed.
The graphs comparing term count to ethnicity show individuals hover across the average with a little deviation that is standard. The lines (for the very first graph) for a couple of these are big as a result of little test size, therefore there??™s a bigger deviation that is standard.