The project has been split into different sections for discussion of the data collection, approach, analysis, interesting results, and more:
In general, communities form around common interests and bring people together. Online communities can take many shapes and forms in today's interconnected work. More specifically, communities may allow people to interact in social groups, and contribute to shared experiences.
Developer communities are no different; groups of likeminded individuals come together to learn, ask, debug, and build programs and code for a variety of purposes, using a variety of languages. These communities exist on sites such as Reddit, StackOverflow, and Google Groups (to name a few). Developer and programming communities can have a more macro approach (focusing on what good programming looks like, accessibility, and security across programming concepts in general), or a micro approach (usually focused on a specific language). In these micro communities are interesting nuances:
For this project, we explored the interactions between different programming communities on one particular platform: Reddit. We wanted to learn more about the micro communities, and what they can tell us about developer and programming online communities. We also wanted to determine influencers and topics of interests in selected subreddits and find similarities between selected subreddits. To that end, our research questions were:
We collected data from 14 subreddits below using the Push Shift API. We collected 312 851 records during the date range 1/1/2021 - 6/30/2021.
C_Programming, COBOL, Fortran, Haskell, HTML, HTML5, JavaScript, LaTeX, learnjava, learnpython, LISP, MATLAB, perl, Rlanguage
Next, we preprocessed the data by removing any records that were missing or incomplete, such as removing “deleted” authors, and splitting the records into datasheets with edges and nodes to be able to do network analyses.
We focused on aggregated analysis vs. temporal analysis for the project. The initial batch of languages we selected was based on languages we found in syllabi at George Mason University, but in general these languages were picked randomly for interest. We also made sure to include languages on the spectrum of newness and maturity. We selected the most active subreddit for each language, and excluded “dead” languages (i.e., Algol, APL).
Below is the final network graph thatw as created for all the networks.
We initially wanted to see what else we could learn about our users mainly through centrality measures and PageRank. We spent some time identifying each community within our map and labeled them as such. If you remember from the table we showed earlier, Python and JavaScript subreddits account for a large number of interactions between users.
If you notice within each community, there are concentric circles moving outwards from the center. Users in the center interact with that subreddit often, and those on the outside do so less. Those right on the edge of the circle often only have 1 post/comment (extremely low engagement).
Created Date | # of Subs(Jul 2021) | Daily Comments(Jul 2021) | Total Posts(01/21 - 06/21) | Users X-posting(01/21 - 06/21) | |
---|---|---|---|---|---|
C_Programming |
3/27/2008 |
112000 |
78 |
4278 |
727 |
COBOL |
6/5/2009 |
2294 |
3 |
133 |
36 |
Fortran |
9/29/2009 |
5989 |
11 |
264 |
99 |
Haskell |
1/25/2008 |
67417 |
65 |
2068 |
320 |
HTML |
9/5/2009 |
34,545 |
4 |
1284 |
271 |
HTML5 |
9/22/2009 |
39,348 |
1 |
524 |
203 |
JavaScript |
1/25/2008 |
1720113 |
57 |
7008 |
618 |
LaTeX |
3/4/2008 |
35613 |
5 |
1706 |
388 |
learnjava |
1/27/2011 |
116403 |
53 |
3341 |
528 |
learnpython |
10/2/2009 |
571031 |
525 |
24213 |
1360 |
LISP |
1/25/2008 |
33291 |
43 |
824 |
182 |
MATLAB |
8/15/2009 |
43046 |
18 |
2095 |
301 |
perl |
1/25/2008 |
15149 |
7 |
501 |
76 |
Rlanguage |
2/11/2011 |
26494 |
8 |
1389 |
251 |
We realized the importance that the sizes each subreddit played in the conversation. The number of posts and subs differ between the subreddits. The table above shows the number of subscribers, comments per day, total posts within the date range of our dataset and the number of users who cross-posted to another subreddit. JavaScript was by far the largest, and Cobol the smallest in terms of subscribers. Additionally, based on the 3rd column, we can see that users posted more in Python everyday on average than all the other subreddits.
From the network analysis we conducted, Python also had the largest number of posts, and cross posts. After having established that there were users that were cross posting across subreddits, our next step was to figure out what the top cross-posted subreddits were for each user within a subreddit.
One of our hypotheses was about cross-posting between subreddits based on the nature of the language. For example, HTML and JavaScript are often used together for front-end web-design, so it would make sense that there would be shared users here.
However, what we discovered was that Python seems to be the strongest link between all the subreddits, having a high PageRank score on our graph, and also the largest amount users bridging junctions between their specific subreddit and the python subreddit.
Language | Top 1 | Top 2 | Top 3 |
---|---|---|---|
C_Programming |
learnpython |
javascript |
learnjava |
COBOL |
learnpython |
C_Programming |
javascript |
Fortran |
C_Programming |
LaTeX |
learnpython |
Haskell |
C_Programming |
lisp |
learnpython |
HTML |
learnpython |
javascript |
html_5 |
HTML5 |
javascript |
HTML |
learnpython |
JavaScript |
learnpython |
C_Programming |
html_5 |
LaTeX |
learnpython |
matlab |
C_Programming |
learnjava |
learnpython |
C_Programming |
javascript |
learnpython |
learnjava |
C_Programming |
javascript |
LISP |
haskell |
C_Programming |
learnpython |
MATLAB |
learnpython |
LaTeX |
C_Programming |
perl |
C_Programming |
learnpython |
javascript |
Rlanguage |
learnpython |
LaTeX |
matlab |
We found some interesting data to back up the claim of Python's predominance. The table on the above the top 3 other subreddits that users from a specific subreddit posted to. So, for C Programming, which is the first row, users cross-posted most to python, JavaScript, and finally java.
The users in 8 of our 14 subreddits cross-posted most to Python. In fact, Python was ranked within the top 3 of every subreddit we analyzed. This is interesting and our hypothesis based on reviewing our data for this is kind how ubiquitous Python has become with programming today. It is taught in schools, bootcamps, self-learning courses, has a relatively simpler learning curve compared to other languages, is used in industry and interfaces well with other languages.
On closer analysis, we found that much of the discussion in more mature language subreddits such as Cobol, Fortran, and lisp involved troubleshooting how to get Python functionalities to interface with their legacy code/hardware, in addition to language migration queries.
We calculated several centrality measures for the users in the network, including the betweenness, closeness, and PageRank. Of our top influencers, we found several that had interesting profiles. Overall, they interact with users often helping with ways to get the users code working correctly, and being able to provide additional resources that would otherwise not have been available to the users. Some of these top influencers also link to other content on other sites, so they are marginally aware of cross posting across sites here too.
We found 1 user with an extremely high score for betweenness, and page rank, and a high measure for closeness, and the largest degree of all our users. This user as it turned out, was actually a bot known as AutoModerator, that is often used by subreddits as a management tool. The intention of spreading shared community guidelines, and rules across multiple communities seems to work, and if we need to spread a contagion type of message across all 14 subreddits, this user would be our pick.
For our topic modelling attempt, we used Latent Dirichlet Allocation (LDA) to create a set amount of bins, and attempt to sort through the text and create general grouped topics. We expected ~4 topics based on trial and error with empirical observation of several subreddits: Tech Help, General Language, Jobs and Advice, Off-Topic. LDA is unsupervised, we decided that it may be better at finding hidden subjects or other patterns that we wouldn't have tagged manually given a different approach.
Overall, it is difficult to generalize subreddits, even within the same domain.