Network & Collaborative Behavior Analysis in Developer Communities


An image depicting the final network mapping. 14 concentric circles with lines running between them.

The project has been split into different sections for discussion of the data collection, approach, analysis, interesting results, and more:

  1. Scope, Data, and Preprocessing
  2. Analysis and Findings
  3. Overall findings

In general, communities form around common interests and bring people together. Online communities can take many shapes and forms in today's interconnected work. More specifically, communities may allow people to interact in social groups, and contribute to shared experiences.

Developer communities are no different; groups of likeminded individuals come together to learn, ask, debug, and build programs and code for a variety of purposes, using a variety of languages. These communities exist on sites such as Reddit, StackOverflow, and Google Groups (to name a few). Developer and programming communities can have a more macro approach (focusing on what good programming looks like, accessibility, and security across programming concepts in general), or a micro approach (usually focused on a specific language). In these micro communities are interesting nuances:

  • Are users often part of multiple communities? What do they talk about in each community?
  • Are users predominantly asking for technical help?
  • Are users talking abstractly about the programming languages?
  • Who are the “influencers” in these types of social networks?

For this project, we explored the interactions between different programming communities on one particular platform: Reddit. We wanted to learn more about the micro communities, and what they can tell us about developer and programming online communities. We also wanted to determine influencers and topics of interests in selected subreddits and find similarities between selected subreddits. To that end, our research questions were:

  1. Which subreddits have common users, who do they interact with, and who are important users in the network?
  2. What kind of posts (questions, news, etc.) do users post on each subreddit?


Scope, Data, and Preprocessing


We collected data from 14 subreddits below using the Push Shift API. We collected 312 851 records during the date range 1/1/2021 - 6/30/2021.

C_Programming, COBOL, Fortran, Haskell, HTML, HTML5, JavaScript, LaTeX, learnjava, learnpython, LISP, MATLAB, perl, Rlanguage

Next, we preprocessed the data by removing any records that were missing or incomplete, such as removing “deleted” authors, and splitting the records into datasheets with edges and nodes to be able to do network analyses.

We focused on aggregated analysis vs. temporal analysis for the project. The initial batch of languages we selected was based on languages we found in syllabi at George Mason University, but in general these languages were picked randomly for interest. We also made sure to include languages on the spectrum of newness and maturity. We selected the most active subreddit for each language, and excluded “dead” languages (i.e., Algol, APL).


Analysis and Findings


Commonalities among Subreddit Communities


Below is the final network graph thatw as created for all the networks.

An image depicting the final network mapping. 14 concentric circles with lines running between them.

We initially wanted to see what else we could learn about our users mainly through centrality measures and PageRank. We spent some time identifying each community within our map and labeled them as such. If you remember from the table we showed earlier, Python and JavaScript subreddits account for a large number of interactions between users.

If you notice within each community, there are concentric circles moving outwards from the center. Users in the center interact with that subreddit often, and those on the outside do so less. Those right on the edge of the circle often only have 1 post/comment (extremely low engagement).

Programming Subreddit Descriptive Data
>
 

Created Date # of Subs
(Jul 2021)
Daily Comments
(Jul 2021)
Total Posts
(01/21 - 06/21)
Users X-posting
(01/21 - 06/21)

C_Programming

3/27/2008

112000

78

4278

727

COBOL

6/5/2009

2294

3

133

36

Fortran

9/29/2009

5989

11

264

99

Haskell

1/25/2008

67417

65

2068

320

HTML

9/5/2009

34,545

4

1284

271

HTML5

9/22/2009

39,348

1

524

203

JavaScript

1/25/2008

1720113

57

7008

618

LaTeX

3/4/2008

35613

5

1706

388

learnjava

1/27/2011

116403

53

3341

528

learnpython

10/2/2009

571031

525

24213

1360

LISP

1/25/2008

33291

43

824

182

MATLAB

8/15/2009

43046

18

2095

301

perl

1/25/2008

15149

7

501

76

Rlanguage

2/11/2011

26494

8

1389

251

We realized the importance that the sizes each subreddit played in the conversation. The number of posts and subs differ between the subreddits. The table above shows the number of subscribers, comments per day, total posts within the date range of our dataset and the number of users who cross-posted to another subreddit. JavaScript was by far the largest, and Cobol the smallest in terms of subscribers. Additionally, based on the 3rd column, we can see that users posted more in Python everyday on average than all the other subreddits.

From the network analysis we conducted, Python also had the largest number of posts, and cross posts. After having established that there were users that were cross posting across subreddits, our next step was to figure out what the top cross-posted subreddits were for each user within a subreddit.

One of our hypotheses was about cross-posting between subreddits based on the nature of the language. For example, HTML and JavaScript are often used together for front-end web-design, so it would make sense that there would be shared users here.

However, what we discovered was that Python seems to be the strongest link between all the subreddits, having a high PageRank score on our graph, and also the largest amount users bridging junctions between their specific subreddit and the python subreddit.

Top 3 Connected Subreddits to Each Programming Language Subreddit
>
Language Top 1 Top 2 Top 3

C_Programming

learnpython

javascript

learnjava

COBOL

learnpython

C_Programming

javascript

Fortran

C_Programming

LaTeX

learnpython

Haskell

C_Programming

lisp

learnpython

HTML

learnpython

javascript

html_5

HTML5

javascript

HTML

learnpython

JavaScript

learnpython

C_Programming

html_5

LaTeX

learnpython

matlab

C_Programming

learnjava

learnpython

C_Programming

javascript

learnpython

learnjava

C_Programming

javascript

LISP

haskell

C_Programming

learnpython

MATLAB

learnpython

LaTeX

C_Programming

perl

C_Programming

learnpython

javascript

Rlanguage

learnpython

LaTeX

matlab

We found some interesting data to back up the claim of Python's predominance. The table on the above the top 3 other subreddits that users from a specific subreddit posted to. So, for C Programming, which is the first row, users cross-posted most to python, JavaScript, and finally java.

The users in 8 of our 14 subreddits cross-posted most to Python. In fact, Python was ranked within the top 3 of every subreddit we analyzed. This is interesting and our hypothesis based on reviewing our data for this is kind how ubiquitous Python has become with programming today. It is taught in schools, bootcamps, self-learning courses, has a relatively simpler learning curve compared to other languages, is used in industry and interfaces well with other languages.

On closer analysis, we found that much of the discussion in more mature language subreddits such as Cobol, Fortran, and lisp involved troubleshooting how to get Python functionalities to interface with their legacy code/hardware, in addition to language migration queries.


Influencer Profiles


We calculated several centrality measures for the users in the network, including the betweenness, closeness, and PageRank. Of our top influencers, we found several that had interesting profiles. Overall, they interact with users often helping with ways to get the users code working correctly, and being able to provide additional resources that would otherwise not have been available to the users. Some of these top influencers also link to other content on other sites, so they are marginally aware of cross posting across sites here too.

We found 1 user with an extremely high score for betweenness, and page rank, and a high measure for closeness, and the largest degree of all our users. This user as it turned out, was actually a bot known as AutoModerator, that is often used by subreddits as a management tool. The intention of spreading shared community guidelines, and rules across multiple communities seems to work, and if we need to spread a contagion type of message across all 14 subreddits, this user would be our pick.


What are users posting about? Our Topic Modeling Methodology


A word cloud representing the most common words across subreddits.
A word cloud representing the most common words across subreddits with further processing.

For our topic modelling attempt, we used Latent Dirichlet Allocation (LDA) to create a set amount of bins, and attempt to sort through the text and create general grouped topics. We expected ~4 topics based on trial and error with empirical observation of several subreddits: Tech Help, General Language, Jobs and Advice, Off-Topic. LDA is unsupervised, we decided that it may be better at finding hidden subjects or other patterns that we wouldn't have tagged manually given a different approach.

4 generalized work clouds for the topics generated across all the data using LDA.

Overall findings


Overall, it is difficult to generalize subreddits, even within the same domain.

  1. High-level language communities seem to be the most popular
  2. A few users that are very active
  3. Determined that new languages are more commonly discussed
  4. Topics were specific to each subreddit with some overlap
  5. Even with overlap, the proportion of posts in similar topics varied considerably.