Mohit Shukla

StackEx DataDump Part: 1

The dataset used is taken from Stack Exchange Data Dump published on December 15, 2016. The data set provides information about Programmers.stackexchange.com. Downloaded dataset comes in zipped format which has different .xml files. To read this data into pandas dataframe, I have used pymysql with pandas. Same pipeline procedure is used for reading .xml data to pandas dataframe:

Tags dataset has 1628 different tags for various post. Each tag has an associated count value with it.

Tags Table
Id TagName Count
1 comments 152
3 code-smell 99
4 programming-languages 1205
5 usage 7
7 business 121

Top 10 Tags based on count

Tags Table
Id TagName Count
37 java 3511
116 c# 3007
265 design 2886
174 design-patterns 2475
107 object-oriented 2036
226 c++ 1822
142 algorithms 1732
68 php 1667
331 architecture 1639
319 javascript 1578

1D Plot

Taking all the posts that contains python in their title or tag column. Now, the most popular posts can be chosen based on Score, ViewCount, AnswerCount, CommentCount, FavoriteCount.

Top 5 posts based on Score

Top 5 posts based on ViewCount

Top 5 posts based on AnswerCount

Top 5 posts based on CommentCount

Top 5 posts based on FavoriteCount


Share this: