StackEx DataDump Part: 1

28 Dec 2016

The dataset used is taken from Stack Exchange Data Dump published on December 15, 2016. The data set provides information about Programmers.stackexchange.com. Downloaded dataset comes in zipped format which has different .xml files. To read this data into pandas dataframe, I have used pymysql with pandas. Same pipeline procedure is used for reading .xml data to pandas dataframe:

Create an SQL table based on the number of entries in a row of XML data.
To Load the local XML data in that SQL table.
In Establish a connection and load SQL table as a pandas data frame.

Tags dataset has 1628 different tags for various post. Each tag has an associated count value with it.

Tags Table

Id	TagName	Count
1	comments	152
3	code-smell	99
4	programming-languages	1205
5	usage	7
7	business	121

Top 10 Tags based on count

Tags Table

Id	TagName	Count
37	java	3511
116	c#	3007
265	design	2886
174	design-patterns	2475
107	object-oriented	2036
226	c++	1822
142	algorithms	1732
68	php	1667
331	architecture	1639
319	javascript	1578

1D Plot

Taking all the posts that contains python in their title or tag column. Now, the most popular posts can be chosen based on Score, ViewCount, AnswerCount, CommentCount, FavoriteCount.

Mohit Shukla

StackEx DataDump Part: 1

Tags Table

Tags Table

Top 5 posts based on Score

Top 5 posts based on ViewCount

Top 5 posts based on AnswerCount

Top 5 posts based on CommentCount

Top 5 posts based on FavoriteCount