Saturday 25 December 2010

Research 101

We've had requests to provide the public some tips on how to conduct research of their own. Glad to oblige! Employing the easy process below, you'll be able to model various trends online in fifteen minutes or less.

First things first, though. It's not uncommon to hear "people say" or "most members think that" while these phrases are, in reality, sucked out of a finger, assumptions. Sure, we could all trust our intuition, but it would get messy, especially, when you have two equally assumptuous opinions on the table. Solution? Get to the facts, the objective stuff.

How do you do that? You find the numbers. Why numbers? They are easy to analyse and difficult to misunderstand. When you add 2+4+8 together, you can find their average, maximum value, their order in the sequence, the total, and even come to a conclusion of what number should come next. You can't do that with "dressing" - "snowman" - "goose". When you see 2, it is likely others will see it as a 2, causing less misunderstandings. On the other hand, "dressing" could be understood as getting clothes on or salad dressing. Being on the same page is what counts here.

Also, you may want to have a LOT of numbers. Why? Let's say you're standing in front of a zoo, asking people about the total number of exhibits. You ask a blind person and a small kid, two people. The first person tells you 3, the second - a gazillion. That can't be right. Solution? Ask more people. The theory is very simple: the more numbers you have on your list, the more likely it is that you'll get to the real deal. Sure, the most "reliable" way would be going to the zoo and counting yourself, but it's not that easy if the zoo closes in fifteen minutes and you don't want to spend $20 on a ticket. And if someone asked you about the number, you'll have no way to prove your count is correct. But if you have a list of people vouching for a certain number, majority rules, and it's likely the problem is solved.

Notice that you've been given one example, and some of you may not be convinced. Had there been twenty or fifty examples to justify the need to get many numbers, it would be very likely that everyone would be convinced because different examples would appeal to different people. The beauty of this is a hypothesis, and while it sounds rational, it might not be true. What if all examples are equally dumb? It is likely when only three people work on them. What if these examples contradict one another? See, there is a lot of uncertainty without proper calculations, and if you want to get to the bottom of something, you need to get your hands dirty.

Think about the topic. Do you want to study a trend? Do you want to find what influences something else? Research can tell you which day to post a story to get the most reviews, what are the prospects for your fandom and all other things you've seen in the previous posts. Amazingly, trying to solve one problem usually solves several because you can reuse and adapt your data to show a whole system at work.

TOPIC

We're working in a real environment here, so let's find things you might care about, reviews. Our assumption is that you want reviews and are interested in finding out how to get more reviews. Whether that assumption is true is none of our concern because the purpose of this post is to show you how to conduct research.

Let's make a bet that something influences reviews and they are not written at random. To make things specific, we'll choose one fandom (Sonic the Hedgehog) and one language (English) and a date. Why one fandom? Because that's where a story would be located, and every fandom has different review patterns. When you are ready to post in Sonic the Hedgehog, for instance, you might find it more useful to know the outcome of posting in Sonic the Hedgehog rather than Tetris. More reasons are described in METHOD. As such, our topic would be: "Factors influencing the review count in FanFiction.Net's English Sonic the Hedgehog section in November 2010". Why November? December is not over yet, and November is the latest full month. Stories updated in November must have gotten all the reviews you can count on, both from browsing readers and favorites/alerts.

It's necessary to have a topic written out clearly for yourself and any person, who may want to read your findings. For one, the topic won't let you sidetrack, so you reach a goal set. For two, people will know what to expect from the whole research upon reading the topic's name. Naming your topic too broadly or incorrectly will make you answer questions you didn't ask. If you think it's no big deal to have the topic written wrong, the world of research will be very cruel to you. For instance, if you're looking for review trends of 2009, you may waste time if someone didn't write the year of their research's interest right at the top. Normally, if you don't write the date or the fandom, it is assumed you're doing a general or site-wide search, which is too difficult for this tiny example. When the example is done, though, you will be easily able to make it more up to date and applicable to more fandoms.

Our topic: "Factors influencing the review count in FanFiction.Net's Sonic the Hedgehog section in November 2010".

VARIABLES

What influences a review count (in Sonic the Hedgehog)? The number of chapters, perhaps. The more chapters, the more reviews, we assume, because one person can only review once, and two chapters can mean two reviews from one reader. This might not be true, but our research will be able to answer that, too!

What else? Word count. Stories with less words have less reviews.

Experience? The more experienced the author, the more reviews his or her stories should get. Though, we don't have an experienceometer, so it has to be something else. Author's age? We could ask authors for their age, but they might lie, be unavailable and make us wait too long. Hmm, it seems deciding what variables to use is greatly limited by the ability to obtain the data. Hey, account age might be possible to take to measure experience. The longer the account has been on FFN, the more reviews it should get, we assume. It's possible to get the information from account ID. The higher (newer) the ID, the less reviews an author gets.

Let's add a fourth variable, the number of stories posted on the author's account. The more stories you have now, the more reviews you're going to get, we assume.

We could have added a fictional "yes"/"no" AKA "Boolean" variable. Boolean variables are very useful to turn obscure qualities into numbers. For instance, writer's nationality is boolean when checked by a question like "is the writer American?" In it, "yes" ("American") would be 1 and "no" ("Other country") would be 0. When the variables you have picked logically don't work, add something boolean to set them apart. They can be anything from "acts like a jerk" to "has the word 'honey' on the profile". Just don't let them go dominant in your research.

We're making a quick research here, so that'll be enough variables, four. You may not want to use too many variables in your research because it usually brings certain problems.

Rule of thumb - every variable requires 6 data points. We have 4 variables, so that's a minimum of 4 x 6 = 24 stories in the sample.

METHOD

Now that we've decided the variable we're trying to analyse (review count) and have deciding factors (chapter count, word count, author ID, number of posted stories), it's time to select a method for making the future steps.

Obviously, we're going to gather data based on observation, not a questionnaire. Surveys fail too often, and we don't have to bother anyone by taking notes of what we can see publicly.

We may want to use our research results more times than one, making them practical.

It brings both problems and opportunities. The more you want to predict, the more accurately you want to do it, the higher are the requirements and the less choices you can make.

Let's look at some of the requirements we want to fulfil. We want to apply results of our study to a general audience. By that, it is implied we're using sampling. It's very time consuming to go through all stories on FanFiction.Net (over three-million), so a sample, a part of the whole will do. This part should have the same qualities as the whole.

A visual alternative: you have to draw a specific triangle, but you don't know its angles, only the perimeter (length of the line used to draw it all). You have very little chalk, so that'll have to be a proportionally smaller triangle. One inch of your triangle could mean a hundred inches of the life-size triangle, the qualities (angles/edge length) of which you're trying to determine. This proportion has to be kept for every edge.

The problem is that you don't know how large are any of the angles nor the length of an individual edge, only the perimeter.

Don't worry. There's a magic trick called "randomness". It's difficult to explain, but if you let randomness take the pick for you, you get the most accurate results. This has to do with bias. We're unconsciously biased towards certain numbers, and we can't let that get in our way. Our opinion could only go as far as assuming the logical factors to influence the variable. That's why dice and coins are used as tools of chance.

When making the topic, we saved ourselves a lot of logistics by defining one fandom, and one month, one language. This is our "large" triangle. On FFN it spans from page 23 till 44 of this fandom. For safety, let's reduce our page count to 24 - 43 because the pages may shift while this guide is being written. Every choice has to be justified.

Now, we see a beautifully placed list of stories. Every page has 25 stories. There are 20 pages, so 20 x 25 = 500 stories. Our large triangle's perimeter is 500. Now, we decide the proportion.

This is a crucial moment, so you could pick enough stories to be accurate while not straining yourself with repetition. Another rule of thumb is to have at least 100 data points. If you examine 100 stories, you don't have to prove certain things and can give the data the default treatment. However, you may want to do things more precisely. Commercially popular sample sizes are 250, 500, 1067-1100. The sample size determines the error band. When you have 1100 people/stories examined, the error margin (confidence interval) tends to be 3%. This margin determines the statistical difference. In statistics numbers 43% and 44% are not necessarily different (don't have statistically significant differences) because they might be affected by your error margin. It's possible to determine the interval manually, but I like using this website to do it for me.

Even if you have a small confidence interval (3% is small), there is a chance some freak statistical accident happened and your research doesn't mean squat. It's called a confidence level. In commercial data collection, it ranges from 90% to 99% because you cannot be 100% sure of anything when sampling. The higher the confidence level, the more data points you need. The higher the error margin/interval, the less you need. These two are independent. You may have a 99% certainty level the average review count is 10 +-8% or a 50% certainty level the average review count is 5 +-1%. Confidence levels are, generally, more important. Don't aim for lower than 93% because when you conduct 100 samplings, 7 of them would make no sense at all, and you don't want to be a part of those 7. The max error margin we can tolerate is 10%. There are other factors that matter, but we're aiming at a quickie.

We're going to pick 100 stories, dodging some evidence gathering, at a 95% confidence level, which would give us an 8.8% error margin.

All right, we have the proportion now: 100/500 = 1/5 = 20%. There are two ways we can go now. The easy way in our case is to do systematic sampling, which is, basically, taking every fifth story you see and taking notice of its data. The randomness here comes in having chance decide which story to start from. Since I don't have a 5-point die, I'm going to take five bits of paper, number them from 1 to 5 and let someone pick one of these. The number that is pulled out is going to be which story from the beginning of my sample on page 24 is going to be first, so I go to every fifth after that. Let's assume I did that, and the pulled out number was 1.

The second approach is more difficult in this case, but applicable to more things. Sometimes, it's impossible/unnecessary to know the proportion. FFN "should" have over six-million stories, but it has only three-million. If we hadn't done research before, we wouldn't have known this and, you would think, this would lead to bad samples. Not at all. Sometimes it's impossible to make a list or know the beginning, the end of something. This is where randomness replaces the list without changing any confidence-related issues. Had there been a different number of fanfics on every page (instead of 25 on every one), we would have used Excel's random number generation, and asked it to generate 100 story IDs from 1 to, say, 6 million.

Looks like we're all set for practice.

ANALYSIS IN EXCEL

We're going to work with Excel's 2003 version. First, let's make sure we have what we need. In the top menu, click "Tools" and see if you have "Data Analysis" in the drop-down menu.

If not, go to View-Toolbars-Customise. Click "Commands" Find "Data" and see if you have "Data Analysis" to choose. If you do, just make it visible.

If you don't, we'll need to go to Tools-Add-Ins. Check "Analysis ToolPak - VBA" and/or any other version of the phrase you may have. Press "OK", restart Excel and see if you have "Data Analysis" in "Tools" now. If you don't, your version is either incomplete (use the installation disk, and install Add-Ins for Excel) or you are on a restricted computer.

Moving on. Click Tools-Data Analysis. "Random Number Generation" should be highlighted by default. Excel is clever and knows people first need the random number first. Click it, and you'll see a new window. Pick the drop-down Distribution (in the middle) and pick "Patterned". Now, you're in total control. The number I see on top of page 24 is 576. The number on the bottom of page 43 is 1075. The important part is our proportion 20%, which means every fifth story goes, and we needed 5 bits of paper to decide, whether we start at 576, 577, 578 et cetera. Our paper said 576, so that'll be the first number we write in ("From:").

Fill the window with the following.
Number of variables: 1
Number of random numbers: 100
From: 576 to 1075 in steps of 5
Repeating each number: 1 times
Repeating each sequence: 1 times

Delete the last number if you get 101 results. If you're picky, try "Uniform" in the drop-down of a new "Random Number Generation" window. You just input the range with the rest being identical 1 and 100. Uniform is, generally, a better solution because it requires less input from you at first, and you may have to merely round up the numbers and add 1 if you get two identical numbers after rounding up.

Now, we have the story numbers from the pages we need. We should note the data each story has. If you don't have the time to list the variables, just save links to the 100 stories to check them later. (Instead of making five clicks per story, you'd make only one.) Even if you have the time, save the link or story ID you get upon clicking the stories because page numbers change (that's why we moved from 23 to 24, and the pages did shift since the beginning of this paper). Of course, this is a matter of choice and seriousness.

You have 100 numbers now. I suggest selecting the 100 numbers (not the whole column), pressing CTRL+X, and putting the cursor on cell A2. Press CTRL+V. You should see them now placed one cell lower.

We should do some labelling (that's why we lowered the rank numbers). Write 'y' in cell B1, 'x1 - chapters' in C1, 'x2 - words' in D1, 'x3 - author ID' in E1 and 'x4 - story count' in F1. Add extra labels to make the columns more informative if needed. Consider freezing the first two rows in your sheet by selecting them and going to Window-Freeze panes, so the labels wouldn't get lost. You may notice that 100 stories on the list is more than 24 required by our "times six" rule of thumb. That's good.

What we got to do now is start taking notes. They should look something like this when you finish. Scroll down.

y - x1 - x2 - x3 - x4
9 - 1 - 870 - 1720168 - 3
0 - 1 - 2091 - 2332564 - 35
11 - 4 - 8977 - 1146820 - 4
13 - 2 - 1696 - 2497515 - 8
32 - 12 - 54094 - 370579 - 15
0 - 12 - 9070 - 2600296 - 3
2 - 3 - 2346 - 2625209 - 4
10 - 1 - 942 - 2322399 - 10
0 - 1 - 812 - 1445016 - 27
48 - 14 - 10430 - 2234950 - 2
1 - 1 - 756 - 1247257 - 4
4 - 2 - 6918 - 2254848 - 9
1 - 1 - 828 - 2500706 - 8
11 - 4 - 2851 - 2466270 - 4
51 - 5 - 27418 - 2464934 - 5
23 - 7 - 42338 - 2246255 - 17
32 - 7 - 9075 - 2349427 - 51
0 - 5 - 5611 - 2592567 - 2
23 - 11 - 3932 - 1960339 - 7
6 - 1 - 1326 - 2432493 - 22
2 - 3 - 7584 - 2254848 - 9
62 - 7 - 30210 - 2407962 - 8
36 - 19 - 29653 - 2469814 - 17
19 - 23 - 53349 - 1733388 - 23
4 - 1 - 3013 - 1802183 - 14
1 - 1 - 1480 - 2405648 - 24
3 - 2 - 1646 - 2001585 - 16
11 - 3 - 2300 - 2443927 - 4
2 - 1 - 1772 - 2619494 - 5
1 - 5 - 4977 - 2592556 - 1
16 - 10 - 16627 - 2416048 - 4
2 - 1 - 969 - 998811 - 5
7 - 2 - 3288 - 2621859 - 1
12 - 4 - 2387 - 2572568 - 1
1 - 1 - 1188 - 2576690 - 1
287 - 22 - 106903 - 1263516 - 24
21 - 7 - 9713 - 2229401 - 7
4 - 4 - 3526 - 1842866 - 8
25 - 9 - 6025 - 2418265 - 22
1 - 16 - 12514 - 2581451 - 4
2 - 8 - 5571 - 2324060 - 18
3 - 4 - 8165 - 1055075 - 7
0 - 1 - 641 - 2533529 - 1
0 - 1 - 637 - 2564950 - 12
0 - 1 - 411 - 2434313 - 5
145 - 12 - 51373 - 557082 - 42
1 - 2 - 816 - 2615675 - 1
1 - 1 - 1333 - 2397687 - 3
6 - 1 - 286 - 2363663 - 3
0 - 1 - 653 - 1890945 - 5
0 - 1 - 773 - 2434313 - 5
3 - 3 - 1378 - 2208560 - 9
64 - 23 - 83711 - 909079 - 12
2 - 2 - 2241 - 2603413 - 1
3 - 1 - 4643 - 1314061 - 14
1 - 6 - 807 - 2514303 - 5
0 - 1 - 450 - 2082789 - 3
9 - 1 - 691 - 2497515 - 8
1 - 1 - 201 - 2562978 - 1
6 - 2 - 3249 - 2315797 - 4
165 - 46 - 197328 - 1102393 - 1
3 - 13 - 40438 - 1598320 - 2
25 - 15 - 15107 - 2143219 - 2
33 - 10 - 13654 - 2141369 - 9
20 - 6 - 32769 - 120594 - 3
6 - 9 - 28965 - 1495936 - 22
4 - 1 - 1520 - 2349427 - 51
28 - 7 - 24952 - 2164733 - 8
10 - 8 - 10808 - 2048230 - 2
279 - 30 - 65882 - 1894188 - 10
5 - 1 - 1249 - 2603600 - 5
22 - 14 - 25407 - 2088418 - 1
0 - 2 - 1430 - 2421071 - 1
3 - 1 - 1530 - 1938657 - 7
0 - 1 - 4661 - 2605547 - 2
122 - 21 - 52851 - 1098628 - 5
15 - 2 - 3880 - 2100751 - 19
5 - 3 - 3723 - 2467839 - 2
0 - 2 - 1010 - 2127913 - 4
24 - 15 - 34346 - 1070963 - 5
3 - 1 - 1226 - 2474307 - 10
5 - 1 - 1698 - 2371159 - 11
0 - 2 - 1931 - 2602634 - 2
3 - 1 - 786 - 1890867 - 43
12 - 5 - 6181 - 1685030 - 5
26 - 7 - 6204 - 2133339 - 9
2 - 1 - 963 - 2316070 - 3
54 - 13 - 40760 - 2164733 - 8
8 - 6 - 1502 - 2338442 - 22
0 - 1 - 603 - 2127913 - 4
86 - 15 - 47367 - 1947992 - 15
30 - 11 - 16908 - 2140302 - 4
3 - 3 - 2289 - 2230728 - 6
3 - 11 - 6345 - 2400672 - 1
2 - 1 - 610 - 2547041 - 9
1 - 1 - 1266 - 2404673 - 5
1 - 9 - 6051 - 2596418 - 2
14 - 4 - 4525 - 1543587 - 7
3 - 3 - 3958 - 1514770 - 62
0 - 1 - 494 - 2594903 - 2

Every sample should be publicly available, so others could check your results for validity. It's easy to say "I've done research, surveyed 100,000 people and found that 2 out of 9 are pet owners," but others won't always take your word for it. Samples are usually available on demand as links you can download, not as lists in the middle of your research. The reason you see them here is to save you data collection efforts. By the way, gathering the above took me 35 minutes. This is a slow outcome because I had to click not only to the next page, but also on pen names to find their ID numbers and story counts in another window. Mind you, if you share the burden with two people or don't have to open new windows, you may do it sooner than your media player switches tunes.

Intercorrelation

The "scary part" comes next. It's scary because it has a lot of symbols you probably won't understand and won't need. But first, we lighten up our model. The list of numbers above is the basis of our model, a simplified version of reality. By simplifying it, we may lose some accuracy, but that's okay, because we can always add variables, make the list longer and reach an impractical level of accuracy. You may feel the difference between no reviews and ten reviews, but not 1.04 and 1.06 reviews. Be practical.

By "lighten up" in the previous paragraph, I meant dodging derivatives. You see, we have a dependent variable, our y, the review count. This variable is influenced by what we call "independent variables" x1, x2, x3, x4. While we logically tried to decipher what would be a factor for the review count, we might have, accidentally or otherwise, added variables that depend on one another, are derivatives. Notice how we tried to pick a variable for experience, looking into alternatives. We didn't know for sure whether they were alternatives, but we deemed them so. Statistical analysis allows us to see whether we've included two or more similar alternatives in our model. A model should be efficient and practical, so it's unnecessary to have a variable, which doesn't add value.

We're going to take the BACKWARD procedure. To use it, we need to have all our variables in the table, which we do, and start picking out the variables that don't add value. First, let's do a correlation matrix. In Tools-Data Analysis, pick "Correlation" and select all the data whilst not forgetting to tick "Labels in First Row". Click OK, and you should get a triangle of numbers.

Name: - reviews - chapter c - word c - author ID - story c
reviews - 1 - - - -
chapter c - 0,715994603 - 1 - - -
word c - 0,75572468 - 0,876722508 - 1 - -
author ID - -0,362859097 - -0,37127767 - -0,497448704 - 1 -
story c - 0,14074523 - 0,003728427 - 0,058269805 - -0,208831173 - 1

This matrix/triangle tells us how aligned is one variable with another. The first column explains how attached the dependent variable, our review count, is to our other variables. The higher the coefficient, the better. Anything above 0.8 is so awesome you can draw a straight line and call it a day. In the first column. If anything in other columns (save for the diagonal of 1) is 0.8 or higher (or -0.8), things are bad. It means one independent variable depends on another, they're alternatives. As such, one of them will have to go. And yes, we have that problem. Right in the centre, where "chapter c" meets "word c" we have 0.87. It means one follows the other 87% of the time, and such repetition is redundant. One of them has to go.

How do we decide which? We go to the first column to find which of the two variables "chapter c" or "word c" is a smaller influence to our review count. 0.72 for chapter c vs 0.76 for word c. Therefore, the word count is more important to us than chapter count, and chapter count has got to go. What do we do now? We copy the chapter count column somewhere far, so it wouldn't get lost, and delete it from our main table. My suggestion is to have two sheets with tables, one being main and the other - your work horse, which you edit and mutilate according to what Data Analysis tells you. Arrange your columns comfortably if placement has shifted.

Okay, we got rid of one faulty variable, and there weren't any more interdependent variables. Had there been more than one point above 0.8 or below -0.8 in columns after the first one, we would have needed to remove another variable, the less important of that pair.

Regression analysis

We have just one magic trick left to discover, regression. Explaining what it is in non-math language can be difficult, but it is like a healthy, working generalisation. For instance, you see car tyres as round, you draw them as round, and they are used as an example of roundness. However, if we take a microscope, we'll find the tyre is very uneven, full of dents and little furrows we don't really care about. Regression lets you get to what matters, the essence of a happening, so you are not distracted by something insignificant or scarcely irregular.

Tools-Data Analysis-Regression. Click. We get a very frightening table with lots of tick boxes and input ranges.

Input Y Range: click on the white space after the colon and select the y (review count) column, finishing your selection by the last filled cell. Don't add empty cells, and don't add more than one column to your selection.

Input X Range: select the remaining three columns from the top to the last filled cells. It should be a rectangle with 3 columns and 101 (100 numbers + labels) rows.

Below you see three checkboxes. Tick "Labels" because we have included them this time. You don't have to include labels; Excel will give your variables generic names, but we want clarity here.

Tick "Confidence Level", and set it for 95%. It should be also the default number.

Never ever tick "Constant is Zero" or our car tyre model may turn into a square. If you're curious, ticking that would kill one number responsible for evening things out.

Don't touch anything else in the window, and just click "OK". You have a new sheet. Rename it to "regression" if you want. On top of the spreadsheet, you have SUMMARY OUTPUT and three weird tables, each with more columns than the previous. We'll be working only with the third one, but the other two are useful, too.

The top one tells you, basically, one thing. You may have seen "R2" or "R squared" mentioned in our previous releases. It is a coefficient, which explains how well your variables determine changes. In our case, how well the word count, author ID and story count determine the review count. This number ranges from 1 to 0, and anything above 0.8 is awesome. Anything below 0.3 is horrible. In our case, Multiple R is 0.76, which we disragard, and look at the second row R Square. It's 0.58. This means that if you get 100 extra reviews, 58 of them can be explained by how many words you used, how many stories you wrote and when you joined the site. 42 come from factors we have missed.

Now, when you have a lower R Square, below 0.5, it can mean two things: you've missed some important factor while brainstorming or there is a problem with the numbers you've attained. There are methods on refining your data, but our example looks good, so we won't need them.

Have a look at the second table creepily labelled ANOVA. On its right edge, you see Significance F. Let's call it "the fail factor". It's 4.06 divided by a number with 18 zeros or "4,06E-18" (0.000...0406). It's a very small number, which means our fail factor won't bother the results. When you see this number grow big, reaching 0.1 and the like, it means your research is destined to fail and you might as well give up because making it work would be as difficult as heart surgery. The fail factor applies not to one variable, but to everything at once, and any connections you make are a coincidence, a fake. But let's put a smile back on your face because our model is safe.

A bit robust, though. We're going to have to butcher it a bit. Third table. There are three methods we can use. All of them should (almost always) give you the same results. Before we do anything, though, look at the row that says "Intercept", the first row of numbers in the third table. Highlight it in yellow, make the text white and do whatever you need to ignore what's written there. Once that is done, here's what we're going to do: see if there are irrelevant variables in the model. Sometimes, a variable is not important enough, does not cause enough changes to your review count, so we may safely kick it out. We determine if any variables are useless, and carefully puncture them out.

Three methods for removing weak variables:

t Stat column. Rule of thumb: any value between -2 and 2 (higher than -2 but lower than 2 [0.7, for instance]) means you should highlight the number's row red, ready to kick it out.

P-value column. It shows the possibility for a particular variable being useless. See your confidence level (we have 95% or 0.95 without the percentage). If the P-value is higher than 1 - confidence level (we have 1-0.95 = 0.05), highlight the row red, ready to kick it out.

Lower 95%-Upper 95%. See if they have different signs (one is positive and one is negative). If they do, highlight red.

The most reliable is the third one because it's easy to see the difference between number signs, but any one of these is enough. If you check the table (word c, author ID and story c rows), all three tests would have given you the same results. word c has a high t Stat value, low P-value (E-17 means divided by a huge number), and Lower 95%, Upper 95% have the same sign.

Author ID has a low t Stat, only 0.545, lower than 2, a P-value higher than 0.05, and signs are different on the Lower-Upper columns.

Story c has 1.55 t Stat, lower than 2, but higher than what Author ID has. P-value is 0.12, higher than 0.05, but lower than what Author ID has. Lower-Upper columns have different signs.

Looks like Author ID and Story c would be highlighted red for removal, but we don't remove them both. Like when we ditched chapter count, we have to cull them one at a time, the least important first. Chances are both will end up as totally unimportant, but when we remove just one variable, the whole model might change.

As you could see, The Upper-Lower test with different signs works as far as telling you "there is/isn't a problem" (boolean), and you can use either t Stat or P-value for deciding which variable is removed. In our case, let's use t Stat. Author ID has a lower t Stat, so we go to our working table, and remove that column.

We are now down to two independent variables, word c and story c, along with our review count.

Tools-Data Analysis-Regression.

Repeat the process. Review column from top to the last number in Y Range, and word c, story c columns in X Range. Tick Labels, Tick Confidence Level 95, click OK.

Once again, we see three tables. Let's look at R Square. It's still 0.58, which means ditching author ID did not lose us even one percent of usefulness. It won't be missed. We skip the second table and go right to the third. Feeling fast, let's go for the P-value test. Only one P-value is higher than 1-0.95=0.05, story c. 0.14>0.05. The t Stat is also lower than 2, so we highlight the row red, and go to the working table.

Delete the "story c" column (should be the one on the right). Now, we're down to just reviews and "word c", two columns. Tools-Data Analysis-Regression. Repeat the steps, only the X Range will be one column instead of two. Click "OK".

And we have another sheet. Looking at the first table R Square is 0.57 (was 0.58). Ah, so we did lose something with the story count. It may mean that the number of stories you write has an influence to your review count, but it is so insignificant, including it will only make our calculations complicated for very small perks. In any case, the drop was just 0.01 because the t Stat and other tests called that variable insignificant. Had you accidentally kicked an important variable, R Square would have dwindled...by a third or something.

So, what do we have now? Obviously, t Stat and other tests are okay. We're out of insignificant and useless variables. Oh, and look at the corner of the second table! Our fail factor has become lower. It's 1.02E-19. Used to be 4.06E-18. 40 times smaller. Nearly 98% of our fail factor was contributed by the variables we kicked.

Now, we can draw a rule for the review count in FanFiction.Net's Sonic the Hedgehog section in November, 2010.

y=0.001326x1 + 2.46 + e

y - review count
x1- word count
e - compulsory random error, for all the forces we did not account for

As you can see, the function is linear. By default, you should get 2 reviews in Sonic the Hedgehog. Every word you write, according to this function, adds a thousandth of a review. This means, if you write a thousand words, you, statistically, get 3 reviews. "Statistically" means "on average". This equation is a pretty good tool to measure how well your story is faring against works of others.

Right now, you can make an estimate on your stories written in that fandom. You know what influences the review count, and how many reviews you can expect when you start writing there. If you're a review hog, have a group of friends analyse several fandoms, and join the one, which gives you more reviews per written word.

CONCLUSION

Conclusions are necessary in research. They must be brief and informative because some people like spoilers, and skip to the results.

In Sonic the Hedgehog of FanFiction.Net during November, 2010, the total word count influenced the total review count. There was a positive linear relationship, where every extra word added a thousandth of a review.

The number of submitted stories and author ID were irrelevant to the total review count. Neither was the chapter count, an alternative of the word count.

EXTRAS

Here is a bonus for the curious. You did see that our linear function was described as "pretty good". What if there is a better way? Surely, if someone writes 50,000 in one chapter, they can't possibly get as many reviews as someone with a more reasonable 5,000? Nobody reads 50k in one chapter, you may even think. And your thoughts may be right. Regression analysis gives us linear results, and the line can go either up or down indefinitely from start to finish. We could build a curve.

However, the problem with curves is that the more complicated they are, the more time it takes to put one to use. That in mind, we go to our working table, with just two variables review count and word count. We're going to draw a chart. First, move the reviews column to be on the right of the word count column. We need it to dodge some messy misconceptions on Excel's part.

Insert-Chart. Pick XY (Scatter). It's very important that you use the scattered dot matrix. Upon clicking it, select the default subtype without any connections. Click Next. In the window that appears, you may have what you need already, but, to be sure, look at Data range (below the chart), erase it, and select two columns, the review count and the word count. Make sure the series are in Columns (radio selector). Click either "Next" or "Finish" because we should have everything now.

You should see a weird mess of dots, lots near the zero point, and just a few far from the beginning. Left-click on one of the dots. Several of them should light up yellow. Right-click on the dot, and select "Add Trendline". A new window should appear. You should see different curve types you can select. The linear is the default one, and it would have been identical to the equation above. We select the top right one, Polynomial. Most of the time, it's the most useful curve type. Now, go to Options on top of the window. You should see three tick boxes. Tick the third one, Display R-squared value. Go back to Type, on top of the Add Trendline window. Look at "Order" next to the Polynomial curve.

2 order gives you a parabola. 3 gets a cubical parabola and so on. The higher the order, the more steeply it will rise. Right now, we have to decide, which order is the optimal one. The optimum is somewhat arbitrary. If a higher order does not give you a "sufficient" increase in the R-square value, stick to the current one. If you recall, our linear trend gave us a 0.57 value, so 43% of all changes are a mystery. Let's pick order 2 and click "OK". A curve appears. It reaches to the bottom at a certain point, and R Square is 0.62. That's a 5% increase. We've found a better estimate for our function, but is there an even better one?

Repeat the steps: left-click dot, right-click-select Add Trendline, pick Polynomial - Options, tick Display R-squared value on chart - Type, pick order 3, OK. Now, it says 0.701. Eight percent. We've gone up from 0.57 to 0.701 in total only by changing the curve's form. Truth is definitely out there. Usually, it's a sign that going higher is useless, but you can try orders 4 and 5. Make the graph larger, so all the numbers fit on-screen. Order 4 gave 0.707, less than one percent. It's reasonable to assume things only get worse from there. Order 5 is too complicated, and too useless.

Order 4 is going to be a pretty long equation, and minuscule extra accuracy isn't worth the high-power equations. Order 3 is good, but it will lead to an irrational end (study a 3rd degree parabola). Order 2 isn't bad, but the gains aren't huge either. Let's leave it at order 3. The nearly 10% increase in accuracy is very nice. Right-click the second lowest (order 3) trend line, Click Format Trendline, go to Options and tick Display equation on chart. OK.

It would be: y=-2E-13x^3+4E-8x^2-0.0002x+5,436

This one, while better suited to describe the review count in general, has two problems.

1. It cannot be used for stories longer than 170k.
2. It overappreciates the minimal number of reviews a story can get.

As such, it is good in theory, but, in practice, stories are shorter, and their brevity calls for a different system of reviews. For this reason, let's also include the 2nd degree polynomial function:

y=-5E-9x^2+0,002x-2,7011

Interestingly, the number of reviews would drop after a story gets more than 200k words. While reasonable, this function has an accuracy problem, compared to the 3rd order. The solutions can be mind-boggling, like taking one function for word counts 0 to 10,000 and another for 10,001+. Less exotically, once we decide to get to the bottom of the issue and stop tolerating discrepancies, we need to not only drop variables, but also drop data points. Without going into two complicated tests, pick Tools-Data Analysis-Descriptive Statistics. Select the two variables, labels in first row. Tick summary statistics and Kth Largest, Kth Smallest, both set to 1. OK. There should be a table with four columns, two per variable. We're going anomaly-hunting.

We need two things, the top value, Mean, and Standard Deviation, row 7. Add three times Standard Deviation to the Mean. For word count, that would be 13,727+3*26,726=93,905. Why are we doing this? Anything above this value is an anomaly, and only 0.3% of all values can be higher than this without messing up our calculations. Since we have 100 data points, any one word count above 93.9k is an anomaly. What do we do to anomalies? We delete them. What do we get afterwards? A headache, looping back to the charts. That's the beauty of statistics: while 80% of all accuracy requires 20% of effort, getting 20% more, you guessed it, makes you sweat a whole 80%.

Hopefully, this has been an interesting enough adventure in the realm of online research. Calculating the basics really takes but a few minutes, but when the world gets you stumped in conclusions that seem impossible, you may spend hours. And when you think this is ludicrous, ask Facebook or Google if there's a better way to get into your head.

Merry Christmas, folks!

2 comments:

  1. I found this really interesting. Thanks for writing all this.

    ReplyDelete
  2. As a data geek and FFN writer, this has been fascinating reading! I have written two substantive stories in a non-Sonic fandom and can concur that the 1/1000 reviews per wordcount is almost dead on. Right now, my stories are at:

    103,711 words, 116 reviews and
    21,518 words, 20 reviews

    Pretty cool. Are you at all able to do any analysis on either Favorites or Alerts?

    ReplyDelete