Skip to content

Q&A versus reference sites

02-Jul-11

Stack Exchange is a hub of vertical (domain-specific) question and answer sites (including programming, system administration, , theoretical computer science, physics, gardening, English, and other subjects) using the eponymous software. The pioneering site for the software was Stack Overflow. Stack Exchange’s key features include: specific questions with the intent of getting specific answers, voting on questions, voting on answers, concept of an accepted answer determined by the question-asker, reputation and gaming features based on upvotes received, and rearranging the order of appearance of answers in a page based on votes. Specific questions tend to get answered accurately and quickly (often within minutes), usually by “regulars” who visit the site hourly and have accumulated huge reputations. Stack Exchange’s site statistics as well as question answering rate are impressive (2010 summary).

Math Stack Exchange is one such site, meant for undergraduate level mathematics questions and answers (this differs from the more research-level question-and-answer site Math Overflow).

Some of my random thoughts on the relation between Q&A sites and reference sites (like Wikipedia or the subject wikis):

  • The success of Q&A sites, particularly those of the Stack Exchange type, often relies on the fact that reference sites either don’t have sufficiently deep information, or that such information is not easily accessible or navigable for many users. For instance, I’ve seen questions asked on Math Stack Exchange in group theory and two of my common reactions are: (i) Oh, this is content I’m planning to put up in Groupprops in a few weeks/months! and (ii) Oh, it’s already there on Groupprops, but somewhat hard to find. See for instance this question. If somebody has not yet already posted an answer, I post a link, but generally some enterprising soul has posted an answer already — often going to quite some effort to type out a laborious answer.

  • It’s overall preferable to have a reference site that automatically and quickly answers your question than to have to type it to a Q&A site and have somebody else type an answer. Apart from saving time, the reference site can go into greater depth, have greater reliability, link to related material, and answer questions at the periphery of your consciousness. Q&A is, for many purposes, an inferior good.

  • One goal of a good reference site is to push out the boundaries (i.e., raise the standards) of the questions that people consider asking on Q&A sites. The “Wikipedia test” (respectively, “Google test”) filters out any question that can be easily answered by Wikipedia (respectively, Google). People who ask questions that fail these tests risk their reputations. The better Wikipedia and Google get, the more the questions filtered out. With the subject wikis, I am hoping that yet more questions get filtered out because they can be immediately answered by search/lookup.

  • One advantage of the Q&A style as a reference — the focus on questions creates a somewhat different hierarchy of importance and relevance, which might be more useful for somebody with a question (or the beginnings of one) to locate it. Also, the Q&A navigation experience is more suited for some moods and needs. However, there’s an alternative to Stack Exchange-style active Q&A: “passive Q&A”. This is Q&A lists deeply embedded in a reference — with short questions and somewhat longer answers that link deeply to content in the reference. A prototype is Questions on normal subgroup. Due to the semantic richness of the underlying reference layer (Groupprops), the questions can be organized quite well and the whole experience can be very smooth.

  • Spontaneous questions asked by real people will still keep uncovering new areas of confusion and scope for clarification that passive Q&A has not yet covered. Passive Q&A must regularly borrow new insights from gaming Q&A to continually improve.

The observations above pertain strictly to Stack Exchange style sites. There are other Q&A sites, such as Quora, that are hinged more on exploration, discovery, and creating new knowledge — something that cannot be included in a reference because it needs discussion and opinion-soliciting to develop in the first place. Quora aims to use the question and answer model to create new knowledge; according to their about page, Quora is a “continually improving collection of questions and answers created, edited, and organized by everyone who uses it. The most important thing is to have each question page become the best possible resource for someone who wants to know about the question.” Here are Seb Paquet’s thoughts on Quora and how it differs from a reference.

For more on the similarities and differences between Quora and Stack Exchange, see this Quora question and the Quora vs Stack Exchange topic on Quora.

Dreaming — what will the subject wikis look like in 2015?

18-Apr-11

By 2015, the world will probably look pretty different from what it looks now (although the more things change …). Particularly in the computing and Internet connectivity side, things will be different: touch interfaces, ubiquitous social networking, and all kinds of other forms of interactivity and intelligence will be embedded into web browsing. I hope that the subject wikis will continue to evolve. It’s hard to exactly imagine interaction with tomorrow’s technologies, but here’s a start as to the kind of improvements possible.

Instant page loading

This sounds mundane, but it’s the start. If individual pages can load on user’s browsers in under a second, then it would be possible to load large numbers of pages. Velocity matters, and people spend entire conferences giving talks about shaving off milliseconds from page load speeds.

Morphing pages based on user needs

Pages will no longer be static or quasi-static constructs but rather will be highly responsive to what the user is looking for. Underlying each page will be a static page (which the user can view as a static page if desired) but the default interaction will include such features as:

  • Highlighting or focusing on parts of the page that are best suited to the user’s state of knowledge, as inferred either directly from past history on the site, or through a quick, seamless survey question, or through the user’s social network information. For repeat visits, highlighting portions the user may have missed or ignored in the past, or new stuff that is best kept to a second visit.

  • Facilitating highly intelligent page annotation and sharing. For instance, if a page gives five equivalent definitions of a concept, a student for a course can circle Definition 1 and say “this is what we saw in class” and circle Definition 2 and say “this is what I need to show as equivalent to Definition 1 in the homework” and circle Definition 3 as “I don’t understand this yet, but it might be worth coming back to later if I’m stuck with some question involving this idea.” This annotation can affect the user experience in future visits, and it can be combined with sharing. Imagine two or three people doing homework together. One of them sees something relevant to a homework problem, then he/she can “share” it with friends, annotated with the number of the homework problem, and they can all study it together.

Large-scale statistical generation

As pages tailor themselves to user needs, the users in turn provide pages information as they interact with them, and this information gets aggregated in a data-rich environment. Rather than going by anecdotal impressions, we’ll then have fairly concrete, large-scale data, across different learning and course environments, of things people find difficult versus easy, of common confusions that they have, of the ideas that they find interesting, and all other kinds of correlations.

Explore in depth

For any topic or any statement that an individual user doesn’t understand, that person can tap into a vast network of further resources that elaborate on it — including not just the static wiki content, but also how other users in the past have interacted with the content, and how these other people ultimately resolved their issues with the content. It’s like a dream come true of somebody saying “Hey! I didn’t understand X” and actually getting a clear, more detailed, and interactive explanation of X — all without an actual instructor.

At the basic technological level, all of these are feasible today — but with the technology of today (both hardware and software) much of this isn’t seamless, which means that it would be very hard to implement, which means it will not scale. But circa 2015, much will be different!

Platform versus content, and the knowledge domain

18-Apr-11

There’s an interesting distinction between things that serve the role of a platform/utility and things that serve the role of content. A platform is something relatively neutral on top of which content is put (by one or many people).

For instance, an operating system (such as Linux, Windows, Mac OS) serves as a platform, and the various softwares and applications that run on top of this serve as the “content” — though some of these softwares (such as word processing software) may themselves serve as platform for people to create content (documents). The underlying software, web site, and server of a social network are a platform on top of which people create content — for instance, on Facebook (the platform), people talk to each other, share links and news, and play games and have other kinds of interactions, as well as exploit the power of Facebook’s social graph through tools such as Facebook Connect.

In a similar vein, the MediaWiki software plays the role of a platform for MediaWiki-based websites such as Wikipedia, where various contributors add and improve upon content, and others come and consume the content.

As a general rule, successful platforms enjoy a lot more leverage than successful content, i.e., the number of people adopting a successful platform could be much larger than the number of people who consume successful content. For instance, the Harry Potter series has sold copies in the tens of millions, but this is still a lot less than the number of users of the Facebook website (over 600 million). The market for particular content is limited to those people who want to consume that particular content. Platforms, by virtue of being (relative) blank slates, can attract a wider range of people who can build and customise them in different ways.

Of course, platforms are not complete blank slates. The rules of a successful platform, whether a social network or an iPhone, depend a lot on the specific rules and user experience, as well as the various defaults and norms that evolve around it. However, the parameters of a platform are more in the nature of general rules of interaction for how content may be overlaid on them, rather than specific pieces of content. A platform comprises not just software or hard-coded/law-based decisions but also informal norms of behavior and “community rules” that evolve around this software.

Given the substantially greater leverage enjoyed by platforms, and the substantial increases in such leverage offered by the Internet, it is unsurprising that the most innovative companies and products in the Internet era have been platform-based. In a different era, a highly creative person spent time making paintings of designs. Today, a highly creative person can start a cool little website that soon turns into a world-changing platform for communication and social interaction.

A question of key relevance for the subject wikis is: what are effective platforms for content dissemination/absorption/education/learning? In other words, what kind of generic infrastructure best facilitates specific content dissemination/absorption/learning goals?

What’s interesting to me is that this question appears not to be one of great importance to people. Individually, there have been many new platforms/modes of knowledge dissemination: wiki software (including its features of strong internal linking, collaborative editing, and canonical page naming), blogs (with their timeline-based post nature and usually personalized/narrative style), open course videos, etc. However, with the exception of Wikipedia, none of these platforms have attracted sufficiently large amounts of content in a sufficiently easy-to-locate fashion to fundamentally transform people’s day-to-day experience of knowledge acquisition and assimilation.

Crudely, it seems to me that the state of knowledge dissemination/acquisition and information flow today is akin to the state of web search circa 1995, or the state of social networking in 2003. The Wikipedia model does have a lot of what I think are fairly right answers with respect to knowledge dissemination, but Wikipedia in isolation simply isn’t the complete answer, because it is meant to be an encyclopedia and not a learning resource for in-depth subject-specific knowledge.

Why hasn’t there been a lot of platform innovation in the knowledge arena? I think there are some reasons:

  • People mistakenly think the problem is solved: This is probably one of the more important reasons. People think that the current modes of knowledge acquisition and assimilation — many of them dating many centuries (such as books and chalkboard lectures) are the best that will ever be devised, and there is no real room for new platforms. Though they may agree that the specific books or lectures they have been subjected to could be improved upon, they don’t see the need for or the possibility of something fundamentally better. Even those who acknowledge some of the Internet’s innovations think that everything that had to happen in the knowledge arena has happened — for instance, some are of the view that Wikipedia has basically solved the knowledge problem. It’s a bit like people back in 1995 thinking that web search was a solved problem, or people in 2000 thinking that the existing methods of keeping track of friends via phone, email, and greeting cards meant there was no need for fundamental breakthroughs (such as social networking).

  • The problem is harder and has less leverage: The trickier the knowledge, the more quirky it is by nature, which means that generic delivery platforms fail to do the specific knowledge justice. Thus, platform design has to take into account the quirks of specific content. But this makes the platform much less scalable and means it enjoys much lower leverage. If the right platform differs for each discipline, then the leverage that a platform enjoys is limited to that discipline. With the exception of an encyclopedia-type platform (where Wikipedia already reigns), here’s no question of getting a platform that appeals to the needs of a billion users.

  • The collision of skills between subject-matter expertise and platform development skills is highly rare: This is particularly true for hard subjects — the people who know the best about a topic are people who spent years learning that topic, and hence probably didn’t spend years thinking about platform development. The best platform developers have often been college dropouts (Steve Jobs, Bill Gates, Mark Zuckerberg) and their knowledge of other subject matter is limited.

  • Incentives in academia (where much knowledge resides) are not platform-oriented: Academics face incentives to publish, and those in teaching colleges face incentives to teach. These incentives are “content”-oriented: produce stuff, and get rewarded. Few have incentives to think about developing new approaches to building platforms, or even collaborating with others who are thus interested. Academics enter big-picture stuff typically after they have tenure, but the big-picture stuff now tends to be directing research programs, not making the existing knowledge more transparent and readily accessible. It simply isn’t fashionable in academia to boast about reducing the time taken for learning a new concept from four minutes to three.

Briefly noted

18-Apr-11

Brief news updates:

  • The MediaWiki engine for all the subject wikis was upgraded to MediaWiki 1.16.4, a security release, on April 17. See the security release notes.

  • As I noted in a buzz, a reply to a Math Overflow question (admittedly not a very difficult one) refers to Groupprops in a cool manner.

  • The Wikimedia Foundation is working on a simple survey tool. If well built, this could serve to replace the current SurveyMonkey-based single question surveys (though I’ll be using SurveyMonkey for longer surveys).

  • MediaWiki 1.17 is on its way — the wikis will be upgraded once an official release candidate is available.

MediaWiki upgrades, single question surveys

04-Feb-11

I’ve upgraded all the wikis to MediaWiki 1.16.2, the latest stable release of MediaWiki.

On Groupprops, I am introducing, currently in experimental form, “single question surveys” which will be of many types. These are expandable single question surveys advertised at the top of the page. Clicking on the “SHOW MORE” button expands to show the question. After the user selects an option and submits, the user is told the correct answer as well as how many people chose which option.

The complete collection of surveys can be viewed at this wiki page.

Pop quiz questions in group theory

The purpose of these is to stimulate interest in the group theory material on the website. The questions are generally of moderate difficulty for people who already know the material. They contain links to the relevant pages. After selecting an answer, users get to know the correct answer, as well as how many people selected each answer option.

These questions could be a window both into group theory and into the material available on the website. I’m hoping that they will convert some casual visitors into large depth visitors, as they discover more relevant content on the website.

As of now, there are four pop quiz questions, but I plan to considerably expand into a wider range of questions. As of now, all these questions are included in the site notice, and there is no wisdom in the choice of question relative to the pages the user is seeing right now. Later, some questions may be embedded inside pages too so that users see questions that are more relevant to what they are currently learning.

Single question user profile surveys

There is only one such survey currently — a survey asking for the user’s profile in background knowledge of group theory:

Create your free online surveys with SurveyMonkey, the world’s leading questionnaire tool.

This survey was created January 19 and the question had a response rate of about 1-3 responses per day over the last couple of weeks.

Feedback questions

These are questions about what people thought/think about the material on the website, and whether it addressed the questions they originally had. There’s only one such question right now:

Create your free online surveys with SurveyMonkey, the world’s leading questionnaire tool.

Magazine format versus computable data format

22-Dec-10

These are somewhat unstructured ruminations about the role and nature of the subject wikis. I’ll draw here a contrast between two alternative formats: the magazine/blog format and the computable data format.

In the magazine/blog format, time plays a key role, and the units of production are individual articles/blog posts/ideas posted at specific times. The interaction and feedback to these then influences subsequent articles/blog posts.

The computable data format focuses on providing data in a highly polished and easy-to-use form. As time passes, the data gets modified, but at any given stage, the data represents a collection of data and not so much a timelined sequence of articles/posts.

The computable data format is great for getting specific data quickly. But it fails to offer something that the magazine format excels at: new, exciting, serendipitious stuff that we weren’t looking for or actively seeking.

In the modern web era, the magazine format can do double duty as semi-computable data when there is a plethora of archives to search. Conversely, the computable data format can be used to create small magazine-style missives.

The subject wikis are essentially of the computable data format, not the magazine format. However, it would be great if some features of the magazine format could be adapted to the subject wikis. This would help attract people to the websites to regularly check out content and generate a higher level of excitement.

Thinking through value propositions

22-Dec-10

Business types use the jargon “value proposition” for the (typically specific and unique) values offered by a product or service. When I started Groupprops four years ago, and incorporated it into Subject Wikis 2.5 years ago, I had some ideas about the value proposition, but these ideas have been continually modified based on the way people have actually used the website.

As I shift attention to improving Topospaces significantly to incorporate algebraic topology as well as the relationships between point set, algebraic, and basic differential topology, I think it’s a good time to reflect on the experience of building Groupprops and think about whether the experience can help with a more rapid development of Topospaces.

Information at fingertips

One of the things that I disliked about most conventional reference resources was the fact that specific information details were often left for readers to work out, and the information that was given was scattered across multiple sources. I’ve found this extremely frustrating. It may be a good approach to teaching a course, but it’s not helpful in a reference. One of the goals of subject wikis is to make specific information quickly and readily accessible to people in a way that they can eyeball it and get a sense of things.

This was not completely missing from the original goals, but my original focus had been more to include proofs of general statements. Now, information on each specific example in full, gory detail has taken on a lot of significance. Also, unlike the case of most references, these examples or particular cases/instances are treated as separate entities with their full development, rather than mentioned only in the context of general results that they may or may not illustrate. Some additional observations about the significance of this:

  • In textbooks, examples are usually developed when the context is ripe, and are developed only to the extent that they illustrate some important principle. Thus, a lot of examples that don’t illustrate any principle directly important to the author of the textbook are ignored. The subject wikis approach is different. Each example is developed as a separate entity. Then, for a particular example, general facts that highlight the particular features of that example are linked to (and necessary computations to elucidate the link are shown). Conversely, for a general fact, those examples that are best understood in context of that fact are linked to. Neither specific examples nor general facts are parasitic on each other.

  • In many areas of learning, there is a bunch of common misconceptions that students develop. Part of the source of these misconceptions is the fact that students haven’t seen enough examples of a sufficiently wide range. In the subject wikis, the goal is to design pages in a manner that scrolling through the page gives a “general feel” for the nature of examples, thus reducing the chance of misconceptions. For commonly identified misconceptions, cautionary notes and highlights are included both in the general page and in pages developing each specific example.

  • Separately developing examples without worrying about whether those examples are “important” makes the resource useful for discovery — somebody can come along and see a well-developed example and notice patterns that perhaps weren’t otherwise obvious.

For instance, consider the page element structure of symmetric group:S4. In addition to developing information about the element structure of this particular group, the page explains how this element structure fits into the context of interpretations of the group as a symmetric group, as well as a projective general linear group of degree two. Links are provided to general facts about the element structure of symmetric groups and also of projective general linear groups of degree two.

Relationships between facts, and identifying the critical jumps

As should be clear from what’s said so far, our goal is to not shy from the gory details of computation and development of specific examples. Even more important here are the explicit calculations that underlie big theorems, results, and relationships, often calculations that are left as an exercise to the reader in too many references. In addition to providing these calculations, the context, generalization, and limitations of these calculations are considered. This could help people get a feel of exactly why and how the calculations work as they do.

For instance, those who’ve seen some point set topology and the beginnings of algebraic topology may have encountered the notion of fundamental group. In a typical algebraic topology text, the definition is accompanied by a brief description of why the fundamental group is a group. The details are often omitted or left as an exercise for the reader. The Topospaces page on fundamental group takes a different approach. Each of the aspects of showing that the fundamental group is a group is stated clearly, and the proof/explanation is deferred to a separate page, where the construction is covered in detail (these pages are still under development). It turns out that many of the same ideas turn up when we are trying to understand loop spaces, so this same proof/explanation page serves double duty as showing things about loop spaces.

Thinking deeply about simple things

Textbooks are often written for first-time learners or perhaps second-time learners. They use a linear ordering and, typically, have to restrict information on a topic to what can be explained based on topics covered so far.

It is often the case that the definitions of the simple and basic concepts in a subject (particularly in mathematics) have a number of subtleties that cannot be pointed out to first-time learners when the concepts are being introduced. These subtleties are mentioned in random places as people learn. Ten years after learning a subject, the experienced student has no single place to turn to to get an updated, improved, but concise description of all these added nuances.

With the subject wikis, we hope to overcome this limitation. Definitions of simple ideas are accompanied by alternative definitions/interpretations that rely on the development of more complicated machinery. The equivalence of these multiple definitions may itself rely on important results of the subject, that are linked to. Experts in a subject usually have all these multiple definitions in their head when they think of the concept, whereas novices are often stuck on the “basic” definition and can recall other definitions only on being prompted (even if they are aware of these). By having these multiple definitions all available in one single place, people can build their expertise in the topic faster. The pages on 2-subnormal subgroup and normal subgroup at Groupprops are illustrations.

Short-run milestone

03-Nov-10

In a previous blog post more than a year ago, I’d put out a list of possible success measures to indicate that the subject wikis have arrived. One of these success measures was first achieved for Groupprops about two months ago: “a single day with more than 1000 pageviews for a subject wiki.” In fact, there have been over 37000 pageviews over the last 31 days, with the highest ever being 1500+ on one day. Even restricting to “unique” pageviews (i.e., not counting multiple pageviews by the same visitor on the same day), on most weekdays over the past month this number has exceeded 1000.

Note that the visit and pageview numbers are based on Google Analytics aggregation, and attempt to exclude all bot accesses. So, these should mostly reflect human pageviews.

Secular trend in Groupprops

The overall secular trend in both visits and pageviews over the last two years has been that of somewhat more than doubling every 12 months. A great deal of weekly and seasonal variation (see here) masks the secular trend in the short run, but any comparison between a date and another date 364 days ago (so that it’s the same time of year and the same day of week) reveals a robust more-than-doubling. For instance, here are total visits and total pageviews over 30-day periods:

  • October 5, 2008 to November 4, 2008: 1322 visits, 4101 pageviews, 3.10 pages/visit.

  • October 4, 2009 to November 3, 2009: 6888 visits, 16469 pageviews, 2.39 pages/visit.

  • October 3, 2010 to November 2, 2010: 16272 visits, 37142 pageviews, 2.28 pages/visit.

Despite the secular trend in growth, usage patterns remain fairly similar. The number of pages per visit is going down, but visits at all levels of depth are increasing in absolute numbers. For instance, if we consider only visits that have 5 or more pageviews, then the situation is:

  • October 4, 2009 to November 3, 2009: 785 visits, 7248 pageviews, 9.23 pages/visit.

  • October 3, 2010 to November 2, 2010: 1683 pageviews, 14865 pageviews, 8.83 pages/visit.

Thus, even at this high depth, the numbers are doubling.

The growth is not completely uniform across pages — the greatest growth has been in pages about specific groups, that have considerably expanded and improved over the last year. For instance, symmetric group:S4 saw 1576 pageviews over the last 31 days, as opposed to 518 a year ago, while symmetric group:S3 saw 1704 pageviews over the last 31 days, as opposed to 836 a year ago. Although for most individual pages, the growth has been less than a doubling, the expansion in the number of pages more than makes up for this.

Other wikis

One explanation for the rapid increase in the case of Groupprops is the expansion of content, as well as growing site visibility. The picture is more mixed for other wikis. The topology wiki, which has not changed much over the last year, has also enjoyed a more-than-doubling in visits and pageviews, albeit starting from a much smaller base. The total pageviews for the last 31 days were 3030, which gives an average of slightly less than 100 pageviews per day, while the total pageviews for a similar time interval last year was 1444. The market wiki has quadrupled over the last year, albeit from a very small base: from 147 visits and 204 pageviews to 651 visits and 837 pageviews. The classical mechanics wiki has grown by a factor of infinity — it had almost no content and precisely no visits last year in this time period, and now has 99 visits and 122 pageviews over a 31-day period.

For most other wikis, such as the commutative algebra wiki, the pageview growth has been near zero, and is much less visible than random fluctuation.

Recent improvements

18-May-10

The subject wikis are being upgraded to MediaWiki 1.16.0beta (see here for the security release). The high traffic wikis have already been upgraded; others should be upgraded in a few days. We’re also upgrading to Semantic MediaWiki 1.5.0.

The default skin on the upgraded wikis is the “Vector” skin, which is the same as Wikipedia’s new default skin. Those who want to change the skin appearance back to monobook should create an account and change their user settings. Users who already have accounts may still have their settings as “monobook” — so they need to manually change the settings to “vector.” For more information about why Wikipedia switched to a vector skin, see Usability Initiative.

Apart from these software changes, we’re also making changes to page content and appearance to make it easier to find relevant information. Most of these build further on the page design changes blogged about earlier. These include:

  • Front-ending the definitions: For long pages, the definition is being moved right to the top, above the table of contents and the article-tagging template boxes, which give information about the type of term and similar terms. This means that people in a hurry can quickly read the definition without scrolling down too much. This change may be rolled over even for shorter pages.

  • Increased use of tables: Tables allow for a compact expression of relates, correlates and analogies, and also make it easier to locate information. On the minus side, it becomes tedious to put detailed and lengthy paragraphs in a table. We are dealing with this by putting the most important summary information in tables with further detailed information below expandable/collapsible “SHOW MORE”s. Even for definitions, we are switching from the earlier “Symbol-free definition” and “Definition with symbols” as separate subsections to a single tabular format for definitions where one column gives a shorthand phrase for the definition, one column gives the symbol-free definition, and another column gives the definition with symbols (for instance, pronormal subgroup (permalink to current version). Additional columns may include applications of a particular definition to ways to prove/use the given term. We are also using tables for references, as seen in this example, making them easier to parse as well as look up.

  • Increased use of expandables/collapsibles: Expandable/collapsible “SHOW MORE”s allow for a lot of relevant information to be placed within pages without causing a cognitive overload or making the page too much effort to scroll down. Expandable/collapsibles are used both within pre-defined templates and on a discretionary basis within pages. Sometimes, the most important things are stated and the rest are hidden under a “SHOW MORE” — for instance, the list of properties stronger than characteristicity. The “SHOW MORE” feature uses the MediaWiki extension ToggleDisplay.

  • A continued shift away from categories to semantic constructions: We are continually trimming down the use of categories to the level where they help with broad “containment”-based navigations. We’re moving all relations and analogies to the semantic realm, which is much more flexible and allows for more powerful querying. For instance, variations on a particular term are no longer stored in a MediaWiki category, but can instead be accessed using a semantic query — for instance, here’s the query for variations of normal subgroup.

  • More suggested semantic queries: The pages now often contain links to semantic queries that might answer further questions the reader could or should have. Many of these links are generated automatically through templates.

Groupprops usage patterns update

13-Jan-10

I had started working on a report on usage pattern analytics for Groupprops, but for various reasons, will not have the time to complete the report in the near future. Also, I would like to subject some of the findings in my preliminary number-crunching to the test of more data — particularly data spanning across more than one year. Nonetheless, it might be worthwhile to note some of the findings on the blog. (What follows below is what I consider the most salient snippets from the current draft of the report).

Nature of variation between daily traffic across days

The variation is broadly of three kinds:

  • Intra-week variation: There is a clear pattern here: weekend traffic is generally about 50-70% per day of weekday traffic. The minimum usually occurs on Saturdays, and the second lowest is on Sundays, with the third lowest on Fridays (extended weekends?). So, it seems like visits to Groupprops are somehow better classified as “work” than “leisure”. This is further corroborated by the fact that in holiday seasons, traffic is low on all days and the difference between weekdays and weekends is less pronounced. (More data in as well as further segmentation of existing data will allow the testing of further hypotheses about intra-week variation).

  • Seasonal variation: There is a reasonably clear pattern here too: seasons that are “off” in colleges and universities see less traffic. Holidays see less traffic in the regions that observe those holidays. For instance, the Thursday of Thanksgiving saw a significant drop in U.S. traffic while U.K. traffic remained at usual weekday levels. Christmas week saw a worldwide traffic drop. Traffic is most in mid-September to mid-December and mid-January to mid-June, and less mid-December to mid-January and mid-June to mid-September. (These trends will be better understood with more multi-year data available, because it is difficult to separate seasonal variation from a general upward trend if traffic. However, these observations are similar to the observations made in the 2005 full evaluation report for MIT OpenCourseWare.

  • Secular increase (here, secular means over time, i.e., a long-run trend): Traffic has been increasing since May 2008, when the wiki was moved to this site, with most of the dips being accounted for by intra-week and seasonal variation. For instance, a comparison of the mid-December to mid-January of 2008-2009 with the mid-December to mid-January of 2009-2010 shows an increase of 260% (which means the new traffic quantity is 3.6 times the old). Interestingly, the same-time-of-year comparisons show that the proportional increase is least in holiday seasons and more during seasons when traffic is higher. This hypothesis needs to be tested further.

Visits and pageviews

For overall magnitude estimates, there was a total of about 20,000 visits and 48,000 pageviews from mid-September to mid-December of 2009, higher than over previous three-month periods.

The ratio of pageviews to visits has remained steadily in the range of 2.4-2.6, and this ratio has not shown much change despite the secular increase in both the number of visits and the number of pageviews. Moreover, the composition of visitors by depth of visit has remained remarkably similar over time. The breakdown is roughly as follows: 60% of visitors had one pageview, 14% has two pageviews, 8.5% had three pageviews, 4.5% had four pageviews, 3% had five pageviews, 2% had six pageviews, 1.5% had seven pageviews, 1% had eight pageviews, and so on. About 1% had eighteen or more pageviews.

Inter-country variation

This is another area with fertile analytical possibilities. Current results suggest the following picture.

  • In absolute numbers, in terms of visits (pageview rankings are almost the same), the top countries are, in decreasing order: United States, United Kingdom, Canada, India, Germany, Australia, Italy, Israel, Turkey and the Phillippines. Note that there is likely to be quite a bias in favor of countries that are English-speaking (such as the United States, the United Kingdom, Canada, Australia) or countries where higher education and research is carried out in English, even though there are other local languages (such as India, Turkey, and perhaps Phillippines and (in the higher mathematical context) Israel). Despite this, Germany and Italy make it near the top of the list. However, the absence of countries such as Japan, Korea, China, and France form the top of the list may be explained by the language factor.

  • The picture looks a little different if we consider the number of visits per capita. Here, the United Kingdom comes out on top (largely due to the contributions of Cambridge and London), and other good performers include New Zealand, United States, Israel, Ireland, Singapore, Canada, and Australia. India falls very far down once we divide out by its huge population, though it still comes higher than China.

  • It is unclear how the per capita usage of Groupprops compares with other indicators, and more research needs to be done on the connection with such factors as wealth, Internet access, number of college students, etc. One clear finding seems to be that with the exception of Israel, the top nine countries in per capita visits are among the top ten in the 2007 Economic Freedom of the World rankings (download report as PDF). The connection with political freedom, as measured by Freedom House, seems more tenuous.

Top cities

The cities that top (in absolute numbers, not on a per capita basis) include Cambridge (UK) (home to the University of Cambridge), London, Chicago (home to The University of Chicago, and also where I currently am), New York, and Cambridge (Massachussetts, USA) (home to MIT and Harvard). Other cities that do well include Oxford (home to Oxford University), Pasadena (home to CalTech), Portland (home to University of Oregon), Singapore, Atlanta, Charlottesville, Philadelphia, Don Mills, Stanford, Los Angeles,
Chennai, Delhi, Ithaca, Sydney, Manchester, Austin, Champaign, Claremont, and Seoul. The fact that both the top cities are in the United Kingdom, and the clear lead enjoyed by Cambridge, UK, are as yet unexplained, though it seems that the high traffic from Cambridge is largely confined to the period from October 2009 onward.

Browser/OS combinations

In the analysis over one time period, the most popular browser/OS combination among Groupprops visitors appears to be Firefox/Windows (35%) followed by IE/Windows (32%). Other popular browser/OS combinations are Safari/Macintosh (8.37%), Firefox/Linux (7.36%), Firefox/Macintosh (6.11%), and Chrome/Windows (5.93%). 0.38% of the visits came from the Safari/iPhone combination. The changes in these proportions over time is potentially a subject of further study.

Network locations: university networks and commercial service providers

Among the top network locations, the universities were University of Cambridge (rank 3), which accounts for about 97% of the traffic coming from Cambridge, the University of Chicago (rank 5), which
accounts for about 60% of the traffic coming from Chicago, Harvard University (rank 11), which accounts for about 70% of the traffic coming from Cambridge, Massachussetts, Oxford University (rank 13)
which accounts for about 90% of the traffic coming from Oxford, and Caltech (rank 15) which accounts for about 80% of the traffic coming from Pasadena. Note that the actual traffic from people affiliated
with the university is probably higher, since many of the students and faculty may be using non-university Internet connections when at home.

Most of the network locations at the top are non-university. The topper is Comcast Cable, an internet provider in the United States.

Connection speeds

The most used connection speed is T1, and it generates more than a third of the traffic. Other connection speeds commonly in use are cable (slightly more than a fifth of the traffic) and DSL (about a sixth of the traffic). A large amount of traffic was generated through unknown connection speeds.

Traffic sources

Most of the traffic (varying between 75% and 90%) is generated by search engines, with about 99% of the search traffic originating from Google. The remaining traffic includes both direct visits and referring sites, and the proportions of these vary with time. Long-term trends in these will be among the things to be studied in a more in-depth investigation.