11/12/2008
In my last post, I outlined the elements that keep us from easily and/or quickly finding information and the costs associated with a poor findability solution. And in a future post, I'll outline how to build out a findability architecture that will integrate with your information architecture as well as integrating with the findability tools that already exist in your environment if you're running Microsoft technologies.
In this post, I'll discuss managers and business leaders. Why this is important to understand should be apparent: when they are successful, the business is successful. You can take this post as an example of how your findability solution(s) need to adapt to the work habits of different users in your organization. Managers are no exception to this rule. Build out a poor findability solution for them at your own peril. So, let's get going.
In a classic view of management, we find that managers are supposed to organize, coordinate, plan and control. But the reality is that the facts suggest otherwise. Let's take a look at some of these facts. First, managers work at an unrelenting pace and their work environments are characterized by:
- Brevity
- Variety
- Discontinuity
- Action-oriented
- Dislike reflective activities
For example, in one study of CEOs, they found that 50% of their activities lasted less than 9 minutes, that their coffee breaks and lunches are work-oriented and that 93% of CEO verbal contacts are ad hoc. One researcher put it this way: "in not one single case did a manager report obtaining important external information from a general conversation or other undirected personal communication".
Secondly, managers strongly favor verbal media, telephone calls and meetings over documents or aggregated data. Now, being a manager myself, I can tell you that a dashboard is necessary to get a quick overview of what's working and not working. But a dashboard, in and of itself, doesn't provide good data for making decisions. Instead, managers often rely on ad hoc communication. Today's gossip might very well be tomorrow's fact. Most managers cherish "soft" information and often identify decision situations and build models using tidbits of data rather than hard, aggregated data. I can attest from experience that passing comments from people, which tend to be the most honest and transparent types of comments I hear, often form the core of what I think about a person or situation.
Thirdly, projects managed indirectly by managers emerge as a series of small decisions and actions sequenced over time and must be fit into a disjointed, busy schedule. Managers tend to comprehend complex issues gradually rather than through concentrated learning. In one study, they found that CEOs were supervising as many as 50 ongoing projects related to new product development, re-organizing a weak department or launching a public relation campaign. The CEOs would give an individual project limited time and attention with enough energy to send it back into orbit while they focused on another project or C-level personnel decision.
Lastly, research has found that most authorization decisions are based on ad hoc information, even when aggregated data is available. In the same studies with CEOs, they found that the CEOs were faced with many projects that could not wait or did not have quantifiable costs and benefits from which to make informed decisions. Yet, the timing of these decisions was forced by external factors outside the control of the CEO. Their choices were complex and involved a number of difficult-to-quantify elements, such as:
- How does this choice affect other choices already made or that still need to be made?
- Is my choice Acceptable to internal influencers or will passive resistance sabotage my choice?
- Will resources be over-extended?
- Will this result in too much change for my organization?
- What are the non-monetary costs and benefits?
- Is this the right time to make this decision and move forward?
When we stop to ponder how decision-makers work and the type of information they tend to value, this helps us frame up the findability problem in terms that help them be successful. After considering these facts, one can understand how dashboards serve up the type of ad hoc information that managers really need. And if this information is generated by reliable systems, then dashboard information can become even more important to a manager. So, it is true that dashboard information isn't a good basis from which to make decisions, it is a great system to give managers the overview/ad hoc information they need to know which systems, products, teams or other parts of their business or department need more direct and sustained attention.
The reporting of data in a dashboard is technically not a findability issue. Reporting is not the same thing as finding. But there is overlap. Finding the right information at the right time assumes that there is "right" information that can be found and that this information is meaningful to its' consumer. Just like the content items in the result set need to be meaningful to the one who executed the query, the information that is exposed in the dashboard needs to be meaningful to the managers who read it. The information need to accurately make assessments for the manager or report data in such a way that it is comprehended and meaningful as soon as it is consumed.
So, while tagging of data to make documents more findable is a central part of a robust findability solution, don't forget that most of your managers don't have time to sit and read information. Consciously or not, they will flinch in the face of information overload and will filter out much of what they see and hear if it isn't immediately practical and helpful to them. So, ensure that helpful dashboards have been built. And then give them unambiguous ways of drilling down into problem areas from the dashboard to find the information they need to assess and correct and thus improve the business overall. Take time to understand the tools that your managers need to be successful and ensure that these tools add to their efficiency rather than to their workload. Doing so will help improve your business overall and might actually remove resistance to the implementation of a larger findability solution in your environment.
Even though I'm not a lawyer, in my next post, I'll discuss the problems with findability and e-discovery. This is a thorny area that we cannot ignore as we build out our findability architecture.
Bill English, MVP
11/5/2008
In this post, I want to focus on why we have poor findability solutions in our organizations and what the real costs are for maintaining a poor findability solution.
What keeps us from finding the information we need at the time we need it? Well, the AIIM research shows the following statistics:
- Poor search functionality: 71%
- Inconsistency in how we tag/describe data: 59%
- Lack of adequate tags/descriptors: 55%
- Information not available electronically: 49%
- Poor navigation: 48%
- Don't know where to look: 48%
- Constant information change: 37%
- Can't access the system that hosts the info: 30%
- Don't know what I'm looking for: 22%
- Lack the skills to find the information: 22%
Notice that the first and second bullet points deal with the most-often-pursued solution to findability (search application) and the least-often-pursued solution (tagging metadata). Why would this be? I think it's because most decision-makers can more readily understand and quantify the costs and benefits of implementing a software/hardware solution in response to a problem. Moreover, the market tends to define the problem in technical terms rather than cultural or process terms.
But is a poor findability solution really a problem? And if so, is there a way to quantify that problem in real monetary terms? I think the answer to both of these questions is "yes". The cost of a poor findability solution can be enormous. Consider the following research results:
- On any given day, the average information worker commits 20 queries in an effort to find information (Windows Live Enterprise Search, 2006)
- In any given week, the average information worker spends 9.5 hours trying to find the information s/he needs to do his/her job, costing $14,250/year/employee (Source: The Hidden Costs of Information Work, IDC)
- In any given week, the average information worker will spend 6.5 hours not finding information that s/he knows exists only to re-create it so that they have it again. This costs $9700/year/worker. (IDC paper cited above)
Hence, the costs of a poor findability solution are significant, costing organizations an average of roughly $24,000/employee/year. Implementing a solution that will reduce costs through increased productivity should be an obvious call for decision-makers. Yet, it is often overlooked or ignored. I believe this is due to multiple factors: A) culture change is never easy and the headaches associated with such change can be huge, B) the costs/benefit analysis is very difficult to quantify for a specific environment and C) a partial solution to this problem that can be quantified will likely be preferred over a fuller solution that is difficult to quantify.
In my next post, I'll discuss how to build out a Findability Architecture. Doing this will help begin the process of pulling together what I think is a real need in organizations today: how to build a robust findability solution that will improve your organizations' performance.
Bill English, MVP
In my last post, I went over some of the latest research about Findability but never really made the case for why Google's appliance is not, in and of itself, a robust or complete findability solution. In this post, I'll go over this point as well as discuss, briefly, why a technical search solution is not a full findability solution in and of itself.
In my last post, I said this:
"Findability is not a technology. Instead, it is a way of managing information that is baked into the organization. Let me be clear on this point: findability is a well-defined and well-executed strategic model of consistent practices and actions. While it is true that technologies contribute to an overall Findability solution, it is equally true that a robust findability solution is much more than the implementation of search technology."
I want to re-emphasize that building findability into your organization's architecture is tantamount to changing your organization's culture and information processes. These changes need to incorporate several permanent changes:
- Consistent tagging of data
- Assignment of ownership to data
- Willingness of end-users to pragmatically engage in the Findability solution
- Development and training on tools that will support the information architecture and findability solution
- Consistent developing and training on these tools
- Upper level and grassroots support for this effort to improve Findability and re-use of content
Because the development of a Findability solution requires a culture change for the entire organization, it should become apparent that implementing a Search solution – no matter how robust the solution – is not the same thing as implementing a full findability solution. This is why Google's "plug it in, turn it on and find it" is unrealistic. While it is true that Google's appliance – as well as most other search engines – can create a seemingly robust findability solution when applied against a smaller set of documents that are homogenous in their word structures, it is also true that as the words and their meaning randomize, the keyword search method of finding information becomes less and less effective. (Please see my Part II post for a fuller discussion on this.)
I will also say that Search Server 2008 by itself is not a full findability solution either. In fact, any search application platform cannot form a full findability solution because the search applications can find only that which exists. If the documents have not been tagged, then the search engine cannot find them by their metadata. If meaning cannot be accurately discerned from the keyword query, then findability is damaged and the user experience diminished because there is no metadata to discriminate between documents.
Do we need Search? Absolutely. But the technical solution is not a full findability solution. In my future posts, I'll continue to expand on the current research and build on this concept that implementing a full findability solution is much more than implementing a technical search application. I'll also, as promised before, build out a findability architecture that should help us all understand better how to align our information architectures with our technical solutions to produce a robust findability solution.
Bill English, MVP
When I go out and do SharePoint architecture and design sessions with companies, what I find is that most organizations really don't take time to organize their content well. Content organization and the tagging of metadata on content is nearly universally viewed as a waste of time – an overhead expense that just isn't necessary. Nothing could be further from the truth. 10/26/2008
In my first post about Findability and SharePoint, I explained what findability is an why you need to pay attention to it as you're developing your SharePoint Server 2007 deployment.
In this post, I would like to discuss some of the recently released research on findability in corporate America. I will then take a moment to go over why Google's promise of "plug it in, turn it on and find it" is both unrealistic and simplistic. So, let's start with a quick overview of some recent research on Findability in the marketplace.
You can do nearly everything right with your SharePoint deployment and yet still get it all wrong when it comes to users being able to find their information quickly and easily. For example, you can do all of the following correctly….
- Capacity plan your servers correctly
- Scale out your farm correctly
- Implement all of the customizations correctly
- Implement a robust Search and Indexing solution
- Train your users correctly
- Write the business and technical requirements for your deployment
- Manage your servers to the point where there are no errors or warnings in any of the event logs
…..And still fail in your SharePoint implementation because people can't find the information they need quickly and easily. Now, I'm not suggesting at all that these elements should be ignored or should be considered unimportant. If there's one thing we have learned, it is a SharePoint Server 2007 deployment requires excellence at each level of the deployment and from each group involved in the deployment if everyone is going to be happy. But this is to say that not integrating a coherent findability architecture into your SharePoint Server 2007 deployment will mean that you have missed one of the main reasons for implementing SharePoint Server 2007 in the first place: to find your information more quickly and easily than can be achieved by current technologies.
Published by AIIM, the Findability and Market IQ white paper exposed the perilous state of findability in most organizations today. When respondents were asked "How well is findability understood in your organization?", only 17% said that it was both well understood and adequately addressed. 30% of the respondents couldn't discern the difference between their search technologies and findability. 22% responded that there was no clear understanding of findability in their organization.
This single question revealed that over half of the organizations today either don't know what findability is or they think they have findability solved because they have a search appliance or software product in their environment. Search is too-often viewed as an application-specific solution for findability. But when one stops to look at what findability really is, a single search application suddenly misses the mark.
A typical search application focuses on trying to ask the right question by "matching" keywords with content under the assumption that if I find the right word, I've found the right content. The huge assumption here is that meaning can be derived syntactically from the word itself. I found the right word, therefore, I've found the right meaning in the content that has been presented to me in the search results. But nothing could be farther from the truth. Meaning and syntax are two entirely different things. Findability is not resolved by the humble keyword + search application.
Why would I say this? Well, consider the work of George Zipf, who developed the following axiom:
As the size of the corpus increases, the ability of the information retrieval system to retrieve relevant information will diminish
By analyzing large texts, Zipf found that, at least in the English language, a few number of words occur a large number of times. For example, he found that the 2 most frequent words account for 10% of all word occurrences in the English language. Furthermore, he found that the 6 most frequent words account for 20% of all word occurrences and the 50 most frequent words account for 50% of all word occurrences. Now, it can't be that we tend to write and discuss the same topics over and over, can it? Hardly. Instead, what we find is that the more often a word is used, the more general it's meaning and he less often a word is used, the more precise it's meaning.
But language naturally tends towards the use of fewer words to refer to different concepts, objects and elements. Hence, when a single word refers to vastly different concepts, ideas or elements, precision is impaired in the result set because syntax cannot express meaning. For example, consider these word pairs and ask yourself which concept each word is referring to:
- Minute or minute?
- Horn, horn or horn?
- Boot or boot?
- Bonnet or bonnet?
- Football or Football?
In the technology world, we borrow and reconceptualize words at an alarming rate:
- Site or site? (web site, Active Directory site)
- Zone or zone? (DNS zone or a SharePoint zone)
- Domain or domain?
This list could go on and on. Most of the words we use in technology are borrowed words with either new or extended meanings. Hence, when we enter a keyword into a search engine, the likelihood that the content items in the result set will be meaningful diminishes as the number of documents in the index increases. Why? Because, as the number of documents increase in the index, the chances that the same words are used in different ways to mean different concepts increases – often dramatically.
Even more interesting is that meaning is often expressed using different phrases or idioms and is often not expressed in a single word. Consider the following:
- "Good Grief"
- "Give it a Go"
- "I don't have a dog in that race"
- "Oh please, give me a break"
What do these phrases really mean? How difficult is it to find documents that encourage someone to try something if the keywords entered are "give it a go"? Every search engine that I've run across is unable to account for meaning using only keyword queries. Google, SharePoint, Autonomy and the others do not have a good way to account for meaning in their current keyword queries.
Findability is not a technology. Instead, it is a way of managing information that is baked into the organization. Let me be clear on this point: findability is a well-defined and well-executed strategic model of consistent practices and actions. While it is true that technologies contribute to an overall Findability solution, it is equally true that a robust findability solution is much more than the implementation of search technology.
The Paradox of Findability
The AIIM study also found that when respondents were asked the degree to which findability is critical to their overall business goals and success, 62% of respondents indicated that it is imperative or significant. Only 5% felt it had minimal or no impact on business success. Yet, 49% responded that even though Findability is strategically essential, they have no formal plan or set of goals for Findability in their organization. Moreover, of the other 51% who claimed to have a strategy, 26% reported that their strategy was ad hoc, meaning that they have no strategy at all. So, 75% have no Findability strategy, even though many believe it is strategically essential. This is a state of affairs that must change.
In a future posts, I'll outline how to develop a Findability Architecture that integrates with your SharePoint architecture and information architecture. But before I get there, I'll outline some additional research on this topic and then spend more time discussing the problems inherent with attaining precision, recall and relevance.
Bill English, MVP
Along with Steve Smith and Craig Carpenter at Combined-Knowledge, Kathy Hughes at Combined-Knowledge Asia and my partner, Todd Bleeker, we're pleased to announce the creation of the World Education Alliance.
The World Education Alliance is a global education alliance, incorporating Mindsharp (USA), Combined Knowledge Ltd, (UK and EMEA) and Combined Knowledge Asia Pacific, (Australia and New Zealand). Each of the companies within the WEA specializes in the development and delivery of Information Technology Training classes. The Exclusive portfolio of courses that are available from the World Education Alliance enables organizations to have Global Training Solutions that meet the needs of any role involved in an environment from Administrators, Developers, Designers and End Users to Executives.
The courses are written and delivered by Expert Information Technology trainers and consultants who have the knowledge and skills required to deliver quality training courses either on a public scheduled basis or as private courses to various organizations. Our Trainers and Consultants boast industry leading accreditations and references including the prestigious Microsoft MVP Accreditation.
The World Education Alliance, enables you to organize your Information Technology Training requirements for Administrators, Developers, Designers and End Users over multiple locations in one easy step, you can maintain the same local contact that will help you to plan and organize your training program across multiple office locations. This process enables you to maintain the quality of IT training that you require across all regions ensuring that all people involved in an implementation are receiving the same training wherever they are based.
Benefits of using the WEA
The WEA provides clients with a 'One Stop Shop' for their IT Training Requirements. Through the WEA, a world-wide company can train everyone in their organization on SharePoint and Unified Communications and experience the following benefits:
- One-Stop-Shop for all of their SharePoint education needs
- Seamless booking for trainers, travel and materials for world-wide engagements
- Consistent courseware and customer experience
- Assurance that the instructors are well-trained and fully prepared to teach their classes
- Consistent pricing and invoicing processes
- Unique, but flexible bundling of solutions for customers, including trainer selection, course selection and delivery mechanisms
Benefits of working with Combined Knowledge, Mindsharp and the World Education Alliance
For those who train and work with us, there are several benefits of associating with Combined Knowledge and Mindsharp:
- Work with recognized, respected industry leaders
- Live anywhere in the world and train with us
- World-wide training opportunities
- Participate in internal education on new technologies
- Technical support for trainers for both the technology and the course
- Mentoring for professional development, including writing and training
Education Courses
The breadth of the education options within the WEA is substantial. Consider the following number of training classes, sorted by audiences for SharePoint Products and Technologies:
For Unified Communications, we offer the following course:
- Core Technologies in Microsoft Office Communications Server 2007
Delivery Methods
Both Mindsharp and Combined Knowledge offer our courses via a number of different delivery methods. These methods can be combined within a larger bundled solution to meet your exact needs. Utilizing over 50 authorized trainers world-wide, the World Education Alliance allows you to have our training delivered using the following methods:
- Instructor-Led
- CBT
- Train-the-Trainer (end-user courseware only)
- Remote Training to your desktop (Live Meeting plus Audio)
- Public classes
- Private classes
- Customized classes and workshops
Delivery Locations
The World Education Alliance can deliver education in most areas of the world. We have offices in London, Sydney and Minneapolis. We can offer training in the following locations and can bring education to your company privately anywhere in the world:
- Nearly any city in the United States
- Ottawa and Montreal, Canada
- UK Midlands – Public Classes
- UK London – Public classes
- UK Wide – Private classes
- Cologne, Germany – Public classes
- Amsterdam, Holland – Public classes
- Luxemburg –Public classes
- EMEA wide for private classes
- Sydney – Public classes
- Melbourne – Public classes
- Brisbane – Public classes
- Auckland NZ – Public classes
- Wellington NZ – Public classes
- Asia Pacific wide for private classes
Facts about WEA Trainers
There are over 50 authorized trainers in the World Education Alliance. Of these, twelve are MVPs and two are former MVPs. We also offer training in English, French, German, Spanish, Dutch and Finnish. Between 2007 and 2008, our corp of trainers more than doubled.
Please let us know how we can help you today. You'll find that this alliance between Combined Knowledge and Mindsharp will bring you benefits and advantages for SharePoint and Unified Communications education that few others can offer.
If you need education world-wide or in your own town, we can assist you today. If you live in North America, South America or Africa, please contact David Hoffeld at dhoffeld@mindsharp.com. If you live in the UK, Europe and the Middle East please contact Zoe Watson at zoe@combined-knowledge.com. For Asia Pacific and Australia please contact kerriann@combined-knowledge.com.au We pledge to you that we'll do everything we can to ensure you're delighted with our education services.
Bill English, MVP
10/16/2008
The more I learn about findability, the more convinced I am that this concept needs to be an organizing principle of any SharePoint Server 2007 deployment. In this post, I'll explain what findability is in generic terms and then apply these concepts to a SharePoint Server 2007 deployment. I'll also wade into a the difficult area of explaining why Google's current promise of "plug it in, turn it on and find it" is not enough for implementing a full findability solution in your organization.
What is Findability?
Succinctly stated, findability is the quality of being found or the ability to locate objects. Methods of findability includes commonly known technologies such as navigation menus or search/indexing applications. Obviously, we'll want to implement findability tools that are easy to use and integrate seamlessly into our current end-user experience.
Truths about Findability
In his book, Ambient Findability, Peter Moreville outlines several truths that I'll repeat here for sake of our discussion:
- You can't use what you can't find
- Information that can't be found is worthless
- Our customer's can't purchase what they can't find
- Information that is hard to find is hardly used
- Authority, trust and findability are interwoven
- Key to success when working with information is findability
The first truth to focus on is the most important: You can't use what you can't find. It doesn't really matter what the current technology platform is: SharePoint, Autonomy, Websphere, Vingnette, SAP, Oracle, Plumtree or any other portal or collaboration platform. If you can't find the information that has been placed inside it or if the search technology doesn't return relevant search results, then you might as well not have the information at all. The costs of producing that information will have been wasted if you're unable to find the information that you need, when you need it.
If you're running an e-commerce web site, then you should take note that your customers can't purchase what they can't find. If the findability tools on your web site are not good, then it doesn't do you any good to have your products for sale on your web site. Giving your customers the ability to find what they want, evaluate it and then make decisions on your products - all without talking to your sales staff - is the current state of e-commerce. In times past, marketing was mainly a push mechanism - the seller sends out flyers or makes phone calls or invites prospects to their seminars. Today, customers call the shots when it comes to marketing. They can find you on the web, look at your products, read evaluations about your products and services on the web, compare your products features and prices, inform themselves about you and your company and formulate their purchasing decisions without ever talking to your sales team. Why do I emphasize this twice in the same paragraph? Because the customer of today expects to see this information on your e-commerce web site and if you don't have a good web site, you're products and services won't be included in their evaluation process and you'll lose from the start. In a very real sense, the customer's experience on your web site is the experience of your brand. Put another way, the concept of a brand has been expanded to include not just a marketing position + tagline + logo + color scheme, but it now also includes the customer's experience on your web site and how easily it is for them to find that for which they're looking.
Being able to find information is one thing, but the information also needs to be trustworthy and have authority in the mind of the reader. For example, information that is found via the internet can be suspect in terms of authenticity and accuracy. Companies often claim to be "#1" or offer "high quality" products. What company is going to claim that they offer "low quality" products? Doesn't everyone want to be #1? Certain claims just lack authenticity, no matter who makes the claim. The fact that these claims are made on your web site doesn't mean that they are more trustworthy. However, when users do find information for which they are looking, they will necessarily make a value judgment on its' trustworthiness - can we trust this information to be authentic and useful?
One experience for me illustrates this point well. Around the year 2001 (roughly), I was speaking at Comdex in Las Vegas and was following an instructor who was discussing the pros and features of Windows 2000 Server. I came into the room late in his presentation. Since he was a friend of mine, I thought I'd catch up on what he was talking about in his presentations. In the last 10 minutes, he outlined how Windows 2000 Server was a "secure server platform". As soon as he said those three words - secure server platform - the room erupted in spontaneous laughter. The message from the attendees was clear: no one (at that time) thought that any Microsoft platform was secure, let alone this new Windows 2000 Server platform. My friend had to cut short his comments on Window's security system because he knew the audience wasn't going to buy it. They simply didn't find the information credible or trustworthy.
These truths about findability can be ignored, but just like economic markets will behave in a certain way whether or not you believe in capitalism, these truths will play out in your environment whether or not you choose to pay attention to them. Take them to heart and you're on your way to building a great Findability solution in your environment. Ignore them and you'll do so at your own peril.
In my coming posts, I'll explain why search applications by themselves - as illustrated by Google's marketing messages - cannot form a full findability solution. I'll also explain how to conceptualize and build out a Findability architecture for your environment and will then illustrate how a number of tools across Microsoft's platforms can work together to formulate a full findability solution.
Stay tuned.
Bill English, MVP 8/18/2008
In advance of our Best Practices Conference, Mark Schneider has asked the most obvious and basic question: What is a best practice? In his post, he discusses his answer to this question. It is worth reading before you come to the conference.
By the way, I've heard from the folks at Microsoft Press that our Best Practices book is selling very, very well. Early returns are not always indicative of long-term, sustained sales, but for those who have purchase this book, I just want to say "thanks" and hope that you'll communicate with me (bill@mindsharp.com) and Ben Curry (bcurry@mindsharp.com). We'd love to hear from you about what you're learning and what your thinking is regarding the book or points we make in the book.
Bill English Mindsharp 7/15/2008
If you want to manage global scopes in Searchc Server 2008, then add /_layouts/viewscopesssp.aspx?mode=ssp to the search admin URL and you can create and manage global scopes in Microsoft Search Server 2008.
Remember, there is no link in the admin page, but the functionality is there.
Thanks to Daniel Webster for providing this tip.
Bill English Mindsharp 6/6/2008
One client that I worked with recently had the following scenario:
They need to keep their database sizes below 30GB for disaster recovery reasons. They have a 4 hour SLA with their business stakeholders for recovery of any SharePoint database. In the past, they have had very bad experiences with their SharePoint Portal Server 2003 implementation going down, both times taking out their implementation for over 4 days. Not good. They report that their first SharePoint consultant had told them to keep all of their content in a single content database and that database has grown to over 400GB. They point to the database size as the reason for their two outages in the last two years.
Now, they are in the process of moving their data around to new and different site collections with the hopes that they can spread out their data across multiple databases and site collections. But in the meantime, they have been partitioning and re-partitioning their database in SQL. They've never felt good about this.
Part of their database size problem is that their users often create new versions of their documents and by law, they are required to keep *all* of their old versions for 7 years. Their documents are often 10-20MB in size, kept in PDF format. So a site collection that might start out with 1GB of data can grow into a 50GB site collection within a matter of months. At present, they are averaging 5GB of new data each month to their overall SharePoint farm. That number is expected to grow as more business units come "online" with SharePoint. Yet they need to keep their database sizes below 30GB and we all know that we can't spread a single site collection across multiple databases.
So their business requirements are as follows:
- Must have the ability to create new/updated version of procedure documents on an as-needed basis. This ability to update the document cannot be limited by storage limitations.
- Must keep all versions of each document pedigree for 7 years
- The storage method must meet regulatory agency standards
- The storage method must allow discovery of all document versions for legal purposes
- All SharePoint databases must be recoverable within 4 hours
- All SharePoint databases must not exceed 30 GB in size
What are their options? Well, here were some that come up (and some that didn't):
- Violate the regulatory standards and not keep all versions for 7 years.
- Don't keep all of the versions and hope the regulators don't catch you.
- Keep dividing and re-dividing the site collections into more numerous site collections that increasing host fewer and fewer documents and their versions.
- Manually take the most recent version of the document and on an annual basis create a new document pedigree in a new library and archive off the previous year's versions to a backup file.
- Programmatically move older documents to off-line storage that is both protected and isolated.
With regard to options #1 and #2, that's similar to playing with fire, so they really aren't options. Pay the fines, open yourself to legal/criminal liability? I don't think so.
The problem with option #3 is that eventually you'll have a single site collection dedicated to hosting a single document and all its' versions, perhaps writing to a dedicated database. This scenario will eventually not scale to the level they need since some document pedigrees could exceed 30GB over a 7 year period.
Option #4 is certainly doable, but they felt it wouldn't pass muster with the regulators. Manual intervention couldn't prove Chain-of-Custody in a legal proceeding.
So, they are going to look into Option #5 – writing custom code to move the older versions off to a dedicated, isolated storage area. They feel this will meet with regulatory approval as well as passing muster for Chain-of-Custody standards in a legal proceeding.
I would be interested in your thoughts about this scenario. You're welcome to post back about this with ideas/thoughts on the solutions discussed here (and any that aren't discussed here too!).
Bill English Mindsharp 6/3/2008
In this post, I'll explain how the crawler works in SharePoint Server 2007 and Search Server 2008 and from this, draw several Best Practices for the creation and maintenance of content sources and Crawler Impact Rules.
DISCLAIMER: While this post will explore how to view several tables within the SSP Search Database, please do not make modifications of any type to the SQL databases directly. Always use the object models to read to and write from the SQL databases in SharePoint.
I need to lay the groundwork for our scenario and the tests that we're going to run to illustrate certain principles that we've learned. So, here are the details.
I've created several contents sources, as follows:
- Mindsharp Blogs has one start address at http://www.mindsharpblogs.com
-
News Sites has three start addresses to three public news sites:
- www.foxnews.com
- www.cnn.com
- www.bbc.co.uk
Over in SQL, we can learn some basic information about how these start addresses are assigned numbers for tracking purposes. If you open up the SSP Search database, you can commit several queries to help you learn how to read and understand the crawl queue, which is something that we need to learn how to do. First, I'll open SQL Manager and navigate to the Search database. To learn which Hostname has been assigned which HostID in the database, I'll execute a simple SQL command: Select * from msscrawlhostlist. This table to the left appears in the Results tab. This is likely the table from which some of the Crawl Log information is displayed.
Each instance of each URL namespace is also assigned a StartAddress in the SQL database and each instance of each content source is assigned a unique ContentSourceID. We can learn which ContentSource ID's have been assigned to which namespace by executing the SQL command "Select * from msscrawlurl". When this command is executed, a table is returned that lists out all of the URLs that are currently in the index along with their ContentSourceID and StartAddressID. Both are incremented by one (1) when new start addresses and/or content sources are created. However, if a new start address is created inside an existing content source, then only the start address list is incremented by one because that is the only new object in the database. In this example, notice that foxnews.com, cnn.com and bbc.co.uk are all start addresses within the same content source, so their all given the same ContentSourceID, but they each have different StartAddressIDs. Foxnews.com has a start address of 18, meaning that this was the 18th start address to be created in this install of Search Server 2008. Cnn.com has a start address of 19 and the bbc.co.uk news site has a start address of 20. Also, you should know that when you reset the index, the ContentSourceID and StartAddressID assignments are not changed. In addition, if you delete either the content source or the start address and then re-create them, they will be assigned new StartAddressID and ContentSourceID numbers.
Each crawl that is started, regardless of whether or not it successfully completes, is assigned a unique CrawlID number. Resetting the index does not reset the CrawlID numbers.
You can look at the msscrawlurl table to see exactly which URL was crawled and when. In addition, each crawl of each URL receives a unique TrackID number and this number increments by one and is never repeated. Each document at each start address receives a unique DocID. You can see an example of the msscrwlurl table, showing the DocID, StartAddressID, ContentSourceID and Access URL in this table.
Monitoring the MSSCrawlQueue
When a crawl process commences, there are several things that transpire. First, the protocol handler connects to the start address(es) and begins the process of enumerating the documents that will be crawled. The enumeration process identifies the exact URL of each document and places that URL in the MSSCrawlQueue table of the Search database. Each web page is considered a single document, no matter how small or large that single page might be. In addition, each folder or container is also considered a separate object. This accounts for why a file share with one folder and three documents accounts for the index receiving four additional items. The folder is indexed as a separate item. The same holds true for list and document library containers. The protocol handler is responsible for filling up the queue at a reasonable rate so that the crawler knows which documents to call for content extraction.
If there are multiple start addresses, the protocol handler will connect to all of the start addresses as fast as reasonably possible and it will begin the process of document enumeration. The number of documents that are enumerated and placed in the queue from each start address varies, from what I can tell, by bandwidth, current user demand on the web site and the target server's ability to respond to the requests. For example, consider what happens when I try to perform a full crawl of my News Sites content source. In the first 817 URLs that were placed in the queue, the order appeared like this:
|
Beginning Sequence ID |
Start Address |
Ending Sequence ID |
Total # URLs from Start Address |
|
76484 |
29 (cnn.com) |
76559 |
75 |
|
76577 |
28 (foxnews.com) |
76721 |
144 |
|
76727 |
29 |
76729 |
3 |
|
76733 |
30 (bbc.co.uk) |
76734 |
2 |
|
76735 |
29 |
76802 |
67 |
|
76803 |
28 |
76914 |
111 |
|
76915 |
29 |
76916 |
2 |
|
76917 |
28 |
76919 |
3 |
|
76920 |
29 |
76978 |
58 |
The sequence continued to go back and forth between foxnews.com and cnn.com. bbc.co.uk only had 2 start addresses in the first 817 URLs that were placed in the queue. Based on this data, there are several important items for us to note. First, why are there only 2 start addresses for the bbc.co.uk in this first 817 URL list? Since the crawl was initiated from Minneapolis, Minnesota (USA), one could easily argue that it was more difficult and time consuming for the protocol handler to connect to the BBC servers in London than it was to connect to the CNN and FoxNews servers in the United States. I've seen this patter repeated when I enter different start addresses. Geographically speaking (which may just translate into more routers and backbones to traverse), the father the server is for a start address from the index server relative to the other start addresses in the same content source, the more likely it is that the farthest server's content will be crawled last simply because it will take more time for the protocol handler to connect to the farthest target server to enumerate that servers' documents.
To illustrate this, please consider these tracerts for FoxNews.com and the BBC. Note not only that there are four additional hops to get from my computer in Minneapolis to the servers in London, but also consider the high milliseconds response time once we hit the international connections. Not only are the FoxNews servers much closer (Chicago is only about 400 miles from Minneapolis), but we also had fewer hops and faster response times in each hop. Hence, this would explain why the protocol handler was able to deposit 3 URLs out of more than 800 URLs placed in the crawl queue while it was able to place several hundred URLs in the queue for FoxNews. Repeating this test on a Sunday afternoon when, presumably, the networks are not busy produced similar results.
Secondly, you might have noticed that the sequence ID numbers were, well, out of sequence. Why is this? The answer is that the crawler had already started crawling the URLs that had been placed in the crawl queue. But does this account for the numbers being out of sequence? Yes, it does.
Recall that the default setting for the Crawler Impact Rules is to request 8 documents per second. How many URLs were missing in the queue? When we do the math, what we find is that 24 URLs are missing from the queue. This means that the crawler requested URLs out of the queue in multiples of eight (8) and this aligns with our Crawler Impact Rules settings. This table also helps us understand that the protocol handler was able to add 6 URLs from the BBC site, not just three as mentioned above.
|
Start Sequence ID |
Start Address |
End Sequence ID |
Number of URLs |
|
76560 |
28 (FoxNews) |
76576 |
16 |
|
76722 |
29 (CNN) |
76726 |
5 |
|
76730 |
30 (BBC) |
76732 |
3 |
Thirdly, the missing URLs are out of sequence in the queue. The lack of sequence points to an important aspect for the crawler's architecture and how it works. Instead of taking the next X number of URLs from the queue in the order that they were placed in the queue, the crawler is architected to request X number of documents from the queue for each start address that has existing URLs in the queue. In this case, the first set of URLs in the queue belonged to CNN. The second set of URLs belonged to FoxNews. So, when the crawler requested the configured number of URLs from the queue so that it knew which URLs to crawl, it requested 8 URLs from CNN and at the same time, requested 8 URLs for FoxNews. I suspect that the start of the queue, in terms of sequence numbers, was not 76484, but was 76468. I suspect that the crawler had made three separate requests for documents by the time I was able to fully pause the crawling operation:
Request #1 had 8 URLs for CNN and 8 URLs for FoxNews
Request #2 had 8 URLs for CNN, 8 URLS for FoxNews, 5 additional URLs for CNN and 3 URLs for BBC.
Why would it not request 8 URLs for the BBC? Because there were not 8 URLs for the crawler to request, so it requested a subset of the BBC URLs (not sure why) and then request its' remaining number of URLs from CNN using a logic that hasn't been published.
Timing of Crawl Actions
One of the questions that I always wondered about was how often the crawler requested additional documents. For example, would it make a request for 8 documents from the content source and then, without waiting, ask for an additional 8 documents? To test this out, I left the Crawler Impact Rules at their defaults of 0 seconds between requests while requesting 8 documents per request. I then crawled a content source that had a single start address, mindsharpblogs.com.
In watching both the total number of rows in the msscrawlqueue database and the sequence ID numbers, I found that there was often 90-120 seconds between requests for documents from the content source. In running a packet trace, what I learned was interesting, but perhaps not helpful. First, the first GET request is for */*. This request returns a listing of URLs from the start address. It is my assumption that this request is executed by the protocol handler. Secondly, the subsequent GET requests are for either containers that host individual documents or the individual document URLs within the start address. If the GET request is for a container, then the responses list out the documents in each container. Each individual URL has its own GET request and this is executed by the crawler. What is interesting is that the GET requests are not clustered around a certain time. Consider this (very) incomplete packet trace in the next graphic. Notice that the GET requests are not happening within the same second and some of them are for containers while others are for individual documents. For example, if you look near the bottom of the top pane, you'll see three GET requests at nearly the same moment: two are for containers (Kyle and Wayne) while one is for an individual document (Ben/Archive/….). As far as an explanation, we could assume that the requests for the containers is being executed by the protocol handler to enumerate what is inside the containers whereas the request for individual documents is executed by the crawler. But a clear argument could (perhaps should?) be made that the crawler must crawl and index the containers as well. It is true that the folders and list containers must also be crawled, so crawling containers in this trace would also be expected. Either way, it is clear that in the end, both the protocol handler and the crawler must enumerate and index both containers and content.
Mapping out the timing of when the GET requests were executed, we have the following pattern:
|
Seconds from start of packet capture |
Number of GET requests for individual documents |
|
14 |
5 |
|
15 |
5 |
|
16 |
0 |
|
17 |
4 |
|
18 |
4 |
|
19 |
3 |
|
20 |
5 |
|
21 |
4 |
The trace continues on like this with GET requests being executed nearly every second of the trace. What is clear is that we can't differentiate between protocol handler GET requests and crawler GET requests using a packet trace alone. To help us understand what the crawler considers to be a document, we need to move to the MSSCrawlURLLog table and obtain the results from that table. If we were to look at this table, we'd find the DocID number associated with the exact URL that the crawler is processing.
Notice how the AccessURL matches *exactly* the GET requests from our packet trace:
I reset the index before getting the two screen shots immediately above. When the index is reset, the SequenceID numbers are also reset so that the next crawl starts with SequenceID 1 in the Search database. You can safely count on the number of URLs that have exited the crawl queue being equal to the number of documents that appear in the index. But if you run a concurrent packet trace, it will be difficult to figure out which GET requests constituted a new document for the index. This is why the MSSCrawlURLLog table is so helpful.
So, to answer the timing question, after running many different crawls, watching the queues, changing the Crawler Impact Rules and other general observations, I can say that there is little consistency as to when the crawler crawls the content at the URLs that have been placed in the content source. What is clear, is that the crawler obeys the maximum number of documents requested. So if you set the maximum number of documents requested to 16, it will not use more than 16 URLs from the crawler queue per start address at a time. If there are times when this is exceeded by the crawler for a given start address, compensation is made from other start addresses so that the multiple is never higher than it should be.
Cleaning Out the Crawl Queue
There are three ways that the crawl queue can be cleared out. The first way – and the most often used method – is the successful completion of a crawl effort on a content source. The start addresses in the content source are said to have been fully crawled when all of the URLs in the crawl queue have been processed and crawled. The second method is to stop the crawl process. When you do this, the content source enters a state of "stopping" and remains in this state until all of the URLs have been emptied from the queue. If the queue has a large number of URLs in it (100,000 or more), it may take a number of minutes or hours (depending on the SQL Server's available resources) to completely empty the queue. Thirdly, resetting the index will clear out the queue. In my testing, I have seen orphaned URLs remain in the MSSCrawlerQueue and the only way to remove them is to reset the index.
When a content source is deleted, all of the crawled URLs are placed in the MSSCrawlQueue and then are deleted from the index one-by-one, using a similar approach that was used to place the content into the index. Large content sources may take hours to delete because of this architecture.
Crawler Best Practices and Summation
So, what have we learned? At least the following:
- The protocol handler will enumerate documents at each start address simultaneously and place those URLs in the MSSCrawlQueue to instruct the crawler about which URLs it is supposed to open and gather data.
- Network and bandwidth realities will impact the ability of the system to enumerate and crawl content, both in terms of speed and in terms of order (assuming multiple start addresses per content source).
- The number of documents set in the Crawler Impact Rules apply to each start address in the content source.
- The number of documents requested is not always exact. In two consecutive requests, one content source had 7 documents requested and then 9 and vice versa on the other content source.
- The documents crawled must be processed before the next set of URLs are taken out of the queue for crawling purposes. What does not happen is a continual (vs. constant) request for documents by the crawler. A simple packet trace will deceive you into thinking this, but use of the MSSCrawlURLLog table and watching the URLs leave the MSSCrawlQueue will help you understand what is actually happening.
- The protocol handler did enumerate documents from each content source at the same time. Usually the URLs placed in the queue were consecutive from one content source or the other, in groups with different sizes. This grouping is based on network and resource factors at the time the crawl was commenced.
- URLs pulled out of the queue by the crawler will be for each start address in the content source. For example, if SA1 had URLs 1-30 in the queue and SA2 had URLs 31-60 in the queue, when the crawler needed the URLs to request documents, it would take URLs 1-8 and 31-39. The next request would take URLs 9-16 and 40-48.
- The protocol handler will continue to add to the queue list while the crawler was processing documents and not taking more URLs out of the queue.
In terms of best practices, I would recommend the following:
- Try to group start addresses that have similar sizes and types into the same content source. For example, if SA1 has 50,000 documents and SA2 has 500 documents, then the crawler will likely end up crawling a good portion of SA2 near the end of its time crawling SA1. The protocol handler will likely not enumerate all of SA2 near the beginning of the crawl process, though this is certainly a possibility.
- Be sure you have enough server and bandwidth resources to crawl a number of start addresses within the same content source. While you *can* place 500 start addresses within a single content source, it's likely not the best practice to do this from a resource perspective. Best practice is to group similar start addresses within the same content source, then performance monitor your crawl activities to ensure your servers can crawl all the start addresses at the same time.
- If you have "long-distance" start addresses to crawl, consider grouping them into a common content source
- Schedule the deletion of large content sources at a time when other crawl activities are not occurring
- Use the Crawl Log, not SELECT statements, to learn about the success or failure of the crawling of an individual URL
- If you choose to stop a large crawl process, allow the system enough time to fully clear out the MSSCrawlQueue before attempting to start that crawl process again.
If you would like to have a PDF copy of this post, please download your copy at the premium content site at www.mindsharp.com. I'm always interested to hear what others think, so please e-mail me at bill@mindsharp.com about comments on this post.
Thanks.
Bill English Mindsharp
|
|
|
|
|