2010-03-10 Wed
原文:Google, transparency and our not-so-secret formula
发表者:Matt Cutts,搜索质量团队首席工程师
最近,欧盟委员会就一系列竞争申诉开始展开初步调查。部分申诉指责谷歌在确定搜索结果排名的方式和原因等问题上不够透明。这一说法让我难以接受。关于如何与网站发布者进行交流,谷歌已经设定过标准。下面让我向您介绍,谷歌如何对搜索结果进行排名及其根据。
关于谷歌“打分”的讨论主要集中在网页排名上。其实,所谓的“秘密准则”其实根本算不上什么秘密。这篇论文对其有详细的介绍。这篇早期的论文不仅介绍了网页排名的规则,同时还提到了谷歌排名的其他标准,其中包括锚文本、词语在文档内的位置、搜索关键词的相关顿、所用字体的字号和类型、页面原始HTML代码以及词语大小写状态等。在过去几年中,谷歌陆续发布了数百篇研究论文。这些论文展现了许多与谷歌运营有关的“秘诀”,以及谷歌所使用的文档基础架构。其中的一些论文不仅促进了开源项目的发展,同时还帮助了很多公司的成长。
学术论文只是其中一个方面,谷歌同时还通过其他方法介绍其经营方式。1999年,谢尔盖.布林参加了首届“网站站长搜索引擎战略大会”。2001年,谷歌成为首批加入“网站站长的世界”这一网络发布商在线论坛的搜索引擎之一。谷歌的的一个代表在论坛上发言2800多次,而另一位代表 AdWords顾问则发言约5000次。
随着网络的发展,谷歌为实现透明化并促进信息交流所做出的努力也随之并进。我们于2004年5月开始发布博客,迄今为止我们已经在官方博客上发布了数以千计的博文。谷歌现有70多个官方博客,其中包括官方网站站长博客,这是一个专门帮助网站站长了解谷歌运作方式并帮助他们在我们的搜索结果中获得期望排名的博客。相比其他任何一个大公司,谷歌发布了更多的博文。同时,我们还使用几十种不同语言,在我们的网站上提供大量公共文档,向发布者提供建议。
作为谷歌“反垃圾网站”团队的主管(我们团队致力于阻止各种违反了谷歌的“网站站长指南”里公开、明确规定的垃圾网站行为),经常有人问我谷歌是如何运作的。这也是我2005年开始撰写个人博客的原因。迄今,我已经发表了数百篇关于谷歌的博文。我的博文话题广泛,从常见的网站错误到博客新手建议。我还有幸在30多个搜索引擎会议上为网站站长做演讲,并对一些公共网站做评论和解析。事实上,这周我还将与10多个谷歌同事一起参加另一个搜索引擎会议,解答相关疑问。
我们实验了各种方式,以帮助网站站长了解谷歌搜索排名的运作。我们举行了多次网站站长现场谈话的在线直播,吸引了数百名参与者实时参加。此外,我们还尝试过发布Twitter消息和播客。而我还想介绍一种我最喜欢的方式,通过它,我们开诚公布地向网络发布商提供建议:去年,我们收集了许多来自公众的问题,并在网站站长视频频道中发布了数百段视频答复。这些视频的播放次数超过150万次!我们还通过在线博客回答公众关于谷歌业务运作的问题。
这样的例子不胜枚举。谷歌还与其他搜索引擎合作,探讨如何让网站站长的工作变得更为轻松。由此产生的一系列行业标准有:指定首选的网站地址url格式以及网站地图,此举将使网站站长能够轻松地告知搜索引擎关于其网页的信息。谷歌还推出了一个网站站长论坛,谷歌员工和一些经验丰富的“超级用户”经常会登陆这一论坛,解答关于某些特定站点的问题。我们还推出了人工服务的“虚拟网站诊所”,为从旧金山到俄罗斯、从印度到西班牙语系的各地用户提供一对一的回复和建议。我们甚至还确认了谷歌算法中并不使用的排名符号,如关键词元标记,因为这样能帮助网站站长省去一些无用功,并帮他们避免不必要的官司纠纷。
令人沮丧的是,即便是谷歌所有的2万名员工全天候回答网站发布商的问题,我们仍然无法解答所有网站站长的问题。原因何在?因为互联网上有超过1.92亿个注册域名。这也是我们推出谷歌网站站长工具的原因,谷歌网站站长工具是一个一站式站点,能提供可扩展的自助式服务信息,网站站长也可通过该工具向我们提供数据。要全面介绍我们向网站站长推出的这些功能强大的免费工具,恐怕需要再写一篇完整的博文,因此在这里我仅列出其中的一些功能:
- 网站站长能够在重复元标记或标题标签缺失等问题上获得建议。
- 那些我们认为违反了谷歌网站站长指南、并在我们的索引中遭受了谷歌所采取的相应措施的网站的站长可申请复议。
- 遭受黑客攻击的网站站长可以获得攻击其网站的恶意软件的详细信息。在移除受攻击内容之后,他们可以从网站上取回网页,Googlebot将确认恶意内容已被完全清除。
- 网站站长可以找出谷歌在抓取其网站时遇到的错误。
最近,一位谷歌员工发表博文,介绍了如何通过这些免费的公共工具来诊断其超过宽带上限的网站空间(webhost) 的问题。数百万网站站长也采取了类似的办法,利用谷歌的免费工具获取与其网站相关的有用信息。
谷歌致力于创建一种尽可能公开的经营方式,甚至帮助用户将数据从谷歌产品中导出。同时,我们并不认为企业有某些特定的商业秘密是不合理的,尤其当我们的保密是为了网站垃圾制造者和黑客利用我们的系统。如果那些试图篡改谷歌搜索排名的人了解到了谷歌排名运作的所有细节,那么要将我们的搜索结果“篡改”成毫无关联的内容简直就轻松多了,比如说改成色情内容和恶意站点,而最终蒙受损失的将会是用户。
最后我想说的是,随便批评谷歌的“秘诀”不是什么难事,但这并不是事实。多年来,谷歌一直致力于以一种开放的方式开展业务,它向发布商提供关于谷歌排名规则的信息,并努力解答无论是发布商还是用户的各种问题。如果这就是人们对“秘密”的定义的话,那么,谷歌的秘密无疑是在搜索的世界中保守得最糟糕的秘密。
add to del.icio.us. look up in del.icio.usAlso I will upload the slides and put them here.
add to del.icio.us. look up in del.icio.us从今天开始, SQLULDR2可以从标准输入设备接受复杂的SQL语句了. 如下所示, SQL是人工输入的, 而最后一行的反斜杆用来表示输入结束.
D:\>sqluldr2 scott/tiger file=- sql=-
select
*
from tab
/
DBOBJECTS,TABLE,
BLOB,TABLE,
SPACE_DAILY,TABLE,
SQLULDR2_LOG,TABLE,
TRADE_MONTHLY_SUMMARY,TABLE,
TRADE_DATA,TABLE,
这个可以省去创建SQL文件的步骤, 更重要的是在Linux的Shell或Perl脚本中调用SQLULDR2, 并传入复杂的SQL时, 也不需要创建SQL文件了, 而且很容易使用Shell或Perl中的变量, 进行替换生成动态的SQL. 例如, 我们经常会见到如下脚本.
sqlplus -s "/ as sysdba" << EOF
select
*
from tab;
exit
EOF
现在SQLULDR2也可以这样用了, 在Linux的Shell或Perl脚本中同样使用.
sqluldr2 sys file=- sql=- << EOF
select
*
from tab
EOF
使用这个功能, 可以使嵌入SQLULDR2的批脚本更具有通用性, 方便移值到不同的平台.
Relative Posts:
add to del.icio.us. look up in del.icio.us2010-03-08 Mon

Looking for new contributors and fresh perspectives for your open source software project? Through the Google Summer of Code™ program, we fund students worldwide to work with mentors from the FLOSS community on a three month coding project. Over the past five years, we've successfully paired nearly 3,400 students "with more than 3,000 mentors from backgrounds spanning industry to academia, with some spectacular results: more than 8 million lines of source code produced and over $20M in funding in support of open source development. We're particularly excited by the social ties our students form through the course of the program. We've connected people in more than 100 countries, and hope to bring people from even more places into the Google Summer of Code community this year. We're looking forward to our sixth year and welcoming another group of 1,000 student developers to the program.
We're now accepting applications from open source projects who wish to act as mentoring organizations. We'll be taking mentoring organization applications until Friday, March 12th at 23:00 UTC. Our list of approved organizations will be published on the 2010 Google Summer of Code site on March 18th. Interested students will then have several days to discuss their ideas with the accepted organizations before student applications open on March 29th.
Check out our Frequently Asked Questions page for more details and a preview of the application. And remember, if you have any questions, you can always find us in the Google Summer of Code Discussion group or in #gsoc on Freenode. Best of luck to all of our applicants!
add to del.icio.us. look up in del.icio.us2010-03-05 Fri
看到Kamus对SQLULDR2的留言后, 破有感触. 人们应当比较关注, 他们想要的功能用起来方便是否, 关键并不在于功能的多少. 而SQLULDR2的众多的命令行选项, 也确实有些让人发晕, 包括我自已.
为了方便大多数人使用, 简化了SQLULDR2的命令行帮助, 简化到如下所示.
SQL*UnLoader: Fast Oracle Text Unloader (GZIP), Release 3.0.1
(@) Copyright Lou Fangxin (AnySQL.net) 2004 - 2010, all rights reserved.
Usage: SQLULDR2 keyword=value [,keyword=value,...]
Valid Keywords:
user = username/password@tnsname
sql = SQL file name
query = select statement
field = separator string between fields
record = separator string between records
rows = print progress for every given rows (default, 1000000)
file = output file name(default: uldrdata.txt)
log = log file name, prefix with + to append mode
fast = auto tuning the session level parameters(YES)
text = output type (MYSQL, CSV, MYSQLINS, ORACLEINS, FORM, SEARCH).
parfile = read command option from parameter file
for field and record, you can use '0x' to specify hex character code,
\r=0x0d \n=0x0a |=0x7c ,=0x2c, \t=0x09, :=0x3a, #=0x23, "=0x22 '=0x27
对于专家而言, 可以用如下方式得到以前全部的命令行选项.
sqluldr2 help=yes
通过引入一个TEXT选项, 来针对不同格式的导出进行相关选项的设置, 不仅方便了大家使用, 也可以对SQLULDR2的功能有一个很直接的了解, 例如SQLULDR2可以导出数据给MySQL用, 或导出成Excel可以打开的标准CSV文件, 或是生成MySQL和ORACLE上的INSERT语句, 也可以按列显示记录, 或为一些特殊的搜索程序生成数据源.
再次感谢Kamus的好建议, 今年是支付宝的用户体验年, 应当从用户角度进行反思.
Relative Posts:
add to del.icio.us. look up in del.icio.usIf you are interested in learning more about Google’s activities in computer science education, make sure to attend some of the talks we have scheduled or drop by the Google booth!
add to del.icio.us. look up in del.icio.us![]() | ![]() | ![]() |
We continue to be impressed by the new solutions developers are bringing to market by leveraging the Google Analytics Platform. If you have developed a useful new tool or integration on top of Google Analytics, drop us an email at analytics-api@google.com. If it's innovative and useful we'll highlight it to our readers on this blog.
add to del.icio.us. look up in del.icio.usThe open source projects he created as part of his work were two-fold: Linux Trace Toolkit Next Generation (LTTng), a LGPLv2.1/GPLv2 tracer for the Linux kernel; and Userspace RCU library (liburcu), a highly-scalable user-space synchronization library, distributed under the LGPLv2.1 license.
Mathieu was kind enough to send us this summary of his research:
Computer systems, both at the hardware and software-levels, are becoming increasingly complex. Tracing is the key to solving some or all of this increasing complexity. In the case of Linux, used in a large range of applications, from small embedded devices to high-end servers, the size of the operating system kernels are increasing, libraries are being added, and major redesign of existing software is required to benefit from multi-core architectures. As a result, the software development industry and individual developers are facing problems whose resolution requires an understanding of the interaction between applications and all components of an operating system.
In my thesis, I propose the LTTng (Linux Trace Toolkit next generation) tracer as an answer to the industry and open source community tracing needs. The low-intrusiveness of the tracer is a key aspect of its usefulness because we need to be able to reproduce problems occurring in normal conditions. In some cases, users leave tracers active at all times in production, which makes the tracer overhead definitely critical. Our approach involves the design of synchronization primitives that meet the low-impact requirements. The linearly scalable and wait-free RCU (Read-Copy Update) synchronization mechanism used by the LTTng tracer fulfills these requirements with respect to data read. A custom-made buffer synchronization scheme is proposed to extract tracing data while preserving linear scalability and wait-free characteristics.
By measuring the LTTng impact, I demonstrate that it is possible to create a tracer that satisfy all the following characteristics: low latency, deterministic real-time impact (wait-free), small impact on operating system throughput and linear scalability with the number of cores. Experiments on various architectures show that this tracer is portable.
I propose a general model for superscalar multi-core systems with weakly-ordered memory accesses to perform formal verification of the RCU correctness and wait-free guarantees by model-checking. The LTTng
buffering scheme is also formally verified for safety and progress. Formal verification demonstrates that these algorithms allow reentrancy from multiple execution contexts, ranging from standard thread to non-maskable interrupts handlers, allowing a wide instrumentation coverage of the operating system.
Many thanks to Mathieu for sending us this report. You can download the full dissertation for more details.
add to del.icio.us. look up in del.icio.us- 垃圾信息或由用户生成的垃圾内容
- 含有垃圾信息的论坛帖子或大量的垃圾评论
- 可疑的黑客攻击
add to del.icio.us. look up in del.icio.us2010-03-04 Thu
All of these changes in software are very exciting, but who is it all for? Why is anonymity online so important? Companies like Google have privacy and opt-out policies, but not everyone has this stance. Corporations, nations, criminal organizations and individuals want your information. Companies collect information on your web browsing habits and sell it or are sloppy when it comes to protecting it from identity thieves. Others can threaten lives, from repressive nations tracking down outspoken journalists, to abusive spouses or stalkers who want to find out where their victims are hiding; from enemy military forces trying to find a communications link, to criminals who know when law enforcement is watching online.
Political upheaval sparks protests and renewed efforts to control the flow of information online. Interest in censorship circumvention also rises. In 2009, use of Tor increased, as users tried to get around national firewalls during the elections in Iran, and after the introduction of national Internet filters in other countries.

In times of relative political stability, governments routinely filter out international news outlets, information on reproductive health, religion, human rights and other topics deemed unfit. Women blogging about things considered mundane elsewhere, like being forbidden to drive or shop alone, are harassed by authorities. On the one hand, technology has made it easier to crack down on dissent, but the right technology can influence policy in good ways. In Mauritania, the use of censorship circumvention software after 2005 became widespread enough to prompt the government to stop filtering, since it was becoming a waste of time.
Even people living in countries where free speech is protected by law need anonymity for political activities. People blogging about political views that differ from the prevailing attitudes in a small community may lose a job or face boycotts if they run a business. In a company town, writing about the misdeeds of the company that employs your neighbors may be dangerous. Telling people about corruption could lead to harassment from guilty officials.
When someone finds the courage to leave an abusive relationship, the support of victims' advocates is vital. The Internet can help a survivor find counseling, shelter, and encouragement from people who have gone through the same process. Sadly, stalkers are also using technology to find their victims. Abusers monitor web browsers to see if a victim is planning to leave. Information about a shelter's location can be found in email headers, forcing abuse survivors to relocate. According to the U.S. Bureau of Justice Statistics, over one in four people who are stalked experience some sort of cyberstalking. Though some software in a stalker's toolkit is installed on a home computer, IP addresses can reveal which internet cafe or library someone uses to get online. Even if you don't have a stalker, hiding your IP address can be a good idea. Kids and adults alike are advised not to tell strangers where they live, but an IP address can reveal it for them.
Sting operations fail if criminals can tell that the police are connecting to message boards and chat from a government network. The information disappears. Insurgents may be looking for soldiers connecting to their defense department's computers back home. Anonymous tip lines are not so anonymous if someone telling authorities about crime is the only person in the neighborhood connecting to a government website. Without anonymity, going after organized crime can be dangerous to officers and their families.
Some companies do not reveal how much they know about their customers, or who sees the information. Some Internet Service Providers feel entitled to sell data collected from their subscribers to marketers. Though they claim that the information is not tied to any particular users, it is easy to find someone based on their search history. Information about visits to banking websites, searches for details on pre-existing health conditions, or other sensitive online activity could be damaging in the wrong hands; whether made available through carelessness or commercial interest.
Privacy online can protect people offline whether they are organizing protests, covering the news, blowing the whistle on threats to public health, or just blogging about daily life. In the "real world" assaults on privacy like peeking in windows, opening mail, or breaking and entering are obvious crimes. In the online world, however, assaults on privacy are subtle and unyielding. These threats to your health, your wealth and your well-being have no "opt-out" button. They have no "scrub my data" option. Your online activities, e-mails, bank transactions and everything else can be used to trace where you are and who you are. Using software like Tor gives ordinary citizens more choice about the information they reveal online.
For more information about online privacy and circumventing internet censorship, visit the Tor Project's website.
add to del.icio.us. look up in del.icio.us


