浩瀚之物,即将来临
浩瀚之物,即将来临
Vast.com(开发者)预览版终于上线了!如果你一直好奇我们在做什么,简单来说——我们正在构建一个搜索服务,从全网提取分类广告,对其进行结构化处理,然后通过开放的[REST API](http://REST API)供商业和非商业用途使用。
更多细节:
- 我们正在使用通用爬虫抓取网络和大部分博客圈,类似于雅虎、谷歌、Ask、MSN和Gigablast运营的爬虫。
- 该爬虫会激活表单,并深入挖掘以找到甚至动态数据(尽管它肯定不会填写任何登录名和密码)
- 我们自动识别分类广告列表——目前包括待售汽车、招聘信息和交友资料,并提取和规范化周围的元数据(品牌、型号、价格、里程、薪资、地点、标题、年龄、性别等)。
目前,我们拥有一些最大的数据库,包含超过1500万个跨这三个类别的分类广告列表,全部自动提取和结构化,无需人工监督,来自近50,000个网站和博客。(实际上我们爬取的数量要多得多,但这些只是迄今为止有结果的网站)。
如果你是最终用户,你应该能够搜索到那些难以找到的列表,而无需访问数百个网站,并且可以比较跨站点的结果,包括图片、排序和统计信息。
如果你是网站所有者或Web开发人员,我们提供了一个无麻烦的API,可以向你的访问者展示这些数据,或者随心所欲地进行混搭。你可以用它来构建一个大型目的地网站、一个有趣的应用程序,或者补充你今天拥有的内容和列表。你可以将其用于商业目的,只要它被展示给真正的最终用户,查询次数就没有限制。你在网站上看到的所有内容都是基于我们的API构建的,因此你应该能够在自己的网站或博客上复制Vast.com。
如果你有一个分类广告网站或博客,并希望你的广告包含在我们的结果中,你无需做任何事情。只需像平常一样发布,我们就会找到你。如果我们没有获取到你的结果或没有获取到全部结果,请发送邮件至help – at – vast – dot – com,我们会尝试修复。
我们将尽可能保持这个网站和API的开放性,并像一个良好的网络公民一样,直接链接回结果。我们不会通过直接获取列表来与我们爬取的网站竞争。我们不依赖显式标签。我们进行了大量的去重和垃圾邮件过滤,以保持结果清洁。
当然,这是一个搜索服务,而不是列表服务,因此你可以预期会有一些垃圾邮件和错误分类的结果混入。由于更改、过期以及并非为”深度爬取”而设计的挑剔数据库,一些链接会失效。在这些情况下,缓存是你的朋友。还有大量的色情内容必须被过滤掉,偶尔我们会漏掉一些。请使用每个结果旁边的链接报告不良结果来帮助我们。
我们将随着时间的推移添加更多来源、更好的爬取、改进的分类以及更多类别——这只是一个开始。我们希望支持那些希望获取高度结构化内容并在这些海量数据流之上构建应用程序的网络社区。当我们开始通过分发这些数据产生收入时,我们将与通过API分发数据的开发者和网站分享。
人们还想看到什么?我们如何提供帮助或改进?
**更新:**来自TechCrunch、[Paul Kedrosky](http://Paul Kedrosky)、[Peter Rip](http://Peter Rip)和CNet关于此次发布的一些报道和评论。
Something Vast This Way Comes
The Vast.com (developer) Preview is finally available! If you’ve been wondering what we’ve been up to, here it is, in a nutshell – we are building a search service that extracts classified ads from across the web, structures them, and then makes them available via an open [REST API](http://REST API) for commercial and non-commercial uses.
A little more detail:
- We are crawling the web and large parts of the blogosphere with a general crawler, similar to the ones operated by Yahoo!, Google, Ask, MSN, and Gigablast.
- The crawler activates forms, and digs deep to find even dynamic data (although it certainly doesn’t fill in any logins and passwords)
- We automatically recognize classifieds listings – currently cars for sale, job postings, and personals profiles, and extract and normalize the surrounding metadata (make, model, price, mileage, salary, location, title, age, gender, etc.).
Currently, we have some of the largest databases anywhere, of over 15 Million classified listings across these three categories, automatically extracted and structured with no human oversight, from nearly 50,000 web sites and blogs. (We actually crawled many, many times that number, but these are just the sites that have results to date).
If you are an end-user, you should be able to search for that hard-to-find listing without having to visit hundreds of sites, and compare cross-site results, with images, sorting, and statistics.
If you are a web-site owner or web developer, we’re offering a no-hassle API to show this data to your visitors, or to mash it up to your hearts content. You can use it build a huge destination site, an interesting application, or to supplement content and listings that you have today. You CAN use it for commercial purposes, and as long as it’s being shown to real end users, there’s NO LIMIT on the number of queries. Everything you see on the site is built on our API, so you should be able to replicate Vast.com on your own site or blog.
If you have a classifieds site or a blog and would like your ads to be included in our results, you shouldn’t have to do anything. Just post like you normally would, and we’ll find you. If we’re not getting your results or not getting them all, drop us a note at help – at – vast – dot – com and we’ll try and fix it.
We’re going to keep this site and the API as open as possible, and like a good net citizen, link directly back to the results. We don’t compete with the people that we crawl by taking direct listings. We don’t rely on explicit tagging. And we do an enormous amount of de-duplication and spam filtering to keep the results clean.
Of course, this is a search service, not a listing service, so you can expect some spam and mis-classified results will sneak through. Some links will break due to changes, expirations, and finicky databases that were not designed to be “deep crawled.” In those cases, the cache is your friend. There’re also rivers of pornographic content that had to be filtered out, and occasionally, we miss a few. Please help out by reporting bad results using the links next to each result.
We will be adding more sources, better crawling, improved classification, and many more categories over time – this is just a start. We want to support the web community that wants to take highly-structured content and build applications on top of these massive data flows. When we start making revenue through syndicating this data, we will share it with the developers and sites distributing it via the API.
What more would people like to see? How can we help or improve?
Update: Some coverage of the launch and reviews from TechCrunch, [Paul Kedrosky](http://Paul Kedrosky), [Peter Rip](http://Peter Rip) and CNet.