Thursday, 29 April 2010

MarkLogic: XQuery - Creating XML Elements and Attributes programmatically

In the past, I've written XQuery modules which populate a particular sequence of elements in order (adding in the relevant values and returning the result). I had a brief discussion with Philip Fennell today who informed me of a useful aspect of XQuery: Element and Attribute constructors.

Below is an example of how they can be used with MarkLogic - hopefully useful if you want to generate sequences from the raw components: name and value pairs:

MarkLogic: XQuery Function Overloading

I remember seeing in a few places that XQuery doesn't support function overloading. I also remember seeing examples where function overloading is used.

This simplistic example shows overloading at work - as written for MarkLogic (although the example could be easily modified for eXist:

Tuesday, 27 April 2010

MarkLogic: Viewing Installed Modules on a Server

The format for this is:


Another useful trick for getting a list of installed modules - suggested by Lee Pollington - is to use cts:uri-match. You'll need to enable the URI lexicon to make this one work:

Thursday, 22 April 2010

MarkLogic: Dump the XML Configuration of a given Database in cq/DQ

Useful if you want to dump the current database settings as an XML doc; open cq/DQ, set your source and run this against a specific database to get the XML configuration settings back from MarkLogic

MarkLogic: Managing "slow" document delete times

I ran into a problem where a process ingests a lot of documents into MarkLogic on a fairly regular basis. One of the modes of this particular process is to run a "full" delete of all directories in a given area of the database and to re-ingest.

The problem we were noticing is that deletion of this folder was timing out, which in turn was causing major server performance issues, which led to a lot of "let's restart the server and see what happens" comments.

Briefly, the setup was a directory with a significant number of child directories, with each directory containing 10,000 XML documents or less. Attempting to delete any of the child folders in their entiriety (like so) would cause a timeout:



I found that doing this did seem to work (at least, it did for a folder containing 10,000 items):


However for each child directory (and there were *well* over 100), each one took 2-3 minutes to delete - which meant an ingestion process with an existing clearing of folders would have to take over 6 hours just managing those deletes.

I found this gem of a post on MarkMail: http://www.mail-archive.com/general@developer.marklogic.com/msg03616.html

And I discovered that a default setting on creation of a new database meant slow batch deletes.

Changing Directory Creation from the default 'automatic' to 'manual' led to almost instant bulk deletes and allowed me to delete the entire folder structure without timing out. In essence this setting seems to allow the delete to use the indexes, which means the files can be removed very quickly.

There are some caveats, so I'd recommend reading the article first before deciding whether this process could be suitable for your situation.

Hope this helps someone else experiencing the same problem.

Saturday, 10 April 2010

MarkLogic (x64) install on Ubuntu 10.4

The following steps were required:

Friday, 9 April 2010

MarkLogic: Techniques for querying in-memory fragments using cts:contains

This snippet demonstrates the use of cts:contains and cts:element-attribute-word-query on an in-memory fragment (something that has been stored in the Expanded Tree Cache using a let statement):


To really add power to the search, you can use cts:contains with a cts:and-query. For this example, I want to return link(s) with a class of featured and a rel containing the value "video":

This last example demonstrates the use of an additional cts:or-query. This example creates a list of featured videos and documentation. Using the 'or' query will return any links that have the "featured" class attribute and have a rel of "video" or "pdf":

xquery version '1.0-ml';

let $x :=
Mark Logic Application Builder
Mark Logic Application Builder
Mark Logic Corporation
Mark Logic 4.1 Install Guide

return

Using cts:element-attribute-word-query on an in-memory fragment

Featured Videos and Documentation

    {
    for $item in $x/a
    where cts:contains($item,

    cts:and-query((

    cts:element-attribute-word-query(
    xs:QName("a"), xs:QName("class"),
    "FEATURED", "case-insensitive"),

    cts:or-query((

    cts:element-attribute-word-query(
    xs:QName("a"), xs:QName("rel"),
    "VIDEO", "case-insensitive"),
    cts:element-attribute-word-query(
    xs:QName("a"), xs:QName("rel"),
    "PDF", "case-insensitive")
    ))
    )))
    return
  • {$item}

  • }



Thursday, 8 April 2010

MarkLogic: Query Optimisation Notes (Part One: getting started with xdmp:query-meters())

Here are some brief notes regarding query performance tuning in MarkLogic using xdmp:query-meters()

A simple starting point involves using xdmp:estimate to find out how many documents you currently have in MarkLogic:

As this is an estimate - and as such, is returning the result direct from the indexes - it's going to return a result set almost instantaneously. Changing the xdmp:estimate to fn:count will take fractionally longer to compute a result. Also, removing the /qm:elapsed-time/text() will give you the full breakdown of where the indexes and caches are being hit (and where they're not).
You could express that like so:

xdmp:query-meters tend to become really useful when you use them with cts:searches, a simple example of such use would be:

On a big result set, however, this could take a while as it will return all of the matched XML documents in that set - what normally happens is that the query will resolve quickly, then the rest of the time can be taken up sending vast quantities of XML back to the browser over the network. This can - and with big result sets often does - cause both your machine and your browser to become unresponsive, so it's always best to estimate the size of the resultset before you attempt this!

So for situations where you want the output from query meters but don't want MarkLogic to stream megabytes of XML back to cq/DQ (or your middle tier layer), you can use fn:count. After all, in most cases when you're getting query stats you're probably more interested in the result timings rather than the result set. So here's the query re-written to return just the number of records (and not the results themselves:


Part two will discuss example usage(s) for the xdmp:query-trace(true()) and xdmp:query-trace(false()) functions.

MarkLogic Search Note: cts:search vs. XPath

A quick note about MarkLogic's extensive search APIs with an emphasis on using cts:search

Trying an XQuery snippet like this in cq (or DQ):

Will return [1.0-ml] XDMP-UNSEARCHABLE.

One of the most important aspects of cts:search is that it uses MarkLogic's indexes. As soon as you assign a document to a variable as in the above example (let $doc := fn:doc("/uri/for/doc.xml")), the document is no longer considered within the context of the indexes as it becomes its own entity and - as such - is considered as an in-memory fragment, rather than a "loaded" document. As cts:search relies on the use of indexes - which is what makes it so fast - the error gets thrown.

There's another important distinction we should make at this stage too; the assignation to a variable means the requested doc gets stored in MarkLogic's Expanded Tree Cache. If you fill the cache with a document which is too big, you'll see XDMP-EXPNTREECACHEFULL exceptions and your XQuery will fail.

If do you need to obtain the document as a variable, you can always use XPath to pull values from the fragment - and in some cases this will not have any noticable effect on performance (MarkLogic's XPath handling is still pretty fast). However, I think it's safe to say that cts:search is almost always the right tool for the job and by using it you're getting your money's worth from the MarkLogic licence!

What's the workaround? The obvious one is to rewrite the original example like so:

However, as I've mentioned in previous discussions, when you assign documents to variables, you can use cts:contains - which will return an xs:boolean based on given search criteria. So this pattern will work:


I hope this is useful to anyone wishing to learn more about how to get the most out of MarkLogic